opendevreview | OpenStack Release Bot proposed openstack/nova stable/2024.1: Update .gitreview for stable/2024.1 https://review.opendev.org/c/openstack/nova/+/913631 | 10:20 |
---|---|---|
opendevreview | OpenStack Release Bot proposed openstack/nova stable/2024.1: Update TOX_CONSTRAINTS_FILE for stable/2024.1 https://review.opendev.org/c/openstack/nova/+/913632 | 10:21 |
opendevreview | OpenStack Release Bot proposed openstack/nova master: Update master for stable/2024.1 https://review.opendev.org/c/openstack/nova/+/913633 | 10:21 |
opendevreview | Rocky proposed openstack/nova master: Fix the command to list hw_machine_type unset instances https://review.opendev.org/c/openstack/nova/+/913652 | 10:35 |
bauzas | guys, please welcome our master branch that's now a Dalmatian :) | 10:48 |
bauzas | so long Caracal and thanks for the fish | 10:48 |
bauzas | (RC1 tag is set and caracal branch created) | 10:49 |
gibi | bauzas: \o/ | 10:53 |
sean-k-mooney | :) | 11:15 |
*** sfinucan is now known as stephenfin | 11:38 | |
opendevreview | Fabian Wiesel proposed openstack/nova master: Do not close returned image-chunk iterator & get_verifier early https://review.opendev.org/c/openstack/nova/+/909474 | 12:00 |
*** mklejn__ is now known as mklejn | 13:11 | |
fwiesel | bauzas: I've reopened the bug #2055700 as #2058225. In my mind _rebuild_default_impl is implemented libvirt specific. I can implement driver.rebuild for vmwareapi to work around the bug, but I don't think that would be a great way forward. | 13:18 |
opendevreview | Merged openstack/nova stable/2023.2: Power on cores for isolated emulator threads https://review.opendev.org/c/openstack/nova/+/913195 | 13:21 |
bauzas | fwiesel: cool | 13:40 |
fwiesel | But I do not know enough how it is supposed to work in libvirt to make a good suggestion on how to implement it differently. I mean, I assume it makes sense that it is implemented the way it is. | 13:44 |
bauzas | fwiesel: you're referring to https://github.com/openstack/nova/blob/master/nova/compute/manager.py#L4132-L4137 | 13:53 |
bauzas | fwiesel: _rebuild_default_impl is a virt-agnostic solution for rebuild which does delete the old guest and then create a new one https://github.com/openstack/nova/blob/master/nova/compute/manager.py#L3648-L3731 | 14:00 |
gibi | elodilles: thanks for the reviews so far. the parent of https://review.opendev.org/c/openstack/nova/+/913196 is merged now. So when you have time please check this backport too | 14:02 |
bauzas | as you can read, we power off the guest, detach its volumes, (reimage the root disk if volume-backed), destroy the guest, attach the volumes again and spawn the guest | 14:02 |
sean-k-mooney | fwiesel: the way its currently implemtned shoudl wor for vmware too | 14:14 |
sean-k-mooney | fwiesel: the only driver that does anythign custom is ironic | 14:15 |
sean-k-mooney | the vmware dirver can provide its own solution | 14:16 |
elodilles | gibi: ACK, will do | 14:17 |
sean-k-mooney | the imporant thing to remember about rebuild is rebuild is not a move operations, evacuate is a rebuild to a diffent host but by default a normal rebuild shoudl never change host. if you change image however you do need to call the schduler with the current host adn the existin flavor and new iamge to confirm that host is valid. that should all be handeled for you in the | 14:18 |
sean-k-mooney | generic compute manager code so the only thing the driver level rebuidl needs to do is ensure the root disk is reimaged | 14:18 |
sean-k-mooney | if there is a more effiecnt way to do that in vmware then the generic approch then that is fine to use | 14:19 |
dansmith | not "should never change host" but "does not change host" | 14:20 |
dansmith | even the driver shouldn't be able to override that | 14:20 |
sean-k-mooney | right | 14:22 |
sean-k-mooney | im just tyring to say that if there was to be a vmware version fo rebuidl in the vmware driver the only thing it should od is reimage the disk | 14:22 |
sean-k-mooney | the vm must remain on the same hypervior host | 14:22 |
sean-k-mooney | i just mentioned evcautate because the driver code will have to work when called in that code path too | 14:23 |
dansmith | yep | 14:23 |
fwiesel | sean-k-mooney: The current implementation calls driver.destroy, and *then* calls detach volume. In my mind, that shouldn't work in the general case. | 14:27 |
bauzas | no | 14:33 |
bauzas | fwiesel: https://github.com/openstack/nova/blob/master/nova/compute/manager.py#L3675-L3676 | 14:33 |
bauzas | poweroff, detach, destroy is the workflow | 14:33 |
fwiesel | Destroy: https://github.com/openstack/nova/blob/master/nova/compute/manager.py#L3695-L3701 and then via https://github.com/openstack/nova/blob/master/nova/compute/manager.py#L3713-L3715 comes https://github.com/openstack/nova/blob/master/nova/compute/manager.py#L3589-L3595 | 14:34 |
fwiesel | bauzas: The root volume is being detached after the instance has been destroyed if reimage_boot_volume and is_volume_backed | 14:35 |
fwiesel | So, poweroff, detach everything except the root-disk, destroy, detach the root_disk (in the case of reimage_boot_volume and is_volume_backed) | 14:36 |
bauzas | we still detach the volumes, including the root bdm in https://github.com/openstack/nova/blob/master/nova/compute/manager.py#L3675-L3676 | 14:36 |
fwiesel | Sure, and after the desotry the _rebuild_volume_backed_instance -> _detach_root_volume | 14:37 |
fwiesel | Well, we have `reimage_boot_volume=True` ergo `detach_root_bdm = not reimage_boot_volume = False`, so `detach_block_devices(..., detach_root_bdm=False)`. | 14:40 |
bauzas | fwiesel: ah you're right, it's a bool | 14:51 |
bauzas | okay, so the root volume isn't detached yet, my bad | 14:51 |
bauzas | and ack, we rebuild the root volume in _rebuild_volume_backed_instance() | 14:53 |
bauzas | where we detach the volume, delete the old attachment, and then reimage the volume thru cinder | 14:55 |
bauzas | interestingly enough, we destroy the guest indeed before the detach, but we skip its bdm when we call the driver destroy | 14:56 |
fwiesel | There is probably a reason for that. I mean it is done very explicitly so. But as I said, I don't have any idea why. | 15:06 |
opendevreview | Fabian Wiesel proposed openstack/nova master: Do not close returned image-chunk iterator & get_verifier early https://review.opendev.org/c/openstack/nova/+/909474 | 15:21 |
opendevreview | Elod Illes proposed openstack/nova stable/2024.1: [stable-only] Update .gitreview for stable/2024.1 https://review.opendev.org/c/openstack/nova/+/913631 | 15:30 |
opendevreview | Elod Illes proposed openstack/nova stable/2024.1: [stable-only] Update TOX_CONSTRAINTS_FILE for stable/2024.1 https://review.opendev.org/c/openstack/nova/+/913632 | 15:30 |
opendevreview | Elod Illes proposed openstack/nova stable/2024.1: [stable-only] Update TOX_CONSTRAINTS_FILE for stable/2024.1 https://review.opendev.org/c/openstack/nova/+/913632 | 15:30 |
bauzas | #startmeeting nova | 16:00 |
opendevmeet | Meeting started Tue Mar 19 16:00:45 2024 UTC and is due to finish in 60 minutes. The chair is bauzas. Information about MeetBot at http://wiki.debian.org/MeetBot. | 16:00 |
opendevmeet | Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. | 16:00 |
opendevmeet | The meeting name has been set to 'nova' | 16:00 |
bauzas | heyho | 16:00 |
bauzas | #link https://wiki.openstack.org/wiki/Meetings/Nova#Agenda_for_next_meeting | 16:00 |
fwiesel | o/ | 16:01 |
elodilles | o/ | 16:01 |
auniyal | o/ | 16:01 |
gibi | o/j | 16:01 |
bauzas | okay, let's sofly start | 16:02 |
sean-k-mooney | o/ | 16:02 |
grandchild | o/ | 16:02 |
bauzas | #topic Bugs (stuck/critical) | 16:03 |
bauzas | #info No Critical bug | 16:03 |
bauzas | #link https://bugs.launchpad.net/nova/+bugs?search=Search&field.status=New 63 new untriaged bugs (-2 since the last meeting) | 16:03 |
bauzas | #info Add yourself in the team bug roster if you want to help https://etherpad.opendev.org/p/nova-bug-triage-roster | 16:04 |
bauzas | gibi: could you please take the baton for this week ? | 16:04 |
gibi | bauzas: sorry I cannot promise to do meaningful triage this week either | 16:04 |
gibi | can we skip me?j | 16:05 |
bauzas | okay | 16:05 |
gibi | I hope I will have more time in the coming months | 16:05 |
bauzas | melwitt is the next in the queue | 16:05 |
bauzas | melwitt: would you be okay ? | 16:05 |
bauzas | okay, anyone else wanting to take the baton ? if not, I'll take it again... :) | 16:07 |
bauzas | okay | 16:08 |
elodilles | i'd rather wait until we release 2024.1 Caracal o:) | 16:08 |
bauzas | #info bug baton is bauzas | 16:08 |
bauzas | moving on | 16:08 |
bauzas | #topic Gate status | 16:08 |
bauzas | #link https://bugs.launchpad.net/nova/+bugs?field.tag=gate-failure Nova gate bugs | 16:08 |
bauzas | #link https://etherpad.opendev.org/p/nova-ci-failures-minimal | 16:09 |
bauzas | #link https://zuul.openstack.org/builds?project=openstack%2Fnova&project=openstack%2Fplacement&pipeline=periodic-weekly Nova&Placement periodic jobs status | 16:09 |
bauzas | #info Please look at the gate failures and file a bug report with the gate-failure tag. | 16:09 |
bauzas | anything to discuss about the gate ? | 16:09 |
bauzas | oh, one point | 16:09 |
bauzas | I just saw slaweq's email on bare rechecks | 16:09 |
Uggla | o/ | 16:10 |
bauzas | as a reminder, please avoid to say only "recheck" when you have a gate failure | 16:10 |
dansmith | or "recheck unrelated" | 16:10 |
bauzas | and please add a good comment explaining what's the root problem | 16:10 |
dansmith | it can be unrelated, but at least say why | 16:10 |
bauzas | and try to find a second to understand the reason | 16:10 |
bauzas | yeah, or at least please tell in the comment which job fails and with which exception | 16:11 |
bauzas | #link https://lists.openstack.org/archives/list/openstack-discuss@lists.openstack.org/thread/7L6DSFN6GB65FBVHQATTLSQ7HWGBZJHM/ | 16:11 |
sean-k-mooney | honestly i dont think calling this out here helps | 16:12 |
bauzas | #info please avoid bare rechecks | 16:12 |
dansmith | sean-k-mooney: really? the tc has asked PTLs and other project leaders to help get this word out | 16:12 |
sean-k-mooney | the people that are here already know we should avoid it | 16:12 |
bauzas | not only folks discussing here look at our meetings :) | 16:13 |
sean-k-mooney | dansmith: sure but doing this in the team metting i dont think helps | 16:13 |
dansmith | sean-k-mooney: it's exactly what we've asked for, AFAIK | 16:13 |
sean-k-mooney | ok but we disucssed about stoping to do this here in the past | 16:13 |
bauzas | I know some folks look at our minutes | 16:13 |
dansmith | and I think it does help because as bauzas says, people might not be here but read the minutes | 16:13 |
sean-k-mooney | ok but we stopped doing it in meeting becasue we didnt think it was working | 16:13 |
dansmith | sean-k-mooney: 15% on nova are still bare, and a higher proportion are "recheck unrelated" so there's still work to do | 16:14 |
sean-k-mooney | we can go back to doing it | 16:14 |
bauzas | 16% is a bad value IMHO | 16:14 |
dansmith | bauzas: better than some projects for sure, but still work to do yeah | 16:14 |
bauzas | so I'll then discuss this every meeting | 16:14 |
sean-k-mooney | do we have the list of those because i dont think just callign it out here is going to have an impact | 16:15 |
bauzas | #action bauzas to tell about the bare rechecks every week in our meeting | 16:15 |
bauzas | sean-k-mooney: please look at the link above | 16:15 |
sean-k-mooney | i have the email open | 16:15 |
bauzas | we had 16 bare rechecks against 105 rechecks, so we have 15% | 16:16 |
sean-k-mooney | i might see if i can pull out the bare rehceck and see what there were | 16:16 |
bauzas | fwiw, I can find which changes have bare rechecks | 16:16 |
bauzas | this is quite simple | 16:16 |
bauzas | but I don't want to ping folks saying "dude, don't do it" | 16:17 |
sean-k-mooney | there are a few poepl that tend to only review +1 | 16:17 |
sean-k-mooney | and i wonder if its the same subset that are blind rechecking | 16:17 |
sean-k-mooney | i did ask one person to stopp rechecking | 16:17 |
sean-k-mooney | becuase there was an actual issue | 16:18 |
sean-k-mooney | anyway lets move on we can chat about this after the meeting | 16:19 |
bauzas | okay | 16:20 |
bauzas | #topic Release Planning | 16:20 |
bauzas | #link https://releases.openstack.org/caracal/schedule.html#nova | 16:20 |
bauzas | #info RC1 delivered this morning | 16:20 |
bauzas | \œ/ | 16:20 |
bauzas | #info master branch is now on Dalmatian | 16:21 |
elodilles | \o/ | 16:21 |
bauzas | as a reminder, we won't accept backports for the caracal branch until GA unless they are regression fixes | 16:21 |
bauzas | ditto with other stable backports given you need to have it in Caracal too | 16:22 |
bauzas | (unless of course the backports are from a caracal merged change) | 16:22 |
elodilles | +1 | 16:22 |
bauzas | #link https://etherpad.opendev.org/p/nova-caracal-rc-potential | 16:23 |
bauzas | I'll continue to provide this etherpad for the possible new Caracal RC tags | 16:23 |
bauzas | I just hope we won't find a regression | 16:23 |
bauzas | any question about RC phases or Dalmatian ? | 16:24 |
bauzas | looks not, moving on | 16:25 |
bauzas | #topic Review priorities | 16:25 |
bauzas | #action bauzas to provide a new etherpad for Dalmatian | 16:26 |
bauzas | for the moment, we don't have priorities | 16:26 |
bauzas | I'll start to look at some spec changes next week | 16:26 |
bauzas | #topic Stable Branches | 16:26 |
bauzas | elodilles: go for it | 16:26 |
elodilles | o/ | 16:26 |
elodilles | #info stable/2024.1 branch is open for bugfix backports | 16:26 |
elodilles | of course as bauzas wrote: only rc critical | 16:27 |
elodilles | for this time | 16:27 |
elodilles | #info stable gates seem to be OK | 16:27 |
bauzas | cool | 16:27 |
elodilles | (as we discussed, there are jobs that fail intermittently :/) | 16:27 |
elodilles | #info stable branch status / gate failures tracking etherpad: https://etherpad.opendev.org/p/nova-stable-branch-ci | 16:28 |
elodilles | that's it from me about stable branches | 16:28 |
bauzas | cool thanks | 16:29 |
bauzas | #topic vmwareapi 3rd-party CI efforts Highlights | 16:29 |
bauzas | fwiesel: ? | 16:29 |
fwiesel | #info CI comments on openstack/nova CR confirmed to be working. | 16:29 |
fwiesel | Best maybe seen here: | 16:30 |
fwiesel | #link https://review.opendev.org/dashboard/36571 | 16:30 |
fwiesel | #info Fixed FIPs instability. Tests quite reproducible. Only one error left with all proposed patches applied. | 16:30 |
fwiesel | The issue and fix were in retrospect easy... we have now a working pipeline. | 16:30 |
bauzas | \o/ | 16:31 |
fwiesel | Just with the new branches came in some instability, as it expects all service to actually have the branch. | 16:31 |
fwiesel | I.e. it fails now to check out stable/2024.1 for requirements, but we have it in nova. Not sure, if I should do something about that one. | 16:32 |
fwiesel | Missed that one: | 16:32 |
fwiesel | #link http://openstack-ci-logs.global.cloud.sap/sapcc-nova--nmllh/tempest.html | 16:32 |
fwiesel | The tempest test with the last open problem. | 16:32 |
elodilles | now ~everything should have branched (well, requirements, devstack and grenade will be created around end of week) | 16:32 |
bauzas | are you sure you use the requirements from the stable releases ? | 16:32 |
bauzas | fwiesel: awesome about where you are | 16:32 |
fwiesel | Missing now are hopefully only cleaning up the rough edges... A nicer UI with linkable lines we still are missing. | 16:33 |
bauzas | okay, anything else ? | 16:34 |
fwiesel | No, not from my side. | 16:34 |
fwiesel | Questions? | 16:35 |
fwiesel | One remark: A test run with fixes in place takes roughly 20minutes in total. | 16:35 |
bauzas | nope | 16:35 |
bauzas | wow | 16:35 |
fwiesel | bauzas: Back to you then | 16:36 |
bauzas | cool, that will be short | 16:36 |
bauzas | #topic Open discussion | 16:36 |
bauzas | anyuthing anyone ? | 16:36 |
dansmith | a tempest run takes 20 minutes? that would be shockingly fasty | 16:37 |
fwiesel | Well, it is a hefty machine, and I run it as parallel as I can. | 16:37 |
bauzas | ok guys, I think I can close the meeting | 16:38 |
bauzas | thanks all | 16:38 |
bauzas | #endmeeting | 16:38 |
opendevmeet | Meeting ended Tue Mar 19 16:38:23 2024 UTC. Information about MeetBot at http://wiki.debian.org/MeetBot . (v 0.1.4) | 16:38 |
opendevmeet | Minutes: https://meetings.opendev.org/meetings/nova/2024/nova.2024-03-19-16.00.html | 16:38 |
opendevmeet | Minutes (text): https://meetings.opendev.org/meetings/nova/2024/nova.2024-03-19-16.00.txt | 16:38 |
opendevmeet | Log: https://meetings.opendev.org/meetings/nova/2024/nova.2024-03-19-16.00.log.html | 16:38 |
fwiesel | Thank you too | 16:38 |
sean-k-mooney | well i guess it depend on how many tests you run. devstack can stack in about 8 mins if its not io bound | 16:38 |
elodilles | thanks o/ | 16:39 |
fwiesel | The tempest run takes 10 minutes, by the way :) | 16:39 |
sean-k-mooney | yep but your also unly runnign 525 ish tests | 16:40 |
sean-k-mooney | at least based on | 16:40 |
fwiesel | Sure, are there tests I am missing? | 16:40 |
sean-k-mooney | http://openstack-ci-logs.global.cloud.sap/openstack-nova-909474-f54wv/tempest.html | 16:40 |
sean-k-mooney | fwiesel: well you only runing the compute api tests | 16:40 |
sean-k-mooney | your not runng the senario tests ro any fo teh test fo other services like cinder/nuetorn extra | 16:41 |
sean-k-mooney | we dont need you to run all of those | 16:41 |
sean-k-mooney | but i was just providing context for others | 16:41 |
sean-k-mooney | i.e. its not tempest-full | 16:41 |
fwiesel | No, indeed it isn't. | 16:42 |
fwiesel | I am not even running all compute.api-tests, only those that are implemented for the vmware api driver. | 16:43 |
sean-k-mooney | ack that kind of makes sense | 16:43 |
sean-k-mooney | for the hardware based cis they took a simialr approch | 16:44 |
sean-k-mooney | i.e. the old intel ci just ran the cpu pinning/sriov/numa related subset | 16:44 |
fwiesel | Possibly it would make sense to check what code-coverage the tempest test generates on the driver implementation, and adjust the tests accordingly. | 16:46 |
sean-k-mooney | dansmith: correct me if im wrong but we do not have the ablity to ask placement for a host that does not have a resouce type right like a forbiden trait, we would have to do that in a scheduler filter | 17:55 |
sean-k-mooney | related to our converation about non numa instances | 17:56 |
dansmith | sean-k-mooney: I really don't know currently, but I'm pretty sure that was one of the early limitations.. we could ask for inclusion but not exclusion | 17:56 |
sean-k-mooney | ack i think its still that way. i belive we coudl have a new filter that does the following check | 17:57 |
sean-k-mooney | for a given instance look at the guest falvor/image to see if it is requesting numa aware memory i.e. hw:mem_page_size | 17:58 |
sean-k-mooney | if it is not check the allcoation candates for the host to see if it has either the file backed memory trait or an inventory of PCPU | 17:58 |
sean-k-mooney | if it has either reject the host | 17:59 |
sean-k-mooney | that would allow operators to opt in/out by not using the filter | 18:00 |
sean-k-mooney | we can add it to the default filters to enforce the correct behavior. | 18:00 |
dansmith | well, that feels more formal of a feature opt-out than a workaround named I_like_to_oom_my_instances=True | 18:00 |
dansmith | but it would mean an abort in scheduling instead of compute, which would be better | 18:01 |
sean-k-mooney | well we cloud perhaps not make it a filter but a config option that woudl work like a filter | 18:02 |
sean-k-mooney | i.e. have a workaround to remove this mandatory filter | 18:02 |
dansmith | sean-k-mooney: btw, I am still interested in the answer to my question here: https://review.opendev.org/c/openstack/nova/+/877773/9/nova/virt/libvirt/migration.py | 18:02 |
dansmith | because it seems like it should always fail and I don't know why it's not | 18:02 |
sean-k-mooney | reading that now ill need to check but if i rememebr correctly we pop the value form the dict before it ever get there | 18:03 |
dansmith | that's a bad solution IMHO if so because it would break computes that don't have this code yet, | 18:04 |
dansmith | but I also couldn't find any such pop | 18:04 |
sean-k-mooney | your asertion is calling int(a string containing a list of ints) | 18:04 |
dansmith | sean-k-mooney: no it's calling int() on the fake key | 18:04 |
sean-k-mooney | should always explode so either we are not hitting that path in the test or there is somethign else going on here | 18:04 |
dansmith | that dict is supposed to be keys of '1', '2', etc and so it int()'s the keys, | 18:04 |
dansmith | but now there's a key of "dst_cpu_whatever" in the dict | 18:05 |
dansmith | and I don't know how it's not failing to int() that when it iterates the items | 18:05 |
dansmith | and I'm wondering if it's because we don't have test coverage that was hitting that code path with the three conditions being true | 18:05 |
sean-k-mooney | oh right "guest_id, host_ids in info.cpu_pins.items()" so guest_id would be the key | 18:07 |
dansmith | right | 18:08 |
dansmith | so I don't know how that's not breaking | 18:08 |
sean-k-mooney | artom: https://review.opendev.org/c/openstack/nova/+/631053 and https://bugs.launchpad.net/nova/+bug/1811886 are related to our converstaion about the issue relating to not requesting a page size. | 19:23 |
sean-k-mooney | its not the example i was giving when we talks but its a similar equally wrong case. | 19:23 |
opendevreview | Merged openstack/nova master: Remove nova.wsgi module https://review.opendev.org/c/openstack/nova/+/902686 | 19:42 |
opendevreview | Merged openstack/nova master: Add new nova.wsgi module https://review.opendev.org/c/openstack/nova/+/902687 | 19:42 |
sean-k-mooney | dansmith: i looked brifly but i dont know why that is not brekaing things but my best guess is we are somehow mocking this and we dont have cpu_shared_set defiend in any of the jobs | 19:48 |
sean-k-mooney | there is no reason why we cant have cpu share set in one of our live migration jobs so i might try and enable that | 19:49 |
sean-k-mooney | i.e. 0-5 on one node and 2-7 on the other | 19:50 |
sean-k-mooney | that pretty trivial to do so ill add a patch on top or renes serise to do that | 19:51 |
sean-k-mooney | in theory rene's code should be runnign even if that is not devied but we should be able to test this in the gate anyway | 19:52 |
opendevreview | Stephen Finucane proposed openstack/nova master: Make overcommit check for pinned instance pagesize aware https://review.opendev.org/c/openstack/nova/+/631053 | 19:59 |
dansmith | sean-k-mooney: yeah, I'm unsure either, but I'm positive that means we're just not running the code we think we're running | 20:01 |
opendevreview | ribaudr proposed openstack/nova master: Fix [backport]: Live migrating to a host with cpu_shared_set configured will now update the VM's configuration accordingly. https://review.opendev.org/c/openstack/nova/+/877773 | 20:02 |
opendevreview | ribaudr proposed openstack/nova master: Fix: Live migrating to a host with cpu_shared_set configured will now update the VM's configuration accordingly. https://review.opendev.org/c/openstack/nova/+/903706 | 20:02 |
sean-k-mooney | dansmith: my guess right now is we are never populating the numa toplogy info in the migrate_data | 20:07 |
dansmith | sean-k-mooney: you mean not running a test that causes it to be populated right? | 20:08 |
dansmith | I mean...not running a test _case_ where it would be populated and that code path exercised | 20:08 |
sean-k-mooney | no i ment not seting the object that had the new data at all but we appear to be | 20:10 |
artom | sean-k-mooney, dansmith, hrmm, I wonder if we're fine because... let my type this out | 20:10 |
artom | The NUMA live migration code kicks in only if the instance has a NUMA topology | 20:11 |
artom | Uggla's patch _always_ adds that hidden item in the dst_numa_info | 20:11 |
sean-k-mooney | so it alwasy uses the numa migration code? | 20:12 |
artom | Hold on | 20:12 |
artom | No, I thought I had it... | 20:13 |
sean-k-mooney | the fact we have to have this converstaion and its not obvious how its working after a few of us lokking at it is a code smell to me in either case | 20:16 |
sean-k-mooney | the code might be correct but its to complex to debug | 20:16 |
artom | So to trigger it, all that should be required in live migration a NUMA VM to a dest with cpu_shared_set, right? | 20:18 |
artom | That would cause Uggla's patch to stash the cpu_shared_set in the "hidden" key in dst_numa_info, and *should* cause the update XML code to trip over and explode? | 20:19 |
artom | We've _definitely_ done that ^^ in whitebox, and nothing has exploded | 20:19 |
sean-k-mooney | artom: no | 20:19 |
sean-k-mooney | to trigger the bug yuou need to migate a non numa vm | 20:19 |
sean-k-mooney | oh | 20:20 |
sean-k-mooney | you mena the the driver issue | 20:20 |
artom | Yeah, the thing that we're trying to understand how it even works | 20:20 |
sean-k-mooney | yes if the loop over the keys was broken then you woudl have to migrate a pinned instance | 20:20 |
sean-k-mooney | right but since this is alwasy set https://review.opendev.org/c/openstack/nova/+/877773/10/nova/virt/libvirt/driver.py#10028 | 20:22 |
sean-k-mooney | any numa instance should triger this code path | 20:22 |
artom | Exactly | 20:22 |
artom | I'm expecting migrate_data.dst_numa_info.cpu_pins["dst_cpu_shared_set_info"] to always be there | 20:23 |
artom | Ohhhhh | 20:25 |
artom | We just clobber the entire object for NUMA live migration: https://opendev.org/openstack/nova/src/branch/master/nova/virt/libvirt/driver.py#L10106 | 20:25 |
sean-k-mooney | but again numa instance | 20:26 |
artom | And that ^^ happens later in the flow | 20:26 |
sean-k-mooney | sorry that was in my buffer | 20:26 |
sean-k-mooney | ya that would make sense | 20:26 |
sean-k-mooney | as that is where it woudl normally be created | 20:27 |
artom | That's after we make the claim on the dest, and Uggla's code was adding its thing earlier in check_can_live_migrate_destination | 20:27 |
artom | So it just gets overwritten | 20:27 |
sean-k-mooney | is it in check_can_live_migrate_destination i tought it was in pre_live_migration (at dest) | 20:28 |
dansmith | but as sean-k-mooney said now and I said earlier, this is hiding demons in the details on so many levels, especially since this transforms to a real object property in the next patch | 20:28 |
dansmith | artom: just a thought - are you running the tests you expect to pass now on the lower patch or only the stack? if the latter, then this code is "gone" by then right? | 20:29 |
artom | sean-k-mooney, https://review.opendev.org/c/openstack/nova/+/877773/10/nova/virt/libvirt/driver.py#9958 | 20:29 |
sean-k-mooney | oh its in check_can_live_migrate_destination in the driver i was thinking of the conductor method with a similer name | 20:29 |
artom | dansmith, definitely the backportable one | 20:29 |
dansmith | okay just checking | 20:30 |
artom | dansmith, but I think we've figured it out :) | 20:30 |
dansmith | yeah, but I haven't figured out yet if you're thinking that we're not covering the case where we would trip over that code, or if they're mutually exclusive somehow (not that it makes it okay) | 20:31 |
sean-k-mooney | well at a minium this show we are missign a funtional test to demonstrate that the xml is updated properly between the srouce and destinatnio hosts | 20:32 |
artom | dansmith, they're not mutually exclusive, they definitely both run, and the NUMA live migration code runs _after_ the hack, so it clobbers the entire dst_numa_info object that was used as the hiding place for the "secret" cpu shared set info | 20:33 |
dansmith | I don't understand how that can be | 20:34 |
artom | dansmith lulz another call ^_^ ? | 20:34 |
sean-k-mooney | its because of the 3 conditions in the if | 20:34 |
dansmith | we have to put the dst info into the blob so it can be used to generate the xml right? | 20:34 |
sean-k-mooney | if ('cpu_pins' in info and | 20:34 |
sean-k-mooney | 'cell_pins' in info and | 20:34 |
sean-k-mooney | 'emulator_pins' in info): | 20:34 |
sean-k-mooney | in the non numa case only one of those 3 in checks is true | 20:34 |
sean-k-mooney | so we never enter the loop | 20:35 |
sean-k-mooney | https://review.opendev.org/c/openstack/nova/+/877773/9/nova/virt/libvirt/migration.py#186 | 20:35 |
sean-k-mooney | in the numa case the numa live migration code clobers the object that rene modifed | 20:35 |
sean-k-mooney | so it just happens to work becasue we get lucky and the case where its not clobpered we dont enter the loop | 20:36 |
dansmith | and to be clear, that's always the case or because we're missing a test scenario where we could get unlucky? | 20:36 |
sean-k-mooney | cell_pins and emulator_pins are not defiend in the non numa case | 20:36 |
artom | The case where it's not clobbered is when a VM doesn't have a NUMA topoogy | 20:36 |
dansmith | I guess the point is. yeah that ^ | 20:36 |
sean-k-mooney | thats alwasy the case | 20:36 |
artom | So by pure luck we've made them functionally mutually exclusive | 20:37 |
sean-k-mooney | when we dont have a numa guest cell_pins and emulator_pins are not set | 20:37 |
sean-k-mooney | and in the numa case the dst_numa_info is regenerated by the numa live migration code | 20:38 |
sean-k-mooney | so we never execute _update_numa_xml with cpu_pins that contains the stashed value | 20:39 |
dansmith | usually when I say "this will be a nightmare of complexity to debug in the future" I have to wait years to collect on that debt.. glad this time it was under 24 hours :D | 20:39 |
sean-k-mooney | me too because i proably would have been the one trying to fiugre this out in a few years :) | 20:39 |
dansmith | yep | 20:40 |
artom | In a hypothetical few years wherein we've chosen to merge the haxx ;) | 20:41 |
dansmith | s/we've/we'd/ yeah | 20:41 |
sean-k-mooney | Uggla: have you had time to sync with artom on the change of direction | 20:41 |
sean-k-mooney | i need to go eat so ill call it a day there | 20:45 |
Uggla | sean-k-mooney, more or less. But I guess Artom and I will sync tomorrow.... | 20:45 |
sean-k-mooney | im glad we know why it was working im not happy with how it was working | 20:45 |
spatel_ | sean-k-mooney I believe I talked about this in past. Just re-checking again. Does snapshot support include of memory footprint? | 20:51 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!