Friday, 2024-09-13

*** __ministry is now known as Guest332401:33
opendevreviewSam Morrison proposed openstack/nova master: Filter out deleted instances when looking for build timouts  https://review.opendev.org/c/openstack/nova/+/88012503:59
fricklerstephenfin: sean-k-mooney: note that we release osc 7.1.0 yesterday, so likely related to that. 7.0.0 was also broken and never made it into u-c. I can also reproduce the above issue locally fwiw05:59
fricklermaybe let's move this to sdks channel06:01
*** ralonsoh_ is now known as ralonsoh06:55
*** bauzas_ is now known as bauzas07:46
gibisean-k-mooney: I have small things in  https://review.opendev.org/q/topic:%22bug/2080556%22 but I'm overall positive.07:56
gibibauzas: I have a nit in the prelude https://review.opendev.org/c/openstack/nova/+/92899508:02
gibibauzas: also I think we can go with the RC1 for placement even if we need to wait with the nova RC108:03
gibibauzas: also we have negative answer for our proposal of fixing the live migration issue https://bugs.launchpad.net/nova/+bug/208043608:04
bauzascool, I'm just busy with another task08:05
bauzasyou're next in my waiting queue :)08:05
gibiI'm not waiting :)08:05
opendevreviewSylvain Bauza proposed openstack/nova master: Add Dalmatian prelude section  https://review.opendev.org/c/openstack/nova/+/92899508:14
bauzasgibi: I have full attention to the regression bug now08:26
bauzasso, that sounds the conditional clause isn't enough08:26
bauzassomething in the cleanup really deletes the instance path08:27
gibiI just published a set of observations in https://review.opendev.org/c/openstack/nova/+/92897008:29
gibiIf I have to guess the complication is around: > NUMA pinned VMs and an NFS mount for the VMs, but we also use cinder boot volumes08:32
gibiI guess we won't be able to avoid setting up a multinode devstack..08:33
bauzasI tried yesterday to write a functional test but that was hard08:36
bauzasI would have to create a specific tempfs for the instance paths in order to correctly reproduce it08:37
bauzasbut I failed finding where we mock the paths08:37
bauzas++08:37
bauzas(oops, wrong window)08:37
gibiI will try to set up multinode devstack with NFS but I cannot promise it will work today08:40
bauzasI have two nodes, I can also try to modify them08:41
bauzaseither way, we'll probably need to come to a decision today or monday08:41
bauzaswe can't delay RC1 more08:41
bauzasbut I'd like some core quorum 08:42
gibian alternative is to revert the power mgmt live migration fix. Declare that that power mgmt does not work with live migration and fix the whole thing later08:43
bauzasthat's what I had in mind08:44
bauzasor a third alternative is ship power management but let the call be optional08:44
bauzasif you don't setup power management, you shouldn't be impacted08:44
gibiwe could hack a fix together where shared storage live migration works if power mgmt is not used, and power mgmt live migration works if no shared storage is used (basically the original proposal from the reported) but that is uggly08:44
gibiand power mgmt with shared storage will still not work08:45
bauzasthat's the concern I have with the current master, which inconditionnally says you're power manageable08:45
bauzasI wish we had a flag on the migrate_data object that would signal that info08:45
bauzaslike we did for mdevs or vpmems08:45
gibithat would not remove the problem that shared storage live migration does not work if you have power mgmt, mdevs, or vpmems. 08:46
gibias those would trigger calling clenaup08:46
bauzasyup, but that's something we can doc08:46
bauzashere the problem is that shared storage is broken *anyway*08:47
bauzasand you can't opt out from it08:47
gibiyepp it seem shared storage live migration only works we we don't call cleanup08:48
gibibut I'm not 100% sure whw08:48
gibiwhy08:48
gibido you have any comment on my las comment https://review.opendev.org/c/openstack/nova/+/928970/comment/c28b9ea3_30ed15a5/08:49
gibiespecially around deleting the instance dir when the VM is volume backed?08:49
bauzaseven if the instance is volume backed, we still have an instance dir, right?08:51
bauzaswe just don't have disks08:51
bauzasI could imagine that they have NFS on /var/lib/instances 08:53
bauzasso the BFV instances are present on that shared path08:53
bauzasin that case, 08:54
bauzascleanup_instance_dir = migrate_data.is_shared_block_storage08:54
bauzasit will cleanup the dir, right ?08:54
bauzasoh wait no08:55
bauzasthat's not shared storage, that's shared RBD image backend08:55
gibithe instance dir for a volume booted instance (without config drive) only holds the console.log fiel nothing else08:55
bauzasI don't disagree, but that's still an instance dir08:56
gibiis_shared_block_storage will be true if the instance is booted from volume08:56
gibi(and in other cases too)08:56
bauzasokay, so that's why this is deleting the instance dir08:56
gibiIs it a problem if we delete hte instance dir08:56
gibi?08:57
bauzasgood question08:57
bauzasas you said, the instance dir is quite empty08:57
bauzaswhere is stored the domain definition ?08:57
*** ralonsoh_ is now known as ralonsoh08:59
gibithat is nova does not store the xml09:01
gibi* nova does not store the xml09:02
gibiI'm not sure libvirt stores it by default09:02
gibiit is under /run/libvirt/qemu/instance-00000002.xml09:03
gibiin devstack09:03
bauzasokay09:04
bauzasyour question is actually good : does it harm if we delete an instance dir from a BFV instance that's shared ?09:05
bauzashttps://github.com/openstack/nova/blob/16d815990b53b9afa35fc4da38609f87820fa690/nova/virt/libvirt/driver.py#L1708-L171109:06
bauzasI enjoy mdbooth's comment : "I'm pretty sure this is wrong"09:06
bauzasfrom what I read from both comments is that this is EXPECTED that we delete the instance dir if the instance is BFV09:07
bauzasnot sure we do support shared storage with BFV09:07
bauzassee ?09:07
gibiI see09:12
gibiI'm not sure wht mdbooth's comment really mean. I.e. does the comment really implies that the logic is wrong and we should change it?09:13
gibiwe need to figure out if live migration on shared storage with a volume backed VM actually fails if the instance dir is deleted09:16
gibiwe never seen the real failure from the reporter09:16
gibithey just stated that the dir is deleted09:16
gibithe only thing they said is09:17
gibi"making the VM unusable for future operations."09:17
gibiI need to drop for an hour but will continue from here09:17
bauzas🤷‍09:17
bauzassure, let's continue to discuss this later once more people are here09:18
gibione thing we can suggest is to only delete the instance dir if the instance is volume backed but the instance path is not shared09:19
gibiI can amend the my suggested patch with that logic09:19
gibiand they can test09:19
bauzasyeah, seems a good point09:31
bauzasbut this would revert the previous logic :)09:31
elinuxany one available here ?10:01
elinuxAnyone online 10:08
supamatto/10:08
fricklerstephenfin: sean-k-mooney: https://review.opendev.org/c/openstack/python-openstackclient/+/929236 should fix the second part of the evacuate command, not sure about the microversion thing yet10:41
sean-k-mooney frickler so i think this is an sdk regression?10:45
sean-k-mooneyi guess it was intorduced with https://github.com/openstack/python-openstackclient/commit/e6dc0f39c0891ea551867b77663a463b5f76987c perhaps10:46
sean-k-mooneyfrickler: the actul password handeling code has not changed in over a year10:48
sean-k-mooneybut the args are now being passed to an sdk client instead of nova client10:48
opendevreviewMatthew Heler proposed openstack/nova master: Fix regression with live migration on shared storage  https://review.opendev.org/c/openstack/nova/+/92897010:50
sean-k-mooneyfrickler: the microversion we used is https://docs.openstack.org/nova/latest/reference/api-microversion-history.html#id6210:50
sean-k-mooneywhich removed foraced evacuates10:50
supamattbauzas the latest patch I posted for 928970 works with shared storage now10:51
sean-k-mooneyfrickler: the password is adminPass10:52
sean-k-mooneyin the rest api10:53
opendevreviewMatthew Heler proposed openstack/nova master: Fix regression with live migration on shared storage  https://review.opendev.org/c/openstack/nova/+/92897010:54
gibisupamatt: nice, I think that was what I imagined above by "one thing we can suggest is to only delete the instance dir if the instance is volume backed but the instance path is not shared"10:59
gibisupamatt: so the patch you posted proves my theory. 10:59
gibilets wait for bauzas as well but I think we can make your patch mergeable by adding some unti tests around it and a release notes11:00
supamatt+111:05
fricklersean-k-mooney: yes, I think that that is a regression that was missed in the commit you mention11:13
gibisupamatt: if you need pointers / help to add unit tests and a release notes then let me know11:13
supamattgibi: for the unit test, I'm not sure how best to do that. the release notes I can do11:35
gibisupamatt: for he compute manager change you can extend the unit test coverage here based on the existing examples https://github.com/openstack/nova/blob/16d815990b53b9afa35fc4da38609f87820fa690/nova/tests/unit/compute/test_compute_mgr.py#L11385-L1144811:51
opendevreviewMatthew Heler proposed openstack/nova master: Fix regression with live migration on shared storage  https://review.opendev.org/c/openstack/nova/+/92897011:54
gibisupamatt: for the driver change here is an example test case you can use https://github.com/openstack/nova/blob/16d815990b53b9afa35fc4da38609f87820fa690/nova/tests/unit/virt/libvirt/test_driver.py#L20993-L2101311:54
bauzassorry, was offline due to some network change12:59
bauzassupamatt: gibi : can you please give me a summary?13:00
gibibauzas: supamatt: basically implemented what I described before I dropped this morning.13:02
gibiSo the patch now only deletes the instance dir if the instance is booted from volume but the instance directory is not on shared storage13:03
gibisupamatt: stated that this solves their issue13:03
gibiand it is aligned with my understanding13:03
gibinow supamatt works on unit test coverage for the change13:03
bauzasexcellent then13:08
bauzasI tried to work on a functest but wasn't simple13:08
bauzasbut we can try to write a specific shared storage test if we want in another patch13:09
gibigiven the time pressure and the fact that we have first hand info from the reporter that it solves the regression I'm OK not to have functional reproduce for this13:09
gibiif we can add functional test later that is fine too13:09
bauzasyup13:10
bauzasbut yeah, I think we need some way to check shared storage by a functest13:10
elinux_I have observed somethign while testing the nova api calls. after the request comes in , the request spec object is built. after that I have placed a debugger at the scheduler level. after the pdb prompt appears, by that time when I inspect the spec obj, it is loosing the metadata key,value if passed by the user. but when I dont use the debugger and use only a simple print statement , it is printing the metadata value13:15
elinux_os my question is , is there a time span after which the data within that object is removed ?13:15
elinux_*so my13:15
gibielinux: probably the rpc all to the scheduler is timed out while you are debugged the scheduler13:36
elinuxgibi: but all other details r intact for the request. Only metadata is missing 13:45
supamattbauzas gibi: yea the unit test doesn't look simple to me either ;S13:45
gibisupamatt: If you are stuck I can try to create a follow up patch top of yours with the unit tests13:46
bauzassupamatt: I was talking of functional tests which are in another directory13:46
bauzasyes, we can help on the UTs13:46
bauzashave you found the test methods to update ?13:46
bauzaswe need another core to chime in https://review.opendev.org/c/openstack/nova/+/92899513:47
bauzaselodilles: I think we will reasonable release nova RC1 by monday13:48
supamattIf you guys wouldn't mind adding that UT, that'd be awesome. This patch seems like a big regression. I've done a number of openstack deployments over the last decade and NFS being used as shared storage is in almost a quarter of those deployments.13:48
gibisupamatt: sure, I can propose the UTs13:49
elodillesbauzas: ACK, sounds good13:49
sean-k-mooneyfrickler: sorry was gone for lunch with some local readhatter 13:50
bauzaselodilles: I'll +1 the placement RC113:50
elodillesthanks o/13:50
elodilleswe still have 2 weeks until the final RC deadline o:)13:50
sean-k-mooneyfrickler: so the password shoudl be valid for the requested microversion but i think i agree that just not passing it when its not preseent on the command line will work around it13:50
sean-k-mooneybut there might also be bugs in the sdk about that13:51
elodillesbut let's hope RC1 will be enough :] fingers crossed :X13:51
fricklersean-k-mooney: the real issue is that the parameter is named admin_password in the sdk, that change was missed while switching evacuate to use the sdk13:54
fricklerthe sdk always sends none in the request anyway, so I reverted that part in PS213:55
sean-k-mooney well the real issue is its named adminPassword in the api and the sdk and clinet were both calling it somethign else13:55
sean-k-mooneyfrickler: the sdk has historically renamed the parmater form there real api names13:55
sean-k-mooneywhich causes a lot of similear bug that will hopefully stop in the futrue13:56
sean-k-mooneyif we complete the openapi stuff and starge generaging parts of the sdk form the openapi definitions13:56
sean-k-mooneyactully its adminPass in the api13:57
artomgibi, bauzas, shouldn't mdev and vpmem live migration have uncovered the problem before? They're also cases where we started calling cleanup, but the destroy_disks logic was incorrect?13:58
bauzasmaybe13:59
sean-k-mooneyeither of those are tested in ci currently13:59
sean-k-mooney*neitehr13:59
bauzasbut here, we now call cleanup everytime if you have dst_numa_info13:59
sean-k-mooneyso i have a ci job that will test this we just have not merged it yet14:00
artomRight, but before we had this:14:00
artom            do_cleanup = (not migrate_data.is_shared_instance_path or14:00
artom                          has_vpmem or has_mdevs)14:00
bauzaswhile for mdev, it was only if you were migrating the mdevs14:00
sean-k-mooneyoh actully no becasue it need nfs too14:00
artomSo we called cleanup every time we had vpmem or mdevs14:00
sean-k-mooneyyes14:00
artomSo if any of those were on NFS, they would see their disk deleted as well?14:00
gibiartom: I'm fairly certain that mdev + shared storage would trigger it too14:00
sean-k-mooneybecause we do need to clean up the persitent memeory14:00
artomAnd we just never caught it?14:00
sean-k-mooneywe did not supprot mdev live migation until caracal14:01
bauzaswe're calling it for mdevs because we do something in cleanup about mdevs14:01
artomRight, I understand _why_ we need to call it14:01
gibiartom: I think we don't have much coverage with NFS under nova instances path14:01
sean-k-mooneymy point is that we have no test coverage of any of the condtion in ci today14:01
artomI'm saying it's suspect that power management live migration is the first thing to have caught the fault destroy_disks logic14:01
sean-k-mooneywe dont have coverage fo mdev or pmem in ci14:01
artomOK, mdev is new in Caracal, but vpmem? That's been around for a while...14:02
gibiartom: correct the power mgmt condition was vide enough to hit people14:02
bauzasrather the problem is that we don't really test shared instances14:02
sean-k-mooneyartom: intel killed pmem remeber14:02
* dansmith sean-k-mooney: have you looked at the post_failures on this? https://review.opendev.org/c/openstack/nova/+/928829/5?tab=change-view-tab-header-zuul-results-summary14:02
artomgibi, OK, that makes sense I guess.14:02
gibias it hits all deployments with NFS + numa14:02
sean-k-mooneydansmith yep frickler has a possibel fix14:02
dansmith(stupid client auto /me'ing)14:02
sean-k-mooneyhttps://review.opendev.org/c/openstack/python-openstackclient/+/92923614:03
bauzasmigrating instances with mdevs is not a problem if you don't have shared storage14:03
artomI'm always terrified when I'm touching that code. I had to do it with vTPM live migration as well14:03
dansmithsean-k-mooney: okay it seemed relevant to your discussion above but I wasn't sure14:03
gibiartom: the fix propsed will work for not just NUMA but also for mdev and vpmem instances too14:03
bauzasin order to see this problem, you would need to have the CI checking *both* mdevs *and* shared storage14:03
artomgibi, right, because power management isn't the cause, it just happened to have exposed the bug14:03
gibiyes14:03
bauzasbecause we default to check *every* numa instance14:03
gibiyepp14:04
artomMaybe that check should have taken into account the power management config option :P14:04
sean-k-mooneybauzas: we can test this in ci going forward 14:04
bauzaswe could have another field in the migrate_data that would say "heh, my compute supports power management, so please do it too"14:04
sean-k-mooneywe will just do the numa live mgiration testing in a multi node nfs job14:04
sean-k-mooneyi was going to enable it in the ceph job 14:04
sean-k-mooneybut i can just change which job i enable numa migration in14:05
artomsean-k-mooney, so actually, would this bug have been visible in CI? The instance might remain ACTIVE even if its disk is gone...14:05
bauzasright, we don't have any exceptions14:05
artomWe would need to do something like soft-reboot it after the live migration?14:05
sean-k-mooneyhttps://review.opendev.org/c/openstack/nova/+/91384214:05
sean-k-mooneyartom: it should have caused the instance to crash and caused the connectivy test to fail when we ssh in14:06
gibiartom: in the volume booted case I think the instance will remain ACTIVE. For the shared storage case I don't know what happen when the local disk is deleted under it14:06
sean-k-mooneywe also have one broken the back and forth live migration14:06
artomAh, yeah, that would probably expose this14:06
artomOK, I think I've saturated - I've literally had a feverish night tossing and turning because of some mystery illness one of the kids (presumably the baby) brought home. I'll do go dumb paperwork stuff :P14:07
bauzassean-k-mooney: does the ceph job check shared storage with cephfs ?14:08
sean-k-mooneybauzas: no14:08
bauzasor is it just rbd ?14:08
bauzasokay, then we wouldn't see the bug14:08
sean-k-mooneyits images_type=rbd14:08
gibiartom: take care14:08
bauzasI see14:08
sean-k-mooneyright i said i need to create a new job for this14:08
sean-k-mooneyi tought we had an nfs job already14:08
bauzasartom: joys of being parent of kids going back at school14:09
sean-k-mooneywe have         - devstack-plugin-nfs-tempest-full:14:09
* gibi goes typing in UTs14:09
sean-k-mooneyin the experimatal line but not in check14:09
sean-k-mooneybauzas: basically we have 2 chioices i can continue enabling numa testing in the ceph job14:09
bauzasso you'd add numa live-migration tests in there ?14:09
sean-k-mooneyor i could pivort and create a shared storage migration job and put them in that14:10
bauzasI'm cool but we're just adding more tests to the CI14:10
sean-k-mooneywe dont have a job that would catch this currently14:10
sean-k-mooneywe do not have any testing with nova on shared storage14:10
sean-k-mooneybauzas: so im saying i can create one and also put the numa stuff in it, or i can proceed with putting the numa stuff in teh ceph jobs and we can create a seperate nova-shared-stroage-migration job14:11
sean-k-mooneybauzas: do you have a prefernce?14:12
bauzaskill shared storage with fire ? 14:12
sean-k-mooney:)14:12
bauzasyou can have a shared image backend, why would you mind using NFS which would create you don't know ?14:13
sean-k-mooneywell i cant argue agaisnt that but we coudl certenly make it less teribad in the future if we included this as part of the image backend refactor14:13
bauzasanother good case, yes14:13
sean-k-mooneylike if we had image_typs=nfs that would simplfy things a lot14:14
sean-k-mooneybauzas: lets pause this dicussion for now and we can loop back to the tempest coverage after rc114:15
sean-k-mooneybauzas: https://github.com/openstack/devstack-plugin-nfs currently just supprot seting up cinder on nfs so to test this i woudl eiter need to extend that or just put it in the nova devstack plugin which is honestly simpler14:17
sean-k-mooneywhen we have a little more time ill try and figure out which job to extend to give us the most test coverage without needing extra josb14:18
sean-k-mooneyim thinkg of using  nova-ovs-hybrid-plug to test shared storage and numa and then just leave the ceph job as is14:21
sean-k-mooneysummerised that here https://review.opendev.org/c/openstack/nova/+/913842/comments/ee7e862a_c1fd1b8c14:25
sean-k-mooneyfrickler: is there a dnm that testing your openstack client patch14:26
sean-k-mooneyfrickler: if not i need to rebase one of my patches to adress a nit anyway so i might add a depends on to it14:27
gibibauzas: which way you want to land the shared storage fix: a) I add the UT to supamatt's commit directly or b) I push the UT separately and then I can +2 the commit with the fix still14:56
bauzascan we squash both ?14:58
bauzasyou can +2 the patch, I'm fine14:59
bauzasI'll explain in my comment why it's fine14:59
bauzas(and I want to stop by the top of the hour :) )14:59
gibiOK I will push the squash in 5 mins 15:01
opendevreviewOpenStack Release Bot proposed openstack/placement stable/2024.2: Update .gitreview for stable/2024.2  https://review.opendev.org/c/openstack/placement/+/92929015:08
opendevreviewOpenStack Release Bot proposed openstack/placement stable/2024.2: Update TOX_CONSTRAINTS_FILE for stable/2024.2  https://review.opendev.org/c/openstack/placement/+/92929115:08
opendevreviewOpenStack Release Bot proposed openstack/placement master: Update master for stable/2024.2  https://review.opendev.org/c/openstack/placement/+/92929215:08
opendevreviewBalazs Gibizer proposed openstack/nova master: Fix regression with live migration on shared storage  https://review.opendev.org/c/openstack/nova/+/92897015:09
gibibauzas: ^^15:09
gibisorry it was 8 minutes15:09
gibibut I can offer an extended commit message in return :)15:09
gibicc artom: ^^15:12
gibiand I need to drop now15:16
opendevreviewsean mooney proposed openstack/nova master: allow upgrade of pre-victoria InstanceNUMACells  https://review.opendev.org/c/openstack/nova/+/92918715:16
bauzassean-k-mooney: can you please review https://review.opendev.org/c/openstack/nova/+/928970 ?15:22
* bauzas will stop by now15:22
fricklersean-k-mooney: no, I only tested locally, feel free to add a dep to verify15:22
sean-k-mooneyfrickler:i have +2 the client patch after verifyign the filed in the sdk and novaclient15:24
sean-k-mooneyfrickler: i have also added it as a depedn on in https://review.opendev.org/c/openstack/nova/+/92918715:24
sean-k-mooneyaltough i dont knwo fi the jobs are set up to pull in osc form souce15:24
sean-k-mooneybauzas: sure ill take a look now15:25
bauzasthanks, now I need to go off15:25
sean-k-mooneyfrickler: i think at this point we just need someone form the sdk team to +w the clinet patch15:25
sean-k-mooneybauzas: o/ enjoy your weekend15:26
bauzasthanks15:26
sean-k-mooney+2w, the approch looks good. i woudl have prefer a functioanl regression test but the coverage is enough to proceed with. ill look at tempest ci coverage for this next week if i have time15:30
*** bauzas_ is now known as bauzas16:44
*** bauzas_ is now known as bauzas19:44
opendevreviewMerged openstack/nova master: Add Dalmatian prelude section  https://review.opendev.org/c/openstack/nova/+/92899520:02
*** bauzas_ is now known as bauzas23:27

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!