Friday, 2024-09-13

*** __ministry is now known as Guest3324		01:33
opendevreview	Sam Morrison proposed openstack/nova master: Filter out deleted instances when looking for build timouts https://review.opendev.org/c/openstack/nova/+/880125	03:59
frickler	stephenfin: sean-k-mooney: note that we release osc 7.1.0 yesterday, so likely related to that. 7.0.0 was also broken and never made it into u-c. I can also reproduce the above issue locally fwiw	05:59
frickler	maybe let's move this to sdks channel	06:01
*** ralonsoh_ is now known as ralonsoh		06:55
*** bauzas_ is now known as bauzas		07:46
gibi	sean-k-mooney: I have small things in https://review.opendev.org/q/topic:%22bug/2080556%22 but I'm overall positive.	07:56
gibi	bauzas: I have a nit in the prelude https://review.opendev.org/c/openstack/nova/+/928995	08:02
gibi	bauzas: also I think we can go with the RC1 for placement even if we need to wait with the nova RC1	08:03
gibi	bauzas: also we have negative answer for our proposal of fixing the live migration issue https://bugs.launchpad.net/nova/+bug/2080436	08:04
bauzas	cool, I'm just busy with another task	08:05
bauzas	you're next in my waiting queue :)	08:05
gibi	I'm not waiting :)	08:05
opendevreview	Sylvain Bauza proposed openstack/nova master: Add Dalmatian prelude section https://review.opendev.org/c/openstack/nova/+/928995	08:14
bauzas	gibi: I have full attention to the regression bug now	08:26
bauzas	so, that sounds the conditional clause isn't enough	08:26
bauzas	something in the cleanup really deletes the instance path	08:27
gibi	I just published a set of observations in https://review.opendev.org/c/openstack/nova/+/928970	08:29
gibi	If I have to guess the complication is around: > NUMA pinned VMs and an NFS mount for the VMs, but we also use cinder boot volumes	08:32
gibi	I guess we won't be able to avoid setting up a multinode devstack..	08:33
bauzas	I tried yesterday to write a functional test but that was hard	08:36
bauzas	I would have to create a specific tempfs for the instance paths in order to correctly reproduce it	08:37
bauzas	but I failed finding where we mock the paths	08:37
bauzas	++	08:37
bauzas	(oops, wrong window)	08:37
gibi	I will try to set up multinode devstack with NFS but I cannot promise it will work today	08:40
bauzas	I have two nodes, I can also try to modify them	08:41
bauzas	either way, we'll probably need to come to a decision today or monday	08:41
bauzas	we can't delay RC1 more	08:41
bauzas	but I'd like some core quorum	08:42
gibi	an alternative is to revert the power mgmt live migration fix. Declare that that power mgmt does not work with live migration and fix the whole thing later	08:43
bauzas	that's what I had in mind	08:44
bauzas	or a third alternative is ship power management but let the call be optional	08:44
bauzas	if you don't setup power management, you shouldn't be impacted	08:44
gibi	we could hack a fix together where shared storage live migration works if power mgmt is not used, and power mgmt live migration works if no shared storage is used (basically the original proposal from the reported) but that is uggly	08:44
gibi	and power mgmt with shared storage will still not work	08:45
bauzas	that's the concern I have with the current master, which inconditionnally says you're power manageable	08:45
bauzas	I wish we had a flag on the migrate_data object that would signal that info	08:45
bauzas	like we did for mdevs or vpmems	08:45
gibi	that would not remove the problem that shared storage live migration does not work if you have power mgmt, mdevs, or vpmems.	08:46
gibi	as those would trigger calling clenaup	08:46
bauzas	yup, but that's something we can doc	08:46
bauzas	here the problem is that shared storage is broken anyway	08:47
bauzas	and you can't opt out from it	08:47
gibi	yepp it seem shared storage live migration only works we we don't call cleanup	08:48
gibi	but I'm not 100% sure whw	08:48
gibi	why	08:48
gibi	do you have any comment on my las comment https://review.opendev.org/c/openstack/nova/+/928970/comment/c28b9ea3_30ed15a5/	08:49
gibi	especially around deleting the instance dir when the VM is volume backed?	08:49
bauzas	even if the instance is volume backed, we still have an instance dir, right?	08:51
bauzas	we just don't have disks	08:51
bauzas	I could imagine that they have NFS on /var/lib/instances	08:53
bauzas	so the BFV instances are present on that shared path	08:53
bauzas	in that case,	08:54
bauzas	cleanup_instance_dir = migrate_data.is_shared_block_storage	08:54
bauzas	it will cleanup the dir, right ?	08:54
bauzas	oh wait no	08:55
bauzas	that's not shared storage, that's shared RBD image backend	08:55
gibi	the instance dir for a volume booted instance (without config drive) only holds the console.log fiel nothing else	08:55
bauzas	I don't disagree, but that's still an instance dir	08:56
gibi	is_shared_block_storage will be true if the instance is booted from volume	08:56
gibi	(and in other cases too)	08:56
bauzas	okay, so that's why this is deleting the instance dir	08:56
gibi	Is it a problem if we delete hte instance dir	08:56
gibi	?	08:57
bauzas	good question	08:57
bauzas	as you said, the instance dir is quite empty	08:57
bauzas	where is stored the domain definition ?	08:57
*** ralonsoh_ is now known as ralonsoh		08:59
gibi	that is nova does not store the xml	09:01
gibi	* nova does not store the xml	09:02
gibi	I'm not sure libvirt stores it by default	09:02
gibi	it is under /run/libvirt/qemu/instance-00000002.xml	09:03
gibi	in devstack	09:03
bauzas	okay	09:04
bauzas	your question is actually good : does it harm if we delete an instance dir from a BFV instance that's shared ?	09:05
bauzas	https://github.com/openstack/nova/blob/16d815990b53b9afa35fc4da38609f87820fa690/nova/virt/libvirt/driver.py#L1708-L1711	09:06
bauzas	I enjoy mdbooth's comment : "I'm pretty sure this is wrong"	09:06
bauzas	from what I read from both comments is that this is EXPECTED that we delete the instance dir if the instance is BFV	09:07
bauzas	not sure we do support shared storage with BFV	09:07
bauzas	see ?	09:07
gibi	I see	09:12
gibi	I'm not sure wht mdbooth's comment really mean. I.e. does the comment really implies that the logic is wrong and we should change it?	09:13
gibi	we need to figure out if live migration on shared storage with a volume backed VM actually fails if the instance dir is deleted	09:16
gibi	we never seen the real failure from the reporter	09:16
gibi	they just stated that the dir is deleted	09:16
gibi	the only thing they said is	09:17
gibi	"making the VM unusable for future operations."	09:17
gibi	I need to drop for an hour but will continue from here	09:17
bauzas	🤷‍	09:17
bauzas	sure, let's continue to discuss this later once more people are here	09:18
gibi	one thing we can suggest is to only delete the instance dir if the instance is volume backed but the instance path is not shared	09:19
gibi	I can amend the my suggested patch with that logic	09:19
gibi	and they can test	09:19
bauzas	yeah, seems a good point	09:31
bauzas	but this would revert the previous logic :)	09:31
elinux	any one available here ?	10:01
elinux	Anyone online	10:08
supamatt	o/	10:08
frickler	stephenfin: sean-k-mooney: https://review.opendev.org/c/openstack/python-openstackclient/+/929236 should fix the second part of the evacuate command, not sure about the microversion thing yet	10:41
sean-k-mooney	frickler so i think this is an sdk regression?	10:45
sean-k-mooney	i guess it was intorduced with https://github.com/openstack/python-openstackclient/commit/e6dc0f39c0891ea551867b77663a463b5f76987c perhaps	10:46
sean-k-mooney	frickler: the actul password handeling code has not changed in over a year	10:48
sean-k-mooney	but the args are now being passed to an sdk client instead of nova client	10:48
opendevreview	Matthew Heler proposed openstack/nova master: Fix regression with live migration on shared storage https://review.opendev.org/c/openstack/nova/+/928970	10:50
sean-k-mooney	frickler: the microversion we used is https://docs.openstack.org/nova/latest/reference/api-microversion-history.html#id62	10:50
sean-k-mooney	which removed foraced evacuates	10:50
supamatt	bauzas the latest patch I posted for 928970 works with shared storage now	10:51
sean-k-mooney	frickler: the password is adminPass	10:52
sean-k-mooney	in the rest api	10:53
opendevreview	Matthew Heler proposed openstack/nova master: Fix regression with live migration on shared storage https://review.opendev.org/c/openstack/nova/+/928970	10:54
gibi	supamatt: nice, I think that was what I imagined above by "one thing we can suggest is to only delete the instance dir if the instance is volume backed but the instance path is not shared"	10:59
gibi	supamatt: so the patch you posted proves my theory.	10:59
gibi	lets wait for bauzas as well but I think we can make your patch mergeable by adding some unti tests around it and a release notes	11:00
supamatt	+1	11:05
frickler	sean-k-mooney: yes, I think that that is a regression that was missed in the commit you mention	11:13
gibi	supamatt: if you need pointers / help to add unit tests and a release notes then let me know	11:13
supamatt	gibi: for the unit test, I'm not sure how best to do that. the release notes I can do	11:35
gibi	supamatt: for he compute manager change you can extend the unit test coverage here based on the existing examples https://github.com/openstack/nova/blob/16d815990b53b9afa35fc4da38609f87820fa690/nova/tests/unit/compute/test_compute_mgr.py#L11385-L11448	11:51
opendevreview	Matthew Heler proposed openstack/nova master: Fix regression with live migration on shared storage https://review.opendev.org/c/openstack/nova/+/928970	11:54
gibi	supamatt: for the driver change here is an example test case you can use https://github.com/openstack/nova/blob/16d815990b53b9afa35fc4da38609f87820fa690/nova/tests/unit/virt/libvirt/test_driver.py#L20993-L21013	11:54
bauzas	sorry, was offline due to some network change	12:59
bauzas	supamatt: gibi : can you please give me a summary?	13:00
gibi	bauzas: supamatt: basically implemented what I described before I dropped this morning.	13:02
gibi	So the patch now only deletes the instance dir if the instance is booted from volume but the instance directory is not on shared storage	13:03
gibi	supamatt: stated that this solves their issue	13:03
gibi	and it is aligned with my understanding	13:03
gibi	now supamatt works on unit test coverage for the change	13:03
bauzas	excellent then	13:08
bauzas	I tried to work on a functest but wasn't simple	13:08
bauzas	but we can try to write a specific shared storage test if we want in another patch	13:09
gibi	given the time pressure and the fact that we have first hand info from the reporter that it solves the regression I'm OK not to have functional reproduce for this	13:09
gibi	if we can add functional test later that is fine too	13:09
bauzas	yup	13:10
bauzas	but yeah, I think we need some way to check shared storage by a functest	13:10
elinux_	I have observed somethign while testing the nova api calls. after the request comes in , the request spec object is built. after that I have placed a debugger at the scheduler level. after the pdb prompt appears, by that time when I inspect the spec obj, it is loosing the metadata key,value if passed by the user. but when I dont use the debugger and use only a simple print statement , it is printing the metadata value	13:15
elinux_	os my question is , is there a time span after which the data within that object is removed ?	13:15
elinux_	*so my	13:15
gibi	elinux: probably the rpc all to the scheduler is timed out while you are debugged the scheduler	13:36
elinux	gibi: but all other details r intact for the request. Only metadata is missing	13:45
supamatt	bauzas gibi: yea the unit test doesn't look simple to me either ;S	13:45
gibi	supamatt: If you are stuck I can try to create a follow up patch top of yours with the unit tests	13:46
bauzas	supamatt: I was talking of functional tests which are in another directory	13:46
bauzas	yes, we can help on the UTs	13:46
bauzas	have you found the test methods to update ?	13:46
bauzas	we need another core to chime in https://review.opendev.org/c/openstack/nova/+/928995	13:47
bauzas	elodilles: I think we will reasonable release nova RC1 by monday	13:48
supamatt	If you guys wouldn't mind adding that UT, that'd be awesome. This patch seems like a big regression. I've done a number of openstack deployments over the last decade and NFS being used as shared storage is in almost a quarter of those deployments.	13:48
gibi	supamatt: sure, I can propose the UTs	13:49
elodilles	bauzas: ACK, sounds good	13:49
sean-k-mooney	frickler: sorry was gone for lunch with some local readhatter	13:50
bauzas	elodilles: I'll +1 the placement RC1	13:50
elodilles	thanks o/	13:50
elodilles	we still have 2 weeks until the final RC deadline o:)	13:50
sean-k-mooney	frickler: so the password shoudl be valid for the requested microversion but i think i agree that just not passing it when its not preseent on the command line will work around it	13:50
sean-k-mooney	but there might also be bugs in the sdk about that	13:51
elodilles	but let's hope RC1 will be enough :] fingers crossed :X	13:51
frickler	sean-k-mooney: the real issue is that the parameter is named admin_password in the sdk, that change was missed while switching evacuate to use the sdk	13:54
frickler	the sdk always sends none in the request anyway, so I reverted that part in PS2	13:55
sean-k-mooney	well the real issue is its named adminPassword in the api and the sdk and clinet were both calling it somethign else	13:55
sean-k-mooney	frickler: the sdk has historically renamed the parmater form there real api names	13:55
sean-k-mooney	which causes a lot of similear bug that will hopefully stop in the futrue	13:56
sean-k-mooney	if we complete the openapi stuff and starge generaging parts of the sdk form the openapi definitions	13:56
sean-k-mooney	actully its adminPass in the api	13:57
artom	gibi, bauzas, shouldn't mdev and vpmem live migration have uncovered the problem before? They're also cases where we started calling cleanup, but the destroy_disks logic was incorrect?	13:58
bauzas	maybe	13:59
sean-k-mooney	either of those are tested in ci currently	13:59
sean-k-mooney	*neitehr	13:59
bauzas	but here, we now call cleanup everytime if you have dst_numa_info	13:59
sean-k-mooney	so i have a ci job that will test this we just have not merged it yet	14:00
artom	Right, but before we had this:	14:00
artom	do_cleanup = (not migrate_data.is_shared_instance_path or	14:00
artom	has_vpmem or has_mdevs)	14:00
bauzas	while for mdev, it was only if you were migrating the mdevs	14:00
sean-k-mooney	oh actully no becasue it need nfs too	14:00
artom	So we called cleanup every time we had vpmem or mdevs	14:00
sean-k-mooney	yes	14:00
artom	So if any of those were on NFS, they would see their disk deleted as well?	14:00
gibi	artom: I'm fairly certain that mdev + shared storage would trigger it too	14:00
sean-k-mooney	because we do need to clean up the persitent memeory	14:00
artom	And we just never caught it?	14:00
sean-k-mooney	we did not supprot mdev live migation until caracal	14:01
bauzas	we're calling it for mdevs because we do something in cleanup about mdevs	14:01
artom	Right, I understand _why_ we need to call it	14:01
gibi	artom: I think we don't have much coverage with NFS under nova instances path	14:01
sean-k-mooney	my point is that we have no test coverage of any of the condtion in ci today	14:01
artom	I'm saying it's suspect that power management live migration is the first thing to have caught the fault destroy_disks logic	14:01
sean-k-mooney	we dont have coverage fo mdev or pmem in ci	14:01
artom	OK, mdev is new in Caracal, but vpmem? That's been around for a while...	14:02
gibi	artom: correct the power mgmt condition was vide enough to hit people	14:02
bauzas	rather the problem is that we don't really test shared instances	14:02
sean-k-mooney	artom: intel killed pmem remeber	14:02
* dansmith sean-k-mooney: have you looked at the post_failures on this? https://review.opendev.org/c/openstack/nova/+/928829/5?tab=change-view-tab-header-zuul-results-summary		14:02
artom	gibi, OK, that makes sense I guess.	14:02
gibi	as it hits all deployments with NFS + numa	14:02
sean-k-mooney	dansmith yep frickler has a possibel fix	14:02
dansmith	(stupid client auto /me'ing)	14:02
sean-k-mooney	https://review.opendev.org/c/openstack/python-openstackclient/+/929236	14:03
bauzas	migrating instances with mdevs is not a problem if you don't have shared storage	14:03
artom	I'm always terrified when I'm touching that code. I had to do it with vTPM live migration as well	14:03
dansmith	sean-k-mooney: okay it seemed relevant to your discussion above but I wasn't sure	14:03
gibi	artom: the fix propsed will work for not just NUMA but also for mdev and vpmem instances too	14:03
bauzas	in order to see this problem, you would need to have the CI checking both mdevs and shared storage	14:03
artom	gibi, right, because power management isn't the cause, it just happened to have exposed the bug	14:03
gibi	yes	14:03
bauzas	because we default to check every numa instance	14:03
gibi	yepp	14:04
artom	Maybe that check should have taken into account the power management config option :P	14:04
sean-k-mooney	bauzas: we can test this in ci going forward	14:04
bauzas	we could have another field in the migrate_data that would say "heh, my compute supports power management, so please do it too"	14:04
sean-k-mooney	we will just do the numa live mgiration testing in a multi node nfs job	14:04
sean-k-mooney	i was going to enable it in the ceph job	14:04
sean-k-mooney	but i can just change which job i enable numa migration in	14:05
artom	sean-k-mooney, so actually, would this bug have been visible in CI? The instance might remain ACTIVE even if its disk is gone...	14:05
bauzas	right, we don't have any exceptions	14:05
artom	We would need to do something like soft-reboot it after the live migration?	14:05
sean-k-mooney	https://review.opendev.org/c/openstack/nova/+/913842	14:05
sean-k-mooney	artom: it should have caused the instance to crash and caused the connectivy test to fail when we ssh in	14:06
gibi	artom: in the volume booted case I think the instance will remain ACTIVE. For the shared storage case I don't know what happen when the local disk is deleted under it	14:06
sean-k-mooney	we also have one broken the back and forth live migration	14:06
artom	Ah, yeah, that would probably expose this	14:06
artom	OK, I think I've saturated - I've literally had a feverish night tossing and turning because of some mystery illness one of the kids (presumably the baby) brought home. I'll do go dumb paperwork stuff :P	14:07
bauzas	sean-k-mooney: does the ceph job check shared storage with cephfs ?	14:08
sean-k-mooney	bauzas: no	14:08
bauzas	or is it just rbd ?	14:08
bauzas	okay, then we wouldn't see the bug	14:08
sean-k-mooney	its images_type=rbd	14:08
gibi	artom: take care	14:08
bauzas	I see	14:08
sean-k-mooney	right i said i need to create a new job for this	14:08
sean-k-mooney	i tought we had an nfs job already	14:08
bauzas	artom: joys of being parent of kids going back at school	14:09
sean-k-mooney	we have - devstack-plugin-nfs-tempest-full:	14:09
* gibi goes typing in UTs		14:09
sean-k-mooney	in the experimatal line but not in check	14:09
sean-k-mooney	bauzas: basically we have 2 chioices i can continue enabling numa testing in the ceph job	14:09
bauzas	so you'd add numa live-migration tests in there ?	14:09
sean-k-mooney	or i could pivort and create a shared storage migration job and put them in that	14:10
bauzas	I'm cool but we're just adding more tests to the CI	14:10
sean-k-mooney	we dont have a job that would catch this currently	14:10
sean-k-mooney	we do not have any testing with nova on shared storage	14:10
sean-k-mooney	bauzas: so im saying i can create one and also put the numa stuff in it, or i can proceed with putting the numa stuff in teh ceph jobs and we can create a seperate nova-shared-stroage-migration job	14:11
sean-k-mooney	bauzas: do you have a prefernce?	14:12
bauzas	kill shared storage with fire ?	14:12
sean-k-mooney	:)	14:12
bauzas	you can have a shared image backend, why would you mind using NFS which would create you don't know ?	14:13
sean-k-mooney	well i cant argue agaisnt that but we coudl certenly make it less teribad in the future if we included this as part of the image backend refactor	14:13
bauzas	another good case, yes	14:13
sean-k-mooney	like if we had image_typs=nfs that would simplfy things a lot	14:14
sean-k-mooney	bauzas: lets pause this dicussion for now and we can loop back to the tempest coverage after rc1	14:15
sean-k-mooney	bauzas: https://github.com/openstack/devstack-plugin-nfs currently just supprot seting up cinder on nfs so to test this i woudl eiter need to extend that or just put it in the nova devstack plugin which is honestly simpler	14:17
sean-k-mooney	when we have a little more time ill try and figure out which job to extend to give us the most test coverage without needing extra josb	14:18
sean-k-mooney	im thinkg of using nova-ovs-hybrid-plug to test shared storage and numa and then just leave the ceph job as is	14:21
sean-k-mooney	summerised that here https://review.opendev.org/c/openstack/nova/+/913842/comments/ee7e862a_c1fd1b8c	14:25
sean-k-mooney	frickler: is there a dnm that testing your openstack client patch	14:26
sean-k-mooney	frickler: if not i need to rebase one of my patches to adress a nit anyway so i might add a depends on to it	14:27
gibi	bauzas: which way you want to land the shared storage fix: a) I add the UT to supamatt's commit directly or b) I push the UT separately and then I can +2 the commit with the fix still	14:56
bauzas	can we squash both ?	14:58
bauzas	you can +2 the patch, I'm fine	14:59
bauzas	I'll explain in my comment why it's fine	14:59
bauzas	(and I want to stop by the top of the hour :) )	14:59
gibi	OK I will push the squash in 5 mins	15:01
opendevreview	OpenStack Release Bot proposed openstack/placement stable/2024.2: Update .gitreview for stable/2024.2 https://review.opendev.org/c/openstack/placement/+/929290	15:08
opendevreview	OpenStack Release Bot proposed openstack/placement stable/2024.2: Update TOX_CONSTRAINTS_FILE for stable/2024.2 https://review.opendev.org/c/openstack/placement/+/929291	15:08
opendevreview	OpenStack Release Bot proposed openstack/placement master: Update master for stable/2024.2 https://review.opendev.org/c/openstack/placement/+/929292	15:08
opendevreview	Balazs Gibizer proposed openstack/nova master: Fix regression with live migration on shared storage https://review.opendev.org/c/openstack/nova/+/928970	15:09
gibi	bauzas: ^^	15:09
gibi	sorry it was 8 minutes	15:09
gibi	but I can offer an extended commit message in return :)	15:09
gibi	cc artom: ^^	15:12
gibi	and I need to drop now	15:16
opendevreview	sean mooney proposed openstack/nova master: allow upgrade of pre-victoria InstanceNUMACells https://review.opendev.org/c/openstack/nova/+/929187	15:16
bauzas	sean-k-mooney: can you please review https://review.opendev.org/c/openstack/nova/+/928970 ?	15:22
* bauzas will stop by now		15:22
frickler	sean-k-mooney: no, I only tested locally, feel free to add a dep to verify	15:22
sean-k-mooney	frickler:i have +2 the client patch after verifyign the filed in the sdk and novaclient	15:24
sean-k-mooney	frickler: i have also added it as a depedn on in https://review.opendev.org/c/openstack/nova/+/929187	15:24
sean-k-mooney	altough i dont knwo fi the jobs are set up to pull in osc form souce	15:24
sean-k-mooney	bauzas: sure ill take a look now	15:25
bauzas	thanks, now I need to go off	15:25
sean-k-mooney	frickler: i think at this point we just need someone form the sdk team to +w the clinet patch	15:25
sean-k-mooney	bauzas: o/ enjoy your weekend	15:26
bauzas	thanks	15:26
sean-k-mooney	+2w, the approch looks good. i woudl have prefer a functioanl regression test but the coverage is enough to proceed with. ill look at tempest ci coverage for this next week if i have time	15:30
*** bauzas_ is now known as bauzas		16:44
*** bauzas_ is now known as bauzas		19:44
opendevreview	Merged openstack/nova master: Add Dalmatian prelude section https://review.opendev.org/c/openstack/nova/+/928995	20:02
*** bauzas_ is now known as bauzas		23:27

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!