Thursday, 2024-09-12

*** __ministry is now known as Guest3215		01:18
opendevreview	melanie witt proposed openstack/nova master: DNM try minimalistic imagebackend refactor https://review.opendev.org/c/openstack/nova/+/925635	01:20
opendevreview	melanie witt proposed openstack/nova master: DNM try minimalistic imagebackend refactor https://review.opendev.org/c/openstack/nova/+/925635	05:00
gibi	bauzas: supamatt's bug feels like an RC candidate https://review.opendev.org/c/openstack/nova/+/928970	06:42
gibi	bauzas: I've added it to the tracking etherpad and I will review it properly after a coffee	06:43
gibi	supamatt: thanks for bringing this to our attention	06:43
gibi	supamatt: I have doubts about the problem the code pointed out by the report and the fix does not effect the setting of the destroy_disks variable that is reponsible to signal deletion of files from the disk at cleanup	07:15
gibi	it seems nobody is around so I will update the bug report with my doubt/questions	07:23
gibi	done	07:43
bauzas	gibi: ack, sorry, I had trouble with my home network	07:47
gibi	bauzas: no worries	07:48
bauzas	gibi: this is on 2024.1	07:48
bauzas	Caracal so	07:48
gibi	bauzas: it was a backport from master	07:49
gibi	merged this cycle	07:49
bauzas	technically, we could deliver RC1 and then backport that one in a .z release	07:49
gibi	https://review.opendev.org/c/openstack/nova/+/909806	07:49
bauzas	okay, lemme look at the commit SHA1	07:49
gibi	that is the offending patch according to the reporter	07:49
gibi	bauzas: I lean towards that the reporter do see a numa live migration regression on shared storage but I'm not sure that the root cause is the power mgmt flag added. But more like that flag now trigger a driver cleanup call that wasn't triggered before	07:50
gibi	and that cleanup call might have issues	07:50
bauzas	agreed	07:50
bauzas	that's the cleanup	07:51
bauzas	gibi: oh, that's the conditional	07:53
bauzas	we only cleanup if we don't have shared path	07:53
bauzas	but here, the 'or' makes it now optional	07:53
bauzas	we just need to change the conditional	07:53
gibi	calling the cleanup with destroy_disks should be OK	07:56
gibi	destroy_disk= false	07:56
gibi	appreantly it isn't OK but changing the logic at when to trigger do_cleanup seems to be the wrong fix to me	07:57
gibi	as destroy_disk=false should prevent the disk deletion	07:57
bauzas	true	07:57
bauzas	but we check two different fields for shared storage	07:58
bauzas	the conditional checks whether both computes use the same path	07:58
gibi	yes I spotted that but did not dig deeper tehre	07:58
bauzas	while the destroy_disks arg checks whether we use block storage	07:59
bauzas	I assume the reporter gets something like True for the shared path but False for shared block storage	07:59
gibi	what is a shared block storage?	08:01
bauzas	for mdevs and vpmems, there was a reason to call cleanup	08:01
gibi	just cinder? or something more special?	08:01
bauzas	we need to delete the local residues	08:01
bauzas	honestly, that's the first time I'm hearing this	08:02
bauzas	I'm used to the regular FS share between instance pathes	08:02
bauzas	that's the regular shared storage I know	08:03
gibi	yeah I think is_shared_instance_path is the case when the instance dir is on e.g. NFS	08:03
bauzas	yup	08:03
gibi	we need to call cleanup if power mgmt is enabled so the current fix proposed would regress that	08:04
bauzas	I need to see what makes migrate_data.is_shared_block_storage	08:04
gibi	yep	08:04
bauzas	and it would also regress the mdev live-migration support	08:05
bauzas	we need to call cleanup to delete the mdevs we reserved	08:05
bauzas	hence the 'or'	08:05
bauzas	I need a coffee	08:06
bauzas	it's early and I'm already fried	08:06
bauzas	trying to improve my local LTE connection took me too much of time	08:06
gibi	https://github.com/openstack/nova/blob/16d815990b53b9afa35fc4da38609f87820fa690/nova/virt/libvirt/driver.py#L10367-L10369	08:08
bauzas	ooooh I see	08:09
gibi	https://github.com/openstack/nova/blob/16d815990b53b9afa35fc4da38609f87820fa690/nova/virt/libvirt/driver.py#L10470	08:09
bauzas	yeah, some this is when you have BFV or you share the same libvirt image backend between both	08:10
bauzas	that's not NFS	08:10
gibi	yeah it is the RBD image backend	08:11
bauzas	so	08:11
bauzas	I think of something quick	08:11
bauzas	previously, we were NOT calling cleanup on shared storage	08:11
bauzas	right?	08:12
bauzas	now we WANT to call it too, but we don't want to destroy the disks	08:12
bauzas	I see a simple solution	08:12
bauzas	which is to add another clause to destroy_disks	08:13
bauzas	gibi: thoughts ?	08:13
gibi	thinking...	08:14
gibi	we did not call cleanup if is_shared_instance_path was true. And now we need it due to power mgmt.	08:14
bauzas	yup	08:14
bauzas	but we need to call it smartly	08:15
gibi	yes	08:15
gibi	would the change destroy_disks = not (migrate_data.is_shared_block_storage or migrate_data.is_shared_instance_path) work?	08:16
bauzas	that's my hope	08:16
bauzas	but we need to check the cleanup method	08:16
gibi	https://github.com/openstack/nova/blob/16d815990b53b9afa35fc4da38609f87820fa690/nova/virt/libvirt/driver.py#L1707-L1724 this is the relevant path. I think it would work	08:20
gibi	but we don't have reproduction	08:20
bauzas	yeah	08:21
bauzas	we need a functional test	08:21
gibi	In 20 mins I need to step away for more than an hour	08:22
bauzas	I can try to work on the reproducer	08:22
gibi	I will summarize the above discussion in the bug	08:22
bauzas	I already did in the report	08:22
gibi	cool	08:23
gibi	I will add a request to the reporter to try,if they can, our fix proposal	08:23
bauzas	please, but I assume he's on West-US timezones	08:23
gibi	done	08:36
*** bauzas_ is now known as bauzas		09:30
opendevreview	Sylvain Bauza proposed openstack/nova master: Add Dalmatian prelude section https://review.opendev.org/c/openstack/nova/+/928995	10:28
opendevreview	Merged openstack/osc-placement master: Update master for stable/2024.2 https://review.opendev.org/c/openstack/osc-placement/+/928360	10:28
yusufgungor_	hi everyone, we think hit a strange bug. We got the mtu error as below on the source compute node, when live migrating an instance with the specified scenarios.	11:32
yusufgungor_	"Target network card MTU 0 does not match source 1450: libvirt.libvirtError: unsupported configuration: Target network card MTU 0 does not match source 1450"	11:32
yusufgungor_	Also debug logs shows that mtu is null for one of the interfaces for the failed scenarios. _update_vif_xml function prints log as below on debug:	11:32
yusufgungor_	Updating guest XML with vif config: <interface type="ethernet">	11:32
yusufgungor_	<mac address="fa:16:3e:19:a8:ba"/>	11:33
yusufgungor_	<model type="virtio"/>	11:33
yusufgungor_	<target dev="tapfb1b3816-ed"/>	11:33
yusufgungor_	</interface>	11:33
yusufgungor_	Updating guest XML with vif config: <interface type="ethernet">	11:33
yusufgungor_	<mac address="fa:16:3e:8e:e7:b3"/>	11:33
yusufgungor_	<model type="virtio"/>	11:33
yusufgungor_	<mtu size="1500"/>	11:33
yusufgungor_	<target dev="tapd001bd30-9c"/>	11:33
yusufgungor_	</interface>	11:33
yusufgungor_	Should we create a bug report? This mtu error causes revert the migration and triggers another strange bug on the os-brick side which unmaps the incorrect iscsi device for random instance :/ It is another story.	11:33
yusufgungor_	Scenarios:	11:33
yusufgungor_	1. When creating an instance from a vxlan tenant network, then attaching an interface from the provider network, and live migrating, it gives an mtu error!	11:33
yusufgungor_	2. When creating an instance from a vxlan tenant network, then attaching an interface from the provider network, rebooting, and live migrating, it does not give an mtu error.	11:33
yusufgungor_	3. When creating an instance only from the provider network, it does not give an mtu error.	11:33
yusufgungor_	4. When creating an instance from the provider network, then attaching an interface from the vxlan tenant network, and migrating, it gives an mtu error!	11:33
yusufgungor_	5. When creating an instance, giving an IP from both the vxlan tenant network and the provider network, and live migrating, it does not give an mtu error.	11:33
yusufgungor_	6. When creating an instance only from the provider network, it does not give an mtu error. Then, when attaching a leg from the vxlan tenant network and migrating again, it gives an mtu error!	11:33
yusufgungor_	nova-compute --version : 26.2.1 (osp cluster is zed version)	11:36
sean-k-mooney	yusufgungor_: so it shoudl be noted that nova does not supprot multi segment networks with differnt mtu	12:04
sean-k-mooney	yusufgungor_: vxlan has 48? byte of encapsulation overhead so we round that up to 50 and reduce its mtu vs vlan/flat in neutorn automatically	12:05
sean-k-mooney	yusufgungor_: you should file a bug with as much detail as is relevent	12:07
yusufgungor_	@sean thanks for reply. so a nova instance with both vxlan tenant network (mtu: 1450) and provider network (mtu: 1500) is not supported? Actually it works except live migration on spesific case. For example after hard reboot the instance we can live migrate too.	12:08
yusufgungor_	@sean-k-mooney thanks, then i will file a bug report to nova	12:08
sean-k-mooney	no you can have instance with a mix of both	12:09
sean-k-mooney	that should be fine	12:09
sean-k-mooney	just not a neutron network that has a vlan segment and a vxlan segment with differnt mtus	12:10
sean-k-mooney	that actully might work form a nova side as we take the mtu form the port	12:10
sean-k-mooney	but it would break l2 connectivty between the segments as mtus should only chagne on l3(subnet) bondaries	12:11
yusufgungor_	@sean-k-mooney thanks. actually we have tried to set the provider network mtu as 1450 too and tested but hitting the same live migrate error	12:12
yusufgungor_	@sean-k-mooney i will test it again, and file a bug	12:12
sean-k-mooney	oh this might be an old bug then	12:12
sean-k-mooney	so we do not suprot changing the mtu of neutron netowrks	12:13
sean-k-mooney	and by we i mean nova	12:13
sean-k-mooney	the only way to do that properly is if you detach and reattach all the ports or otherwise regenerate the guest xml	12:14
yusufgungor_	@sean-k-mooney thanks 🙏	12:15
ratailor_	sean-k-mooney, could you please provide your suggestion on this approach https://review.opendev.org/c/openstack/nova/+/928933 for https://bugs.launchpad.net/nova/+bug/2058928	12:55
ratailor_	sean-k-mooney, the finish_time field is not shown in instance action show output, that's why we need to do this before calling instance action finish method for respective APIs where we need to finish the started action.	12:56
ratailor_	sean-k-mooney, I will fix failing tests and add new tests wherever required, once this approach looks good.	12:57
ratailor_	artom, gibi please provide your suggestions as well ^^	13:00
opendevreview	sean mooney proposed openstack/nova master: repoduce post liberty pre vicoria instance numa db issue https://review.opendev.org/c/openstack/nova/+/929025	13:01
sean-k-mooney	dansmith: gibi bauzas ^	13:02
sean-k-mooney	ill file a bug for that and update it when i push the fix patch	13:04
yusufgungor_	@sean-k-mooney i have filed a bug: https://bugs.launchpad.net/nova/+bug/2080531	13:31
opendevreview	Dan Smith proposed openstack/nova stable/2023.1: [stable-only]: Port qcow2v2 support from oslo https://review.opendev.org/c/openstack/nova/+/928937	14:02
opendevreview	Rajesh Tailor proposed openstack/nova master: Add support for showing finish_time https://review.opendev.org/c/openstack/nova/+/928933	14:08
dansmith	sean-k-mooney: are you respinning the safety patch to invert the safe/unsafe logic or no?	16:29
dansmith	your comment made it sound like yes, but...	16:30
sean-k-mooney	no	16:31
sean-k-mooney	i inverted the logic already	16:31
sean-k-mooney	so it should be good as is	16:31
dansmith	okay	16:31
sean-k-mooney	unsafe would have been less work but i was alrey 30 mins into fixing the tests	16:31
sean-k-mooney	so i decied to just finish it	16:31
dansmith	yeah I just wasn't sure from your comment if you were going to respin is all	16:32
bauzas	hum, looks like we don't have any existing functional tests checking shared storage	16:32
sean-k-mooney	ya i proably should have dropped the comment	16:32
sean-k-mooney	bauzas: we may under the libvirt section	16:32
sean-k-mooney	but in generally its light if it exists	16:33
bauzas	my grep on shared_storage or just storage shows me nothing	16:33
bauzas	I assume I may have to mock this	16:34
bauzas	iirc, we just detect shared storage if we can see the same instance files on both computes, right?	16:34
sean-k-mooney	yep you can mock that function	16:36
sean-k-mooney	i would just create a regression test in a spereate file	16:36
sean-k-mooney	and use https://github.com/openstack/nova/blob/master/nova/tests/functional/libvirt/base.py#L264 to make live migration work	16:36
sean-k-mooney	then stub the function for the shared storage check to return true	16:36
sean-k-mooney	https://github.com/openstack/nova/blob/16d815990b53b9afa35fc4da38609f87820fa690/nova/tests/functional/regressions/test_bug_1939545.py	16:36
bauzas	we have a couple of functional tests for live-migration, that's not a problem	16:36
sean-k-mooney	i think has all the ground work	16:37
bauzas	oh cool, thanks for the digging but I was about to do it :)	16:37
bauzas	okay, I'm out of steam and I don't have ideas on how to correctly assert we delete the local disks, except checking we call the delete	17:20
bauzas	see you tomorrow	17:20
*** bauzas_ is now known as bauzas		19:48
opendevreview	sean mooney proposed openstack/nova master: repoduce post liberty pre vicoria instance numa db issue https://review.opendev.org/c/openstack/nova/+/929025	20:09
opendevreview	sean mooney proposed openstack/nova master: allow upgrade of pre-victoria InstanceNUMACells https://review.opendev.org/c/openstack/nova/+/929187	20:09
opendevreview	sean mooney proposed openstack/nova master: repoduce post liberty pre vicoria instance numa db issue https://review.opendev.org/c/openstack/nova/+/929025	20:14
opendevreview	sean mooney proposed openstack/nova master: allow upgrade of pre-victoria InstanceNUMACells https://review.opendev.org/c/openstack/nova/+/929187	20:14
sean-k-mooney	looks like openstack clinet is broken	22:11
sean-k-mooney	024-09-12 21:04:32.171696 \| controller \| + /opt/stack/nova/roles/run-evacuate-hook/files/test_negative_evacuate.sh:evacuate_and_wait_for_error:19 : openstack --os-compute-api-version 2.67 server evacuate --host np0038485592 evacuate-test	22:11
sean-k-mooney	2024-09-12 21:04:33.305045 \| controller \| compute version 2.67 is not in supported versions: 2, 2.1	22:11
sean-k-mooney	2024-09-12 21:04:34.474915 \| controller \| Proxy.evacuate_server() got an unexpected keyword argument 'password'	22:11
sean-k-mooney	stephenfin: ^ fyi for the morning	22:12

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!