*** __ministry is now known as Guest3215 | 01:18 | |
opendevreview | melanie witt proposed openstack/nova master: DNM try minimalistic imagebackend refactor https://review.opendev.org/c/openstack/nova/+/925635 | 01:20 |
---|---|---|
opendevreview | melanie witt proposed openstack/nova master: DNM try minimalistic imagebackend refactor https://review.opendev.org/c/openstack/nova/+/925635 | 05:00 |
gibi | bauzas: supamatt's bug feels like an RC candidate https://review.opendev.org/c/openstack/nova/+/928970 | 06:42 |
gibi | bauzas: I've added it to the tracking etherpad and I will review it properly after a coffee | 06:43 |
gibi | supamatt: thanks for bringing this to our attention | 06:43 |
gibi | supamatt: I have doubts about the problem the code pointed out by the report and the fix does not effect the setting of the destroy_disks variable that is reponsible to signal deletion of files from the disk at cleanup | 07:15 |
gibi | it seems nobody is around so I will update the bug report with my doubt/questions | 07:23 |
gibi | done | 07:43 |
bauzas | gibi: ack, sorry, I had trouble with my home network | 07:47 |
gibi | bauzas: no worries | 07:48 |
bauzas | gibi: this is on 2024.1 | 07:48 |
bauzas | Caracal so | 07:48 |
gibi | bauzas: it was a backport from master | 07:49 |
gibi | merged this cycle | 07:49 |
bauzas | technically, we could deliver RC1 and then backport that one in a .z release | 07:49 |
gibi | https://review.opendev.org/c/openstack/nova/+/909806 | 07:49 |
bauzas | okay, lemme look at the commit SHA1 | 07:49 |
gibi | that is the offending patch according to the reporter | 07:49 |
gibi | bauzas: I lean towards that the reporter do see a numa live migration regression on shared storage but I'm not sure that the root cause is the power mgmt flag added. But more like that flag now trigger a driver cleanup call that wasn't triggered before | 07:50 |
gibi | and that cleanup call might have issues | 07:50 |
bauzas | agreed | 07:50 |
bauzas | that's the cleanup | 07:51 |
bauzas | gibi: oh, that's the conditional | 07:53 |
bauzas | we only cleanup if we don't have shared path | 07:53 |
bauzas | but here, the 'or' makes it now optional | 07:53 |
bauzas | we just need to change the conditional | 07:53 |
gibi | calling the cleanup with destroy_disks should be OK | 07:56 |
gibi | destroy_disk= false | 07:56 |
gibi | appreantly it isn't OK but changing the logic at when to trigger do_cleanup seems to be the wrong fix to me | 07:57 |
gibi | as destroy_disk=false should prevent the disk deletion | 07:57 |
bauzas | true | 07:57 |
bauzas | but we check two different fields for shared storage | 07:58 |
bauzas | the conditional checks whether both computes use the same path | 07:58 |
gibi | yes I spotted that but did not dig deeper tehre | 07:58 |
bauzas | while the destroy_disks arg checks whether we use block storage | 07:59 |
bauzas | I assume the reporter gets something like True for the shared path but False for shared block storage | 07:59 |
gibi | what is a shared block storage? | 08:01 |
bauzas | for mdevs and vpmems, there was a reason to call cleanup | 08:01 |
gibi | just cinder? or something more special? | 08:01 |
bauzas | we need to delete the local residues | 08:01 |
bauzas | honestly, that's the first time I'm hearing this | 08:02 |
bauzas | I'm used to the regular FS share between instance pathes | 08:02 |
bauzas | that's the regular shared storage I know | 08:03 |
gibi | yeah I think is_shared_instance_path is the case when the instance dir is on e.g. NFS | 08:03 |
bauzas | yup | 08:03 |
gibi | we need to call cleanup if power mgmt is enabled so the current fix proposed would regress that | 08:04 |
bauzas | I need to see what makes migrate_data.is_shared_block_storage | 08:04 |
gibi | yep | 08:04 |
bauzas | and it would also regress the mdev live-migration support | 08:05 |
bauzas | we need to call cleanup to delete the mdevs we reserved | 08:05 |
bauzas | hence the 'or' | 08:05 |
bauzas | I need a coffee | 08:06 |
bauzas | it's early and I'm already fried | 08:06 |
bauzas | trying to improve my local LTE connection took me too much of time | 08:06 |
gibi | https://github.com/openstack/nova/blob/16d815990b53b9afa35fc4da38609f87820fa690/nova/virt/libvirt/driver.py#L10367-L10369 | 08:08 |
bauzas | ooooh I see | 08:09 |
gibi | https://github.com/openstack/nova/blob/16d815990b53b9afa35fc4da38609f87820fa690/nova/virt/libvirt/driver.py#L10470 | 08:09 |
bauzas | yeah, some this is when you have BFV or you share the same libvirt image backend between both | 08:10 |
bauzas | that's not NFS | 08:10 |
gibi | yeah it is the RBD image backend | 08:11 |
bauzas | so | 08:11 |
bauzas | I think of something quick | 08:11 |
bauzas | previously, we were NOT calling cleanup on shared storage | 08:11 |
bauzas | right? | 08:12 |
bauzas | now we WANT to call it too, but we don't want to destroy the disks | 08:12 |
bauzas | I see a simple solution | 08:12 |
bauzas | which is to add another clause to destroy_disks | 08:13 |
bauzas | gibi: thoughts ? | 08:13 |
gibi | thinking... | 08:14 |
gibi | we did not call cleanup if is_shared_instance_path was true. And now we need it due to power mgmt. | 08:14 |
bauzas | yup | 08:14 |
bauzas | but we need to call it smartly | 08:15 |
gibi | yes | 08:15 |
gibi | would the change destroy_disks = not (migrate_data.is_shared_block_storage or migrate_data.is_shared_instance_path) work? | 08:16 |
bauzas | that's my hope | 08:16 |
bauzas | but we need to check the cleanup method | 08:16 |
gibi | https://github.com/openstack/nova/blob/16d815990b53b9afa35fc4da38609f87820fa690/nova/virt/libvirt/driver.py#L1707-L1724 this is the relevant path. I think it would work | 08:20 |
gibi | but we don't have reproduction | 08:20 |
bauzas | yeah | 08:21 |
bauzas | we need a functional test | 08:21 |
gibi | In 20 mins I need to step away for more than an hour | 08:22 |
bauzas | I can try to work on the reproducer | 08:22 |
gibi | I will summarize the above discussion in the bug | 08:22 |
bauzas | I already did in the report | 08:22 |
gibi | cool | 08:23 |
gibi | I will add a request to the reporter to try,if they can, our fix proposal | 08:23 |
bauzas | please, but I assume he's on West-US timezones | 08:23 |
gibi | done | 08:36 |
*** bauzas_ is now known as bauzas | 09:30 | |
opendevreview | Sylvain Bauza proposed openstack/nova master: Add Dalmatian prelude section https://review.opendev.org/c/openstack/nova/+/928995 | 10:28 |
opendevreview | Merged openstack/osc-placement master: Update master for stable/2024.2 https://review.opendev.org/c/openstack/osc-placement/+/928360 | 10:28 |
yusufgungor_ | hi everyone, we think hit a strange bug. We got the mtu error as below on the source compute node, when live migrating an instance with the specified scenarios. | 11:32 |
yusufgungor_ | "Target network card MTU 0 does not match source 1450: libvirt.libvirtError: unsupported configuration: Target network card MTU 0 does not match source 1450" | 11:32 |
yusufgungor_ | Also debug logs shows that mtu is null for one of the interfaces for the failed scenarios. _update_vif_xml function prints log as below on debug: | 11:32 |
yusufgungor_ | Updating guest XML with vif config: <interface type="ethernet"> | 11:32 |
yusufgungor_ | <mac address="fa:16:3e:19:a8:ba"/> | 11:33 |
yusufgungor_ | <model type="virtio"/> | 11:33 |
yusufgungor_ | <target dev="tapfb1b3816-ed"/> | 11:33 |
yusufgungor_ | </interface> | 11:33 |
yusufgungor_ | Updating guest XML with vif config: <interface type="ethernet"> | 11:33 |
yusufgungor_ | <mac address="fa:16:3e:8e:e7:b3"/> | 11:33 |
yusufgungor_ | <model type="virtio"/> | 11:33 |
yusufgungor_ | <mtu size="1500"/> | 11:33 |
yusufgungor_ | <target dev="tapd001bd30-9c"/> | 11:33 |
yusufgungor_ | </interface> | 11:33 |
yusufgungor_ | Should we create a bug report? This mtu error causes revert the migration and triggers another strange bug on the os-brick side which unmaps the incorrect iscsi device for random instance :/ It is another story. | 11:33 |
yusufgungor_ | Scenarios: | 11:33 |
yusufgungor_ | 1. When creating an instance from a vxlan tenant network, then attaching an interface from the provider network, and live migrating, it gives an mtu error! | 11:33 |
yusufgungor_ | 2. When creating an instance from a vxlan tenant network, then attaching an interface from the provider network, rebooting, and live migrating, it does not give an mtu error. | 11:33 |
yusufgungor_ | 3. When creating an instance only from the provider network, it does not give an mtu error. | 11:33 |
yusufgungor_ | 4. When creating an instance from the provider network, then attaching an interface from the vxlan tenant network, and migrating, it gives an mtu error! | 11:33 |
yusufgungor_ | 5. When creating an instance, giving an IP from both the vxlan tenant network and the provider network, and live migrating, it does not give an mtu error. | 11:33 |
yusufgungor_ | 6. When creating an instance only from the provider network, it does not give an mtu error. Then, when attaching a leg from the vxlan tenant network and migrating again, it gives an mtu error! | 11:33 |
yusufgungor_ | nova-compute --version : 26.2.1 (osp cluster is zed version) | 11:36 |
sean-k-mooney | yusufgungor_: so it shoudl be noted that nova does not supprot multi segment networks with differnt mtu | 12:04 |
sean-k-mooney | yusufgungor_: vxlan has 48? byte of encapsulation overhead so we round that up to 50 and reduce its mtu vs vlan/flat in neutorn automatically | 12:05 |
sean-k-mooney | yusufgungor_: you should file a bug with as much detail as is relevent | 12:07 |
yusufgungor_ | @sean thanks for reply. so a nova instance with both vxlan tenant network (mtu: 1450) and provider network (mtu: 1500) is not supported? Actually it works except live migration on spesific case. For example after hard reboot the instance we can live migrate too. | 12:08 |
yusufgungor_ | @sean-k-mooney thanks, then i will file a bug report to nova | 12:08 |
sean-k-mooney | no you can have instance with a mix of both | 12:09 |
sean-k-mooney | that should be fine | 12:09 |
sean-k-mooney | just not a neutron network that has a vlan segment and a vxlan segment with differnt mtus | 12:10 |
sean-k-mooney | that actully might work form a nova side as we take the mtu form the port | 12:10 |
sean-k-mooney | but it would break l2 connectivty between the segments as mtus should only chagne on l3(subnet) bondaries | 12:11 |
yusufgungor_ | @sean-k-mooney thanks. actually we have tried to set the provider network mtu as 1450 too and tested but hitting the same live migrate error | 12:12 |
yusufgungor_ | @sean-k-mooney i will test it again, and file a bug | 12:12 |
sean-k-mooney | oh this might be an old bug then | 12:12 |
sean-k-mooney | so we do not suprot changing the mtu of neutron netowrks | 12:13 |
sean-k-mooney | and by we i mean nova | 12:13 |
sean-k-mooney | the only way to do that properly is if you detach and reattach all the ports or otherwise regenerate the guest xml | 12:14 |
yusufgungor_ | @sean-k-mooney thanks 🙏 | 12:15 |
ratailor_ | sean-k-mooney, could you please provide your suggestion on this approach https://review.opendev.org/c/openstack/nova/+/928933 for https://bugs.launchpad.net/nova/+bug/2058928 | 12:55 |
ratailor_ | sean-k-mooney, the finish_time field is not shown in instance action show output, that's why we need to do this before calling instance action finish method for respective APIs where we need to finish the started action. | 12:56 |
ratailor_ | sean-k-mooney, I will fix failing tests and add new tests wherever required, once this approach looks good. | 12:57 |
ratailor_ | artom, gibi please provide your suggestions as well ^^ | 13:00 |
opendevreview | sean mooney proposed openstack/nova master: repoduce post liberty pre vicoria instance numa db issue https://review.opendev.org/c/openstack/nova/+/929025 | 13:01 |
sean-k-mooney | dansmith: gibi bauzas ^ | 13:02 |
sean-k-mooney | ill file a bug for that and update it when i push the fix patch | 13:04 |
yusufgungor_ | @sean-k-mooney i have filed a bug: https://bugs.launchpad.net/nova/+bug/2080531 | 13:31 |
opendevreview | Dan Smith proposed openstack/nova stable/2023.1: [stable-only]: Port qcow2v2 support from oslo https://review.opendev.org/c/openstack/nova/+/928937 | 14:02 |
opendevreview | Rajesh Tailor proposed openstack/nova master: Add support for showing finish_time https://review.opendev.org/c/openstack/nova/+/928933 | 14:08 |
dansmith | sean-k-mooney: are you respinning the safety patch to invert the safe/unsafe logic or no? | 16:29 |
dansmith | your comment made it sound like yes, but... | 16:30 |
sean-k-mooney | no | 16:31 |
sean-k-mooney | i inverted the logic already | 16:31 |
sean-k-mooney | so it should be good as is | 16:31 |
dansmith | okay | 16:31 |
sean-k-mooney | unsafe would have been less work but i was alrey 30 mins into fixing the tests | 16:31 |
sean-k-mooney | so i decied to just finish it | 16:31 |
dansmith | yeah I just wasn't sure from your comment if you were going to respin is all | 16:32 |
bauzas | hum, looks like we don't have any existing functional tests checking shared storage | 16:32 |
sean-k-mooney | ya i proably should have dropped the comment | 16:32 |
sean-k-mooney | bauzas: we may under the libvirt section | 16:32 |
sean-k-mooney | but in generally its light if it exists | 16:33 |
bauzas | my grep on shared_storage or just storage shows me nothing | 16:33 |
bauzas | I assume I may have to mock this | 16:34 |
bauzas | iirc, we just detect shared storage if we can see the same instance files on both computes, right? | 16:34 |
sean-k-mooney | yep you can mock that function | 16:36 |
sean-k-mooney | i would just create a regression test in a spereate file | 16:36 |
sean-k-mooney | and use https://github.com/openstack/nova/blob/master/nova/tests/functional/libvirt/base.py#L264 to make live migration work | 16:36 |
sean-k-mooney | then stub the function for the shared storage check to return true | 16:36 |
sean-k-mooney | https://github.com/openstack/nova/blob/16d815990b53b9afa35fc4da38609f87820fa690/nova/tests/functional/regressions/test_bug_1939545.py | 16:36 |
bauzas | we have a couple of functional tests for live-migration, that's not a problem | 16:36 |
sean-k-mooney | i think has all the ground work | 16:37 |
bauzas | oh cool, thanks for the digging but I was about to do it :) | 16:37 |
bauzas | okay, I'm out of steam and I don't have ideas on how to correctly assert we delete the local disks, except checking we call the delete | 17:20 |
bauzas | see you tomorrow | 17:20 |
*** bauzas_ is now known as bauzas | 19:48 | |
opendevreview | sean mooney proposed openstack/nova master: repoduce post liberty pre vicoria instance numa db issue https://review.opendev.org/c/openstack/nova/+/929025 | 20:09 |
opendevreview | sean mooney proposed openstack/nova master: allow upgrade of pre-victoria InstanceNUMACells https://review.opendev.org/c/openstack/nova/+/929187 | 20:09 |
opendevreview | sean mooney proposed openstack/nova master: repoduce post liberty pre vicoria instance numa db issue https://review.opendev.org/c/openstack/nova/+/929025 | 20:14 |
opendevreview | sean mooney proposed openstack/nova master: allow upgrade of pre-victoria InstanceNUMACells https://review.opendev.org/c/openstack/nova/+/929187 | 20:14 |
sean-k-mooney | looks like openstack clinet is broken | 22:11 |
sean-k-mooney | 024-09-12 21:04:32.171696 | controller | + /opt/stack/nova/roles/run-evacuate-hook/files/test_negative_evacuate.sh:evacuate_and_wait_for_error:19 : openstack --os-compute-api-version 2.67 server evacuate --host np0038485592 evacuate-test | 22:11 |
sean-k-mooney | 2024-09-12 21:04:33.305045 | controller | compute version 2.67 is not in supported versions: 2, 2.1 | 22:11 |
sean-k-mooney | 2024-09-12 21:04:34.474915 | controller | Proxy.evacuate_server() got an unexpected keyword argument 'password' | 22:11 |
sean-k-mooney | stephenfin: ^ fyi for the morning | 22:12 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!