Wednesday, 2025-07-16

gmaanfinally this is merged, ceph job should be green now https://review.opendev.org/c/openstack/tempest/+/95494901:18
gmaangibi: ^^01:18
gibigmaan: thanks!07:32
gibigmaan: https://review.opendev.org/c/openstack/nova/+/954956 this now proves the the original problem is the ceph version 19.2.1 with pinned back packages to 19.2.0 the tempest test still passes. I'm in the process to find a way to report the bug to ceph upstream...07:34
sean-k-mooneygibi: as a workaround shoudl we just move the affected jobs to debian like the nova-hybrid-plug job10:38
sean-k-mooneyi can push up a patch to just update the node set and see if that passes10:38
sean-k-mooneywe can alwasy swap back later if we want or keep the ceph jobs on debian. we have had no real issues with the hybrid plug job since we swapped10:39
sean-k-mooneywe are installing ceph via cephadm using containers fo the ceph version wont change it would just be the qemu and librbd/rados version that change10:40
sean-k-mooneysince that the only part that actully comes form the distro packages in the devstack job10:40
sean-k-mooneyassuming that works we coudl avoid skipping the testing10:41
sean-k-mooneyoh https://review.opendev.org/c/openstack/tempest/+/954949 merged...10:41
sean-k-mooneywell i guess i can propose a revert and depend on and test that as well10:42
opendevreviewsean mooney proposed openstack/nova master: move ceph jobs to debian to avoid bug 2116852  https://review.opendev.org/c/openstack/nova/+/95517910:57
sean-k-mooneyso something like ^10:58
sean-k-mooneywe can see if that passes and decied if we want to keep it disabled or just do that. we could also consdier doign the debian swap in the parent devstack jobs so it fixes it for everyone10:59
sean-k-mooneyso devstack-plugin-ceph-tempest-py3 and devstack-plugin-ceph-multinode-tempest-py310:59
sean-k-mooneywhich are defiend in https://github.com/openstack/devstack-plugin-ceph/blob/master/.zuul.yaml11:00
gibiI've added the Ceph project on https://bugs.launchpad.net/nova/+bug/2116852 to let ubuntu see the issue11:29
opendevreviewMerged openstack/nova master: Move ComputeManager to use spawn_on  https://review.opendev.org/c/openstack/nova/+/94818612:09
noonedeadpunkhey folks! am I right that there is no way so far to supply any arbitrary arguments to as migration_flags ?13:45
noonedeadpunkspecifically thinking about VIR_MIGRATE_PARALLEL and VIR_MIGRATE_PARAM_PARALLEL_CONNECTIONS now13:48
sean-k-mooneynoonedeadpunk: be design no not really. that an internal interface of nova's libvirt driver14:08
sean-k-mooneywe add flag when we supprot new features in our config. 14:08
sean-k-mooneyunofficlly you very ocationlly can enable some thing via the migration uri via query args14:08
sean-k-mooneywe woudl need to actully add supprot for that in nova properly if we were to supprot it14:10
noonedeadpunkyeah, I was planning to look into adding this thing to nova, but also wondering about how to workaround current failures of migration14:15
noonedeadpunkAs it seems to ways so far is either to disable tls, or increase throughput14:16
noonedeadpunkdo you think I'd be needing to submit a full-fledged spec for that?14:24
dansmithfor which? always-on migration flags, config knob to enable one, or something like generic flag merging support?14:26
dansmithin either of those cases we probably want to document the need, risk, config behavior, etc14:27
gibieventlet sync https://meet.google.com/bcy-uqoz-hje14:30
noonedeadpunkdansmith: config knob I believe. to enable/disable parallel and amount of forks14:32
noonedeadpunkok, I guess I'll come up with smth relatively simple14:32
dansmithnoonedeadpunk: I think a spec for that yeah14:46
noonedeadpunksean-k-mooney: do you recall the issue I was coming with for live migration failures a month ago or so?15:19
noonedeadpunkas I was able to reproduce it on plain libvirt+qemu on ubuntu15:19
noonedeadpunkand pretty much it's kinda a single usecase which is failing....15:20
noonedeadpunkvirsh migrate --live --auto-converge --persistent  --copy-storage-all --tls instance-00016a41 qemu+tls://compute10/system15:20
noonedeadpunkif I: drop tls, add parallel or use compress - migration passes. 15:20
noonedeadpunkbut this also is really confusing... how in the world uwing qemu+tls without --tls does not actually use tls...15:21
sean-k-mooneynoonedeadpunk: reading back and yes vagly15:22
sean-k-mooneyspec vs specless blueprint in this context partly comes down to the upgrade impact, the other factor being what version fo libvirt/qemu supprot it ectra15:23
noonedeadpunkbut anyway, wanted to confirm it has nothing to do with nova. But I'm very surprised nobody else was reporting or coming with that15:23
sean-k-mooneywe do allow new config option that default to the old bevhioar without a spec15:23
noonedeadpunkintention was to keep the behavior for sure15:23
sean-k-mooneybtu if it impact upgrades in a mixed comptue node version case it almost alwasy needs a spec15:24
sean-k-mooneynoonedeadpunk: but if i recall correctly you were seeing qemu crashes right15:25
noonedeadpunkI was thinking to add an option for instance live_migration_parallel_connections with default of 1, and when it's >1 add https://libvirt.org/html/libvirt-libvirt-domain.html#VIR_MIGRATE_PARAM_PARALLEL_CONNECTIONS and libvirt.VIR_MIGRATE_PARALLEL15:25
noonedeadpunksean-k-mooney: I do15:25
noonedeadpunkalso thinking to report a bug to qemu, and gather some debug logs now15:26
sean-k-mooneyso i dont actlly have a problem with a config option that set hte number of parralel connection and use that to enable this15:26
sean-k-mooneybut beyond that we woudl need to make sure the version of libvirt supprot that15:27
sean-k-mooneyand qemu15:27
sean-k-mooneyboth on the source and dest host15:27
sean-k-mooneyand we may or may not want ot schdluer on it. 15:27
noonedeadpunkany gotchas on how to trace that back?15:27
sean-k-mooneyyou have replicated this with that virsh migrate command yes15:28
sean-k-mooneyif you have ideally without openstack at all i think we shoudl file a libvirt but for that15:28
noonedeadpunkyeah, it's without openstack at all15:29
noonedeadpunkor well15:29
dansmithI think a spec to explain how this works and what the considerations are is important.. like, does this have a dependency on the remote version? do we want to schedule for the feature? if not, why not?15:29
noonedeadpunkI copied big chunk of domain xml, minus nova metadata 15:29
sean-k-mooneyok but you actully created the vm wiht libvirt driectly right15:30
noonedeadpunkyep15:30
noonedeadpunkjust 2 independent hosts15:30
sean-k-mooneyif you can do that its eaiser for the libvirt maintainer ot recreate this15:30
noonedeadpunkright15:30
sean-k-mooneyso i think there are 3 things that can/should happne 1 file a libvirt bug, 2 we can dicuss a spec for the parallel connection supprot, 3 we can discss a workaroudn for your production cloud15:31
noonedeadpunkso parallel migration was added in libvirt 5.2: https://github.com/libvirt/libvirt/commit/d3ea986af24fdb320a54854b6d6668a51ecb0cd015:31
sean-k-mooneythats a good data point as its below our min version, what about qemu?15:32
noonedeadpunkI'm not that fast, sorry15:32
sean-k-mooney its fine15:32
sean-k-mooneygemini things it was in qemu 2.1 but im not sure i turst it15:33
sean-k-mooneyhttps://wiki.qemu.org/Features/Migration-Multiple-fds15:34
sean-k-mooneyby the way i know there ws also interest in enabling this for vgpu live migration15:35
noonedeadpunklike hard to find so far, as plenty of related work to multifd was done last year15:35
sean-k-mooneyMultifd introduced in QEMU 2.11 in late 2017 by Juan Quintela15:35
noonedeadpunkyeah, ok, so then min versions are passing for sure15:36
sean-k-mooneyhttps://kvm-forum.qemu.org/2024/kvm-forum-2024-multifd-device-state-transfer_3K5EQIG.pdf#page=715:36
noonedeadpunkyeah, excatly what I see is 10gbit 15:36
noonedeadpunkwhich is annoying on 100gbit cards...15:36
sean-k-mooneyand when you have gpus with large amount of vram that they are afctivly updating15:37
noonedeadpunkI was thinking also about rdma option, but we don't have hugepages everywhere...15:37
noonedeadpunk(or well, we have only on *very* selected hosts as added them after you helped last time)15:38
sean-k-mooneyenabling multiple connections is a generically useful feature so i woudl keep that proposal simple15:38
noonedeadpunkok, cool, so I'll do 1 and 2 for now then15:38
sean-k-mooneyi woudl proably do somthing like a min comptue service version check to know if this si supproted15:38
sean-k-mooneywhich avoid the need to schdler on it but if you file a spec we can disucss that there15:39
sean-k-mooneyfor 3 your production case.15:39
sean-k-mooneyi assume you need tls and cant take the perfromance overhead fo using ssh instead15:40
noonedeadpunkI'm not sure it's working either...15:40
noonedeadpunkso far I'm thinking about doing a local backport of proposal15:40
sean-k-mooneyack15:41
sean-k-mooneyyou have been around openstack long enough to know the pros and cons of that15:41
noonedeadpunkI can't think of better option either :(15:42
noonedeadpunkand then hope it would be merged for 2026.1...15:42
sean-k-mooneyif using ssh worked instead of qemu+tls it would act as a mitigation although the better option woudl be for the qemu/libvirt bug to be fixed15:43
noonedeadpunkI kinda still don't get how qemu+tls without --tls works tbh15:44
sean-k-mooneythe protocol is ment to imply --tls i think although there are 2 qemu connection to consier. you have teh conenction to your local libvirt and the connection to the remote one15:45
sean-k-mooneyit hink --tls is specifying that tls shoudl be used for the remote one15:45
noonedeadpunkyes, right, qemu/libvirt fix would be really nice, though I'd guess the lead time might be even higher then waiting for spec to merge15:45
noonedeadpunksean-k-mooney: well. it seems so, but qemu+tls://compute10/system - compute10 is a remote host...15:46
sean-k-mooneyalso to be fair i never actully use virsh to do migration manullly so im not famiarly with the specific here15:46
noonedeadpunkand also I have listen_tls = 1 and listen_tcp = 0 for /etc/libvirt/libvirtd.conf15:47
noonedeadpunkso wtrf...15:47
noonedeadpunkprobably I should try going to their IRC with this though15:48
sean-k-mooneyhttps://libvirt.org/migration.html implie that the uri si enough but it does nto say one way or another15:50
sean-k-mooneythere is a way to force the use fo tls via migrate_tls_force in /etc/libvirt/qemu.conf but that shoudl nto be required espically with nova15:51
sean-k-mooneyoh right you use -c to specify the locall connection to virsh15:53
noonedeadpunksean-k-mooney: https://gitlab.com/qemu-project/qemu/-/issues/193715:54
sean-k-mooneyoh you found a 2023 issue in qemu15:58
noonedeadpunkyeah16:01
sean-k-mooneylooks like dan bareeange is activlly engaged16:01
sean-k-mooneyi can mentio nto him that this affect nova as well today16:01
noonedeadpunkand they in #virt suggesting using parallel as general rule16:01
sean-k-mooneythey also recommended multi-fd migration 16:01
sean-k-mooneyso thats another reason to consider your proposal and add supprot in nova16:02
noonedeadpunkyeah, I will work on it tomorrow then :)16:04
sean-k-mooneynoonedeadpunk: so there is a possibel workaround via cofnig files assuming you do not need fips16:18
sean-k-mooneynoonedeadpunk: i was talking to danpb downstream about this and https://issues.redhat.com/browse/RHEL-103240?focusedId=27595864&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-27595864 is a workaround16:19
sean-k-mooneynoonedeadpunk: you can disabel AES as one fo the allowed encyption methods16:20
noonedeadpunkah, he mentioned that in IRC actually16:20
noonedeadpunkbut thanks for pointing to that report16:21
sean-k-mooneythat breaks fips compliance but it might eb a better option then a downstream code change if its somethign that was accpable form a secuirty perspective until the issue is fixed properly16:21
noonedeadpunkI think I will try it, as all other options are kinda not perfect16:21
noonedeadpunkI don't want tunneled migrations as we are have local storages here and there16:21
noonedeadpunkand parallel and compress both not solving issue but making it more distant16:22
noonedeadpunkso it's a question of how big and loaded VM should be16:22
sean-k-mooneywell as someone who prefers the simple life of migratin unlooed cirros iamges in my devstack small an light :) but customer want to run real workloads, how unreasonabel :)16:23
noonedeadpunkIt's also a question if there could be more side-effects from disabling aes16:23
sean-k-mooneyso there are some, if you dont use aes and use one of the other algroatims you may not be able to leverage aes-ni or similer exteions to offload the encyprtion to hardware16:24
noonedeadpunkyeah, and we start talking if customer moved from VMWare with their 1Tb MSSQL with 200gb of ram...16:24
sean-k-mooneythe side effect of that would be a bottelneck on througput16:24
noonedeadpunkyeah, but there's also like OVN on host likely using AES as well16:25
sean-k-mooneyi remember jay pipes talking about one of the customers mirantas had wehre they litrally ran 1 VM per compute node that used all the cpus/ram16:25
noonedeadpunkor well, ovs16:25
sean-k-mooneythe litrally just wanted to have an api to create/delete vms effectivly and provide some levle of security via a vm absrction16:26
noonedeadpunkI wonder how scheduling work for them if they need to evacuate the VM on hypervisor failure16:26
sean-k-mooneyhopefully they had hot spares16:26
sean-k-mooneyas in free hosts16:27
sean-k-mooneyso i beliv ethis dropin file is specific to qemu16:27
sean-k-mooneyi.e. its createing /etc/crypto-policies/local.d/gnutls-qemu.config16:27
sean-k-mooneybut it woudl apply to all tls coonection use by qemu fro anything it doing16:27
sean-k-mooneybut again im not an experit on this so that may be incorrect16:28
noonedeadpunkwell, this would be fine then16:28
noonedeadpunkI was a bit more, if crypto-policies just do fileglob of /etc/crypto-policies/local.d/*.config16:28
noonedeadpunkbut will check on that 16:28
sean-k-mooneyi think that is what the QEMU= at the start of it is doing not the file name16:29
sean-k-mooneybut again i have not looked into this deaply at all16:29
noonedeadpunkactually, I'd argue that using CHACHA-POLY1305 should be faster then AES16:30
sean-k-mooneyit depend aes on sufficlty new enough intel chips will be done in dedicated hardware espiclly if you have access to qat acclerators16:32
sean-k-mooneybut it might be faster and you will really only knwo once your try it on your hardware16:32
noonedeadpunkI tottally need to sleep over with all this info16:36
opendevreviewMerged openstack/nova master: FUP: Translate scatter-gather to futurist  https://review.opendev.org/c/openstack/nova/+/95333817:19
opendevreviewMerged openstack/nova master: Add Project Manager role context in unit tests  https://review.opendev.org/c/openstack/nova/+/94105617:19

Generated by irclog2html.py 4.0.0 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!