Wednesday, 2025-07-16

gmaan	finally this is merged, ceph job should be green now https://review.opendev.org/c/openstack/tempest/+/954949	01:18
gmaan	gibi: ^^	01:18
gibi	gmaan: thanks!	07:32
gibi	gmaan: https://review.opendev.org/c/openstack/nova/+/954956 this now proves the the original problem is the ceph version 19.2.1 with pinned back packages to 19.2.0 the tempest test still passes. I'm in the process to find a way to report the bug to ceph upstream...	07:34
sean-k-mooney	gibi: as a workaround shoudl we just move the affected jobs to debian like the nova-hybrid-plug job	10:38
sean-k-mooney	i can push up a patch to just update the node set and see if that passes	10:38
sean-k-mooney	we can alwasy swap back later if we want or keep the ceph jobs on debian. we have had no real issues with the hybrid plug job since we swapped	10:39
sean-k-mooney	we are installing ceph via cephadm using containers fo the ceph version wont change it would just be the qemu and librbd/rados version that change	10:40
sean-k-mooney	since that the only part that actully comes form the distro packages in the devstack job	10:40
sean-k-mooney	assuming that works we coudl avoid skipping the testing	10:41
sean-k-mooney	oh https://review.opendev.org/c/openstack/tempest/+/954949 merged...	10:41
sean-k-mooney	well i guess i can propose a revert and depend on and test that as well	10:42
opendevreview	sean mooney proposed openstack/nova master: move ceph jobs to debian to avoid bug 2116852 https://review.opendev.org/c/openstack/nova/+/955179	10:57
sean-k-mooney	so something like ^	10:58
sean-k-mooney	we can see if that passes and decied if we want to keep it disabled or just do that. we could also consdier doign the debian swap in the parent devstack jobs so it fixes it for everyone	10:59
sean-k-mooney	so devstack-plugin-ceph-tempest-py3 and devstack-plugin-ceph-multinode-tempest-py3	10:59
sean-k-mooney	which are defiend in https://github.com/openstack/devstack-plugin-ceph/blob/master/.zuul.yaml	11:00
gibi	I've added the Ceph project on https://bugs.launchpad.net/nova/+bug/2116852 to let ubuntu see the issue	11:29
opendevreview	Merged openstack/nova master: Move ComputeManager to use spawn_on https://review.opendev.org/c/openstack/nova/+/948186	12:09
noonedeadpunk	hey folks! am I right that there is no way so far to supply any arbitrary arguments to as migration_flags ?	13:45
noonedeadpunk	specifically thinking about VIR_MIGRATE_PARALLEL and VIR_MIGRATE_PARAM_PARALLEL_CONNECTIONS now	13:48
sean-k-mooney	noonedeadpunk: be design no not really. that an internal interface of nova's libvirt driver	14:08
sean-k-mooney	we add flag when we supprot new features in our config.	14:08
sean-k-mooney	unofficlly you very ocationlly can enable some thing via the migration uri via query args	14:08
sean-k-mooney	we woudl need to actully add supprot for that in nova properly if we were to supprot it	14:10
noonedeadpunk	yeah, I was planning to look into adding this thing to nova, but also wondering about how to workaround current failures of migration	14:15
noonedeadpunk	As it seems to ways so far is either to disable tls, or increase throughput	14:16
noonedeadpunk	do you think I'd be needing to submit a full-fledged spec for that?	14:24
dansmith	for which? always-on migration flags, config knob to enable one, or something like generic flag merging support?	14:26
dansmith	in either of those cases we probably want to document the need, risk, config behavior, etc	14:27
gibi	eventlet sync https://meet.google.com/bcy-uqoz-hje	14:30
noonedeadpunk	dansmith: config knob I believe. to enable/disable parallel and amount of forks	14:32
noonedeadpunk	ok, I guess I'll come up with smth relatively simple	14:32
dansmith	noonedeadpunk: I think a spec for that yeah	14:46
noonedeadpunk	sean-k-mooney: do you recall the issue I was coming with for live migration failures a month ago or so?	15:19
noonedeadpunk	as I was able to reproduce it on plain libvirt+qemu on ubuntu	15:19
noonedeadpunk	and pretty much it's kinda a single usecase which is failing....	15:20
noonedeadpunk	virsh migrate --live --auto-converge --persistent --copy-storage-all --tls instance-00016a41 qemu+tls://compute10/system	15:20
noonedeadpunk	if I: drop tls, add parallel or use compress - migration passes.	15:20
noonedeadpunk	but this also is really confusing... how in the world uwing qemu+tls without --tls does not actually use tls...	15:21
sean-k-mooney	noonedeadpunk: reading back and yes vagly	15:22
sean-k-mooney	spec vs specless blueprint in this context partly comes down to the upgrade impact, the other factor being what version fo libvirt/qemu supprot it ectra	15:23
noonedeadpunk	but anyway, wanted to confirm it has nothing to do with nova. But I'm very surprised nobody else was reporting or coming with that	15:23
sean-k-mooney	we do allow new config option that default to the old bevhioar without a spec	15:23
noonedeadpunk	intention was to keep the behavior for sure	15:23
sean-k-mooney	btu if it impact upgrades in a mixed comptue node version case it almost alwasy needs a spec	15:24
sean-k-mooney	noonedeadpunk: but if i recall correctly you were seeing qemu crashes right	15:25
noonedeadpunk	I was thinking to add an option for instance live_migration_parallel_connections with default of 1, and when it's >1 add https://libvirt.org/html/libvirt-libvirt-domain.html#VIR_MIGRATE_PARAM_PARALLEL_CONNECTIONS and libvirt.VIR_MIGRATE_PARALLEL	15:25
noonedeadpunk	sean-k-mooney: I do	15:25
noonedeadpunk	also thinking to report a bug to qemu, and gather some debug logs now	15:26
sean-k-mooney	so i dont actlly have a problem with a config option that set hte number of parralel connection and use that to enable this	15:26
sean-k-mooney	but beyond that we woudl need to make sure the version of libvirt supprot that	15:27
sean-k-mooney	and qemu	15:27
sean-k-mooney	both on the source and dest host	15:27
sean-k-mooney	and we may or may not want ot schdluer on it.	15:27
noonedeadpunk	any gotchas on how to trace that back?	15:27
sean-k-mooney	you have replicated this with that virsh migrate command yes	15:28
sean-k-mooney	if you have ideally without openstack at all i think we shoudl file a libvirt but for that	15:28
noonedeadpunk	yeah, it's without openstack at all	15:29
noonedeadpunk	or well	15:29
dansmith	I think a spec to explain how this works and what the considerations are is important.. like, does this have a dependency on the remote version? do we want to schedule for the feature? if not, why not?	15:29
noonedeadpunk	I copied big chunk of domain xml, minus nova metadata	15:29
sean-k-mooney	ok but you actully created the vm wiht libvirt driectly right	15:30
noonedeadpunk	yep	15:30
noonedeadpunk	just 2 independent hosts	15:30
sean-k-mooney	if you can do that its eaiser for the libvirt maintainer ot recreate this	15:30
noonedeadpunk	right	15:30
sean-k-mooney	so i think there are 3 things that can/should happne 1 file a libvirt bug, 2 we can dicuss a spec for the parallel connection supprot, 3 we can discss a workaroudn for your production cloud	15:31
noonedeadpunk	so parallel migration was added in libvirt 5.2: https://github.com/libvirt/libvirt/commit/d3ea986af24fdb320a54854b6d6668a51ecb0cd0	15:31
sean-k-mooney	thats a good data point as its below our min version, what about qemu?	15:32
noonedeadpunk	I'm not that fast, sorry	15:32
sean-k-mooney	its fine	15:32
sean-k-mooney	gemini things it was in qemu 2.1 but im not sure i turst it	15:33
sean-k-mooney	https://wiki.qemu.org/Features/Migration-Multiple-fds	15:34
sean-k-mooney	by the way i know there ws also interest in enabling this for vgpu live migration	15:35
noonedeadpunk	like hard to find so far, as plenty of related work to multifd was done last year	15:35
sean-k-mooney	Multifd introduced in QEMU 2.11 in late 2017 by Juan Quintela	15:35
noonedeadpunk	yeah, ok, so then min versions are passing for sure	15:36
sean-k-mooney	https://kvm-forum.qemu.org/2024/kvm-forum-2024-multifd-device-state-transfer_3K5EQIG.pdf#page=7	15:36
noonedeadpunk	yeah, excatly what I see is 10gbit	15:36
noonedeadpunk	which is annoying on 100gbit cards...	15:36
sean-k-mooney	and when you have gpus with large amount of vram that they are afctivly updating	15:37
noonedeadpunk	I was thinking also about rdma option, but we don't have hugepages everywhere...	15:37
noonedeadpunk	(or well, we have only on very selected hosts as added them after you helped last time)	15:38
sean-k-mooney	enabling multiple connections is a generically useful feature so i woudl keep that proposal simple	15:38
noonedeadpunk	ok, cool, so I'll do 1 and 2 for now then	15:38
sean-k-mooney	i woudl proably do somthing like a min comptue service version check to know if this si supproted	15:38
sean-k-mooney	which avoid the need to schdler on it but if you file a spec we can disucss that there	15:39
sean-k-mooney	for 3 your production case.	15:39
sean-k-mooney	i assume you need tls and cant take the perfromance overhead fo using ssh instead	15:40
noonedeadpunk	I'm not sure it's working either...	15:40
noonedeadpunk	so far I'm thinking about doing a local backport of proposal	15:40
sean-k-mooney	ack	15:41
sean-k-mooney	you have been around openstack long enough to know the pros and cons of that	15:41
noonedeadpunk	I can't think of better option either :(	15:42
noonedeadpunk	and then hope it would be merged for 2026.1...	15:42
sean-k-mooney	if using ssh worked instead of qemu+tls it would act as a mitigation although the better option woudl be for the qemu/libvirt bug to be fixed	15:43
noonedeadpunk	I kinda still don't get how qemu+tls without --tls works tbh	15:44
sean-k-mooney	the protocol is ment to imply --tls i think although there are 2 qemu connection to consier. you have teh conenction to your local libvirt and the connection to the remote one	15:45
sean-k-mooney	it hink --tls is specifying that tls shoudl be used for the remote one	15:45
noonedeadpunk	yes, right, qemu/libvirt fix would be really nice, though I'd guess the lead time might be even higher then waiting for spec to merge	15:45
noonedeadpunk	sean-k-mooney: well. it seems so, but qemu+tls://compute10/system - compute10 is a remote host...	15:46
sean-k-mooney	also to be fair i never actully use virsh to do migration manullly so im not famiarly with the specific here	15:46
noonedeadpunk	and also I have listen_tls = 1 and listen_tcp = 0 for /etc/libvirt/libvirtd.conf	15:47
noonedeadpunk	so wtrf...	15:47
noonedeadpunk	probably I should try going to their IRC with this though	15:48
sean-k-mooney	https://libvirt.org/migration.html implie that the uri si enough but it does nto say one way or another	15:50
sean-k-mooney	there is a way to force the use fo tls via migrate_tls_force in /etc/libvirt/qemu.conf but that shoudl nto be required espically with nova	15:51
sean-k-mooney	oh right you use -c to specify the locall connection to virsh	15:53
noonedeadpunk	sean-k-mooney: https://gitlab.com/qemu-project/qemu/-/issues/1937	15:54
sean-k-mooney	oh you found a 2023 issue in qemu	15:58
noonedeadpunk	yeah	16:01
sean-k-mooney	looks like dan bareeange is activlly engaged	16:01
sean-k-mooney	i can mentio nto him that this affect nova as well today	16:01
noonedeadpunk	and they in #virt suggesting using parallel as general rule	16:01
sean-k-mooney	they also recommended multi-fd migration	16:01
sean-k-mooney	so thats another reason to consider your proposal and add supprot in nova	16:02
noonedeadpunk	yeah, I will work on it tomorrow then :)	16:04
sean-k-mooney	noonedeadpunk: so there is a possibel workaround via cofnig files assuming you do not need fips	16:18
sean-k-mooney	noonedeadpunk: i was talking to danpb downstream about this and https://issues.redhat.com/browse/RHEL-103240?focusedId=27595864&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-27595864 is a workaround	16:19
sean-k-mooney	noonedeadpunk: you can disabel AES as one fo the allowed encyption methods	16:20
noonedeadpunk	ah, he mentioned that in IRC actually	16:20
noonedeadpunk	but thanks for pointing to that report	16:21
sean-k-mooney	that breaks fips compliance but it might eb a better option then a downstream code change if its somethign that was accpable form a secuirty perspective until the issue is fixed properly	16:21
noonedeadpunk	I think I will try it, as all other options are kinda not perfect	16:21
noonedeadpunk	I don't want tunneled migrations as we are have local storages here and there	16:21
noonedeadpunk	and parallel and compress both not solving issue but making it more distant	16:22
noonedeadpunk	so it's a question of how big and loaded VM should be	16:22
sean-k-mooney	well as someone who prefers the simple life of migratin unlooed cirros iamges in my devstack small an light :) but customer want to run real workloads, how unreasonabel :)	16:23
noonedeadpunk	It's also a question if there could be more side-effects from disabling aes	16:23
sean-k-mooney	so there are some, if you dont use aes and use one of the other algroatims you may not be able to leverage aes-ni or similer exteions to offload the encyprtion to hardware	16:24
noonedeadpunk	yeah, and we start talking if customer moved from VMWare with their 1Tb MSSQL with 200gb of ram...	16:24
sean-k-mooney	the side effect of that would be a bottelneck on througput	16:24
noonedeadpunk	yeah, but there's also like OVN on host likely using AES as well	16:25
sean-k-mooney	i remember jay pipes talking about one of the customers mirantas had wehre they litrally ran 1 VM per compute node that used all the cpus/ram	16:25
noonedeadpunk	or well, ovs	16:25
sean-k-mooney	the litrally just wanted to have an api to create/delete vms effectivly and provide some levle of security via a vm absrction	16:26
noonedeadpunk	I wonder how scheduling work for them if they need to evacuate the VM on hypervisor failure	16:26
sean-k-mooney	hopefully they had hot spares	16:26
sean-k-mooney	as in free hosts	16:27
sean-k-mooney	so i beliv ethis dropin file is specific to qemu	16:27
sean-k-mooney	i.e. its createing /etc/crypto-policies/local.d/gnutls-qemu.config	16:27
sean-k-mooney	but it woudl apply to all tls coonection use by qemu fro anything it doing	16:27
sean-k-mooney	but again im not an experit on this so that may be incorrect	16:28
noonedeadpunk	well, this would be fine then	16:28
noonedeadpunk	I was a bit more, if crypto-policies just do fileglob of /etc/crypto-policies/local.d/*.config	16:28
noonedeadpunk	but will check on that	16:28
sean-k-mooney	i think that is what the QEMU= at the start of it is doing not the file name	16:29
sean-k-mooney	but again i have not looked into this deaply at all	16:29
noonedeadpunk	actually, I'd argue that using CHACHA-POLY1305 should be faster then AES	16:30
sean-k-mooney	it depend aes on sufficlty new enough intel chips will be done in dedicated hardware espiclly if you have access to qat acclerators	16:32
sean-k-mooney	but it might be faster and you will really only knwo once your try it on your hardware	16:32
noonedeadpunk	I tottally need to sleep over with all this info	16:36
opendevreview	Merged openstack/nova master: FUP: Translate scatter-gather to futurist https://review.opendev.org/c/openstack/nova/+/953338	17:19
opendevreview	Merged openstack/nova master: Add Project Manager role context in unit tests https://review.opendev.org/c/openstack/nova/+/941056	17:19

Generated by irclog2html.py 4.0.0 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!