Thursday, 2026-04-30

opendevreviewMerged openstack/nova master: TPM: handle key manager Forbidden errors consistently  https://review.opendev.org/c/openstack/nova/+/98350401:39
opendevreviewMerged openstack/nova master: TPM: clean up orphaned libvirt secret on guest creation failure  https://review.opendev.org/c/openstack/nova/+/98350504:35
opendevreviewchenker proposed openstack/nova master: add risc-v  https://review.opendev.org/c/openstack/nova/+/98675208:27
opendevreviewchenker proposed openstack/nova master: Add compatibility for nova with RISC-V architecture.  https://review.opendev.org/c/openstack/nova/+/98675208:34
opendevreviewKamil Sambor proposed openstack/nova master: Change nova-alt-configurations job  https://review.opendev.org/c/openstack/nova/+/98317908:44
opendevreviewJulien LE JEUNE proposed openstack/nova master: Reproduce bug #2150616: KeyError in _rollback_volume_bdms  https://review.opendev.org/c/openstack/nova/+/98675409:00
opendevreviewMerged openstack/nova master: Fix invalid jsonschema for keypair list response  https://review.opendev.org/c/openstack/nova/+/98666012:10
opendevreviewMerged openstack/os-vif master: Remove url tags from README  https://review.opendev.org/c/openstack/os-vif/+/97627112:56
dansmithgibi: I'm not sure if my point is getting across in the spec reviews.. I'm not really blocking on anything I just feel like we're taking an already very confusing syntax and blowing it up, which I think is unfortunate15:47
gibidansmith:  yeah I agree that the config syntax is complicated. I like your idea of listing addresses to define the group. I just went one step further and wanted to do that in a separate group_spec option not directly in the already complicated device_spec optikn16:01
gibi*option16:01
gibiI think we are not *ready* with the spec so your comment alone is not blocking anything 16:01
gibiI think we are exploring the possible configurations to express a group right now16:02
gibiso no hard feeling at all16:02
gibi(if I come accross as angry in my comment then sorry it wasn't the intention, today I got angry on a totally unrealted thing but I guess some then leaked into my spec comment)16:03
dansmithoh no, no leakage that I noticed I'm just trying to make sure it's clear that "I don't like this but I don't really know what the best plan is"16:05
dansmithand it's really just about the actual [pci] syntax part16:05
gibiOK cool then16:06
gibiI do want to spend still couple of back and forth before settle on the syntax. I think it is worth exploring the possibilities16:06
gibihonestly the [pci]device_spec allows too many things already16:07
dansmithfor sure16:07
dansmithcan you answer though - is it just managed=True that needs to be per-device instead of per-group?16:07
gibiI think live_migratable is also per device, especially if we eventually extend that flag to support live migration via unplug and replug (how we do it for neutron port requested PCI devs today). phyiscal_network can be a per device in the future when we start supporting neutron port requested PCI device with PCI in Placement. 16:10
gibione_time_use is an edge case that does not need to be per device today16:10
gibibut logically if I group an NVME and a NIC and I want to define that the NVME in the group the one that needs the cleaning then16:11
gibiit would be better to model one_time_use on the device instead of on the group16:11
dansmithlive_migratable is per group though right? you can't have a group with one migratable device and one not16:11
dansmithI guess I'm not sure why you would group a nic and a nvme together, but fair enough - for the cleaning it makes more sense to account for them individually16:12
gibiimagine that we start grouping a "normally" live_migratable GPU with a NIC that is stateless and should not prevent the live migration but instead we will allow saying for the NIC that live_migratable=unplug16:12
dansmithI know we _should_ be able to group any two devices for any reason, but it's also okay for this to be minimally useful and leave complex scenarios to cyborg :)16:13
dansmiththat unplug case could be a cyborg-only too I think, but yeah fair as well16:13
gibithe whole grouping could be / should be cybog only :D16:15
dansmithright, which is why I guess I'm (perhaps too) willing to have some limitations :D16:15
dansmithbut okay so sounds  like we have to keep the device_spec lines per device16:15
dansmiththe "name" is required for each group so we know which provider in placement goes with each set of devices right?16:16
gibiyeah the name is an ugly leak of how we implement resource tracking16:19
gibiwe need to name the RP in placement with a stable name16:20
dansmithI _hate_ that16:20
gibiI thought about generating it from the hash of the PCI addresses in the group but that is still not stable16:20
gibiI hate it too16:20
dansmithyeah, I also thought .. even just the one/first pci address would be enough, no?16:21
dansmithbut it makes it a bit messy on the backend in case the order changes, etc16:21
gibiso far I thought we allow reconfiguring the content of a *not allocated* group16:21
gibithat breaks naming by address16:21
gibiI thought about just indexing group-type-1 -2 -316:22
gibibut then we will be config line order dependent16:22
dansmithsorted(group.addresses)[0] would be somewhat stable against rerders16:22
dansmithbut if they add a new earlier one it won't of course16:22
dansmitheither way, I think "name" is the wrong label for this thing.. it's an identifier, a key, a ... something16:23
dansmith"name" is really most appropriate for the alias, IMHO, the rest are basically FK's16:23
gibiOK that is fair16:27
gibikey works fo rme16:27
gibi*for me16:27
gibiwe even save a char in each def :D16:27
dansmithack, I'm not too concerned over the char of course, but the meaning16:29
gibiI know, I just entertain myself with the saving of chars :D16:30
dansmithand I guess we could have more than one alias reference a type of group? I'm not sure why really, but I assume that's "desirable"16:31
dansmithugh, I'm getting 403 from docs.o.o16:31
gibiyes two alias can refer to the same group_type but can asks for different traits, and two groups in the same group_type can have two different set of traits. I have no good example why this would be useful but the current alias syntax allows it today already16:32
gibiohh I have a good example16:33
gibitwo alias can differ in numa_policy16:33
dansmithyeah okay16:33
gibiso you want two GPUs but one alias give with with NUMA affinity required the other just preferres16:33
gibi* preferred16:33
dansmithI wish that we could have alias reference device_spec or group_spec things instead of the devices (by address or vendor/model) and was just thinking of other ways to collapse that a bit16:35
dansmithbut probably not possible16:35
dansmithalias can find things by address, vendor/model, and now group_type.. just feels far too undefined16:35
gibiyeah I know16:35
gibiat least alias with a group_type is better in my eyes than using vendor/product16:36
dansmithyes definitely, and kinda where I was going with "group with one device".. I don't want to make things more complicated, but having alias reference only a group, which references device_specs would be a nicer hierarchy16:37
dansmith*group_type I should say16:37
gibihm16:37
gibiso we i) add group_type to alias ii)allow one device per group in the group_spec 3) deprecate vendor/product in alias 4)... 5) profit16:38
dansmith4) annoy people with more typing for no reason16:38
dansmithit would be a better diagram but not a better UX I imagine16:38
dansmithif we could imply the auto-generated single-device group type maybe, but idk16:39
gibinah, people we use LLMs to generate the config :D16:39
dansmithlol, of course16:39
dansmithsilly me16:39
dansmithmaybe then 4) sell the users an LLM as the only way to actually generate this config so that 5) profit16:40
gibiI need to start frontier model company, easy-peasy16:41
dansmiththat's the obvious solution to any problem in 202616:42
gibitomorrow is public holiday here so I will have time to *plan*  :D16:43
opendevreviewDan Smith proposed openstack/nova-specs master: Add unpin-az spec  https://review.opendev.org/c/openstack/nova-specs/+/98653917:03
gouthamrhey melwitt: question regarding the ML post and this bug: https://bugs.launchpad.net/nova/+bug/214996518:39
gouthamri was curious if the migration flow in any way _need_ cephadm? i.e., does anything run "ceph orch" commands? i wasn't able to find that info and was wondering 18:40
gouthamrmelwitt: how did you arrive at that ceph orch being the problem is i guess my question18:41
melwittgouthamr: sorry, I didn't include info info in the bug, I did it in haste. I don't think so -- my assumption is that things failed during ceph installation, well before any tests were run. this is an example run https://zuul.opendev.org/t/openstack/build/2199510cc500444b9638763287ebc62f18:42
melwitt*include enough info18:42
gouthamri see, it's happening on the migration job, but, that's unrelated18:43
gouthamrmelwitt: thanks.. that was my initial thought; maybe that part of the handling is just .. flaky18:43
gouthamrmelwitt: the job history is pretty decent: https://zuul.opendev.org/t/openstack/builds?job_name=nova-live-migration-ceph18:44
melwittgouthamr: yeah I wondered if maybe something is just not waiting for 'ceph mgr module enable orchestrator' to finish or something like that?18:45
melwittyeah. it is definitely a "sometimes" bug but I have personally seen it a lot enough to ask someone if they know something to stop it haha18:45
gouthamrmelwitt: ack, i've added some retry logic here: https://review.opendev.org/c/openstack/devstack-plugin-ceph/+/98681818:46
gouthamrso we don't bail out early.. 18:46
gouthamrdisabling cephadm isn't necessary.. just something that'll help save resources if we have nothing more to setup/change on the ceph cluster18:47
melwittah gotcha, makes sense18:47
gouthamrsean-k-mooney headed in the same direction.. so i guess we can see if this helps with stability. thanks for flagging this18:49
melwittthanks you for looking at it :)18:50
sean-k-mooneyso i looked at the plugin breifly but didn know which job failed18:50
sean-k-mooneyso i didnt knwo which method was actully the failure18:50
sean-k-mooneygouthamr: i saw you patch to add the retry on disable but i suspect that wasint the issue18:51
sean-k-mooneyperhaps it was but i was assomign like mel that it was happenign durign the install18:51
melwittI added another comment on https://bugs.launchpad.net/nova/+bug/2149965 to explain a bit more, for future reference18:52
gouthamrsean-k-mooney: the restart you stated is an internal ceph thingy - we don't explicitly ask to restart.. the manager module reloads things each time and during that reload  the API is down.. so "orch" just throws a random error: "set a backend 18:52
sean-k-mooney+ /opt/stack/devstack-plugin-ceph/devstack/lib/cephadm:start_ceph:210 18:53
gouthamr" - that's just catching a giant exception in the client18:53
sean-k-mooneyright os we need to be more robot to the internal errors18:53
gouthamryes18:54
sean-k-mooneyin any case it faillign here https://github.com/openstack/devstack-plugin-ceph/blob/master/devstack/lib/cephadm#L20718:54
melwittthanks to both of you for looking despite the major lack of info on the bug 😬 18:55
sean-k-mooneymelwitt: i definlly have not seen that error several time sin the last month and not looked at it because fo tiem...18:55
sean-k-mooneydefinlty not a thing that woudl happen18:56
sean-k-mooneyanyway https://review.opendev.org/c/openstack/devstack-plugin-ceph/+/986818 wont fix it because that is not what is failign18:57
sean-k-mooneybtu the approch might work if we use it else hwere18:57
sean-k-mooneyi dont knwo if tis just safe to invoke the same command again when we are in this state18:57
sean-k-mooneyi.e. will it just fail again18:57
melwitthm ok18:57
sean-k-mooneyor is that the right thing to do possibel with a sleep ro query ot see fi the thing that restarted is up18:58
sean-k-mooneythis honestly feels like a cphadm bug that should eventully get fixed there18:58
sean-k-mooneygouthamr: i assume you wrapped https://review.opendev.org/c/openstack/devstack-plugin-ceph/+/986818/1/devstack/lib/cephadm because that is the only place we call set backend18:59
sean-k-mooneybut you will noteice in teh trace back its failing when we are settting the backend to cephadm and there we are not setting it ot anythin19:00
sean-k-mooney cli(['orch', 'set', 'backend', 'cephadm'])19:00
melwittso you don't think it's something to do with commands the devstack plugin is running, like it could wait for a thing to be ready first19:01
gouthamrwe'll just be yelling at it to disable orch, what's wrong with that?19:01
gouthamrit should be idempotent19:01
sean-k-mooneyits https://zuul.opendev.org/t/openstack/build/2199510cc500444b9638763287ebc62f/log/job-output.txt#1249519:01
gouthamrhttps://docs.ceph.com/en/reef/mgr/orchestrator/#disable-the-orchestrator19:01
sean-k-mooneyso if you look at the logs we are  failign in start_ceph19:02
sean-k-mooneyso the failign call is all internal to that initall call to cephadm19:02
sean-k-mooney2026-04-29 21:16:02.191617 | controller | + /opt/stack/devstack-plugin-ceph/devstack/lib/cephadm:start_ceph:1 :   exit_trap19:03
gouthamrAH! see that's why i needed the full log :D 19:03
sean-k-mooneywe trap on the execode form that call19:03
sean-k-mooneyso this is very much a race in cephadm rather then our devstack plugin19:04
gouthamrlemme see if there's something new they invented to prevent a "bootstrap" failure 19:06
sean-k-mooneyack, we should not need to disable the orc module by the way we definlly want to use it because that is how real cluster work19:06
gouthamrno we had to do this just for CI19:06
gouthamrwhen we're spinning up single-node ceph and openstack compute on the flavor we use, we ran into OOMs19:07
gouthamrlocally, i give it a beefy-enough node, and for jobs that do more, we use multi-node and spin up instances on compute nodes19:08
sean-k-mooneyright in this case i dont think its an OOM19:09
gouthamryeah19:10
sean-k-mooneyhttps://paste.opendev.org/show/bf4uNdKhVfHSw5XRwFXS/19:12
sean-k-mooneywell a littel more https://paste.opendev.org/show/bClxtqx7PviIQSfdnEkq/19:12
gouthamrokay, we can just retry the bootstrap until it succeeds... there are some opts that can help it be idempotent; check if mon is up, --allow-overwrite so we can be okay with cruft19:12
gouthamryeah, that message isn't helpful there too :/ we can't enable orch, we're trying to bootstrap $machine ¯\_(ツ)_/¯ 19:13
gouthamrit's as good as "something unexpected happened"19:13
gouthamri can report a ceph tracker if they don't already know about this issue19:14
sean-k-mooneybut is it as good as "no valid host found"19:14
sean-k-mooneynova most useful of uesful errors19:14
gouthamr:D19:15
opendevreviewMerged openstack/nova master: Use python-native keyword-only arguments  https://review.opendev.org/c/openstack/nova/+/96905021:55

Generated by irclog2html.py 4.1.0 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!