Thursday, 2026-04-30

opendevreview	Merged openstack/nova master: TPM: handle key manager Forbidden errors consistently https://review.opendev.org/c/openstack/nova/+/983504	01:39
opendevreview	Merged openstack/nova master: TPM: clean up orphaned libvirt secret on guest creation failure https://review.opendev.org/c/openstack/nova/+/983505	04:35
opendevreview	chenker proposed openstack/nova master: add risc-v https://review.opendev.org/c/openstack/nova/+/986752	08:27
opendevreview	chenker proposed openstack/nova master: Add compatibility for nova with RISC-V architecture. https://review.opendev.org/c/openstack/nova/+/986752	08:34
opendevreview	Kamil Sambor proposed openstack/nova master: Change nova-alt-configurations job https://review.opendev.org/c/openstack/nova/+/983179	08:44
opendevreview	Julien LE JEUNE proposed openstack/nova master: Reproduce bug #2150616: KeyError in _rollback_volume_bdms https://review.opendev.org/c/openstack/nova/+/986754	09:00
opendevreview	Merged openstack/nova master: Fix invalid jsonschema for keypair list response https://review.opendev.org/c/openstack/nova/+/986660	12:10
opendevreview	Merged openstack/os-vif master: Remove url tags from README https://review.opendev.org/c/openstack/os-vif/+/976271	12:56
dansmith	gibi: I'm not sure if my point is getting across in the spec reviews.. I'm not really blocking on anything I just feel like we're taking an already very confusing syntax and blowing it up, which I think is unfortunate	15:47
gibi	dansmith: yeah I agree that the config syntax is complicated. I like your idea of listing addresses to define the group. I just went one step further and wanted to do that in a separate group_spec option not directly in the already complicated device_spec optikn	16:01
gibi	*option	16:01
gibi	I think we are not ready with the spec so your comment alone is not blocking anything	16:01
gibi	I think we are exploring the possible configurations to express a group right now	16:02
gibi	so no hard feeling at all	16:02
gibi	(if I come accross as angry in my comment then sorry it wasn't the intention, today I got angry on a totally unrealted thing but I guess some then leaked into my spec comment)	16:03
dansmith	oh no, no leakage that I noticed I'm just trying to make sure it's clear that "I don't like this but I don't really know what the best plan is"	16:05
dansmith	and it's really just about the actual [pci] syntax part	16:05
gibi	OK cool then	16:06
gibi	I do want to spend still couple of back and forth before settle on the syntax. I think it is worth exploring the possibilities	16:06
gibi	honestly the [pci]device_spec allows too many things already	16:07
dansmith	for sure	16:07
dansmith	can you answer though - is it just managed=True that needs to be per-device instead of per-group?	16:07
gibi	I think live_migratable is also per device, especially if we eventually extend that flag to support live migration via unplug and replug (how we do it for neutron port requested PCI devs today). phyiscal_network can be a per device in the future when we start supporting neutron port requested PCI device with PCI in Placement.	16:10
gibi	one_time_use is an edge case that does not need to be per device today	16:10
gibi	but logically if I group an NVME and a NIC and I want to define that the NVME in the group the one that needs the cleaning then	16:11
gibi	it would be better to model one_time_use on the device instead of on the group	16:11
dansmith	live_migratable is per group though right? you can't have a group with one migratable device and one not	16:11
dansmith	I guess I'm not sure why you would group a nic and a nvme together, but fair enough - for the cleaning it makes more sense to account for them individually	16:12
gibi	imagine that we start grouping a "normally" live_migratable GPU with a NIC that is stateless and should not prevent the live migration but instead we will allow saying for the NIC that live_migratable=unplug	16:12
dansmith	I know we _should_ be able to group any two devices for any reason, but it's also okay for this to be minimally useful and leave complex scenarios to cyborg :)	16:13
dansmith	that unplug case could be a cyborg-only too I think, but yeah fair as well	16:13
gibi	the whole grouping could be / should be cybog only :D	16:15
dansmith	right, which is why I guess I'm (perhaps too) willing to have some limitations :D	16:15
dansmith	but okay so sounds like we have to keep the device_spec lines per device	16:15
dansmith	the "name" is required for each group so we know which provider in placement goes with each set of devices right?	16:16
gibi	yeah the name is an ugly leak of how we implement resource tracking	16:19
gibi	we need to name the RP in placement with a stable name	16:20
dansmith	I _hate_ that	16:20
gibi	I thought about generating it from the hash of the PCI addresses in the group but that is still not stable	16:20
gibi	I hate it too	16:20
dansmith	yeah, I also thought .. even just the one/first pci address would be enough, no?	16:21
dansmith	but it makes it a bit messy on the backend in case the order changes, etc	16:21
gibi	so far I thought we allow reconfiguring the content of a not allocated group	16:21
gibi	that breaks naming by address	16:21
gibi	I thought about just indexing group-type-1 -2 -3	16:22
gibi	but then we will be config line order dependent	16:22
dansmith	sorted(group.addresses)[0] would be somewhat stable against rerders	16:22
dansmith	but if they add a new earlier one it won't of course	16:22
dansmith	either way, I think "name" is the wrong label for this thing.. it's an identifier, a key, a ... something	16:23
dansmith	"name" is really most appropriate for the alias, IMHO, the rest are basically FK's	16:23
gibi	OK that is fair	16:27
gibi	key works fo rme	16:27
gibi	*for me	16:27
gibi	we even save a char in each def :D	16:27
dansmith	ack, I'm not too concerned over the char of course, but the meaning	16:29
gibi	I know, I just entertain myself with the saving of chars :D	16:30
dansmith	and I guess we could have more than one alias reference a type of group? I'm not sure why really, but I assume that's "desirable"	16:31
dansmith	ugh, I'm getting 403 from docs.o.o	16:31
gibi	yes two alias can refer to the same group_type but can asks for different traits, and two groups in the same group_type can have two different set of traits. I have no good example why this would be useful but the current alias syntax allows it today already	16:32
gibi	ohh I have a good example	16:33
gibi	two alias can differ in numa_policy	16:33
dansmith	yeah okay	16:33
gibi	so you want two GPUs but one alias give with with NUMA affinity required the other just preferres	16:33
gibi	* preferred	16:33
dansmith	I wish that we could have alias reference device_spec or group_spec things instead of the devices (by address or vendor/model) and was just thinking of other ways to collapse that a bit	16:35
dansmith	but probably not possible	16:35
dansmith	alias can find things by address, vendor/model, and now group_type.. just feels far too undefined	16:35
gibi	yeah I know	16:35
gibi	at least alias with a group_type is better in my eyes than using vendor/product	16:36
dansmith	yes definitely, and kinda where I was going with "group with one device".. I don't want to make things more complicated, but having alias reference only a group, which references device_specs would be a nicer hierarchy	16:37
dansmith	*group_type I should say	16:37
gibi	hm	16:37
gibi	so we i) add group_type to alias ii)allow one device per group in the group_spec 3) deprecate vendor/product in alias 4)... 5) profit	16:38
dansmith	4) annoy people with more typing for no reason	16:38
dansmith	it would be a better diagram but not a better UX I imagine	16:38
dansmith	if we could imply the auto-generated single-device group type maybe, but idk	16:39
gibi	nah, people we use LLMs to generate the config :D	16:39
dansmith	lol, of course	16:39
dansmith	silly me	16:39
dansmith	maybe then 4) sell the users an LLM as the only way to actually generate this config so that 5) profit	16:40
gibi	I need to start frontier model company, easy-peasy	16:41
dansmith	that's the obvious solution to any problem in 2026	16:42
gibi	tomorrow is public holiday here so I will have time to plan :D	16:43
opendevreview	Dan Smith proposed openstack/nova-specs master: Add unpin-az spec https://review.opendev.org/c/openstack/nova-specs/+/986539	17:03
gouthamr	hey melwitt: question regarding the ML post and this bug: https://bugs.launchpad.net/nova/+bug/2149965	18:39
gouthamr	i was curious if the migration flow in any way _need_ cephadm? i.e., does anything run "ceph orch" commands? i wasn't able to find that info and was wondering	18:40
gouthamr	melwitt: how did you arrive at that ceph orch being the problem is i guess my question	18:41
melwitt	gouthamr: sorry, I didn't include info info in the bug, I did it in haste. I don't think so -- my assumption is that things failed during ceph installation, well before any tests were run. this is an example run https://zuul.opendev.org/t/openstack/build/2199510cc500444b9638763287ebc62f	18:42
melwitt	*include enough info	18:42
gouthamr	i see, it's happening on the migration job, but, that's unrelated	18:43
gouthamr	melwitt: thanks.. that was my initial thought; maybe that part of the handling is just .. flaky	18:43
gouthamr	melwitt: the job history is pretty decent: https://zuul.opendev.org/t/openstack/builds?job_name=nova-live-migration-ceph	18:44
melwitt	gouthamr: yeah I wondered if maybe something is just not waiting for 'ceph mgr module enable orchestrator' to finish or something like that?	18:45
melwitt	yeah. it is definitely a "sometimes" bug but I have personally seen it a lot enough to ask someone if they know something to stop it haha	18:45
gouthamr	melwitt: ack, i've added some retry logic here: https://review.opendev.org/c/openstack/devstack-plugin-ceph/+/986818	18:46
gouthamr	so we don't bail out early..	18:46
gouthamr	disabling cephadm isn't necessary.. just something that'll help save resources if we have nothing more to setup/change on the ceph cluster	18:47
melwitt	ah gotcha, makes sense	18:47
gouthamr	sean-k-mooney headed in the same direction.. so i guess we can see if this helps with stability. thanks for flagging this	18:49
melwitt	thanks you for looking at it :)	18:50
sean-k-mooney	so i looked at the plugin breifly but didn know which job failed	18:50
sean-k-mooney	so i didnt knwo which method was actully the failure	18:50
sean-k-mooney	gouthamr: i saw you patch to add the retry on disable but i suspect that wasint the issue	18:51
sean-k-mooney	perhaps it was but i was assomign like mel that it was happenign durign the install	18:51
melwitt	I added another comment on https://bugs.launchpad.net/nova/+bug/2149965 to explain a bit more, for future reference	18:52
gouthamr	sean-k-mooney: the restart you stated is an internal ceph thingy - we don't explicitly ask to restart.. the manager module reloads things each time and during that reload the API is down.. so "orch" just throws a random error: "set a backend	18:52
sean-k-mooney	+ /opt/stack/devstack-plugin-ceph/devstack/lib/cephadm:start_ceph:210	18:53
gouthamr	" - that's just catching a giant exception in the client	18:53
sean-k-mooney	right os we need to be more robot to the internal errors	18:53
gouthamr	yes	18:54
sean-k-mooney	in any case it faillign here https://github.com/openstack/devstack-plugin-ceph/blob/master/devstack/lib/cephadm#L207	18:54
melwitt	thanks to both of you for looking despite the major lack of info on the bug 😬	18:55
sean-k-mooney	melwitt: i definlly have not seen that error several time sin the last month and not looked at it because fo tiem...	18:55
sean-k-mooney	definlty not a thing that woudl happen	18:56
sean-k-mooney	anyway https://review.opendev.org/c/openstack/devstack-plugin-ceph/+/986818 wont fix it because that is not what is failign	18:57
sean-k-mooney	btu the approch might work if we use it else hwere	18:57
sean-k-mooney	i dont knwo if tis just safe to invoke the same command again when we are in this state	18:57
sean-k-mooney	i.e. will it just fail again	18:57
melwitt	hm ok	18:57
sean-k-mooney	or is that the right thing to do possibel with a sleep ro query ot see fi the thing that restarted is up	18:58
sean-k-mooney	this honestly feels like a cphadm bug that should eventully get fixed there	18:58
sean-k-mooney	gouthamr: i assume you wrapped https://review.opendev.org/c/openstack/devstack-plugin-ceph/+/986818/1/devstack/lib/cephadm because that is the only place we call set backend	18:59
sean-k-mooney	but you will noteice in teh trace back its failing when we are settting the backend to cephadm and there we are not setting it ot anythin	19:00
sean-k-mooney	cli(['orch', 'set', 'backend', 'cephadm'])	19:00
melwitt	so you don't think it's something to do with commands the devstack plugin is running, like it could wait for a thing to be ready first	19:01
gouthamr	we'll just be yelling at it to disable orch, what's wrong with that?	19:01
gouthamr	it should be idempotent	19:01
sean-k-mooney	its https://zuul.opendev.org/t/openstack/build/2199510cc500444b9638763287ebc62f/log/job-output.txt#12495	19:01
gouthamr	https://docs.ceph.com/en/reef/mgr/orchestrator/#disable-the-orchestrator	19:01
sean-k-mooney	so if you look at the logs we are failign in start_ceph	19:02
sean-k-mooney	so the failign call is all internal to that initall call to cephadm	19:02
sean-k-mooney	2026-04-29 21:16:02.191617 \| controller \| + /opt/stack/devstack-plugin-ceph/devstack/lib/cephadm:start_ceph:1 : exit_trap	19:03
gouthamr	AH! see that's why i needed the full log :D	19:03
sean-k-mooney	we trap on the execode form that call	19:03
sean-k-mooney	so this is very much a race in cephadm rather then our devstack plugin	19:04
gouthamr	lemme see if there's something new they invented to prevent a "bootstrap" failure	19:06
sean-k-mooney	ack, we should not need to disable the orc module by the way we definlly want to use it because that is how real cluster work	19:06
gouthamr	no we had to do this just for CI	19:06
gouthamr	when we're spinning up single-node ceph and openstack compute on the flavor we use, we ran into OOMs	19:07
gouthamr	locally, i give it a beefy-enough node, and for jobs that do more, we use multi-node and spin up instances on compute nodes	19:08
sean-k-mooney	right in this case i dont think its an OOM	19:09
gouthamr	yeah	19:10
sean-k-mooney	https://paste.opendev.org/show/bf4uNdKhVfHSw5XRwFXS/	19:12
sean-k-mooney	well a littel more https://paste.opendev.org/show/bClxtqx7PviIQSfdnEkq/	19:12
gouthamr	okay, we can just retry the bootstrap until it succeeds... there are some opts that can help it be idempotent; check if mon is up, --allow-overwrite so we can be okay with cruft	19:12
gouthamr	yeah, that message isn't helpful there too :/ we can't enable orch, we're trying to bootstrap $machine ¯\_(ツ)_/¯	19:13
gouthamr	it's as good as "something unexpected happened"	19:13
gouthamr	i can report a ceph tracker if they don't already know about this issue	19:14
sean-k-mooney	but is it as good as "no valid host found"	19:14
sean-k-mooney	nova most useful of uesful errors	19:14
gouthamr	:D	19:15
opendevreview	Merged openstack/nova master: Use python-native keyword-only arguments https://review.opendev.org/c/openstack/nova/+/969050	21:55

Generated by irclog2html.py 4.1.0 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!