| opendevreview | Merged openstack/nova master: TPM: handle key manager Forbidden errors consistently https://review.opendev.org/c/openstack/nova/+/983504 | 01:39 |
|---|---|---|
| opendevreview | Merged openstack/nova master: TPM: clean up orphaned libvirt secret on guest creation failure https://review.opendev.org/c/openstack/nova/+/983505 | 04:35 |
| opendevreview | chenker proposed openstack/nova master: add risc-v https://review.opendev.org/c/openstack/nova/+/986752 | 08:27 |
| opendevreview | chenker proposed openstack/nova master: Add compatibility for nova with RISC-V architecture. https://review.opendev.org/c/openstack/nova/+/986752 | 08:34 |
| opendevreview | Kamil Sambor proposed openstack/nova master: Change nova-alt-configurations job https://review.opendev.org/c/openstack/nova/+/983179 | 08:44 |
| opendevreview | Julien LE JEUNE proposed openstack/nova master: Reproduce bug #2150616: KeyError in _rollback_volume_bdms https://review.opendev.org/c/openstack/nova/+/986754 | 09:00 |
| opendevreview | Merged openstack/nova master: Fix invalid jsonschema for keypair list response https://review.opendev.org/c/openstack/nova/+/986660 | 12:10 |
| opendevreview | Merged openstack/os-vif master: Remove url tags from README https://review.opendev.org/c/openstack/os-vif/+/976271 | 12:56 |
| dansmith | gibi: I'm not sure if my point is getting across in the spec reviews.. I'm not really blocking on anything I just feel like we're taking an already very confusing syntax and blowing it up, which I think is unfortunate | 15:47 |
| gibi | dansmith: yeah I agree that the config syntax is complicated. I like your idea of listing addresses to define the group. I just went one step further and wanted to do that in a separate group_spec option not directly in the already complicated device_spec optikn | 16:01 |
| gibi | *option | 16:01 |
| gibi | I think we are not *ready* with the spec so your comment alone is not blocking anything | 16:01 |
| gibi | I think we are exploring the possible configurations to express a group right now | 16:02 |
| gibi | so no hard feeling at all | 16:02 |
| gibi | (if I come accross as angry in my comment then sorry it wasn't the intention, today I got angry on a totally unrealted thing but I guess some then leaked into my spec comment) | 16:03 |
| dansmith | oh no, no leakage that I noticed I'm just trying to make sure it's clear that "I don't like this but I don't really know what the best plan is" | 16:05 |
| dansmith | and it's really just about the actual [pci] syntax part | 16:05 |
| gibi | OK cool then | 16:06 |
| gibi | I do want to spend still couple of back and forth before settle on the syntax. I think it is worth exploring the possibilities | 16:06 |
| gibi | honestly the [pci]device_spec allows too many things already | 16:07 |
| dansmith | for sure | 16:07 |
| dansmith | can you answer though - is it just managed=True that needs to be per-device instead of per-group? | 16:07 |
| gibi | I think live_migratable is also per device, especially if we eventually extend that flag to support live migration via unplug and replug (how we do it for neutron port requested PCI devs today). phyiscal_network can be a per device in the future when we start supporting neutron port requested PCI device with PCI in Placement. | 16:10 |
| gibi | one_time_use is an edge case that does not need to be per device today | 16:10 |
| gibi | but logically if I group an NVME and a NIC and I want to define that the NVME in the group the one that needs the cleaning then | 16:11 |
| gibi | it would be better to model one_time_use on the device instead of on the group | 16:11 |
| dansmith | live_migratable is per group though right? you can't have a group with one migratable device and one not | 16:11 |
| dansmith | I guess I'm not sure why you would group a nic and a nvme together, but fair enough - for the cleaning it makes more sense to account for them individually | 16:12 |
| gibi | imagine that we start grouping a "normally" live_migratable GPU with a NIC that is stateless and should not prevent the live migration but instead we will allow saying for the NIC that live_migratable=unplug | 16:12 |
| dansmith | I know we _should_ be able to group any two devices for any reason, but it's also okay for this to be minimally useful and leave complex scenarios to cyborg :) | 16:13 |
| dansmith | that unplug case could be a cyborg-only too I think, but yeah fair as well | 16:13 |
| gibi | the whole grouping could be / should be cybog only :D | 16:15 |
| dansmith | right, which is why I guess I'm (perhaps too) willing to have some limitations :D | 16:15 |
| dansmith | but okay so sounds like we have to keep the device_spec lines per device | 16:15 |
| dansmith | the "name" is required for each group so we know which provider in placement goes with each set of devices right? | 16:16 |
| gibi | yeah the name is an ugly leak of how we implement resource tracking | 16:19 |
| gibi | we need to name the RP in placement with a stable name | 16:20 |
| dansmith | I _hate_ that | 16:20 |
| gibi | I thought about generating it from the hash of the PCI addresses in the group but that is still not stable | 16:20 |
| gibi | I hate it too | 16:20 |
| dansmith | yeah, I also thought .. even just the one/first pci address would be enough, no? | 16:21 |
| dansmith | but it makes it a bit messy on the backend in case the order changes, etc | 16:21 |
| gibi | so far I thought we allow reconfiguring the content of a *not allocated* group | 16:21 |
| gibi | that breaks naming by address | 16:21 |
| gibi | I thought about just indexing group-type-1 -2 -3 | 16:22 |
| gibi | but then we will be config line order dependent | 16:22 |
| dansmith | sorted(group.addresses)[0] would be somewhat stable against rerders | 16:22 |
| dansmith | but if they add a new earlier one it won't of course | 16:22 |
| dansmith | either way, I think "name" is the wrong label for this thing.. it's an identifier, a key, a ... something | 16:23 |
| dansmith | "name" is really most appropriate for the alias, IMHO, the rest are basically FK's | 16:23 |
| gibi | OK that is fair | 16:27 |
| gibi | key works fo rme | 16:27 |
| gibi | *for me | 16:27 |
| gibi | we even save a char in each def :D | 16:27 |
| dansmith | ack, I'm not too concerned over the char of course, but the meaning | 16:29 |
| gibi | I know, I just entertain myself with the saving of chars :D | 16:30 |
| dansmith | and I guess we could have more than one alias reference a type of group? I'm not sure why really, but I assume that's "desirable" | 16:31 |
| dansmith | ugh, I'm getting 403 from docs.o.o | 16:31 |
| gibi | yes two alias can refer to the same group_type but can asks for different traits, and two groups in the same group_type can have two different set of traits. I have no good example why this would be useful but the current alias syntax allows it today already | 16:32 |
| gibi | ohh I have a good example | 16:33 |
| gibi | two alias can differ in numa_policy | 16:33 |
| dansmith | yeah okay | 16:33 |
| gibi | so you want two GPUs but one alias give with with NUMA affinity required the other just preferres | 16:33 |
| gibi | * preferred | 16:33 |
| dansmith | I wish that we could have alias reference device_spec or group_spec things instead of the devices (by address or vendor/model) and was just thinking of other ways to collapse that a bit | 16:35 |
| dansmith | but probably not possible | 16:35 |
| dansmith | alias can find things by address, vendor/model, and now group_type.. just feels far too undefined | 16:35 |
| gibi | yeah I know | 16:35 |
| gibi | at least alias with a group_type is better in my eyes than using vendor/product | 16:36 |
| dansmith | yes definitely, and kinda where I was going with "group with one device".. I don't want to make things more complicated, but having alias reference only a group, which references device_specs would be a nicer hierarchy | 16:37 |
| dansmith | *group_type I should say | 16:37 |
| gibi | hm | 16:37 |
| gibi | so we i) add group_type to alias ii)allow one device per group in the group_spec 3) deprecate vendor/product in alias 4)... 5) profit | 16:38 |
| dansmith | 4) annoy people with more typing for no reason | 16:38 |
| dansmith | it would be a better diagram but not a better UX I imagine | 16:38 |
| dansmith | if we could imply the auto-generated single-device group type maybe, but idk | 16:39 |
| gibi | nah, people we use LLMs to generate the config :D | 16:39 |
| dansmith | lol, of course | 16:39 |
| dansmith | silly me | 16:39 |
| dansmith | maybe then 4) sell the users an LLM as the only way to actually generate this config so that 5) profit | 16:40 |
| gibi | I need to start frontier model company, easy-peasy | 16:41 |
| dansmith | that's the obvious solution to any problem in 2026 | 16:42 |
| gibi | tomorrow is public holiday here so I will have time to *plan* :D | 16:43 |
| opendevreview | Dan Smith proposed openstack/nova-specs master: Add unpin-az spec https://review.opendev.org/c/openstack/nova-specs/+/986539 | 17:03 |
| gouthamr | hey melwitt: question regarding the ML post and this bug: https://bugs.launchpad.net/nova/+bug/2149965 | 18:39 |
| gouthamr | i was curious if the migration flow in any way _need_ cephadm? i.e., does anything run "ceph orch" commands? i wasn't able to find that info and was wondering | 18:40 |
| gouthamr | melwitt: how did you arrive at that ceph orch being the problem is i guess my question | 18:41 |
| melwitt | gouthamr: sorry, I didn't include info info in the bug, I did it in haste. I don't think so -- my assumption is that things failed during ceph installation, well before any tests were run. this is an example run https://zuul.opendev.org/t/openstack/build/2199510cc500444b9638763287ebc62f | 18:42 |
| melwitt | *include enough info | 18:42 |
| gouthamr | i see, it's happening on the migration job, but, that's unrelated | 18:43 |
| gouthamr | melwitt: thanks.. that was my initial thought; maybe that part of the handling is just .. flaky | 18:43 |
| gouthamr | melwitt: the job history is pretty decent: https://zuul.opendev.org/t/openstack/builds?job_name=nova-live-migration-ceph | 18:44 |
| melwitt | gouthamr: yeah I wondered if maybe something is just not waiting for 'ceph mgr module enable orchestrator' to finish or something like that? | 18:45 |
| melwitt | yeah. it is definitely a "sometimes" bug but I have personally seen it a lot enough to ask someone if they know something to stop it haha | 18:45 |
| gouthamr | melwitt: ack, i've added some retry logic here: https://review.opendev.org/c/openstack/devstack-plugin-ceph/+/986818 | 18:46 |
| gouthamr | so we don't bail out early.. | 18:46 |
| gouthamr | disabling cephadm isn't necessary.. just something that'll help save resources if we have nothing more to setup/change on the ceph cluster | 18:47 |
| melwitt | ah gotcha, makes sense | 18:47 |
| gouthamr | sean-k-mooney headed in the same direction.. so i guess we can see if this helps with stability. thanks for flagging this | 18:49 |
| melwitt | thanks you for looking at it :) | 18:50 |
| sean-k-mooney | so i looked at the plugin breifly but didn know which job failed | 18:50 |
| sean-k-mooney | so i didnt knwo which method was actully the failure | 18:50 |
| sean-k-mooney | gouthamr: i saw you patch to add the retry on disable but i suspect that wasint the issue | 18:51 |
| sean-k-mooney | perhaps it was but i was assomign like mel that it was happenign durign the install | 18:51 |
| melwitt | I added another comment on https://bugs.launchpad.net/nova/+bug/2149965 to explain a bit more, for future reference | 18:52 |
| gouthamr | sean-k-mooney: the restart you stated is an internal ceph thingy - we don't explicitly ask to restart.. the manager module reloads things each time and during that reload the API is down.. so "orch" just throws a random error: "set a backend | 18:52 |
| sean-k-mooney | + /opt/stack/devstack-plugin-ceph/devstack/lib/cephadm:start_ceph:210 | 18:53 |
| gouthamr | " - that's just catching a giant exception in the client | 18:53 |
| sean-k-mooney | right os we need to be more robot to the internal errors | 18:53 |
| gouthamr | yes | 18:54 |
| sean-k-mooney | in any case it faillign here https://github.com/openstack/devstack-plugin-ceph/blob/master/devstack/lib/cephadm#L207 | 18:54 |
| melwitt | thanks to both of you for looking despite the major lack of info on the bug 😬 | 18:55 |
| sean-k-mooney | melwitt: i definlly have not seen that error several time sin the last month and not looked at it because fo tiem... | 18:55 |
| sean-k-mooney | definlty not a thing that woudl happen | 18:56 |
| sean-k-mooney | anyway https://review.opendev.org/c/openstack/devstack-plugin-ceph/+/986818 wont fix it because that is not what is failign | 18:57 |
| sean-k-mooney | btu the approch might work if we use it else hwere | 18:57 |
| sean-k-mooney | i dont knwo if tis just safe to invoke the same command again when we are in this state | 18:57 |
| sean-k-mooney | i.e. will it just fail again | 18:57 |
| melwitt | hm ok | 18:57 |
| sean-k-mooney | or is that the right thing to do possibel with a sleep ro query ot see fi the thing that restarted is up | 18:58 |
| sean-k-mooney | this honestly feels like a cphadm bug that should eventully get fixed there | 18:58 |
| sean-k-mooney | gouthamr: i assume you wrapped https://review.opendev.org/c/openstack/devstack-plugin-ceph/+/986818/1/devstack/lib/cephadm because that is the only place we call set backend | 18:59 |
| sean-k-mooney | but you will noteice in teh trace back its failing when we are settting the backend to cephadm and there we are not setting it ot anythin | 19:00 |
| sean-k-mooney | cli(['orch', 'set', 'backend', 'cephadm']) | 19:00 |
| melwitt | so you don't think it's something to do with commands the devstack plugin is running, like it could wait for a thing to be ready first | 19:01 |
| gouthamr | we'll just be yelling at it to disable orch, what's wrong with that? | 19:01 |
| gouthamr | it should be idempotent | 19:01 |
| sean-k-mooney | its https://zuul.opendev.org/t/openstack/build/2199510cc500444b9638763287ebc62f/log/job-output.txt#12495 | 19:01 |
| gouthamr | https://docs.ceph.com/en/reef/mgr/orchestrator/#disable-the-orchestrator | 19:01 |
| sean-k-mooney | so if you look at the logs we are failign in start_ceph | 19:02 |
| sean-k-mooney | so the failign call is all internal to that initall call to cephadm | 19:02 |
| sean-k-mooney | 2026-04-29 21:16:02.191617 | controller | + /opt/stack/devstack-plugin-ceph/devstack/lib/cephadm:start_ceph:1 : exit_trap | 19:03 |
| gouthamr | AH! see that's why i needed the full log :D | 19:03 |
| sean-k-mooney | we trap on the execode form that call | 19:03 |
| sean-k-mooney | so this is very much a race in cephadm rather then our devstack plugin | 19:04 |
| gouthamr | lemme see if there's something new they invented to prevent a "bootstrap" failure | 19:06 |
| sean-k-mooney | ack, we should not need to disable the orc module by the way we definlly want to use it because that is how real cluster work | 19:06 |
| gouthamr | no we had to do this just for CI | 19:06 |
| gouthamr | when we're spinning up single-node ceph and openstack compute on the flavor we use, we ran into OOMs | 19:07 |
| gouthamr | locally, i give it a beefy-enough node, and for jobs that do more, we use multi-node and spin up instances on compute nodes | 19:08 |
| sean-k-mooney | right in this case i dont think its an OOM | 19:09 |
| gouthamr | yeah | 19:10 |
| sean-k-mooney | https://paste.opendev.org/show/bf4uNdKhVfHSw5XRwFXS/ | 19:12 |
| sean-k-mooney | well a littel more https://paste.opendev.org/show/bClxtqx7PviIQSfdnEkq/ | 19:12 |
| gouthamr | okay, we can just retry the bootstrap until it succeeds... there are some opts that can help it be idempotent; check if mon is up, --allow-overwrite so we can be okay with cruft | 19:12 |
| gouthamr | yeah, that message isn't helpful there too :/ we can't enable orch, we're trying to bootstrap $machine ¯\_(ツ)_/¯ | 19:13 |
| gouthamr | it's as good as "something unexpected happened" | 19:13 |
| gouthamr | i can report a ceph tracker if they don't already know about this issue | 19:14 |
| sean-k-mooney | but is it as good as "no valid host found" | 19:14 |
| sean-k-mooney | nova most useful of uesful errors | 19:14 |
| gouthamr | :D | 19:15 |
| opendevreview | Merged openstack/nova master: Use python-native keyword-only arguments https://review.opendev.org/c/openstack/nova/+/969050 | 21:55 |
Generated by irclog2html.py 4.1.0 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!