| opendevreview | melanie witt proposed openstack/nova master: TPM: fixup for bump service version documentation https://review.opendev.org/c/openstack/nova/+/978836 | 03:33 |
|---|---|---|
| opendevreview | melanie witt proposed openstack/nova master: TPM: support live migration of `user` secret security https://review.opendev.org/c/openstack/nova/+/983333 | 03:33 |
| opendevreview | melanie witt proposed openstack/nova master: TPM: support live migration of `deployment` secret security https://review.opendev.org/c/openstack/nova/+/925771 | 03:33 |
| opendevreview | melanie witt proposed openstack/nova master: TPM: test live migration between hosts with different security https://review.opendev.org/c/openstack/nova/+/952629 | 03:33 |
| opendevreview | melanie witt proposed openstack/nova master: TPM: add late check for supported TPM secret security https://review.opendev.org/c/openstack/nova/+/956975 | 03:33 |
| opendevreview | melanie witt proposed openstack/nova master: TPM: enable conversion of secret security modes via resize https://review.opendev.org/c/openstack/nova/+/962052 | 03:33 |
| opendevreview | melanie witt proposed openstack/nova master: TPM: handle key manager Forbidden errors consistently https://review.opendev.org/c/openstack/nova/+/983504 | 03:33 |
| opendevreview | melanie witt proposed openstack/nova master: TPM: clean up orphaned libvirt secret on guest creation failure https://review.opendev.org/c/openstack/nova/+/983505 | 03:33 |
| opendevreview | Robbert Nijgh proposed openstack/nova master: Remove proxy wrapping in libvirt device listing, slowing down update_available_resource https://review.opendev.org/c/openstack/nova/+/983529 | 10:52 |
| MaxLamprecht[m] | sean-k-mooney: Can you maybe have another look at this https://review.opendev.org/c/openstack/nova/+/934984. We merged already the reproducer a while ago | 13:27 |
| MaxLamprecht[m] | Would be great to also get https://review.opendev.org/c/openstack/nova/+/958556 reviewed/merged. | 13:29 |
| sean-k-mooney | MaxLamprecht[m]: sure. ill see if i can do it before my next meeting at the top of the hour but i also need to prepare some contet for that | 13:32 |
| sean-k-mooney | it looks like dan is now happy with it and there has only been 1 revision since i last looked so i suspect this will be good to proceed | 13:32 |
| dansmith | I wasn't ever unhappy, I just thought we should wait until after the release | 13:37 |
| sean-k-mooney | ack, i just fisnihed reviewin git i think its ok to proceed with although i wont have tiem to look at the other 2 related repoduces/fixes today | 13:42 |
| sean-k-mooney | im not agaisnt fixing the correct handeling fo build, rescheduler and migate seperatly so that they are indepently backportable bu i am wondering if we need a better long term fix for this in general. i have not spent any time thinkign what that could look like however. im just concerned that unshelve or resize will be next | 13:45 |
| MaxLamprecht[m] | thx, as far what I can tell, these fixes are pretty good for the nova side. We have all of them running since the beginning of this year and our cinder/nova out of sync problems reduced a lot. At least I can't even report a single new case that is nova related currently. I have 2 more race conditions on cinder side that I will submit there. | 13:53 |
| sean-k-mooney | MaxLamprecht[m]: that good context. so you do not currently expect to need to adress simialr failures in other nova codepaths out side of the 2 other you have already sumbited bugs/fixes for? | 14:01 |
| sean-k-mooney | at least based on your current workloads/usage patterens | 14:01 |
| dansmith | sean-k-mooney: oh I was talking about this: | 14:11 |
| dansmith | https://review.opendev.org/c/openstack/nova/+/958556 | 14:11 |
| dansmith | this ^ should be a slam dunk early in the cycle I think | 14:12 |
| sean-k-mooney | dansmith: oh ok and ya i think we can proceed with that ill loop back to it whne im finished with the cybrog team meeting | 14:13 |
| MaxLamprecht[m] | sean-k-mooney: Exactly! | 14:34 |
| melwitt | someone has reported a bug related to the tpool.Proxy wrapping of virNodeDevice of PCI devices during update_available_resource() https://bugs.launchpad.net/nova/+bug/2147424 saying that it causes update_available_resource() to increase time taken to complete over time. does anyone have any idea why this would be? if so, comments on the bug would be appreciated | 15:45 |
| sean-k-mooney | did we fix that alreday | 15:48 |
| sean-k-mooney | melwitt: that feels like something we did before | 15:48 |
| melwitt | sean-k-mooney: they are saying that fix caused a different problem, that now update_available_resource() run time progressively increases over time. whereas the problem before the fix was that update_available_resource() could block other nova-compute eventlet threads. they are proposing a revert but I don't see how it could be that simple? | 15:50 |
| melwitt | (i.e. wouldn't that return things to the blocked threads problem) | 15:52 |
| gibi | clearly reverting until we know a bit more about the root cause of the increasing runtime feel premature to me | 15:53 |
| melwitt | I don't really get how tpool.Proxy wrapping could result in a run time increase for the periodic task | 15:53 |
| dansmith | ...if it increases contention | 15:53 |
| gibi | if it is increases for each periodict then we probably leaking something somewhere that accumlates | 15:53 |
| melwitt | yeah. they did test with and without the wrapping and said without the wrapping the run time does not increase | 15:55 |
| gibi | I think we can try to reproduce the degradation. We just need to set the periodic delay to a short enough number and wait | 15:56 |
| melwitt | yeah. I wonder would we need real PCI devices or can this be faked somehow? | 15:58 |
| dansmith | I doubt it requires real PCI devices.. I don't think anything is actually touching the devices there, just getting the list result from libvirt | 15:59 |
| gibi | we do filter for certain devices in that call | 16:00 |
| gibi | https://github.com/openstack/nova/blob/b84864d9781ece8387daa967be5208b859e20d97/nova/virt/libvirt/driver.py#L8905-L8913 | 16:01 |
| gibi | if the number of devs are not increasing over time in their env then that call should not increase its runtime over time either | 16:02 |
| gibi | as crawlers killing our infra I cannot set up a new devstack at the moment to poke this | 16:07 |
| sean-k-mooney | it shoudl in theory be repoducable in the functionalt tests | 16:08 |
| gibi | but I added a comment to the bug after couple of retries | 16:08 |
| * sean-k-mooney is not on a call anymore | 16:08 | |
| sean-k-mooney | gibi: so we cache the actual devices form libvirt in that code path | 16:08 |
| sean-k-mooney | so after the first call to libvirt to get the pci devices | 16:08 |
| sean-k-mooney | we should be using it form memory | 16:08 |
| gibi | I don't see the cache, but it is getting late here so... | 16:10 |
| sean-k-mooney | but we might want ot dbouble check that we are not creating new cached copies due to self or similar breaking the caching | 16:10 |
| melwitt | hm I wonder could the cache be making the tpool.Proxy threads stay around and not be cleaned up or something like that? | 16:11 |
| sean-k-mooney | perhaps im trying to find where we do it today to confirm exactly how we are doing that | 16:12 |
| dansmith | gibi: s/our/all/ | 16:16 |
| sean-k-mooney | melwitt: im wondering if i am misser remembering and perhasp wrote a caching patch that we have not merged? | 16:17 |
| gibi | dansmith: indeed | 16:17 |
| sean-k-mooney | https://review.opendev.org/c/openstack/nova/+/922855/6/nova/virt/libvirt/driver.py | 16:19 |
| melwitt | oh ok so not yet a factor | 16:21 |
| sean-k-mooney | ya | 16:21 |
| sean-k-mooney | that is not the first time we have looket at that for waht its owrth | 16:21 |
| sean-k-mooney | there are some other optimistaion patches like https://review.opendev.org/c/openstack/nova/+/925382 but i gouth we had merge a ttleast one o the cachign ones before | 16:22 |
| sean-k-mooney | for now that not a factor i guess | 16:22 |
| melwitt | I think the thing I keep focusing on is, why with tpool.Proxy only? seems the only way would be contention like Dan said | 16:23 |
| sean-k-mooney | melwitt: do you know if they are seeing an increase in memory usage as well as time | 16:23 |
| melwitt | I don't really know the details about how tpool.Proxy works. I pulled up the eventlet doc about it but it doesn't say much | 16:23 |
| melwitt | sean-k-mooney: they did not mention memory usage but that would be good to know | 16:24 |
| sean-k-mooney | if we were keeping more nad more info in memoyr over time we shoudl see htat increase. | 16:24 |
| dansmith | "how it works" is "terribly" | 16:24 |
| dansmith | I mean it does what it says but it's really a gross hack we should avoid whenever we can | 16:24 |
| melwitt | well, that's not reassuring | 16:25 |
| melwitt | (haha) | 16:25 |
| dansmith | I can also see it just straight up hurting performance if we wrap an object and then have some really terrible code that hits a bunch of its properties/functions a thousand times, like in a sort routine or something | 16:26 |
| sean-k-mooney | i did look tat the impl a long time ago but i dont rememebr the details | 16:26 |
| gibi | ... OK it seems I got lucky and have at least an aio devstack built with PCI devices now. So I can start experimenting with the periodic | 16:26 |
| dansmith | each of those would be a few instructions for regular code and with the proxy would be a spawn..wait..collect..return instead | 16:26 |
| sean-k-mooney | ya so we are actully collecting the pci device 3 times per call today | 16:27 |
| dansmith | unless we need the actual object to be wrapped, we should wrap the call to libvirt and return the objects, or wrap a wrapper that does the call and copies the objects into like NamedTuples or something | 16:27 |
| sean-k-mooney | that is what https://review.opendev.org/c/openstack/nova/+/925382 fixes | 16:27 |
| sean-k-mooney | but we also dont supprot adding/removing devices at runtime anyway without a agent restart | 16:28 |
| sean-k-mooney | so we can cache this once on the first invocatoin safely today | 16:28 |
| melwitt | it's that the call of virNodeDevice.listCaps() needs to be wrapped somehow to not block stuff. so my thinking was wrap the object and thus get the call wrapped? | 16:28 |
| sean-k-mooney | thats in host.py right | 16:29 |
| dansmith | melwitt: so we need to call listCaps() later and that makes another call into libvirt? | 16:30 |
| melwitt | yes | 16:30 |
| dansmith | ...and do we do that for all the returned devices? | 16:30 |
| melwitt | all of the devices call their listCaps() method so I had thought so. I might be wrong | 16:31 |
| dansmith | so that means each one of those listCaps() calls will be a thread spawn, wait, return with moving some data around for communication | 16:31 |
| sean-k-mooney | we normaly convert them into dictonaries for pci devceis https://github.com/openstack/nova/blob/master/nova/virt/libvirt/host.py#L1593 | 16:31 |
| melwitt | it's get a list of [dev, dev, dev, ...] and then for each dev: dev.listCaps() | 16:31 |
| dansmith | without, those will just be sequential calls, so blocking but way faster | 16:32 |
| sean-k-mooney | we try not to return the libvirt dev object out of host.py | 16:32 |
| dansmith | sean-k-mooney: right converting them to dicts or NamedTuple or something all at once in a single tpool'd function would be the more efficient thing | 16:32 |
| sean-k-mooney | _get_device_capabilities take sthe virNodeDevice list and converts them today for the pci devices in general | 16:33 |
| sean-k-mooney | but i dont know if that pater is applied everywhere | 16:33 |
| melwitt | yeah I mean maybe it's just that, that the calls are too many and too long and then the threads pile up? | 16:33 |
| dansmith | not the threads piling up so much as just way more work doing it with threads | 16:34 |
| sean-k-mooney | well this is 2025.1 | 16:34 |
| melwitt | yeah I was thinking about the progressive increase over time | 16:34 |
| sean-k-mooney | so that not using our threaded implemation | 16:34 |
| sean-k-mooney | this is just for teh eventlet flow right | 16:34 |
| melwitt | something has to be accumulating I thought, just not sure exactly what | 16:35 |
| melwitt | yeah this should just be for eventlet | 16:35 |
| sean-k-mooney | so we can create fake pci devices in the fucntional tests | 16:35 |
| sean-k-mooney | and we can run the perodics explcitly in those | 16:35 |
| sean-k-mooney | so it might be wroth callign it like 100 times and tiem each oen to see if we can repoduce | 16:35 |
| sean-k-mooney | sort of like in my ironic simulator patch | 16:36 |
| melwitt | gibi said he just got an env set up and will run some experiments | 16:36 |
| sean-k-mooney | if we can we can then debug that | 16:36 |
| sean-k-mooney | ack i have a env live with pci device too although i currently have the pci grouping serice deployed on taht but i can at least check my logs i gess and see what the time looks like | 16:36 |
| sean-k-mooney | melwitt: the other thing of note was the fact they said more instance make it degrade faster | 16:37 |
| sean-k-mooney | the number of isntance dont affect the livbirt retrival path | 16:37 |
| sean-k-mooney | so i wonder if its related to the isntace objects instead or the instance part fo this | 16:37 |
| melwitt | I guess I assumed more instance = more PCI devices but I guess you're saying not? | 16:38 |
| sean-k-mooney | well if its the same hosts in both cases | 16:38 |
| sean-k-mooney | or same server sku | 16:38 |
| dansmith | sean-k-mooney: in eventlet with tpool.proxy you're creating worker threads for every property invocation on a wrapped obejct | 16:38 |
| dansmith | nothing to do with our threaded implementation | 16:38 |
| sean-k-mooney | dansmith: ack yes you are hookign get_atribute or simialr and dispatch the callables to a trhead pool | 16:39 |
| melwitt | or yeah I guess I see. so if more instances = more other unrelated tpool.Proxy things maybe | 16:39 |
| sean-k-mooney | melwitt: i think we pull all fo the domains for example in this periodic | 16:39 |
| sean-k-mooney | so im wondifing if its related to the number of vms with passthough devices in some way | 16:39 |
| sean-k-mooney | i.e. if on an empy host its fixed/stable but on a host with isntance it degrades over time | 16:40 |
| sean-k-mooney | it is odd however that it reported against _get_pci_passthrough_devices if that was the case | 16:41 |
| melwitt | I feel like it must be something less common ... like otherwise this complaint would have come up a long time ago I would think. and more people affected | 16:41 |
| dansmith | if we iterate all the host pci devices once (or more) per instance, then that could be the increase that scales with instances | 16:41 |
| sean-k-mooney | https://github.com/openstack/nova/blob/master/nova/virt/libvirt/driver.py#L8921-L8925 | 16:42 |
| dansmith | melwitt: this change happened just a year ago right? so real production people are probably just starting to run it | 16:42 |
| sean-k-mooney | we are iterating over raw nodedev defivce form libvirt here | 16:42 |
| melwitt | dansmith: yeah a little over a year ago | 16:43 |
| sean-k-mooney | list_all_device here https://github.com/openstack/nova/blob/master/nova/virt/libvirt/driver.py#L8913 | 16:43 |
| melwitt | but it got backported everywhere also | 16:43 |
| dansmith | ack about the backports, that's fair | 16:43 |
| sean-k-mooney | well no that wraped into a proxy in list_all_devices | 16:44 |
| sean-k-mooney | https://github.com/openstack/nova/blob/master/nova/virt/libvirt/host.py#L1744 | 16:44 |
| melwitt | I had seen a lot of positive feedback about the change (since before that people were getting nova-compute lockups) but I hadn't seen negative until today | 16:44 |
| dansmith | yeah, I mean.. it improves one thing at the expense of another, so not surprising I think | 16:44 |
| dansmith | and if you have fewer PCI devices or slower periodic config you probably appreciate the change :) | 16:45 |
| melwitt | yeah. I just meant as far as, the conditions under which it happens being less common or more common | 16:45 |
| MaxLamprecht[m] | I remember about a similar problem. See my patch and comment https://review.opendev.org/c/openstack/nova/+/922855/comment/a0361266_9c58663c/. There was a bug/performance regression in libvirt < 10.5. | 16:45 |
| melwitt | oh interesting | 16:46 |
| sean-k-mooney | dansmith: so we are only calling it once | 16:46 |
| sean-k-mooney | not once per isntance as we are callign it here https://github.com/openstack/nova/blob/master/nova/virt/libvirt/driver.py#L10506-L10560 | 16:46 |
| dansmith | sean-k-mooney: ack, I hadn't looked I was just saying _if_ it scales with instances and _if_ we are calling once per instance... | 16:47 |
| sean-k-mooney | yep i was just doublecking that we were not | 16:47 |
| melwitt | MaxLamprecht[m]: I guess though, the bug reporter did say they tested with and without tpool.Proxy wrap (assuming same libvirt version) and said they didn't get the slowdown as long as tpool.Proxy wrap was not used | 16:48 |
| melwitt | I'm not sure if that means it could not be related. might be worth asking them on the bug what their libvirt version is regardless | 16:49 |
| dansmith | that's expected because of the overhead it introduces.. but MaxLamprecht[m]'s change just makes us never call it again which is also an improvement regardless of the proxy | 16:49 |
| sean-k-mooney | yes we coudl also jsut impvoe this to convert the data to dicts earlier | 16:50 |
| MaxLamprecht[m] | ack, for us upgrading libvirt to 10.5 (which includes https://gitlab.com/libvirt/libvirt/-/commit/98f1cf88fa7e0f992d93f376418fbfb3996a9690). I would start to check this :) | 16:51 |
| MaxLamprecht[m] | *solved it | 16:51 |
| sean-k-mooney | list_all_devices is only uses in 2 places | 16:51 |
| sean-k-mooney | it would not be hard to replace it with 2 function one that return the lsit of mdev names and a second that returns the capablites as a dict and just remove the tpool wrapper entirly | 16:52 |
| dansmith | you had me at "remove the tpool wrapper entirely" | 16:53 |
| sean-k-mooney | hum well not quite https://github.com/openstack/nova/blob/master/nova/virt/libvirt/driver.py#L1225-L1242 but that call can be moved to host.py and it does not return the obejct so there is no long lived usage | 16:54 |
| sean-k-mooney | _start_inactive_mediated_devices is only called once in init_host | 16:54 |
| dansmith | just converting to dicts when we do it, in a single worker thread and returning (probably cached if that's appropriate) would solve all the things.. then callers can iterate over the dicts as they iterate over objects today without the penality | 16:55 |
| sean-k-mooney | +1 | 16:55 |
| melwitt | yeah, let's not remove the wrapper entirely please and return to the lockups | 16:55 |
| sean-k-mooney | well we can do it in phases | 16:56 |
| dansmith | we want to remove the wrappER, just not remove the wrappING( of the initial call) | 16:56 |
| melwitt | O_O | 16:56 |
| melwitt | ok | 16:57 |
| dansmith | I can cook up the start of a change if it's still not clear | 16:57 |
| melwitt | let him cook | 16:58 |
| melwitt | I think I understand the basic idea, I was not thinking of the distinction between wrappER and wrappING so I'm good :) | 17:00 |
| dansmith | ack | 17:00 |
| melwitt | regardless if anyone wants to cook a change, please do. I didn't come up with any of the ideas | 17:00 |
| opendevreview | Takashi Kajinami proposed openstack/nova-specs master: libvirt: AMD SEV-SNP support https://review.opendev.org/c/openstack/nova-specs/+/983376 | 17:02 |
| -opendevstatus- NOTICE: Load on the opendev.org Gitea backends is under control again for now, if any Zuul jobs failed with SSL errors or disconnects reaching the service prior to 16:15 UTC they can be safely rechecked | 17:04 | |
| opendevreview | Takashi Kajinami proposed openstack/nova-specs master: libvirt: AMD SEV-SNP support https://review.opendev.org/c/openstack/nova-specs/+/983376 | 17:07 |
| melwitt | just to keep things rolling with problems, I noticed it seems like the test_resize_reschedule_uses_host_lists_3_fails unit test is still nondeterministically failing somehow https://f1460b24809dc27f2808-70ad550016ca78bab2d4ae8445294bcc.ssl.cf2.rackcdn.com/openstack/243619dc5a054b6cbca4265b38ce41ed/testr_results.html | 17:07 |
| melwitt | I don't understand why yet | 17:07 |
| sean-k-mooney | hum i have not seen that but have nto looked latesly is it always the 3 varian or is it also failing on test_resize_reschedule_uses_host_lists_1_fail | 17:10 |
| sean-k-mooney | all the logic is common bar the parmters os the instablity must be in https://github.com/openstack/nova/blob/b84864d9781ece8387daa967be5208b859e20d97/nova/tests/functional/test_servers.py#L3453-L3490 | 17:10 |
| sean-k-mooney | oh that keeps going https://github.com/openstack/nova/blob/b84864d9781ece8387daa967be5208b859e20d97/nova/tests/functional/test_servers.py#L3453-L3558 | 17:11 |
| melwitt | I have only seen the 3 variant. and I think gmaan did something to improve it but I don't remember. I need to look at the git blame | 17:11 |
| gibi | initial results: 142 devs returned from libvirt (10.0), no instances on the host. Over 10 periodic runs I don't see any trends. But I confirm that eventlet mode is slower overall. | 17:14 |
| sean-k-mooney | is that on master | 17:14 |
| gibi | no proxy: _get_pci_passthrough_devices took 0.137605510999947 secs for 142 libvirt devices | 17:14 |
| gibi | with proxy: periodic update_available_resource took 3.2481384640000215 secs | 17:14 |
| gibi | sorry | 17:14 |
| gibi | with proxy: _get_pci_passthrough_devices took 0.5433623499998248 secs for 142 libvirt devices | 17:15 |
| sean-k-mooney | i asume you just seting the env flag on master? | 17:15 |
| gibi | yepp | 17:15 |
| sean-k-mooney | its nice we can do that :) | 17:15 |
| gibi | :) | 17:15 |
| gibi | I will start adding VMs | 17:15 |
| gibi | and taking a longer measurement window | 17:16 |
| gibi | this is the current data I have https://paste.openstack.org/show/bNCNkXgjaGe1EXOjAxNf/ | 17:16 |
| sean-k-mooney | they said 60 repocues it but if we could get 4-8 it should still be visible it just might take longer to notice but runniing the perodic like ever 10 second should offset that | 17:16 |
| gibi | yepp | 17:17 |
| sean-k-mooney | i guess those are new logs | 17:18 |
| sean-k-mooney | let me check my sriov dev env for the overall perod time quickkly | 17:18 |
| gibi | yeah I needed new logs otherwise it is not clear how long things take | 17:18 |
| sean-k-mooney | i think in debug mode we print the total perodic time | 17:19 |
| sean-k-mooney | Lock "compute_resources" "released" by "nova.compute.resource_tracker.ResourceTracker._update_available_resource" :: held 2.483s | 17:21 |
| sean-k-mooney | wo we have thsoe logs | 17:21 |
| sean-k-mooney | but im not sure if the call is in that loc | 17:22 |
| sean-k-mooney | it ideally is not | 17:22 |
| sean-k-mooney | ok no we are computing the resouce view well before that | 17:24 |
| gibi | I'm wondering that the "scaling with the number of VMs" is an artifact of the compute node being more and more under load by those VMs. It is definitely the case for my compute by default. I will measure this with VMs in shutoff state to limit that effect | 17:25 |
| sean-k-mooney | even so the pull perodic time for me locally is 12 seconds | 17:25 |
| sean-k-mooney | *full | 17:25 |
| sean-k-mooney | well there will be more contention for the compute_resouces lock | 17:26 |
| sean-k-mooney | in teh resouce tracker formother perodics | 17:26 |
| sean-k-mooney | we do print the wait time ock "compute_resources" acquired by "nova.compute.resource_tracker.ResourceTracker._update_available_resource" :: waited 0.001s so mabe that a factor | 17:27 |
| sean-k-mooney | although locally the other thing that uses it is Lock "compute_resources" acquired by "nova.compute.resource_tracker.ResourceTracker.clean_compute_node_cache" | 17:28 |
| sean-k-mooney | i think we aquire it when bootign vms as well | 17:29 |
| sean-k-mooney | but in terms of perodic that contenion does nto seam very high | 17:29 |
| sean-k-mooney | Active: active (running) since Wed 2026-03-11 17:16:44 UTC; 3 weeks 6 days ago | 17:31 |
| sean-k-mooney | so i do not currently have instnace on this env | 17:31 |
| sean-k-mooney | but i ahve had them runign with pci device in the past so it does seam to return to a sane baseline at least on an empty host | 17:32 |
| sean-k-mooney | gibi i guess the next best thign to do is leave it run over night in your env and see if it repoduces when there are isntances | 17:34 |
| gibi | 10 VM that are not using PCI devs on the host and are in shutoff state does not increase the time of the _get_pci_passthrough_devices call. It does increase the time of th whole periodic obviously. I don't yet see any treands in either of the times over time | 17:34 |
| gibi | sean-k-mooney: yeah I will, I just try to see what setup make sense to use | 17:35 |
| gibi | next I will try with PCI user VMs | 17:35 |
| gibi | I have 6+6 VFs (2 PFs) so each VM can take a VF | 17:37 |
| sean-k-mooney | ack im currnetly comparing in my cybrog/pci groupign env. ill leave it alone for not but if you dong see anygint i can maybe run ~50-80 vms | 17:37 |
| sean-k-mooney | i belive i have 2 sriov capable nic on this host and i shoudl barely have enough disk/ram but i havent really pushed this host to find its limits | 17:39 |
| gibi | I don't see any timing data difference between 10VMs without PCI usage and 10 VMs with one VF usage each (both cases the VMs are in shutoff) | 17:44 |
| gibi | so what I will do I will keep the 10 VMs with one VF each,take a GMR and then let the periodic run every 10 seconds during the night | 17:45 |
| sean-k-mooney | hum im not sure what runing vs not would cause in this context | 17:47 |
| sean-k-mooney | techinaly that does change the retivial of mac adresses for VFs | 17:48 |
| sean-k-mooney | i.e. if livbirt is restart after the vf is passed to the vm it may not see the mac | 17:49 |
| sean-k-mooney | but did the bug say it was realted to runign/active vms or are you just leaving them stop to reduces load | 17:49 |
| gibi | I simply want to avoid the cpu usage noise of the 10 VMs running | 17:49 |
| sean-k-mooney | ah ok | 17:49 |
| sean-k-mooney | you could start them and then pause them | 17:49 |
| gibi | it is non 0 even with cirros | 17:49 |
| gibi | sure | 17:50 |
| gibi | that is better | 17:50 |
| sean-k-mooney | but i suspect we wont see much of a delta on the perodic althoug hif we do it would be an interesting datapoint i guess | 17:50 |
| gibi | a quick look in two GMRs ~5 minutes distance I don't see increasing number of Green Threads or Threads reported. | 18:17 |
| gibi | OK. This is the starting state of my measurement setup for the night https://paste.openstack.org/show/b96mgBQpguw6KUseBAgZ/ | 18:25 |
| melwitt | awesome gibi | 18:34 |
| melwitt | dansmith: fyi potential cross-project PTG topic re: barbican secrets https://etherpad.opendev.org/p/nova-2026.2-ptg#L90 cc Uggla sorry if I put the text in the wrong location in the etherpad, feel free to move it | 20:10 |
| dansmith | melwitt: ack, not sure that needs to be a cross-project session really, but cool.. I also didn't realize you were worried about the deployment side of the ownership thing, more the user one | 20:28 |
| dansmith | but good point that neutron would have access to those keys | 20:28 |
| dansmith | unless nova could change the ACL itself | 20:28 |
| melwitt | dansmith: my brain started on user and then the realization about service hit. but yeah nova can change the ACL itself as far as I know | 20:29 |
| melwitt | I thought you had suggested we talk to ppl who know more about barbican so I put it cross-project | 20:30 |
| melwitt | it does suck that there's no way to do ACLs through castellan. not sure if that's by design | 20:30 |
| melwitt | I guess maybe not all key managers support ACLs | 20:31 |
| dansmith | ack | 20:33 |
| sean-k-mooney | melwitt: by the way i belive there as a specic commit possibel as part fo there srbac work that change teh secrte to be owned by the proejct | 20:56 |
| sean-k-mooney | melwitt: i dont exactly know why i feel tha tway but i think we talksed about it before | 20:57 |
| melwitt | sean-k-mooney: I wondered if that could be a possibility. I'll take another gander | 20:58 |
| sean-k-mooney | i.e. the fact that is not only accable by the user is someitng that i was vaugly aware of an that it was not alwasy that way | 20:58 |
| melwitt | as far as I could find it goes down to the model and object level, so always been that way | 21:00 |
| melwitt | (this was just me looking in the code) | 21:01 |
| dansmith | yeah I really thought we were told that it was specific to the user and not the project | 21:01 |
| sean-k-mooney | melwitt: maybe https://opendev.org/openstack/barbican/commit/cd9b0ba9f0b4606193858d3cf05acd413a66ba1e | 21:02 |
| sean-k-mooney | they use to enforce _SECRET_CREATOR = "user_id:%(target.secret.creator_id)s" | 21:02 |
| sean-k-mooney | ... storybaord 2010235 has more info | 21:03 |
| sean-k-mooney | huh its actully working https://storyboard.openstack.org/#!/story/2010235 | 21:03 |
| sean-k-mooney | i guess the bots doen know about story board | 21:03 |
| sean-k-mooney | ya so in https://opendev.org/openstack/barbican/commit/cd9b0ba9f0b4606193858d3cf05acd413a66ba1e the allowed project member not jsut teh orgial secreat creator to decrpte the secrete adn basiclly mange it | 21:05 |
| melwitt | lol storyboard | 21:06 |
| sean-k-mooney | without the need for acls so that proably shoudl ahve at least had a release note... or a doc update.... | 21:06 |
| sean-k-mooney | https://review.opendev.org/c/openstack/barbican/+/853680 apprently that was aslo cherry-picked which is qustionable | 21:07 |
| melwitt | yeah that's a pretty major change | 21:07 |
| sean-k-mooney | although i coudl eb misreading those polices but i dont think i am | 21:07 |
| dansmith | but these are API policies - doesn't barbican also have a more robust internal ACL system? | 21:08 |
| dansmith | I guess I thought that | 21:08 |
| dansmith | is what we were talking about here and not just API policy | 21:08 |
| melwitt | I guess the old rule had rule:secret_project_admin which allows project admin also | 21:08 |
| sean-k-mooney | hum actully on gerrit i think i might be https://review.opendev.org/c/openstack/barbican/+/853680/9/barbican/common/policies/secrets.py#72 | 21:08 |
| melwitt | secrets are not as secret as I had thought :P | 21:09 |
| sean-k-mooney | im not familar with rule:secret_is_not_private | 21:09 |
| melwitt | that's the project_access thing. project_access=false=private. at least that's how it is today | 21:09 |
| sean-k-mooney | ah | 21:10 |
| sean-k-mooney | so seperatly form the acl you can mark it as prive to a user or public to the project | 21:10 |
| sean-k-mooney | i assume acls were an evoltion of that with more granualatiy or someitng in later release | 21:11 |
| melwitt | no that is the ACL. i.e. if you want to change it, it's a PUT to the /acl API | 21:11 |
| dansmith | so the api policy defines the default but I can still ACL individual secrets however I want? | 21:11 |
| melwitt | as far as I can tell the default is hard-coded, I don't think you can change it | 21:12 |
| melwitt | *don't think you can change the default | 21:12 |
| melwitt | and you can't create a secret to have project_access=false from the get go | 21:12 |
| sean-k-mooney | https://docs.openstack.org/barbican/latest/api/reference/acls.html | 21:12 |
| sean-k-mooney | so apprenlty the acl is on tdefien on indeivual secrets but the contienr the sectes are contied in | 21:12 |
| sean-k-mooney | it feels like this is not fully baked based on teh documantion warning | 21:13 |
| melwitt | yeah... kinda also why I was thinking cross-project at the ptg, to just get a sanity check from barbican experts if they are out there | 21:14 |
| sean-k-mooney | https://docs.openstack.org/api-guide/key-manager/acls.html#default-acl | 21:14 |
| sean-k-mooney | { | 21:15 |
| sean-k-mooney | "read":{ | 21:15 |
| sean-k-mooney | "project-access": true | 21:15 |
| sean-k-mooney | } | 21:15 |
| sean-k-mooney | } | 21:15 |
| sean-k-mooney | so that apprently the default os i guess whne they added acls it change the behvior | 21:15 |
| melwitt | ok so you are saying before ACLs even existed, it was like private secrets? | 21:15 |
| melwitt | vtpm was added in the victoria release which was in 2020 | 21:17 |
| sean-k-mooney | im not sure | 21:17 |
| sean-k-mooney | bu tjsut skiming ther eapi ref | 21:17 |
| sean-k-mooney | i think if we anted to chagne teh bhvior we would need 3 calls minium | 21:17 |
| sean-k-mooney | 1 to create a metadata only secrete | 21:18 |
| sean-k-mooney | then to apply the acl to it | 21:18 |
| sean-k-mooney | and finaly to upload the data | 21:18 |
| sean-k-mooney | i would assume we are not doing that today | 21:18 |
| melwitt | yeah that's exactly what I thought we would have to do and no we are not doing that today | 21:19 |
| sean-k-mooney | so when we are creatign the "user" secreat its likely project scoped | 21:19 |
| melwitt | secrets in nova have always been create with the default acl, so have always been open to project | 21:19 |
| melwitt | I don't know why I thought they were private to a user :( | 21:20 |
| sean-k-mooney | so its the secret:decrypt | 21:21 |
| sean-k-mooney | polyc that we care about rahter then get | 21:22 |
| melwitt | correct | 21:22 |
| sean-k-mooney | get is just the metadata | 21:22 |
| melwitt | yes | 21:22 |
| melwitt | man the test_resize_reschedule_uses_host_lists_3_fails unit test is going crazy in CI | 21:23 |
| sean-k-mooney | https://opendev.org/openstack/barbican/blame/commit/661386a517fac2f81d6b35706dd360b045955c3f/barbican/common/policies/secrets.py#L21 | 21:23 |
| sean-k-mooney | so that acl rules has been there for at least 10 years | 21:24 |
| sean-k-mooney | so in effect they have never been user scoped unless you change the acl | 21:24 |
| melwitt | I don't think so right because of this https://opendev.org/openstack/barbican/src/commit/661386a517fac2f81d6b35706dd360b045955c3f/barbican/common/policies/base.py#L50-L52 | 21:24 |
| melwitt | oh wait that's not decrypt | 21:25 |
| sean-k-mooney | righ if you look at https://opendev.org/openstack/barbican/blame/commit/661386a517fac2f81d6b35706dd360b045955c3f/barbican/common/policies/secrets.py#L21 | 21:25 |
| melwitt | https://opendev.org/openstack/barbican/src/commit/661386a517fac2f81d6b35706dd360b045955c3f/barbican/common/policies/base.py#L53-L55 | 21:26 |
| sean-k-mooney | that has the 'rule:secret_acl_read' rule alaready | 21:26 |
| melwitt | if I read that right it was allow project admin, project observer, project creator | 21:26 |
| melwitt | so it's true it was not project member | 21:26 |
| sean-k-mooney | right os creator was the old role they mapped to member | 21:26 |
| sean-k-mooney | and observer i guess was reader | 21:27 |
| sean-k-mooney | so before srbac to us babican at all admin had to assing you the creator role | 21:27 |
| sean-k-mooney | melwitt: i think this is perhaps the relvent spec https://github.com/openstack/barbican-specs/blob/master/specs/kilo/add-creator-only-option.rst | 21:30 |
| sean-k-mooney | well and https://github.com/openstack/barbican-specs/blob/master/specs/kilo/add-per-secret-policy.rst | 21:30 |
| melwitt | ok yeah that is showing that private secrets was a new feature at the time. wasn't even possible originally | 21:31 |
| sean-k-mooney | rigth so the second spec is the one that intouced tha acl api | 21:32 |
| sean-k-mooney | and the first was specificly usignit for providign diffent options for the creator role | 21:32 |
| melwitt | interesting. good to know | 21:32 |
| sean-k-mooney | i hate to say this but.... this would allwo a 4 mode of operations | 21:34 |
| sean-k-mooney | we could have a "delegated" for lack fo a beter term mode where the secret is owned by the end user isntead of the nova user but we just add an acl for ti when creating it for the nova user to be able to read it when needed | 21:35 |
| sean-k-mooney | and do that only for the tpm secrets that we are cretaing on there behalf | 21:35 |
| sean-k-mooney | that almost what the deployment mode is with the only delta being the secret remaisn owned by the project that owns the vm | 21:35 |
| melwitt | yeah .. that would work I think | 21:36 |
| sean-k-mooney | you coudl also upgrae to that by backfilling the ACL | 21:36 |
| melwitt | honestly that seems easier because then the service project thing doesn't come up | 21:36 |
| sean-k-mooney | although well it kind of does but only in the sense that what ever auth cred we are usign to talk to it need to be one form our config | 21:37 |
| sean-k-mooney | but just lookign at master for a second | 21:37 |
| sean-k-mooney | https://opendev.org/openstack/barbican/src/branch/master/barbican/common/policies/secrets.py#L19-L26 | 21:37 |
| sean-k-mooney | rule:secret_project_admin | 21:38 |
| melwitt | yeah but I mean if we do it by giving the nova user access via acl, then it's never open to the service project right | 21:38 |
| sean-k-mooney | i asumem that is someone wiht the admin role that has a project scoped token for that project right | 21:38 |
| melwitt | I think that's what that means, yes | 21:38 |
| sean-k-mooney | oh well there may or may not be a shared service project | 21:39 |
| sean-k-mooney | we do that in devstack but in our down stream we dont i belive | 21:39 |
| melwitt | true, but the "common" deployment is done that way | 21:39 |
| sean-k-mooney | it woudl allow use to restict ti to just the nova user however | 21:39 |
| sean-k-mooney | if ther was a shared service proejct so that still better | 21:39 |
| sean-k-mooney | well not nessicarly "beter" then owned by nova in genreal jsut better then "accsabel by anyone in the service project" | 21:41 |
| melwitt | right | 21:41 |
| melwitt | it's the open-to-the-service-project that seems a bit too out there | 21:41 |
| sean-k-mooney | right. so looking at barbican base policy they do impmente proejct scoped admin to some degree | 21:42 |
| sean-k-mooney | but that really global admin | 21:42 |
| sean-k-mooney | so today if an admin added themselve to a end users project the shoudl be bale to live mgigrate | 21:43 |
| sean-k-mooney | provide we just dont block it in the api and try to download the secret when needed | 21:43 |
| sean-k-mooney | i.e with teh defualt "user" policy | 21:43 |
| sean-k-mooney | what "deployment" or "delegated" what ever we want to call the acl approch gives | 21:44 |
| melwitt | yeah exactly. it is possible for project admin today | 21:44 |
| sean-k-mooney | is the ablity to skip adding yourself to the end users policy | 21:44 |
| sean-k-mooney | *end users project | 21:44 |
| sean-k-mooney | jsut to do the live migration | 21:44 |
| sean-k-mooney | so its a net win still be but the security posture is mostly the same as user or host | 21:45 |
| melwitt | yeah. it feels like it makes more sense too in a way. like more intuitive | 21:45 |
| sean-k-mooney | i think this also means that an admin coudl also jsut live migrate any vm with the user policy if they just add the acl to the relevent secret right? | 21:46 |
| sean-k-mooney | for there user? | 21:46 |
| melwitt | I really don't want to have to migrate 'deployment' secrets though, ugh. | 21:46 |
| sean-k-mooney | well you dont have to move them | 21:47 |
| sean-k-mooney | that the nice thign about deployment you jsut get them form barbican in pre live migrate | 21:47 |
| sean-k-mooney | there is no copyign right. | 21:47 |
| melwitt | yes you can with acl also. a bit more code is needed to enable 'user' live migration and I have them proposed | 21:47 |
| sean-k-mooney | by migrate deployment secret did you mean add the acl? | 21:47 |
| sean-k-mooney | or did you mean move them during live migrate? | 21:48 |
| sean-k-mooney | or i guess copy them to a new secret | 21:48 |
| sean-k-mooney | but that woudl be on resize right | 21:48 |
| melwitt | yeah but I mean if we wanted to do the user + acl for nova user thing, taking the nova user secret and creating a user secret and copying the passphrase over and adding the acl, I dunno. mess | 21:48 |
| sean-k-mooney | oh your sayign supprotign the 4th policy | 21:49 |
| sean-k-mooney | fo ading the nova user by acl | 21:49 |
| melwitt | for potentially already existing 'deployment' | 21:49 |
| melwitt | secrets | 21:49 |
| sean-k-mooney | ah well we dont have that yet do we? | 21:49 |
| sean-k-mooney | i tought that was not merged | 21:49 |
| melwitt | we do have the secret generation but not the live migration | 21:49 |
| sean-k-mooney | ah ok | 21:50 |
| sean-k-mooney | well if we atned to suprpot it its feeld like a 4th policy | 21:50 |
| sean-k-mooney | but i htink as you suggested talking to the barbican folks at the ptg would be a good next step | 21:50 |
| melwitt | yeah.. that would be easier to not have to deal with it. and then let it stay open to the service project? I dunno. just rambling | 21:50 |
| sean-k-mooney | to confirm that the acls are actully workign as we think they do | 21:50 |
| sean-k-mooney | melwitt: well if its a new policy we can recommend that you resize to delegated for better security or something | 21:51 |
| sean-k-mooney | where deploymene i "nova owned secte" adn delegated is "user owned secreate with nova acl" | 21:51 |
| melwitt | yeah. that would be much less pain | 21:52 |
| melwitt | ++ | 21:52 |
| sean-k-mooney | in anycase i think we shoudl finsih the deploymetn live migration supprot first | 21:52 |
| melwitt | yeah, I agree | 21:52 |
| sean-k-mooney | then maybe tack this on if folks agree | 21:52 |
| melwitt | the way I have it is I put 'user' ahead of 'deployment' since 'user' support is a subset of 'deployment' | 21:52 |
| sean-k-mooney | ya host was the most invasive one | 21:53 |
| melwitt | much smaller change than 'deployment' bc 'deployment' ultimately needs to be able to do secret conversions for resizes | 21:53 |
| sean-k-mooney | as all the resed didnt need rpc/obejct changes | 21:53 |
| sean-k-mooney | right ya i see | 21:54 |
| sean-k-mooney | its not alot diffent for the acl approch as you woudl still need to add/remove it when you change form user | 21:54 |
| sean-k-mooney | but that a singel api call | 21:54 |
| sean-k-mooney | no need to copy the data and reupload it | 21:55 |
| melwitt | yeah, I don't think it would be much to add | 21:55 |
| sean-k-mooney | all fo this falls apart if acls dont work that way but that seams to be what those spec imply | 21:55 |
| sean-k-mooney | kilo era features with "This ACL documentation is work in progress and may change in near future." | 21:57 |
| sean-k-mooney | make me causiously optimistic | 21:57 |
| melwitt | they seem to work how we think, at least in my unexpert testing in devstack. I also used the acl in the 'user' live migration tempest tests I wrote and that works too | 21:57 |
| sean-k-mooney | ok im going to call it a night o/ | 22:00 |
| melwitt | seeya o/ | 22:00 |
| opendevreview | sean mooney proposed openstack/nova master: Add regression test for unified limits resource bug https://review.opendev.org/c/openstack/nova/+/975859 | 22:01 |
| opendevreview | sean mooney proposed openstack/nova master: Fix unified limits to include all resource types https://review.opendev.org/c/openstack/nova/+/975872 | 22:01 |
| sean-k-mooney | i dont know if i resoved all the comments on that but i just pushed what i had ill take a look at it again tomorrow | 22:03 |
| melwitt | ack | 22:08 |
Generated by irclog2html.py 4.1.0 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!