Tuesday, 2026-04-07

opendevreviewmelanie witt proposed openstack/nova master: TPM: fixup for bump service version documentation  https://review.opendev.org/c/openstack/nova/+/97883603:33
opendevreviewmelanie witt proposed openstack/nova master: TPM: support live migration of `user` secret security  https://review.opendev.org/c/openstack/nova/+/98333303:33
opendevreviewmelanie witt proposed openstack/nova master: TPM: support live migration of `deployment` secret security  https://review.opendev.org/c/openstack/nova/+/92577103:33
opendevreviewmelanie witt proposed openstack/nova master: TPM: test live migration between hosts with different security  https://review.opendev.org/c/openstack/nova/+/95262903:33
opendevreviewmelanie witt proposed openstack/nova master: TPM: add late check for supported TPM secret security  https://review.opendev.org/c/openstack/nova/+/95697503:33
opendevreviewmelanie witt proposed openstack/nova master: TPM: enable conversion of secret security modes via resize  https://review.opendev.org/c/openstack/nova/+/96205203:33
opendevreviewmelanie witt proposed openstack/nova master: TPM: handle key manager Forbidden errors consistently  https://review.opendev.org/c/openstack/nova/+/98350403:33
opendevreviewmelanie witt proposed openstack/nova master: TPM: clean up orphaned libvirt secret on guest creation failure  https://review.opendev.org/c/openstack/nova/+/98350503:33
opendevreviewRobbert Nijgh proposed openstack/nova master: Remove proxy wrapping in libvirt device listing, slowing down update_available_resource  https://review.opendev.org/c/openstack/nova/+/98352910:52
MaxLamprecht[m]sean-k-mooney: Can you maybe have another look at this https://review.opendev.org/c/openstack/nova/+/934984. We merged already the reproducer a while ago13:27
MaxLamprecht[m]Would be great to also get https://review.opendev.org/c/openstack/nova/+/958556 reviewed/merged.13:29
sean-k-mooneyMaxLamprecht[m]: sure. ill see if i can do it before my next meeting at the top of the hour but i also need to prepare some contet for that 13:32
sean-k-mooneyit looks like dan is now happy with it and there has only been 1 revision since i last looked so i suspect this will be good to proceed13:32
dansmithI wasn't ever unhappy, I just thought we should wait until after the release13:37
sean-k-mooneyack, i just fisnihed reviewin git i think its ok to proceed with although i wont have tiem to look at the other 2 related repoduces/fixes today13:42
sean-k-mooneyim not agaisnt fixing the correct handeling fo build, rescheduler and migate seperatly so that they are indepently backportable bu i am wondering if we need a better long term fix for this in general. i have not spent any time thinkign what that could look like however. im just concerned that unshelve or resize will be next13:45
MaxLamprecht[m]thx, as far what I can tell, these fixes are pretty good for the nova side. We have all of them running since the beginning of this year and our cinder/nova out of sync problems reduced a lot. At least I can't even report a single new case that is nova related currently. I have 2 more race conditions on cinder side that I will submit there.13:53
sean-k-mooneyMaxLamprecht[m]: that good context. so you do not currently expect to need to adress simialr failures in other nova codepaths out side of the 2 other you have already sumbited bugs/fixes for?14:01
sean-k-mooneyat least based on your current workloads/usage patterens14:01
dansmithsean-k-mooney: oh I was talking about this: 14:11
dansmithhttps://review.opendev.org/c/openstack/nova/+/95855614:11
dansmiththis ^ should be a slam dunk early in the cycle I think14:12
sean-k-mooneydansmith: oh ok and ya i think we can proceed with that ill loop back to it whne im finished with the cybrog team meeting14:13
MaxLamprecht[m]sean-k-mooney: Exactly!14:34
melwittsomeone has reported a bug related to the tpool.Proxy wrapping of virNodeDevice of PCI devices during update_available_resource() https://bugs.launchpad.net/nova/+bug/2147424 saying that it causes update_available_resource() to increase time taken to complete over time. does anyone have any idea why this would be? if so, comments on the bug would be appreciated15:45
sean-k-mooneydid we fix that alreday15:48
sean-k-mooneymelwitt: that feels like something we did before15:48
melwittsean-k-mooney: they are saying that fix caused a different problem, that now update_available_resource() run time progressively increases over time. whereas the problem before the fix was that update_available_resource() could block other nova-compute eventlet threads. they are proposing a revert but I don't see how it could be that simple?15:50
melwitt(i.e. wouldn't that return things to the blocked threads problem)15:52
gibiclearly reverting until we know a bit more about the root cause of the increasing runtime feel premature to me15:53
melwittI don't really get how tpool.Proxy wrapping could result in a run time increase for the periodic task15:53
dansmith...if it increases contention15:53
gibiif it is increases for each periodict then we probably leaking something somewhere that accumlates15:53
melwittyeah. they did test with and without the wrapping and said without the wrapping the run time does not increase15:55
gibiI think we can try to reproduce the degradation. We just need to set the periodic delay to a short enough number and wait 15:56
melwittyeah. I wonder would we need real PCI devices or can this be faked somehow?15:58
dansmithI doubt it requires real PCI devices.. I don't think anything is actually touching the devices there, just getting the list result from libvirt15:59
gibiwe do filter for certain devices in that call16:00
gibihttps://github.com/openstack/nova/blob/b84864d9781ece8387daa967be5208b859e20d97/nova/virt/libvirt/driver.py#L8905-L891316:01
gibiif the number of devs are not increasing over time in their env then that call should not increase its runtime over time either 16:02
gibias crawlers killing our infra I cannot set up a new devstack at the moment to poke this16:07
sean-k-mooneyit shoudl in theory be repoducable in the functionalt tests16:08
gibibut I added a comment to the bug after couple of retries16:08
* sean-k-mooney is not on a call anymore16:08
sean-k-mooneygibi: so we cache the actual devices form libvirt in that code path16:08
sean-k-mooneyso after the first call to libvirt to get the pci devices16:08
sean-k-mooneywe should be using it form memory16:08
gibiI don't see the cache, but it is getting late here so...16:10
sean-k-mooneybut we might want ot dbouble check that we are not creating new cached copies due to self or similar breaking the caching16:10
melwitthm I wonder could the cache be making the tpool.Proxy threads stay around and not be cleaned up or something like that?16:11
sean-k-mooneyperhaps im trying to find where we do it today to confirm exactly how we are doing that16:12
dansmithgibi: s/our/all/16:16
sean-k-mooneymelwitt: im wondering if i am misser remembering and perhasp wrote a caching patch that we have not merged?16:17
gibidansmith: indeed16:17
sean-k-mooneyhttps://review.opendev.org/c/openstack/nova/+/922855/6/nova/virt/libvirt/driver.py16:19
melwittoh ok so not yet a factor16:21
sean-k-mooneyya16:21
sean-k-mooneythat is not the first time we have looket at that for waht its owrth16:21
sean-k-mooneythere are some other optimistaion patches like https://review.opendev.org/c/openstack/nova/+/925382 but i gouth we had merge a ttleast one o the cachign ones before16:22
sean-k-mooneyfor now that not a factor i guess16:22
melwittI think the thing I keep focusing on is, why with tpool.Proxy only? seems the only way would be contention like Dan said16:23
sean-k-mooneymelwitt: do you know if they are seeing an increase in memory usage as well as time16:23
melwittI don't really know the details about how tpool.Proxy works. I pulled up the eventlet doc about it but it doesn't say much16:23
melwittsean-k-mooney: they did not mention memory usage but that would be good to know16:24
sean-k-mooneyif we were keeping more nad more info in memoyr over time we shoudl see htat increase.16:24
dansmith"how it works" is "terribly"16:24
dansmithI mean it does what it says but it's really a gross hack we should avoid whenever we can16:24
melwittwell, that's not reassuring16:25
melwitt(haha)16:25
dansmithI can also see it just straight up hurting performance if we wrap an object and then have some really terrible code that hits a bunch of its properties/functions a thousand times, like in a sort routine or something16:26
sean-k-mooneyi did look tat the impl a long time ago but i dont rememebr the details16:26
gibi... OK it seems I got lucky and have at least an aio devstack built with PCI devices now. So I can start experimenting with the periodic 16:26
dansmitheach of those would be a few instructions for regular code and with the proxy would be a spawn..wait..collect..return instead16:26
sean-k-mooneyya so we are actully collecting the pci device 3 times per call today16:27
dansmithunless we need the actual object to be wrapped, we should wrap the call to libvirt and return the objects, or wrap a wrapper that does the call and copies the objects into like NamedTuples or something16:27
sean-k-mooneythat is what https://review.opendev.org/c/openstack/nova/+/925382 fixes16:27
sean-k-mooneybut we also dont supprot adding/removing devices at runtime anyway without a agent restart16:28
sean-k-mooneyso we can cache this once on the first invocatoin safely today16:28
melwittit's that the call of virNodeDevice.listCaps() needs to be wrapped somehow to not block stuff. so my thinking was wrap the object and thus get the call wrapped?16:28
sean-k-mooneythats in host.py right16:29
dansmithmelwitt: so we need to call listCaps() later and that makes another call into libvirt?16:30
melwittyes16:30
dansmith...and do we do that for all the returned devices?16:30
melwittall of the devices call their listCaps() method so I had thought so. I might be wrong 16:31
dansmithso that means each one of those listCaps() calls will be a thread spawn, wait, return with moving some data around for communication16:31
sean-k-mooneywe normaly convert them into dictonaries for pci devceis https://github.com/openstack/nova/blob/master/nova/virt/libvirt/host.py#L159316:31
melwittit's get a list of [dev, dev, dev, ...] and then for each dev: dev.listCaps()16:31
dansmithwithout, those will just be sequential calls, so blocking but way faster16:32
sean-k-mooneywe try not to return the libvirt dev object out of host.py16:32
dansmithsean-k-mooney: right converting them to dicts or NamedTuple or something all at once in a single tpool'd function would be the more efficient thing16:32
sean-k-mooney_get_device_capabilities take sthe virNodeDevice list and converts them today for the pci devices in general16:33
sean-k-mooneybut i dont know if that pater is applied everywhere16:33
melwittyeah I mean maybe it's just that, that the calls are too many and too long and then the threads pile up?16:33
dansmithnot the threads piling up so much as just way more work doing it with threads16:34
sean-k-mooneywell this is 2025.116:34
melwittyeah I was thinking about the progressive increase over time16:34
sean-k-mooneyso that not using our threaded implemation16:34
sean-k-mooneythis is just for teh eventlet flow right16:34
melwittsomething has to be accumulating I thought, just not sure exactly what16:35
melwittyeah this should just be for eventlet16:35
sean-k-mooneyso we can create fake pci devices in the fucntional tests16:35
sean-k-mooneyand we can run the perodics explcitly in those16:35
sean-k-mooneyso it might be wroth callign it like 100 times and tiem each oen to see if we can repoduce16:35
sean-k-mooneysort of like in my ironic simulator patch16:36
melwittgibi said he just got an env set up and will run some experiments16:36
sean-k-mooneyif we can we can then  debug that16:36
sean-k-mooneyack i have a env live with pci device too although i currently have the pci grouping serice deployed on taht but i can at least check my logs i gess and see what the time looks like16:36
sean-k-mooneymelwitt: the other thing of note was the fact they said more instance make it degrade faster16:37
sean-k-mooneythe number of isntance dont affect the livbirt retrival path16:37
sean-k-mooneyso i wonder if its related to the isntace objects instead or the instance part fo this16:37
melwittI guess I assumed more instance = more PCI devices but I guess you're saying not?16:38
sean-k-mooneywell if its the same hosts in both cases16:38
sean-k-mooneyor same server sku16:38
dansmithsean-k-mooney: in eventlet with tpool.proxy you're creating worker threads for every property invocation on a wrapped obejct16:38
dansmithnothing to do with our threaded implementation16:38
sean-k-mooneydansmith: ack yes you are hookign get_atribute or simialr and dispatch the callables to a trhead pool16:39
melwittor yeah I guess I see. so if more instances = more other unrelated tpool.Proxy things maybe16:39
sean-k-mooneymelwitt: i think we pull all fo the domains for example in this periodic16:39
sean-k-mooneyso im wondifing if its related to the number of vms with passthough devices in some way16:39
sean-k-mooneyi.e. if on an empy host its fixed/stable but on a host with isntance it degrades over time16:40
sean-k-mooneyit is odd however that it reported against _get_pci_passthrough_devices if that was the case16:41
melwittI feel like it must be something less common ... like otherwise this complaint would have come up a long time ago I would think. and more people affected16:41
dansmithif we iterate all the host pci devices once (or more) per instance, then that could be the increase that scales with instances16:41
sean-k-mooneyhttps://github.com/openstack/nova/blob/master/nova/virt/libvirt/driver.py#L8921-L892516:42
dansmithmelwitt: this change happened just a year ago right? so real production people are probably just starting to run it16:42
sean-k-mooneywe are iterating over raw nodedev defivce form libvirt here16:42
melwittdansmith: yeah a little over a year ago16:43
sean-k-mooneylist_all_device here https://github.com/openstack/nova/blob/master/nova/virt/libvirt/driver.py#L891316:43
melwittbut it got backported everywhere also16:43
dansmithack about the backports, that's fair16:43
sean-k-mooneywell no that wraped into a proxy in list_all_devices16:44
sean-k-mooneyhttps://github.com/openstack/nova/blob/master/nova/virt/libvirt/host.py#L174416:44
melwittI had seen a lot of positive feedback about the change (since before that people were getting nova-compute lockups) but I hadn't seen negative until today16:44
dansmithyeah, I mean.. it improves one thing at the expense of another, so not surprising I think16:44
dansmithand if you have fewer PCI devices or slower periodic config you probably appreciate the change :)16:45
melwittyeah. I just meant as far as, the conditions under which it happens being less common or more common16:45
MaxLamprecht[m] I remember about a similar problem. See my patch and comment https://review.opendev.org/c/openstack/nova/+/922855/comment/a0361266_9c58663c/. There was a bug/performance regression in libvirt < 10.5.16:45
melwittoh interesting16:46
sean-k-mooneydansmith: so we are only calling it once16:46
sean-k-mooneynot once per isntance as we are callign it here https://github.com/openstack/nova/blob/master/nova/virt/libvirt/driver.py#L10506-L1056016:46
dansmithsean-k-mooney: ack, I hadn't looked I was just saying _if_ it scales with instances and _if_ we are calling once per instance...16:47
sean-k-mooney yep i was just doublecking that we were not16:47
melwittMaxLamprecht[m]: I guess though, the bug reporter did say they tested with and without tpool.Proxy wrap (assuming same libvirt version) and said they didn't get the slowdown as long as tpool.Proxy wrap was not used16:48
melwittI'm not sure if that means it could not be related. might be worth asking them on the bug what their libvirt version is regardless16:49
dansmiththat's expected because of the overhead it introduces.. but MaxLamprecht[m]'s change just makes us never call it again which is also an improvement regardless of the proxy16:49
sean-k-mooneyyes we coudl also jsut impvoe this to convert the data to dicts earlier16:50
MaxLamprecht[m]ack, for us upgrading libvirt to 10.5 (which includes https://gitlab.com/libvirt/libvirt/-/commit/98f1cf88fa7e0f992d93f376418fbfb3996a9690). I would start to check this :)16:51
MaxLamprecht[m]*solved it16:51
sean-k-mooneylist_all_devices is only uses in 2 places16:51
sean-k-mooneyit would not be hard to replace it with 2 function one that return the lsit of mdev names and a second that returns the capablites as a dict and just remove the tpool wrapper entirly16:52
dansmithyou had me at "remove the tpool wrapper entirely"16:53
sean-k-mooneyhum well not quite https://github.com/openstack/nova/blob/master/nova/virt/libvirt/driver.py#L1225-L1242 but  that call can be moved to host.py and it does not return the obejct so there is no long lived usage16:54
sean-k-mooney_start_inactive_mediated_devices is only called once in init_host16:54
dansmithjust converting to dicts when we do it, in a single worker thread and returning (probably cached if that's appropriate) would solve all the things.. then callers can iterate over the dicts as they iterate over objects today without the penality16:55
sean-k-mooney+116:55
melwittyeah, let's not remove the wrapper entirely please and return to the lockups16:55
sean-k-mooneywell we can do it in phases16:56
dansmithwe want to remove the wrappER, just not remove the wrappING( of the initial call)16:56
melwittO_O16:56
melwittok16:57
dansmithI can cook up the start of a change if it's still not clear16:57
melwittlet him cook16:58
melwittI think I understand the basic idea, I was not thinking of the distinction between wrappER and wrappING so I'm good :)17:00
dansmithack17:00
melwittregardless if anyone wants to cook a change, please do. I didn't come up with any of the ideas17:00
opendevreviewTakashi Kajinami proposed openstack/nova-specs master: libvirt: AMD SEV-SNP support  https://review.opendev.org/c/openstack/nova-specs/+/98337617:02
-opendevstatus- NOTICE: Load on the opendev.org Gitea backends is under control again for now, if any Zuul jobs failed with SSL errors or disconnects reaching the service prior to 16:15 UTC they can be safely rechecked17:04
opendevreviewTakashi Kajinami proposed openstack/nova-specs master: libvirt: AMD SEV-SNP support  https://review.opendev.org/c/openstack/nova-specs/+/98337617:07
melwittjust to keep things rolling with problems, I noticed it seems like the test_resize_reschedule_uses_host_lists_3_fails  unit test is still nondeterministically failing somehow https://f1460b24809dc27f2808-70ad550016ca78bab2d4ae8445294bcc.ssl.cf2.rackcdn.com/openstack/243619dc5a054b6cbca4265b38ce41ed/testr_results.html17:07
melwittI don't understand why yet17:07
sean-k-mooneyhum i have not seen that but have nto looked latesly is it always the 3 varian or is it also failing on test_resize_reschedule_uses_host_lists_1_fail17:10
sean-k-mooneyall the logic is common bar the parmters os the instablity must be in https://github.com/openstack/nova/blob/b84864d9781ece8387daa967be5208b859e20d97/nova/tests/functional/test_servers.py#L3453-L349017:10
sean-k-mooneyoh that keeps going https://github.com/openstack/nova/blob/b84864d9781ece8387daa967be5208b859e20d97/nova/tests/functional/test_servers.py#L3453-L355817:11
melwittI have only seen the 3 variant. and I think gmaan did something to improve it but I don't remember. I need to look at the git blame17:11
gibiinitial results: 142 devs returned from libvirt (10.0), no instances on the host. Over 10 periodic runs I don't see any trends. But I confirm that eventlet mode is slower overall.17:14
sean-k-mooneyis that on master17:14
gibino proxy: _get_pci_passthrough_devices took 0.137605510999947 secs for 142 libvirt devices17:14
gibiwith proxy: periodic update_available_resource took 3.2481384640000215 secs17:14
gibisorry17:14
gibiwith proxy:  _get_pci_passthrough_devices took 0.5433623499998248 secs for 142 libvirt devices17:15
sean-k-mooneyi asume you just seting the env flag on master?17:15
gibiyepp17:15
sean-k-mooneyits nice we can do that :)17:15
gibi:)17:15
gibiI will start adding VMs17:15
gibiand taking a longer measurement window17:16
gibithis is the current data I have https://paste.openstack.org/show/bNCNkXgjaGe1EXOjAxNf/17:16
sean-k-mooneythey said 60 repocues it but if we could get 4-8 it should still be visible it just might take longer to notice but runniing the perodic like ever 10 second should offset that17:16
gibiyepp17:17
sean-k-mooneyi guess those are new logs17:18
sean-k-mooneylet me check my sriov dev env for the overall perod time quickkly17:18
gibiyeah I needed new logs otherwise it is not clear how long things take17:18
sean-k-mooneyi think in debug mode we print the total perodic time17:19
sean-k-mooney Lock "compute_resources" "released" by "nova.compute.resource_tracker.ResourceTracker._update_available_resource" :: held 2.483s17:21
sean-k-mooneywo we have thsoe logs17:21
sean-k-mooneybut im not sure if the call is in that loc17:22
sean-k-mooneyit ideally is not17:22
sean-k-mooneyok no we are computing the resouce view well before that17:24
gibiI'm wondering that the "scaling with the number of VMs" is an artifact of the compute node being more and more under load by those VMs. It is definitely the case for my compute by default. I will measure this with VMs in shutoff state to limit that effect17:25
sean-k-mooneyeven so the pull perodic time for me locally is 12 seconds17:25
sean-k-mooney*full17:25
sean-k-mooneywell there will be more contention for the compute_resouces lock17:26
sean-k-mooneyin teh resouce tracker formother perodics17:26
sean-k-mooneywe do print the wait time ock "compute_resources" acquired by "nova.compute.resource_tracker.ResourceTracker._update_available_resource" :: waited 0.001s so mabe that a factor17:27
sean-k-mooneyalthough locally the other thing that uses it is Lock "compute_resources" acquired by "nova.compute.resource_tracker.ResourceTracker.clean_compute_node_cache"17:28
sean-k-mooneyi think we aquire it when bootign vms as well17:29
sean-k-mooneybut in terms of perodic that contenion does nto seam very high17:29
sean-k-mooneyActive: active (running) since Wed 2026-03-11 17:16:44 UTC; 3 weeks 6 days ago17:31
sean-k-mooneyso i do not currently have instnace on this env17:31
sean-k-mooneybut i ahve had them runign with pci device in the past so it does seam to return to a sane baseline at least on an empty host17:32
sean-k-mooneygibi i guess the next best thign to do is leave it run over night in your env and see if it repoduces when there are isntances17:34
gibi10 VM that are not using PCI devs on the host and are in shutoff state does not increase the time of the _get_pci_passthrough_devices call. It does increase the time of th whole periodic obviously. I don't yet see any treands in either of the times over time17:34
gibisean-k-mooney: yeah I will, I just try to see what setup make sense to use17:35
gibinext I will try with PCI user VMs17:35
gibiI have 6+6 VFs (2 PFs) so each VM can take a VF17:37
sean-k-mooneyack im currnetly comparing in my cybrog/pci groupign env. ill leave it alone for not but if you dong see anygint i can maybe run ~50-80 vms 17:37
sean-k-mooneyi belive i have 2 sriov capable nic on this host and i shoudl barely have enough disk/ram but i havent really pushed this host to find its limits17:39
gibiI don't see any timing data difference between 10VMs without PCI usage and 10 VMs with one VF usage each (both cases the VMs are in shutoff)17:44
gibiso what I will do I will keep the 10 VMs with one VF each,take a GMR and then let the periodic run every 10 seconds during the night17:45
sean-k-mooneyhum im not sure what runing vs not would cause in this context17:47
sean-k-mooneytechinaly that does change the retivial of mac adresses for VFs17:48
sean-k-mooneyi.e. if livbirt is restart after the vf is passed to the vm it may not see the mac17:49
sean-k-mooneybut did the bug say it was realted to runign/active vms or are you just leaving them stop to reduces load17:49
gibiI simply want to avoid the cpu usage noise of the 10 VMs running17:49
sean-k-mooneyah ok17:49
sean-k-mooneyyou could start them and then pause them17:49
gibiit is non 0 even with cirros17:49
gibisure17:50
gibithat is better17:50
sean-k-mooneybut i suspect we wont see much of a delta on the perodic althoug hif we do it would be an interesting datapoint i guess17:50
gibia quick look in two GMRs ~5 minutes distance I don't see increasing number of Green Threads or Threads reported.18:17
gibiOK. This is the starting state of my measurement setup for the night https://paste.openstack.org/show/b96mgBQpguw6KUseBAgZ/18:25
melwittawesome gibi18:34
melwittdansmith: fyi potential cross-project PTG topic re: barbican secrets https://etherpad.opendev.org/p/nova-2026.2-ptg#L90 cc Uggla sorry if I put the text in the wrong location in the etherpad, feel free to move it20:10
dansmithmelwitt: ack, not sure that needs to be a cross-project session really, but cool.. I also didn't realize you were worried about the deployment side of the ownership thing, more the user one20:28
dansmithbut good point that neutron would have access to those keys20:28
dansmithunless nova could change the ACL itself20:28
melwittdansmith: my brain started on user and then the realization about service hit. but yeah nova can change the ACL itself as far as I know20:29
melwittI thought you had suggested we talk to ppl who know more about barbican so I put it cross-project20:30
melwittit does suck that there's no way to do ACLs through castellan. not sure if that's by design 20:30
melwittI guess maybe not all key managers support ACLs20:31
dansmithack20:33
sean-k-mooneymelwitt: by the way i belive there as a specic commit possibel as part fo there srbac work that change teh secrte to be owned by the proejct20:56
sean-k-mooneymelwitt: i dont exactly know why i feel tha tway but i think we talksed about it before20:57
melwittsean-k-mooney: I wondered if that could be a possibility. I'll take another gander20:58
sean-k-mooneyi.e. the fact that is not only accable by the user is someitng that i was vaugly aware of an that it was not alwasy that way20:58
melwittas far as I could find it goes down to the model and object level, so always been that way21:00
melwitt(this was just me looking in the code)21:01
dansmithyeah I really thought we were told that it was specific to the user and not the project21:01
sean-k-mooneymelwitt: maybe https://opendev.org/openstack/barbican/commit/cd9b0ba9f0b4606193858d3cf05acd413a66ba1e21:02
sean-k-mooneythey use to enforce _SECRET_CREATOR = "user_id:%(target.secret.creator_id)s"21:02
sean-k-mooney... storybaord 2010235 has more info 21:03
sean-k-mooneyhuh its actully working https://storyboard.openstack.org/#!/story/201023521:03
sean-k-mooneyi guess the bots doen know about story board21:03
sean-k-mooneyya so in https://opendev.org/openstack/barbican/commit/cd9b0ba9f0b4606193858d3cf05acd413a66ba1e the allowed project member not jsut teh orgial secreat creator to decrpte the secrete adn basiclly mange it21:05
melwittlol storyboard21:06
sean-k-mooneywithout the need for acls so that proably shoudl ahve at least had a release note... or a doc update....21:06
sean-k-mooneyhttps://review.opendev.org/c/openstack/barbican/+/853680 apprently that was aslo cherry-picked which is qustionable21:07
melwittyeah that's a pretty major change21:07
sean-k-mooneyalthough i coudl eb misreading those polices but i dont think i am21:07
dansmithbut these are API policies - doesn't barbican also have a more robust internal ACL system?21:08
dansmithI guess I thought that21:08
dansmithis what we were talking about here and not just API policy21:08
melwittI guess the old rule had rule:secret_project_admin which allows project admin also21:08
sean-k-mooneyhum actully on gerrit i think i might be https://review.opendev.org/c/openstack/barbican/+/853680/9/barbican/common/policies/secrets.py#7221:08
melwittsecrets are not as secret as I had thought :P21:09
sean-k-mooneyim not familar with rule:secret_is_not_private 21:09
melwittthat's the project_access thing. project_access=false=private. at least that's how it is today21:09
sean-k-mooneyah21:10
sean-k-mooneyso seperatly form the acl you can mark it as prive to a user or public to the project21:10
sean-k-mooneyi assume acls were an evoltion of that with more granualatiy or someitng in later release21:11
melwittno that is the ACL. i.e. if you want to change it, it's a PUT to the /acl API21:11
dansmithso the api policy defines the default but I can still ACL individual secrets however I want?21:11
melwittas far as I can tell the default is hard-coded, I don't think you can change it21:12
melwitt*don't think you can change the default21:12
melwittand you can't create a secret to have project_access=false from the get go21:12
sean-k-mooneyhttps://docs.openstack.org/barbican/latest/api/reference/acls.html21:12
sean-k-mooneyso apprenlty the acl is on tdefien on indeivual secrets but the contienr the sectes are contied in21:12
sean-k-mooneyit feels like this is not fully baked based on teh documantion warning21:13
melwittyeah... kinda also why I was thinking cross-project at the ptg, to just get a sanity check from barbican experts if they are out there21:14
sean-k-mooneyhttps://docs.openstack.org/api-guide/key-manager/acls.html#default-acl21:14
sean-k-mooney{21:15
sean-k-mooney  "read":{21:15
sean-k-mooney    "project-access": true21:15
sean-k-mooney  }21:15
sean-k-mooney}21:15
sean-k-mooneyso that apprently the default os i guess whne they added acls it change the behvior21:15
melwittok so you are saying before ACLs even existed, it was like private secrets?21:15
melwittvtpm was added in the victoria release which was in 202021:17
sean-k-mooneyim not sure21:17
sean-k-mooneybu tjsut skiming ther eapi ref21:17
sean-k-mooneyi think if we anted to chagne teh bhvior we would need 3 calls minium21:17
sean-k-mooney1 to create a metadata only secrete21:18
sean-k-mooneythen to apply the acl to it21:18
sean-k-mooneyand finaly to upload the data21:18
sean-k-mooneyi would assume we are not doing that today21:18
melwittyeah that's exactly what I thought we would have to do and no we are not doing that today21:19
sean-k-mooneyso when we are creatign the "user" secreat its likely project scoped21:19
melwittsecrets in nova have always been create with the default acl, so have always been open to project21:19
melwittI don't know why I thought they were private to a user :(21:20
sean-k-mooneyso its the secret:decrypt21:21
sean-k-mooneypolyc that we care about rahter then get21:22
melwittcorrect21:22
sean-k-mooneyget is just the metadata21:22
melwittyes21:22
melwittman the test_resize_reschedule_uses_host_lists_3_fails unit test is going crazy in CI21:23
sean-k-mooneyhttps://opendev.org/openstack/barbican/blame/commit/661386a517fac2f81d6b35706dd360b045955c3f/barbican/common/policies/secrets.py#L2121:23
sean-k-mooneyso that acl rules has been there for at least 10 years21:24
sean-k-mooneyso in effect they have never been user scoped unless you change the acl21:24
melwittI don't think so right because of this https://opendev.org/openstack/barbican/src/commit/661386a517fac2f81d6b35706dd360b045955c3f/barbican/common/policies/base.py#L50-L5221:24
melwittoh wait that's not decrypt21:25
sean-k-mooney righ if you look at https://opendev.org/openstack/barbican/blame/commit/661386a517fac2f81d6b35706dd360b045955c3f/barbican/common/policies/secrets.py#L2121:25
melwitthttps://opendev.org/openstack/barbican/src/commit/661386a517fac2f81d6b35706dd360b045955c3f/barbican/common/policies/base.py#L53-L5521:26
sean-k-mooneythat has the  'rule:secret_acl_read' rule alaready21:26
melwittif I read that right it was allow project admin, project observer, project creator21:26
melwittso it's true it was not project member21:26
sean-k-mooney right os creator was the old role they mapped to member21:26
sean-k-mooneyand observer i guess was reader21:27
sean-k-mooneyso before srbac to us babican at all admin had to assing you the creator role21:27
sean-k-mooneymelwitt: i think this is perhaps the relvent spec https://github.com/openstack/barbican-specs/blob/master/specs/kilo/add-creator-only-option.rst21:30
sean-k-mooneywell and https://github.com/openstack/barbican-specs/blob/master/specs/kilo/add-per-secret-policy.rst21:30
melwittok yeah that is showing that private secrets was a new feature at the time. wasn't even possible originally 21:31
sean-k-mooneyrigth so the second spec is the one that intouced tha acl api21:32
sean-k-mooneyand the first was specificly usignit for providign diffent options for the creator role21:32
melwittinteresting. good to know21:32
sean-k-mooneyi hate to say this but.... this would allwo a 4 mode of operations21:34
sean-k-mooneywe could have a "delegated" for lack fo a beter term mode where the secret is owned by the end user isntead of the nova user but we just add an acl for ti when creating it for the nova user to be able to read it when needed21:35
sean-k-mooneyand do that only for the tpm secrets that we are cretaing on there behalf21:35
sean-k-mooneythat almost what the deployment mode is with the only delta being the secret remaisn owned by the project that owns the vm21:35
melwittyeah .. that would work I think21:36
sean-k-mooneyyou coudl also upgrae to that by backfilling the ACL21:36
melwitthonestly that seems easier because then the service project thing doesn't come up21:36
sean-k-mooneyalthough well it kind of does but only in the sense that what ever auth cred we are usign to talk to it need to be one form our config21:37
sean-k-mooneybut just lookign at master for a second 21:37
sean-k-mooneyhttps://opendev.org/openstack/barbican/src/branch/master/barbican/common/policies/secrets.py#L19-L2621:37
sean-k-mooneyrule:secret_project_admin21:38
melwittyeah but I mean if we do it by giving the nova user access via acl, then it's never open to the service project right21:38
sean-k-mooneyi asumem that is someone wiht the admin role that has a project scoped token for that project right21:38
melwittI think that's what that means, yes21:38
sean-k-mooneyoh well there may or may not be a shared service project21:39
sean-k-mooneywe do that in devstack but in our down stream we dont i belive21:39
melwitttrue, but the "common" deployment is done that way21:39
sean-k-mooneyit woudl allow use to restict ti to just the nova user however21:39
sean-k-mooneyif ther was a shared service proejct so that still better21:39
sean-k-mooneywell not nessicarly "beter" then owned by nova in genreal jsut better then "accsabel by anyone in the service project"21:41
melwittright21:41
melwittit's the open-to-the-service-project that seems a bit too out there21:41
sean-k-mooneyright. so looking at barbican base policy they do impmente proejct scoped admin to some degree21:42
sean-k-mooneybut that really global admin21:42
sean-k-mooneyso today if an admin added themselve to a end users project the shoudl be bale to live mgigrate21:43
sean-k-mooneyprovide we just dont block it in the api and try to download the secret when needed21:43
sean-k-mooneyi.e with teh defualt "user" policy21:43
sean-k-mooneywhat "deployment" or "delegated" what ever we want to call the acl approch gives21:44
melwittyeah exactly. it is possible for project admin today21:44
sean-k-mooneyis the ablity to skip adding yourself to the end users policy21:44
sean-k-mooney*end users project21:44
sean-k-mooneyjsut to do the live migration21:44
sean-k-mooneyso its a net win still be but the security posture is mostly the same as user or host21:45
melwittyeah. it feels like it makes more sense too in a way. like more intuitive21:45
sean-k-mooneyi think this also means that an admin coudl also jsut live migrate any vm with the user policy if they just add the acl to the relevent secret right?21:46
sean-k-mooneyfor there user?21:46
melwittI really don't want to have to migrate 'deployment' secrets though, ugh. 21:46
sean-k-mooneywell you dont have to move them21:47
sean-k-mooneythat the nice thign about deployment you jsut get them form barbican in pre live migrate21:47
sean-k-mooneythere is no copyign right.21:47
melwittyes you can with acl also. a bit more code is needed to enable 'user' live migration and I have them proposed21:47
sean-k-mooneyby migrate deployment secret did you mean add the acl?21:47
sean-k-mooneyor did you mean move them during live migrate?21:48
sean-k-mooneyor i guess copy them to a new secret21:48
sean-k-mooneybut that woudl be on resize right21:48
melwittyeah but I mean if we wanted to do the user + acl for nova user thing, taking the nova user secret and creating a user secret and copying the passphrase over and adding the acl, I dunno. mess21:48
sean-k-mooneyoh your sayign supprotign the 4th policy21:49
sean-k-mooneyfo ading the nova user by acl21:49
melwittfor potentially already existing 'deployment' 21:49
melwittsecrets21:49
sean-k-mooneyah well we dont have that yet do we?21:49
sean-k-mooneyi tought that was not merged21:49
melwittwe do have the secret generation but not the live migration21:49
sean-k-mooneyah ok21:50
sean-k-mooneywell if we atned to suprpot it its feeld like a 4th policy21:50
sean-k-mooneybut i htink as you suggested talking to the barbican folks at the ptg would be a good next step21:50
melwittyeah.. that would be easier to not have to deal with it. and then let it stay open to the service project? I dunno. just rambling21:50
sean-k-mooneyto confirm that the acls are actully workign as we think they do21:50
sean-k-mooneymelwitt: well if its a new policy we can recommend that you resize to delegated for better security or something21:51
sean-k-mooneywhere deploymene i "nova owned secte" adn delegated is "user owned secreate with nova acl"21:51
melwittyeah. that would be much less pain21:52
melwitt++21:52
sean-k-mooneyin anycase i think we shoudl finsih the deploymetn live migration supprot first21:52
melwittyeah, I agree21:52
sean-k-mooneythen maybe tack this on if folks agree21:52
melwittthe way I have it is I put 'user' ahead of 'deployment' since 'user' support is a subset of 'deployment'21:52
sean-k-mooneyya host was the most invasive one21:53
melwittmuch smaller change than 'deployment' bc 'deployment' ultimately needs to be able to do secret conversions for resizes21:53
sean-k-mooneyas all the resed didnt need rpc/obejct changes21:53
sean-k-mooneyright ya i see21:54
sean-k-mooneyits not alot diffent for the acl approch as you woudl still need to add/remove it when you change form user21:54
sean-k-mooneybut that a singel api call21:54
sean-k-mooneyno need to copy the data and reupload it21:55
melwittyeah, I don't think it would be much to add 21:55
sean-k-mooneyall fo this falls apart if acls dont work that way but that seams to be what those spec imply21:55
sean-k-mooneykilo era features with "This ACL documentation is work in progress and may change in near future."21:57
sean-k-mooneymake me causiously optimistic21:57
melwittthey seem to work how we think, at least in my unexpert testing in devstack. I also used the acl in the 'user' live migration tempest tests I wrote and that works too21:57
sean-k-mooneyok im going to call it a night o/22:00
melwittseeya o/22:00
opendevreviewsean mooney proposed openstack/nova master: Add regression test for unified limits resource bug  https://review.opendev.org/c/openstack/nova/+/97585922:01
opendevreviewsean mooney proposed openstack/nova master: Fix unified limits to include all resource types  https://review.opendev.org/c/openstack/nova/+/97587222:01
sean-k-mooneyi dont know if i resoved all the comments on that but i just pushed what i had ill take a look at it again tomorrow22:03
melwittack22:08

Generated by irclog2html.py 4.1.0 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!