Tuesday, 2026-04-07

opendevreview	melanie witt proposed openstack/nova master: TPM: fixup for bump service version documentation https://review.opendev.org/c/openstack/nova/+/978836	03:33
opendevreview	melanie witt proposed openstack/nova master: TPM: support live migration of `user` secret security https://review.opendev.org/c/openstack/nova/+/983333	03:33
opendevreview	melanie witt proposed openstack/nova master: TPM: support live migration of `deployment` secret security https://review.opendev.org/c/openstack/nova/+/925771	03:33
opendevreview	melanie witt proposed openstack/nova master: TPM: test live migration between hosts with different security https://review.opendev.org/c/openstack/nova/+/952629	03:33
opendevreview	melanie witt proposed openstack/nova master: TPM: add late check for supported TPM secret security https://review.opendev.org/c/openstack/nova/+/956975	03:33
opendevreview	melanie witt proposed openstack/nova master: TPM: enable conversion of secret security modes via resize https://review.opendev.org/c/openstack/nova/+/962052	03:33
opendevreview	melanie witt proposed openstack/nova master: TPM: handle key manager Forbidden errors consistently https://review.opendev.org/c/openstack/nova/+/983504	03:33
opendevreview	melanie witt proposed openstack/nova master: TPM: clean up orphaned libvirt secret on guest creation failure https://review.opendev.org/c/openstack/nova/+/983505	03:33
opendevreview	Robbert Nijgh proposed openstack/nova master: Remove proxy wrapping in libvirt device listing, slowing down update_available_resource https://review.opendev.org/c/openstack/nova/+/983529	10:52
MaxLamprecht[m]	sean-k-mooney: Can you maybe have another look at this https://review.opendev.org/c/openstack/nova/+/934984. We merged already the reproducer a while ago	13:27
MaxLamprecht[m]	Would be great to also get https://review.opendev.org/c/openstack/nova/+/958556 reviewed/merged.	13:29
sean-k-mooney	MaxLamprecht[m]: sure. ill see if i can do it before my next meeting at the top of the hour but i also need to prepare some contet for that	13:32
sean-k-mooney	it looks like dan is now happy with it and there has only been 1 revision since i last looked so i suspect this will be good to proceed	13:32
dansmith	I wasn't ever unhappy, I just thought we should wait until after the release	13:37
sean-k-mooney	ack, i just fisnihed reviewin git i think its ok to proceed with although i wont have tiem to look at the other 2 related repoduces/fixes today	13:42
sean-k-mooney	im not agaisnt fixing the correct handeling fo build, rescheduler and migate seperatly so that they are indepently backportable bu i am wondering if we need a better long term fix for this in general. i have not spent any time thinkign what that could look like however. im just concerned that unshelve or resize will be next	13:45
MaxLamprecht[m]	thx, as far what I can tell, these fixes are pretty good for the nova side. We have all of them running since the beginning of this year and our cinder/nova out of sync problems reduced a lot. At least I can't even report a single new case that is nova related currently. I have 2 more race conditions on cinder side that I will submit there.	13:53
sean-k-mooney	MaxLamprecht[m]: that good context. so you do not currently expect to need to adress simialr failures in other nova codepaths out side of the 2 other you have already sumbited bugs/fixes for?	14:01
sean-k-mooney	at least based on your current workloads/usage patterens	14:01
dansmith	sean-k-mooney: oh I was talking about this:	14:11
dansmith	https://review.opendev.org/c/openstack/nova/+/958556	14:11
dansmith	this ^ should be a slam dunk early in the cycle I think	14:12
sean-k-mooney	dansmith: oh ok and ya i think we can proceed with that ill loop back to it whne im finished with the cybrog team meeting	14:13
MaxLamprecht[m]	sean-k-mooney: Exactly!	14:34
melwitt	someone has reported a bug related to the tpool.Proxy wrapping of virNodeDevice of PCI devices during update_available_resource() https://bugs.launchpad.net/nova/+bug/2147424 saying that it causes update_available_resource() to increase time taken to complete over time. does anyone have any idea why this would be? if so, comments on the bug would be appreciated	15:45
sean-k-mooney	did we fix that alreday	15:48
sean-k-mooney	melwitt: that feels like something we did before	15:48
melwitt	sean-k-mooney: they are saying that fix caused a different problem, that now update_available_resource() run time progressively increases over time. whereas the problem before the fix was that update_available_resource() could block other nova-compute eventlet threads. they are proposing a revert but I don't see how it could be that simple?	15:50
melwitt	(i.e. wouldn't that return things to the blocked threads problem)	15:52
gibi	clearly reverting until we know a bit more about the root cause of the increasing runtime feel premature to me	15:53
melwitt	I don't really get how tpool.Proxy wrapping could result in a run time increase for the periodic task	15:53
dansmith	...if it increases contention	15:53
gibi	if it is increases for each periodict then we probably leaking something somewhere that accumlates	15:53
melwitt	yeah. they did test with and without the wrapping and said without the wrapping the run time does not increase	15:55
gibi	I think we can try to reproduce the degradation. We just need to set the periodic delay to a short enough number and wait	15:56
melwitt	yeah. I wonder would we need real PCI devices or can this be faked somehow?	15:58
dansmith	I doubt it requires real PCI devices.. I don't think anything is actually touching the devices there, just getting the list result from libvirt	15:59
gibi	we do filter for certain devices in that call	16:00
gibi	https://github.com/openstack/nova/blob/b84864d9781ece8387daa967be5208b859e20d97/nova/virt/libvirt/driver.py#L8905-L8913	16:01
gibi	if the number of devs are not increasing over time in their env then that call should not increase its runtime over time either	16:02
gibi	as crawlers killing our infra I cannot set up a new devstack at the moment to poke this	16:07
sean-k-mooney	it shoudl in theory be repoducable in the functionalt tests	16:08
gibi	but I added a comment to the bug after couple of retries	16:08
* sean-k-mooney is not on a call anymore		16:08
sean-k-mooney	gibi: so we cache the actual devices form libvirt in that code path	16:08
sean-k-mooney	so after the first call to libvirt to get the pci devices	16:08
sean-k-mooney	we should be using it form memory	16:08
gibi	I don't see the cache, but it is getting late here so...	16:10
sean-k-mooney	but we might want ot dbouble check that we are not creating new cached copies due to self or similar breaking the caching	16:10
melwitt	hm I wonder could the cache be making the tpool.Proxy threads stay around and not be cleaned up or something like that?	16:11
sean-k-mooney	perhaps im trying to find where we do it today to confirm exactly how we are doing that	16:12
dansmith	gibi: s/our/all/	16:16
sean-k-mooney	melwitt: im wondering if i am misser remembering and perhasp wrote a caching patch that we have not merged?	16:17
gibi	dansmith: indeed	16:17
sean-k-mooney	https://review.opendev.org/c/openstack/nova/+/922855/6/nova/virt/libvirt/driver.py	16:19
melwitt	oh ok so not yet a factor	16:21
sean-k-mooney	ya	16:21
sean-k-mooney	that is not the first time we have looket at that for waht its owrth	16:21
sean-k-mooney	there are some other optimistaion patches like https://review.opendev.org/c/openstack/nova/+/925382 but i gouth we had merge a ttleast one o the cachign ones before	16:22
sean-k-mooney	for now that not a factor i guess	16:22
melwitt	I think the thing I keep focusing on is, why with tpool.Proxy only? seems the only way would be contention like Dan said	16:23
sean-k-mooney	melwitt: do you know if they are seeing an increase in memory usage as well as time	16:23
melwitt	I don't really know the details about how tpool.Proxy works. I pulled up the eventlet doc about it but it doesn't say much	16:23
melwitt	sean-k-mooney: they did not mention memory usage but that would be good to know	16:24
sean-k-mooney	if we were keeping more nad more info in memoyr over time we shoudl see htat increase.	16:24
dansmith	"how it works" is "terribly"	16:24
dansmith	I mean it does what it says but it's really a gross hack we should avoid whenever we can	16:24
melwitt	well, that's not reassuring	16:25
melwitt	(haha)	16:25
dansmith	I can also see it just straight up hurting performance if we wrap an object and then have some really terrible code that hits a bunch of its properties/functions a thousand times, like in a sort routine or something	16:26
sean-k-mooney	i did look tat the impl a long time ago but i dont rememebr the details	16:26
gibi	... OK it seems I got lucky and have at least an aio devstack built with PCI devices now. So I can start experimenting with the periodic	16:26
dansmith	each of those would be a few instructions for regular code and with the proxy would be a spawn..wait..collect..return instead	16:26
sean-k-mooney	ya so we are actully collecting the pci device 3 times per call today	16:27
dansmith	unless we need the actual object to be wrapped, we should wrap the call to libvirt and return the objects, or wrap a wrapper that does the call and copies the objects into like NamedTuples or something	16:27
sean-k-mooney	that is what https://review.opendev.org/c/openstack/nova/+/925382 fixes	16:27
sean-k-mooney	but we also dont supprot adding/removing devices at runtime anyway without a agent restart	16:28
sean-k-mooney	so we can cache this once on the first invocatoin safely today	16:28
melwitt	it's that the call of virNodeDevice.listCaps() needs to be wrapped somehow to not block stuff. so my thinking was wrap the object and thus get the call wrapped?	16:28
sean-k-mooney	thats in host.py right	16:29
dansmith	melwitt: so we need to call listCaps() later and that makes another call into libvirt?	16:30
melwitt	yes	16:30
dansmith	...and do we do that for all the returned devices?	16:30
melwitt	all of the devices call their listCaps() method so I had thought so. I might be wrong	16:31
dansmith	so that means each one of those listCaps() calls will be a thread spawn, wait, return with moving some data around for communication	16:31
sean-k-mooney	we normaly convert them into dictonaries for pci devceis https://github.com/openstack/nova/blob/master/nova/virt/libvirt/host.py#L1593	16:31
melwitt	it's get a list of [dev, dev, dev, ...] and then for each dev: dev.listCaps()	16:31
dansmith	without, those will just be sequential calls, so blocking but way faster	16:32
sean-k-mooney	we try not to return the libvirt dev object out of host.py	16:32
dansmith	sean-k-mooney: right converting them to dicts or NamedTuple or something all at once in a single tpool'd function would be the more efficient thing	16:32
sean-k-mooney	_get_device_capabilities take sthe virNodeDevice list and converts them today for the pci devices in general	16:33
sean-k-mooney	but i dont know if that pater is applied everywhere	16:33
melwitt	yeah I mean maybe it's just that, that the calls are too many and too long and then the threads pile up?	16:33
dansmith	not the threads piling up so much as just way more work doing it with threads	16:34
sean-k-mooney	well this is 2025.1	16:34
melwitt	yeah I was thinking about the progressive increase over time	16:34
sean-k-mooney	so that not using our threaded implemation	16:34
sean-k-mooney	this is just for teh eventlet flow right	16:34
melwitt	something has to be accumulating I thought, just not sure exactly what	16:35
melwitt	yeah this should just be for eventlet	16:35
sean-k-mooney	so we can create fake pci devices in the fucntional tests	16:35
sean-k-mooney	and we can run the perodics explcitly in those	16:35
sean-k-mooney	so it might be wroth callign it like 100 times and tiem each oen to see if we can repoduce	16:35
sean-k-mooney	sort of like in my ironic simulator patch	16:36
melwitt	gibi said he just got an env set up and will run some experiments	16:36
sean-k-mooney	if we can we can then debug that	16:36
sean-k-mooney	ack i have a env live with pci device too although i currently have the pci grouping serice deployed on taht but i can at least check my logs i gess and see what the time looks like	16:36
sean-k-mooney	melwitt: the other thing of note was the fact they said more instance make it degrade faster	16:37
sean-k-mooney	the number of isntance dont affect the livbirt retrival path	16:37
sean-k-mooney	so i wonder if its related to the isntace objects instead or the instance part fo this	16:37
melwitt	I guess I assumed more instance = more PCI devices but I guess you're saying not?	16:38
sean-k-mooney	well if its the same hosts in both cases	16:38
sean-k-mooney	or same server sku	16:38
dansmith	sean-k-mooney: in eventlet with tpool.proxy you're creating worker threads for every property invocation on a wrapped obejct	16:38
dansmith	nothing to do with our threaded implementation	16:38
sean-k-mooney	dansmith: ack yes you are hookign get_atribute or simialr and dispatch the callables to a trhead pool	16:39
melwitt	or yeah I guess I see. so if more instances = more other unrelated tpool.Proxy things maybe	16:39
sean-k-mooney	melwitt: i think we pull all fo the domains for example in this periodic	16:39
sean-k-mooney	so im wondifing if its related to the number of vms with passthough devices in some way	16:39
sean-k-mooney	i.e. if on an empy host its fixed/stable but on a host with isntance it degrades over time	16:40
sean-k-mooney	it is odd however that it reported against _get_pci_passthrough_devices if that was the case	16:41
melwitt	I feel like it must be something less common ... like otherwise this complaint would have come up a long time ago I would think. and more people affected	16:41
dansmith	if we iterate all the host pci devices once (or more) per instance, then that could be the increase that scales with instances	16:41
sean-k-mooney	https://github.com/openstack/nova/blob/master/nova/virt/libvirt/driver.py#L8921-L8925	16:42
dansmith	melwitt: this change happened just a year ago right? so real production people are probably just starting to run it	16:42
sean-k-mooney	we are iterating over raw nodedev defivce form libvirt here	16:42
melwitt	dansmith: yeah a little over a year ago	16:43
sean-k-mooney	list_all_device here https://github.com/openstack/nova/blob/master/nova/virt/libvirt/driver.py#L8913	16:43
melwitt	but it got backported everywhere also	16:43
dansmith	ack about the backports, that's fair	16:43
sean-k-mooney	well no that wraped into a proxy in list_all_devices	16:44
sean-k-mooney	https://github.com/openstack/nova/blob/master/nova/virt/libvirt/host.py#L1744	16:44
melwitt	I had seen a lot of positive feedback about the change (since before that people were getting nova-compute lockups) but I hadn't seen negative until today	16:44
dansmith	yeah, I mean.. it improves one thing at the expense of another, so not surprising I think	16:44
dansmith	and if you have fewer PCI devices or slower periodic config you probably appreciate the change :)	16:45
melwitt	yeah. I just meant as far as, the conditions under which it happens being less common or more common	16:45
MaxLamprecht[m]	I remember about a similar problem. See my patch and comment https://review.opendev.org/c/openstack/nova/+/922855/comment/a0361266_9c58663c/. There was a bug/performance regression in libvirt < 10.5.	16:45
melwitt	oh interesting	16:46
sean-k-mooney	dansmith: so we are only calling it once	16:46
sean-k-mooney	not once per isntance as we are callign it here https://github.com/openstack/nova/blob/master/nova/virt/libvirt/driver.py#L10506-L10560	16:46
dansmith	sean-k-mooney: ack, I hadn't looked I was just saying _if_ it scales with instances and _if_ we are calling once per instance...	16:47
sean-k-mooney	yep i was just doublecking that we were not	16:47
melwitt	MaxLamprecht[m]: I guess though, the bug reporter did say they tested with and without tpool.Proxy wrap (assuming same libvirt version) and said they didn't get the slowdown as long as tpool.Proxy wrap was not used	16:48
melwitt	I'm not sure if that means it could not be related. might be worth asking them on the bug what their libvirt version is regardless	16:49
dansmith	that's expected because of the overhead it introduces.. but MaxLamprecht[m]'s change just makes us never call it again which is also an improvement regardless of the proxy	16:49
sean-k-mooney	yes we coudl also jsut impvoe this to convert the data to dicts earlier	16:50
MaxLamprecht[m]	ack, for us upgrading libvirt to 10.5 (which includes https://gitlab.com/libvirt/libvirt/-/commit/98f1cf88fa7e0f992d93f376418fbfb3996a9690). I would start to check this :)	16:51
MaxLamprecht[m]	*solved it	16:51
sean-k-mooney	list_all_devices is only uses in 2 places	16:51
sean-k-mooney	it would not be hard to replace it with 2 function one that return the lsit of mdev names and a second that returns the capablites as a dict and just remove the tpool wrapper entirly	16:52
dansmith	you had me at "remove the tpool wrapper entirely"	16:53
sean-k-mooney	hum well not quite https://github.com/openstack/nova/blob/master/nova/virt/libvirt/driver.py#L1225-L1242 but that call can be moved to host.py and it does not return the obejct so there is no long lived usage	16:54
sean-k-mooney	_start_inactive_mediated_devices is only called once in init_host	16:54
dansmith	just converting to dicts when we do it, in a single worker thread and returning (probably cached if that's appropriate) would solve all the things.. then callers can iterate over the dicts as they iterate over objects today without the penality	16:55
sean-k-mooney	+1	16:55
melwitt	yeah, let's not remove the wrapper entirely please and return to the lockups	16:55
sean-k-mooney	well we can do it in phases	16:56
dansmith	we want to remove the wrappER, just not remove the wrappING( of the initial call)	16:56
melwitt	O_O	16:56
melwitt	ok	16:57
dansmith	I can cook up the start of a change if it's still not clear	16:57
melwitt	let him cook	16:58
melwitt	I think I understand the basic idea, I was not thinking of the distinction between wrappER and wrappING so I'm good :)	17:00
dansmith	ack	17:00
melwitt	regardless if anyone wants to cook a change, please do. I didn't come up with any of the ideas	17:00
opendevreview	Takashi Kajinami proposed openstack/nova-specs master: libvirt: AMD SEV-SNP support https://review.opendev.org/c/openstack/nova-specs/+/983376	17:02
-opendevstatus- NOTICE: Load on the opendev.org Gitea backends is under control again for now, if any Zuul jobs failed with SSL errors or disconnects reaching the service prior to 16:15 UTC they can be safely rechecked		17:04
opendevreview	Takashi Kajinami proposed openstack/nova-specs master: libvirt: AMD SEV-SNP support https://review.opendev.org/c/openstack/nova-specs/+/983376	17:07
melwitt	just to keep things rolling with problems, I noticed it seems like the test_resize_reschedule_uses_host_lists_3_fails unit test is still nondeterministically failing somehow https://f1460b24809dc27f2808-70ad550016ca78bab2d4ae8445294bcc.ssl.cf2.rackcdn.com/openstack/243619dc5a054b6cbca4265b38ce41ed/testr_results.html	17:07
melwitt	I don't understand why yet	17:07
sean-k-mooney	hum i have not seen that but have nto looked latesly is it always the 3 varian or is it also failing on test_resize_reschedule_uses_host_lists_1_fail	17:10
sean-k-mooney	all the logic is common bar the parmters os the instablity must be in https://github.com/openstack/nova/blob/b84864d9781ece8387daa967be5208b859e20d97/nova/tests/functional/test_servers.py#L3453-L3490	17:10
sean-k-mooney	oh that keeps going https://github.com/openstack/nova/blob/b84864d9781ece8387daa967be5208b859e20d97/nova/tests/functional/test_servers.py#L3453-L3558	17:11
melwitt	I have only seen the 3 variant. and I think gmaan did something to improve it but I don't remember. I need to look at the git blame	17:11
gibi	initial results: 142 devs returned from libvirt (10.0), no instances on the host. Over 10 periodic runs I don't see any trends. But I confirm that eventlet mode is slower overall.	17:14
sean-k-mooney	is that on master	17:14
gibi	no proxy: _get_pci_passthrough_devices took 0.137605510999947 secs for 142 libvirt devices	17:14
gibi	with proxy: periodic update_available_resource took 3.2481384640000215 secs	17:14
gibi	sorry	17:14
gibi	with proxy: _get_pci_passthrough_devices took 0.5433623499998248 secs for 142 libvirt devices	17:15
sean-k-mooney	i asume you just seting the env flag on master?	17:15
gibi	yepp	17:15
sean-k-mooney	its nice we can do that :)	17:15
gibi	:)	17:15
gibi	I will start adding VMs	17:15
gibi	and taking a longer measurement window	17:16
gibi	this is the current data I have https://paste.openstack.org/show/bNCNkXgjaGe1EXOjAxNf/	17:16
sean-k-mooney	they said 60 repocues it but if we could get 4-8 it should still be visible it just might take longer to notice but runniing the perodic like ever 10 second should offset that	17:16
gibi	yepp	17:17
sean-k-mooney	i guess those are new logs	17:18
sean-k-mooney	let me check my sriov dev env for the overall perod time quickkly	17:18
gibi	yeah I needed new logs otherwise it is not clear how long things take	17:18
sean-k-mooney	i think in debug mode we print the total perodic time	17:19
sean-k-mooney	Lock "compute_resources" "released" by "nova.compute.resource_tracker.ResourceTracker._update_available_resource" :: held 2.483s	17:21
sean-k-mooney	wo we have thsoe logs	17:21
sean-k-mooney	but im not sure if the call is in that loc	17:22
sean-k-mooney	it ideally is not	17:22
sean-k-mooney	ok no we are computing the resouce view well before that	17:24
gibi	I'm wondering that the "scaling with the number of VMs" is an artifact of the compute node being more and more under load by those VMs. It is definitely the case for my compute by default. I will measure this with VMs in shutoff state to limit that effect	17:25
sean-k-mooney	even so the pull perodic time for me locally is 12 seconds	17:25
sean-k-mooney	*full	17:25
sean-k-mooney	well there will be more contention for the compute_resouces lock	17:26
sean-k-mooney	in teh resouce tracker formother perodics	17:26
sean-k-mooney	we do print the wait time ock "compute_resources" acquired by "nova.compute.resource_tracker.ResourceTracker._update_available_resource" :: waited 0.001s so mabe that a factor	17:27
sean-k-mooney	although locally the other thing that uses it is Lock "compute_resources" acquired by "nova.compute.resource_tracker.ResourceTracker.clean_compute_node_cache"	17:28
sean-k-mooney	i think we aquire it when bootign vms as well	17:29
sean-k-mooney	but in terms of perodic that contenion does nto seam very high	17:29
sean-k-mooney	Active: active (running) since Wed 2026-03-11 17:16:44 UTC; 3 weeks 6 days ago	17:31
sean-k-mooney	so i do not currently have instnace on this env	17:31
sean-k-mooney	but i ahve had them runign with pci device in the past so it does seam to return to a sane baseline at least on an empty host	17:32
sean-k-mooney	gibi i guess the next best thign to do is leave it run over night in your env and see if it repoduces when there are isntances	17:34
gibi	10 VM that are not using PCI devs on the host and are in shutoff state does not increase the time of the _get_pci_passthrough_devices call. It does increase the time of th whole periodic obviously. I don't yet see any treands in either of the times over time	17:34
gibi	sean-k-mooney: yeah I will, I just try to see what setup make sense to use	17:35
gibi	next I will try with PCI user VMs	17:35
gibi	I have 6+6 VFs (2 PFs) so each VM can take a VF	17:37
sean-k-mooney	ack im currnetly comparing in my cybrog/pci groupign env. ill leave it alone for not but if you dong see anygint i can maybe run ~50-80 vms	17:37
sean-k-mooney	i belive i have 2 sriov capable nic on this host and i shoudl barely have enough disk/ram but i havent really pushed this host to find its limits	17:39
gibi	I don't see any timing data difference between 10VMs without PCI usage and 10 VMs with one VF usage each (both cases the VMs are in shutoff)	17:44
gibi	so what I will do I will keep the 10 VMs with one VF each,take a GMR and then let the periodic run every 10 seconds during the night	17:45
sean-k-mooney	hum im not sure what runing vs not would cause in this context	17:47
sean-k-mooney	techinaly that does change the retivial of mac adresses for VFs	17:48
sean-k-mooney	i.e. if livbirt is restart after the vf is passed to the vm it may not see the mac	17:49
sean-k-mooney	but did the bug say it was realted to runign/active vms or are you just leaving them stop to reduces load	17:49
gibi	I simply want to avoid the cpu usage noise of the 10 VMs running	17:49
sean-k-mooney	ah ok	17:49
sean-k-mooney	you could start them and then pause them	17:49
gibi	it is non 0 even with cirros	17:49
gibi	sure	17:50
gibi	that is better	17:50
sean-k-mooney	but i suspect we wont see much of a delta on the perodic althoug hif we do it would be an interesting datapoint i guess	17:50
gibi	a quick look in two GMRs ~5 minutes distance I don't see increasing number of Green Threads or Threads reported.	18:17
gibi	OK. This is the starting state of my measurement setup for the night https://paste.openstack.org/show/b96mgBQpguw6KUseBAgZ/	18:25
melwitt	awesome gibi	18:34
melwitt	dansmith: fyi potential cross-project PTG topic re: barbican secrets https://etherpad.opendev.org/p/nova-2026.2-ptg#L90 cc Uggla sorry if I put the text in the wrong location in the etherpad, feel free to move it	20:10
dansmith	melwitt: ack, not sure that needs to be a cross-project session really, but cool.. I also didn't realize you were worried about the deployment side of the ownership thing, more the user one	20:28
dansmith	but good point that neutron would have access to those keys	20:28
dansmith	unless nova could change the ACL itself	20:28
melwitt	dansmith: my brain started on user and then the realization about service hit. but yeah nova can change the ACL itself as far as I know	20:29
melwitt	I thought you had suggested we talk to ppl who know more about barbican so I put it cross-project	20:30
melwitt	it does suck that there's no way to do ACLs through castellan. not sure if that's by design	20:30
melwitt	I guess maybe not all key managers support ACLs	20:31
dansmith	ack	20:33
sean-k-mooney	melwitt: by the way i belive there as a specic commit possibel as part fo there srbac work that change teh secrte to be owned by the proejct	20:56
sean-k-mooney	melwitt: i dont exactly know why i feel tha tway but i think we talksed about it before	20:57
melwitt	sean-k-mooney: I wondered if that could be a possibility. I'll take another gander	20:58
sean-k-mooney	i.e. the fact that is not only accable by the user is someitng that i was vaugly aware of an that it was not alwasy that way	20:58
melwitt	as far as I could find it goes down to the model and object level, so always been that way	21:00
melwitt	(this was just me looking in the code)	21:01
dansmith	yeah I really thought we were told that it was specific to the user and not the project	21:01
sean-k-mooney	melwitt: maybe https://opendev.org/openstack/barbican/commit/cd9b0ba9f0b4606193858d3cf05acd413a66ba1e	21:02
sean-k-mooney	they use to enforce _SECRET_CREATOR = "user_id:%(target.secret.creator_id)s"	21:02
sean-k-mooney	... storybaord 2010235 has more info	21:03
sean-k-mooney	huh its actully working https://storyboard.openstack.org/#!/story/2010235	21:03
sean-k-mooney	i guess the bots doen know about story board	21:03
sean-k-mooney	ya so in https://opendev.org/openstack/barbican/commit/cd9b0ba9f0b4606193858d3cf05acd413a66ba1e the allowed project member not jsut teh orgial secreat creator to decrpte the secrete adn basiclly mange it	21:05
melwitt	lol storyboard	21:06
sean-k-mooney	without the need for acls so that proably shoudl ahve at least had a release note... or a doc update....	21:06
sean-k-mooney	https://review.opendev.org/c/openstack/barbican/+/853680 apprently that was aslo cherry-picked which is qustionable	21:07
melwitt	yeah that's a pretty major change	21:07
sean-k-mooney	although i coudl eb misreading those polices but i dont think i am	21:07
dansmith	but these are API policies - doesn't barbican also have a more robust internal ACL system?	21:08
dansmith	I guess I thought that	21:08
dansmith	is what we were talking about here and not just API policy	21:08
melwitt	I guess the old rule had rule:secret_project_admin which allows project admin also	21:08
sean-k-mooney	hum actully on gerrit i think i might be https://review.opendev.org/c/openstack/barbican/+/853680/9/barbican/common/policies/secrets.py#72	21:08
melwitt	secrets are not as secret as I had thought :P	21:09
sean-k-mooney	im not familar with rule:secret_is_not_private	21:09
melwitt	that's the project_access thing. project_access=false=private. at least that's how it is today	21:09
sean-k-mooney	ah	21:10
sean-k-mooney	so seperatly form the acl you can mark it as prive to a user or public to the project	21:10
sean-k-mooney	i assume acls were an evoltion of that with more granualatiy or someitng in later release	21:11
melwitt	no that is the ACL. i.e. if you want to change it, it's a PUT to the /acl API	21:11
dansmith	so the api policy defines the default but I can still ACL individual secrets however I want?	21:11
melwitt	as far as I can tell the default is hard-coded, I don't think you can change it	21:12
melwitt	*don't think you can change the default	21:12
melwitt	and you can't create a secret to have project_access=false from the get go	21:12
sean-k-mooney	https://docs.openstack.org/barbican/latest/api/reference/acls.html	21:12
sean-k-mooney	so apprenlty the acl is on tdefien on indeivual secrets but the contienr the sectes are contied in	21:12
sean-k-mooney	it feels like this is not fully baked based on teh documantion warning	21:13
melwitt	yeah... kinda also why I was thinking cross-project at the ptg, to just get a sanity check from barbican experts if they are out there	21:14
sean-k-mooney	https://docs.openstack.org/api-guide/key-manager/acls.html#default-acl	21:14
sean-k-mooney	{	21:15
sean-k-mooney	"read":{	21:15
sean-k-mooney	"project-access": true	21:15
sean-k-mooney	}	21:15
sean-k-mooney	}	21:15
sean-k-mooney	so that apprently the default os i guess whne they added acls it change the behvior	21:15
melwitt	ok so you are saying before ACLs even existed, it was like private secrets?	21:15
melwitt	vtpm was added in the victoria release which was in 2020	21:17
sean-k-mooney	im not sure	21:17
sean-k-mooney	bu tjsut skiming ther eapi ref	21:17
sean-k-mooney	i think if we anted to chagne teh bhvior we would need 3 calls minium	21:17
sean-k-mooney	1 to create a metadata only secrete	21:18
sean-k-mooney	then to apply the acl to it	21:18
sean-k-mooney	and finaly to upload the data	21:18
sean-k-mooney	i would assume we are not doing that today	21:18
melwitt	yeah that's exactly what I thought we would have to do and no we are not doing that today	21:19
sean-k-mooney	so when we are creatign the "user" secreat its likely project scoped	21:19
melwitt	secrets in nova have always been create with the default acl, so have always been open to project	21:19
melwitt	I don't know why I thought they were private to a user :(	21:20
sean-k-mooney	so its the secret:decrypt	21:21
sean-k-mooney	polyc that we care about rahter then get	21:22
melwitt	correct	21:22
sean-k-mooney	get is just the metadata	21:22
melwitt	yes	21:22
melwitt	man the test_resize_reschedule_uses_host_lists_3_fails unit test is going crazy in CI	21:23
sean-k-mooney	https://opendev.org/openstack/barbican/blame/commit/661386a517fac2f81d6b35706dd360b045955c3f/barbican/common/policies/secrets.py#L21	21:23
sean-k-mooney	so that acl rules has been there for at least 10 years	21:24
sean-k-mooney	so in effect they have never been user scoped unless you change the acl	21:24
melwitt	I don't think so right because of this https://opendev.org/openstack/barbican/src/commit/661386a517fac2f81d6b35706dd360b045955c3f/barbican/common/policies/base.py#L50-L52	21:24
melwitt	oh wait that's not decrypt	21:25
sean-k-mooney	righ if you look at https://opendev.org/openstack/barbican/blame/commit/661386a517fac2f81d6b35706dd360b045955c3f/barbican/common/policies/secrets.py#L21	21:25
melwitt	https://opendev.org/openstack/barbican/src/commit/661386a517fac2f81d6b35706dd360b045955c3f/barbican/common/policies/base.py#L53-L55	21:26
sean-k-mooney	that has the 'rule:secret_acl_read' rule alaready	21:26
melwitt	if I read that right it was allow project admin, project observer, project creator	21:26
melwitt	so it's true it was not project member	21:26
sean-k-mooney	right os creator was the old role they mapped to member	21:26
sean-k-mooney	and observer i guess was reader	21:27
sean-k-mooney	so before srbac to us babican at all admin had to assing you the creator role	21:27
sean-k-mooney	melwitt: i think this is perhaps the relvent spec https://github.com/openstack/barbican-specs/blob/master/specs/kilo/add-creator-only-option.rst	21:30
sean-k-mooney	well and https://github.com/openstack/barbican-specs/blob/master/specs/kilo/add-per-secret-policy.rst	21:30
melwitt	ok yeah that is showing that private secrets was a new feature at the time. wasn't even possible originally	21:31
sean-k-mooney	rigth so the second spec is the one that intouced tha acl api	21:32
sean-k-mooney	and the first was specificly usignit for providign diffent options for the creator role	21:32
melwitt	interesting. good to know	21:32
sean-k-mooney	i hate to say this but.... this would allwo a 4 mode of operations	21:34
sean-k-mooney	we could have a "delegated" for lack fo a beter term mode where the secret is owned by the end user isntead of the nova user but we just add an acl for ti when creating it for the nova user to be able to read it when needed	21:35
sean-k-mooney	and do that only for the tpm secrets that we are cretaing on there behalf	21:35
sean-k-mooney	that almost what the deployment mode is with the only delta being the secret remaisn owned by the project that owns the vm	21:35
melwitt	yeah .. that would work I think	21:36
sean-k-mooney	you coudl also upgrae to that by backfilling the ACL	21:36
melwitt	honestly that seems easier because then the service project thing doesn't come up	21:36
sean-k-mooney	although well it kind of does but only in the sense that what ever auth cred we are usign to talk to it need to be one form our config	21:37
sean-k-mooney	but just lookign at master for a second	21:37
sean-k-mooney	https://opendev.org/openstack/barbican/src/branch/master/barbican/common/policies/secrets.py#L19-L26	21:37
sean-k-mooney	rule:secret_project_admin	21:38
melwitt	yeah but I mean if we do it by giving the nova user access via acl, then it's never open to the service project right	21:38
sean-k-mooney	i asumem that is someone wiht the admin role that has a project scoped token for that project right	21:38
melwitt	I think that's what that means, yes	21:38
sean-k-mooney	oh well there may or may not be a shared service project	21:39
sean-k-mooney	we do that in devstack but in our down stream we dont i belive	21:39
melwitt	true, but the "common" deployment is done that way	21:39
sean-k-mooney	it woudl allow use to restict ti to just the nova user however	21:39
sean-k-mooney	if ther was a shared service proejct so that still better	21:39
sean-k-mooney	well not nessicarly "beter" then owned by nova in genreal jsut better then "accsabel by anyone in the service project"	21:41
melwitt	right	21:41
melwitt	it's the open-to-the-service-project that seems a bit too out there	21:41
sean-k-mooney	right. so looking at barbican base policy they do impmente proejct scoped admin to some degree	21:42
sean-k-mooney	but that really global admin	21:42
sean-k-mooney	so today if an admin added themselve to a end users project the shoudl be bale to live mgigrate	21:43
sean-k-mooney	provide we just dont block it in the api and try to download the secret when needed	21:43
sean-k-mooney	i.e with teh defualt "user" policy	21:43
sean-k-mooney	what "deployment" or "delegated" what ever we want to call the acl approch gives	21:44
melwitt	yeah exactly. it is possible for project admin today	21:44
sean-k-mooney	is the ablity to skip adding yourself to the end users policy	21:44
sean-k-mooney	*end users project	21:44
sean-k-mooney	jsut to do the live migration	21:44
sean-k-mooney	so its a net win still be but the security posture is mostly the same as user or host	21:45
melwitt	yeah. it feels like it makes more sense too in a way. like more intuitive	21:45
sean-k-mooney	i think this also means that an admin coudl also jsut live migrate any vm with the user policy if they just add the acl to the relevent secret right?	21:46
sean-k-mooney	for there user?	21:46
melwitt	I really don't want to have to migrate 'deployment' secrets though, ugh.	21:46
sean-k-mooney	well you dont have to move them	21:47
sean-k-mooney	that the nice thign about deployment you jsut get them form barbican in pre live migrate	21:47
sean-k-mooney	there is no copyign right.	21:47
melwitt	yes you can with acl also. a bit more code is needed to enable 'user' live migration and I have them proposed	21:47
sean-k-mooney	by migrate deployment secret did you mean add the acl?	21:47
sean-k-mooney	or did you mean move them during live migrate?	21:48
sean-k-mooney	or i guess copy them to a new secret	21:48
sean-k-mooney	but that woudl be on resize right	21:48
melwitt	yeah but I mean if we wanted to do the user + acl for nova user thing, taking the nova user secret and creating a user secret and copying the passphrase over and adding the acl, I dunno. mess	21:48
sean-k-mooney	oh your sayign supprotign the 4th policy	21:49
sean-k-mooney	fo ading the nova user by acl	21:49
melwitt	for potentially already existing 'deployment'	21:49
melwitt	secrets	21:49
sean-k-mooney	ah well we dont have that yet do we?	21:49
sean-k-mooney	i tought that was not merged	21:49
melwitt	we do have the secret generation but not the live migration	21:49
sean-k-mooney	ah ok	21:50
sean-k-mooney	well if we atned to suprpot it its feeld like a 4th policy	21:50
sean-k-mooney	but i htink as you suggested talking to the barbican folks at the ptg would be a good next step	21:50
melwitt	yeah.. that would be easier to not have to deal with it. and then let it stay open to the service project? I dunno. just rambling	21:50
sean-k-mooney	to confirm that the acls are actully workign as we think they do	21:50
sean-k-mooney	melwitt: well if its a new policy we can recommend that you resize to delegated for better security or something	21:51
sean-k-mooney	where deploymene i "nova owned secte" adn delegated is "user owned secreate with nova acl"	21:51
melwitt	yeah. that would be much less pain	21:52
melwitt	++	21:52
sean-k-mooney	in anycase i think we shoudl finsih the deploymetn live migration supprot first	21:52
melwitt	yeah, I agree	21:52
sean-k-mooney	then maybe tack this on if folks agree	21:52
melwitt	the way I have it is I put 'user' ahead of 'deployment' since 'user' support is a subset of 'deployment'	21:52
sean-k-mooney	ya host was the most invasive one	21:53
melwitt	much smaller change than 'deployment' bc 'deployment' ultimately needs to be able to do secret conversions for resizes	21:53
sean-k-mooney	as all the resed didnt need rpc/obejct changes	21:53
sean-k-mooney	right ya i see	21:54
sean-k-mooney	its not alot diffent for the acl approch as you woudl still need to add/remove it when you change form user	21:54
sean-k-mooney	but that a singel api call	21:54
sean-k-mooney	no need to copy the data and reupload it	21:55
melwitt	yeah, I don't think it would be much to add	21:55
sean-k-mooney	all fo this falls apart if acls dont work that way but that seams to be what those spec imply	21:55
sean-k-mooney	kilo era features with "This ACL documentation is work in progress and may change in near future."	21:57
sean-k-mooney	make me causiously optimistic	21:57
melwitt	they seem to work how we think, at least in my unexpert testing in devstack. I also used the acl in the 'user' live migration tempest tests I wrote and that works too	21:57
sean-k-mooney	ok im going to call it a night o/	22:00
melwitt	seeya o/	22:00
opendevreview	sean mooney proposed openstack/nova master: Add regression test for unified limits resource bug https://review.opendev.org/c/openstack/nova/+/975859	22:01
opendevreview	sean mooney proposed openstack/nova master: Fix unified limits to include all resource types https://review.opendev.org/c/openstack/nova/+/975872	22:01
sean-k-mooney	i dont know if i resoved all the comments on that but i just pushed what i had ill take a look at it again tomorrow	22:03
melwitt	ack	22:08

Generated by irclog2html.py 4.1.0 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!