Monday, 2024-10-21

*** __ministry is now known as Guest6947		07:24
*** __ministry is now known as Guest6953		08:48
*** __ministry is now known as Guest6966		10:40
opendevreview	Takashi Kajinami proposed openstack/os-resource-classes master: Drop unnecessary 'x' bit from doc config file https://review.opendev.org/c/openstack/os-resource-classes/+/932878	11:25
opendevreview	Takashi Kajinami proposed openstack/os-traits master: Drop unnecessary 'x' bit from doc config file https://review.opendev.org/c/openstack/os-traits/+/932879	11:25
opendevreview	Takashi Kajinami proposed openstack/osc-placement master: Drop unnecessary 'x' bit from doc config file https://review.opendev.org/c/openstack/osc-placement/+/932881	11:26
sean-k-mooney	HTTP exception thrown: Flavor has hw:virtio_packed_ring extra spec explicitly set to True, conflicting with image which has hw_virtio_packed_ring explicitly set to true.	12:40
sean-k-mooney	... well thats a bug	12:40
sean-k-mooney	we are doing a sting compare somewhere instead of comparing the boolean values	12:40
tkajinam	https://github.com/openstack/nova/blob/master/nova/virt/hardware.py#L1120	12:43
sean-k-mooney	https://bugs.launchpad.net/nova/+bug/2085124	12:43
tkajinam	this is used not only for boolean properties but also for string properties. We probably have to add a flat to indicate whether the value should be converted before comparison	12:44
sean-k-mooney	perhaps we have other helpers too	12:44
sean-k-mooney	i normally expect us to call _get_flavor_image_meta	12:44
sean-k-mooney	and hten have a spereate funciton per extra spec to chekc it	12:45
sean-k-mooney	like get_mem_encryption_constraint	12:45
sean-k-mooney	so we proably used _get_unique_flavor_image_meta to make it simpler but didnt think about this usecase	12:45
sean-k-mooney	rather then a flavor for conversion we proably need to pass an expected type	12:45
sean-k-mooney	or convertion function for the flavor value. the image ones are already typed	12:45
tkajinam	yeah	12:46
tkajinam	ok the logic for packed_ring is specifically implemented in get_packed_virtqueue_constraint	12:46
sean-k-mooney	i.e. pass in bool or oslo.utils.str_to_bool whatever taht is called	12:46
sean-k-mooney	ya so the issue is proably here https://github.com/openstack/nova/blob/master/nova/virt/hardware.py#L1985	12:47
sean-k-mooney	in that path we are not converting it to the correct type	12:47
sean-k-mooney	we convert the flavor to the correct type here https://github.com/openstack/nova/blob/master/nova/virt/hardware.py#L1990-L1992	12:48
sean-k-mooney	but in the dict case the image_value is just a string	12:48
sean-k-mooney	so we need to call strutils.bool_from_string here https://github.com/openstack/nova/blob/master/nova/virt/hardware.py#L1985	12:49
jlejeune	hello, I have a question regarding host_mapping, is it "normal" that when we delete a nova-compute service, the 'mapped' field in nova_cell1.compute_nodes table stays at 1 instead of 0 while nova_api.host_mappings related entry is removed ?	13:54
jlejeune	by letting that 'mapped' field to 1 prevent nova-manage cell_v2 discover_hosts command to create the missing mapping	13:55
gibi	jlejeune: it sounds like bug #1 in scenario 1 in https://bugs.launchpad.net/nova/+bug/2077070/comments/7	13:56
gibi	jlejeune: could you check if adding --by-service to the discover_hosts command helps you or not	13:57
gibi	?	13:57
jlejeune	I've already test it and indeed, it "fix" the issue by creating the missing mapping	13:57
gibi	OK. Then I think it is the same issue.	13:58
gibi	so yeah, nova should set mapped to 0 when the host mapping is deleted.	13:58
jlejeune	but not sure I want to do it by service, given the number of computes we have in production	13:58
jlejeune	agree	13:59
gibi	if you have time to propose a fix then I'm happy to review it. If not then eventually I get to it myself to propose the fix.	13:59
jlejeune	I will have a look !	14:01
gibi	jlejeune: thanks!	14:02
jlejeune	gibi: maybe we should split the bug in two, one for pci devices and another one for the mapping issue ?	14:27
gibi	jlejeune: I don't think there is anythin else in the pci side, but if you want a separate bug to fix the mapping issue then sure go ahead and file one. I agree that not mentioning pci devices in the separate bug will help the reviewers not bog down on the pci stuff.	14:31
jlejeune	yes exactly :)	14:34
gibi	ping me with the separate bug and I will quickly do the triage of it	14:35
jlejeune	ok	14:37
jlejeune	gibi: https://bugs.launchpad.net/nova/+bug/2085135	14:55
gibi	jlejeune: thank	15:09
johnthetubaguy	dansmith: are you thinking threads for each task being better for what nova-compute is generally doing? I hadn’t really thought about that. I could see that limiting the code changes, I suppose?	15:44
dansmith	johnthetubaguy: I want us to at least consider that yeah	15:44
bauzas	johnthetubaguy: that was something we were considering from the last PTG	15:44
dansmith	johnthetubaguy: I suspect we have some places where we think we're threadsafe, but aren't	15:45
dansmith	so we'll need to fix that up, which we should have fixed anyway	15:45
dansmith	but I think much of nova (potentially conductor as well) could just move straight to real threads, ops can control how much parallelism they want (which they've asked for before) to scale performance, and then we can move on	15:46
johnthetubaguy	Yeah, I had not thought of it like that. Its a good point.	15:47
clarkb	just as a point of info zuul runs largely with real threads (I think the websocket lib we use may involve asyncio but I can't remember)	15:49
JayF	^ is what dtantsur has been pitching for IPA, although we've written no code in that direction	15:49
clarkb	it works reasonably well and in theory will work better when we can drop the gil and sort out all of the places we missed extra lockign (but we already coordinate multiple processes with zookeeper so don't really need that scaling approach with dropping gil)	15:49
dtantsur	Yep. Not all of our project needs to handle an extremely high concurrency.	15:50
dtantsur	Especially Ironic, which has "I'm busy, come back later" as an inherent part of its API :)	15:50
johnthetubaguy	Honestly, most of the bugs I find are where we attempt high concurrency badly, hard blocking on some of the API calls and tasks would totally make me operator self happier :)	15:51
opendevreview	Artom Lifshitz proposed openstack/nova master: libvirt pwr mgmt: reproduce bug 2085122 https://review.opendev.org/c/openstack/nova/+/932922	15:51
sean-k-mooney	gibi: im not actully sold on usign ayncio for the timout stuff by the way	16:05
sean-k-mooney	gibi: it was just going to be my next poc, warp the existing fucntion in awaitlet and dispatch ot asyncio on a seperat ethead	16:05
sean-k-mooney	specificlay for the nova-api sacather gather usecause	16:05
dansmith	I'm quite sure that doing that will not be a magic bullet unless we're actually running the asyncio event loop	16:06
sean-k-mooney	right that woudl be idea run the asyncio eventloop in a sperate thread nad dispatch the async task into it	16:06
sean-k-mooney	again this is just one of the many altertinitve i was condiering	16:06
sean-k-mooney	im not conviced it is a good approch but its what was suggested in teh wiki	16:07
dansmith	yeah, but unless we're the parent, I think that the thing that spawned us may receive the SIGALRM and not deliver it to our event loop as a result meaning it doesn't really work like we think if it's blocked in a C call	16:07
dansmith	anyway, we should dive into a poc and confirm	16:08
sean-k-mooney	https://wiki.openstack.org/wiki/Eventlet-removal#3._Timeout_and_Deadline_Management_2	16:08
gibi	sean-k-mooney: for me more moving pieces sounds worse, e.g. awaitlet + asyncio + a worker thread running the eventloop is just too much complexity.	16:09
sean-k-mooney	i did a poc of this before https://review.opendev.org/c/openstack/nova/+/904425/10/nova/context.py#422	16:09
sean-k-mooney	with a diffent approch	16:09
sean-k-mooney	gibi: ack which is valid	16:09
sean-k-mooney	gibi: i dont think the event approch you suggest really works as we cant assume we can generially check the event within the fucntio we are dispatching	16:10
dansmith	gibi: ++	16:10
sean-k-mooney	gibi: sqlachmen does supprot a statement_timeout	16:10
sean-k-mooney	which i think is going to be the answer for db queies	16:10
sean-k-mooney	i think this is currently set per session or conenction	16:10
gibi	sean-k-mooney: my point that we need to make assumption about the functions we run as a task or we need to prepare to abandon them from the gather thread	16:11
dansmith	sean-k-mooney: "To handle signals the event loop must be run in the main thread."	16:11
sean-k-mooney	dansmith: sure but we cant use sinals for this	16:11
dansmith	I think that unless you're the main thread, you can't delegate things to executors and expect things like timeouts to preempt	16:11
sean-k-mooney	when run under mod_wsig some of those are uncondtionally handeled by apache	16:11
sean-k-mooney	so i dont think we can rely on sig_arlarm in general	16:12
dansmith	right, but I think that's how asyncio handles preemptable timeouts right?	16:12
sean-k-mooney	i do not think so	16:12
sean-k-mooney	asycio also works on windows where it cant	16:12
dansmith	sure, but it doesn't have to implement the primitives in the same way	16:13
johnthetubaguy	What is the argument against brute forcing it with extra threads (for the scatter/gather) apache just says no? Or just worried about the system resources/latency of the extra thread?	16:13
sean-k-mooney	johnthetubaguy: its not really the scather gahter that the concern	16:13
dansmith	I've just run into some weirdness when trying to run asyncio compatible stuff from deep inside "regular" code and it seems like it was related to some of the caveats around the event loop being in the main thread	16:13
sean-k-mooney	we coudl remove that but we currently use eventlet to timeout queiry that done return	16:14
gibi	johnthetubaguy: threads are not cancellable so we cannot easily port the today's timeout handling	16:14
dansmith	right ^	16:14
sean-k-mooney	so i think the reall fix for that is the sql alchemey statement timeout	16:14
johnthetubaguy	Ah… got you.	16:14
dansmith	sean-k-mooney: eight	16:14
dansmith	sean-k-mooney: *right	16:14
sean-k-mooney	process can be killd but that not reaonble ot do in the api	16:14
gibi	sean-k-mooney: sure, if all the blocking call can be made non blocking with a timeout then we can just push down our timeout handling to the bottom of the call stack	16:15
sean-k-mooney	ok so i think ill pause lookign at other "timeout methods" that work for arbitry funcitons	16:15
sean-k-mooney	and insted look at how we can pass timeouts to existing libs directly i.e. sqlachmey	16:15
gibi	yeah, I think we should not chase the "timeut for arbitrary function" direction with native threads. But accept that we need to per task type based solution	16:16
dansmith	agree	16:16
sean-k-mooney	that is fine with me.	16:17
sean-k-mooney	i was hoping to limit the scope fo the refactors but lets expore that instead then	16:17
sean-k-mooney	to be clear changing form eventlet.timeout to native timeouts in sqlachemy shoudl work with or without eventlet monkypatching	16:18
dansmith	yeah	16:18
sean-k-mooney	dansmith: gibi: would you condier this in or out of scope for this cycle. perhasps we shoudl dicsuss in ptg session	16:18
sean-k-mooney	im not sure if that is high or low risk	16:19
gibi	sean-k-mooney: do we have a timeout today both on the gather greenthread and on the task greenthreads doing the work?	16:19
dansmith	I think we could get an eventlet.Timeout or a sqla.timeout at any point so we should already be handling those	16:19
gibi	sean-k-mooney: I consider the work in scope for E, landing it for E might not happen but if I want to learn the devils in the details I need to work on an example, and this is a good example.	16:20
sean-k-mooney	ack	16:20
dansmith	also, AFAIK eventlet will let the timeout happen to unblock the waiter, but the worker thread actually blocked remains there to be GC'd until it frees up,	16:20
dansmith	so using real timeouts should be better anyway to unblock and return threads to the pool quicker	16:20
sean-k-mooney	when the timeout fire i belvie the eventlet calls greenthread.throw on the worker greentrhead	16:21
dansmith	right but not in the actual worker thread which is blocked on a C call	16:21
sean-k-mooney	but to dans point it may very well block for ever or untill the os kills the tcp socket	16:21
dansmith	there's no way to generically interrupt the latter	16:21
dansmith	the greenthread.throw is only handled as an exception when the blocking call returns	16:22
sean-k-mooney	right	16:22
sean-k-mooney	so until the os returns an interupt that resuble the worker thread it wont die	16:22
dansmith	right	16:23
dansmith	that's why moving to real library timeouts will benefit real threads and the green threads approache	16:23
sean-k-mooney	ack for now the syncronos ceph lib does not supprot that but i think there async (not asynio) function that return futres can, still need to confirm that	16:25
dansmith	ack	16:29
sean-k-mooney	they do supprot timouts on the connect method and the "cil command" wrappers	16:34
sean-k-mooney	https://docs.ceph.com/en/reef/rados/api/python/#rados.Rados.connect https://docs.ceph.com/en/reef/rados/api/python/#rados.Rados.mon_command	16:34
bauzas	gouthamr: hey, what exactly do you want to discuss about virtiofs during nova sessions ? https://etherpad.opendev.org/p/nova-2025.1-ptg#L231	16:37
opendevreview	Artom Lifshitz proposed openstack/nova master: libvirt pwr mgmt: reproduce bug 2085122 https://review.opendev.org/c/openstack/nova/+/932922	16:37
opendevreview	Artom Lifshitz proposed openstack/nova master: pwr mgmt: power up instances's CPUs upon init https://review.opendev.org/c/openstack/nova/+/932926	16:37
gouthamr	bauzas: i'd like to know the current status of the development / focus of the team going into Epoxy	16:39
bauzas	okay, noted	16:39
opendevreview	Artom Lifshitz proposed openstack/nova master: libvirt pwr mgmt: reproduce bug 2085122 https://review.opendev.org/c/openstack/nova/+/932922	16:39
opendevreview	Artom Lifshitz proposed openstack/nova master: pwr mgmt: power up instances's CPUs upon init https://review.opendev.org/c/openstack/nova/+/932926	16:39
bauzas	I'll provide a raw agenday tomorrow morning European time	16:39
johnthetubaguy	bauzas: I believe its MIG vGPU not working with detach volume any more, Caracal onwards? Came up in the scientific SIG	16:56
bauzas	johnthetubaguy: I may have said that MIG isn't officially supported in Nova	16:56
bauzas	because that's nvidia specific and can't be documented upstream	16:57
johnthetubaguy	bauzas: could you link me to where we say that? Its in the PCI docs I guess?	16:58
johnthetubaguy	(It might be vGPU not MIG directly, but I could well be confused)	16:59
sean-k-mooney	the gap when is the write/read commends to add data to the volume	17:01
sean-k-mooney	context of ^ is our current abllity to timeout ceph commands	17:01
sean-k-mooney	johnthetubaguy: i see i tought we fixed that?	17:02
johnthetubaguy	@bauzas https://review.opendev.org/c/openstack/nova/+/925037 Maybe it got fixed another way?	17:02
sean-k-mooney	johnthetubaguy: there was a breakage casue by the alisa feature i think, i dont recall the full details i woudl have to find the review	17:02
bauzas	johnthetubaguy: we don't exactly mention that nvidia bits are forbidden, but vgpu upstream docs somehow show it up https://docs.openstack.org/nova/latest/admin/virtual-gpu.html	17:03
bauzas	oh ok, now I get the picture	17:04
sean-k-mooney	johnthetubaguy: you dont happen to have the review do you. ill see if i can find it	17:04
sean-k-mooney	johnthetubaguy: if i recall correctly we asked them to repoduce the failure in a fucntional test and then demonstrate there patch fixed it	17:05
sean-k-mooney	but i dont think we ask for thid party ci	17:05
sean-k-mooney	johnthetubaguy: was it related to https://review.opendev.org/c/openstack/nova/+/925037 https://bugs.launchpad.net/nova/+bug/1942345	17:06
bauzas	guys, I need to drop for my sanity	17:06
bauzas	we'll have long days of nova sessions so today is a hard stop	17:07
sean-k-mooney	we fixed that with https://review.opendev.org/c/openstack/nova/+/806943	17:07
fungi	i also linked it in the dalmation retrospective section of the ptg pad since bauzas asked	17:08
sean-k-mooney	ack	17:09
fungi	and yes, the proposer seemed to think he was being told 1. we don't support your use case, and 2. if you want this merged you need to set up your own ci to provide testing for it	17:10
johnthetubaguy	To be clear, my motive is I know people using vGPU on Antelope, and we are about to SLURP them to Caracal, but no one can afford GPUs in their staging environments these days. So people raising the flag and saying it’s all broken make me worry, and want to know how I can help fix things. A chat in the retrospective sounds like a good idea, regardless.	17:10
fungi	which i don't get when i read the review comments, so they were reading a lot more into it for unknown reasons	17:10
johnthetubaguy	Ah… I had missed that crucial detail, probably a misunderstanding of “add a functional test” talking about a different tox target.	17:14
fungi	yeah, as for their unpleasant experience, some people come into this expecting to have a bad time and then read between the lines in whatever way is needed to reinforce their expectation, so aside from being more clear in communicating (maybe? it was pretty clear to me already), i'm not sure what else might have helped the situation	17:20
sean-k-mooney	johnthetubaguy: yep tox not third party ci	17:30
sean-k-mooney	this is a valid bug to fix. but the fact we had two almost identical issues suggests our existign test coverage is lacking	17:31
sean-k-mooney	the orginal fix did not fix the underlying problem just eh symtom	17:31
sean-k-mooney	but thats a seperate matter.	17:31
sean-k-mooney	johnthetubaguy: for what its worth we are trying to get mdev testign in first and thirdparty ci currently	17:32
sean-k-mooney	johnthetubaguy: for the first party ci we have a poc of using the mtty sample modle which we hope to merge sooner rather then later into nova-next	17:33
sean-k-mooney	for third party ci we are hopeing to have a limited capablity to trigger a thirdparty job using redhat resouces evenutlly	17:33
sean-k-mooney	i.e. agent changes to master before they merge	17:33
sean-k-mooney	its a lot of work but somehtign we are trying to line up. the bit problem as you noted is hardware aviablity but that is not the only one	17:34
sean-k-mooney	we hope the mtty supprot will give a baseline of coverage for things like this	17:34
sean-k-mooney	johnthetubaguy: https://review.opendev.org/q/topic:%22mtty_support%22	17:35
sean-k-mooney	we need https://review.opendev.org/c/openstack/nova/+/898100/4 first ot allow using virtual devices	17:36
opendevreview	Artom Lifshitz proposed openstack/nova master: pwr mgmt: power up instances's CPUs upon init https://review.opendev.org/c/openstack/nova/+/932926	17:42
sean-k-mooney	fungi: the orgianl confugion arount "we dont support your usecase" came from the fact they fix the bug by adding the infra required to be able to detach an mdev via the alisa, without stating that they were actully trying to detach a volume and this just so happend to fail when the vm also had an mdev	17:42
sean-k-mooney	they had a limited explcation of that in the bug but not in the commit message	17:43
fungi	sean-k-mooney: thanks for the clarification, makes sense	18:15
fungi	also worth noting, there was a chameleon cloud representative at oid-na who was talking about the possibility of making vgpu resources available for upstream testing. not sure where that will go, but something to consider	18:16
sean-k-mooney	the issue is we cant do that with nested vir	18:16
sean-k-mooney	i.e. we can only test it properly with baremental compute nodes	18:17
sean-k-mooney	if you passhtough afully gpu to a vm you cant then use vgpus	18:17
JayF	darn, too bad we don't know how to provision baremetal /s	18:17
JayF	seriously though, having to provision baremetal is very tough in CI, it's part of why Ironic 3rd party CIs died out	18:18
JayF	just a little high touch	18:18
fungi	well, yeah they were talking about nova baremetal api to boot machines with vgpu-capable resources	18:18
fungi	(i.e. ironic)	18:18
fungi	but i agree with you about the whole safety challenge with providing root access to physical machines for untrusted job payloads	18:19
JayF	it's not just a safety challenge, it's a challenge-challenge	18:19
JayF	many real-world failure modes which needs human hands, or at least, non-automated intervention to fix	18:19
sean-k-mooney	the tldr is when you passthough a fully pci device in qemu it removes all the sriov capablities because qemu did not have a way to make that work	18:20
sean-k-mooney	for what its worth we are lookign to see if we can use our internal hardware in teh future to test specifi upstream changes	18:20
sean-k-mooney	that woudl do a baremteally deployment of a nova compute (using ironic) with a manual triger to basicaly say "this is safe to test"	18:21
sean-k-mooney	as a third party ci	18:21
JayF	I think there's a lot of untapped value in that idea of on-demand testing	18:22
JayF	we can usually pinpoint when we're getting close to an edge case that might need revalidation	18:23
sean-k-mooney	but that is why we are also trying to figure out how to test without real hardware using the sample kernel dirvers for a baseline of coverage	18:23
sean-k-mooney	for example we are enableing igb vif model in nova so tthat in the future we can use that to test sriov	18:24
sean-k-mooney	if our providers can provide neutron ports with igb vif model in the future we can fully test sriov in the 1st party ci	18:24
sean-k-mooney	unfrogually it needs a new enough nova/qemu but by 2026.1 we might be in a position to do that	18:25
sean-k-mooney	*unfortunetly	18:25
opendevreview	Artom Lifshitz proposed openstack/nova master: pwr mgmt: power up instances's CPUs upon init https://review.opendev.org/c/openstack/nova/+/932926	19:04
opendevreview	sean mooney proposed openstack/nova master: Add repoducer test for bug 2074219 https://review.opendev.org/c/openstack/nova/+/932930	19:54
opendevreview	sean mooney proposed openstack/nova master: Fix detaching devices by alias with mdevs https://review.opendev.org/c/openstack/nova/+/932931	19:54
sean-k-mooney	JayF: johnthetubaguy: fungi: ^ that is the actual correct fix and what i was asking for in that review	19:55
fungi	awesome!	20:13

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!