Monday, 2024-10-21

*** __ministry is now known as Guest694707:24
*** __ministry is now known as Guest695308:48
*** __ministry is now known as Guest696610:40
opendevreviewTakashi Kajinami proposed openstack/os-resource-classes master: Drop unnecessary 'x' bit from doc config file  https://review.opendev.org/c/openstack/os-resource-classes/+/93287811:25
opendevreviewTakashi Kajinami proposed openstack/os-traits master: Drop unnecessary 'x' bit from doc config file  https://review.opendev.org/c/openstack/os-traits/+/93287911:25
opendevreviewTakashi Kajinami proposed openstack/osc-placement master: Drop unnecessary 'x' bit from doc config file  https://review.opendev.org/c/openstack/osc-placement/+/93288111:26
sean-k-mooneyHTTP exception thrown: Flavor has hw:virtio_packed_ring extra spec explicitly set to True, conflicting with image which has hw_virtio_packed_ring explicitly set to true. 12:40
sean-k-mooney... well thats a bug12:40
sean-k-mooneywe are doing a sting compare somewhere instead of comparing the boolean values12:40
tkajinamhttps://github.com/openstack/nova/blob/master/nova/virt/hardware.py#L112012:43
sean-k-mooneyhttps://bugs.launchpad.net/nova/+bug/208512412:43
tkajinamthis is used not only for boolean properties but also for string properties. We probably have to add a flat to indicate whether the value should be converted before comparison12:44
sean-k-mooneyperhaps we have other helpers too12:44
sean-k-mooneyi normally expect us to call _get_flavor_image_meta12:44
sean-k-mooneyand hten have a spereate funciton per extra spec to chekc it12:45
sean-k-mooneylike get_mem_encryption_constraint12:45
sean-k-mooneyso we proably used _get_unique_flavor_image_meta to make it simpler but didnt think about this usecase12:45
sean-k-mooneyrather then a flavor for conversion we proably need to pass an expected type12:45
sean-k-mooneyor convertion function for the flavor value. the image ones are already typed12:45
tkajinamyeah12:46
tkajinamok the logic for packed_ring is specifically implemented in get_packed_virtqueue_constraint12:46
sean-k-mooneyi.e. pass in bool or oslo.utils.str_to_bool whatever taht is called12:46
sean-k-mooneyya so the issue is proably here https://github.com/openstack/nova/blob/master/nova/virt/hardware.py#L198512:47
sean-k-mooneyin that path we are not converting it to the correct type12:47
sean-k-mooneywe convert the flavor to the correct type here https://github.com/openstack/nova/blob/master/nova/virt/hardware.py#L1990-L199212:48
sean-k-mooneybut in the dict case the image_value is just a string12:48
sean-k-mooneyso we need to call strutils.bool_from_string here  https://github.com/openstack/nova/blob/master/nova/virt/hardware.py#L198512:49
jlejeunehello, I have a question regarding host_mapping, is it "normal" that when we delete a nova-compute service, the 'mapped' field in nova_cell1.compute_nodes table stays at 1 instead of 0 while nova_api.host_mappings related entry is removed ?13:54
jlejeuneby letting that 'mapped' field to 1 prevent nova-manage cell_v2 discover_hosts command to create the missing mapping13:55
gibijlejeune: it sounds like bug #1 in scenario 1 in https://bugs.launchpad.net/nova/+bug/2077070/comments/713:56
gibijlejeune: could you check if adding --by-service to the discover_hosts command helps you or not13:57
gibi?13:57
jlejeuneI've already test it and indeed, it "fix" the issue by creating the missing mapping13:57
gibiOK. Then I think it is the same issue.13:58
gibiso yeah, nova should set mapped to 0 when the host mapping is deleted.13:58
jlejeunebut not sure I want to do it by service, given the number of computes we have in production13:58
jlejeuneagree13:59
gibiif you have time to propose a fix then I'm happy to review it. If not then eventually I get to it myself to propose the fix.13:59
jlejeuneI will have a look !14:01
gibijlejeune: thanks!14:02
jlejeunegibi: maybe we should split the bug in two, one for pci devices and another one for the mapping issue ?14:27
gibijlejeune: I don't think there is anythin else in the pci side, but if you want a separate bug to fix the mapping issue then sure go ahead and file one. I agree that not mentioning pci devices in the separate bug will help the reviewers not bog down on the pci stuff.14:31
jlejeuneyes exactly :)14:34
gibiping me with the separate bug and I will quickly do the triage of it14:35
jlejeuneok14:37
jlejeunegibi: https://bugs.launchpad.net/nova/+bug/208513514:55
gibijlejeune: thank15:09
johnthetubaguydansmith: are you thinking threads for each task being better for what nova-compute is generally doing? I hadn’t really thought about that. I could see that limiting the code changes, I suppose?15:44
dansmithjohnthetubaguy: I want us to at least consider that yeah15:44
bauzasjohnthetubaguy: that was something we were considering from the last PTG15:44
dansmithjohnthetubaguy: I suspect we have some places where we think we're threadsafe, but aren't15:45
dansmithso we'll need to fix that up, which we should have fixed anyway15:45
dansmithbut I think much of nova (potentially conductor as well) could just move straight to real threads, ops can control how much parallelism they want (which they've asked for before) to scale performance, and then we can move on15:46
johnthetubaguyYeah, I had not thought of it like that. Its a good point.15:47
clarkbjust as a point of info zuul runs largely with real threads (I think the websocket lib we use may involve asyncio but I can't remember)15:49
JayF^ is what dtantsur has been pitching for IPA, although we've written no code in that direction15:49
clarkbit works reasonably well and in theory will work better when we can drop the gil and sort out all of the places we missed extra lockign (but we already coordinate multiple processes with zookeeper so don't really need that scaling approach with dropping gil)15:49
dtantsurYep. Not all of our project needs to handle an extremely high concurrency.15:50
dtantsurEspecially Ironic, which has "I'm busy, come back later" as an inherent part of its API :)15:50
johnthetubaguyHonestly, most of the bugs I find are where we attempt high concurrency badly, hard blocking on some of the API calls and tasks would totally make me operator self happier :)15:51
opendevreviewArtom Lifshitz proposed openstack/nova master: libvirt pwr mgmt: reproduce bug 2085122  https://review.opendev.org/c/openstack/nova/+/93292215:51
sean-k-mooneygibi: im not actully sold on usign ayncio for the timout stuff by the way16:05
sean-k-mooneygibi: it was just going to be my next poc, warp the existing fucntion in awaitlet and dispatch ot asyncio on a seperat ethead16:05
sean-k-mooneyspecificlay for the nova-api sacather gather usecause16:05
dansmithI'm quite sure that doing that will not be a magic bullet unless we're actually running the asyncio event loop16:06
sean-k-mooneyright that woudl be idea run the asyncio eventloop in a sperate thread nad dispatch the async task into it16:06
sean-k-mooneyagain this is just one of the many altertinitve i was condiering16:06
sean-k-mooneyim not conviced it is a good approch but its what was suggested in teh wiki16:07
dansmithyeah, but unless we're the parent, I think that the thing that spawned us may receive the SIGALRM and not deliver it to our event loop as a result meaning it doesn't really work like we think if it's blocked in a C call16:07
dansmithanyway, we should dive into a poc and confirm16:08
sean-k-mooneyhttps://wiki.openstack.org/wiki/Eventlet-removal#3._Timeout_and_Deadline_Management_216:08
gibisean-k-mooney: for me more moving pieces sounds worse, e.g. awaitlet + asyncio + a worker thread running the eventloop is just too much complexity.16:09
sean-k-mooneyi did a poc of this before https://review.opendev.org/c/openstack/nova/+/904425/10/nova/context.py#42216:09
sean-k-mooneywith a diffent approch16:09
sean-k-mooneygibi: ack which is valid16:09
sean-k-mooneygibi: i dont think the event approch you suggest really works as we cant assume we can generially check the event within the fucntio we are dispatching16:10
dansmithgibi: ++16:10
sean-k-mooneygibi: sqlachmen does supprot a statement_timeout16:10
sean-k-mooneywhich i think is going to be the answer for db queies 16:10
sean-k-mooneyi think this is currently set per session or conenction 16:10
gibisean-k-mooney: my point that we need to make assumption about the functions we run as a task or we need to prepare to abandon them from the gather thread16:11
dansmithsean-k-mooney: "To handle signals the event loop must be run in the main thread."16:11
sean-k-mooneydansmith: sure but we cant use sinals for this16:11
dansmithI think that unless you're the main thread, you can't delegate things to executors and expect things like timeouts to preempt16:11
sean-k-mooneywhen run under mod_wsig some of those are uncondtionally handeled by apache16:11
sean-k-mooneyso i dont think we can rely on sig_arlarm in general16:12
dansmithright, but I think that's how asyncio handles preemptable timeouts right?16:12
sean-k-mooneyi do not think so16:12
sean-k-mooneyasycio also works on windows where it cant16:12
dansmithsure, but it doesn't have to implement the primitives in the same way16:13
johnthetubaguyWhat is the argument against brute forcing it with extra threads (for the scatter/gather) apache just says no? Or just worried about the system resources/latency of the extra thread?16:13
sean-k-mooneyjohnthetubaguy: its not really the scather gahter that the concern16:13
dansmithI've just run into some weirdness when trying to run asyncio compatible stuff from deep inside "regular" code and it seems like it was related to some of the caveats around the event loop being in the main thread16:13
sean-k-mooneywe coudl remove that but we currently use eventlet to timeout queiry that done return16:14
gibijohnthetubaguy: threads are not cancellable so we cannot easily port the today's timeout handling16:14
dansmithright ^16:14
sean-k-mooneyso i think the reall fix for that is the sql alchemey statement timeout16:14
johnthetubaguyAh… got you.16:14
dansmithsean-k-mooney: eight16:14
dansmithsean-k-mooney: *right16:14
sean-k-mooneyprocess can be killd but that not reaonble ot do in the api16:14
gibisean-k-mooney: sure, if all the blocking call can be made non blocking with a timeout then we can just push down our timeout handling to the bottom of the call stack16:15
sean-k-mooneyok so i think ill pause lookign at other "timeout methods" that work for arbitry funcitons16:15
sean-k-mooneyand insted look at how we can pass timeouts to existing libs directly i.e. sqlachmey16:15
gibiyeah, I think we should not chase the "timeut for arbitrary function" direction with native threads. But accept that we need to per task type based solution16:16
dansmithagree16:16
sean-k-mooneythat is fine with me.16:17
sean-k-mooneyi was hoping to limit the scope fo the refactors but lets expore that instead then16:17
sean-k-mooneyto be clear changing form eventlet.timeout to native timeouts in sqlachemy shoudl work with or without eventlet monkypatching16:18
dansmithyeah16:18
sean-k-mooneydansmith: gibi: would you condier this in or out of scope for this cycle. perhasps we shoudl dicsuss in ptg session16:18
sean-k-mooneyim not sure if that is high or low risk16:19
gibisean-k-mooney: do we have a timeout today both on the gather greenthread and on the task greenthreads doing the work?16:19
dansmithI think we could get an eventlet.Timeout or a sqla.timeout at any point so we should already be handling those16:19
gibisean-k-mooney: I consider the work in scope for E, landing it for E might not happen but if I want to learn the devils in the details I need to work on an example, and this is a good example.16:20
sean-k-mooneyack16:20
dansmithalso, AFAIK eventlet will let the timeout happen to unblock the waiter, but the worker thread actually blocked remains there to be GC'd until it frees up,16:20
dansmithso using real timeouts should be better anyway to unblock and return threads to the pool quicker16:20
sean-k-mooneywhen the timeout fire i belvie the eventlet calls greenthread.throw on the worker greentrhead16:21
dansmithright but not in the actual worker thread which is blocked on a C call16:21
sean-k-mooneybut to dans point it may very well block for ever or untill the os kills the tcp socket16:21
dansmiththere's no way to generically interrupt the latter16:21
dansmiththe greenthread.throw is only handled as an exception when the blocking call returns16:22
sean-k-mooneyright16:22
sean-k-mooneyso until the os returns an interupt that resuble the worker thread it wont die16:22
dansmithright16:23
dansmiththat's why moving to real library timeouts will benefit real threads and the green threads approache16:23
sean-k-mooneyack for now the syncronos ceph lib does not supprot that but i think there async (not asynio) function that return futres can, still need to confirm that16:25
dansmithack16:29
sean-k-mooneythey do supprot timouts on the connect method and the "cil command" wrappers16:34
sean-k-mooneyhttps://docs.ceph.com/en/reef/rados/api/python/#rados.Rados.connect https://docs.ceph.com/en/reef/rados/api/python/#rados.Rados.mon_command16:34
bauzasgouthamr: hey, what exactly do you want to discuss about virtiofs during nova sessions ? https://etherpad.opendev.org/p/nova-2025.1-ptg#L23116:37
opendevreviewArtom Lifshitz proposed openstack/nova master: libvirt pwr mgmt: reproduce bug 2085122  https://review.opendev.org/c/openstack/nova/+/93292216:37
opendevreviewArtom Lifshitz proposed openstack/nova master: pwr mgmt: power up instances's CPUs upon init  https://review.opendev.org/c/openstack/nova/+/93292616:37
gouthamrbauzas: i'd like to know the current status of the development / focus of the team going into Epoxy16:39
bauzasokay, noted16:39
opendevreviewArtom Lifshitz proposed openstack/nova master: libvirt pwr mgmt: reproduce bug 2085122  https://review.opendev.org/c/openstack/nova/+/93292216:39
opendevreviewArtom Lifshitz proposed openstack/nova master: pwr mgmt: power up instances's CPUs upon init  https://review.opendev.org/c/openstack/nova/+/93292616:39
bauzasI'll provide a raw agenday tomorrow morning European time16:39
johnthetubaguybauzas: I believe its MIG vGPU not working with detach volume any more, Caracal onwards? Came up in the scientific SIG16:56
bauzasjohnthetubaguy: I may have said that MIG isn't officially supported in Nova16:56
bauzasbecause that's nvidia specific and can't be documented upstream16:57
johnthetubaguybauzas: could you link me to where we say that? Its in the PCI docs I guess?16:58
johnthetubaguy(It might be vGPU not MIG directly, but I could well be confused)16:59
sean-k-mooneythe gap when is the write/read commends to add data to the volume17:01
sean-k-mooneycontext of ^ is our current abllity to timeout ceph commands17:01
sean-k-mooneyjohnthetubaguy: i see i tought we fixed that?17:02
johnthetubaguy@bauzas https://review.opendev.org/c/openstack/nova/+/925037 Maybe it got fixed another way?17:02
sean-k-mooneyjohnthetubaguy: there was a breakage casue by the alisa feature i think, i dont recall the full details i woudl have to find the review17:02
bauzasjohnthetubaguy: we don't exactly mention that nvidia bits are forbidden, but vgpu upstream docs somehow show it up https://docs.openstack.org/nova/latest/admin/virtual-gpu.html17:03
bauzasoh ok, now I get the picture17:04
sean-k-mooneyjohnthetubaguy: you dont happen to have the review do you. ill see if i can find it17:04
sean-k-mooneyjohnthetubaguy: if i recall correctly we asked them to repoduce the failure in a fucntional test and then demonstrate there patch fixed it17:05
sean-k-mooneybut i dont think we ask for thid party ci17:05
sean-k-mooneyjohnthetubaguy: was it related to https://review.opendev.org/c/openstack/nova/+/925037 https://bugs.launchpad.net/nova/+bug/194234517:06
bauzasguys, I need to drop for my sanity17:06
bauzaswe'll have long days of nova sessions so today is a hard stop17:07
sean-k-mooneywe fixed that with https://review.opendev.org/c/openstack/nova/+/80694317:07
fungii also linked it in the dalmation retrospective section of the ptg pad since bauzas asked17:08
sean-k-mooneyack17:09
fungiand yes, the proposer seemed to think he was being told 1. we don't support your use case, and 2. if you want this merged you need to set up your own ci to provide testing for it17:10
johnthetubaguyTo be clear, my motive is I know people using vGPU on Antelope, and we are about to SLURP them to Caracal, but no one can afford GPUs in their staging environments these days. So people raising the flag and saying it’s all broken make me worry, and want to know how I can help fix things. A chat in the retrospective sounds like a good idea, regardless. 17:10
fungiwhich i don't get when i read the review comments, so they were reading a lot more into it for unknown reasons17:10
johnthetubaguyAh… I had missed that crucial detail, probably a misunderstanding of “add a functional test” talking about a different tox target.17:14
fungiyeah, as for their unpleasant experience, some people come into this expecting to have a bad time and then read between the lines in whatever way is needed to reinforce their expectation, so aside from being more clear in communicating (maybe? it was pretty clear to me already), i'm not sure what else might have helped the situation17:20
sean-k-mooneyjohnthetubaguy: yep tox not third party ci17:30
sean-k-mooneythis is a valid bug to fix. but the fact we had two almost identical issues suggests our existign test coverage is lacking17:31
sean-k-mooneythe orginal fix did not fix the underlying problem just eh symtom17:31
sean-k-mooneybut thats a seperate matter.17:31
sean-k-mooneyjohnthetubaguy: for what its worth we are trying to get mdev testign in first and thirdparty ci currently17:32
sean-k-mooneyjohnthetubaguy: for the first party ci we have a poc of using the mtty sample modle which we hope to merge sooner rather then later into nova-next17:33
sean-k-mooneyfor third party ci we are hopeing to have a limited capablity to trigger a thirdparty job using redhat resouces evenutlly17:33
sean-k-mooneyi.e. agent changes to master before they merge17:33
sean-k-mooneyits a lot of work but somehtign we are trying to line up. the bit problem as you noted is hardware aviablity but that is not the only one17:34
sean-k-mooneywe hope the mtty supprot will give a baseline of coverage for things like this17:34
sean-k-mooneyjohnthetubaguy: https://review.opendev.org/q/topic:%22mtty_support%22 17:35
sean-k-mooneywe need https://review.opendev.org/c/openstack/nova/+/898100/4 first ot allow using virtual devices17:36
opendevreviewArtom Lifshitz proposed openstack/nova master: pwr mgmt: power up instances's CPUs upon init  https://review.opendev.org/c/openstack/nova/+/93292617:42
sean-k-mooneyfungi: the orgianl confugion arount "we dont support your usecase" came from the fact they fix the bug by adding the infra required to be able to detach an mdev via the alisa, without stating that they were actully trying to detach a volume and this just so happend to fail when the vm also had an mdev17:42
sean-k-mooneythey had a limited explcation of that in the bug but not in the commit message17:43
fungisean-k-mooney: thanks for the clarification, makes sense18:15
fungialso worth noting, there was a chameleon cloud representative at oid-na who was talking about the possibility of making vgpu resources available for upstream testing. not sure where that will go, but something to consider18:16
sean-k-mooneythe issue is we cant do that with nested vir18:16
sean-k-mooneyi.e. we can only test it properly with baremental compute nodes18:17
sean-k-mooneyif you passhtough afully gpu to a vm you cant then use vgpus18:17
JayFdarn, too bad we don't know how to provision baremetal /s 18:17
JayFseriously though, having to provision baremetal is very tough in CI, it's part of why Ironic 3rd party CIs died out18:18
JayFjust a little high touch18:18
fungiwell, yeah they were talking about nova baremetal api to boot machines with vgpu-capable resources18:18
fungi(i.e. ironic)18:18
fungibut i agree with you about the whole safety challenge with providing root access to physical machines for untrusted job payloads18:19
JayFit's not just a safety challenge, it's a challenge-challenge18:19
JayFmany real-world failure modes which needs human hands, or at least, non-automated intervention to fix18:19
sean-k-mooneythe tldr is when you passthough a fully pci device in qemu it removes all the sriov capablities because qemu did not have a way to make that work18:20
sean-k-mooneyfor what its worth we are lookign to see if we can use our internal hardware in teh future to test specifi upstream changes18:20
sean-k-mooneythat woudl do a baremteally deployment of a nova compute (using ironic) with a manual triger to basicaly say "this is safe to test"18:21
sean-k-mooneyas a third party ci18:21
JayFI think there's a lot of untapped value in that idea of on-demand testing18:22
JayFwe can *usually* pinpoint when we're getting close to an edge case that might need revalidation18:23
sean-k-mooneybut that is why we are also trying to figure out how to test without real hardware using the sample kernel dirvers for a baseline of coverage18:23
sean-k-mooneyfor example we are enableing igb vif model in nova so tthat in the future we can use that to test sriov 18:24
sean-k-mooneyif our providers can provide neutron ports with igb vif model in the future we can fully test sriov in the 1st party ci18:24
sean-k-mooneyunfrogually it needs a new enough nova/qemu but by 2026.1 we might be in a position to do that18:25
sean-k-mooney*unfortunetly18:25
opendevreviewArtom Lifshitz proposed openstack/nova master: pwr mgmt: power up instances's CPUs upon init  https://review.opendev.org/c/openstack/nova/+/93292619:04
opendevreviewsean mooney proposed openstack/nova master: Add repoducer test for bug 2074219  https://review.opendev.org/c/openstack/nova/+/93293019:54
opendevreviewsean mooney proposed openstack/nova master: Fix detaching devices by alias with mdevs  https://review.opendev.org/c/openstack/nova/+/93293119:54
sean-k-mooneyJayF: johnthetubaguy: fungi: ^ that is the actual correct fix and what i was asking for in that review19:55
fungiawesome!20:13

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!