*** __ministry is now known as Guest6947 | 07:24 | |
*** __ministry is now known as Guest6953 | 08:48 | |
*** __ministry is now known as Guest6966 | 10:40 | |
opendevreview | Takashi Kajinami proposed openstack/os-resource-classes master: Drop unnecessary 'x' bit from doc config file https://review.opendev.org/c/openstack/os-resource-classes/+/932878 | 11:25 |
---|---|---|
opendevreview | Takashi Kajinami proposed openstack/os-traits master: Drop unnecessary 'x' bit from doc config file https://review.opendev.org/c/openstack/os-traits/+/932879 | 11:25 |
opendevreview | Takashi Kajinami proposed openstack/osc-placement master: Drop unnecessary 'x' bit from doc config file https://review.opendev.org/c/openstack/osc-placement/+/932881 | 11:26 |
sean-k-mooney | HTTP exception thrown: Flavor has hw:virtio_packed_ring extra spec explicitly set to True, conflicting with image which has hw_virtio_packed_ring explicitly set to true. | 12:40 |
sean-k-mooney | ... well thats a bug | 12:40 |
sean-k-mooney | we are doing a sting compare somewhere instead of comparing the boolean values | 12:40 |
tkajinam | https://github.com/openstack/nova/blob/master/nova/virt/hardware.py#L1120 | 12:43 |
sean-k-mooney | https://bugs.launchpad.net/nova/+bug/2085124 | 12:43 |
tkajinam | this is used not only for boolean properties but also for string properties. We probably have to add a flat to indicate whether the value should be converted before comparison | 12:44 |
sean-k-mooney | perhaps we have other helpers too | 12:44 |
sean-k-mooney | i normally expect us to call _get_flavor_image_meta | 12:44 |
sean-k-mooney | and hten have a spereate funciton per extra spec to chekc it | 12:45 |
sean-k-mooney | like get_mem_encryption_constraint | 12:45 |
sean-k-mooney | so we proably used _get_unique_flavor_image_meta to make it simpler but didnt think about this usecase | 12:45 |
sean-k-mooney | rather then a flavor for conversion we proably need to pass an expected type | 12:45 |
sean-k-mooney | or convertion function for the flavor value. the image ones are already typed | 12:45 |
tkajinam | yeah | 12:46 |
tkajinam | ok the logic for packed_ring is specifically implemented in get_packed_virtqueue_constraint | 12:46 |
sean-k-mooney | i.e. pass in bool or oslo.utils.str_to_bool whatever taht is called | 12:46 |
sean-k-mooney | ya so the issue is proably here https://github.com/openstack/nova/blob/master/nova/virt/hardware.py#L1985 | 12:47 |
sean-k-mooney | in that path we are not converting it to the correct type | 12:47 |
sean-k-mooney | we convert the flavor to the correct type here https://github.com/openstack/nova/blob/master/nova/virt/hardware.py#L1990-L1992 | 12:48 |
sean-k-mooney | but in the dict case the image_value is just a string | 12:48 |
sean-k-mooney | so we need to call strutils.bool_from_string here https://github.com/openstack/nova/blob/master/nova/virt/hardware.py#L1985 | 12:49 |
jlejeune | hello, I have a question regarding host_mapping, is it "normal" that when we delete a nova-compute service, the 'mapped' field in nova_cell1.compute_nodes table stays at 1 instead of 0 while nova_api.host_mappings related entry is removed ? | 13:54 |
jlejeune | by letting that 'mapped' field to 1 prevent nova-manage cell_v2 discover_hosts command to create the missing mapping | 13:55 |
gibi | jlejeune: it sounds like bug #1 in scenario 1 in https://bugs.launchpad.net/nova/+bug/2077070/comments/7 | 13:56 |
gibi | jlejeune: could you check if adding --by-service to the discover_hosts command helps you or not | 13:57 |
gibi | ? | 13:57 |
jlejeune | I've already test it and indeed, it "fix" the issue by creating the missing mapping | 13:57 |
gibi | OK. Then I think it is the same issue. | 13:58 |
gibi | so yeah, nova should set mapped to 0 when the host mapping is deleted. | 13:58 |
jlejeune | but not sure I want to do it by service, given the number of computes we have in production | 13:58 |
jlejeune | agree | 13:59 |
gibi | if you have time to propose a fix then I'm happy to review it. If not then eventually I get to it myself to propose the fix. | 13:59 |
jlejeune | I will have a look ! | 14:01 |
gibi | jlejeune: thanks! | 14:02 |
jlejeune | gibi: maybe we should split the bug in two, one for pci devices and another one for the mapping issue ? | 14:27 |
gibi | jlejeune: I don't think there is anythin else in the pci side, but if you want a separate bug to fix the mapping issue then sure go ahead and file one. I agree that not mentioning pci devices in the separate bug will help the reviewers not bog down on the pci stuff. | 14:31 |
jlejeune | yes exactly :) | 14:34 |
gibi | ping me with the separate bug and I will quickly do the triage of it | 14:35 |
jlejeune | ok | 14:37 |
jlejeune | gibi: https://bugs.launchpad.net/nova/+bug/2085135 | 14:55 |
gibi | jlejeune: thank | 15:09 |
johnthetubaguy | dansmith: are you thinking threads for each task being better for what nova-compute is generally doing? I hadn’t really thought about that. I could see that limiting the code changes, I suppose? | 15:44 |
dansmith | johnthetubaguy: I want us to at least consider that yeah | 15:44 |
bauzas | johnthetubaguy: that was something we were considering from the last PTG | 15:44 |
dansmith | johnthetubaguy: I suspect we have some places where we think we're threadsafe, but aren't | 15:45 |
dansmith | so we'll need to fix that up, which we should have fixed anyway | 15:45 |
dansmith | but I think much of nova (potentially conductor as well) could just move straight to real threads, ops can control how much parallelism they want (which they've asked for before) to scale performance, and then we can move on | 15:46 |
johnthetubaguy | Yeah, I had not thought of it like that. Its a good point. | 15:47 |
clarkb | just as a point of info zuul runs largely with real threads (I think the websocket lib we use may involve asyncio but I can't remember) | 15:49 |
JayF | ^ is what dtantsur has been pitching for IPA, although we've written no code in that direction | 15:49 |
clarkb | it works reasonably well and in theory will work better when we can drop the gil and sort out all of the places we missed extra lockign (but we already coordinate multiple processes with zookeeper so don't really need that scaling approach with dropping gil) | 15:49 |
dtantsur | Yep. Not all of our project needs to handle an extremely high concurrency. | 15:50 |
dtantsur | Especially Ironic, which has "I'm busy, come back later" as an inherent part of its API :) | 15:50 |
johnthetubaguy | Honestly, most of the bugs I find are where we attempt high concurrency badly, hard blocking on some of the API calls and tasks would totally make me operator self happier :) | 15:51 |
opendevreview | Artom Lifshitz proposed openstack/nova master: libvirt pwr mgmt: reproduce bug 2085122 https://review.opendev.org/c/openstack/nova/+/932922 | 15:51 |
sean-k-mooney | gibi: im not actully sold on usign ayncio for the timout stuff by the way | 16:05 |
sean-k-mooney | gibi: it was just going to be my next poc, warp the existing fucntion in awaitlet and dispatch ot asyncio on a seperat ethead | 16:05 |
sean-k-mooney | specificlay for the nova-api sacather gather usecause | 16:05 |
dansmith | I'm quite sure that doing that will not be a magic bullet unless we're actually running the asyncio event loop | 16:06 |
sean-k-mooney | right that woudl be idea run the asyncio eventloop in a sperate thread nad dispatch the async task into it | 16:06 |
sean-k-mooney | again this is just one of the many altertinitve i was condiering | 16:06 |
sean-k-mooney | im not conviced it is a good approch but its what was suggested in teh wiki | 16:07 |
dansmith | yeah, but unless we're the parent, I think that the thing that spawned us may receive the SIGALRM and not deliver it to our event loop as a result meaning it doesn't really work like we think if it's blocked in a C call | 16:07 |
dansmith | anyway, we should dive into a poc and confirm | 16:08 |
sean-k-mooney | https://wiki.openstack.org/wiki/Eventlet-removal#3._Timeout_and_Deadline_Management_2 | 16:08 |
gibi | sean-k-mooney: for me more moving pieces sounds worse, e.g. awaitlet + asyncio + a worker thread running the eventloop is just too much complexity. | 16:09 |
sean-k-mooney | i did a poc of this before https://review.opendev.org/c/openstack/nova/+/904425/10/nova/context.py#422 | 16:09 |
sean-k-mooney | with a diffent approch | 16:09 |
sean-k-mooney | gibi: ack which is valid | 16:09 |
sean-k-mooney | gibi: i dont think the event approch you suggest really works as we cant assume we can generially check the event within the fucntio we are dispatching | 16:10 |
dansmith | gibi: ++ | 16:10 |
sean-k-mooney | gibi: sqlachmen does supprot a statement_timeout | 16:10 |
sean-k-mooney | which i think is going to be the answer for db queies | 16:10 |
sean-k-mooney | i think this is currently set per session or conenction | 16:10 |
gibi | sean-k-mooney: my point that we need to make assumption about the functions we run as a task or we need to prepare to abandon them from the gather thread | 16:11 |
dansmith | sean-k-mooney: "To handle signals the event loop must be run in the main thread." | 16:11 |
sean-k-mooney | dansmith: sure but we cant use sinals for this | 16:11 |
dansmith | I think that unless you're the main thread, you can't delegate things to executors and expect things like timeouts to preempt | 16:11 |
sean-k-mooney | when run under mod_wsig some of those are uncondtionally handeled by apache | 16:11 |
sean-k-mooney | so i dont think we can rely on sig_arlarm in general | 16:12 |
dansmith | right, but I think that's how asyncio handles preemptable timeouts right? | 16:12 |
sean-k-mooney | i do not think so | 16:12 |
sean-k-mooney | asycio also works on windows where it cant | 16:12 |
dansmith | sure, but it doesn't have to implement the primitives in the same way | 16:13 |
johnthetubaguy | What is the argument against brute forcing it with extra threads (for the scatter/gather) apache just says no? Or just worried about the system resources/latency of the extra thread? | 16:13 |
sean-k-mooney | johnthetubaguy: its not really the scather gahter that the concern | 16:13 |
dansmith | I've just run into some weirdness when trying to run asyncio compatible stuff from deep inside "regular" code and it seems like it was related to some of the caveats around the event loop being in the main thread | 16:13 |
sean-k-mooney | we coudl remove that but we currently use eventlet to timeout queiry that done return | 16:14 |
gibi | johnthetubaguy: threads are not cancellable so we cannot easily port the today's timeout handling | 16:14 |
dansmith | right ^ | 16:14 |
sean-k-mooney | so i think the reall fix for that is the sql alchemey statement timeout | 16:14 |
johnthetubaguy | Ah… got you. | 16:14 |
dansmith | sean-k-mooney: eight | 16:14 |
dansmith | sean-k-mooney: *right | 16:14 |
sean-k-mooney | process can be killd but that not reaonble ot do in the api | 16:14 |
gibi | sean-k-mooney: sure, if all the blocking call can be made non blocking with a timeout then we can just push down our timeout handling to the bottom of the call stack | 16:15 |
sean-k-mooney | ok so i think ill pause lookign at other "timeout methods" that work for arbitry funcitons | 16:15 |
sean-k-mooney | and insted look at how we can pass timeouts to existing libs directly i.e. sqlachmey | 16:15 |
gibi | yeah, I think we should not chase the "timeut for arbitrary function" direction with native threads. But accept that we need to per task type based solution | 16:16 |
dansmith | agree | 16:16 |
sean-k-mooney | that is fine with me. | 16:17 |
sean-k-mooney | i was hoping to limit the scope fo the refactors but lets expore that instead then | 16:17 |
sean-k-mooney | to be clear changing form eventlet.timeout to native timeouts in sqlachemy shoudl work with or without eventlet monkypatching | 16:18 |
dansmith | yeah | 16:18 |
sean-k-mooney | dansmith: gibi: would you condier this in or out of scope for this cycle. perhasps we shoudl dicsuss in ptg session | 16:18 |
sean-k-mooney | im not sure if that is high or low risk | 16:19 |
gibi | sean-k-mooney: do we have a timeout today both on the gather greenthread and on the task greenthreads doing the work? | 16:19 |
dansmith | I think we could get an eventlet.Timeout or a sqla.timeout at any point so we should already be handling those | 16:19 |
gibi | sean-k-mooney: I consider the work in scope for E, landing it for E might not happen but if I want to learn the devils in the details I need to work on an example, and this is a good example. | 16:20 |
sean-k-mooney | ack | 16:20 |
dansmith | also, AFAIK eventlet will let the timeout happen to unblock the waiter, but the worker thread actually blocked remains there to be GC'd until it frees up, | 16:20 |
dansmith | so using real timeouts should be better anyway to unblock and return threads to the pool quicker | 16:20 |
sean-k-mooney | when the timeout fire i belvie the eventlet calls greenthread.throw on the worker greentrhead | 16:21 |
dansmith | right but not in the actual worker thread which is blocked on a C call | 16:21 |
sean-k-mooney | but to dans point it may very well block for ever or untill the os kills the tcp socket | 16:21 |
dansmith | there's no way to generically interrupt the latter | 16:21 |
dansmith | the greenthread.throw is only handled as an exception when the blocking call returns | 16:22 |
sean-k-mooney | right | 16:22 |
sean-k-mooney | so until the os returns an interupt that resuble the worker thread it wont die | 16:22 |
dansmith | right | 16:23 |
dansmith | that's why moving to real library timeouts will benefit real threads and the green threads approache | 16:23 |
sean-k-mooney | ack for now the syncronos ceph lib does not supprot that but i think there async (not asynio) function that return futres can, still need to confirm that | 16:25 |
dansmith | ack | 16:29 |
sean-k-mooney | they do supprot timouts on the connect method and the "cil command" wrappers | 16:34 |
sean-k-mooney | https://docs.ceph.com/en/reef/rados/api/python/#rados.Rados.connect https://docs.ceph.com/en/reef/rados/api/python/#rados.Rados.mon_command | 16:34 |
bauzas | gouthamr: hey, what exactly do you want to discuss about virtiofs during nova sessions ? https://etherpad.opendev.org/p/nova-2025.1-ptg#L231 | 16:37 |
opendevreview | Artom Lifshitz proposed openstack/nova master: libvirt pwr mgmt: reproduce bug 2085122 https://review.opendev.org/c/openstack/nova/+/932922 | 16:37 |
opendevreview | Artom Lifshitz proposed openstack/nova master: pwr mgmt: power up instances's CPUs upon init https://review.opendev.org/c/openstack/nova/+/932926 | 16:37 |
gouthamr | bauzas: i'd like to know the current status of the development / focus of the team going into Epoxy | 16:39 |
bauzas | okay, noted | 16:39 |
opendevreview | Artom Lifshitz proposed openstack/nova master: libvirt pwr mgmt: reproduce bug 2085122 https://review.opendev.org/c/openstack/nova/+/932922 | 16:39 |
opendevreview | Artom Lifshitz proposed openstack/nova master: pwr mgmt: power up instances's CPUs upon init https://review.opendev.org/c/openstack/nova/+/932926 | 16:39 |
bauzas | I'll provide a raw agenday tomorrow morning European time | 16:39 |
johnthetubaguy | bauzas: I believe its MIG vGPU not working with detach volume any more, Caracal onwards? Came up in the scientific SIG | 16:56 |
bauzas | johnthetubaguy: I may have said that MIG isn't officially supported in Nova | 16:56 |
bauzas | because that's nvidia specific and can't be documented upstream | 16:57 |
johnthetubaguy | bauzas: could you link me to where we say that? Its in the PCI docs I guess? | 16:58 |
johnthetubaguy | (It might be vGPU not MIG directly, but I could well be confused) | 16:59 |
sean-k-mooney | the gap when is the write/read commends to add data to the volume | 17:01 |
sean-k-mooney | context of ^ is our current abllity to timeout ceph commands | 17:01 |
sean-k-mooney | johnthetubaguy: i see i tought we fixed that? | 17:02 |
johnthetubaguy | @bauzas https://review.opendev.org/c/openstack/nova/+/925037 Maybe it got fixed another way? | 17:02 |
sean-k-mooney | johnthetubaguy: there was a breakage casue by the alisa feature i think, i dont recall the full details i woudl have to find the review | 17:02 |
bauzas | johnthetubaguy: we don't exactly mention that nvidia bits are forbidden, but vgpu upstream docs somehow show it up https://docs.openstack.org/nova/latest/admin/virtual-gpu.html | 17:03 |
bauzas | oh ok, now I get the picture | 17:04 |
sean-k-mooney | johnthetubaguy: you dont happen to have the review do you. ill see if i can find it | 17:04 |
sean-k-mooney | johnthetubaguy: if i recall correctly we asked them to repoduce the failure in a fucntional test and then demonstrate there patch fixed it | 17:05 |
sean-k-mooney | but i dont think we ask for thid party ci | 17:05 |
sean-k-mooney | johnthetubaguy: was it related to https://review.opendev.org/c/openstack/nova/+/925037 https://bugs.launchpad.net/nova/+bug/1942345 | 17:06 |
bauzas | guys, I need to drop for my sanity | 17:06 |
bauzas | we'll have long days of nova sessions so today is a hard stop | 17:07 |
sean-k-mooney | we fixed that with https://review.opendev.org/c/openstack/nova/+/806943 | 17:07 |
fungi | i also linked it in the dalmation retrospective section of the ptg pad since bauzas asked | 17:08 |
sean-k-mooney | ack | 17:09 |
fungi | and yes, the proposer seemed to think he was being told 1. we don't support your use case, and 2. if you want this merged you need to set up your own ci to provide testing for it | 17:10 |
johnthetubaguy | To be clear, my motive is I know people using vGPU on Antelope, and we are about to SLURP them to Caracal, but no one can afford GPUs in their staging environments these days. So people raising the flag and saying it’s all broken make me worry, and want to know how I can help fix things. A chat in the retrospective sounds like a good idea, regardless. | 17:10 |
fungi | which i don't get when i read the review comments, so they were reading a lot more into it for unknown reasons | 17:10 |
johnthetubaguy | Ah… I had missed that crucial detail, probably a misunderstanding of “add a functional test” talking about a different tox target. | 17:14 |
fungi | yeah, as for their unpleasant experience, some people come into this expecting to have a bad time and then read between the lines in whatever way is needed to reinforce their expectation, so aside from being more clear in communicating (maybe? it was pretty clear to me already), i'm not sure what else might have helped the situation | 17:20 |
sean-k-mooney | johnthetubaguy: yep tox not third party ci | 17:30 |
sean-k-mooney | this is a valid bug to fix. but the fact we had two almost identical issues suggests our existign test coverage is lacking | 17:31 |
sean-k-mooney | the orginal fix did not fix the underlying problem just eh symtom | 17:31 |
sean-k-mooney | but thats a seperate matter. | 17:31 |
sean-k-mooney | johnthetubaguy: for what its worth we are trying to get mdev testign in first and thirdparty ci currently | 17:32 |
sean-k-mooney | johnthetubaguy: for the first party ci we have a poc of using the mtty sample modle which we hope to merge sooner rather then later into nova-next | 17:33 |
sean-k-mooney | for third party ci we are hopeing to have a limited capablity to trigger a thirdparty job using redhat resouces evenutlly | 17:33 |
sean-k-mooney | i.e. agent changes to master before they merge | 17:33 |
sean-k-mooney | its a lot of work but somehtign we are trying to line up. the bit problem as you noted is hardware aviablity but that is not the only one | 17:34 |
sean-k-mooney | we hope the mtty supprot will give a baseline of coverage for things like this | 17:34 |
sean-k-mooney | johnthetubaguy: https://review.opendev.org/q/topic:%22mtty_support%22 | 17:35 |
sean-k-mooney | we need https://review.opendev.org/c/openstack/nova/+/898100/4 first ot allow using virtual devices | 17:36 |
opendevreview | Artom Lifshitz proposed openstack/nova master: pwr mgmt: power up instances's CPUs upon init https://review.opendev.org/c/openstack/nova/+/932926 | 17:42 |
sean-k-mooney | fungi: the orgianl confugion arount "we dont support your usecase" came from the fact they fix the bug by adding the infra required to be able to detach an mdev via the alisa, without stating that they were actully trying to detach a volume and this just so happend to fail when the vm also had an mdev | 17:42 |
sean-k-mooney | they had a limited explcation of that in the bug but not in the commit message | 17:43 |
fungi | sean-k-mooney: thanks for the clarification, makes sense | 18:15 |
fungi | also worth noting, there was a chameleon cloud representative at oid-na who was talking about the possibility of making vgpu resources available for upstream testing. not sure where that will go, but something to consider | 18:16 |
sean-k-mooney | the issue is we cant do that with nested vir | 18:16 |
sean-k-mooney | i.e. we can only test it properly with baremental compute nodes | 18:17 |
sean-k-mooney | if you passhtough afully gpu to a vm you cant then use vgpus | 18:17 |
JayF | darn, too bad we don't know how to provision baremetal /s | 18:17 |
JayF | seriously though, having to provision baremetal is very tough in CI, it's part of why Ironic 3rd party CIs died out | 18:18 |
JayF | just a little high touch | 18:18 |
fungi | well, yeah they were talking about nova baremetal api to boot machines with vgpu-capable resources | 18:18 |
fungi | (i.e. ironic) | 18:18 |
fungi | but i agree with you about the whole safety challenge with providing root access to physical machines for untrusted job payloads | 18:19 |
JayF | it's not just a safety challenge, it's a challenge-challenge | 18:19 |
JayF | many real-world failure modes which needs human hands, or at least, non-automated intervention to fix | 18:19 |
sean-k-mooney | the tldr is when you passthough a fully pci device in qemu it removes all the sriov capablities because qemu did not have a way to make that work | 18:20 |
sean-k-mooney | for what its worth we are lookign to see if we can use our internal hardware in teh future to test specifi upstream changes | 18:20 |
sean-k-mooney | that woudl do a baremteally deployment of a nova compute (using ironic) with a manual triger to basicaly say "this is safe to test" | 18:21 |
sean-k-mooney | as a third party ci | 18:21 |
JayF | I think there's a lot of untapped value in that idea of on-demand testing | 18:22 |
JayF | we can *usually* pinpoint when we're getting close to an edge case that might need revalidation | 18:23 |
sean-k-mooney | but that is why we are also trying to figure out how to test without real hardware using the sample kernel dirvers for a baseline of coverage | 18:23 |
sean-k-mooney | for example we are enableing igb vif model in nova so tthat in the future we can use that to test sriov | 18:24 |
sean-k-mooney | if our providers can provide neutron ports with igb vif model in the future we can fully test sriov in the 1st party ci | 18:24 |
sean-k-mooney | unfrogually it needs a new enough nova/qemu but by 2026.1 we might be in a position to do that | 18:25 |
sean-k-mooney | *unfortunetly | 18:25 |
opendevreview | Artom Lifshitz proposed openstack/nova master: pwr mgmt: power up instances's CPUs upon init https://review.opendev.org/c/openstack/nova/+/932926 | 19:04 |
opendevreview | sean mooney proposed openstack/nova master: Add repoducer test for bug 2074219 https://review.opendev.org/c/openstack/nova/+/932930 | 19:54 |
opendevreview | sean mooney proposed openstack/nova master: Fix detaching devices by alias with mdevs https://review.opendev.org/c/openstack/nova/+/932931 | 19:54 |
sean-k-mooney | JayF: johnthetubaguy: fungi: ^ that is the actual correct fix and what i was asking for in that review | 19:55 |
fungi | awesome! | 20:13 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!