Wednesday, 2025-08-20

jandersTheJulia ACK. So I remove the diagram from my doco change?00:06
janders(new diagram looks good)00:06
TheJuliaeither, I mean its all programatically generated00:08
TheJuliaso we should ideally do it off master branch state periodically but dont.00:08
opendevreviewJacob Anders proposed openstack/ironic master: WIP: update documentation to include servicing abort.  https://review.opendev.org/c/openstack/ironic/+/95782500:15
opendevreviewJacob Anders proposed openstack/ironic master: WIP: update documentation to include servicing abort.  https://review.opendev.org/c/openstack/ironic/+/95782500:19
janders^^ change to the svg file removed from my WIP doco update00:20
opendevreviewOpenStack Proposal Bot proposed openstack/ironic-ui master: Imported Translations from Zanata  https://review.opendev.org/c/openstack/ironic-ui/+/95782902:33
rpittaugood morning ironic! o/06:59
jandershey rpittau o/07:01
rpittauhey janders :)07:08
queensly[m]Good morning o/07:53
abongalegood morning 08:16
opendevreviewluocanhui proposed openstack/ironic master: Add UT for metrics modules  https://review.opendev.org/c/openstack/ironic/+/95802808:27
kubajjGood morning Ironic! o/09:33
kubajjI am very confused about https://opendev.org/openstack/ironic-python-agent/src/branch/master/ironic_python_agent/hardware.py#L2041-L204309:33
kubajjjanders: if you are around, any clue why you checked if the function exists?09:34
janderskubajj not of the top of my head (been a while) but will try check for you after dinner09:41
janderskubajj this looks like a straight up typo bug with brackets missing10:43
jandersI don't think there is any deeper explanation than that10:44
janderspart of this change was moving the logic in question to a helper function, typo must have been introduced as a part of this and since it's not too disruptive in most circumstances (change still works well on nodes with NVMe drivers, doesn't break CI) it went unnoticed up till now10:46
jandersIIRC you're working on swraid related features, correct? I see how it can be disruptive in this scenario10:48
opendevreviewJacob Anders proposed openstack/ironic master: Fix servicing abort to respect abortable flag  https://review.opendev.org/c/openstack/ironic/+/95718911:12
dtantsurTheJulia: hey, how is indirection API going to work in the hash ring code, given that the hash ring is needed to get the RPC peers?11:20
opendevreviewTakashi Kajinami proposed openstack/ironic-inspector master: Add WSGI module entrypoint  https://review.opendev.org/c/openstack/ironic-inspector/+/95806812:36
opendevreviewTakashi Kajinami proposed openstack/ironic-inspector master: add pyproject.toml to support pip 23.1  https://review.opendev.org/c/openstack/ironic-inspector/+/95806912:36
tkajinamquick question: Do we expect 2025.2 release of ironic-inspector deliverables ?12:36
kubajjjanders: yes, indeed, we are trying to debug some swraid issues12:36
dtantsurtkajinam: I assume we still release it for 2025.2, but I don't remember the final agreement tbh12:37
tkajinamdtantsur, ok12:37
tkajinamI'm updating all puppet modules to replace wsgi script and am wondering if we can add wsgi module to ironic-inspector, then want to know if it'll be part of the next release12:38
tkajinamif not then I can just drop all ironic-inspector support code12:38
opendevreviewDmitry Tantsur proposed openstack/ironic master: JSON-RPC: disable server-side logging with rpc_transport=none  https://review.opendev.org/c/openstack/ironic/+/95807212:43
opendevreviewTakashi Kajinami proposed openstack/ironic-inspector master: Add WSGI module entrypoint  https://review.opendev.org/c/openstack/ironic-inspector/+/95806812:51
opendevreviewTakashi Kajinami proposed openstack/ironic-inspector master: add pyproject.toml to support pip 23.1  https://review.opendev.org/c/openstack/ironic-inspector/+/95806912:51
opendevreviewBaptiste Jonglez proposed openstack/networking-generic-switch master: Cast boolean Netmiko kwargs to native types  https://review.opendev.org/c/openstack/networking-generic-switch/+/95807913:03
opendevreviewBaptiste Jonglez proposed openstack/networking-generic-switch master: Document advanced Netmiko parameters  https://review.opendev.org/c/openstack/networking-generic-switch/+/95808013:07
TheJuliaI *think* the idea was we would release it for 2025.2 and that would be it13:16
dtantsurTheJulia: good morning. Now to the bad news: Ironic seems pretty broken in Metal3, and the eventlet migration may be to blame.13:20
TheJuliadtantsur: you've reminded me I need to set one of the classic jobs to use it. I think it ends up working by routing to ironic.conductor_manager directly13:20
TheJuliadtantsur: logs?13:20
TheJuliametal3-integration really need to be fixed to use an entire stack even though we did the squash test13:21
TheJulia... which makes me question it's job overall, but hey13:21
opendevreviewJohn Garbutt proposed openstack/ironic master: WIP: Choose prefered vmedia boot device  https://review.opendev.org/c/openstack/ironic/+/95808213:21
dtantsurTheJulia: metal3-integration is, unfortunately, the least useful thing we could run on Ironic.. we have consistent failures in the ironic-standalone-operator job. Which, among other things, creates and reads 100 nodes in a quick sequence on each test.13:22
TheJuliaWell, had no way to know about that, and I can only work with logs13:23
TheJuliaso... logs?13:23
dtantsurYeah, before I bother you with that, I need to understand if it fails entirely or is just slow13:23
dtantsurso13:23
dtantsurhttps://github.com/metal3-io/ironic-image/actions/runs/17098403587/job/48488270259 is the entry point13:23
dtantsurif you click "Summary", you can find the ZIP file with logs to download. Each directly is one test, so a separate Ironic installation.13:24
TheJuliaack, caffinating now()13:24
dtantsur(Yeah, github UX for you... I wish they spent time on that rather than AI..)13:24
dtantsurI'll try to rise all the timeouts to see if it's slowness or something gets stuck13:24
TheJuliait wouldn't super surprise me if something got stuck, the basic testing I was running was pretty sporty performance wise, but... as with all things, it is also a massive change13:25
dtantsurhttps://github.com/metal3-io/ironic-standalone-operator/pull/349 will hopefully tell13:26
opendevreviewJohn Garbutt proposed openstack/ironic master: WIP: Choose prefered vmedia boot device  https://review.opendev.org/c/openstack/ironic/+/95808213:29
TheJuliacurious13:43
TheJulianot a single error, looks like it is generating responses13:44
TheJuliaat least, it thinks it is13:44
TheJuliadtantsur: is the config intentionally all over the place, some jobs with json-rpc, some appearing to be sqlite, but then others with a configuration for an ironic-database.mariadb.svc ?13:56
TheJuliaI guess so, an ha job13:56
dtantsurTheJulia: yep, it's testing different configurations.13:57
* dtantsur updated the wrong timeout, retrying..13:58
TheJuliadoh13:58
TheJuliaInteresting14:08
stephenfinTheJulia: JayF: Did you come to a conclusion yet on whether to make the redfish-vmedia-4k job n-v? https://review.opendev.org/c/openstack/ironic/+/95796214:13
TheJuliamy hope was mirrors would heal since the central mirrors are supposed to be good now aiui, but maybe my understanding was limited by my EOD14:14
stephenfinack14:15
TheJuliadtantsur: crazy question, could ironic be ooming?14:19
TheJuliadtantsur: or to be more precise, could the runtime environment be killing it?14:20
dtantsurThat's possible but we should see signs of it14:20
TheJuliadtantsur: I do think your on there right path with timeout, inconsistent stop points while just seemingly chugging right along14:20
dtantsurIf that's the timeout, on one hand it's a relief (we're not broken in the end), but it's also a significant slow-down.14:21
TheJuliaor the hosts are swapping and get inconsistent performance14:22
TheJuliawhich would greatly reduce overall performance14:22
dtantsurTheJulia: it's passing, but taking 4 minutes instead of < 2 minutes for now14:22
TheJuliaokay, a start point!14:22
TheJuliaWe need to get request logging back somehow14:23
dtantsurdefinitely14:26
TheJuliado you have a link for your successful run logs? I want to see if I can somehow extract a pattern14:29
TheJuliaor at least, some level of understaning14:29
TheJuliaunderstanding14:29
TheJuliaAlso, how much ram do those nodes have?14:30
dtantsurThat's something we ourselves are trying to figure out14:30
dtantsurit's minikube14:30
TheJuliaat least, we know its not a 5 alarm fire, more like a 2.5 alarm fire :)14:32
dtantsur(side conversation: how crazy would be to mount the RPC application under the rest of the API?)14:32
dtantsurhehe, true14:32
TheJuliacould you elaborate on your idea and what your trying to achieve?14:33
dtantsurDon't worry too much about it. But I really want to stop inventing new port numbers (8089 conflicts in OpenShift)14:33
TheJuliaoh, he14:33
TheJuliaheh14:33
TheJuliaRegarding logging, it feels like we need to either see if we can toggle any additional logging in cheroot/cotyldon, or we need to insturment our rest API code14:34
dtantsurLooks like it's not implemented in cheroot, so I've filed a bug for us14:37
JayFdtantsur: link me the bug?14:38
JayFI'll at least get it on the GR-OSS backlog14:38
JayFif nobody gets it by the time they can schedule it, woo a freebie :D 14:38
TheJuliaLooking at a single delete action towards when one of the jobs stopped, at *least* when it started, the rpc call got generated at 13:05:20.167 and the client logged the response at 13:05:20.257 which is not bad.  Basically 90ms for the entire conductor side14:40
dtantsurJayF: https://bugs.launchpad.net/ironic/+bug/212105814:42
JayFaha so not something we can do in cheroot14:42
JayFthat can't be that hard14:42
JayFI'm going to point claude at it and see if I can bang it out today14:42
dtantsurJayF: maybe we can convince them to accept a PR? It's not clear if they don't want it at all or just don't want to work on it.14:43
JayFI am... skeptical given how that reads to me, and also, given the option between "Ironic solves our own problem in a day-ish" vs "solve problem in cheroot -> release -> requirements -> update -> test" I'll take the first one :)14:44
TheJuliaI think in our own code path makes sense, since we also need to tie things with a request-id14:45
dtantsurGood point14:45
TheJuliathe logs dmitry has provided *are* nice and squential14:46
dtantsurOnly because the test is sequential ;)14:46
TheJuliathere is definitely a feeling that I'm getting looking at them that we do seem to be dealing with some sort of unknown issue on the API side sort of outside our RPC code path14:46
TheJuliawhich means, minimally we need to insturment that in middleware most likely14:47
TheJuliaat least, we need to see once we consider ourselves entirely done14:47
TheJuliaand log that14:47
TheJuliaand see how long it is before the next request comes in14:47
TheJulia(and of course with all multiprocess logging, some of the entires are a little out of sequential order since we have two processes operating here14:48
TheJuliaso, we actually need to sort of work both but I don't think we have any baseline on the conductor/rpc side aside from "that feels reasonable"14:48
TheJuliaI almost want thread count numbers too14:49
dtantsurMeanwhile, the tests passed with an increased timeout14:49
TheJuliaso, 2.5 alarms ;)14:49
TheJuliaits not the end of the world, but its not great14:49
TheJuliaWhat is super interesting to me is it *feels* like in opendev, it has felt like a wash timewise14:50
opendevreviewTakashi Kajinami proposed openstack/ironic-inspector master: Add WSGI module entrypoint  https://review.opendev.org/c/openstack/ironic-inspector/+/95806814:50
dtantsurTheJulia: here is the data point for you. As far as I can tell, the normal RPC case (where we have 3 Ironic's and a MariaDB) has not regressed. Still around 1 minute (even less) for creating/reading/deleting nodes.14:52
TheJuliaInteresting!14:53
TheJuliaThe log I'm looking at looks like it is doing that and it looks like it got to node 95 before the job stopped, so not entirely there before your timeout job but close14:54
TheJuliaat least, on deletions14:55
TheJuliaI suspect we're sort of in a death by a thousand cuts situation, one thing is for sure though, the amount of logging we do when we're doing indirection is likely just way too much text and a bit more overhead14:55
TheJuliaI guess first stop might also need to put a knob on that as well14:56
dtantsurI wonder if we can get rid of the API/conductor split entirely for the rpc_transport==none case14:57
dtantsurLike, run the entire API in a conductor thread or something14:57
dtantsur(pondering if indirection is the cause of slowness)14:57
TheJuliaWell, we could single process it, but we would still need a hardish delineation14:57
TheJuliawithout the systemd signaling stuffs systemd on bifrost will think the process never actually started14:58
dtantsurI'm thinking: we're already able to start the RPC WSGI server as part of teh conductor. What is preventing doing the same with the API?14:59
* dtantsur -> 1:114:59
TheJuliaI dunno, but lets try to make this somewhat data driven of an approach14:59
dtantsurfair14:59
TheJuliaso like obvious thing on my end is the logging which results when indirection is in use15:00
TheJuliawe can dial that back big time15:00
TheJuliawe also need to insturment15:00
dtantsurhttps://review.opendev.org/c/openstack/ironic/+/958072?usp=dashboard related15:00
TheJuliagood enough for me15:01
TheJuliaI was going to make it configurable15:01
opendevreviewJohn Garbutt proposed openstack/ironic master: WIP: Choose prefered vmedia boot device  https://review.opendev.org/c/openstack/ironic/+/95808215:39
iurygregory<eyes> ^ 16:38
JayFiurygregory: if you have questions about that or need context I have some of it (the vmedia boot device change)16:41
iurygregoryI added a comment to the patch16:42
iurygregoryjust some thought I had while looking at it16:42
JayFI'm assuming we're all onboard to add request logging to cheroot in IPA as well?16:56
JayFThe Ironic patch is getting close, and it occurred to me I should port it to IPA16:57
iurygregory++16:58
opendevreviewJay Faulkner proposed openstack/ironic master: Add request logging middleware for API requests  https://review.opendev.org/c/openstack/ironic/+/95810317:00
TheJuliathe regex worries me, I suspect we should always have a request-id by that point though, but I don't remember if its in context that gets created or if it is in middleware establishing context17:07
TheJulia(doesn't help I'm on a long running meeting)17:07
TheJuliaI'm looking at changing the jsonrpc logging further, to make it *way* more lean on the client side17:10
TheJuliaif we can get to "we sent request_id to rpc and got it back" then that should be a lot and also make the logs a bit cleaner to follow-overall17:15
JayFTheJulia: that is exactly how nova does it fwiw17:16
JayFTheJulia: they validate the request-id via regexp every time, that's why I did it17:16
JayFseemed like a good sanity check and protection from people injecting weird crap into logs via http header17:16
JayFI have a meeting very shortly, am going to check CI output or test manually after if it's not done17:17
TheJuliaokay then17:18
TheJuliaI just approved the 4k job change, since it seems like its still happening, likely just slowly syncing mirrors :(17:20
clifapparently there's been an open issue for at least a week: https://issues.redhat.com/browse/CS-298317:21
JayFgiven the datestamp on that and a fix is attached and declared working without comments, I wonder if there's a newer ticket and/or if they realize that one isn't solved17:25
TheJuliaclif: yeah, although I think I glanced at that yesterday and it was resolved.17:32
* TheJulia goes and uses her employee access to see what the change log says17:32
TheJuliastill being worked, fwiw17:34
TheJuliaI must have been thinking it was resolved by looking at a different item in jira, that one has never changed state to done/resolved17:35
opendevreviewVerification of a change to openstack/ironic master failed: Make ironic-tempest-uefi-redfish-vmedia-4k non-voting  https://review.opendev.org/c/openstack/ironic/+/95796218:30
opendevreviewJulia Kreger proposed openstack/ironic master: Minimize json-rpc client logging  https://review.opendev.org/c/openstack/ironic/+/95811918:44
opendevreviewJulia Kreger proposed openstack/ironic master: Make ironic-tempest-uefi-redfish-vmedia-4k non-voting  https://review.opendev.org/c/openstack/ironic/+/95796218:46
TheJuliaclif: ^^ you had to disable the gate job as well :)18:46
TheJuliathe wall clocking of the request handling seems to point to the indirection interaction being the huge sink. I think we should get patches around dialing back logging and see what we come out at19:36
TheJuliaIf not, we might want to dial out indirection for metal3 and just see if we're lock chaos free, or lock ridden19:37
TheJuliaLooks like the post to the conductor is chewing about 300 ms20:42
JayFis there any part of that interaction we could speed up? Thinking about the moving parts that seems a little long but probably not more than double-digit ms to save20:48
TheJuliaI'm not sure yet20:50
TheJuliahttps://github.com/cherrypy/cheroot/blob/main/cheroot/wsgi.py#L148 <-- say hello to our 300ms21:02
JayFshould that be obvious to me?21:04
TheJulianah21:07
TheJulianope21:07
TheJuliaso, whats interesting here is wsgi_app is the wsgiservice _application call for json_rpc (ironic/common/json_rpc/server.py)21:08
TheJuliawhich makes no sense because if I measure it internally, its super fast21:08
JayFso the difference is probably imports?21:14
JayFyeah?21:14
JayFif you're measuring inside ironic and it's fast, but from inside cheroot it's slow, that makes me wonder if there's something inherently slow about calling into there21:14
TheJulianot terribly, So, the issue why the local service metal3 stuff is slower, is because its basically death by a thousand cuts21:14
JayFimports is unlikely the actual answer, but something along those lines21:14
TheJuliabut... each jsonrpc pass through indirection gets 300+ ms on my local laptop through the cheroot wsgi app handler back to our own app21:15
JayFyeah but if there's something weird here, we could speed up the whole service if we could improve it21:15
TheJuliaI don't understand why, yet21:15
JayFbecause I'd assume we're paying those 300ms, just less often, in all cases with json-rpc, yeah?21:15
TheJuliaoh yeah21:15
TheJuliawe just now pay them on the object calls in local mode21:16
TheJuliawhich slows down the reads/writes21:16
JayF</topic><topic> it's weird to me that ironic uses https://github.com/openstack/ironic/blob/master/ironic/api/hooks.py#L102 whereas other stuff uses https://github.com/openstack/oslo.middleware/blob/master/oslo_middleware/request_id.py#L40 ... I'm really perplexed as to how our request_id handling works21:20
JayFI thought at first it was just a "we did it an old way, there's a new way", but the shape seems different21:21
JayFI'm going to table this and try to get back on the yellow log road21:26
JayFholy cow I'm pretty sure I just unraveled this and figured out a thing21:31
JayFI'm 99% sure Ironic is a terminus in a case where we're given a global_request_id 21:31
TheJuliafound our 300ms21:32
JayFnope just kidding, this code is just misdirected as hell21:32
JayFTheJulia: forgot to call sys.go_fast()?21:32
JayF:D 21:32
TheJulialol no21:32
TheJuliagetting a link21:32
JayFno, I am right21:33
JayFhttps://github.com/openstack/ironic/blob/master/ironic/api/hooks.py#L12821:33
JayFwe never pass global-request-id back through the response21:33
JayFonly our newly-created request-id21:33
JayFso it's in the context but I can't find where we ever take it back outta the context on the way back out21:34
JayFand readd it to a header21:34
TheJuliahttps://github.com/openstack/ironic/blob/master/ironic/common/auth_basic.py#L5321:35
TheJuliathat line takes 300ms21:35
JayFwe don't cache there?!21:35
JayFwe're opening that file 9000 times21:35
TheJuliaheh21:35
JayFeven without caching there has to be a more efficient implementation21:35
TheJuliayeah, its authenticate()21:37
JayFI already asked claude to see if it has ideas on perf improvements with no tradeoffs21:37
JayFoh, it's bcrypt21:38
JayFthat's intentional21:38
JayFbcrypt is intentionally slow to prevent brute force attacks, right21:38
JayFI bet bcrypt.checkpw() is 99% of that time sink21:38
TheJulialikely21:39
JayFthe file would be in ram all day long anyway21:39
TheJuliaI *suspect* the best thing to do is to cache known good in ram21:39
TheJuliaand do a quick direct match21:39
JayFthat is exactly what my first thought was21:39
JayFI'm trying to think if there's any way I could use that to do Bad Things(tm)21:39
JayFwe likely would just need some kinda awareness of when the file changes to clear the cache, and/or document that you must restart ironic services if you change the auth_file contents21:40
TheJulia... I bet we're hitting this issue twice atually21:41
TheJuliaactually21:41
TheJuliaboth on the rest api21:41
* JayF wonders if this performance bug is so severe we can backport it21:41
TheJuliaand the internal json-rpc endpoint21:41
JayFyes exactly21:41
JayFyes21:41
TheJuliathat would be basically half to 2/3rds of all the wall clock time for the requests metal3 is hitting21:42
TheJuliaif not 3/4ths of the time21:42
JayFgreat job21:43
JayFnot fixed yet but this is a super good find21:43
TheJuliayeah21:43
TheJulia(I quite literally started adding prints to wall clock everything in the wsgi app invocation21:43
JayFcid: ^ this is relevant to our 1:1 chat, fwiw21:44
JayFwe were talking about the sorta things you could find performance testing, and this is a great story of how to do it \o/21:44
TheJulia(prints not great in general, since python does buffered IO, but when I printed out "in basic auth -- called authenticate %s" % (time.time()-start_time))21:45
TheJuliaactually, in that I did logging instead of print, cheroot I had to do print21:46
TheJuliaAnyway, I'm about to go vanish for the day and try to get some exercise in21:47
JayFif you can spare 5 minutes21:49
JayFcan you sanity check my global-request-id discovery above21:49
* TheJulia presses brain rewind21:49
JayFone of those I'm pretty sure I'm right but I wonder if there's some code somewhere I'm missing21:49
TheJuliahmmmmm21:51
TheJuliaI suspect your right because we have a single place where we capture context.global_id and send it21:52
TheJuliawhich is in the nova callback stuffs21:52
JayFOK. I'm going to write the request logging stuff assuming that bug doesn't exist, then go fix that bug too.21:52
JayFProbably by actually using oslo_middleware to handle global_request_id, unless I find some reason not to21:53
TheJuliadid you check into oslo.context invocations at all?21:53
JayF.from_environ() loads it into the context21:53
JayFwe have no code that takes it from the context and puts it on the outgoing response21:53
TheJuliacorrect21:54
TheJuliawe also have a wrapper around it, but our wrapper just adds a custom field or two21:54
JayFyep, I will keep api compat in mind21:54
JayFbut I'm correct that we should have that header on response21:54
JayF*and* on any outgoing requests?21:55
TheJuliaI don't remember if that was a requirement or not21:55
JayFI'm going to use nova as a model for the behavior21:55
TheJuliaits beeeeen soooooooooooooo long21:55
JayFI'll file a bug about it21:55
JayFTheJulia: thinking about the caching idea, it exposes us to zero additional risk, given if you can read the ram of ironic you can read auth_file21:56
JayFTheJulia: so I'm +1 to that being a solution21:56
* TheJulia may not be an El-Alurian from star trek and think "its been five hundred years!"21:56
TheJuliaerr, thinks21:56
JayFI can't acknowledge our centuries-old friendship until midway through the run of TNG, sorry21:57
TheJuliasilly time travel21:57
TheJuliaI've prodded steve since he wrote a bunch of that code around basic auth, to go ahead and chime in if he sees any issues with us caching "known good" in ram.22:00
JayFgood stuff22:00
JayFreally, really nice find22:00
JayFI am 100% serious we should try to make that backportable22:00
JayFthat's a huge perf increase, even if it doesn't hit the majority of our (stable) userbase 22:01
TheJuliaI'm 95% sure we can do that easily22:01
stevebaker[m]I don't see an issue with memoization of the bcrypt response22:05
stevebaker[m]actually memoizing the auth_entry function would mean the file is always read but bcrypt calls are cached22:08
JayFit's not ridiculous to say "restart the service" if they update auth22:13
JayFin fact, given most auth_basic users are metal3 (right?) wouldn't that already be what's done by default in a k8s world22:14
stevebaker[m]yeah, it would be a regression in behaviour though. Reading the file every time was an intentional design choice, and likely not the cause of this poor performance22:17
stevebaker[m]TheJulia, JayF : Shall I whip up a change?22:30
TheJuliastevebaker[m]: if you have time and capacity, sure22:30
TheJuliaif not, I'll be taking a look tomorrow22:31
JayFthat means we are free to review it22:31
TheJulia(I also have.... ?4? hours of meetings tomorrow... soooo yeah.)22:31
stevebaker[m]yep I can take a look22:32
opendevreviewJay Faulkner proposed openstack/ironic master: Add request logging middleware for API requests  https://review.opendev.org/c/openstack/ironic/+/95810323:10
stevebaker[m]I have filed this bug https://bugs.launchpad.net/ironic/+bug/212110523:31

Generated by irclog2html.py 4.0.0 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!