Thursday, 2025-07-24

TheJuliaThe threads are going to drive me crazy01:19
cardoeOoo. Jacob got working what I’ve wanted to?01:50
opendevreviewSteve Baker proposed openstack/networking-generic-switch master: [DNM] dump flows after port plug  https://review.opendev.org/c/openstack/networking-generic-switch/+/95573901:51
opendevreviewKaifeng Wang proposed openstack/ironic-python-agent master: Support transport type as a root device hint  https://review.opendev.org/c/openstack/ironic-python-agent/+/95574202:32
janderscardoe working on it - happy to compare notes if you're interested in this area03:44
jandersI have few more ideas for further optimisation but will probably pause once finished with the current scope to attend to the downstream part of this feature03:45
janderssorry I've been quiet upstream, will try do a better job keeping up to date with discussions03:46
opendevreviewSteve Baker proposed openstack/networking-generic-switch master: [DNM] dump flows after port plug  https://review.opendev.org/c/openstack/networking-generic-switch/+/95573904:35
opendevreviewVerification of a change to openstack/ironic master failed: Switch from local RPC to automated JSON RPC on localhost  https://review.opendev.org/c/openstack/ironic/+/95475506:32
rpittaugood morning ironic! o/06:57
queensly[m]Good morning o/07:03
rpittauTheJulia, dtantsur: we had a look at the grenade job with masghar and queensly[m] and we found that it's failing since 2 days!08:24
rpittauthe issue seems somewhat related to nova, at least looking at https://e33cfc4e5ae5ade32027-ad9677b8f3b079990c4951ac9cbbd797.ssl.cf2.rackcdn.com/openstack/2f8dfc2059784e6691e13e08876ef108/controller/logs/grenade.sh_log.txt08:24
rpittauyou can see there are some errors there related to compute, for example2025-07-23 21:16:59.494 | ERROR nova.objects.instance [[01;36mNone req-c824579b-82f9-499c-8b59-382bdf74f47b [00;36mNone None] [01;35m[instance: 00000000-0000-0000-0000-000000000000] Unable to migrate instance because host None with node None not found[00m: nova.exception.ComputeHostNotFound: Compute host None could not be found.08:24
rpittaunot sure if that's the root cause, more eyes needed there08:24
rpittauand sorry for the wall of text! :)08:24
dtantsurUgh, thanks for looking into it. Had a chance to mention it to #openstack-nova already?08:24
dtantsurTheJulia: an observation: SimpleQueue is written in pure C: https://docs.python.org/3/library/queue.html#simplequeue-objects. I wonder if it makes any difference for us.08:29
abongalegood morning ironic!08:30
rpittaudtantsur: I haven't brought that up in the nova channel yet08:30
opendevreviewKaifeng Wang proposed openstack/ironic-python-agent master: Support transport type as a root device hint  https://review.opendev.org/c/openstack/ironic-python-agent/+/95574208:37
opendevreviewDmitry Tantsur proposed openstack/ironic master: Log how long power sync and sensor collections take  https://review.opendev.org/c/openstack/ironic/+/95576209:32
dtantsurTheJulia: sorry if you already have something like ^^ planned, but looks helpful to me09:32
TheJuliadtantsur: no worries, I was actually thinking one minor step further, specifically if we exceed 3x the preferred interval to explicitly log "hey, you likely need to increase these settings *OR* add conductors"13:10
dtantsurmmm, yeah, self-diagnostics is a great idea13:13
opendevreviewJacob Anders proposed openstack/ironic master: Skip initial reboot to IPA when updating firmware out-of-band  https://review.opendev.org/c/openstack/ironic/+/95431113:14
TheJuliaThe act of getting from the queue is not the bottleneck based upon my testing. Its actually surprisingly quick to get entries out. I'm *really* starting to think its just the db thundering heard issues we've long had around startup or occasionally colliding.13:14
TheJuliaI did do some additional measurement, and part of the issue seems to be the shutdown lock, checking it typically takes 0.00s, and is most often less then 0.20 seconds, but in some cases took up to 1.54 seconds.13:15
TheJuliaand, think of it as like 95% of the time, it is less than 0.2s, but that also begins to add up as well13:16
TheJuliathe overall loop itself over 12 hours beyond the checks were way more consistent13:16
dtantsurTheJulia: the shutdown lock in futurist, right?13:17
TheJuliamultiprocessing since the entity which sets it is the parent process13:17
dtantsuruhmm, so some different lock?13:17
TheJuliawe could make it multi-step, but really its not a huge issue13:17
TheJuliayeah13:17
TheJuliaagain, looking at the spread, its not the source of what I perceive as pain/inconsistency13:18
* TheJulia rasies an eyebrow at "no more conductor workers"13:19
dtantsurhuh, something is leaking?13:20
TheJuliasure looks that way13:21
dtantsurTheJulia: the recent version of my futurist change should provide you with helpful logging on what gets created/deleted13:21
TheJuliamy thread count logging is consistent, its the call to create the thread it looks like13:22
TheJuliaokkay13:22
TheJuliaokay, so we're not leaking, we're just not operating on all cyinders with coffee13:22
dtantsurBtw, to close the topic on Queue: I'm now quite confident it does release GIL when waiting. It's still interesting to try SimpleQueue instead. But if you say the queue is not the problem, then it may not be worth the effort.13:25
TheJuliaYeah, I thought I checked for exceptions/errors on thread creation previously but now getting them. I did crash this machine so maybe something else changed beyond the node count13:28
TheJuliaI need to get showered and all that stuff here in a bit so I'll dig in a little bit, but the queue seems pretty solid. And truthfully, if I set the delay to 0, then the entire set wraps really quite cleanly. Yay for "fun" problems13:29
dtantsuryay13:30
dtantsurwhat's the delay now, by the way?13:30
TheJulia1 second for power sync ops13:48
TheJuliaoh, you mean total spread?13:49
TheJuliafunny enough, last night I was seeing like ~8 minutes after getting back from a late dinner before going to bed13:49
dtantsurNo, I meant that, thx13:49
TheJuliaand it had been consistent13:49
TheJuliaand sometime after, sadness13:49
TheJuliaugh13:50
dtantsurI need to redo my math, but it sounds like 1 second, times 5000 nodes, on 8 threads, in 8 minutes is not bad at all13:50
TheJuliaI did end up increasing my thread count last night as well, its been a pile of weirdness since I OOM-ed the machine13:51
TheJuliaI should just reboot13:51
TheJulia<-- stubborn13:52
dtantsur:D13:52
opendevreviewNahian Pathan proposed openstack/ironic master: Reduce API calls when collecting sensor data with redfish  https://review.opendev.org/c/openstack/ironic/+/95548414:32
opendevreviewMithun Krishnan Umesan proposed openstack/networking-generic-switch master: Autogenerate list of NGS compatible devices  https://review.opendev.org/c/openstack/networking-generic-switch/+/95579815:10
opendevreviewClif Houck proposed openstack/ironic-tempest-plugin master: Change Portgroup minimum microversion to 1.26  https://review.opendev.org/c/openstack/ironic-tempest-plugin/+/95579915:12
opendevreviewClif Houck proposed openstack/ironic master: Add a new 'physical_network' field to the Portgroup object  https://review.opendev.org/c/openstack/ironic/+/95562515:15
opendevreviewClif Houck proposed openstack/ironic master: Add a new 'category' field to the Portgroup object  https://review.opendev.org/c/openstack/ironic/+/95571315:16
clifQuestion: if I Depends-On an ironic-tempest-plugin review will that change be pulled into the tempest plugin tests that Zuul runs?15:18
dtantsurclif: if we have it in required_projects of that job, it should15:19
clifsweet ty15:27
opendevreviewJohn Garbutt proposed openstack/ironic master: Fix inspection IB port client-id  https://review.opendev.org/c/openstack/ironic/+/95580616:07
opendevreviewNahian Pathan proposed openstack/sushy master: Support expanded Chassis and Storage for redfish  https://review.opendev.org/c/openstack/sushy/+/95521116:25
dtantsurVacation time \o/ I'll be back on Aug 5th. Please behave well and don't upset TheJulia!16:27
queensly[m]Enjoy your vacation dtantsur :)16:46
TheJulialol16:49
opendevreviewJulia Kreger proposed openstack/ironic master: Replace GreenThreadPoolExecutor in conductor  https://review.opendev.org/c/openstack/ironic/+/95293917:53
opendevreviewJulia Kreger proposed openstack/ironic master: Set the backend to threading.  https://review.opendev.org/c/openstack/ironic/+/95368317:53
opendevreviewJulia Kreger proposed openstack/ironic master: Clean-up misc eventlet references  https://review.opendev.org/c/openstack/ironic/+/95563217:53
TheJuliacid: I updated your thread pool change based upon findings I've finally grown to understand, I'm going to pull it all down to my lab machine and will re-verify shortly.17:54
cidTheJulia, ack'd o/18:00
* cid will take a look once he has a laptop handy. Currently on a short interstate trip :)18:00
opendevreviewJulia Kreger proposed openstack/ironic master: Replace GreenThreadPoolExecutor in conductor  https://review.opendev.org/c/openstack/ironic/+/95293918:14
TheJuliacid: ack, fixing a minor mistake in ^, but otherwise looks really good so far.18:15
opendevreviewJulia Kreger proposed openstack/ironic master: Set the backend to threading.  https://review.opendev.org/c/openstack/ironic/+/95368318:19
cid+++18:19
opendevreviewJulia Kreger proposed openstack/ironic master: Clean-up misc eventlet references  https://review.opendev.org/c/openstack/ironic/+/95563218:19
opendevreviewJulia Kreger proposed openstack/ironic master: Add a suggestive warning around power and sensor syncs  https://review.opendev.org/c/openstack/ironic/+/95582118:56
opendevreviewMithun Krishnan Umesan proposed openstack/networking-generic-switch master: Autogenerate list of NGS compatible devices  https://review.opendev.org/c/openstack/networking-generic-switch/+/95579819:10
opendevreviewNahian Pathan proposed openstack/sushy master: Support expanded Chassis and Storage for redfish  https://review.opendev.org/c/openstack/sushy/+/95521119:46
TheJuliaI see what is going on with grenade: https://551b5d3d5ab1e9429cee-63329dcd236a98c48b3784d0d458e269.ssl.cf5.rackcdn.com/openstack/a4a4e91242434212aaaf067deecc5292/controller/logs/screen-q-l3.txt21:20
TheJuliaI'll spend some cycles on it tomorrow, I suspect something got merged and shouldnt' have been yet or needs to be pinneed21:20
iurygregoryyay for unsuportedversion lol23:02

Generated by irclog2html.py 4.0.0 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!