TheJulia | The threads are going to drive me crazy | 01:19 |
---|---|---|
cardoe | Ooo. Jacob got working what I’ve wanted to? | 01:50 |
opendevreview | Steve Baker proposed openstack/networking-generic-switch master: [DNM] dump flows after port plug https://review.opendev.org/c/openstack/networking-generic-switch/+/955739 | 01:51 |
opendevreview | Kaifeng Wang proposed openstack/ironic-python-agent master: Support transport type as a root device hint https://review.opendev.org/c/openstack/ironic-python-agent/+/955742 | 02:32 |
janders | cardoe working on it - happy to compare notes if you're interested in this area | 03:44 |
janders | I have few more ideas for further optimisation but will probably pause once finished with the current scope to attend to the downstream part of this feature | 03:45 |
janders | sorry I've been quiet upstream, will try do a better job keeping up to date with discussions | 03:46 |
opendevreview | Steve Baker proposed openstack/networking-generic-switch master: [DNM] dump flows after port plug https://review.opendev.org/c/openstack/networking-generic-switch/+/955739 | 04:35 |
opendevreview | Verification of a change to openstack/ironic master failed: Switch from local RPC to automated JSON RPC on localhost https://review.opendev.org/c/openstack/ironic/+/954755 | 06:32 |
rpittau | good morning ironic! o/ | 06:57 |
queensly[m] | Good morning o/ | 07:03 |
rpittau | TheJulia, dtantsur: we had a look at the grenade job with masghar and queensly[m] and we found that it's failing since 2 days! | 08:24 |
rpittau | the issue seems somewhat related to nova, at least looking at https://e33cfc4e5ae5ade32027-ad9677b8f3b079990c4951ac9cbbd797.ssl.cf2.rackcdn.com/openstack/2f8dfc2059784e6691e13e08876ef108/controller/logs/grenade.sh_log.txt | 08:24 |
rpittau | you can see there are some errors there related to compute, for example2025-07-23 21:16:59.494 | ERROR nova.objects.instance [[01;36mNone req-c824579b-82f9-499c-8b59-382bdf74f47b [00;36mNone None] [01;35m[instance: 00000000-0000-0000-0000-000000000000] Unable to migrate instance because host None with node None not found[00m: nova.exception.ComputeHostNotFound: Compute host None could not be found. | 08:24 |
rpittau | not sure if that's the root cause, more eyes needed there | 08:24 |
rpittau | and sorry for the wall of text! :) | 08:24 |
dtantsur | Ugh, thanks for looking into it. Had a chance to mention it to #openstack-nova already? | 08:24 |
dtantsur | TheJulia: an observation: SimpleQueue is written in pure C: https://docs.python.org/3/library/queue.html#simplequeue-objects. I wonder if it makes any difference for us. | 08:29 |
abongale | good morning ironic! | 08:30 |
rpittau | dtantsur: I haven't brought that up in the nova channel yet | 08:30 |
opendevreview | Kaifeng Wang proposed openstack/ironic-python-agent master: Support transport type as a root device hint https://review.opendev.org/c/openstack/ironic-python-agent/+/955742 | 08:37 |
opendevreview | Dmitry Tantsur proposed openstack/ironic master: Log how long power sync and sensor collections take https://review.opendev.org/c/openstack/ironic/+/955762 | 09:32 |
dtantsur | TheJulia: sorry if you already have something like ^^ planned, but looks helpful to me | 09:32 |
TheJulia | dtantsur: no worries, I was actually thinking one minor step further, specifically if we exceed 3x the preferred interval to explicitly log "hey, you likely need to increase these settings *OR* add conductors" | 13:10 |
dtantsur | mmm, yeah, self-diagnostics is a great idea | 13:13 |
opendevreview | Jacob Anders proposed openstack/ironic master: Skip initial reboot to IPA when updating firmware out-of-band https://review.opendev.org/c/openstack/ironic/+/954311 | 13:14 |
TheJulia | The act of getting from the queue is not the bottleneck based upon my testing. Its actually surprisingly quick to get entries out. I'm *really* starting to think its just the db thundering heard issues we've long had around startup or occasionally colliding. | 13:14 |
TheJulia | I did do some additional measurement, and part of the issue seems to be the shutdown lock, checking it typically takes 0.00s, and is most often less then 0.20 seconds, but in some cases took up to 1.54 seconds. | 13:15 |
TheJulia | and, think of it as like 95% of the time, it is less than 0.2s, but that also begins to add up as well | 13:16 |
TheJulia | the overall loop itself over 12 hours beyond the checks were way more consistent | 13:16 |
dtantsur | TheJulia: the shutdown lock in futurist, right? | 13:17 |
TheJulia | multiprocessing since the entity which sets it is the parent process | 13:17 |
dtantsur | uhmm, so some different lock? | 13:17 |
TheJulia | we could make it multi-step, but really its not a huge issue | 13:17 |
TheJulia | yeah | 13:17 |
TheJulia | again, looking at the spread, its not the source of what I perceive as pain/inconsistency | 13:18 |
* TheJulia rasies an eyebrow at "no more conductor workers" | 13:19 | |
dtantsur | huh, something is leaking? | 13:20 |
TheJulia | sure looks that way | 13:21 |
dtantsur | TheJulia: the recent version of my futurist change should provide you with helpful logging on what gets created/deleted | 13:21 |
TheJulia | my thread count logging is consistent, its the call to create the thread it looks like | 13:22 |
TheJulia | okkay | 13:22 |
TheJulia | okay, so we're not leaking, we're just not operating on all cyinders with coffee | 13:22 |
dtantsur | Btw, to close the topic on Queue: I'm now quite confident it does release GIL when waiting. It's still interesting to try SimpleQueue instead. But if you say the queue is not the problem, then it may not be worth the effort. | 13:25 |
TheJulia | Yeah, I thought I checked for exceptions/errors on thread creation previously but now getting them. I did crash this machine so maybe something else changed beyond the node count | 13:28 |
TheJulia | I need to get showered and all that stuff here in a bit so I'll dig in a little bit, but the queue seems pretty solid. And truthfully, if I set the delay to 0, then the entire set wraps really quite cleanly. Yay for "fun" problems | 13:29 |
dtantsur | yay | 13:30 |
dtantsur | what's the delay now, by the way? | 13:30 |
TheJulia | 1 second for power sync ops | 13:48 |
TheJulia | oh, you mean total spread? | 13:49 |
TheJulia | funny enough, last night I was seeing like ~8 minutes after getting back from a late dinner before going to bed | 13:49 |
dtantsur | No, I meant that, thx | 13:49 |
TheJulia | and it had been consistent | 13:49 |
TheJulia | and sometime after, sadness | 13:49 |
TheJulia | ugh | 13:50 |
dtantsur | I need to redo my math, but it sounds like 1 second, times 5000 nodes, on 8 threads, in 8 minutes is not bad at all | 13:50 |
TheJulia | I did end up increasing my thread count last night as well, its been a pile of weirdness since I OOM-ed the machine | 13:51 |
TheJulia | I should just reboot | 13:51 |
TheJulia | <-- stubborn | 13:52 |
dtantsur | :D | 13:52 |
opendevreview | Nahian Pathan proposed openstack/ironic master: Reduce API calls when collecting sensor data with redfish https://review.opendev.org/c/openstack/ironic/+/955484 | 14:32 |
opendevreview | Mithun Krishnan Umesan proposed openstack/networking-generic-switch master: Autogenerate list of NGS compatible devices https://review.opendev.org/c/openstack/networking-generic-switch/+/955798 | 15:10 |
opendevreview | Clif Houck proposed openstack/ironic-tempest-plugin master: Change Portgroup minimum microversion to 1.26 https://review.opendev.org/c/openstack/ironic-tempest-plugin/+/955799 | 15:12 |
opendevreview | Clif Houck proposed openstack/ironic master: Add a new 'physical_network' field to the Portgroup object https://review.opendev.org/c/openstack/ironic/+/955625 | 15:15 |
opendevreview | Clif Houck proposed openstack/ironic master: Add a new 'category' field to the Portgroup object https://review.opendev.org/c/openstack/ironic/+/955713 | 15:16 |
clif | Question: if I Depends-On an ironic-tempest-plugin review will that change be pulled into the tempest plugin tests that Zuul runs? | 15:18 |
dtantsur | clif: if we have it in required_projects of that job, it should | 15:19 |
clif | sweet ty | 15:27 |
opendevreview | John Garbutt proposed openstack/ironic master: Fix inspection IB port client-id https://review.opendev.org/c/openstack/ironic/+/955806 | 16:07 |
opendevreview | Nahian Pathan proposed openstack/sushy master: Support expanded Chassis and Storage for redfish https://review.opendev.org/c/openstack/sushy/+/955211 | 16:25 |
dtantsur | Vacation time \o/ I'll be back on Aug 5th. Please behave well and don't upset TheJulia! | 16:27 |
queensly[m] | Enjoy your vacation dtantsur :) | 16:46 |
TheJulia | lol | 16:49 |
opendevreview | Julia Kreger proposed openstack/ironic master: Replace GreenThreadPoolExecutor in conductor https://review.opendev.org/c/openstack/ironic/+/952939 | 17:53 |
opendevreview | Julia Kreger proposed openstack/ironic master: Set the backend to threading. https://review.opendev.org/c/openstack/ironic/+/953683 | 17:53 |
opendevreview | Julia Kreger proposed openstack/ironic master: Clean-up misc eventlet references https://review.opendev.org/c/openstack/ironic/+/955632 | 17:53 |
TheJulia | cid: I updated your thread pool change based upon findings I've finally grown to understand, I'm going to pull it all down to my lab machine and will re-verify shortly. | 17:54 |
cid | TheJulia, ack'd o/ | 18:00 |
* cid will take a look once he has a laptop handy. Currently on a short interstate trip :) | 18:00 | |
opendevreview | Julia Kreger proposed openstack/ironic master: Replace GreenThreadPoolExecutor in conductor https://review.opendev.org/c/openstack/ironic/+/952939 | 18:14 |
TheJulia | cid: ack, fixing a minor mistake in ^, but otherwise looks really good so far. | 18:15 |
opendevreview | Julia Kreger proposed openstack/ironic master: Set the backend to threading. https://review.opendev.org/c/openstack/ironic/+/953683 | 18:19 |
cid | +++ | 18:19 |
opendevreview | Julia Kreger proposed openstack/ironic master: Clean-up misc eventlet references https://review.opendev.org/c/openstack/ironic/+/955632 | 18:19 |
opendevreview | Julia Kreger proposed openstack/ironic master: Add a suggestive warning around power and sensor syncs https://review.opendev.org/c/openstack/ironic/+/955821 | 18:56 |
opendevreview | Mithun Krishnan Umesan proposed openstack/networking-generic-switch master: Autogenerate list of NGS compatible devices https://review.opendev.org/c/openstack/networking-generic-switch/+/955798 | 19:10 |
opendevreview | Nahian Pathan proposed openstack/sushy master: Support expanded Chassis and Storage for redfish https://review.opendev.org/c/openstack/sushy/+/955211 | 19:46 |
TheJulia | I see what is going on with grenade: https://551b5d3d5ab1e9429cee-63329dcd236a98c48b3784d0d458e269.ssl.cf5.rackcdn.com/openstack/a4a4e91242434212aaaf067deecc5292/controller/logs/screen-q-l3.txt | 21:20 |
TheJulia | I'll spend some cycles on it tomorrow, I suspect something got merged and shouldnt' have been yet or needs to be pinneed | 21:20 |
iurygregory | yay for unsuportedversion lol | 23:02 |
Generated by irclog2html.py 4.0.0 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!