tonyb | True. | 00:32 |
---|---|---|
opendevreview | Tony Breeds proposed opendev/system-config master: Don't use the openstack-ci-core/openafs PPA on noble and later https://review.opendev.org/c/opendev/system-config/+/921786 | 03:14 |
opendevreview | Tony Breeds proposed opendev/system-config master: Test mirror services on noble https://review.opendev.org/c/opendev/system-config/+/921771 | 03:14 |
opendevreview | Simon Westphahl proposed zuul/zuul-jobs master: wip: Add ensure-dib role https://review.opendev.org/c/zuul/zuul-jobs/+/922910 | 07:42 |
opendevreview | Simon Westphahl proposed zuul/zuul-jobs master: wip: Add build-diskimage role https://review.opendev.org/c/zuul/zuul-jobs/+/922911 | 07:42 |
opendevreview | Simon Westphahl proposed zuul/zuul-jobs master: wip: Add example role for converting images https://review.opendev.org/c/zuul/zuul-jobs/+/922912 | 07:42 |
opendevreview | Simon Westphahl proposed zuul/zuul-jobs master: wip: Add example role for uploading diskimages https://review.opendev.org/c/zuul/zuul-jobs/+/922913 | 07:42 |
fungi | popping out momentarily to run quick errands, back in ~30 | 14:16 |
corvus | there are a bunch of nodes in deleting state starting yesterday | 14:53 |
corvus | at least, the graph shows a pretty constant value of ~200 deleting nodes, but the node listing says they've mostly been deleting for 0-10 minutes | 14:55 |
corvus | that suggests that they are not stuck, just a constant slow turnover | 14:55 |
corvus | oh i think the timer gets reset in the node listing | 14:56 |
corvus | so these may indeed be stuck deletes that we're retrying | 14:56 |
fungi | #status log Pruned backups on backup02.ca-ymq-1.vexxhost.opendev.org reducing volume utilization from 93% to 72% | 15:07 |
opendevstatus | fungi: finished logging | 15:07 |
frickler | corvus: ack, if you look at https://grafana.opendev.org/d/a8667d6647/nodepool3a-rackspace?orgId=1&from=now-7d&to=now it seems to be an issue with rad-ord since 01:00 yesterday | 15:18 |
clarkb | fungi: seems like we just had to prune things before the gerrit upgrade. But maybe it was the other server that prompted that pruning | 15:30 |
clarkb | I wonder if we've just accumulated enough in the backlog there that the time between prunings is shrikning | 15:36 |
frickler | I don't see any obvious issues in the launcher log for rax-ord, trying to spot check some of the servers now, but rax api is slooow | 15:37 |
opendevreview | Julia Kreger proposed openstack/diskimage-builder master: bootloader: Strip prior console settings https://review.opendev.org/c/openstack/diskimage-builder/+/922961 | 15:47 |
clarkb | hrm the centos-8 mirror dir isn't empty. I need to reboot for local updates then will dig into that | 15:52 |
clarkb | I think https://review.opendev.org/c/opendev/system-config/+/922750 merged removing the cronjob and script before it could fire on its every 6 hour cadence | 15:59 |
corvus | do we just wait for the rax issue to clear up? or have we opened a ticket in the past and that nudged them to fix something? | 15:59 |
clarkb | thats not a big deal since the mirror was basically empty already and rm'ing things by hand when we remove the volume shouldn't be too bad | 15:59 |
clarkb | corvus: usually my first step is to try manually issuing deletes just to confirm that it isn't something to do with nodepool (or how nodepool uses openstacksdk). But 95% of the time there is no difference. Then we can file a ticket with uuids for stuck nodes and usualy they get things cleared out | 16:00 |
fungi | frickler: in the past we've seen issues with api slowness in a rackspace region implying slowness of communication between different openstack services in that region, leading to leaked error nodes that accumulate and are undeletable by us | 16:09 |
fungi | corvus: in the past i think we've needed to open a ticket to let support know we want the nodes deleted | 16:11 |
frickler | well finally a "server list" completed and does not show the nodes that nodepool is trying to delete | 16:35 |
clarkb | could be slowness in api responses affecting the caching in nodepool? (like maybe we never get a response within the timeout period so we're just using a really old cached info?) | 16:40 |
corvus | assuming that nodepool ever sees an updated server list, then it should eventually decide the server was deleted and then just delete the zk record. | 16:41 |
corvus | if nodepool never gets an updated server list, then everything will just sit there forever including no new ready nodes | 16:41 |
corvus | (and by that i mean also the idea that if all it ever does is time out on the server list) | 16:42 |
corvus | frickler: if it's handy, can you pastebin your server list? | 16:42 |
frickler | https://paste.opendev.org/show/b2HjX0OBikUYjgQEUyeV/ | 16:46 |
corvus | frickler: do you know about how long the api response took? | 16:46 |
frickler | no, sorry, just ran it and looked at the window again after maybe 20 minutes | 16:47 |
corvus | okay, the delete server calls are taking about 50 seconds | 16:48 |
corvus | there is a single list servers call: | 16:48 |
corvus | 2024-06-27 16:07:09,756 DEBUG nodepool.OpenStackAdapter.rax-ord: API call detailed server list in 6.466826658695936 | 16:49 |
corvus | i suspect that all the others are taking more than 60 seconds because we have a 60 second timeout configured in clouds.yaml | 16:49 |
clarkb | server listing has always been one of the slower requests iirc | 16:50 |
corvus | (i'm trying to narrow down/prove whether we're stuck there or just looping in timeouts) | 16:50 |
frickler | I'm running some timed server show+delete now | 16:53 |
frickler | hmm, server delete returns within 3s, server show needs ~ 30s to find out that the server doesn't exist. retrying the server list next | 16:56 |
corvus | most of the executor threads for rax-ord ar sitting at creating a connection sock.connect | 16:59 |
corvus | okay, they have made progress, having looked at a second snapshot. so they aren't just stuck there. but they are all doing network things, so this still supports a slow remote side. | 17:01 |
corvus | a possible pathology here is that delete api calls and server api calls go into the same api call executor queue. so we could be looking at an ever increasing amount of server delete calls that effectively (mostly) starve out the server list calls. | 17:05 |
corvus | (the idea here is that api calls are typically millisecond fast) | 17:05 |
corvus | unfortunately, we can't start the repl with the launcher already running, so i can't confirm that | 17:07 |
opendevreview | Clark Boylan proposed opendev/system-config master: Don't use the openstack-ci-core/openafs PPA on noble and later https://review.opendev.org/c/opendev/system-config/+/921786 | 17:08 |
opendevreview | Clark Boylan proposed opendev/system-config master: Test mirror services on noble https://review.opendev.org/c/opendev/system-config/+/921771 | 17:08 |
clarkb | tonyb: ^ I think there was just a minor yaml indentation thing there so I went ahead and made a quick update | 17:08 |
clarkb | corvus: I suppose we could manually remove records for nodes we know that are deleted as a way to mitigate that. Would probably have to restart the launcher too to reset the executor state | 17:09 |
clarkb | definitely a hacky workaround and not ideal that idea is | 17:09 |
corvus | i'm concerned that if the delete server calls are taking 50 seconds (the last several were "only" 30 seconds), that even if we did that we'd end up here again quickly | 17:11 |
corvus | weird that the launcher logs don't agree with frickler's cmdline test (3s) | 17:11 |
clarkb | good point, the next round of booted nodes would be liable to spiral into this too | 17:11 |
*** atmark_ is now known as atmark | 17:11 | |
frickler | the server list took 08:16 now | 17:16 |
corvus | that is some variation :) | 17:16 |
corvus | i think the sequence is something like: 1) submit 200 delete api requests; 2) submit 1 server list request; 3) our internal timeout happens; we submit another 200 delete api requests; 4) repeat 3. | 17:17 |
corvus | i think that's how we end up having one server list request; we will probably eventually execute another, but probably only after working through thousands of delete calls that no one is even listening for anymore. | 17:18 |
corvus | one solution would be to move the list api calls to a different executor, so they don't queue behind the creates and deletes | 17:19 |
corvus | that doesn't stop the deletes from piling up directly, but it does mean that if a 50 second delete call eventually succeeds, then we will stop re-adding it. | 17:19 |
corvus | other ideas to consider would be to remove unused delete calls from the queue; or set a max size on the queue so the provider stops creating state machines if something is going wrong. | 17:21 |
clarkb | we might be able to spread out extra deletes more. I think its like a 5 minute interval now? Maybe twice an hour is plenty after the initial attempt? | 17:22 |
corvus | i think i have some ideas, i'll work on a patch | 17:27 |
corvus | in the mean time, we might be able to dislodge things with a launcher restart; if someone wants to do that i think it's worthwhile. we may get stuck again, or maybe not. we'll at least know. | 17:28 |
corvus | i don't think there's any more interactive debugging to do, so feel free to do that | 17:28 |
clarkb | I'll do that | 17:34 |
clarkb | only the launcher for rax should be necessary (thats nl01) | 17:34 |
clarkb | that is done. The nodepool launcher on nl01 was restarted about 20 seconds ago | 17:36 |
clarkb | the restart does seem to have changed its behavior. Some nodes are building now and the deleting count is failling | 17:44 |
corvus | it looks like things are fast enough that we're not going to tip over into the bad place immediately. but i think it could still happen; creates and deletes are still close to a minute. | 18:03 |
clarkb | It might also be fine until we mix in deletes later | 18:31 |
fungi | when we saw similar timeout issues with image deletes (mostly in iad i think?) it seemed like our list calls were piling up on the openstack end, so stopping the launcher for a while allowed it to catch its breath and start responding in a more timely fashion again for a while | 18:36 |
corvus | nodepool destroyer of clouds | 19:03 |
clarkb | https://zuul.opendev.org/t/openstack/build/c6afb3dfdc9146d7b19e3c4939c8a571/log/job-output.txt#386-387 jammy installs the ppa (then ignores the package). Noble does not install the ppa at all: https://zuul.opendev.org/t/openstack/build/b84f040fcb6248f99a3b191c7eb8307e/log/job-output.txt#385-386 testing confirms the change works as we expect. tonyb I'll +2 it but won't approve as | 19:54 |
clarkb | fungi had a small nit. I'll let you decide if you want to update for that or just approve it as is | 19:54 |
fungi | yeah, i didn't have strong feelings about it, just thought it might be an oversight | 20:02 |
fungi | python 3.13.0b3 is now available | 20:11 |
fungi | one beta release to go before it enters the rc phase | 20:12 |
opendevreview | Jay Faulkner proposed openstack/project-config master: Ironic no longer uses storyboard https://review.opendev.org/c/openstack/project-config/+/922864 | 20:31 |
clarkb | it is a cool afternoon here. I'm going to take advantage and go outside for a bike ride | 20:41 |
mordred | fungi: anything noteworthy in 13? | 21:00 |
fungi | mordred: they remove lots of "dead batteries" from the stdlib | 21:03 |
mordred | also, interesting - an experimental GIL-less option - and an experimental JIT compiler | 21:03 |
fungi | yep | 21:03 |
fungi | the nogil work is definitely interesting but projects that want to take advantage of it will probably need to do a lot more work to handle their own locking needs | 21:04 |
mordred | yah | 21:05 |
mordred | the dead batteries don't look like they'd be anything I'm using | 21:05 |
fungi | removal of the cgi and smtplib modules hit some of my projects pretty hard, but i worked around them | 21:06 |
corvus | batteries no longer included :( | 21:19 |
corvus | fungi: smtplib? | 21:20 |
corvus | not seeing that on the list | 21:20 |
corvus | smtpd maybe? | 21:23 |
fungi | corvus: ah no you're right, smtplib is not slated for removal in pep 594 (smtpd is). i was thinking of the mail header parsing from cgi.parse_header recommending email.message instead | 21:34 |
fungi | telnetlib was the other one i ended up having to work around | 21:34 |
fungi | also smtpd already got removed in 3.12 anyway | 21:35 |
fungi | oh, and crypt's removal is causing some downstream pain in passlib | 21:36 |
corvus | there is some irony there | 21:38 |
corvus | given than passlib is cited as a replacement for crypt | 21:38 |
fungi | yeah, i opened https://foss.heptapod.net/python-libs/passlib/-/issues/148 about it in 2022 | 21:39 |
fungi | since some of my projects depend on passlib as well | 21:41 |
opendevreview | Merged openstack/project-config master: Ironic no longer uses storyboard https://review.opendev.org/c/openstack/project-config/+/922864 | 21:57 |
opendevreview | Tony Breeds proposed opendev/system-config master: Don't use the openstack-ci-core/openafs PPA on noble and later https://review.opendev.org/c/opendev/system-config/+/921786 | 22:15 |
opendevreview | Tony Breeds proposed opendev/system-config master: Test mirror services on noble https://review.opendev.org/c/opendev/system-config/+/921771 | 22:15 |
tonyb | ^^ should be good to go | 22:17 |
tonyb | So the ansible-devel test job. Due to ansible now (current git master branch) needing > 3.10 on the controller and > 3.8 on any hosts I think that means that going forward we can only support/test on >= focal. So any reason I shouldn't remove https://opendev.org/opendev/system-config/src/branch/master/zuul.d/system-config-run.yaml#L100-L103 ? | 22:21 |
tonyb | and add Noble while trying to get that job passing again? | 22:22 |
tonyb | Oh and while I'm thinking about it should we also test ara master? In the same or different job | 22:29 |
opendevreview | Tony Breeds proposed opendev/system-config master: [DNM] Run ansible-devel under python-3.11 https://review.opendev.org/c/opendev/system-config/+/922704 | 22:30 |
Clark[m] | tonyb: the run base jobs should keep the older node types but they are just target nodes like any one of our servers. You should be able to bump the bridge node to jammy and that gets you 3.10 | 22:56 |
Clark[m] | Oh bridge is already jammy so it should be working. Unless Ansible isn't able to run on remote targets that are old but historically they have tried to keep that working | 22:57 |
Clark[m] | Or does it need 3.11? Maybe that is what 922704 means | 22:57 |
tonyb | Clark[m]: Yeah as of ansible/ansible current master the controller has to be >= 3.11 | 22:58 |
tonyb | Clark[m]: and any/all target nodes must be > 3.8 | 22:59 |
tonyb | Oh sorry any/all target hosts must be >= 3.8 | 23:01 |
tonyb | https://github.com/ansible/ansible/commit/b2a289dcbb702003377221e25f62c8a3608f0e89 ; and https://github.com/ansible/ansible/commit/1c17fe2d53c7409dc02780eb29430bee99fb42ad | 23:02 |
clarkb | hrm I think that maens we're stickign to old ansible | 23:17 |
clarkb | corvus: ^ fyi this likely has impacts for zuul as well | 23:17 |
clarkb | a bit sad that the commit message doesn't really indicate anything more than what they did | 23:17 |
clarkb | https://github.com/ansible/ansible/issues/83095 the related issue doesn't really say more either | 23:20 |
clarkb | mostly just curious if 3.8 is expected to be a longer term python version due to its use in rhel 8 (pretty sure rhel8 is 3.8 python) or if the idae is to remove them on a regular cadence regardless of older LTS distros. But really reduces ansible utility if you can't haev a broad set of target runtimes | 23:21 |
clarkb | and historically ansible was really good about that so this change is :/ | 23:21 |
clarkb | I suppose one workaround for this is the stow type work that has been happening | 23:26 |
clarkb | start compiling our own pythons on targets just for ansible. Not ideal but dosable | 23:26 |
clarkb | *doable | 23:26 |
corvus | what targets do we have < 3.8? | 23:30 |
clarkb | corvus: bionic and xenial and bullseye are the current set I think | 23:32 |
clarkb | 3.5, 3.6, and 3.7 respectively | 23:32 |
clarkb | er I have bionic and xenial backwards but ya | 23:32 |
clarkb | and yes some of those are long old, but canonical is doing 10 years (or more I think some stuff is 12 now?) of support for LTS if you pay | 23:37 |
opendevreview | Merged opendev/system-config master: Don't use the openstack-ci-core/openafs PPA on noble and later https://review.opendev.org/c/opendev/system-config/+/921786 | 23:42 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!