Thursday, 2024-06-27

tonybTrue.00:32
opendevreviewTony Breeds proposed opendev/system-config master: Don't use the openstack-ci-core/openafs PPA on noble and later  https://review.opendev.org/c/opendev/system-config/+/92178603:14
opendevreviewTony Breeds proposed opendev/system-config master: Test mirror services on noble  https://review.opendev.org/c/opendev/system-config/+/92177103:14
opendevreviewSimon Westphahl proposed zuul/zuul-jobs master: wip: Add ensure-dib role  https://review.opendev.org/c/zuul/zuul-jobs/+/92291007:42
opendevreviewSimon Westphahl proposed zuul/zuul-jobs master: wip: Add build-diskimage role  https://review.opendev.org/c/zuul/zuul-jobs/+/92291107:42
opendevreviewSimon Westphahl proposed zuul/zuul-jobs master: wip: Add example role for converting images  https://review.opendev.org/c/zuul/zuul-jobs/+/92291207:42
opendevreviewSimon Westphahl proposed zuul/zuul-jobs master: wip: Add example role for uploading diskimages  https://review.opendev.org/c/zuul/zuul-jobs/+/92291307:42
fungipopping out momentarily to run quick errands, back in ~3014:16
corvusthere are a bunch of nodes in deleting state starting yesterday14:53
corvusat least, the graph shows a pretty constant value of ~200 deleting nodes, but the node listing says they've mostly been deleting for 0-10 minutes14:55
corvusthat suggests that they are not stuck, just a constant slow turnover14:55
corvusoh i think the timer gets reset in the node listing14:56
corvusso these may indeed be stuck deletes that we're retrying14:56
fungi#status log Pruned backups on backup02.ca-ymq-1.vexxhost.opendev.org reducing volume utilization from 93% to 72%15:07
opendevstatusfungi: finished logging15:07
fricklercorvus: ack, if you look at https://grafana.opendev.org/d/a8667d6647/nodepool3a-rackspace?orgId=1&from=now-7d&to=now it seems to be an issue with rad-ord since 01:00 yesterday15:18
clarkbfungi: seems like we just had to prune things before the gerrit upgrade. But maybe it was the other server that prompted that pruning15:30
clarkbI wonder if we've just accumulated enough in the backlog there that the time between prunings is shrikning15:36
fricklerI don't see any obvious issues in the launcher log for rax-ord, trying to spot check some of the servers now, but rax api is slooow15:37
opendevreviewJulia Kreger proposed openstack/diskimage-builder master: bootloader: Strip prior console settings  https://review.opendev.org/c/openstack/diskimage-builder/+/92296115:47
clarkbhrm the centos-8 mirror dir isn't empty. I need to reboot for local updates then will dig into that15:52
clarkbI think https://review.opendev.org/c/opendev/system-config/+/922750 merged removing the cronjob and script before it could fire on its every 6 hour cadence15:59
corvusdo we just wait for the rax issue to clear up?  or have we opened a ticket in the past and that nudged them to fix something?15:59
clarkbthats not a big deal since the mirror was basically empty already and rm'ing things by hand when we remove the volume shouldn't be too bad15:59
clarkbcorvus: usually my first step is to try manually issuing deletes just to confirm that it isn't something to do with nodepool (or how nodepool uses openstacksdk). But 95% of the time there is no difference. Then we can file a ticket with uuids for stuck nodes and usualy they get things cleared out16:00
fungifrickler: in the past we've seen issues with api slowness in a rackspace region implying slowness of communication between different openstack services in that region, leading to leaked error nodes that accumulate and are undeletable by us16:09
fungicorvus: in the past i think we've needed to open a ticket to let support know we want the nodes deleted16:11
fricklerwell finally a "server list" completed and does not show the nodes that nodepool is trying to delete16:35
clarkbcould be slowness in api responses affecting the caching in nodepool? (like maybe we never get a response within the timeout period so we're just using a really old cached info?)16:40
corvusassuming that nodepool ever sees an updated server list, then it should eventually decide the server was deleted and then just delete the zk record.16:41
corvusif nodepool never gets an updated server list, then everything will just sit there forever including no new ready nodes16:41
corvus(and by that i mean also the idea that if all it ever does is time out on the server list)16:42
corvusfrickler: if it's handy, can you pastebin your server list?16:42
fricklerhttps://paste.opendev.org/show/b2HjX0OBikUYjgQEUyeV/16:46
corvusfrickler: do you know about how long the api response took?16:46
fricklerno, sorry, just ran it and looked at the window again after maybe 20 minutes16:47
corvusokay, the delete server calls are taking about 50 seconds16:48
corvusthere is a single list servers call:16:48
corvus2024-06-27 16:07:09,756 DEBUG nodepool.OpenStackAdapter.rax-ord: API call detailed server list in 6.46682665869593616:49
corvusi suspect that all the others are taking more than 60 seconds because we have a 60 second timeout configured in clouds.yaml16:49
clarkbserver listing has always been one of the slower requests iirc16:50
corvus(i'm trying to narrow down/prove whether we're stuck there or just looping in timeouts)16:50
fricklerI'm running some timed server show+delete now16:53
fricklerhmm, server delete returns within 3s, server show needs ~ 30s to find out that the server doesn't exist. retrying the server list next16:56
corvusmost of the executor threads for rax-ord ar sitting at creating a connection  sock.connect16:59
corvusokay, they have made progress, having looked at a second snapshot.  so they aren't just stuck there.  but they are all doing network things, so this still supports a slow remote side.17:01
corvusa possible pathology here is that delete api calls and server api calls go into the same api call executor queue.  so we could be looking at an ever increasing amount of server delete calls that effectively (mostly) starve out the server list calls.17:05
corvus(the idea here is that api calls are typically millisecond fast)17:05
corvusunfortunately, we can't start the repl with the launcher already running, so i can't confirm that17:07
opendevreviewClark Boylan proposed opendev/system-config master: Don't use the openstack-ci-core/openafs PPA on noble and later  https://review.opendev.org/c/opendev/system-config/+/92178617:08
opendevreviewClark Boylan proposed opendev/system-config master: Test mirror services on noble  https://review.opendev.org/c/opendev/system-config/+/92177117:08
clarkbtonyb: ^ I think there was just a minor yaml indentation thing there so I went ahead and made a quick update17:08
clarkbcorvus: I suppose we could manually remove records for nodes we know that are deleted as a way to mitigate that. Would probably have to restart the launcher too to reset the executor state17:09
clarkbdefinitely a hacky workaround and not ideal that idea is17:09
corvusi'm concerned that if the delete server calls are taking 50 seconds (the last several were "only" 30 seconds), that even if we did that we'd end up here again quickly17:11
corvusweird that the launcher logs don't agree with frickler's cmdline test (3s)17:11
clarkbgood point, the next round of booted nodes would be liable to spiral into this too17:11
*** atmark_ is now known as atmark17:11
fricklerthe server list took 08:16 now17:16
corvusthat is some variation :)17:16
corvusi think the sequence is something like: 1) submit 200 delete api requests; 2) submit 1 server list request; 3) our internal timeout happens; we submit another 200 delete api requests; 4) repeat 3.17:17
corvusi think that's how we end up having one server list request; we will probably eventually execute another, but probably only after working through thousands of delete calls that no one is even listening for anymore.17:18
corvusone solution would be to move the list api calls to a different executor, so they don't queue behind the creates and deletes17:19
corvusthat doesn't stop the deletes from piling up directly, but it does mean that if a 50 second delete call eventually succeeds, then we will stop re-adding it.17:19
corvusother ideas to consider would be to remove unused delete calls from the queue; or set a max size on the queue so the provider stops creating state machines if something is going wrong.17:21
clarkbwe might be able to spread out extra deletes more. I think its like a 5 minute interval now? Maybe twice an hour is plenty after the initial attempt?17:22
corvusi think i have some ideas, i'll work on a patch17:27
corvusin the mean time, we might be able to dislodge things with a launcher restart; if someone wants to do that i think it's worthwhile.  we may get stuck again, or maybe not.  we'll at least know.17:28
corvusi don't think there's any more interactive debugging to do, so feel free to do that17:28
clarkbI'll do that17:34
clarkbonly the launcher for rax should be necessary (thats nl01)17:34
clarkbthat is done. The nodepool launcher on nl01 was restarted about 20 seconds ago17:36
clarkbthe restart does seem to have changed its behavior. Some nodes are building now and the deleting count is failling17:44
corvusit looks like things are fast enough that we're not going to tip over into the bad place immediately.  but i think it could still happen; creates and deletes are still close to a minute.18:03
clarkbIt might also be fine until we mix in deletes later18:31
fungiwhen we saw similar timeout issues with image deletes (mostly in iad i think?) it seemed like our list calls were piling up on the openstack end, so stopping the launcher for a while allowed it to catch its breath and start responding in a more timely fashion again for a while18:36
corvusnodepool destroyer of clouds19:03
clarkbhttps://zuul.opendev.org/t/openstack/build/c6afb3dfdc9146d7b19e3c4939c8a571/log/job-output.txt#386-387 jammy installs the ppa (then ignores the package). Noble does not install the ppa at all: https://zuul.opendev.org/t/openstack/build/b84f040fcb6248f99a3b191c7eb8307e/log/job-output.txt#385-386 testing confirms the change works as we expect. tonyb I'll +2 it but won't approve as19:54
clarkbfungi had a small nit. I'll let you decide if you want to update for that or just approve it as is19:54
fungiyeah, i didn't have strong feelings about it, just thought it might be an oversight20:02
fungipython 3.13.0b3 is now available20:11
fungione beta release to go before it enters the rc phase20:12
opendevreviewJay Faulkner proposed openstack/project-config master: Ironic no longer uses storyboard  https://review.opendev.org/c/openstack/project-config/+/92286420:31
clarkbit is a cool afternoon here. I'm going to take advantage and go outside for a bike ride20:41
mordredfungi: anything noteworthy in 13?21:00
fungimordred: they remove lots of "dead batteries" from the stdlib21:03
mordredalso, interesting - an experimental GIL-less option - and an experimental JIT compiler21:03
fungiyep21:03
fungithe nogil work is definitely interesting but projects that want to take advantage of it will probably need to do a lot more work to handle their own locking needs21:04
mordredyah21:05
mordredthe dead batteries don't look like they'd be anything I'm using21:05
fungiremoval of the cgi and smtplib modules hit some of my projects pretty hard, but i worked around them21:06
corvusbatteries no longer included :(21:19
corvusfungi: smtplib?21:20
corvusnot seeing that on the list21:20
corvussmtpd maybe?21:23
fungicorvus: ah no you're right, smtplib is not slated for removal in pep 594 (smtpd is). i was thinking of the mail header parsing from cgi.parse_header recommending email.message instead21:34
fungitelnetlib was the other one i ended up having to work around21:34
fungialso smtpd already got removed in 3.12 anyway21:35
fungioh, and crypt's removal is causing some downstream pain in passlib21:36
corvusthere is some irony there21:38
corvusgiven than passlib is cited as a replacement for crypt21:38
fungiyeah, i opened https://foss.heptapod.net/python-libs/passlib/-/issues/148 about it in 202221:39
fungisince some of my projects depend on passlib as well21:41
opendevreviewMerged openstack/project-config master: Ironic no longer uses storyboard  https://review.opendev.org/c/openstack/project-config/+/92286421:57
opendevreviewTony Breeds proposed opendev/system-config master: Don't use the openstack-ci-core/openafs PPA on noble and later  https://review.opendev.org/c/opendev/system-config/+/92178622:15
opendevreviewTony Breeds proposed opendev/system-config master: Test mirror services on noble  https://review.opendev.org/c/opendev/system-config/+/92177122:15
tonyb^^ should be good to go22:17
tonybSo the ansible-devel test job.  Due to ansible now (current git master branch) needing > 3.10 on the controller and > 3.8 on any hosts I think that means that going forward we can only support/test on >= focal.  So any reason I shouldn't remove https://opendev.org/opendev/system-config/src/branch/master/zuul.d/system-config-run.yaml#L100-L103 ?22:21
tonyband add Noble while trying to get that job passing again?22:22
tonybOh and while I'm thinking about it should we also test ara master?  In the same or different job22:29
opendevreviewTony Breeds proposed opendev/system-config master: [DNM] Run ansible-devel under python-3.11  https://review.opendev.org/c/opendev/system-config/+/92270422:30
Clark[m]tonyb: the run base jobs should keep the older node types but they are just target nodes like any one of our servers. You should be able to bump the bridge node to jammy and that gets you 3.1022:56
Clark[m]Oh bridge is already jammy so it should be working. Unless Ansible isn't able to run on remote targets that are old but historically they have tried to keep that working22:57
Clark[m]Or does it need 3.11? Maybe that is what 922704 means 22:57
tonybClark[m]: Yeah as of ansible/ansible current master the controller has to be >= 3.1122:58
tonybClark[m]: and any/all target nodes must be > 3.822:59
tonybOh sorry any/all target hosts must be >= 3.823:01
tonybhttps://github.com/ansible/ansible/commit/b2a289dcbb702003377221e25f62c8a3608f0e89 ; and https://github.com/ansible/ansible/commit/1c17fe2d53c7409dc02780eb29430bee99fb42ad23:02
clarkbhrm I think that maens we're stickign to old ansible23:17
clarkbcorvus: ^ fyi this likely has impacts for zuul as well23:17
clarkba bit sad that the commit message doesn't really indicate anything more than what they did23:17
clarkbhttps://github.com/ansible/ansible/issues/83095 the related issue doesn't really say more either23:20
clarkbmostly just curious if 3.8 is expected to be a longer term python version due to its use in rhel 8 (pretty sure rhel8 is 3.8 python) or if the idae is to remove them on a regular cadence regardless of older LTS distros. But really reduces ansible utility if you can't haev a broad set of target runtimes23:21
clarkband historically ansible was really good about that so this change is :/23:21
clarkbI suppose one workaround for this is the stow type work that has been happening23:26
clarkbstart compiling our own pythons on targets just for ansible. Not ideal but dosable23:26
clarkb*doable23:26
corvuswhat targets do we have < 3.8?23:30
clarkbcorvus: bionic and xenial and bullseye are the current set I think23:32
clarkb3.5, 3.6, and 3.7 respectively23:32
clarkber I have bionic and xenial backwards but ya23:32
clarkband yes some of those are long old, but canonical is doing 10 years (or more I think some stuff is 12 now?) of support for LTS if you pay23:37
opendevreviewMerged opendev/system-config master: Don't use the openstack-ci-core/openafs PPA on noble and later  https://review.opendev.org/c/opendev/system-config/+/92178623:42

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!