Thursday, 2024-06-27

tonyb	True.	00:32
opendevreview	Tony Breeds proposed opendev/system-config master: Don't use the openstack-ci-core/openafs PPA on noble and later https://review.opendev.org/c/opendev/system-config/+/921786	03:14
opendevreview	Tony Breeds proposed opendev/system-config master: Test mirror services on noble https://review.opendev.org/c/opendev/system-config/+/921771	03:14
opendevreview	Simon Westphahl proposed zuul/zuul-jobs master: wip: Add ensure-dib role https://review.opendev.org/c/zuul/zuul-jobs/+/922910	07:42
opendevreview	Simon Westphahl proposed zuul/zuul-jobs master: wip: Add build-diskimage role https://review.opendev.org/c/zuul/zuul-jobs/+/922911	07:42
opendevreview	Simon Westphahl proposed zuul/zuul-jobs master: wip: Add example role for converting images https://review.opendev.org/c/zuul/zuul-jobs/+/922912	07:42
opendevreview	Simon Westphahl proposed zuul/zuul-jobs master: wip: Add example role for uploading diskimages https://review.opendev.org/c/zuul/zuul-jobs/+/922913	07:42
fungi	popping out momentarily to run quick errands, back in ~30	14:16
corvus	there are a bunch of nodes in deleting state starting yesterday	14:53
corvus	at least, the graph shows a pretty constant value of ~200 deleting nodes, but the node listing says they've mostly been deleting for 0-10 minutes	14:55
corvus	that suggests that they are not stuck, just a constant slow turnover	14:55
corvus	oh i think the timer gets reset in the node listing	14:56
corvus	so these may indeed be stuck deletes that we're retrying	14:56
fungi	#status log Pruned backups on backup02.ca-ymq-1.vexxhost.opendev.org reducing volume utilization from 93% to 72%	15:07
opendevstatus	fungi: finished logging	15:07
frickler	corvus: ack, if you look at https://grafana.opendev.org/d/a8667d6647/nodepool3a-rackspace?orgId=1&from=now-7d&to=now it seems to be an issue with rad-ord since 01:00 yesterday	15:18
clarkb	fungi: seems like we just had to prune things before the gerrit upgrade. But maybe it was the other server that prompted that pruning	15:30
clarkb	I wonder if we've just accumulated enough in the backlog there that the time between prunings is shrikning	15:36
frickler	I don't see any obvious issues in the launcher log for rax-ord, trying to spot check some of the servers now, but rax api is slooow	15:37
opendevreview	Julia Kreger proposed openstack/diskimage-builder master: bootloader: Strip prior console settings https://review.opendev.org/c/openstack/diskimage-builder/+/922961	15:47
clarkb	hrm the centos-8 mirror dir isn't empty. I need to reboot for local updates then will dig into that	15:52
clarkb	I think https://review.opendev.org/c/opendev/system-config/+/922750 merged removing the cronjob and script before it could fire on its every 6 hour cadence	15:59
corvus	do we just wait for the rax issue to clear up? or have we opened a ticket in the past and that nudged them to fix something?	15:59
clarkb	thats not a big deal since the mirror was basically empty already and rm'ing things by hand when we remove the volume shouldn't be too bad	15:59
clarkb	corvus: usually my first step is to try manually issuing deletes just to confirm that it isn't something to do with nodepool (or how nodepool uses openstacksdk). But 95% of the time there is no difference. Then we can file a ticket with uuids for stuck nodes and usualy they get things cleared out	16:00
fungi	frickler: in the past we've seen issues with api slowness in a rackspace region implying slowness of communication between different openstack services in that region, leading to leaked error nodes that accumulate and are undeletable by us	16:09
fungi	corvus: in the past i think we've needed to open a ticket to let support know we want the nodes deleted	16:11
frickler	well finally a "server list" completed and does not show the nodes that nodepool is trying to delete	16:35
clarkb	could be slowness in api responses affecting the caching in nodepool? (like maybe we never get a response within the timeout period so we're just using a really old cached info?)	16:40
corvus	assuming that nodepool ever sees an updated server list, then it should eventually decide the server was deleted and then just delete the zk record.	16:41
corvus	if nodepool never gets an updated server list, then everything will just sit there forever including no new ready nodes	16:41
corvus	(and by that i mean also the idea that if all it ever does is time out on the server list)	16:42
corvus	frickler: if it's handy, can you pastebin your server list?	16:42
frickler	https://paste.opendev.org/show/b2HjX0OBikUYjgQEUyeV/	16:46
corvus	frickler: do you know about how long the api response took?	16:46
frickler	no, sorry, just ran it and looked at the window again after maybe 20 minutes	16:47
corvus	okay, the delete server calls are taking about 50 seconds	16:48
corvus	there is a single list servers call:	16:48
corvus	2024-06-27 16:07:09,756 DEBUG nodepool.OpenStackAdapter.rax-ord: API call detailed server list in 6.466826658695936	16:49
corvus	i suspect that all the others are taking more than 60 seconds because we have a 60 second timeout configured in clouds.yaml	16:49
clarkb	server listing has always been one of the slower requests iirc	16:50
corvus	(i'm trying to narrow down/prove whether we're stuck there or just looping in timeouts)	16:50
frickler	I'm running some timed server show+delete now	16:53
frickler	hmm, server delete returns within 3s, server show needs ~ 30s to find out that the server doesn't exist. retrying the server list next	16:56
corvus	most of the executor threads for rax-ord ar sitting at creating a connection sock.connect	16:59
corvus	okay, they have made progress, having looked at a second snapshot. so they aren't just stuck there. but they are all doing network things, so this still supports a slow remote side.	17:01
corvus	a possible pathology here is that delete api calls and server api calls go into the same api call executor queue. so we could be looking at an ever increasing amount of server delete calls that effectively (mostly) starve out the server list calls.	17:05
corvus	(the idea here is that api calls are typically millisecond fast)	17:05
corvus	unfortunately, we can't start the repl with the launcher already running, so i can't confirm that	17:07
opendevreview	Clark Boylan proposed opendev/system-config master: Don't use the openstack-ci-core/openafs PPA on noble and later https://review.opendev.org/c/opendev/system-config/+/921786	17:08
opendevreview	Clark Boylan proposed opendev/system-config master: Test mirror services on noble https://review.opendev.org/c/opendev/system-config/+/921771	17:08
clarkb	tonyb: ^ I think there was just a minor yaml indentation thing there so I went ahead and made a quick update	17:08
clarkb	corvus: I suppose we could manually remove records for nodes we know that are deleted as a way to mitigate that. Would probably have to restart the launcher too to reset the executor state	17:09
clarkb	definitely a hacky workaround and not ideal that idea is	17:09
corvus	i'm concerned that if the delete server calls are taking 50 seconds (the last several were "only" 30 seconds), that even if we did that we'd end up here again quickly	17:11
corvus	weird that the launcher logs don't agree with frickler's cmdline test (3s)	17:11
clarkb	good point, the next round of booted nodes would be liable to spiral into this too	17:11
*** atmark_ is now known as atmark		17:11
frickler	the server list took 08:16 now	17:16
corvus	that is some variation :)	17:16
corvus	i think the sequence is something like: 1) submit 200 delete api requests; 2) submit 1 server list request; 3) our internal timeout happens; we submit another 200 delete api requests; 4) repeat 3.	17:17
corvus	i think that's how we end up having one server list request; we will probably eventually execute another, but probably only after working through thousands of delete calls that no one is even listening for anymore.	17:18
corvus	one solution would be to move the list api calls to a different executor, so they don't queue behind the creates and deletes	17:19
corvus	that doesn't stop the deletes from piling up directly, but it does mean that if a 50 second delete call eventually succeeds, then we will stop re-adding it.	17:19
corvus	other ideas to consider would be to remove unused delete calls from the queue; or set a max size on the queue so the provider stops creating state machines if something is going wrong.	17:21
clarkb	we might be able to spread out extra deletes more. I think its like a 5 minute interval now? Maybe twice an hour is plenty after the initial attempt?	17:22
corvus	i think i have some ideas, i'll work on a patch	17:27
corvus	in the mean time, we might be able to dislodge things with a launcher restart; if someone wants to do that i think it's worthwhile. we may get stuck again, or maybe not. we'll at least know.	17:28
corvus	i don't think there's any more interactive debugging to do, so feel free to do that	17:28
clarkb	I'll do that	17:34
clarkb	only the launcher for rax should be necessary (thats nl01)	17:34
clarkb	that is done. The nodepool launcher on nl01 was restarted about 20 seconds ago	17:36
clarkb	the restart does seem to have changed its behavior. Some nodes are building now and the deleting count is failling	17:44
corvus	it looks like things are fast enough that we're not going to tip over into the bad place immediately. but i think it could still happen; creates and deletes are still close to a minute.	18:03
clarkb	It might also be fine until we mix in deletes later	18:31
fungi	when we saw similar timeout issues with image deletes (mostly in iad i think?) it seemed like our list calls were piling up on the openstack end, so stopping the launcher for a while allowed it to catch its breath and start responding in a more timely fashion again for a while	18:36
corvus	nodepool destroyer of clouds	19:03
clarkb	https://zuul.opendev.org/t/openstack/build/c6afb3dfdc9146d7b19e3c4939c8a571/log/job-output.txt#386-387 jammy installs the ppa (then ignores the package). Noble does not install the ppa at all: https://zuul.opendev.org/t/openstack/build/b84f040fcb6248f99a3b191c7eb8307e/log/job-output.txt#385-386 testing confirms the change works as we expect. tonyb I'll +2 it but won't approve as	19:54
clarkb	fungi had a small nit. I'll let you decide if you want to update for that or just approve it as is	19:54
fungi	yeah, i didn't have strong feelings about it, just thought it might be an oversight	20:02
fungi	python 3.13.0b3 is now available	20:11
fungi	one beta release to go before it enters the rc phase	20:12
opendevreview	Jay Faulkner proposed openstack/project-config master: Ironic no longer uses storyboard https://review.opendev.org/c/openstack/project-config/+/922864	20:31
clarkb	it is a cool afternoon here. I'm going to take advantage and go outside for a bike ride	20:41
mordred	fungi: anything noteworthy in 13?	21:00
fungi	mordred: they remove lots of "dead batteries" from the stdlib	21:03
mordred	also, interesting - an experimental GIL-less option - and an experimental JIT compiler	21:03
fungi	yep	21:03
fungi	the nogil work is definitely interesting but projects that want to take advantage of it will probably need to do a lot more work to handle their own locking needs	21:04
mordred	yah	21:05
mordred	the dead batteries don't look like they'd be anything I'm using	21:05
fungi	removal of the cgi and smtplib modules hit some of my projects pretty hard, but i worked around them	21:06
corvus	batteries no longer included :(	21:19
corvus	fungi: smtplib?	21:20
corvus	not seeing that on the list	21:20
corvus	smtpd maybe?	21:23
fungi	corvus: ah no you're right, smtplib is not slated for removal in pep 594 (smtpd is). i was thinking of the mail header parsing from cgi.parse_header recommending email.message instead	21:34
fungi	telnetlib was the other one i ended up having to work around	21:34
fungi	also smtpd already got removed in 3.12 anyway	21:35
fungi	oh, and crypt's removal is causing some downstream pain in passlib	21:36
corvus	there is some irony there	21:38
corvus	given than passlib is cited as a replacement for crypt	21:38
fungi	yeah, i opened https://foss.heptapod.net/python-libs/passlib/-/issues/148 about it in 2022	21:39
fungi	since some of my projects depend on passlib as well	21:41
opendevreview	Merged openstack/project-config master: Ironic no longer uses storyboard https://review.opendev.org/c/openstack/project-config/+/922864	21:57
opendevreview	Tony Breeds proposed opendev/system-config master: Don't use the openstack-ci-core/openafs PPA on noble and later https://review.opendev.org/c/opendev/system-config/+/921786	22:15
opendevreview	Tony Breeds proposed opendev/system-config master: Test mirror services on noble https://review.opendev.org/c/opendev/system-config/+/921771	22:15
tonyb	^^ should be good to go	22:17
tonyb	So the ansible-devel test job. Due to ansible now (current git master branch) needing > 3.10 on the controller and > 3.8 on any hosts I think that means that going forward we can only support/test on >= focal. So any reason I shouldn't remove https://opendev.org/opendev/system-config/src/branch/master/zuul.d/system-config-run.yaml#L100-L103 ?	22:21
tonyb	and add Noble while trying to get that job passing again?	22:22
tonyb	Oh and while I'm thinking about it should we also test ara master? In the same or different job	22:29
opendevreview	Tony Breeds proposed opendev/system-config master: [DNM] Run ansible-devel under python-3.11 https://review.opendev.org/c/opendev/system-config/+/922704	22:30
Clark[m]	tonyb: the run base jobs should keep the older node types but they are just target nodes like any one of our servers. You should be able to bump the bridge node to jammy and that gets you 3.10	22:56
Clark[m]	Oh bridge is already jammy so it should be working. Unless Ansible isn't able to run on remote targets that are old but historically they have tried to keep that working	22:57
Clark[m]	Or does it need 3.11? Maybe that is what 922704 means	22:57
tonyb	Clark[m]: Yeah as of ansible/ansible current master the controller has to be >= 3.11	22:58
tonyb	Clark[m]: and any/all target nodes must be > 3.8	22:59
tonyb	Oh sorry any/all target hosts must be >= 3.8	23:01
tonyb	https://github.com/ansible/ansible/commit/b2a289dcbb702003377221e25f62c8a3608f0e89 ; and https://github.com/ansible/ansible/commit/1c17fe2d53c7409dc02780eb29430bee99fb42ad	23:02
clarkb	hrm I think that maens we're stickign to old ansible	23:17
clarkb	corvus: ^ fyi this likely has impacts for zuul as well	23:17
clarkb	a bit sad that the commit message doesn't really indicate anything more than what they did	23:17
clarkb	https://github.com/ansible/ansible/issues/83095 the related issue doesn't really say more either	23:20
clarkb	mostly just curious if 3.8 is expected to be a longer term python version due to its use in rhel 8 (pretty sure rhel8 is 3.8 python) or if the idae is to remove them on a regular cadence regardless of older LTS distros. But really reduces ansible utility if you can't haev a broad set of target runtimes	23:21
clarkb	and historically ansible was really good about that so this change is :/	23:21
clarkb	I suppose one workaround for this is the stow type work that has been happening	23:26
clarkb	start compiling our own pythons on targets just for ansible. Not ideal but dosable	23:26
clarkb	*doable	23:26
corvus	what targets do we have < 3.8?	23:30
clarkb	corvus: bionic and xenial and bullseye are the current set I think	23:32
clarkb	3.5, 3.6, and 3.7 respectively	23:32
clarkb	er I have bionic and xenial backwards but ya	23:32
clarkb	and yes some of those are long old, but canonical is doing 10 years (or more I think some stuff is 12 now?) of support for LTS if you pay	23:37
opendevreview	Merged opendev/system-config master: Don't use the openstack-ci-core/openafs PPA on noble and later https://review.opendev.org/c/opendev/system-config/+/921786	23:42

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!