Sunday, 2021-10-31

corvus	i'm going to restart zuul now	01:18
corvus	with just one scheduler	01:18
Alex_Gaynor	Is zuul experiencing some sort of known outage ATM?	01:44
corvus	Alex_Gaynor: i'm restarting it	01:45
Alex_Gaynor	corvus: 👍	01:45
Alex_Gaynor	How long does that typically take?	01:46
fungi	up to now, something around 15-20 minutes, longer depending on what else we might be doing as part of the restart	01:47
corvus	Alex_Gaynor: something like 20 minutes or so for a full restart	01:47
fungi	soon though, restarts should go unnoticed	01:47
corvus	(i think i'm going to have to clear state and start again)	01:47
Alex_Gaynor	Got it. Thanks. (Is there a better place to follow along than here?)	01:48
corvus	this is the place	01:48
corvus	Alex_Gaynor: did you notice an issue before about 20m ago?	01:48
corvus	(i'm wondering if this is prompted by the restart, or if there was another issue before the restart that i didn't notice)	01:49
Alex_Gaynor	Looks like I started seeing issues about 30 minutes ago	01:51
Alex_Gaynor	In the form of various errors loading https://zuul.opendev.org/t/pyca/status/	01:51
corvus	okay. that's as expected then :)	01:52
Alex_Gaynor	👍	01:52
corvus	i've cleared state and am starting the scheduler again	01:59
ianw_pto	Alex_Gaynor: while talking about zuul and pyca things, let me know if you have any thoughts on https://github.com/pyca/pynacl/issues/601 for arm64 wheels for pynacl; i imagine it could be very similar to what we have	02:09
ianw_pto	what prompted me to think about it again was recent work we were doing to upgrade our containers to bullseye; the buildx process cross-compiles the dependencies an pynacl was one of the more painful bits	02:11
Alex_Gaynor	ianw_pto: 👍 I ping'd reaperhulk, pynacl makes me sad these days so I don't think about it much.	02:11
ianw_pto	ok, can do, thanks :) don't want to make anyone sad	02:11
Alex_Gaynor	Hehe, not remotely your fault. (Once upon a time pynacl was a library with misuse resistant cryptography, and now it's mostly cryptography that's too hipster to be in openssl)	02:13
corvus	apparently we're waiting on github rate limits again	02:22
corvus	i think that may mean it could be a few hours before zuul is able to start	02:23
corvus	ah, i think the timout may have expired, it's moving along now	02:28
Alex_Gaynor	Is zuul having a problem ATM? I'm seeing the events queue not going down	13:25
fungi	Alex_Gaynor: i'm taking a look	13:41
Alex_Gaynor	🙇‍♂️	13:41
fungi	it's at 0 for most tenants, but yes it looks like the pyca tenant is reporting an event queue length of 20 at the moment	13:43
fungi	the vexxhost queue length is also 20	13:43
fungi	event queue length	13:44
fungi	and the zuul tenant's event queue is at 9	13:44
fungi	all the others are at 0	13:44
fungi	i'll see if there's any clues in the scheduler logs	13:45
Alex_Gaynor	FWIW the pyca one has been at 20 for many minutes, it's not just transitory	13:45
fungi	the vexxhost tenant only has a gerrit source connection, so seems unlikely to be limited to github events	13:48
fungi	the most recent event logged by the scheduler for the pyca tenant seems to be this one:	13:50
fungi	2021-10-31 13:17:38,143 DEBUG zuul.GithubConnection: [e: ea4f4296-3a4c-11ec-9cac-f345a58a0adc] Scheduling event from github: <GithubTriggerEvent 0x7fd7d80c0c70 pull_request pyca/cryptography refs/pull/6504/head status github.com/pyca/cryptography 6504,6f45c6d2e0978d1521718ed5e97eda6a4d97d763 delivery: ea4f4296-3a4c-11ec-9cac-f345a58a0adc>	13:50
fungi	and no mention of that event id past that log entry	13:51
fungi	so that does seem to support the impression that it's still hanging out in the event queue	13:52
fungi	the most recent builds to start in that tenant were at 09:04:37 utc, so it was clearly processing the event queue at least up to that time	13:53
fungi	implying it wasn't immediately stuck when the scheduler was restarted	13:54
fungi	looks like the vexxhost tenant processed a potential trigger event as recently as 10:09:06 utc	13:58
corvus	fungi: i see the same change cache error. i think we should revert to 4.10.4	14:10
fungi	okay, i saw it too but it's not the only exception so i was trying to do perform some rudimentary statistical analysis to see which ones were more common before and after updating	14:11
corvus	that one will stop queue processing at least	14:11
fungi	AttributeError: 'NoneType' object has no attribute 'cache_key'	14:12
fungi	that one?	14:12
corvus	yep	14:12
fungi	yeah, 5300 of those today since the debug log rotated	14:12
fungi	3135 in yesterday's log	14:12
fungi	what was the trick you did for the mass downgrade last time? ansible playbook to locally tag 4.10.4 as "latest" on all the servers?	14:14
corvus	i'll work on a manual revert	14:14
corvus	yep	14:14
corvus	stopping zuul	14:17
fungi	confirmed, before yesterday we did not seem to log that exception	14:17
fungi	the other one i see starting yesterday and continuing today is...	14:18
fungi	AttributeError: 'str' object has no attribute 'change'	14:18
corvus	i'm deleting the zk state	14:18
fungi	from the rpc listener	14:18
fungi	though that may have been related to manual reenqueuing of changes	14:19
corvus	starting zuul	14:20
corvus	https://zuul.opendev.org/api/components looks like good versions	14:21
fungi	nevermind, those exceptions doesn't look like they were clustered at times where reenqueuing was underway	14:21
fungi	don't look like	14:22
Alex_Gaynor	Question: Does the restart mean the even queue was lost, or will those jobs still happen?	14:23
fungi	since the state in zookeeper was cleared, the queued trigger events will be lost i think	14:24
corvus	right (though items already processed and running jobs are saved, but there weren't any in pyca)	14:24
fungi	note these upgrades/restarts are working toward persistent state for purposes of being able to run multiple schedulers, so restarts in the (hopefully near) future will be hitless	14:25
corvus	we're almost there. unfortunately this was a bug in our state persistence :/	14:26
fungi	i need to run out on a couple of quick errands, but should return within the hour hopefully	14:37
corvus	fungi: i'll be leaving for the day soon	14:37
fungi	thanks, i'll keep an eye on things once i'm back, but we ran smoothly enough on that release so i don't expect it will give us trouble	14:38
corvus	++	14:38
corvus	it's back up; i'm re-enqueing items	14:41
corvus	Alex_Gaynor: and i see something running in pyca	14:41
Alex_Gaynor	corvus: Yeah, I kicked teh job	14:42
corvus	re-enqueue is done	14:42
corvus	#status log restarted zuul on 4.10.4 due to bugs in master	14:43
opendevstatus	corvus: finished logging	14:43
fungi	i'm around again and will keep an eye to irc/mailing lists in case anyone notices something still awry	15:49

Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!