Tuesday, 2024-07-16

frickler	found a non-ansible job that also shouldn't have been retried according to my understanding https://zuul.opendev.org/t/openstack/build/ffa3720c82aa45e2a41b4714b3a8a163	04:43
frickler	and a "normal" tempest failure https://zuul.opendev.org/t/openstack/build/f1eac61ca5be4029b68fdafccec0abe0	04:51
*** ChanServ changes topic to "OpenDev is a space for collaborative Open Source software development \| https://opendev.org/ \| channel logs https://meetings.opendev.org/irclogs/%23opendev/"		04:52
frickler	fungi: corvus: the unreachable for "get df disk usage" seems to be happening because remove-build-sshkey is running before that, so the executor rightfully cannot log into the nodes any more? so this looks like some side effect of https://review.opendev.org/c/zuul/zuul/+/923903 possibly?	05:47
frickler	actually, maybe this has never been working because of the earlier remove-build-sshkey task? https://opendev.org/opendev/base-jobs/src/branch/master/playbooks/base/cleanup.yaml is that what is triggering the retry because it would need "ignore_errors: yes" otherwise?	06:06
frickler	I do see in the executor log that the invocation of the cleanup playbook still has "-e zuul_will_retry=False", so that currently sounds plausible to me	06:11
frickler	I think the issue is this block, that iiuc wasn't run for cleanup playbooks before, it should likely check for playbook.cleanup=False now https://opendev.org/zuul/zuul/src/branch/master/zuul/executor/server.py#L2007-L2013	06:30
frickler	should be fixed by https://review.opendev.org/c/zuul/zuul/+/924198 but zuul CI seems to be mildly unstable	07:31
frickler	hmm, I do see cleanup.yml runs working in older executor logs (7d ago), so maybe my patch only fixes the symptom, but not the real cause (doesn't make it invalid though I think)	10:41
opendevreview	Jens Harbott proposed opendev/system-config master: DNM: test zuul run job https://review.opendev.org/c/opendev/system-config/+/924219	11:30
fungi	frickler: presumably a result of this recent security fix: https://review.opendev.org/c/zuul/zuul/+/923903	12:13
fungi	er, rather its parent https://review.opendev.org/c/zuul/zuul/+/923874	12:14
fungi	may need to adjust the nesting order	12:17
fungi	of our playbooks	12:17
frickler	fungi: actually the order looks the same in the older passing logs, so maybe rather some change in the execution environment?	12:45
fungi	i think we upgraded and restarted onto the security fix i linked on wednesday (6 days ago), so sounds like the same timing	12:47
opendevreview	Jens Harbott proposed opendev/system-config master: DNM: test zuul run job https://review.opendev.org/c/opendev/system-config/+/924219	13:17
corvus	frickler: i made a followup to your change at https://review.opendev.org/924236	15:25
*** dxld_ is now known as dxld		15:40
frickler	corvus: thx, that confirms my assumption that there were more issues in hiding	15:59
*** ykarel is now known as ykarel\|away		16:13
corvus	fixes merged; i'd like to early restart zuul. should i do a hard restart of the executors or should we try to make it graceful?	18:20
corvus	i don't see any release activity, so i think a hard restart would be okay now; the additional runtime of some changes may be offset by the fact that some jobs may only need to be run once now instead of retried.	18:21
fungi	yeah, presumably any in-progress builds will simply be rerun	18:23
fungi	seems fine	18:23
fungi	unrelated (hopefully), noonedeadpunk and frickler just mentioned this odd zuul api timeout event during the openstack tc meeting: https://zuul.opendev.org/t/openstack/build/6caf416c14c44ca6b20bc834556d1d3a	18:24
fungi	the scheduler/web servers have low (sub-1.0) load at the moment, but maybe something was going on earlier	18:26
noonedeadpunk	well...	18:26
fungi	looks like it was trying to hit https://zuul.opendev.org/api/tenant/openstack/builds?change=924217&patchset=1&pipeline=gate&job_name=openstack-tox-docs which is also taking a very long time for me right now	18:27
noonedeadpunk	yeah, I've just ran curl with time to see how long would it take	18:27
fungi	so maybe another non-performant query planner result related to the straight join change	18:27
fungi	did these seem to start over the weekend? or more recently?	18:28
noonedeadpunk	I've spotted that only for docs promote just today	18:28
fungi	got it	18:28
noonedeadpunk	but no idea about earlier, as I rarely check promote jobs health	18:28
noonedeadpunk	it timeouted for me after 2m jsut now	18:29
noonedeadpunk	but also I've spotted over the weekends some wired timing issues opening gitea as well.. Not sure if that could be realted or not though...	18:29
noonedeadpunk	could be my connection totally	18:29
noonedeadpunk	(in case of gitea, not zuul api)	18:30
fungi	well, also the gitea servers do get blasted by inconsiderate distributed crawlers from time to time	18:33
fungi	we have a crazy list of user agent filters used by an apache frontend to block the ones we know about, but they keep coming up with new ones	18:34
corvus	yeah that's another case where we shouldn't use straight_join, i'll patch it	19:01
fungi	thanks!	19:02
fungi	python 3.13.0b4 release is being pushed back to tomorrow, looks like	19:10
corvus	restarting zuul now; expect a short web outage	19:34
corvus	#status log restarted zuul for retry fixes	19:51
opendevstatus	corvus: finished logging	19:52
fungi	thanks again!	19:53
corvus	[zuul/zuul] 924292: Remove straight_join in favor of index hints	22:36
corvus	for the other thing ^	22:36

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!