Tuesday, 2024-07-16

fricklerfound a non-ansible job that also shouldn't have been retried according to my understanding https://zuul.opendev.org/t/openstack/build/ffa3720c82aa45e2a41b4714b3a8a16304:43
fricklerand a "normal" tempest failure https://zuul.opendev.org/t/openstack/build/f1eac61ca5be4029b68fdafccec0abe004:51
*** ChanServ changes topic to "OpenDev is a space for collaborative Open Source software development | https://opendev.org/ | channel logs https://meetings.opendev.org/irclogs/%23opendev/"04:52
fricklerfungi: corvus: the unreachable for "get df disk usage" seems to be happening because remove-build-sshkey is running before that, so the executor rightfully cannot log into the nodes any more? so this looks like some side effect of https://review.opendev.org/c/zuul/zuul/+/923903 possibly?05:47
frickleractually, maybe this has never been working because of the earlier remove-build-sshkey task? https://opendev.org/opendev/base-jobs/src/branch/master/playbooks/base/cleanup.yaml is that what is triggering the retry because it would need "ignore_errors: yes" otherwise?06:06
fricklerI do see in the executor log that the invocation of the cleanup playbook still has "-e zuul_will_retry=False", so that currently sounds plausible to me06:11
fricklerI think the issue is this block, that iiuc wasn't run for cleanup playbooks before, it should likely check for playbook.cleanup=False now https://opendev.org/zuul/zuul/src/branch/master/zuul/executor/server.py#L2007-L201306:30
fricklershould be fixed by https://review.opendev.org/c/zuul/zuul/+/924198 but zuul CI seems to be mildly unstable07:31
fricklerhmm, I do see cleanup.yml runs working in older executor logs (7d ago), so maybe my patch only fixes the symptom, but not the real cause (doesn't make it invalid though I think)10:41
opendevreviewJens Harbott proposed opendev/system-config master: DNM: test zuul run job  https://review.opendev.org/c/opendev/system-config/+/92421911:30
fungifrickler: presumably a result of this recent security fix: https://review.opendev.org/c/zuul/zuul/+/92390312:13
fungier, rather its parent https://review.opendev.org/c/zuul/zuul/+/92387412:14
fungimay need to adjust the nesting order12:17
fungiof our playbooks12:17
fricklerfungi: actually the order looks the same in the older passing logs, so maybe rather some change in the execution environment?12:45
fungii think we upgraded and restarted onto the security fix i linked on wednesday (6 days ago), so sounds like the same timing12:47
opendevreviewJens Harbott proposed opendev/system-config master: DNM: test zuul run job  https://review.opendev.org/c/opendev/system-config/+/92421913:17
corvusfrickler: i made a followup to your change at https://review.opendev.org/92423615:25
*** dxld_ is now known as dxld15:40
fricklercorvus: thx, that confirms my assumption that there were more issues in hiding15:59
*** ykarel is now known as ykarel|away16:13
corvusfixes merged; i'd like to early restart zuul.  should i do a hard restart of the executors or should we try to make it graceful?18:20
corvusi don't see any release activity, so i think a hard restart would be okay now; the additional runtime of some changes may be offset by the fact that some jobs may only need to be run once now instead of retried.18:21
fungiyeah, presumably any in-progress builds will simply be rerun18:23
fungiseems fine18:23
fungiunrelated (hopefully), noonedeadpunk and frickler just mentioned this odd zuul api timeout event during the openstack tc meeting: https://zuul.opendev.org/t/openstack/build/6caf416c14c44ca6b20bc834556d1d3a18:24
fungithe scheduler/web servers have low (sub-1.0) load at the moment, but maybe something was going on earlier18:26
noonedeadpunkwell...18:26
fungilooks like it was trying to hit https://zuul.opendev.org/api/tenant/openstack/builds?change=924217&patchset=1&pipeline=gate&job_name=openstack-tox-docs which is also taking a very long time for me right now18:27
noonedeadpunkyeah, I've just ran curl with time to see how long would it take18:27
fungiso maybe another non-performant query planner result related to the straight join change18:27
fungidid these seem to start over the weekend? or more recently?18:28
noonedeadpunkI've spotted that only for docs promote just today18:28
fungigot it18:28
noonedeadpunkbut no idea about earlier, as I rarely check promote jobs health18:28
noonedeadpunkit timeouted for me after 2m jsut now 18:29
noonedeadpunkbut also I've spotted over the weekends some wired timing issues opening gitea as well.. Not sure if that could be realted or not though...18:29
noonedeadpunkcould be my connection totally18:29
noonedeadpunk(in case of gitea, not zuul api)18:30
fungiwell, also the gitea servers do get blasted by inconsiderate distributed crawlers from time to time18:33
fungiwe have a crazy list of user agent filters used by an apache frontend to block the ones we know about, but they keep coming up with new ones18:34
corvusyeah that's another case where we shouldn't use straight_join, i'll patch it19:01
fungithanks!19:02
fungipython 3.13.0b4 release is being pushed back to tomorrow, looks like19:10
corvusrestarting zuul now; expect a short web outage19:34
corvus#status log restarted zuul for retry fixes19:51
opendevstatuscorvus: finished logging19:52
fungithanks again!19:53
corvus[zuul/zuul] 924292: Remove straight_join in favor of index hints22:36
corvusfor the other thing ^ 22:36

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!