frickler | found a non-ansible job that also shouldn't have been retried according to my understanding https://zuul.opendev.org/t/openstack/build/ffa3720c82aa45e2a41b4714b3a8a163 | 04:43 |
---|---|---|
frickler | and a "normal" tempest failure https://zuul.opendev.org/t/openstack/build/f1eac61ca5be4029b68fdafccec0abe0 | 04:51 |
*** ChanServ changes topic to "OpenDev is a space for collaborative Open Source software development | https://opendev.org/ | channel logs https://meetings.opendev.org/irclogs/%23opendev/" | 04:52 | |
frickler | fungi: corvus: the unreachable for "get df disk usage" seems to be happening because remove-build-sshkey is running before that, so the executor rightfully cannot log into the nodes any more? so this looks like some side effect of https://review.opendev.org/c/zuul/zuul/+/923903 possibly? | 05:47 |
frickler | actually, maybe this has never been working because of the earlier remove-build-sshkey task? https://opendev.org/opendev/base-jobs/src/branch/master/playbooks/base/cleanup.yaml is that what is triggering the retry because it would need "ignore_errors: yes" otherwise? | 06:06 |
frickler | I do see in the executor log that the invocation of the cleanup playbook still has "-e zuul_will_retry=False", so that currently sounds plausible to me | 06:11 |
frickler | I think the issue is this block, that iiuc wasn't run for cleanup playbooks before, it should likely check for playbook.cleanup=False now https://opendev.org/zuul/zuul/src/branch/master/zuul/executor/server.py#L2007-L2013 | 06:30 |
frickler | should be fixed by https://review.opendev.org/c/zuul/zuul/+/924198 but zuul CI seems to be mildly unstable | 07:31 |
frickler | hmm, I do see cleanup.yml runs working in older executor logs (7d ago), so maybe my patch only fixes the symptom, but not the real cause (doesn't make it invalid though I think) | 10:41 |
opendevreview | Jens Harbott proposed opendev/system-config master: DNM: test zuul run job https://review.opendev.org/c/opendev/system-config/+/924219 | 11:30 |
fungi | frickler: presumably a result of this recent security fix: https://review.opendev.org/c/zuul/zuul/+/923903 | 12:13 |
fungi | er, rather its parent https://review.opendev.org/c/zuul/zuul/+/923874 | 12:14 |
fungi | may need to adjust the nesting order | 12:17 |
fungi | of our playbooks | 12:17 |
frickler | fungi: actually the order looks the same in the older passing logs, so maybe rather some change in the execution environment? | 12:45 |
fungi | i think we upgraded and restarted onto the security fix i linked on wednesday (6 days ago), so sounds like the same timing | 12:47 |
opendevreview | Jens Harbott proposed opendev/system-config master: DNM: test zuul run job https://review.opendev.org/c/opendev/system-config/+/924219 | 13:17 |
corvus | frickler: i made a followup to your change at https://review.opendev.org/924236 | 15:25 |
*** dxld_ is now known as dxld | 15:40 | |
frickler | corvus: thx, that confirms my assumption that there were more issues in hiding | 15:59 |
*** ykarel is now known as ykarel|away | 16:13 | |
corvus | fixes merged; i'd like to early restart zuul. should i do a hard restart of the executors or should we try to make it graceful? | 18:20 |
corvus | i don't see any release activity, so i think a hard restart would be okay now; the additional runtime of some changes may be offset by the fact that some jobs may only need to be run once now instead of retried. | 18:21 |
fungi | yeah, presumably any in-progress builds will simply be rerun | 18:23 |
fungi | seems fine | 18:23 |
fungi | unrelated (hopefully), noonedeadpunk and frickler just mentioned this odd zuul api timeout event during the openstack tc meeting: https://zuul.opendev.org/t/openstack/build/6caf416c14c44ca6b20bc834556d1d3a | 18:24 |
fungi | the scheduler/web servers have low (sub-1.0) load at the moment, but maybe something was going on earlier | 18:26 |
noonedeadpunk | well... | 18:26 |
fungi | looks like it was trying to hit https://zuul.opendev.org/api/tenant/openstack/builds?change=924217&patchset=1&pipeline=gate&job_name=openstack-tox-docs which is also taking a very long time for me right now | 18:27 |
noonedeadpunk | yeah, I've just ran curl with time to see how long would it take | 18:27 |
fungi | so maybe another non-performant query planner result related to the straight join change | 18:27 |
fungi | did these seem to start over the weekend? or more recently? | 18:28 |
noonedeadpunk | I've spotted that only for docs promote just today | 18:28 |
fungi | got it | 18:28 |
noonedeadpunk | but no idea about earlier, as I rarely check promote jobs health | 18:28 |
noonedeadpunk | it timeouted for me after 2m jsut now | 18:29 |
noonedeadpunk | but also I've spotted over the weekends some wired timing issues opening gitea as well.. Not sure if that could be realted or not though... | 18:29 |
noonedeadpunk | could be my connection totally | 18:29 |
noonedeadpunk | (in case of gitea, not zuul api) | 18:30 |
fungi | well, also the gitea servers do get blasted by inconsiderate distributed crawlers from time to time | 18:33 |
fungi | we have a crazy list of user agent filters used by an apache frontend to block the ones we know about, but they keep coming up with new ones | 18:34 |
corvus | yeah that's another case where we shouldn't use straight_join, i'll patch it | 19:01 |
fungi | thanks! | 19:02 |
fungi | python 3.13.0b4 release is being pushed back to tomorrow, looks like | 19:10 |
corvus | restarting zuul now; expect a short web outage | 19:34 |
corvus | #status log restarted zuul for retry fixes | 19:51 |
opendevstatus | corvus: finished logging | 19:52 |
fungi | thanks again! | 19:53 |
corvus | [zuul/zuul] 924292: Remove straight_join in favor of index hints | 22:36 |
corvus | for the other thing ^ | 22:36 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!