stevebaker | Hey I've got a change which has been "running" a CI job for 72 hours, is there any way it can be killed manually? https://zuul.openstack.org/status#804000 dib-functests-bionic-python3-extras | 20:16 |
---|---|---|
fungi | stevebaker: i'll take a quick look to see why it might be stuck first, but sure | 21:19 |
stevebaker | fungi: much appreciated | 21:19 |
fungi | the dib-functests-bionic-python3-extras build for it seems to still have a functional console, but all it logged was: | 21:21 |
fungi | 2021-09-30 18:55:05.598946 | Job console starting... | 21:22 |
fungi | 2021-09-30 18:55:05.613329 | Updating repositories | 21:22 |
fungi | 2021-09-30 18:55:05.883662 | Preparing job workspace | 21:22 |
fungi | and it's just been sitting there ever since | 21:23 |
stevebaker | yeah I noticed that | 21:23 |
fungi | checking the executor log once i work out which one it got farmed out to | 21:23 |
fungi | this is the incomplete build page for it, for future reference: https://zuul.opendev.org/t/openstack/build/f51e9d26fcd3458b9da5fa3f934e4aa6 | 21:25 |
fungi | ze04 is the executor it ended up on | 21:28 |
fungi | it's been looping this over and over in its log: | 21:28 |
fungi | 2021-10-03 21:26:30,843 DEBUG zuul.ExecutorServer: Finishing Job: 76fc8aac5f444f3e89998ad6697caae9, queue(5): {'f51e9d26fcd3458b9da5fa3f934e4aa6': <zuul.executor.server.AnsibleJob object at 0x7f48706bfdf0>, '08bf701e3811444f9945626d92210591': <zuul.executor.server.AnsibleJob object at 0x7f48507251c0>, '7c7b642ba1004a49988226f29bd3a9f5': <zuul.executor.server.AnsibleJob object at | 21:28 |
fungi | 0x7f487036dfd0>, '5aeadf4a4c9e4738a0a6bdb266b09b6c': <zuul.executor.server.AnsibleJob object at 0x7f48706d6670>, 'a451017c33674dba909a0cdfc3b2d473': <zuul.executor.server.AnsibleJob object at 0x7f4810798ca0>} | 21:28 |
fungi | the enqueue event id was 840d69f7aacc4f59bdff85271e8dfdb3 | 21:31 |
fungi | the last thing about it in the debug executor log seems to be this: | 21:35 |
fungi | Cloning gerrit/openstack/diskimage-builder | 21:36 |
fungi | 2021-09-30 18:55:05,884 DEBUG zuul.AnsibleJob: [e: 840d69f7aacc4f59bdff85271e8dfdb3] [build: f51e9d26fcd3458b9da5fa3f934e4aa6] Cloning gerrit/openstack/diskimage-builder | 21:36 |
fungi | this is everything it logged for that combination of build id and event id: https://paste.opendev.org/show/809751 | 21:38 |
stevebaker | so its just stalled on cloning | 21:39 |
stevebaker | or whatever happens after that | 21:39 |
fungi | yeah, i suspect if i can find the node it's using there will be a hung git process, but what i can't figure out is why the playbook timeout didn't kick in | 21:40 |
stevebaker | maybe the timeout isn't applied this early in the build | 21:43 |
fungi | ahh, yeah the node which got assigned is still in a ready state, i guess that git clone was in the workspace on the executor prior to being synced to the node, though i'm surprised we don't mark the node in-use before then too | 21:47 |
fungi | according to https://wiki.openstack.org/wiki/Infrastructure_Status the networking issues in vexxhost impacting the gerrit server started around 19:10 utc that day, but maybe they were impacing things a few minutes prior to that? | 21:50 |
fungi | would explain how git got hung | 21:51 |
stevebaker | I see | 21:54 |
fungi | looks like /var/lib/zuul/builds/f51e9d26fcd3458b9da5fa3f934e4aa6/work/logs/job-output.txt has open file descriptors from a zuul-executor fork on ze04, but i don't see any child processes of that | 21:56 |
fungi | anyway, i've probably extracted about as much info as i can about the situation, i'll go ahead and dequeue that change | 21:57 |
stevebaker | fungi: ok, thanks | 21:57 |
fungi | stevebaker: it's gone now, if you want to recheck | 21:58 |
stevebaker | sweet | 21:58 |
fungi | i'll confer with other zuulfolk on that and see if there are ways we could catch similar cases in the future | 21:59 |
stevebaker | fungi: ok, thanks for your help | 21:59 |
fungi | stevebaker: any time, and thanks for pointing out the issue | 22:17 |
Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!