-@gerrit:opendev.org- Ruisi Jian proposed: [zuul/zuul] 940352: Report failure messages on non-enqueued builds https://review.opendev.org/c/zuul/zuul/+/940352 | 00:16 | |
-@gerrit:opendev.org- Ruisi Jian proposed: [zuul/zuul] 940352: Report failure messages on non-enqueued builds https://review.opendev.org/c/zuul/zuul/+/940352 | 00:25 | |
-@gerrit:opendev.org- Ruisi Jian proposed: [zuul/zuul] 940352: Report failure messages on non-enqueued builds https://review.opendev.org/c/zuul/zuul/+/940352 | 00:53 | |
-@gerrit:opendev.org- Ruisi Jian proposed: [zuul/zuul] 940352: Report failure messages on non-enqueued builds https://review.opendev.org/c/zuul/zuul/+/940352 | 01:13 | |
-@gerrit:opendev.org- Ruisi Jian proposed: [zuul/zuul] 940352: Report failure messages on non-enqueued builds https://review.opendev.org/c/zuul/zuul/+/940352 | 01:30 | |
-@gerrit:opendev.org- Ruisi Jian proposed: [zuul/zuul] 940352: Report failure messages on non-enqueued builds https://review.opendev.org/c/zuul/zuul/+/940352 | 02:15 | |
-@gerrit:opendev.org- Ruisi Jian proposed: [zuul/zuul] 940352: Report failure messages on non-enqueued builds https://review.opendev.org/c/zuul/zuul/+/940352 | 05:06 | |
-@gerrit:opendev.org- Ruisi Jian proposed: [zuul/zuul] 940352: Report failure messages on non-enqueued builds https://review.opendev.org/c/zuul/zuul/+/940352 | 05:22 | |
-@gerrit:opendev.org- Ruisi Jian proposed: [zuul/zuul] 940352: Report failure messages on non-enqueued builds https://review.opendev.org/c/zuul/zuul/+/940352 | 05:42 | |
-@gerrit:opendev.org- Ruisi Jian proposed: [zuul/zuul] 940352: Report failure messages on non-enqueued builds https://review.opendev.org/c/zuul/zuul/+/940352 | 06:00 | |
-@gerrit:opendev.org- Benjamin Schanzel proposed: [zuul/zuul] 940500: web: Use a select filter for pipelines and queues on status page https://review.opendev.org/c/zuul/zuul/+/940500 | 08:45 | |
@bridgefan:matrix.org | Hi you may remember the "Resource temporarily unavailable" issue from a few days ago that Ted reported. He and I work on the same Zuul instance | 15:14 |
---|---|---|
@bridgefan:matrix.org | We believe we've narrowed down the problem to an issue where git processes don't get cleared when git fails to authenticate | 15:15 |
@fungicide:matrix.org | bridgefan: have you tried to strace one of the stuck git processes? | 15:15 |
@bridgefan:matrix.org | Something similar to what gitea seem to observed here: https://github.com/go-gitea/gitea/issues/3242 | 15:15 |
@bridgefan:matrix.org | Gitea seemed to solve it using: https://github.com/go-gitea/gitea/pull/19865 | 15:16 |
@fungicide:matrix.org | i guess something similar could be done in the gitpython library | 15:20 |
@fungicide:matrix.org | i don't see it doing an os.setpgid() on any subprocesses currently | 15:22 |
@fungicide:matrix.org | https://github.com/gitpython-developers/GitPython/issues/1756 | 15:23 |
@fungicide:matrix.org | looks like there's a killaftertimeout in gitpython 3.1.41 and later | 15:25 |
@fungicide:matrix.org | so for just over a year now | 15:26 |
@bridgefan:matrix.org | Do you happen to know the version of gitpython for 11.2.0 containers | 15:26 |
@bridgefan:matrix.org | Also we tried running strace and get "strace: attach: ptrace(PTRACE_SEIZE, 52): Operation not permitted" | 15:28 |
@fungicide:matrix.org | looks like we don't pin a maximum version in requirements.txt, so presumably gitpython 3.1.43 which was the latest at the time zuul 11.2.0 was tagged (there's a gitpython 3.1.44 which was uploaded about a month later) | 15:29 |
@fungicide:matrix.org | actually that pr was for improvements to an existing kill_after_timeout feature to take care of child processes | 15:34 |
@fungicide:matrix.org | bridgefan: looks like zuul.merger.merger.Repo._git_clone() and ._git_fetchInner() have been using the kill_after_timeout option all the way back to 3.0.0 | 15:35 |
@fungicide:matrix.org | but maybe these git processes are getting initiated somewhere else where that's not being set, or it's insufficient on its own | 15:35 |
@bridgefan:matrix.org | Do we know what the default timeout used by kill_after_timeout is? | 15:37 |
@fungicide:matrix.org | https://zuul-ci.org/docs/zuul/latest/configuration.html#attr-merger.git_timeout defaults to 5 minutes already, unless you overwrite it | 15:37 |
@bridgefan:matrix.org | ok we see these defunct processes still many hours later | 15:37 |
@bridgefan:matrix.org | Oh you may have found something: we have zuul.conf:git_timeout=14400 | 15:38 |
@flaper87:matrix.org | I know it's been a while but here to give the promised report. Unfortunately, it seems like this was not our problem. We have split streams enabled now but we are still seeing random MODULE_FAILURE errors. I say random because: | 15:39 |
- They happen on multiple different jobs | ||
- They don't happen at the same time, place, or under specific circumstances | ||
- The MODULE_FAILURE comes without an exit code. It just reports failure, no exit code, no error in stdout/stderr | ||
- It happens on multiple machine types (Jobs are running on Kubernetes) | ||
I've enabled ansible debug using `zuul-executor verbose` and there's nothing useful there that would indicate why these failures are happening. | ||
Has this happened to others? | ||
@fungicide:matrix.org | bridgefan: also potentially related, https://github.com/gitpython-developers/GitPython/issues/1762 | 15:39 |
@fungicide:matrix.org | no, never mind, zuul doesn't appear to set output_stream | 15:41 |
@fungicide:matrix.org | flaper87: any indication in, say, dmesg indicating that there was a process ended by the oom killer or something? | 15:42 |
@fungicide:matrix.org | and this is failures occurring on the executors, or on remote nodes? | 15:43 |
@bridgefan:matrix.org | We see this on schedulers | 15:44 |
@bridgefan:matrix.org | We are 99% sure this is due to adding a git repo that needs auth that isn't yet set up | 15:44 |
@bridgefan:matrix.org | So we get a bunch of defunct processes from failed auth cases | 15:44 |
@fungicide:matrix.org | yeah, sorry, i was trying to get clarification on the separate thing flaper87 mentioned just now | 15:44 |
@bridgefan:matrix.org | got it | 15:45 |
@fungicide:matrix.org | flaper87: reading up on potential reasons for ansible to return "module failure" it looks like it can happen in pathological situations like when your filesystem is full or stops being writeable | 15:50 |
@fungicide:matrix.org | also lack of sudo access with become tasks, missing/too old python, et cetera | 15:52 |
@bridgefan:matrix.org | fungi: we're going to investigate further. Thanks for the tip about git_timeout. It explains a lot. | 16:05 |
@fungicide:matrix.org | my pleasure! | 16:08 |
@flaper87:matrix.org | fungi: thanks for the info! Trying to figure this out. Getting logs from the nodes, see if I can get kubelet logs or something else (this is running on GKE so, not full access to the cluster nodes :D ) | 16:19 |
@fungicide:matrix.org | yeah, unfortunately ansible is often less than forthcoming about why it decided to declare a failure in those kinds of situations. for example, we have extra debugging implemented to log back into nodes and run a df after some kinds of failures just to see if we can spot full filesystem situations | 16:29 |
-@gerrit:opendev.org- Zuul merged on behalf of Simon Westphahl: [zuul/zuul] 939724: Import nodescan framework from Nodepool https://review.opendev.org/c/zuul/zuul/+/939724 | 17:02 | |
-@gerrit:opendev.org- Zuul merged on behalf of James E. Blair https://matrix.to/#/@jim:acmegating.com: [zuul/zuul] 940483: Retry nodescan connections during the key phase https://review.opendev.org/c/zuul/zuul/+/940483 | 17:04 | |
-@gerrit:opendev.org- Brian Haley proposed: [zuul/zuul-jobs] 940074: Update ensure-twine role https://review.opendev.org/c/zuul/zuul-jobs/+/940074 | 17:09 | |
-@gerrit:opendev.org- Zuul merged on behalf of James E. Blair https://matrix.to/#/@jim:acmegating.com: [zuul/zuul] 938988: Reduce parallelization in tests https://review.opendev.org/c/zuul/zuul/+/938988 | 17:17 | |
-@gerrit:opendev.org- Brian Haley proposed: [zuul/zuul-jobs] 940074: Update ensure-twine role https://review.opendev.org/c/zuul/zuul-jobs/+/940074 | 17:23 | |
-@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed: [zuul/zuul] 940484: Integrate nodescan worker into launcher https://review.opendev.org/c/zuul/zuul/+/940484 | 17:24 | |
-@gerrit:opendev.org- Zuul merged on behalf of Brian Haley: [zuul/zuul-jobs] 940074: Update ensure-twine role https://review.opendev.org/c/zuul/zuul-jobs/+/940074 | 18:10 | |
-@gerrit:opendev.org- Zuul merged on behalf of James E. Blair https://matrix.to/#/@jim:acmegating.com: [zuul/zuul] 940484: Integrate nodescan worker into launcher https://review.opendev.org/c/zuul/zuul/+/940484 | 19:08 | |
@bridgefan:matrix.org | Just an update from our out of processes issue. We have found that git_timeout doesn't help clear the zombie processes. However, we were overriding the starting command for our containers in the k8s spec to directly call things like zuul-scheduler. | 21:14 |
@bridgefan:matrix.org | We plan to introduce a proper init process which can clean these zombie processes | 21:15 |
@bridgefan:matrix.org | So it appears to be solved | 21:15 |
@jim:acmegating.com | for the record, the images produced by the zuul project do include an init process | 21:19 |
@clarkb:matrix.org | via `ENTRYPOINT ["/usr/bin/dumb-init", "--"]` | 21:20 |
@jim:acmegating.com | Clark: yes, they include and run the process | 21:20 |
@amdted:matrix.org | Yeah, we noticed. It was our k8s setup that wasn't configured correctly. So, the zuul-scheduler container had Python as PID 1 instead of dumb-init, so zombie reaping wasn't being done. | 21:20 |
@jim:acmegating.com | mostly wanted to make it clear to folks that this is a matter of doing "less" rather than something that might need addressing generally. :) | 21:21 |
@clarkb:matrix.org | ++ | 21:21 |
@amdted:matrix.org | No worries. 😁 | 21:24 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!