Friday, 2025-01-31

-@gerrit:opendev.org- Ruisi Jian proposed: [zuul/zuul] 940352: Report failure messages on non-enqueued builds https://review.opendev.org/c/zuul/zuul/+/94035200:16
-@gerrit:opendev.org- Ruisi Jian proposed: [zuul/zuul] 940352: Report failure messages on non-enqueued builds https://review.opendev.org/c/zuul/zuul/+/94035200:25
-@gerrit:opendev.org- Ruisi Jian proposed: [zuul/zuul] 940352: Report failure messages on non-enqueued builds https://review.opendev.org/c/zuul/zuul/+/94035200:53
-@gerrit:opendev.org- Ruisi Jian proposed: [zuul/zuul] 940352: Report failure messages on non-enqueued builds https://review.opendev.org/c/zuul/zuul/+/94035201:13
-@gerrit:opendev.org- Ruisi Jian proposed: [zuul/zuul] 940352: Report failure messages on non-enqueued builds https://review.opendev.org/c/zuul/zuul/+/94035201:30
-@gerrit:opendev.org- Ruisi Jian proposed: [zuul/zuul] 940352: Report failure messages on non-enqueued builds https://review.opendev.org/c/zuul/zuul/+/94035202:15
-@gerrit:opendev.org- Ruisi Jian proposed: [zuul/zuul] 940352: Report failure messages on non-enqueued builds https://review.opendev.org/c/zuul/zuul/+/94035205:06
-@gerrit:opendev.org- Ruisi Jian proposed: [zuul/zuul] 940352: Report failure messages on non-enqueued builds https://review.opendev.org/c/zuul/zuul/+/94035205:22
-@gerrit:opendev.org- Ruisi Jian proposed: [zuul/zuul] 940352: Report failure messages on non-enqueued builds https://review.opendev.org/c/zuul/zuul/+/94035205:42
-@gerrit:opendev.org- Ruisi Jian proposed: [zuul/zuul] 940352: Report failure messages on non-enqueued builds https://review.opendev.org/c/zuul/zuul/+/94035206:00
-@gerrit:opendev.org- Benjamin Schanzel proposed: [zuul/zuul] 940500: web: Use a select filter for pipelines and queues on status page https://review.opendev.org/c/zuul/zuul/+/94050008:45
@bridgefan:matrix.orgHi you may remember the "Resource temporarily unavailable" issue from a few days ago that Ted reported.  He and I work on the same Zuul instance15:14
@bridgefan:matrix.orgWe believe we've narrowed down the problem to an issue where git processes don't get cleared when git fails to authenticate15:15
@fungicide:matrix.orgbridgefan: have you tried to strace one of the stuck git processes?15:15
@bridgefan:matrix.orgSomething similar to what gitea seem to observed here: https://github.com/go-gitea/gitea/issues/324215:15
@bridgefan:matrix.orgGitea seemed to solve it using: https://github.com/go-gitea/gitea/pull/1986515:16
@fungicide:matrix.orgi guess something similar could be done in the gitpython library15:20
@fungicide:matrix.orgi don't see it doing an os.setpgid() on any subprocesses currently15:22
@fungicide:matrix.orghttps://github.com/gitpython-developers/GitPython/issues/175615:23
@fungicide:matrix.orglooks like there's a killaftertimeout in gitpython 3.1.41 and later15:25
@fungicide:matrix.orgso for just over a year now15:26
@bridgefan:matrix.orgDo you happen to know the version of gitpython for 11.2.0 containers15:26
@bridgefan:matrix.orgAlso we tried running strace and get "strace: attach: ptrace(PTRACE_SEIZE, 52): Operation not permitted"15:28
@fungicide:matrix.orglooks like we don't pin a maximum version in requirements.txt, so presumably gitpython 3.1.43 which was the latest at the time zuul 11.2.0 was tagged (there's a gitpython 3.1.44 which was uploaded about a month later)15:29
@fungicide:matrix.orgactually that pr was for improvements to an existing kill_after_timeout feature to take care of child processes15:34
@fungicide:matrix.orgbridgefan: looks like zuul.merger.merger.Repo._git_clone() and ._git_fetchInner() have been using the kill_after_timeout option all the way back to 3.0.015:35
@fungicide:matrix.orgbut maybe these git processes are getting initiated somewhere else where that's not being set, or it's insufficient on its own15:35
@bridgefan:matrix.orgDo we know what the default timeout used by kill_after_timeout is?15:37
@fungicide:matrix.orghttps://zuul-ci.org/docs/zuul/latest/configuration.html#attr-merger.git_timeout defaults to 5 minutes already, unless you overwrite it15:37
@bridgefan:matrix.orgok we see these defunct processes still many hours later15:37
@bridgefan:matrix.orgOh you may have found something: we have zuul.conf:git_timeout=1440015:38
@flaper87:matrix.orgI know it's been a while but here to give the promised report. Unfortunately, it seems like this was not our problem. We have split streams enabled now but we are still seeing random MODULE_FAILURE errors. I say random because:15:39
- They happen on multiple different jobs
- They don't happen at the same time, place, or under specific circumstances
- The MODULE_FAILURE comes without an exit code. It just reports failure, no exit code, no error in stdout/stderr
- It happens on multiple machine types (Jobs are running on Kubernetes)
I've enabled ansible debug using `zuul-executor verbose` and there's nothing useful there that would indicate why these failures are happening.
Has this happened to others?
@fungicide:matrix.orgbridgefan: also potentially related, https://github.com/gitpython-developers/GitPython/issues/176215:39
@fungicide:matrix.orgno, never mind, zuul doesn't appear to set output_stream15:41
@fungicide:matrix.orgflaper87: any indication in, say, dmesg indicating that there was a process ended by the oom killer or something?15:42
@fungicide:matrix.organd this is failures occurring on the executors, or on remote nodes?15:43
@bridgefan:matrix.orgWe see this on schedulers15:44
@bridgefan:matrix.orgWe are 99% sure this is due to adding a git repo that needs auth that isn't yet set up15:44
@bridgefan:matrix.orgSo we get a bunch of defunct processes from failed auth cases15:44
@fungicide:matrix.orgyeah, sorry, i was trying to get clarification on the separate thing flaper87 mentioned just now15:44
@bridgefan:matrix.orggot it15:45
@fungicide:matrix.orgflaper87: reading up on potential reasons for ansible to return "module failure" it looks like it can happen in pathological situations like when your filesystem is full or stops being writeable15:50
@fungicide:matrix.orgalso lack of sudo access with become tasks, missing/too old python, et cetera15:52
@bridgefan:matrix.orgfungi: we're going to investigate further.  Thanks for the tip about git_timeout. It explains a lot.16:05
@fungicide:matrix.orgmy pleasure!16:08
@flaper87:matrix.orgfungi: thanks for the info! Trying to figure this out. Getting logs from the nodes, see if I can get kubelet logs or something else (this is running on GKE so, not full access to the cluster nodes :D )16:19
@fungicide:matrix.orgyeah, unfortunately ansible is often less than forthcoming about why it decided to declare a failure in those kinds of situations. for example, we have extra debugging implemented to log back into nodes and run a df after some kinds of failures just to see if we can spot full filesystem situations16:29
-@gerrit:opendev.org- Zuul merged on behalf of Simon Westphahl: [zuul/zuul] 939724: Import nodescan framework from Nodepool https://review.opendev.org/c/zuul/zuul/+/93972417:02
-@gerrit:opendev.org- Zuul merged on behalf of James E. Blair https://matrix.to/#/@jim:acmegating.com: [zuul/zuul] 940483: Retry nodescan connections during the key phase https://review.opendev.org/c/zuul/zuul/+/94048317:04
-@gerrit:opendev.org- Brian Haley proposed: [zuul/zuul-jobs] 940074: Update ensure-twine role https://review.opendev.org/c/zuul/zuul-jobs/+/94007417:09
-@gerrit:opendev.org- Zuul merged on behalf of James E. Blair https://matrix.to/#/@jim:acmegating.com: [zuul/zuul] 938988: Reduce parallelization in tests https://review.opendev.org/c/zuul/zuul/+/93898817:17
-@gerrit:opendev.org- Brian Haley proposed: [zuul/zuul-jobs] 940074: Update ensure-twine role https://review.opendev.org/c/zuul/zuul-jobs/+/94007417:23
-@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed: [zuul/zuul] 940484: Integrate nodescan worker into launcher https://review.opendev.org/c/zuul/zuul/+/94048417:24
-@gerrit:opendev.org- Zuul merged on behalf of Brian Haley: [zuul/zuul-jobs] 940074: Update ensure-twine role https://review.opendev.org/c/zuul/zuul-jobs/+/94007418:10
-@gerrit:opendev.org- Zuul merged on behalf of James E. Blair https://matrix.to/#/@jim:acmegating.com: [zuul/zuul] 940484: Integrate nodescan worker into launcher https://review.opendev.org/c/zuul/zuul/+/94048419:08
@bridgefan:matrix.orgJust an update from our out of processes issue.  We have found that git_timeout doesn't help clear the zombie processes.  However, we were overriding the starting command for our containers in the k8s spec to directly call things like zuul-scheduler.21:14
@bridgefan:matrix.orgWe plan to introduce a proper init process which can clean these zombie processes21:15
@bridgefan:matrix.orgSo it appears to be solved21:15
@jim:acmegating.comfor the record, the images produced by the zuul project do include an init process21:19
@clarkb:matrix.orgvia `ENTRYPOINT ["/usr/bin/dumb-init", "--"]`21:20
@jim:acmegating.comClark: yes, they include and run the process21:20
@amdted:matrix.orgYeah, we noticed.  It was our k8s setup that wasn't configured correctly.  So, the zuul-scheduler container had Python as PID 1 instead of dumb-init, so zombie reaping wasn't being done.21:20
@jim:acmegating.commostly wanted to make it clear to folks that this is a matter of doing "less" rather than something that might need addressing generally.  :)21:21
@clarkb:matrix.org++21:21
@amdted:matrix.orgNo worries. 😁21:24

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!