Friday, 2025-01-31

-@gerrit:opendev.org- Ruisi Jian proposed: [zuul/zuul] 940352: Report failure messages on non-enqueued builds https://review.opendev.org/c/zuul/zuul/+/940352		00:16
-@gerrit:opendev.org- Ruisi Jian proposed: [zuul/zuul] 940352: Report failure messages on non-enqueued builds https://review.opendev.org/c/zuul/zuul/+/940352		00:25
-@gerrit:opendev.org- Ruisi Jian proposed: [zuul/zuul] 940352: Report failure messages on non-enqueued builds https://review.opendev.org/c/zuul/zuul/+/940352		00:53
-@gerrit:opendev.org- Ruisi Jian proposed: [zuul/zuul] 940352: Report failure messages on non-enqueued builds https://review.opendev.org/c/zuul/zuul/+/940352		01:13
-@gerrit:opendev.org- Ruisi Jian proposed: [zuul/zuul] 940352: Report failure messages on non-enqueued builds https://review.opendev.org/c/zuul/zuul/+/940352		01:30
-@gerrit:opendev.org- Ruisi Jian proposed: [zuul/zuul] 940352: Report failure messages on non-enqueued builds https://review.opendev.org/c/zuul/zuul/+/940352		02:15
-@gerrit:opendev.org- Ruisi Jian proposed: [zuul/zuul] 940352: Report failure messages on non-enqueued builds https://review.opendev.org/c/zuul/zuul/+/940352		05:06
-@gerrit:opendev.org- Ruisi Jian proposed: [zuul/zuul] 940352: Report failure messages on non-enqueued builds https://review.opendev.org/c/zuul/zuul/+/940352		05:22
-@gerrit:opendev.org- Ruisi Jian proposed: [zuul/zuul] 940352: Report failure messages on non-enqueued builds https://review.opendev.org/c/zuul/zuul/+/940352		05:42
-@gerrit:opendev.org- Ruisi Jian proposed: [zuul/zuul] 940352: Report failure messages on non-enqueued builds https://review.opendev.org/c/zuul/zuul/+/940352		06:00
-@gerrit:opendev.org- Benjamin Schanzel proposed: [zuul/zuul] 940500: web: Use a select filter for pipelines and queues on status page https://review.opendev.org/c/zuul/zuul/+/940500		08:45
@bridgefan:matrix.org	Hi you may remember the "Resource temporarily unavailable" issue from a few days ago that Ted reported. He and I work on the same Zuul instance	15:14
@bridgefan:matrix.org	We believe we've narrowed down the problem to an issue where git processes don't get cleared when git fails to authenticate	15:15
@fungicide:matrix.org	bridgefan: have you tried to strace one of the stuck git processes?	15:15
@bridgefan:matrix.org	Something similar to what gitea seem to observed here: https://github.com/go-gitea/gitea/issues/3242	15:15
@bridgefan:matrix.org	Gitea seemed to solve it using: https://github.com/go-gitea/gitea/pull/19865	15:16
@fungicide:matrix.org	i guess something similar could be done in the gitpython library	15:20
@fungicide:matrix.org	i don't see it doing an os.setpgid() on any subprocesses currently	15:22
@fungicide:matrix.org	https://github.com/gitpython-developers/GitPython/issues/1756	15:23
@fungicide:matrix.org	looks like there's a killaftertimeout in gitpython 3.1.41 and later	15:25
@fungicide:matrix.org	so for just over a year now	15:26
@bridgefan:matrix.org	Do you happen to know the version of gitpython for 11.2.0 containers	15:26
@bridgefan:matrix.org	Also we tried running strace and get "strace: attach: ptrace(PTRACE_SEIZE, 52): Operation not permitted"	15:28
@fungicide:matrix.org	looks like we don't pin a maximum version in requirements.txt, so presumably gitpython 3.1.43 which was the latest at the time zuul 11.2.0 was tagged (there's a gitpython 3.1.44 which was uploaded about a month later)	15:29
@fungicide:matrix.org	actually that pr was for improvements to an existing kill_after_timeout feature to take care of child processes	15:34
@fungicide:matrix.org	bridgefan: looks like zuul.merger.merger.Repo._git_clone() and ._git_fetchInner() have been using the kill_after_timeout option all the way back to 3.0.0	15:35
@fungicide:matrix.org	but maybe these git processes are getting initiated somewhere else where that's not being set, or it's insufficient on its own	15:35
@bridgefan:matrix.org	Do we know what the default timeout used by kill_after_timeout is?	15:37
@fungicide:matrix.org	https://zuul-ci.org/docs/zuul/latest/configuration.html#attr-merger.git_timeout defaults to 5 minutes already, unless you overwrite it	15:37
@bridgefan:matrix.org	ok we see these defunct processes still many hours later	15:37
@bridgefan:matrix.org	Oh you may have found something: we have zuul.conf:git_timeout=14400	15:38
@flaper87:matrix.org	I know it's been a while but here to give the promised report. Unfortunately, it seems like this was not our problem. We have split streams enabled now but we are still seeing random MODULE_FAILURE errors. I say random because:	15:39
- They happen on multiple different jobs
- They don't happen at the same time, place, or under specific circumstances
- The MODULE_FAILURE comes without an exit code. It just reports failure, no exit code, no error in stdout/stderr
- It happens on multiple machine types (Jobs are running on Kubernetes)
I've enabled ansible debug using `zuul-executor verbose` and there's nothing useful there that would indicate why these failures are happening.
Has this happened to others?
@fungicide:matrix.org	bridgefan: also potentially related, https://github.com/gitpython-developers/GitPython/issues/1762	15:39
@fungicide:matrix.org	no, never mind, zuul doesn't appear to set output_stream	15:41
@fungicide:matrix.org	flaper87: any indication in, say, dmesg indicating that there was a process ended by the oom killer or something?	15:42
@fungicide:matrix.org	and this is failures occurring on the executors, or on remote nodes?	15:43
@bridgefan:matrix.org	We see this on schedulers	15:44
@bridgefan:matrix.org	We are 99% sure this is due to adding a git repo that needs auth that isn't yet set up	15:44
@bridgefan:matrix.org	So we get a bunch of defunct processes from failed auth cases	15:44
@fungicide:matrix.org	yeah, sorry, i was trying to get clarification on the separate thing flaper87 mentioned just now	15:44
@bridgefan:matrix.org	got it	15:45
@fungicide:matrix.org	flaper87: reading up on potential reasons for ansible to return "module failure" it looks like it can happen in pathological situations like when your filesystem is full or stops being writeable	15:50
@fungicide:matrix.org	also lack of sudo access with become tasks, missing/too old python, et cetera	15:52
@bridgefan:matrix.org	fungi: we're going to investigate further. Thanks for the tip about git_timeout. It explains a lot.	16:05
@fungicide:matrix.org	my pleasure!	16:08
@flaper87:matrix.org	fungi: thanks for the info! Trying to figure this out. Getting logs from the nodes, see if I can get kubelet logs or something else (this is running on GKE so, not full access to the cluster nodes :D )	16:19
@fungicide:matrix.org	yeah, unfortunately ansible is often less than forthcoming about why it decided to declare a failure in those kinds of situations. for example, we have extra debugging implemented to log back into nodes and run a df after some kinds of failures just to see if we can spot full filesystem situations	16:29
-@gerrit:opendev.org- Zuul merged on behalf of Simon Westphahl: [zuul/zuul] 939724: Import nodescan framework from Nodepool https://review.opendev.org/c/zuul/zuul/+/939724		17:02
-@gerrit:opendev.org- Zuul merged on behalf of James E. Blair https://matrix.to/#/@jim:acmegating.com: [zuul/zuul] 940483: Retry nodescan connections during the key phase https://review.opendev.org/c/zuul/zuul/+/940483		17:04
-@gerrit:opendev.org- Brian Haley proposed: [zuul/zuul-jobs] 940074: Update ensure-twine role https://review.opendev.org/c/zuul/zuul-jobs/+/940074		17:09
-@gerrit:opendev.org- Zuul merged on behalf of James E. Blair https://matrix.to/#/@jim:acmegating.com: [zuul/zuul] 938988: Reduce parallelization in tests https://review.opendev.org/c/zuul/zuul/+/938988		17:17
-@gerrit:opendev.org- Brian Haley proposed: [zuul/zuul-jobs] 940074: Update ensure-twine role https://review.opendev.org/c/zuul/zuul-jobs/+/940074		17:23
-@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed: [zuul/zuul] 940484: Integrate nodescan worker into launcher https://review.opendev.org/c/zuul/zuul/+/940484		17:24
-@gerrit:opendev.org- Zuul merged on behalf of Brian Haley: [zuul/zuul-jobs] 940074: Update ensure-twine role https://review.opendev.org/c/zuul/zuul-jobs/+/940074		18:10
-@gerrit:opendev.org- Zuul merged on behalf of James E. Blair https://matrix.to/#/@jim:acmegating.com: [zuul/zuul] 940484: Integrate nodescan worker into launcher https://review.opendev.org/c/zuul/zuul/+/940484		19:08
@bridgefan:matrix.org	Just an update from our out of processes issue. We have found that git_timeout doesn't help clear the zombie processes. However, we were overriding the starting command for our containers in the k8s spec to directly call things like zuul-scheduler.	21:14
@bridgefan:matrix.org	We plan to introduce a proper init process which can clean these zombie processes	21:15
@bridgefan:matrix.org	So it appears to be solved	21:15
@jim:acmegating.com	for the record, the images produced by the zuul project do include an init process	21:19
@clarkb:matrix.org	via `ENTRYPOINT ["/usr/bin/dumb-init", "--"]`	21:20
@jim:acmegating.com	Clark: yes, they include and run the process	21:20
@amdted:matrix.org	Yeah, we noticed. It was our k8s setup that wasn't configured correctly. So, the zuul-scheduler container had Python as PID 1 instead of dumb-init, so zombie reaping wasn't being done.	21:20
@jim:acmegating.com	mostly wanted to make it clear to folks that this is a matter of doing "less" rather than something that might need addressing generally. :)	21:21
@clarkb:matrix.org	++	21:21
@amdted:matrix.org	No worries. 😁	21:24

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!