-@gerrit:opendev.org- Ruisi Jian proposed: [zuul/zuul] 940352: Report failure messages on non-enqueued builds https://review.opendev.org/c/zuul/zuul/+/940352 | 00:03 | |
-@gerrit:opendev.org- Ruisi Jian proposed: [zuul/zuul] 940352: Report failure messages on non-enqueued builds https://review.opendev.org/c/zuul/zuul/+/940352 | 00:08 | |
-@gerrit:opendev.org- Ruisi Jian proposed: [zuul/zuul] 940352: Report failure messages on non-enqueued builds https://review.opendev.org/c/zuul/zuul/+/940352 | 02:50 | |
-@gerrit:opendev.org- Ruisi Jian proposed: [zuul/zuul] 940352: Report failure messages on non-enqueued builds https://review.opendev.org/c/zuul/zuul/+/940352 | 02:53 | |
-@gerrit:opendev.org- Ruisi Jian proposed: [zuul/zuul] 940352: Report failure messages on non-enqueued builds https://review.opendev.org/c/zuul/zuul/+/940352 | 03:17 | |
-@gerrit:opendev.org- Ruisi Jian proposed: [zuul/zuul] 940352: Report failure messages on non-enqueued builds https://review.opendev.org/c/zuul/zuul/+/940352 | 03:28 | |
-@gerrit:opendev.org- Ruisi Jian proposed: [zuul/zuul] 940352: Report failure messages on non-enqueued builds https://review.opendev.org/c/zuul/zuul/+/940352 | 03:33 | |
-@gerrit:opendev.org- Ruisi Jian proposed: [zuul/zuul] 940352: Report failure messages on non-enqueued builds https://review.opendev.org/c/zuul/zuul/+/940352 | 03:47 | |
-@gerrit:opendev.org- Ruisi Jian proposed: [zuul/zuul] 940352: Report failure messages on non-enqueued builds https://review.opendev.org/c/zuul/zuul/+/940352 | 04:08 | |
-@gerrit:opendev.org- Ruisi Jian proposed: [zuul/zuul] 940352: Report failure messages on non-enqueued builds https://review.opendev.org/c/zuul/zuul/+/940352 | 06:42 | |
-@gerrit:opendev.org- Simon Westphahl proposed: [zuul/zuul] 940361: Pin boto3 Ansible requirement https://review.opendev.org/c/zuul/zuul/+/940361 | 08:29 | |
-@gerrit:opendev.org- Matthieu Huin https://matrix.to/#/@mhuin:matrix.org proposed: [zuul/zuul] 938011: Enforce required length check in config https://review.opendev.org/c/zuul/zuul/+/938011 | 11:23 | |
-@gerrit:opendev.org- Matthieu Huin https://matrix.to/#/@mhuin:matrix.org proposed: [zuul/zuul] 938463: sqlreporter: ensure build end data is stored even if log URL is too long https://review.opendev.org/c/zuul/zuul/+/938463 | 11:24 | |
-@gerrit:opendev.org- Matthieu Huin https://matrix.to/#/@mhuin:matrix.org proposed: [zuul/zuul] 938463: sqlreporter: ensure build end data is stored even if log URL is too long https://review.opendev.org/c/zuul/zuul/+/938463 | 11:28 | |
-@gerrit:opendev.org- Matthieu Huin https://matrix.to/#/@mhuin:matrix.org proposed: [zuul/zuul] 938128: autohold REST API: add ref filter validation https://review.opendev.org/c/zuul/zuul/+/938128 | 11:29 | |
@joao15130:matrix.org | Hello, sorry for asking so many questions but I'm learning. | 13:01 |
---|---|---|
I'm trying to reinstall zuul using the same procedure that comes with the quick start and I couldn't make it | ||
I see multiple errors when I compose the containers from the logs: | ||
`[launcher] | 2025-01-29 12:58:47,402 WARNING kazoo.client: Cannot resolve zk: [Errno -3] Temporary failure in name resolution` | ||
@joao15130:matrix.org | * Hello, sorry for asking so many questions but I'm learning. | 13:02 |
I'm trying to reinstall zuul using the same procedure that comes with the quick start and I couldn't make it | ||
I see multiple errors when I compose the containers from the logs: | ||
`[launcher] | 2025-01-29 12:58:47,402 WARNING kazoo.client: Cannot resolve zk: [Errno -3] Temporary failure in name resolution` | ||
`[scheduler] | /var/playbooks/wait-to-start.sh: line 9: mysql: Temporary failure in name resolution | ||
[scheduler] | /var/playbooks/wait-to-start.sh: line 9: /dev/tcp/mysql/3306: Invalid argument | ||
` | ||
@joao15130:matrix.org | * Hello, sorry for asking so many questions but I'm learning. | 13:02 |
I'm trying to reinstall zuul using the same procedure that comes with the quick start and I couldn't make it | ||
I see multiple errors when I compose the containers from the logs: | ||
`[launcher] | 2025-01-29 12:58:47,402 WARNING kazoo.client: Cannot resolve zk: [Errno -3] Temporary failure in name resolution` | ||
`[scheduler] | /var/playbooks/wait-to-start.sh: line 9: mysql: Temporary failure in name resolution [scheduler] | /var/playbooks/wait-to-start.sh: line 9: /dev/tcp/mysql/3306: Invalid argument ` | ||
Any idea on what can cause this? | ||
@fungicide:matrix.org | joao15130: i think docker does some internal magic to make container names resolve through dns from other containers, i remember a related problem in jobs recently, trying to recall the details now... | 13:10 |
@fungicide:matrix.org | https://docs.docker.com/config/containers/container-networking/#dns-services is docker's own documentation on that feature, if curious. i'm still digging to try to find where it came up in discussion recently | 13:16 |
@joao15130:matrix.org | Am I supposed to use podman instead of docker as we run podman-compose in the tutorial? | 13:20 |
-@gerrit:opendev.org- Benjamin Schanzel proposed: [zuul/zuul-jobs] 940373: js-package-manager: Allow setting additional env vars for build command https://review.opendev.org/c/zuul/zuul-jobs/+/940373 | 13:45 | |
@joao15130:matrix.org | fungi: weird but it works fine on ubuntu 24.04 without modifying anything | 13:49 |
@fungicide:matrix.org | joao15130: oh! were you trying on ubuntu 22.04 previously? of so, there's a bug in the version of podman packaged there | 13:51 |
@fungicide:matrix.org | https://bugs.launchpad.net/ubuntu/+source/libpod/+bug/2024394 | 13:51 |
@fungicide:matrix.org | that's probably what you hit | 13:51 |
@joao15130:matrix.org | yes, was on 22.04. I guess previous OS was 24.04 and this is why I didn't encountered the issue | 13:51 |
@fungicide:matrix.org | there are workarounds to get a usable podman on 22.04 but 24.04 is a cleaner option | 13:52 |
@joao15130:matrix.org | ok good, so I didn't lose my time to upgrade to 24.04 then | 13:53 |
-@gerrit:opendev.org- Benjamin Schanzel proposed: [zuul/zuul] 940379: web: Upgrade nodejs to latest v23 https://review.opendev.org/c/zuul/zuul/+/940379 | 14:03 | |
-@gerrit:opendev.org- Ruisi Jian proposed: [zuul/zuul] 940352: Report failure messages on non-enqueued builds https://review.opendev.org/c/zuul/zuul/+/940352 | 17:17 | |
-@gerrit:opendev.org- Ruisi Jian proposed: [zuul/zuul] 940352: Report failure messages on non-enqueued builds https://review.opendev.org/c/zuul/zuul/+/940352 | 18:31 | |
@boha:matrix.org | Hi, we are running zuul 11.1.0 with a few static nodes. A few times a week it seems a node get stuck in "deleted"-state after being released after a build. I have used "zuul delete-state" to get it back on track, but it's a rather time consuming way to reset. Is there a better way to recover? | 19:02 |
@clarkb:matrix.org | boha: the nodes are going to be managed by nodepool. Are you running an up to date nodepool? | 19:05 |
@clarkb:matrix.org | you may also be able to use nodepool node management commands to unstick specific nodes without resorting to full deletion of things in zuul | 19:06 |
@boha:matrix.org | Everything 11.1.0, i tried the nodepool delete command, but it gives me an error due to the node being locked | 19:07 |
@clarkb:matrix.org | but I think ensuring nodepool is up to date then inspecting the nodepool logs for the stuck node is where I would start | 19:07 |
-@gerrit:opendev.org- Ruisi Jian proposed: [zuul/zuul] 940352: Report failure messages on non-enqueued builds https://review.opendev.org/c/zuul/zuul/+/940352 | 19:17 | |
-@gerrit:opendev.org- Ruisi Jian proposed: [zuul/zuul] 940352: Report failure messages on non-enqueued builds https://review.opendev.org/c/zuul/zuul/+/940352 | 19:24 | |
@boha:matrix.org | Thanks, i noticed a NodeDeleter exception and a stacktrace for connection loss and a missed heartbeat. The build node seems to be running fine though. | 19:25 |
@fungicide:matrix.org | boha: are you running a redundant zk cluster or just a single zk server? | 19:26 |
@boha:matrix.org | Single zk, all zuul components in the same vm, but builder nodes are separate machines | 19:28 |
@fungicide:matrix.org | possible the nodepool process is somehow getting disconnected from zookeeper | 19:30 |
@boha:matrix.org | Everything is within same network (docker compose). I guess problems here would be performance related 🤔 | 19:35 |
-@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed: [zuul/zuul] 939672: Track variable sources https://review.opendev.org/c/zuul/zuul/+/939672 | 19:40 | |
@clarkb:matrix.org | yes I think those heartbeats can fail if both sides aren't given enough cpu time to process tehm | 19:45 |
@clarkb:matrix.org | that said a failed heartbeat should lead to nodepool attempting to reconnect then resolving inconsistencies | 19:46 |
@clarkb:matrix.org | a copy of that traceback may help point to bug(s) preventing that from happenin | 19:46 |
@boha:matrix.org | Cl fungi: Thanks for the quick response and tips. I'll see tomorrow if I can extract the traceback and allocate more cpu. | 19:55 |
@amdted:matrix.org | Hello. We are having an intermittent issue where our Zuul setup (11.2.0, running in k8s with 5 scheduler, 4 executor and 2 merger nodes) seems to go catatonic where no new jobs will start regardless of any trigger event (e.g. pushing a change to Gerrit). When I check the executor logs, they are all idle, but when I check the scheduler logs they appear to be stuck in an error loop, with the following sequence repeating endlessly: | 20:16 |
2025-01-29 13:49:55,335 ERROR zuul.connection.git.watcher: Unexpected issue in _run loop: | ||
2025-01-29 13:49:55,335 ERROR zuul.connection.git.watcher: Traceback (most recent call last): | ||
2025-01-29 13:49:55,335 ERROR zuul.connection.git.watcher: File "/usr/local/lib/python3.11/site-packages/zuul/driver/git/gitwatcher.py", line 158, in run | ||
2025-01-29 13:49:55,335 ERROR zuul.connection.git.watcher: self.watcher_election.run(self._run) | ||
2025-01-29 13:49:55,335 ERROR zuul.connection.git.watcher: File "/usr/local/lib/python3.11/site-packages/zuul/zk/election.py", line 28, in run | ||
2025-01-29 13:49:55,335 ERROR zuul.connection.git.watcher: return super().run(func, *args, **kwargs) | ||
2025-01-29 13:49:55,335 ERROR zuul.connection.git.watcher: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | ||
2025-01-29 13:49:55,335 ERROR zuul.connection.git.watcher: File "/usr/local/lib/python3.11/site-packages/kazoo/recipe/election.py", line 54, in run | ||
2025-01-29 13:49:55,335 ERROR zuul.connection.git.watcher: func(*args, **kwargs) | ||
2025-01-29 13:49:55,335 ERROR zuul.connection.git.watcher: File "/usr/local/lib/python3.11/site-packages/zuul/driver/git/gitwatcher.py", line 149, in _run | ||
2025-01-29 13:49:55,335 ERROR zuul.connection.git.watcher: self._poll() | ||
2025-01-29 13:49:55,335 ERROR zuul.connection.git.watcher: File "/usr/local/lib/python3.11/site-packages/zuul/driver/git/gitwatcher.py", line 127, in _poll | ||
2025-01-29 13:49:55,335 ERROR zuul.connection.git.watcher: refs = self.lsRemote(project) | ||
2025-01-29 13:49:55,335 ERROR zuul.connection.git.watcher: ^^^^^^^^^^^^^^^^^^^^^^ | ||
2025-01-29 13:49:55,335 ERROR zuul.connection.git.watcher: File "/usr/local/lib/python3.11/site-packages/zuul/driver/git/gitwatcher.py", line 72, in lsRemote | ||
2025-01-29 13:49:55,335 ERROR zuul.connection.git.watcher: output = client.ls_remote( | ||
2025-01-29 13:49:55,335 ERROR zuul.connection.git.watcher: ^^^^^^^^^^^^^^^^^ | ||
2025-01-29 13:49:55,335 ERROR zuul.connection.git.watcher: File "/usr/local/lib/python3.11/site-packages/git/cmd.py", line 986, in <lambda> | ||
2025-01-29 13:49:55,335 ERROR zuul.connection.git.watcher: return lambda *args, **kwargs: self._call_process(name, *args, **kwargs) | ||
2025-01-29 13:49:55,335 ERROR zuul.connection.git.watcher: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | ||
2025-01-29 13:49:55,335 ERROR zuul.connection.git.watcher: File "/usr/local/lib/python3.11/site-packages/git/cmd.py", line 1598, in _call_process | ||
2025-01-29 13:49:55,335 ERROR zuul.connection.git.watcher: return self.execute(call, **exec_kwargs) | ||
2025-01-29 13:49:55,335 ERROR zuul.connection.git.watcher: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | ||
2025-01-29 13:49:55,335 ERROR zuul.connection.git.watcher: File "/usr/local/lib/python3.11/site-packages/git/cmd.py", line 1388, in execute | ||
2025-01-29 13:49:55,335 ERROR zuul.connection.git.watcher: raise GitCommandError(redacted_command, status, stderr_value, stdout_value) | ||
2025-01-29 13:49:55,335 ERROR zuul.connection.git.watcher: git.exc.GitCommandError: Cmd('git') failed due to: exit code(255) | ||
2025-01-29 13:49:55,335 ERROR zuul.connection.git.watcher: cmdline: git ls-remote --heads --tags https://opendev.org/zuul/zuul-jobs | ||
2025-01-29 13:49:55,335 ERROR zuul.connection.git.watcher: stderr: 'error: cannot fork() for remote-https: Resource temporarily unavailable' | ||
with that ls-remote command randomly alternating between opendev.org/zuul/zuul-jobs and other source repos we use with each loop. One of the scheduler nodes, however, shows a slightly different error loop (see the last 9 lines): | ||
2025-01-29 13:49:55,324 ERROR zuul.connection.git.watcher: Unexpected issue in _run loop: | ||
2025-01-29 13:49:55,324 ERROR zuul.connection.git.watcher: Traceback (most recent call last): | ||
2025-01-29 13:49:55,324 ERROR zuul.connection.git.watcher: File "/usr/local/lib/python3.11/site-packages/zuul/driver/git/gitwatcher.py", line 158, in run | ||
2025-01-29 13:49:55,324 ERROR zuul.connection.git.watcher: self.watcher_election.run(self._run) | ||
2025-01-29 13:49:55,324 ERROR zuul.connection.git.watcher: File "/usr/local/lib/python3.11/site-packages/zuul/zk/election.py", line 28, in run | ||
2025-01-29 13:49:55,324 ERROR zuul.connection.git.watcher: return super().run(func, *args, **kwargs) | ||
2025-01-29 13:49:55,324 ERROR zuul.connection.git.watcher: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | ||
2025-01-29 13:49:55,324 ERROR zuul.connection.git.watcher: File "/usr/local/lib/python3.11/site-packages/kazoo/recipe/election.py", line 54, in run | ||
2025-01-29 13:49:55,324 ERROR zuul.connection.git.watcher: func(*args, **kwargs) | ||
2025-01-29 13:49:55,324 ERROR zuul.connection.git.watcher: File "/usr/local/lib/python3.11/site-packages/zuul/driver/git/gitwatcher.py", line 149, in _run | ||
2025-01-29 13:49:55,324 ERROR zuul.connection.git.watcher: self._poll() | ||
2025-01-29 13:49:55,324 ERROR zuul.connection.git.watcher: File "/usr/local/lib/python3.11/site-packages/zuul/driver/git/gitwatcher.py", line 127, in _poll | ||
2025-01-29 13:49:55,324 ERROR zuul.connection.git.watcher: refs = self.lsRemote(project) | ||
2025-01-29 13:49:55,324 ERROR zuul.connection.git.watcher: ^^^^^^^^^^^^^^^^^^^^^^ | ||
2025-01-29 13:49:55,324 ERROR zuul.connection.git.watcher: File "/usr/local/lib/python3.11/site-packages/zuul/driver/git/gitwatcher.py", line 72, in lsRemote | ||
2025-01-29 13:49:55,324 ERROR zuul.connection.git.watcher: output = client.ls_remote( | ||
2025-01-29 13:49:55,324 ERROR zuul.connection.git.watcher: ^^^^^^^^^^^^^^^^^ | ||
2025-01-29 13:49:55,324 ERROR zuul.connection.git.watcher: File "/usr/local/lib/python3.11/site-packages/git/cmd.py", line 986, in <lambda> | ||
2025-01-29 13:49:55,324 ERROR zuul.connection.git.watcher: return lambda *args, **kwargs: self._call_process(name, *args, **kwargs) | ||
2025-01-29 13:49:55,324 ERROR zuul.connection.git.watcher: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | ||
2025-01-29 13:49:55,324 ERROR zuul.connection.git.watcher: File "/usr/local/lib/python3.11/site-packages/git/cmd.py", line 1598, in _call_process | ||
2025-01-29 13:49:55,324 ERROR zuul.connection.git.watcher: return self.execute(call, **exec_kwargs) | ||
2025-01-29 13:49:55,324 ERROR zuul.connection.git.watcher: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | ||
**2025-01-29 13:49:55,324 ERROR zuul.connection.git.watcher: File "/usr/local/lib/python3.11/site-packages/git/cmd.py", line 1262, in execute | ||
2025-01-29 13:49:55,324 ERROR zuul.connection.git.watcher: proc = safer_popen( | ||
2025-01-29 13:49:55,324 ERROR zuul.connection.git.watcher: ^^^^^^^^^^^^ | ||
2025-01-29 13:49:55,324 ERROR zuul.connection.git.watcher: File "/usr/local/lib/python3.11/subprocess.py", line 1026, in __init__ | ||
2025-01-29 13:49:55,324 ERROR zuul.connection.git.watcher: self._execute_child(args, executable, preexec_fn, close_fds, | ||
2025-01-29 13:49:55,324 ERROR zuul.connection.git.watcher: File "/usr/local/lib/python3.11/subprocess.py", line 1885, in _execute_child | ||
2025-01-29 13:49:55,324 ERROR zuul.connection.git.watcher: self.pid = _fork_exec( | ||
2025-01-29 13:49:55,324 ERROR zuul.connection.git.watcher: ^^^^^^^^^^^ | ||
2025-01-29 13:49:55,324 ERROR zuul.connection.git.watcher: BlockingIOError: [Errno 11] Resource temporarily unavailable** | ||
Does anybody have any ideas as to what may be going on here? It almost appears that the error loop has exhausted the process table like a pseudo-fork bomb. Right now we work around it by doing a full restart of Zuul but that's just a stopgap solution. | ||
@clarkb:matrix.org | The errors appear to be related to forking. You may have hit process limits or run out of memory? | 20:37 |
@clarkb:matrix.org | I would check both things and work back from there | 20:37 |
@amdted:matrix.org | I realize that. I'm mostly wondering if anybody else has encountered this issue before or if this is something new. I work with Kenny Ho and we're still digging into it on our end. He suggested that I post the above here in case anybody might recognize the symptoms. | 21:06 |
@clarkb:matrix.org | I don't see us setting any special ulimits for zuul. But we do set file limits quite high for gitea and gerrit because they both open many files for git operations. Its possilbe that you're in a similar situation depending on how your zuul is deployed | 21:08 |
@fungicide:matrix.org | looks like that "BlockingIOError: [Errno 11] Resource temporarily unavailable" is often a sign of multiple processes or threads fighting over the same network socket, e.g. defunct grpc worker threads | 21:15 |
@fungicide:matrix.org | or defunct git-remote-* processes? | 21:16 |
@clarkb:matrix.org | I suppose it is possible our weekly upgrade restarts claer out any memory/pid/process leaks | 21:18 |
@fungicide:matrix.org | i guess that would also explain the "cannot fork()" error | 21:19 |
@fungicide:matrix.org | basically too many stuck processes | 21:19 |
@clarkb:matrix.org | oh also I don't think we use the git driver? | 21:19 |
@clarkb:matrix.org | confirmed we don't so if it is specific to git in the git driver we wouldn't see it | 21:20 |
@fungicide:matrix.org | right, this seems to be happening deep in the gitwatcher module | 21:22 |
@fungicide:matrix.org | or at least it's the most observable/vocal victim of whatever's going on | 21:22 |
@fungicide:matrix.org | like the remote is suddenly blocking/throttling the client or there's some network device overloaded sending packets into never-never land | 21:25 |
@amdted:matrix.org | Okay, that gives us something to look into. We've had network-related issues in the past so maybe we're running into some corner case here. Note that we only upgraded from 11.1.0 to 11.2.0 last week and we didn't encounter this problem previously. | 21:32 |
@fungicide:matrix.org | there were no changes to gitwatcher.py between 11.1.0 and 11.2.0, but this seems more like a symptom of something broader | 21:40 |
@amdted:matrix.org | Okay, so something else up the food chain may be causing gitwatcher to squeal in pain. | 21:44 |
@fungicide:matrix.org | that's my suspicion, if it were me i'd start with a systemic approach looking at overall resource usage on the server(s)/container(s), potential for network disruptions to leave stuck git processes accumulating until everything falls over for example | 21:48 |
-@gerrit:opendev.org- Ruisi Jian proposed: [zuul/zuul] 940352: Report failure messages on non-enqueued builds https://review.opendev.org/c/zuul/zuul/+/940352 | 22:26 | |
-@gerrit:opendev.org- Zuul merged on behalf of Simon Westphahl: [zuul/zuul] 940303: Fix bulk repo state restore https://review.opendev.org/c/zuul/zuul/+/940303 | 23:02 | |
-@gerrit:opendev.org- Ruisi Jian proposed: [zuul/zuul] 940352: Report failure messages on non-enqueued builds https://review.opendev.org/c/zuul/zuul/+/940352 | 23:03 | |
-@gerrit:opendev.org- Ruisi Jian proposed: [zuul/zuul] 940352: Report failure messages on non-enqueued builds https://review.opendev.org/c/zuul/zuul/+/940352 | 23:09 | |
-@gerrit:opendev.org- Ruisi Jian proposed: [zuul/zuul] 940352: Report failure messages on non-enqueued builds https://review.opendev.org/c/zuul/zuul/+/940352 | 23:37 | |
-@gerrit:opendev.org- Ruisi Jian proposed: [zuul/zuul] 940352: Report failure messages on non-enqueued builds https://review.opendev.org/c/zuul/zuul/+/940352 | 23:41 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!