Wednesday, 2025-01-29

-@gerrit:opendev.org- Ruisi Jian proposed: [zuul/zuul] 940352: Report failure messages on non-enqueued builds https://review.opendev.org/c/zuul/zuul/+/94035200:03
-@gerrit:opendev.org- Ruisi Jian proposed: [zuul/zuul] 940352: Report failure messages on non-enqueued builds https://review.opendev.org/c/zuul/zuul/+/94035200:08
-@gerrit:opendev.org- Ruisi Jian proposed: [zuul/zuul] 940352: Report failure messages on non-enqueued builds https://review.opendev.org/c/zuul/zuul/+/94035202:50
-@gerrit:opendev.org- Ruisi Jian proposed: [zuul/zuul] 940352: Report failure messages on non-enqueued builds https://review.opendev.org/c/zuul/zuul/+/94035202:53
-@gerrit:opendev.org- Ruisi Jian proposed: [zuul/zuul] 940352: Report failure messages on non-enqueued builds https://review.opendev.org/c/zuul/zuul/+/94035203:17
-@gerrit:opendev.org- Ruisi Jian proposed: [zuul/zuul] 940352: Report failure messages on non-enqueued builds https://review.opendev.org/c/zuul/zuul/+/94035203:28
-@gerrit:opendev.org- Ruisi Jian proposed: [zuul/zuul] 940352: Report failure messages on non-enqueued builds https://review.opendev.org/c/zuul/zuul/+/94035203:33
-@gerrit:opendev.org- Ruisi Jian proposed: [zuul/zuul] 940352: Report failure messages on non-enqueued builds https://review.opendev.org/c/zuul/zuul/+/94035203:47
-@gerrit:opendev.org- Ruisi Jian proposed: [zuul/zuul] 940352: Report failure messages on non-enqueued builds https://review.opendev.org/c/zuul/zuul/+/94035204:08
-@gerrit:opendev.org- Ruisi Jian proposed: [zuul/zuul] 940352: Report failure messages on non-enqueued builds https://review.opendev.org/c/zuul/zuul/+/94035206:42
-@gerrit:opendev.org- Simon Westphahl proposed: [zuul/zuul] 940361: Pin boto3 Ansible requirement https://review.opendev.org/c/zuul/zuul/+/94036108:29
-@gerrit:opendev.org- Matthieu Huin https://matrix.to/#/@mhuin:matrix.org proposed: [zuul/zuul] 938011: Enforce required length check in config https://review.opendev.org/c/zuul/zuul/+/93801111:23
-@gerrit:opendev.org- Matthieu Huin https://matrix.to/#/@mhuin:matrix.org proposed: [zuul/zuul] 938463: sqlreporter: ensure build end data is stored even if log URL is too long https://review.opendev.org/c/zuul/zuul/+/93846311:24
-@gerrit:opendev.org- Matthieu Huin https://matrix.to/#/@mhuin:matrix.org proposed: [zuul/zuul] 938463: sqlreporter: ensure build end data is stored even if log URL is too long https://review.opendev.org/c/zuul/zuul/+/93846311:28
-@gerrit:opendev.org- Matthieu Huin https://matrix.to/#/@mhuin:matrix.org proposed: [zuul/zuul] 938128: autohold REST API: add ref filter validation https://review.opendev.org/c/zuul/zuul/+/93812811:29
@joao15130:matrix.orgHello, sorry for asking so many questions but I'm learning.13:01
I'm trying to reinstall zuul using the same procedure that comes with the quick start and I couldn't make it
I see multiple errors when I compose the containers from the logs:
`[launcher] | 2025-01-29 12:58:47,402 WARNING kazoo.client: Cannot resolve zk: [Errno -3] Temporary failure in name resolution`
@joao15130:matrix.org* Hello, sorry for asking so many questions but I'm learning.13:02
I'm trying to reinstall zuul using the same procedure that comes with the quick start and I couldn't make it
I see multiple errors when I compose the containers from the logs:
`[launcher] | 2025-01-29 12:58:47,402 WARNING kazoo.client: Cannot resolve zk: [Errno -3] Temporary failure in name resolution`
`[scheduler] | /var/playbooks/wait-to-start.sh: line 9: mysql: Temporary failure in name resolution
[scheduler] | /var/playbooks/wait-to-start.sh: line 9: /dev/tcp/mysql/3306: Invalid argument
`
@joao15130:matrix.org* Hello, sorry for asking so many questions but I'm learning.13:02
I'm trying to reinstall zuul using the same procedure that comes with the quick start and I couldn't make it
I see multiple errors when I compose the containers from the logs:
`[launcher] | 2025-01-29 12:58:47,402 WARNING kazoo.client: Cannot resolve zk: [Errno -3] Temporary failure in name resolution`
`[scheduler] | /var/playbooks/wait-to-start.sh: line 9: mysql: Temporary failure in name resolution [scheduler] | /var/playbooks/wait-to-start.sh: line 9: /dev/tcp/mysql/3306: Invalid argument `
Any idea on what can cause this?
@fungicide:matrix.orgjoao15130: i think docker does some internal magic to make container names resolve through dns from other containers, i remember a related problem in jobs recently, trying to recall the details now...13:10
@fungicide:matrix.orghttps://docs.docker.com/config/containers/container-networking/#dns-services is docker's own documentation on that feature, if curious. i'm still digging to try to find where it came up in discussion recently13:16
@joao15130:matrix.orgAm I supposed to use podman instead of docker as we run podman-compose in the tutorial?13:20
-@gerrit:opendev.org- Benjamin Schanzel proposed: [zuul/zuul-jobs] 940373: js-package-manager: Allow setting additional env vars for build command https://review.opendev.org/c/zuul/zuul-jobs/+/94037313:45
@joao15130:matrix.orgfungi: weird but it works fine on ubuntu 24.04 without modifying anything13:49
@fungicide:matrix.orgjoao15130: oh! were you trying on ubuntu 22.04 previously? of so, there's a bug in the version of podman packaged there13:51
@fungicide:matrix.orghttps://bugs.launchpad.net/ubuntu/+source/libpod/+bug/202439413:51
@fungicide:matrix.orgthat's probably what you hit13:51
@joao15130:matrix.orgyes, was on 22.04. I guess previous OS was 24.04 and this is why I didn't encountered the issue13:51
@fungicide:matrix.orgthere are workarounds to get a usable podman on 22.04 but 24.04 is a cleaner option13:52
@joao15130:matrix.orgok good, so I didn't lose my time to upgrade to 24.04 then13:53
-@gerrit:opendev.org- Benjamin Schanzel proposed: [zuul/zuul] 940379: web: Upgrade nodejs to latest v23 https://review.opendev.org/c/zuul/zuul/+/94037914:03
-@gerrit:opendev.org- Ruisi Jian proposed: [zuul/zuul] 940352: Report failure messages on non-enqueued builds https://review.opendev.org/c/zuul/zuul/+/94035217:17
-@gerrit:opendev.org- Ruisi Jian proposed: [zuul/zuul] 940352: Report failure messages on non-enqueued builds https://review.opendev.org/c/zuul/zuul/+/94035218:31
@boha:matrix.orgHi, we are running zuul 11.1.0 with a few static nodes. A few times a week it seems a node get stuck in "deleted"-state after being released after a build. I have used "zuul delete-state" to get it back on track, but it's a rather time consuming way to reset. Is there a better way to recover?19:02
@clarkb:matrix.orgboha: the nodes are going to be managed by nodepool. Are you running an up to date nodepool?19:05
@clarkb:matrix.orgyou may also be able to use nodepool node management commands to unstick specific nodes without resorting to full deletion of things in zuul19:06
@boha:matrix.orgEverything 11.1.0, i tried the nodepool delete command, but it gives me an error due to the node being locked19:07
@clarkb:matrix.orgbut I think ensuring nodepool is up to date then inspecting the nodepool logs for the stuck node is where I would start19:07
-@gerrit:opendev.org- Ruisi Jian proposed: [zuul/zuul] 940352: Report failure messages on non-enqueued builds https://review.opendev.org/c/zuul/zuul/+/94035219:17
-@gerrit:opendev.org- Ruisi Jian proposed: [zuul/zuul] 940352: Report failure messages on non-enqueued builds https://review.opendev.org/c/zuul/zuul/+/94035219:24
@boha:matrix.orgThanks, i noticed a NodeDeleter exception and a stacktrace for connection loss and a missed heartbeat. The build node seems to be running fine though.19:25
@fungicide:matrix.orgboha: are you running a redundant zk cluster or just a single zk server?19:26
@boha:matrix.orgSingle zk, all zuul components in the same vm, but builder nodes are separate machines19:28
@fungicide:matrix.orgpossible the nodepool process is somehow getting disconnected from zookeeper19:30
@boha:matrix.orgEverything is within same network (docker compose). I guess problems here would be performance related 🤔19:35
-@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed: [zuul/zuul] 939672: Track variable sources https://review.opendev.org/c/zuul/zuul/+/93967219:40
@clarkb:matrix.orgyes I think those heartbeats can fail if both sides aren't given enough cpu time to process tehm19:45
@clarkb:matrix.orgthat said a failed heartbeat should lead to nodepool attempting to reconnect then resolving inconsistencies19:46
@clarkb:matrix.orga copy of that traceback may help point to bug(s) preventing that from happenin19:46
@boha:matrix.orgCl fungi:  Thanks for the quick response and tips. I'll see tomorrow if I can extract the traceback and allocate more cpu. 19:55
@amdted:matrix.orgHello.  We are having an intermittent issue where our Zuul setup (11.2.0, running in k8s with 5 scheduler, 4 executor and 2 merger nodes) seems to go catatonic where no new jobs will start regardless of any trigger event (e.g. pushing a change to Gerrit).  When I check the executor logs, they are all idle, but when I check the scheduler logs they appear to be stuck in an error loop, with the following sequence repeating endlessly:20:16
2025-01-29 13:49:55,335 ERROR zuul.connection.git.watcher: Unexpected issue in _run loop:
2025-01-29 13:49:55,335 ERROR zuul.connection.git.watcher: Traceback (most recent call last):
2025-01-29 13:49:55,335 ERROR zuul.connection.git.watcher: File "/usr/local/lib/python3.11/site-packages/zuul/driver/git/gitwatcher.py", line 158, in run
2025-01-29 13:49:55,335 ERROR zuul.connection.git.watcher: self.watcher_election.run(self._run)
2025-01-29 13:49:55,335 ERROR zuul.connection.git.watcher: File "/usr/local/lib/python3.11/site-packages/zuul/zk/election.py", line 28, in run
2025-01-29 13:49:55,335 ERROR zuul.connection.git.watcher: return super().run(func, *args, **kwargs)
2025-01-29 13:49:55,335 ERROR zuul.connection.git.watcher: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-01-29 13:49:55,335 ERROR zuul.connection.git.watcher: File "/usr/local/lib/python3.11/site-packages/kazoo/recipe/election.py", line 54, in run
2025-01-29 13:49:55,335 ERROR zuul.connection.git.watcher: func(*args, **kwargs)
2025-01-29 13:49:55,335 ERROR zuul.connection.git.watcher: File "/usr/local/lib/python3.11/site-packages/zuul/driver/git/gitwatcher.py", line 149, in _run
2025-01-29 13:49:55,335 ERROR zuul.connection.git.watcher: self._poll()
2025-01-29 13:49:55,335 ERROR zuul.connection.git.watcher: File "/usr/local/lib/python3.11/site-packages/zuul/driver/git/gitwatcher.py", line 127, in _poll
2025-01-29 13:49:55,335 ERROR zuul.connection.git.watcher: refs = self.lsRemote(project)
2025-01-29 13:49:55,335 ERROR zuul.connection.git.watcher: ^^^^^^^^^^^^^^^^^^^^^^
2025-01-29 13:49:55,335 ERROR zuul.connection.git.watcher: File "/usr/local/lib/python3.11/site-packages/zuul/driver/git/gitwatcher.py", line 72, in lsRemote
2025-01-29 13:49:55,335 ERROR zuul.connection.git.watcher: output = client.ls_remote(
2025-01-29 13:49:55,335 ERROR zuul.connection.git.watcher: ^^^^^^^^^^^^^^^^^
2025-01-29 13:49:55,335 ERROR zuul.connection.git.watcher: File "/usr/local/lib/python3.11/site-packages/git/cmd.py", line 986, in <lambda>
2025-01-29 13:49:55,335 ERROR zuul.connection.git.watcher: return lambda *args, **kwargs: self._call_process(name, *args, **kwargs)
2025-01-29 13:49:55,335 ERROR zuul.connection.git.watcher: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-01-29 13:49:55,335 ERROR zuul.connection.git.watcher: File "/usr/local/lib/python3.11/site-packages/git/cmd.py", line 1598, in _call_process
2025-01-29 13:49:55,335 ERROR zuul.connection.git.watcher: return self.execute(call, **exec_kwargs)
2025-01-29 13:49:55,335 ERROR zuul.connection.git.watcher: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-01-29 13:49:55,335 ERROR zuul.connection.git.watcher: File "/usr/local/lib/python3.11/site-packages/git/cmd.py", line 1388, in execute
2025-01-29 13:49:55,335 ERROR zuul.connection.git.watcher: raise GitCommandError(redacted_command, status, stderr_value, stdout_value)
2025-01-29 13:49:55,335 ERROR zuul.connection.git.watcher: git.exc.GitCommandError: Cmd('git') failed due to: exit code(255)
2025-01-29 13:49:55,335 ERROR zuul.connection.git.watcher: cmdline: git ls-remote --heads --tags https://opendev.org/zuul/zuul-jobs
2025-01-29 13:49:55,335 ERROR zuul.connection.git.watcher: stderr: 'error: cannot fork() for remote-https: Resource temporarily unavailable'
with that ls-remote command randomly alternating between opendev.org/zuul/zuul-jobs and other source repos we use with each loop. One of the scheduler nodes, however, shows a slightly different error loop (see the last 9 lines):
2025-01-29 13:49:55,324 ERROR zuul.connection.git.watcher: Unexpected issue in _run loop:
2025-01-29 13:49:55,324 ERROR zuul.connection.git.watcher: Traceback (most recent call last):
2025-01-29 13:49:55,324 ERROR zuul.connection.git.watcher: File "/usr/local/lib/python3.11/site-packages/zuul/driver/git/gitwatcher.py", line 158, in run
2025-01-29 13:49:55,324 ERROR zuul.connection.git.watcher: self.watcher_election.run(self._run)
2025-01-29 13:49:55,324 ERROR zuul.connection.git.watcher: File "/usr/local/lib/python3.11/site-packages/zuul/zk/election.py", line 28, in run
2025-01-29 13:49:55,324 ERROR zuul.connection.git.watcher: return super().run(func, *args, **kwargs)
2025-01-29 13:49:55,324 ERROR zuul.connection.git.watcher: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-01-29 13:49:55,324 ERROR zuul.connection.git.watcher: File "/usr/local/lib/python3.11/site-packages/kazoo/recipe/election.py", line 54, in run
2025-01-29 13:49:55,324 ERROR zuul.connection.git.watcher: func(*args, **kwargs)
2025-01-29 13:49:55,324 ERROR zuul.connection.git.watcher: File "/usr/local/lib/python3.11/site-packages/zuul/driver/git/gitwatcher.py", line 149, in _run
2025-01-29 13:49:55,324 ERROR zuul.connection.git.watcher: self._poll()
2025-01-29 13:49:55,324 ERROR zuul.connection.git.watcher: File "/usr/local/lib/python3.11/site-packages/zuul/driver/git/gitwatcher.py", line 127, in _poll
2025-01-29 13:49:55,324 ERROR zuul.connection.git.watcher: refs = self.lsRemote(project)
2025-01-29 13:49:55,324 ERROR zuul.connection.git.watcher: ^^^^^^^^^^^^^^^^^^^^^^
2025-01-29 13:49:55,324 ERROR zuul.connection.git.watcher: File "/usr/local/lib/python3.11/site-packages/zuul/driver/git/gitwatcher.py", line 72, in lsRemote
2025-01-29 13:49:55,324 ERROR zuul.connection.git.watcher: output = client.ls_remote(
2025-01-29 13:49:55,324 ERROR zuul.connection.git.watcher: ^^^^^^^^^^^^^^^^^
2025-01-29 13:49:55,324 ERROR zuul.connection.git.watcher: File "/usr/local/lib/python3.11/site-packages/git/cmd.py", line 986, in <lambda>
2025-01-29 13:49:55,324 ERROR zuul.connection.git.watcher: return lambda *args, **kwargs: self._call_process(name, *args, **kwargs)
2025-01-29 13:49:55,324 ERROR zuul.connection.git.watcher: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-01-29 13:49:55,324 ERROR zuul.connection.git.watcher: File "/usr/local/lib/python3.11/site-packages/git/cmd.py", line 1598, in _call_process
2025-01-29 13:49:55,324 ERROR zuul.connection.git.watcher: return self.execute(call, **exec_kwargs)
2025-01-29 13:49:55,324 ERROR zuul.connection.git.watcher: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
**2025-01-29 13:49:55,324 ERROR zuul.connection.git.watcher: File "/usr/local/lib/python3.11/site-packages/git/cmd.py", line 1262, in execute
2025-01-29 13:49:55,324 ERROR zuul.connection.git.watcher: proc = safer_popen(
2025-01-29 13:49:55,324 ERROR zuul.connection.git.watcher: ^^^^^^^^^^^^
2025-01-29 13:49:55,324 ERROR zuul.connection.git.watcher: File "/usr/local/lib/python3.11/subprocess.py", line 1026, in __init__
2025-01-29 13:49:55,324 ERROR zuul.connection.git.watcher: self._execute_child(args, executable, preexec_fn, close_fds,
2025-01-29 13:49:55,324 ERROR zuul.connection.git.watcher: File "/usr/local/lib/python3.11/subprocess.py", line 1885, in _execute_child
2025-01-29 13:49:55,324 ERROR zuul.connection.git.watcher: self.pid = _fork_exec(
2025-01-29 13:49:55,324 ERROR zuul.connection.git.watcher: ^^^^^^^^^^^
2025-01-29 13:49:55,324 ERROR zuul.connection.git.watcher: BlockingIOError: [Errno 11] Resource temporarily unavailable**
Does anybody have any ideas as to what may be going on here? It almost appears that the error loop has exhausted the process table like a pseudo-fork bomb. Right now we work around it by doing a full restart of Zuul but that's just a stopgap solution.
@clarkb:matrix.orgThe errors appear to be related to forking. You may have hit process limits or run out of memory?20:37
@clarkb:matrix.orgI would check both things and work back from there20:37
@amdted:matrix.orgI realize that.  I'm mostly wondering if anybody else has encountered this issue before or if this is something new.  I work with Kenny Ho  and we're still digging into it on our end.  He suggested that I post the above here in case anybody might recognize the symptoms.21:06
@clarkb:matrix.orgI don't see us setting any special ulimits for zuul. But we do set file limits quite high for gitea and gerrit because they both open many files for git operations. Its possilbe that you're in a similar situation depending on how your zuul is deployed21:08
@fungicide:matrix.orglooks like that "BlockingIOError: [Errno 11] Resource temporarily unavailable" is often a sign of multiple processes or threads fighting over the same network socket, e.g. defunct grpc worker threads21:15
@fungicide:matrix.orgor defunct git-remote-* processes?21:16
@clarkb:matrix.orgI suppose it is possible our weekly upgrade restarts claer out any memory/pid/process leaks21:18
@fungicide:matrix.orgi guess that would also explain the "cannot fork()" error21:19
@fungicide:matrix.orgbasically too many stuck processes21:19
@clarkb:matrix.orgoh also I don't think we use the git driver?21:19
@clarkb:matrix.orgconfirmed we don't so if it is specific to git in the git driver we wouldn't see it21:20
@fungicide:matrix.orgright, this seems to be happening deep in the gitwatcher module21:22
@fungicide:matrix.orgor at least it's the most observable/vocal victim of whatever's going on21:22
@fungicide:matrix.orglike the remote is suddenly blocking/throttling the client or there's some network device overloaded sending packets into never-never land21:25
@amdted:matrix.orgOkay, that gives us something to look into.  We've had network-related issues in the past so maybe we're running into some corner case here.  Note that we only upgraded from 11.1.0 to 11.2.0 last week and we didn't encounter this problem previously.21:32
@fungicide:matrix.orgthere were no changes to gitwatcher.py between 11.1.0 and 11.2.0, but this seems more like a symptom of something broader21:40
@amdted:matrix.orgOkay, so something else up the food chain may be causing gitwatcher to squeal in pain.21:44
@fungicide:matrix.orgthat's my suspicion, if it were me i'd start with a systemic approach looking at overall resource usage on the server(s)/container(s), potential for network disruptions to leave stuck git processes accumulating until everything falls over for example21:48
-@gerrit:opendev.org- Ruisi Jian proposed: [zuul/zuul] 940352: Report failure messages on non-enqueued builds https://review.opendev.org/c/zuul/zuul/+/94035222:26
-@gerrit:opendev.org- Zuul merged on behalf of Simon Westphahl: [zuul/zuul] 940303: Fix bulk repo state restore https://review.opendev.org/c/zuul/zuul/+/94030323:02
-@gerrit:opendev.org- Ruisi Jian proposed: [zuul/zuul] 940352: Report failure messages on non-enqueued builds https://review.opendev.org/c/zuul/zuul/+/94035223:03
-@gerrit:opendev.org- Ruisi Jian proposed: [zuul/zuul] 940352: Report failure messages on non-enqueued builds https://review.opendev.org/c/zuul/zuul/+/94035223:09
-@gerrit:opendev.org- Ruisi Jian proposed: [zuul/zuul] 940352: Report failure messages on non-enqueued builds https://review.opendev.org/c/zuul/zuul/+/94035223:37
-@gerrit:opendev.org- Ruisi Jian proposed: [zuul/zuul] 940352: Report failure messages on non-enqueued builds https://review.opendev.org/c/zuul/zuul/+/94035223:41

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!