-@gerrit:opendev.org- Peter Strunk proposed: [zuul/zuul] 903808: zuul_stream: add FQCN for windows command and shell https://review.opendev.org/c/zuul/zuul/+/903808 | 01:42 | |
-@gerrit:opendev.org- Zuul merged on behalf of James E. Blair https://matrix.to/#/@jim:acmegating.com: [zuul/zuul] 907256: Add --keep-config-cache option to delete-state command https://review.opendev.org/c/zuul/zuul/+/907256 | 09:25 | |
-@gerrit:opendev.org- Zuul merged on behalf of James E. Blair https://matrix.to/#/@jim:acmegating.com: [zuul/zuul-client] 907627: Handle forward compatability with cycle refactor https://review.opendev.org/c/zuul/zuul-client/+/907627 | 09:34 | |
-@gerrit:opendev.org- Zuul merged on behalf of James E. Blair https://matrix.to/#/@jim:acmegating.com: [zuul/zuul] 906765: Fix 10 second delay after skipped command task https://review.opendev.org/c/zuul/zuul/+/906765 | 10:10 | |
@vonschultz:matrix.org | I restarted Zuul, and now my scheduler won't start. I'm confused, because I haven't upgraded the Zuul Docker images (still at 8.2.0), and the zuul.conf on the scheduler machine hasn't changed as far as I know. I'm getting the error ``` File "/usr/local/lib/python3.11/site-packages/zuul/configloader.py", line 2206, in _cacheTenantYAMLBranch | 18:01 |
---|---|---|
self._updateUnparsedBranchCache( | ||
File "/usr/local/lib/python3.11/site-packages/zuul/configloader.py", line 2321, in _updateUnparsedBranchCache | ||
min_ltimes[source_context.project_canonical_name][ | ||
~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | ||
KeyError: 'opendev.org/zuul/zuul-jobs' | ||
``` | ||
Does anyone have any idea what this is or how I might recover from it? | ||
@vonschultz:matrix.org | * I restarted Zuul, and now my scheduler won't start. I'm confused, because I haven't upgraded the Zuul Docker images (still at 8.2.0), and the zuul.conf on the scheduler machine hasn't changed as far as I know. I'm getting the error | 18:01 |
``` | ||
File "/usr/local/lib/python3.11/site-packages/zuul/configloader.py", line 2206, in \_cacheTenantYAMLBranch | ||
self.\_updateUnparsedBranchCache( | ||
File "/usr/local/lib/python3.11/site-packages/zuul/configloader.py", line 2321, in \_updateUnparsedBranchCache | ||
min\_ltimes\[source\_context.project\_canonical\_name\]\[ | ||
~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | ||
KeyError: 'opendev.org/zuul/zuul-jobs' | ||
``` | ||
Does anyone have any idea what this is or how I might recover from it? | ||
``` | ||
@vonschultz:matrix.org | * I restarted Zuul, and now my scheduler won't start. I'm confused, because I haven't upgraded the Zuul Docker images (still at 8.2.0), and the zuul.conf on the scheduler machine hasn't changed as far as I know. I'm getting the error | 18:01 |
``` | ||
File "/usr/local/lib/python3.11/site-packages/zuul/configloader.py", line 2206, in \_cacheTenantYAMLBranch | ||
self.\_updateUnparsedBranchCache( | ||
File "/usr/local/lib/python3.11/site-packages/zuul/configloader.py", line 2321, in \_updateUnparsedBranchCache | ||
min\_ltimes\[source\_context.project\_canonical\_name\]\[ | ||
~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | ||
KeyError: 'opendev.org/zuul/zuul-jobs' | ||
``` | ||
Does anyone have any idea what this is or how I might recover from it? | ||
@vonschultz:matrix.org | * I restarted Zuul, and now my scheduler won't start. I'm confused, because I haven't upgraded the Zuul Docker images (still at 8.2.0), and the zuul.conf on the scheduler machine hasn't changed as far as I know. I'm getting the error | 18:02 |
``` | ||
File "/usr/local/lib/python3.11/site-packages/zuul/configloader.py", line 2206, in _cacheTenantYAMLBranch | ||
self._updateUnparsedBranchCache( | ||
File "/usr/local/lib/python3.11/site-packages/zuul/configloader.py", line 2321, in _updateUnparsedBranchCache | ||
min_ltimes[source_context.project_canonical_name][ | ||
~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | ||
KeyError: 'opendev.org/zuul/zuul-jobs' | ||
``` | ||
Does anyone have any idea what this is or how I might recover from it? | ||
@jim:acmegating.com | Christian von Schultz: the cause would take quite a bit of effort to determine (and may not be relevant in current versions), but the remedy should be to stop all zuul components, then run `zuul-admin delete-state` on the scheduler to completely reset the ephemeral zookeeper state (that includes current pipeline contents), then start the system up again. | 18:09 |
@jim:acmegating.com | that's the universal fix for "something in zk is wrong" | 18:10 |
@vonschultz:matrix.org | It looks like it worked. Thanks. 👍️ | 18:26 |
@fungicide:matrix.org | Christian von Schultz: as to the cause, is it possible your zookeeper service got interrupted mid-write or something? | 18:41 |
@fungicide:matrix.org | having a 3-way zk cluster helps mitigate the risk that a spontaneous crash/reboot corrupts your state | 18:42 |
@vonschultz:matrix.org | I do have a 3-way zk cluster. | 18:42 |
@fungicide:matrix.org | cool, so probably not that at least | 18:43 |
@vonschultz:matrix.org | But I suppose it _is_ possible that it got interrupted mid-write anyway. I ran an Ansible playbook, and in the upgrade task, a new version of `docker-ce` got rolled out. Since Ansible likes to run many hosts in parallel, it's entirely possible that the Docker containers for Zookeeper got pulled down at approximately the same time. | 18:44 |
@clarkb:matrix.org | when you upgrade docker it will restart your containers | 18:45 |
@clarkb:matrix.org | you will want to put serial: 1 or whatever the ansible is for that on playbooks that impact your zk members | 18:46 |
-@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed: [zuul/zuul] 907509: Use git native commit with squash merge https://review.opendev.org/c/zuul/zuul/+/907509 | 20:52 | |
@jkkadgar:matrix.org | When 3 changes are in a gate queue (merge-mode: cherry-pick) and the top change has a set of 10 jobs in which 1 retries due to a failure in pre-run, Zuul then dequeues and restarts all the builds in the other 2 changes in the gate queue as the first change is marked failed. Zuul then dequeues and restarts all the builds in the other 2 changes in the gate queue a second time as the first change is reinserted to retry its 1 job. Is this expected behavior? Is there a way to disable or enable this? Ideally I would like to see the dequeue only happen after the retries have been exhausted on the top change. | 20:53 |
@fungicide:matrix.org | jkkadgar: i agree, it doesn't make sense for an automatically rerun build to mark the entire buildset as failing unless the build is failing on its final try. i've never noticed it doing that, but maybe i've just not been observant enough. perhaps this is an oversight related to the early failure detection from 9.0.0... what version of zuul are you running? | 21:00 |
@jkkadgar:matrix.org | 9.3.0 | 21:00 |
@fungicide:matrix.org | so it's possible this is a relatively recent behavior change, yeah | 21:00 |
@fungicide:matrix.org | though also, i can think of another corner case... zuul will automatically retry a build if it detects what looks like a network connection failure reaching a node, even in the run phase. if zuul marks the change as failing due to a failed ansible task in a run phase playbook, maybe it also needs to turn off the auto-rerun behavior for detected connection failures | 21:03 |
@fungicide:matrix.org | corvus: ^ is any of this already known problems with early failure detection? | 21:04 |
@clarkb:matrix.org | when a job is retried for network issues it should not dequeue the change. It will just rerun the job | 21:06 |
@clarkb:matrix.org | the indication that the change is dequeued makes me think something else may be happening | 21:07 |
@clarkb:matrix.org | because yes, dequeing a change causes its children to reset as they need to evict their parent from the speculative git trees and test with the new git state | 21:07 |
@clarkb:matrix.org | is it hitting the retry limit? By default it will only attempt the job three times | 21:08 |
@jkkadgar:matrix.org | Just one retry causes this to occur | 21:09 |
@fungicide:matrix.org | Clark: is it possible that zuul is noticing a failed task, early failure detection is moving the change out of the main sequence for the dependent pipeline at that time, and then moving it back in again when it decides it will retry the build? just since the introduction of early failure detection in 9.0.0 i mean | 21:09 |
@fungicide:matrix.org | the observed case was for retries due to pre-run phase playbook failures | 21:10 |
@fungicide:matrix.org | not network connection errors | 21:10 |
@fungicide:matrix.org | i was merely speculating that network connection error retries might similarly expose the same behavior | 21:10 |
@jim:acmegating.com | fungi: early failure detection does not take place in pre-run playbooks | 21:11 |
@fungicide:matrix.org | corvus: got it, so that shouldn't be influencing the queue | 21:12 |
@clarkb:matrix.org | I would look at debug logs on the scheduler to see its decision making process for this | 21:13 |
@fungicide:matrix.org | then i agree, i'd be surprised if i'd failed to notice pre-run phase failure retries causing a build reset for changes behind the change with the retried build | 21:13 |
@fungicide:matrix.org | mmm, that's a mouthful | 21:14 |
@jkkadgar:matrix.org | Ok thanks, I'll try to see if I can figure out what is happening in the logs. | 21:15 |
@fungicide:matrix.org | jkkadgar: if you can identify the sequence of events, we should be able to add a regression test for it to double-check the behavior is or isn't as expected | 21:16 |
@fungicide:matrix.org | but yes, the scheduler debug log is the best place to find that | 21:17 |
@clarkb:matrix.org | one useful tip for doing that is to grep on the event id for the event enqueing that change | 21:20 |
@clarkb:matrix.org | it may not give you everything but should give you a pretty good overview that you can then dig into from there | 21:20 |
-@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed: | 23:40 | |
- [zuul/zuul] 907506: Update web ui for dependency refactor https://review.opendev.org/c/zuul/zuul/+/907506 | ||
- [zuul/zuul] 906320: Finish circular dependency refactor https://review.opendev.org/c/zuul/zuul/+/906320 | ||
- [zuul/zuul] 907628: Change status json to use "refs" instead of "changes" https://review.opendev.org/c/zuul/zuul/+/907628 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!