-@gerrit:opendev.org- Zuul merged on behalf of James E. Blair https://matrix.to/#/@jim:acmegating.com: [zuul/zuul] 920690: Pre-filter gerrit events based on triggers https://review.opendev.org/c/zuul/zuul/+/920690 | 00:46 | |
-@gerrit:opendev.org- Simon Westphahl proposed: | 06:31 | |
- [zuul/zuul] 921949: Add basic zuul-launcher client/server skeleton https://review.opendev.org/c/zuul/zuul/+/921949 | ||
- [zuul/zuul] 922019: Properly sort imports in model.py https://review.opendev.org/c/zuul/zuul/+/922019 | ||
- [zuul/zuul] 922310: Add ZKObject-based launcher API https://review.opendev.org/c/zuul/zuul/+/922310 | ||
- [zuul/zuul] 922469: Support provider node subclasses in launcher API https://review.opendev.org/c/zuul/zuul/+/922469 | ||
-@gerrit:opendev.org- Simon Westphahl proposed on behalf of James E. Blair https://matrix.to/#/@jim:acmegating.com: | 06:31 | |
- [zuul/zuul] 922982: Add **kw processing to zkobject subclass fromZK https://review.opendev.org/c/zuul/zuul/+/922982 | ||
- [zuul/zuul] 922258: Serialize Provider objects to ZK https://review.opendev.org/c/zuul/zuul/+/922258 | ||
- [zuul/zuul] 922133: Add some developer docs for some nodepool-in-zuul zk objects https://review.opendev.org/c/zuul/zuul/+/922133 | ||
-@gerrit:opendev.org- Simon Westphahl proposed: | 09:57 | |
- [zuul/zuul] 922802: wip: Add launcher client/server implementation https://review.opendev.org/c/zuul/zuul/+/922802 | ||
- [zuul/zuul] 923038: Configure connection registry for launcher https://review.opendev.org/c/zuul/zuul/+/923038 | ||
- [zuul/zuul] 923039: Add AWS provider node class https://review.opendev.org/c/zuul/zuul/+/923039 | ||
@clarkb:matrix.org | corvus: I think OpenStack may have found that retry state jobs in gate pipelines may improperly trigger gate resets that don't fully complete leading to "orphaned" changes in the pipeline that eventually report cancelled jobs. | 17:40 |
---|---|---|
@clarkb:matrix.org | The two changes in questions are https://review.opendev.org/c/openstack/nova/+/915735 and https://review.opendev.org/c/openstack/nova/+/915734. Based on timing in logs on zuul01 and the build history here: https://zuul.opendev.org/t/openstack/builds?change=915734&change=915735&skip=0 it seems that zuul detects a retryable failure state in nova-tox-validate-backport. This retry state seems to coincide with zuul cancelling jobs for the child (915735) in the gate presumably because it thinks there needs to be a gate reset due to the failure in 915734. Problem is that failure isn't a "true" failure and the job will be retried. Which it is retried and 915734 eventually merges. Then 915735 reports but since no new jobs have run it reports the cancelled failure state and you get a -2. The upside to this failure mode is that the speculative git tree state assumptions are not violated for the change that merges so we're failing safe. | 17:44 |
@clarkb:matrix.org | Sounds like this may have been happening for some time but very infrequently. I'm not aware off the top of my head what sorts of chagnes to zuul might be causing this and I haven't been able to keep track of zuul changes recently as well as I'd like. Curious if anyone has any ideas before we deep dive the code | 17:46 |
@clarkb:matrix.org | I think this is related to pre_fail handling | 17:52 |
@clarkb:matrix.org | The reason is the short circuit check in https://opendev.org/zuul/zuul/src/branch/master/zuul/model.py#L4507-L4513 | 17:52 |
@clarkb:matrix.org | we skip waiting to determine if the job would be in a retry state | 17:52 |
@clarkb:matrix.org | but this must be a corner case as we have tests that check we don't set pre_fail when we would retry | 17:53 |
@clarkb:matrix.org | I have confirmed via logs on ze05 that early failure detection did occur for this job | 18:00 |
@clarkb:matrix.org | `2024-06-28 14:22:53,303 INFO zuul.AnsibleJob: [e: 967b8cbde1cb4649bc4b7f5d7ded6be8] [build: a51b05d89221490cb51d6c777dd42f8f] Early failure in job` | 18:00 |
@jim:acmegating.com | Clark: what's the test you reckon covers this? | 18:00 |
@clarkb:matrix.org | corvus: test_pre_run_failure_retry | 18:01 |
@clarkb:matrix.org | or at least it attempts to cover this? | 18:01 |
@clarkb:matrix.org | we do appear to be failing in a pre-run playbook | 18:02 |
@clarkb:matrix.org | corvus: reading the executor code I half suspect that we're properly saying don't allow prefail when running the pre playbook, but then when we go to run the post playbook to finish up the job we allow setting pre fail and it is that point that we read the logs for the pre playbook and set pre_fail. But since we in the post phase we set the flag | 18:05 |
@jim:acmegating.com | i was noting that an interesting characteristic is that this job apparently also has a post failure | 18:07 |
@clarkb:matrix.org | corvus: oh yup and the Early failure in job is detected during the post failure | 18:07 |
@clarkb:matrix.org | so maybe pre is totally fine and it is the mixing of a retry from a pre failure with a failure in post run | 18:07 |
@clarkb:matrix.org | ya I think that is it | 18:08 |
@clarkb:matrix.org | I think to be safe we need to carry forward whether or not we had a retryable pre failure into subsequent runAnsiblePlaybook() calls so that we don't try and set pre_fail if so | 18:08 |
@jim:acmegating.com | Clark: how do we read the logs for the pre-run playbook during post-run? | 18:11 |
@jim:acmegating.com | that should only be from the current playbook execution | 18:11 |
@clarkb:matrix.org | corvus: I don't think we do. As noted above the logging for early failure detection occured during the post-run playbook which made me think that could happen. But then you pointed out post-run is failing too. I think we're detecting the post-run failure instead | 18:12 |
@jim:acmegating.com | oh, we're getting an early failure detection in a post-run playbook? | 18:12 |
@clarkb:matrix.org | corvus: yes I think that is what happens. Pre-run fails and we hanlde that properly. Then we proceed to run post-run and that fails and we detect early failure there which sends the pre-fail flag up | 18:13 |
@clarkb:matrix.org | and before we can set the job status to RETRY we've already detected pre-fail which is treated as a failure and we start resetting stuff | 18:13 |
@jim:acmegating.com | so edge case is: pre-run task failure and post-run task failure. | 18:13 |
@clarkb:matrix.org | I think in zuul executor we can pass the should_retry flag into the post-run call and only set pre fail if should_retry is false | 18:13 |
@sean-k-mooney:matrix.org | this is the job we think cuase the cancelation https://zuul.opendev.org/t/openstack/build/a51b05d89221490cb51d6c777dd42f8f/log/job-output.txt#201 | 18:14 |
@clarkb:matrix.org | corvus: yup | 18:14 |
@sean-k-mooney:matrix.org | the first failure was in pre-run with a failure to lookup opendev.org | 18:14 |
@jim:acmegating.com | Clark: i agree with the fix | 18:15 |
@sean-k-mooney:matrix.org | then log collection failed in post-run here https://zuul.opendev.org/t/openstack/build/a51b05d89221490cb51d6c777dd42f8f/log/job-output.txt#251 | 18:15 |
@sean-k-mooney:matrix.org | becuase we didn insll tokx | 18:15 |
@jim:acmegating.com | Clark: you want to do the typing or shall i? | 18:15 |
@clarkb:matrix.org | corvus: I can make an attempt at it | 18:15 |
@jim:acmegating.com | cool thx; i will review it then :) | 18:16 |
@clarkb:matrix.org | most of the fun in this change is going to be adding a new test. | 18:24 |
-@gerrit:opendev.org- Clark Boylan proposed: [zuul/zuul] 923066: Don't pre fail jobs if they will retry https://review.opendev.org/c/zuul/zuul/+/923066 | 18:41 | |
@clarkb:matrix.org | Something like that maybe | 18:41 |
-@gerrit:opendev.org- Clark Boylan proposed: [zuul/zuul] 923066: Don't pre fail jobs if they will retry https://review.opendev.org/c/zuul/zuul/+/923066 | 19:10 | |
@clarkb:matrix.org | corvus: hrm the test failed in ^ and the log shows "Early failure in job" which I don't understand since we should be in retry mode there | 20:30 |
@clarkb:matrix.org | I wonder if there is something else we're missing here | 20:30 |
@sylvass:albinvass.se | I got an issue with packaging the zuul-executor on nixos. the short story is that the initial bubblewrap check fails because /nix/store has to be mounted into the chroot. | 20:31 |
I already had to do some local patches to make zuul check all default --ro-binds so it doesn't make assumptions about which directories exists. Basically I moved the --ro-binds defined here: https://opendev.org/zuul/zuul/src/branch/master/zuul/driver/bubblewrap/__init__.py#L297 | ||
to here: https://opendev.org/zuul/zuul/src/branch/master/zuul/driver/bubblewrap/__init__.py#L320 | ||
However executables on nixos are stored in `/nix/store` (`/bin/sh` and `/usr/bin/env` exist by default but they're symlinks to the store) so if that isn't mounted then the `id` executable won't be available for the bwrap check: https://opendev.org/zuul/zuul/src/branch/master/zuul/driver/bubblewrap/__init__.py#L217 | ||
I could add `/nix/store` as well to the default `--ro-bind` list, but I don't really like the idea of having hard-coded os specific paths. On the other hand the constructor doesn't take any ro_bind arguments and it isn't aware of the zuul.conf, so adding an argument and piping that data through to the constructor everywhere seems like quite a bit of work. (but maybe I should just try to do that?) | ||
@clarkb:matrix.org | oh wait the Early failure in job is on the second run which is correct. We are aborting the first wait job because post-run fails which means we can't retry? | 20:32 |
@clarkb:matrix.org | except None is the return value which should still be retryable so I'm confused | 20:32 |
@clarkb:matrix.org | corvus: ok I think I get it change 1 cannot merge | 20:35 |
@clarkb:matrix.org | so once zuul knows that (after the retries all complete and fail it aborts the job on change 2 and reenquees it without change 1 in the speculative state | 20:36 |
@clarkb:matrix.org | now to figure out where we've got the test case doing the wrong thing and fix it | 20:36 |
@jim:acmegating.com | Albin Vass: make it conditional? if '/nix/store' exists then add it to default list? | 20:37 |
@jim:acmegating.com | Albin Vass: i will say though that i'm hesitant to suggest that we think nixos is supported though | 20:38 |
@clarkb:matrix.org | you also probably don't want to mount the entire nix store just the subset that represents the things you need? | 20:39 |
@jim:acmegating.com | Albin Vass: at any rate, that should make for a compact local patch for now. :) | 20:39 |
@clarkb:matrix.org | (which I'm not sure how you resolve that) | 20:39 |
@sylvass:albinvass.se | yeah I'm leaning towards that, but it would be better if it didn't require a zuul change if for example someone wanted to package it for guix | 20:39 |
@sylvass:albinvass.se | Clark: that should be possible but there's no reason not to mount the entire store | 20:39 |
@clarkb:matrix.org | Albin Vass: reducing the security risk of old versions being exposed is why | 20:40 |
@jim:acmegating.com | a terminology quibble: this is not "packaging" this is "adding support for" :) | 20:40 |
@clarkb:matrix.org | aiui nix will happily retain old versions that could potentially be abused? | 20:40 |
@sylvass:albinvass.se | imo then they should just be garbage collected | 20:40 |
@sylvass:albinvass.se | I'm not thinking of giving jobs access to the nix daemon to build packages on executors, I just want my executables available :) | 20:41 |
@sylvass:albinvass.se | Clark: `nix path-info --recursive /run/current-system` to get all paths needed for the current system :) | 20:43 |
@sylvass:albinvass.se | (but that would be a ton of bind mounts) | 20:43 |
-@gerrit:opendev.org- Clark Boylan proposed: [zuul/zuul] 923066: Don't pre fail jobs if they will retry https://review.opendev.org/c/zuul/zuul/+/923066 | 20:43 | |
@clarkb:matrix.org | corvus: ^ the comment in the test delta between last ps and current ps explains my understanding of why it failed and why I needed to change it | 20:44 |
@clarkb:matrix.org | corvus: I guess let me know if you think that is inaccurate | 20:44 |
@jim:acmegating.com | Clark: that makes sense because that determination is made executor-side | 20:45 |
@clarkb:matrix.org | Albin Vass: ya i think the main risk is that you end up with some setuid/gid executable hanging around that you don't want to expose. This is also technically an issue with any other distro except most other distros are pretty aggressive about pruning and only having the one version around | 20:46 |
@clarkb:matrix.org | * Albin Vass: ya i think the main risk is that you end up with some buggy setuid/gid executable hanging around that you don't want to expose. This is also technically an issue with any other distro except most other distros are pretty aggressive about pruning and only having the one version around | 20:46 |
@jim:acmegating.com | i think this conversation is a good illustration of why it's good for the zuul project to be clear about it's expectations on runtime environment. i don't think most of us even know what questions to ask about running securely on nixos. it's developed for linux and tested on debian... and beyond that <shrug>. | 20:48 |
@jim:acmegating.com | * i think this conversation is a good illustration of why it's good for the zuul project to be clear about it's expectations on runtime environment. i don't think most of us even know what questions to ask about running securely on nixos. it's developed for gnu/linux and tested on debian... and beyond that \<shrug>. | 20:48 |
@jim:acmegating.com | (a rare time i have to actually break out the "gnu/linux" thing to be specific :) | 20:49 |
@sylvass:albinvass.se | Sure. It's quite easy to get the exact paths you need. But to do that I need to be able to tell bubbwrap what to mount :) | 20:51 |
@clarkb:matrix.org | more generally testing that paths exist before adding them to the ro or rw paths makes sense. And maybe its a warning if they don't exist and are in the default list and it is an error if they aer in the extra list? | 20:57 |
@clarkb:matrix.org | another appraoch (that is maybe more hacky) would be to wrap bwrap with something that adds appropriate nix paths for your platform | 20:59 |
@clarkb:matrix.org | but that might make future debugging annoying when we're all looking at the code and wondering why you get different behavuior | 21:00 |
@sylvass:albinvass.se | I tried wrapping zuul-executor with bubblewrap. But yeah I'd rather have a simpler setup | 21:06 |
@sylvass:albinvass.se | Clark: oh and nix won't create executables with setuid in /nix/store. | 21:23 |
@sylvass:albinvass.se | so that should generally be fine. instead nixos sets up wrappers in `/run/current-system/sw/bin` whenever that's needed | 21:23 |
@aureliojargas:matrix.org | Hi there, just to let you know that I've updated the `ensure-poetry` role proposal, addressing all the code review comments: https://review.opendev.org/c/zuul/zuul-jobs/+/922286 | 21:41 |
-@gerrit:opendev.org- Albin Vass proposed: | 21:50 | |
- [zuul/zuul] 923076: Print stdout on bwrap error https://review.opendev.org/c/zuul/zuul/+/923076 | ||
- [zuul/zuul] 923077: bwrap: check all ro-bind directories before adding flags https://review.opendev.org/c/zuul/zuul/+/923077 | ||
- [zuul/zuul] 923078: bwrap: parse args before checking bubblewrap https://review.opendev.org/c/zuul/zuul/+/923078 | ||
- [zuul/zuul] 923079: Add /nix/store to default ro-paths https://review.opendev.org/c/zuul/zuul/+/923079 | ||
@sylvass:albinvass.se | anyway that's what I got so far ^ | 21:51 |
@clarkb:matrix.org | arg the order of those jobs being processes by the scheduler is actually racy | 22:00 |
@clarkb:matrix.org | maybe its ok to simply check the jobs have the correct status and not worry about the order | 22:00 |
@jim:acmegating.com | Clark: set ordered=False in assertHistory | 22:01 |
@clarkb:matrix.org | yup I'm on it. I just want to leave a comment there that captures why | 22:04 |
-@gerrit:opendev.org- Clark Boylan proposed: [zuul/zuul] 923066: Don't pre fail jobs if they will retry https://review.opendev.org/c/zuul/zuul/+/923066 | 22:05 | |
@jim:acmegating.com | Clark: fwiw, i don't think a comment is necessary there -- *most* of the tests need ordered=False. :) | 22:10 |
@jim:acmegating.com | (doesn't hurt of course; just indicating it's pretty normal) | 22:11 |
-@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed: | 22:12 | |
- [zuul/zuul] 923080: Update providers on launcher https://review.opendev.org/c/zuul/zuul/+/923080 | ||
- [zuul/zuul] 923081: Add image build triggers https://review.opendev.org/c/zuul/zuul/+/923081 | ||
@clarkb:matrix.org | corvus: ack the default appears to be true so I figured I'd explain it | 22:14 |
-@gerrit:opendev.org- Albin Vass proposed: | 22:27 | |
- [zuul/zuul] 923076: Print stdout on bwrap error https://review.opendev.org/c/zuul/zuul/+/923076 | ||
- [zuul/zuul] 923077: bwrap: check all ro-bind directories before adding flags https://review.opendev.org/c/zuul/zuul/+/923077 | ||
- [zuul/zuul] 923078: bwrap: parse args before checking bubblewrap https://review.opendev.org/c/zuul/zuul/+/923078 | ||
-@gerrit:opendev.org- Albin Vass proposed: | 22:32 | |
- [zuul/zuul] 923076: Print stdout on bwrap error https://review.opendev.org/c/zuul/zuul/+/923076 | ||
- [zuul/zuul] 923078: bwrap: parse args before checking bubblewrap https://review.opendev.org/c/zuul/zuul/+/923078 | ||
- [zuul/zuul] 923077: bwrap: check all ro-bind directories before adding flags https://review.opendev.org/c/zuul/zuul/+/923077 | ||
-@gerrit:opendev.org- Clark Boylan proposed: [zuul/zuul] 923066: Don't pre fail jobs if they will retry https://review.opendev.org/c/zuul/zuul/+/923066 | 22:36 | |
@clarkb:matrix.org | corvus: the linters were mad at my comment so I removed it | 22:36 |
-@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed: [zuul/zuul-jobs] 923083: Install the docker-compose-plugin when installing upstream docker https://review.opendev.org/c/zuul/zuul-jobs/+/923083 | 23:26 | |
-@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed: [zuul/zuul] 923084: WIP: Try docker compose with podman https://review.opendev.org/c/zuul/zuul/+/923084 | 23:43 | |
-@gerrit:opendev.org- Zuul merged on behalf of Clark Boylan: [zuul/zuul] 923066: Don't pre fail jobs if they will retry https://review.opendev.org/c/zuul/zuul/+/923066 | 23:53 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!