Friday, 2024-06-28

-@gerrit:opendev.org- Zuul merged on behalf of James E. Blair https://matrix.to/#/@jim:acmegating.com: [zuul/zuul] 920690: Pre-filter gerrit events based on triggers https://review.opendev.org/c/zuul/zuul/+/92069000:46
-@gerrit:opendev.org- Simon Westphahl proposed:06:31
- [zuul/zuul] 921949: Add basic zuul-launcher client/server skeleton https://review.opendev.org/c/zuul/zuul/+/921949
- [zuul/zuul] 922019: Properly sort imports in model.py https://review.opendev.org/c/zuul/zuul/+/922019
- [zuul/zuul] 922310: Add ZKObject-based launcher API https://review.opendev.org/c/zuul/zuul/+/922310
- [zuul/zuul] 922469: Support provider node subclasses in launcher API https://review.opendev.org/c/zuul/zuul/+/922469
-@gerrit:opendev.org- Simon Westphahl proposed on behalf of James E. Blair https://matrix.to/#/@jim:acmegating.com:06:31
- [zuul/zuul] 922982: Add **kw processing to zkobject subclass fromZK https://review.opendev.org/c/zuul/zuul/+/922982
- [zuul/zuul] 922258: Serialize Provider objects to ZK https://review.opendev.org/c/zuul/zuul/+/922258
- [zuul/zuul] 922133: Add some developer docs for some nodepool-in-zuul zk objects https://review.opendev.org/c/zuul/zuul/+/922133
-@gerrit:opendev.org- Simon Westphahl proposed:09:57
- [zuul/zuul] 922802: wip: Add launcher client/server implementation https://review.opendev.org/c/zuul/zuul/+/922802
- [zuul/zuul] 923038: Configure connection registry for launcher https://review.opendev.org/c/zuul/zuul/+/923038
- [zuul/zuul] 923039: Add AWS provider node class https://review.opendev.org/c/zuul/zuul/+/923039
@clarkb:matrix.orgcorvus: I think OpenStack may have found that retry state jobs in gate pipelines may improperly trigger gate resets that don't fully complete leading to "orphaned" changes in the pipeline that eventually report cancelled jobs.17:40
@clarkb:matrix.orgThe two changes in questions are https://review.opendev.org/c/openstack/nova/+/915735 and https://review.opendev.org/c/openstack/nova/+/915734. Based on timing in logs on zuul01 and the build history here: https://zuul.opendev.org/t/openstack/builds?change=915734&change=915735&skip=0 it seems that zuul detects a retryable failure state in nova-tox-validate-backport. This retry state seems to coincide with zuul cancelling jobs for the child (915735) in the gate presumably because it thinks there needs to be a gate reset due to the failure in 915734. Problem is that failure isn't a "true" failure and the job will be retried. Which it is retried and 915734 eventually merges. Then 915735 reports but since no new jobs have run it reports the cancelled failure state and you get a -2. The upside to this failure mode is that the speculative git tree state assumptions are not violated for the change that merges so we're failing safe.17:44
@clarkb:matrix.orgSounds like this may have been happening for some time but very infrequently. I'm not aware off the top of my head what sorts of chagnes to zuul might be causing this and I haven't been able to keep track of zuul changes recently as well as I'd like. Curious if anyone has any ideas before we deep dive the code17:46
@clarkb:matrix.orgI think this is related to pre_fail handling17:52
@clarkb:matrix.orgThe reason is the short circuit check in https://opendev.org/zuul/zuul/src/branch/master/zuul/model.py#L4507-L451317:52
@clarkb:matrix.orgwe skip waiting to determine if the job would be in a retry state17:52
@clarkb:matrix.orgbut this must be a corner case as we have tests that check we don't set pre_fail when we would retry17:53
@clarkb:matrix.orgI have confirmed via logs on ze05 that early failure detection did occur for this job18:00
@clarkb:matrix.org`2024-06-28 14:22:53,303 INFO zuul.AnsibleJob: [e: 967b8cbde1cb4649bc4b7f5d7ded6be8] [build: a51b05d89221490cb51d6c777dd42f8f] Early failure in job`18:00
@jim:acmegating.comClark: what's the test you reckon covers this?18:00
@clarkb:matrix.orgcorvus: test_pre_run_failure_retry18:01
@clarkb:matrix.orgor at least it attempts to cover this?18:01
@clarkb:matrix.orgwe do appear to be failing in a pre-run playbook18:02
@clarkb:matrix.orgcorvus: reading the executor code I half suspect that we're properly saying don't allow prefail when running the pre playbook, but then when we go to run the post playbook to finish up the job we allow setting pre fail and it is that point that we read the logs for the pre playbook and set pre_fail. But since we in the post phase we set the flag18:05
@jim:acmegating.comi was noting that an interesting characteristic is that this job apparently also has a post failure18:07
@clarkb:matrix.orgcorvus: oh yup and the Early failure in job is detected during the post failure18:07
@clarkb:matrix.orgso maybe pre is totally fine and it is the mixing of a retry from a pre failure with a failure in post run18:07
@clarkb:matrix.orgya I think that is it18:08
@clarkb:matrix.orgI think to be safe we need to carry forward whether or not we had a retryable pre failure into subsequent runAnsiblePlaybook() calls so that we don't try and set pre_fail if so18:08
@jim:acmegating.comClark: how do we read the logs for the pre-run playbook during post-run?18:11
@jim:acmegating.comthat should only be from the current playbook execution18:11
@clarkb:matrix.orgcorvus: I don't think we do. As noted above the logging for early failure detection occured during the post-run playbook which made me think that could happen. But then you pointed out post-run is failing too. I think we're detecting the post-run failure instead18:12
@jim:acmegating.comoh, we're getting an early failure detection in a post-run playbook?18:12
@clarkb:matrix.orgcorvus: yes I think that is what happens. Pre-run fails and we hanlde that properly. Then we proceed to run post-run and that fails and we detect early failure there which sends the pre-fail flag up18:13
@clarkb:matrix.organd before we can set the job status to RETRY we've already detected pre-fail which is treated as a failure and we start resetting stuff18:13
@jim:acmegating.comso edge case is: pre-run task failure and post-run task failure.18:13
@clarkb:matrix.orgI think in zuul executor we can pass the should_retry flag into the post-run call and only set pre fail if should_retry is false18:13
@sean-k-mooney:matrix.orgthis is the job we think cuase the cancelation https://zuul.opendev.org/t/openstack/build/a51b05d89221490cb51d6c777dd42f8f/log/job-output.txt#20118:14
@clarkb:matrix.orgcorvus: yup18:14
@sean-k-mooney:matrix.orgthe first failure was in pre-run with a failure to lookup opendev.org18:14
@jim:acmegating.comClark: i agree with the fix18:15
@sean-k-mooney:matrix.orgthen log collection failed in post-run here https://zuul.opendev.org/t/openstack/build/a51b05d89221490cb51d6c777dd42f8f/log/job-output.txt#25118:15
@sean-k-mooney:matrix.orgbecuase we didn insll tokx 18:15
@jim:acmegating.comClark: you want to do the typing or shall i?18:15
@clarkb:matrix.orgcorvus: I can make an attempt at it18:15
@jim:acmegating.comcool thx; i will review it then :)18:16
@clarkb:matrix.orgmost of the fun in this change is going to be adding a new test.18:24
-@gerrit:opendev.org- Clark Boylan proposed: [zuul/zuul] 923066: Don't pre fail jobs if they will retry https://review.opendev.org/c/zuul/zuul/+/92306618:41
@clarkb:matrix.orgSomething like that maybe18:41
-@gerrit:opendev.org- Clark Boylan proposed: [zuul/zuul] 923066: Don't pre fail jobs if they will retry https://review.opendev.org/c/zuul/zuul/+/92306619:10
@clarkb:matrix.orgcorvus: hrm the test failed in ^ and the log shows "Early failure in job" which I don't understand since we should be in retry mode there20:30
@clarkb:matrix.orgI wonder if there is something else we're missing here20:30
@sylvass:albinvass.seI got an issue with packaging the zuul-executor on nixos. the short story is that the initial bubblewrap check fails because /nix/store has to be mounted into the chroot.20:31
I already had to do some local patches to make zuul check all default --ro-binds so it doesn't make assumptions about which directories exists. Basically I moved the --ro-binds defined here: https://opendev.org/zuul/zuul/src/branch/master/zuul/driver/bubblewrap/__init__.py#L297
to here: https://opendev.org/zuul/zuul/src/branch/master/zuul/driver/bubblewrap/__init__.py#L320
However executables on nixos are stored in `/nix/store` (`/bin/sh` and `/usr/bin/env` exist by default but they're symlinks to the store) so if that isn't mounted then the `id` executable won't be available for the bwrap check: https://opendev.org/zuul/zuul/src/branch/master/zuul/driver/bubblewrap/__init__.py#L217
I could add `/nix/store` as well to the default `--ro-bind` list, but I don't really like the idea of having hard-coded os specific paths. On the other hand the constructor doesn't take any ro_bind arguments and it isn't aware of the zuul.conf, so adding an argument and piping that data through to the constructor everywhere seems like quite a bit of work. (but maybe I should just try to do that?)
@clarkb:matrix.orgoh wait the Early failure in job is on the second run which is correct. We are aborting the first wait job because post-run fails which means we can't retry?20:32
@clarkb:matrix.orgexcept None is the return value which should still be retryable so I'm confused20:32
@clarkb:matrix.orgcorvus: ok I think I get it change 1 cannot merge20:35
@clarkb:matrix.orgso once zuul knows that (after the retries all complete and fail it aborts the job on change 2 and reenquees it without change 1 in the speculative state20:36
@clarkb:matrix.orgnow to figure out where we've got the test case doing the wrong thing and fix it20:36
@jim:acmegating.comAlbin Vass: make it conditional?  if '/nix/store' exists then add it to default list?20:37
@jim:acmegating.comAlbin Vass: i will say though that i'm hesitant to suggest that we think nixos is supported though20:38
@clarkb:matrix.orgyou also probably don't want to mount the entire nix store just the subset that represents the things you need?20:39
@jim:acmegating.comAlbin Vass: at any rate, that should make for a compact local patch for now.  :)20:39
@clarkb:matrix.org(which I'm not sure how you resolve that)20:39
@sylvass:albinvass.seyeah I'm leaning towards that, but it would be better if it didn't require a zuul change if for example someone wanted to package it for guix20:39
@sylvass:albinvass.seClark: that should be possible but there's no reason not to mount the entire store20:39
@clarkb:matrix.orgAlbin Vass: reducing the security risk of old versions being exposed is why20:40
@jim:acmegating.coma terminology quibble: this is not "packaging" this is "adding support for"  :)20:40
@clarkb:matrix.orgaiui nix will happily retain old versions that could potentially be abused?20:40
@sylvass:albinvass.seimo then they should just be garbage collected20:40
@sylvass:albinvass.seI'm not thinking of giving jobs access to the nix daemon to build packages on executors, I just want my executables available :)20:41
@sylvass:albinvass.seClark: `nix path-info --recursive /run/current-system` to get all paths needed for the current system :)20:43
@sylvass:albinvass.se(but that would be a ton of bind mounts)20:43
-@gerrit:opendev.org- Clark Boylan proposed: [zuul/zuul] 923066: Don't pre fail jobs if they will retry https://review.opendev.org/c/zuul/zuul/+/92306620:43
@clarkb:matrix.orgcorvus:  ^ the comment in the test delta between last ps and current ps explains my understanding of why it failed and why I needed to change it20:44
@clarkb:matrix.orgcorvus: I guess let me know if you think that is inaccurate20:44
@jim:acmegating.comClark: that makes sense because that determination is made executor-side20:45
@clarkb:matrix.orgAlbin Vass:  ya i think the main risk is that you end up with some setuid/gid executable hanging around that you don't want to expose. This is also technically an issue with any other distro except most other distros are pretty aggressive about pruning and only having the one version around20:46
@clarkb:matrix.org * Albin Vass:  ya i think the main risk is that you end up with some buggy setuid/gid executable hanging around that you don't want to expose. This is also technically an issue with any other distro except most other distros are pretty aggressive about pruning and only having the one version around20:46
@jim:acmegating.comi think this conversation is a good illustration of why it's good for the zuul project to be clear about it's expectations on runtime environment.  i don't think most of us even know what questions to ask about running securely on nixos.  it's developed for linux and tested on debian... and beyond that <shrug>.20:48
@jim:acmegating.com * i think this conversation is a good illustration of why it's good for the zuul project to be clear about it's expectations on runtime environment.  i don't think most of us even know what questions to ask about running securely on nixos.  it's developed for gnu/linux and tested on debian... and beyond that \<shrug>.20:48
@jim:acmegating.com(a rare time i have to actually break out the "gnu/linux" thing to be specific :)20:49
@sylvass:albinvass.seSure. It's quite easy to get the exact paths you need. But to do that I need to be able to tell bubbwrap what to mount :)20:51
@clarkb:matrix.orgmore generally testing that paths exist before adding them to the ro or rw paths makes sense. And maybe its a warning if they don't exist and are in the default list and it is an error if they aer in the extra list?20:57
@clarkb:matrix.organother appraoch (that is maybe more hacky) would be to wrap bwrap with something that adds appropriate nix paths for your platform20:59
@clarkb:matrix.orgbut that might make future debugging annoying when we're all looking at the code and wondering why you get different behavuior21:00
@sylvass:albinvass.seI tried wrapping zuul-executor with bubblewrap. But yeah I'd rather have a simpler setup 21:06
@sylvass:albinvass.seClark: oh and nix won't create executables with setuid in /nix/store.21:23
@sylvass:albinvass.seso that should generally be fine. instead nixos sets up wrappers in `/run/current-system/sw/bin` whenever that's needed21:23
@aureliojargas:matrix.orgHi there, just to let you know that I've updated the `ensure-poetry` role proposal, addressing all the code review comments: https://review.opendev.org/c/zuul/zuul-jobs/+/922286 21:41
-@gerrit:opendev.org- Albin Vass proposed:21:50
- [zuul/zuul] 923076: Print stdout on bwrap error https://review.opendev.org/c/zuul/zuul/+/923076
- [zuul/zuul] 923077: bwrap: check all ro-bind directories before adding flags https://review.opendev.org/c/zuul/zuul/+/923077
- [zuul/zuul] 923078: bwrap: parse args before checking bubblewrap https://review.opendev.org/c/zuul/zuul/+/923078
- [zuul/zuul] 923079: Add /nix/store to default ro-paths https://review.opendev.org/c/zuul/zuul/+/923079
@sylvass:albinvass.seanyway that's what I got so far ^21:51
@clarkb:matrix.orgarg the order of those jobs being processes by the scheduler is actually racy22:00
@clarkb:matrix.orgmaybe its ok to simply check the jobs have the correct status and not worry about the order22:00
@jim:acmegating.comClark: set ordered=False in assertHistory22:01
@clarkb:matrix.orgyup I'm on it. I just want to leave a comment there that captures why22:04
-@gerrit:opendev.org- Clark Boylan proposed: [zuul/zuul] 923066: Don't pre fail jobs if they will retry https://review.opendev.org/c/zuul/zuul/+/92306622:05
@jim:acmegating.comClark: fwiw, i don't think a comment is necessary there -- *most* of the tests need ordered=False.  :)22:10
@jim:acmegating.com(doesn't hurt of course; just indicating it's pretty normal)22:11
-@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed:22:12
- [zuul/zuul] 923080: Update providers on launcher https://review.opendev.org/c/zuul/zuul/+/923080
- [zuul/zuul] 923081: Add image build triggers https://review.opendev.org/c/zuul/zuul/+/923081
@clarkb:matrix.orgcorvus: ack the default appears to be true so I figured I'd explain it22:14
-@gerrit:opendev.org- Albin Vass proposed:22:27
- [zuul/zuul] 923076: Print stdout on bwrap error https://review.opendev.org/c/zuul/zuul/+/923076
- [zuul/zuul] 923077: bwrap: check all ro-bind directories before adding flags https://review.opendev.org/c/zuul/zuul/+/923077
- [zuul/zuul] 923078: bwrap: parse args before checking bubblewrap https://review.opendev.org/c/zuul/zuul/+/923078
-@gerrit:opendev.org- Albin Vass proposed:22:32
- [zuul/zuul] 923076: Print stdout on bwrap error https://review.opendev.org/c/zuul/zuul/+/923076
- [zuul/zuul] 923078: bwrap: parse args before checking bubblewrap https://review.opendev.org/c/zuul/zuul/+/923078
- [zuul/zuul] 923077: bwrap: check all ro-bind directories before adding flags https://review.opendev.org/c/zuul/zuul/+/923077
-@gerrit:opendev.org- Clark Boylan proposed: [zuul/zuul] 923066: Don't pre fail jobs if they will retry https://review.opendev.org/c/zuul/zuul/+/92306622:36
@clarkb:matrix.orgcorvus: the linters were mad at my comment so I removed it22:36
-@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed: [zuul/zuul-jobs] 923083: Install the docker-compose-plugin when installing upstream docker https://review.opendev.org/c/zuul/zuul-jobs/+/92308323:26
-@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed: [zuul/zuul] 923084: WIP: Try docker compose with podman https://review.opendev.org/c/zuul/zuul/+/92308423:43
-@gerrit:opendev.org- Zuul merged on behalf of Clark Boylan: [zuul/zuul] 923066: Don't pre fail jobs if they will retry https://review.opendev.org/c/zuul/zuul/+/92306623:53

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!