openstackgerrit | James E. Blair proposed zuul/zuul-operator master: WIP: Use kopf operator framework https://review.opendev.org/c/zuul/zuul-operator/+/785039 | 00:33 |
---|---|---|
*** josefwells has joined #zuul | 01:10 | |
openstackgerrit | James E. Blair proposed zuul/zuul-operator master: WIP: Use kopf operator framework https://review.opendev.org/c/zuul/zuul-operator/+/785039 | 01:14 |
openstackgerrit | James E. Blair proposed zuul/zuul-operator master: Bump API version to v1alpha2 https://review.opendev.org/c/zuul/zuul-operator/+/785047 | 01:14 |
openstackgerrit | James E. Blair proposed zuul/zuul-operator master: Support externally managed Zookeeper and DB https://review.opendev.org/c/zuul/zuul-operator/+/785273 | 01:14 |
openstackgerrit | James E. Blair proposed zuul/zuul-operator master: Pass through extra scheduler config options https://review.opendev.org/c/zuul/zuul-operator/+/785277 | 01:14 |
openstackgerrit | James E. Blair proposed zuul/zuul-operator master: Add merger support https://review.opendev.org/c/zuul/zuul-operator/+/785278 | 01:14 |
openstackgerrit | James E. Blair proposed zuul/zuul-operator master: Support imagePrefix and versions https://review.opendev.org/c/zuul/zuul-operator/+/785279 | 01:14 |
openstackgerrit | James E. Blair proposed zuul/zuul-operator master: WIP: docs https://review.opendev.org/c/zuul/zuul-operator/+/785083 | 01:14 |
openstackgerrit | James E. Blair proposed zuul/zuul-operator master: Support fingergw https://review.opendev.org/c/zuul/zuul-operator/+/785300 | 01:14 |
*** hamalq has quit IRC | 01:21 | |
*** spotz has quit IRC | 01:32 | |
*** josefwells has quit IRC | 01:58 | |
*** rlandy|rover|bbl is now known as rlandy|rover | 02:17 | |
*** avass has quit IRC | 02:22 | |
*** rlandy|rover has quit IRC | 02:32 | |
*** evrardjp has quit IRC | 02:33 | |
*** evrardjp has joined #zuul | 02:33 | |
*** sam_wan has joined #zuul | 02:55 | |
*** sam_wan has quit IRC | 02:56 | |
*** sam_wan has joined #zuul | 03:24 | |
openstackgerrit | James E. Blair proposed zuul/zuul-operator master: WIP: Use kopf operator framework https://review.opendev.org/c/zuul/zuul-operator/+/785039 | 03:48 |
openstackgerrit | James E. Blair proposed zuul/zuul-operator master: Bump API version to v1alpha2 https://review.opendev.org/c/zuul/zuul-operator/+/785047 | 03:48 |
openstackgerrit | James E. Blair proposed zuul/zuul-operator master: Support externally managed Zookeeper and DB https://review.opendev.org/c/zuul/zuul-operator/+/785273 | 03:48 |
openstackgerrit | James E. Blair proposed zuul/zuul-operator master: Pass through extra scheduler config options https://review.opendev.org/c/zuul/zuul-operator/+/785277 | 03:48 |
openstackgerrit | James E. Blair proposed zuul/zuul-operator master: Add merger support https://review.opendev.org/c/zuul/zuul-operator/+/785278 | 03:48 |
openstackgerrit | James E. Blair proposed zuul/zuul-operator master: Support imagePrefix and versions https://review.opendev.org/c/zuul/zuul-operator/+/785279 | 03:48 |
openstackgerrit | James E. Blair proposed zuul/zuul-operator master: Support fingergw https://review.opendev.org/c/zuul/zuul-operator/+/785300 | 03:48 |
openstackgerrit | James E. Blair proposed zuul/zuul-operator master: WIP: docs https://review.opendev.org/c/zuul/zuul-operator/+/785083 | 03:48 |
*** ykarel|away has joined #zuul | 03:54 | |
*** vishalmanchanda has joined #zuul | 04:15 | |
*** saneax has joined #zuul | 04:20 | |
*** paladox has quit IRC | 04:27 | |
*** saneax has quit IRC | 04:31 | |
*** paladox has joined #zuul | 04:33 | |
*** paladox has quit IRC | 04:43 | |
*** paladox has joined #zuul | 04:45 | |
*** saneax has joined #zuul | 04:51 | |
*** jfoufas1 has joined #zuul | 04:57 | |
openstackgerrit | Simon Westphahl proposed zuul/zuul master: Stop active event gathering on connection loss https://review.opendev.org/c/zuul/zuul/+/785100 | 06:16 |
openstackgerrit | Tobias Henkel proposed zuul/zuul master: Fix missing repo state restore https://review.opendev.org/c/zuul/zuul/+/785310 | 06:42 |
*** ykarel|away is now known as ykarel | 06:43 | |
tobiash | clarkb: this is a more generic attempt to fix the repo state restore ^ | 06:43 |
*** tosky has joined #zuul | 06:45 | |
*** reiterative has quit IRC | 06:49 | |
*** reiterative has joined #zuul | 06:49 | |
*** jpena|off is now known as jpena | 06:50 | |
*** avass has joined #zuul | 06:57 | |
*** rpittau|afk is now known as rpittau | 07:01 | |
avass | Are there still problems withs arm nodes? zuul/zuul-jobs gate is currently blocked because of it | 07:03 |
*** jcapitao has joined #zuul | 07:04 | |
*** saneax has quit IRC | 07:17 | |
*** jcapitao has quit IRC | 07:19 | |
*** jcapitao has joined #zuul | 07:21 | |
*** saneax has joined #zuul | 07:38 | |
*** ykarel_ has joined #zuul | 08:00 | |
*** ykarel has quit IRC | 08:02 | |
*** ykarel_ is now known as ykarel | 08:02 | |
openstackgerrit | Merged zuul/zuul master: Gitlab: raise MergeFailure exception to retry a failing merge https://review.opendev.org/c/zuul/zuul/+/777169 | 08:11 |
openstackgerrit | Merged zuul/zuul master: Add messages to make the job setup more transparent https://review.opendev.org/c/zuul/zuul/+/777885 | 08:13 |
openstackgerrit | Tobias Henkel proposed zuul/zuul master: Fix missing repo state restore https://review.opendev.org/c/zuul/zuul/+/785310 | 08:13 |
openstackgerrit | Merged zuul/nodepool master: Add simple load testing script https://review.opendev.org/c/zuul/nodepool/+/775843 | 08:18 |
*** ykarel is now known as ykarel|lunch | 08:23 | |
openstackgerrit | Ian Wienand proposed zuul/nodepool master: Require dib 3.9.0 https://review.opendev.org/c/zuul/nodepool/+/785347 | 09:12 |
*** ykarel|lunch is now known as ykarel | 09:22 | |
*** sshnaidm|afk is now known as sshnaidm | 09:31 | |
*** jcapitao has quit IRC | 09:46 | |
icey | n | 10:31 |
*** sshnaidm has quit IRC | 11:01 | |
*** sshnaidm has joined #zuul | 11:04 | |
*** sam_wan has quit IRC | 11:08 | |
*** jpena is now known as jpena|lunch | 11:30 | |
*** sshnaidm has quit IRC | 11:37 | |
*** rlandy has joined #zuul | 11:43 | |
*** rlandy is now known as rlandy|rover | 11:43 | |
*** pots has quit IRC | 11:49 | |
*** pots has joined #zuul | 11:50 | |
*** sshnaidm has joined #zuul | 11:50 | |
*** jcapitao has joined #zuul | 12:07 | |
*** jpena|lunch is now known as jpena | 12:31 | |
zbr | tobiash: tristanC: fungi: https://review.opendev.org/c/zuul/zuul/+/766460 is finally green, improved dev guide page: https://d1c94900ec6853cf329a-2ce6583bcfee959f0e7ee40d82e3f479.ssl.cf2.rackcdn.com/766460/26/check/zuul-tox-docs/6ed9304/docs/reference/developer/index.html?highlight=developer | 12:58 |
*** sanjayu_ has joined #zuul | 13:01 | |
tristanC | zbr: thanks! | 13:03 |
*** saneax has quit IRC | 13:03 | |
*** sanjayu__ has joined #zuul | 13:16 | |
*** Goneri has joined #zuul | 13:18 | |
*** sanjayu_ has quit IRC | 13:18 | |
zbr | tristanC: it took 10x more time than I was expecting for such a minor docs improvement. | 13:26 |
openstackgerrit | James E. Blair proposed zuul/zuul-operator master: WIP: Use kopf operator framework https://review.opendev.org/c/zuul/zuul-operator/+/785039 | 13:43 |
openstackgerrit | James E. Blair proposed zuul/zuul master: Add a checkpoint release note https://review.opendev.org/c/zuul/zuul/+/785054 | 13:53 |
tobiash | clarkb, corvus: the repo state fix revealed a conceptual conflict between buildset-global repo state and job-refreezing during tenant reconfigurations | 14:00 |
corvus | tobiash: because a reconfig could add a project to required-projects? | 14:01 |
tobiash | yes, or just a job that needs a new playbook | 14:02 |
tobiash | so the goal of buildset-global repo state is to make all jobs within a buildset consistent so all use the same repo states | 14:03 |
tobiash | a refreeze during reconfig basically can alter the job config of all jobs within a buildset that have not yet been started | 14:05 |
tobiash | thinking about this I think the refreeze during a reconfig also breaks the consistency within a buildset | 14:06 |
corvus | tobiash: that's correct; we accepted it as a sort of best effort. unlike a strict gating sequence, there's really no synthetic point in time where we can say that a new configuration should apply. so we just apply it asap, on new jobs, and we don't try to apply it retroactively on already running or complete jobs (ie, we don't abort/re-run them) | 14:08 |
tobiash | corvus: I think one way to fix this conflict and keep the consistency could be to only refreeze buildsets which have no running jobs yet | 14:09 |
tobiash | if we don't accept the less asap way another option could be to re-process the merge and repo state generation and use the new repo state for the new jobs, but that breaks the goal of global repo state in case of reconfigs | 14:12 |
tobiash | a side effect of the no-refreeze-started-buildsets approach would be a much faster reconfig | 14:14 |
corvus | tobiash: i don't think there's a clearly right/wrong answer here and would be comfortable with either. i think option 1 is probably a little better because it does try to maintain the global state. it might be good to store a flag on items which are running with a prior configuration so if we wanted, we could display that in the ui. the biggest advantage of the asap approach is being able to fix something | 14:16 |
corvus | external to the gate chain and have it apply quickly. like, if zuul ran tempest but isn't in tempest's gate queue, a fix to tempest would take effect sooner. having that flag in the ui could help people decide to dequeue buildsets running on the old config. | 14:16 |
corvus | tobiash: iow, probably picking one, documenting it, and making it visible/discoverable is the best thing we can do in an ambiguous situation like this :| | 14:17 |
zbr | corvus: re TESTING.rst, afaik that page is not part of the built docs. not sure how to fix that. | 14:26 |
zbr | that one is static, the other one is dynamic. we could move that onde inside the development one and remove this file. | 14:28 |
corvus | zbr: that works for me, thanks | 14:29 |
corvus | i think the main thing to avoid is having two sources of truth | 14:29 |
zbr | i agree with that, i will also make few small fixes like replacing py35 with just py on it | 14:30 |
corvus | ++ | 14:30 |
corvus | tobiash: so is this the repo update problem that clarkb saw? | 14:30 |
tobiash | corvus: well, the current global repo state implementation is broken which creates that problem as a side effect | 14:31 |
*** ykarel has quit IRC | 14:32 | |
corvus | tobiash: 785310 fixes that? | 14:32 |
*** ykarel has joined #zuul | 14:32 | |
tobiash | corvus: yes | 14:33 |
tobiash | the problem was that the global repo state was created, then the isupdateof thinks no update is needed but then the restore of the repo state is missing | 14:34 |
tobiash | which basically leads to a potentially wrong commit checked out | 14:34 |
tobiash | but 785310 doesn't pass tests that test the reconfig | 14:35 |
corvus | tobiash: okay, so 2 related issues: a regression which is theoretically fixed by 785310, and the newly discovered/discussed inconsistency in reconfiguration, which doesn't have a patchset yet (which isn't so much a regression as a slight undermining of the intent of global state). we probably need the first in 4.2.0, but the second can wait a bit if necessary? | 14:35 |
corvus | oh | 14:37 |
tobiash | the problem is that 785310 requires a fix for the inconsistency in reconfig | 14:37 |
corvus | tobiash: we might need the second to fix the tests on the first :) | 14:37 |
corvus | ya that :) | 14:37 |
corvus | i'm waking up :) | 14:37 |
tobiash | another option would be to revert global repo state and re-merge it when we have a fix for the reconfig inconsistency | 14:38 |
tobiash | that wouldn't block the release then | 14:38 |
openstackgerrit | Tobias Henkel proposed zuul/zuul master: Revert "Make repo state buildset global" https://review.opendev.org/c/zuul/zuul/+/785427 | 14:43 |
tobiash | corvus: revert of the repo state in case we want to revert until we have the reconfig change ^ | 14:43 |
zbr | I observed an annoying behavior on ubuntu where it prompts me with: correct 'docs' to 'doc' [nyae]? n -- when I run tox -e docs | 14:45 |
zbr | this is because the docs folder is called doc but the tox command is docs. | 14:45 |
zbr | ideally we should be consistent and avoid confusing the tooling | 14:46 |
fungi | avass: we identified a recent regression in diskimage-builder 3.7.1 related to the introduction of secure boot support for centos 8.x, which broke the efi config and hence the ability of us to boot our centos-8-arm64 and centos-stream-8-arm64 images for the past ~week | 14:52 |
fungi | diskimage-builder 3.9.0 has the fix and we're in the process of trying to get working images built in nodepool again for those | 14:52 |
fungi | unfortunately nobody spotted that the nodes weren't booting at the end of last week, so by the time it was brought to our attention on saturday we no longer had any bootable images for those | 14:53 |
fungi | as i discovered after trying to roll back to the previous image, which was broken in the same way | 14:54 |
*** ykarel has quit IRC | 14:57 | |
*** spotz has joined #zuul | 14:58 | |
openstackgerrit | Sorin Sbârnea proposed zuul/zuul master: Move testing doc into sphinx doc https://review.opendev.org/c/zuul/zuul/+/785430 | 15:00 |
corvus | avass, fungi: we don't use arm nodes in the zuul-jobs gate do we? | 15:01 |
corvus | oh apparently we do | 15:02 |
corvus | but we shouldn't | 15:02 |
clarkb | I think there are jobs that try to cover a couple of arm specific things? | 15:03 |
corvus | clarkb: one job for one specific thing: https://review.opendev.org/746245 | 15:03 |
clarkb | tobiash: did you have a chance to see my comments on https://review.opendev.org/c/zuul/zuul/+/785152 ? I'll look at the alternative shortly | 15:03 |
corvus | avass, fungi, clarkb, ianw: the arm resource pool isn't really robust enough for us to add jobs to check/gate; if we really want to run that, we should probably add a second check pipeline | 15:04 |
corvus | avass: i'd be in favor of disabling that job for now (and when/if we re-enable it, put it in another pipeline) | 15:05 |
clarkb | ++ the second check pipeline seems to work well for system-config | 15:05 |
fungi | i concur | 15:05 |
fungi | right now we literally have only one provider for those nodes | 15:05 |
fungi | a second would be most welcome | 15:05 |
clarkb | we have a potential second provider that reached out just as my day ended yesterday | 15:06 |
fungi | the recent dib regression is somewhat related to that lack of robustness. we don't want to gate dib changes on arm jobs so don't test that they're able to boot | 15:07 |
clarkb | kevinz seems to have sent them to ianw and myself, will say more when I know more :) | 15:07 |
corvus | cool; to be clear (since i'm about to propose a change to remove them) i think the arm64 jobs are worthwhile and would love to have them either in a second pipeline, or when things are robust enough (2 providers: yay!) to have them back in check :) | 15:08 |
openstackgerrit | James E. Blair proposed zuul/zuul-jobs master: Remove arm64 jobs (temporarily) https://review.opendev.org/c/zuul/zuul-jobs/+/785432 | 15:09 |
corvus | avass, fungi, clarkb: ^ | 15:10 |
corvus | tobiash: i'll see if i can cobble together a change to the reconfiguration as we discussed; if we can get that together by the end of the day, then it's probably worth doing that; otherwise, let's take the revert. so hopefully either way we can restart tomorrow. | 15:11 |
corvus | clarkb: ^ for when you're caught up :) | 15:11 |
clarkb | corvus: that plan makes sense to me The reconfiguration chnage would be in addition to https://review.opendev.org/c/zuul/zuul/+/785310 ? Note this change isn't passing tests yet | 15:12 |
clarkb | Once I've got some tea and breakfast I'm going to try and review that one | 15:14 |
corvus | clarkb: it'll have to be included -- the tests are failing because they need the reconfiguration change | 15:15 |
clarkb | got it | 15:18 |
openstackgerrit | Merged zuul/zuul master: Stop active event gathering on connection loss https://review.opendev.org/c/zuul/zuul/+/785100 | 15:34 |
*** sanjayu__ has quit IRC | 15:52 | |
*** rpittau is now known as rpittau|bbl | 15:57 | |
*** jpena is now known as jpena|off | 15:59 | |
openstackgerrit | James E. Blair proposed zuul/zuul-operator master: WIP: Use kopf operator framework https://review.opendev.org/c/zuul/zuul-operator/+/785039 | 16:31 |
*** hamalq has joined #zuul | 16:33 | |
*** vishalmanchanda has quit IRC | 16:34 | |
clarkb | corvus: tobiash: if I'm reading this change correctly essentailly the way we're proposing things work is that isUpdateNeeded() will always ensure the necessary revs are present in the repo then later we run _restoreRepoState() with repo_state that says project foo branch bar is at rev xyz | 16:39 |
*** jcapitao has quit IRC | 16:39 | |
clarkb | and that step will update branch bar to point at rev xyz if it doesn't already | 16:39 |
clarkb | and I guess that repo_state will always know about all the tags and heads and such? | 16:39 |
tobiash | yes | 16:39 |
clarkb | thanks, in that case these chagnes seem like they should work (once updated to fix the other issue that was exposed) | 16:41 |
*** sshnaidm has quit IRC | 16:43 | |
corvus | i'm a little confused about the two different repo states in the executor | 16:44 |
*** sshnaidm has joined #zuul | 16:44 | |
openstackgerrit | Merged zuul/nodepool master: Require dib 3.9.0 https://review.opendev.org/c/zuul/nodepool/+/785347 | 16:50 |
*** sshnaidm is now known as sshnaidm|afk | 17:10 | |
corvus | tobiash: can you look at my comment on that change? (cc clarkb) | 17:15 |
corvus | if i'm right, then that bit of code is unecessary (but extra safe so maybe we want to keep it anyway but change the comment?) if i'm wrong, then i'm missing something and would like to fully understand. :) | 17:17 |
tobiash | corvus: as far as I've understood the mergeItem it updates the merged branch within the repo state | 17:19 |
tobiash | that's why I wanted to be sure that we use an original repo state for trusted repos later | 17:19 |
corvus | tobiash: i don't see how it can do that; nothing touches repo_state in _mergeItem after the merge | 17:21 |
tobiash | then I'm confused why it should update it | 17:21 |
corvus | tobiash: you saw that happen? | 17:21 |
corvus | or you're confused about why we call saverepostate? | 17:22 |
tobiash | https://opendev.org/zuul/zuul/src/branch/master/zuul/merger/merger.py#L954 | 17:23 |
tobiash | I'm confused about that | 17:23 |
tobiash | that clearly pretends to mutate the repo state | 17:23 |
tobiash | or is that a noop within the executor? | 17:23 |
corvus | tobiash: the process is slightly different depending on whether it's happening on enqueue or execution | 17:24 |
corvus | tobiash: yep that's the idea | 17:24 |
tobiash | ah ok | 17:24 |
tobiash | then I've misunderstood that | 17:24 |
corvus | tobiash: on enqueue, it starts empty and populates as we merge each item | 17:24 |
tobiash | based on that line I wanted to be safe ;) | 17:24 |
corvus | tobiash: on the executor, it starts fully populated and shouldn't change | 17:24 |
tobiash | k, then you're right and we don't have to copy | 17:24 |
corvus | ok. you could talk me into keeping the copy as extra protection against a future bug that could cause us to update the trusted repos, as long as we change the comment to say that :) | 17:25 |
corvus | tobiash: and thank you very much for putting the comment you did put there, that helped me realize the discrepancy :) | 17:26 |
corvus | this is tricky stuff | 17:27 |
tobiash | actually I don't mind whether I remove that or change the comment, I leave the decision to you | 17:27 |
corvus | okay, i'm starting work on the refreeze part of this, i'll pick one :) | 17:28 |
tobiash | ++ | 17:30 |
avass | corvus, fungi, clarkb: wanna promote 785432 in zuul-jobs gate so the changes can start merging? :) | 17:33 |
fungi | avass: sure, on it | 17:35 |
avass | thanks! | 17:36 |
fungi | and done | 17:37 |
fungi | #status log Promoted 785432,1 in the zuul tenant's gate pipeline due to indefinitely waiting builds ahead of it | 17:40 |
openstackstatus | fungi: finished logging | 17:40 |
*** rpittau|bbl is now known as rpittau | 17:44 | |
openstackgerrit | Merged zuul/zuul-jobs master: Remove arm64 jobs (temporarily) https://review.opendev.org/c/zuul/zuul-jobs/+/785432 | 17:46 |
fungi | yeah, basically we have the following nodes in linaro-us: two debian-buster-arm64 nodes in-use for over an hour, two ubuntu-focal-arm64 nodes in use for half an hour, five debian-buster-arm64 nodes deleting for the past few minutes, and a ready ubuntu-focal-arm64-xxxlarge node nothing's used for 8.5 hours yet | 17:49 |
fungi | er, sorry, that was for #opendev | 17:49 |
openstackgerrit | Merged zuul/zuul-jobs master: Add upload-logs-azure role https://review.opendev.org/c/zuul/zuul-jobs/+/782004 | 17:50 |
corvus | tobiash: if we go with the approach we talked about, we may lose the ability to remove broken jobs from running queue items (a feature we literally just took advantage of 30 minutes ago ^) | 18:11 |
tobiash | corvus: we still have dequeue and abandon/restore | 18:12 |
corvus | tobiash: right, but so much of the reconfiguration code is explicitly designed to support this behavior | 18:13 |
tobiash | corvus: we could also retain removing jobs as a special case | 18:13 |
tobiash | like temporary refreeze and remove non existing jobs from the original buildset | 18:14 |
tobiash | although this wouldn't give us the benefit of faster reconfigs | 18:15 |
corvus | i'm concerned about inconsistent configurations -- we could have a change ahead that removes a job that produces an artifact, then a change behind that depends on that and it never arrives | 18:16 |
corvus | i'm having trouble seeing how we can have consistent global configuration with updates, at the same time we have consistent buildset contents | 18:17 |
corvus | tobiash: i think the only way we can have both would be to discard already completed builds if the repo state has changed. | 18:19 |
corvus | (which is not great for, say, a publishing pipeline) | 18:20 |
corvus | (i mean, i guess it would be discard already completed builds if both the layout and the repo state has changed... something like that) | 18:21 |
tobiash | discarding in a gate pipeline would be effectively a gate reset | 18:23 |
corvus | yeah | 18:23 |
tobiash | that can mean throwing away two hours of build time in our more crowded gates | 18:24 |
corvus | or, perhaps we weaken the buildset repo state guarantee so that we reset it for jobs that haven't started yet, but only if necessary (because a job config change)? | 18:24 |
corvus | i just thought of another issue -- when we add new items to a gate pipeline, we use the layout from the item ahead if it's been speculatively updated; if we don't update that after a reconfig, then we could keep using an old layout perpetually as we keep enqueing items | 18:28 |
tobiash | I think that could be updated withour refreezing | 18:29 |
tobiash | so the already frozen jobs run as is and later items start with an updated layout? | 18:30 |
corvus | yeah, if that's what we wanted to do, i suppose it's possible. it would mean that we accomodate having frozen jobs that don't match their layout. | 18:31 |
corvus | i don't love that idea; this is getting really complicated. | 18:31 |
tobiash | so you'd do the refreezing combined with repo state refresh if the job config changed? | 18:35 |
corvus | i think that's one option. it supports most of our goals, except the buildset repo state in some circumstances | 18:36 |
tobiash | I guess we basically have two choices, either ditch updates on buildsets (and thus don't remove jobs) or do the weakened repo state | 18:36 |
corvus | trying to take some notes: https://etherpad.opendev.org/p/xtM3zRa7-xe5RkNejGOq | 18:37 |
tobiash | what do you mean with configuration consistency? | 18:39 |
mordred | corvus, tobiash: "on configuration change" - I'm playing catch up - which type of config change do we mean here - like landing a non-speculative change? | 18:40 |
tobiash | mordred: any tenant reconfig, e.g. removing/adding jobs from a pipeline | 18:41 |
mordred | nod | 18:41 |
mordred | thanks | 18:41 |
corvus | tobiash: two things: that a queue item's layout is based on the layout ahead of it (or the pipeline layout), and that a user can know what zuul is running by inspecting the repos and changes. | 18:41 |
corvus | tobiash: i'm not worried about reconfiguration time; i'm much more worried about correctness | 18:42 |
tobiash | I think both options can be correct (if we don't care about real time job removal/addition) | 18:43 |
corvus | i'm just saying i don't think it's a goal here and listing it isn't helping me evaluate the options | 18:44 |
tobiash | just wanted to mention that since reconfig times are still one of our biggest problems | 18:45 |
corvus | understood | 18:46 |
corvus | "configuration consistency needs updating layout without refreezing (can add complexity)" what does that mean? | 18:46 |
corvus | oh, the thing about carrying around layouts on queue items without refreezing them | 18:46 |
corvus | i think i got it | 18:46 |
tobiash | yes | 18:46 |
tobiash | feel free to rephrase it :) | 18:47 |
tobiash | somehow that feels like a decision of real-time job updates vs repo state guarantee | 18:51 |
corvus | i'd like to be really clear that real-time updates have been an explicit goal we have done a huge amount of work across many years to enable | 18:52 |
corvus | it's not an accident. and the fact that a user can actually know what zuul is running an any point in time by inspecting the repos and changes is important | 18:53 |
tobiash | then I think the best compromise is #1 | 18:53 |
corvus | maybe; but i haven't lost hope on #2 :) | 18:55 |
corvus | i'm exploring what configuration consistency means | 18:55 |
corvus | what are the ways changes can affect each other: provides/requires, hold-following-changes, semaphores? | 18:56 |
clarkb | I haven't quite followed along, but wouldn't a git change to a parent changing jobs (to say remove or add one) imply a gate reset anyway? | 18:58 |
clarkb | I'm trying to think of a situation where you wouldn't end up with a reset due to git changes | 18:58 |
tobiash | it doesn't atm and looking at base job changes gate-resetting all gates is not viable | 18:59 |
clarkb | oh I see, it is dependencies outside of the gate dag | 18:59 |
tobiash | yes | 18:59 |
corvus | clarkb: nope, we could remove the tox-docs job from all repos in project-config, and as soon as that merges, poof it disappears from all running items without disruption. | 18:59 |
corvus | right | 18:59 |
corvus | being able to make those changes without resetting a day-long pipeline is really the driver for how reconfiguration works in zuul | 19:01 |
clarkb | ya I was missing that that was something that already happens TIL | 19:01 |
corvus | and i agree, i don't think we want to add more gate resets :) | 19:01 |
tobiash | corvus: how important is the live-adding of new jobs compared to the live-removal? | 19:03 |
tobiash | I think anwering this question could help us to judge between #1 and #3 | 19:03 |
corvus | tobiash: i am certain we have used both in the past in openstack. it may be that it's more important when a project is growing and less important when things have stabilized. it may also be more important with centralized config and less important with distributed config. | 19:04 |
corvus | i suspect that my personal experience has drifted to the later part of both of those continuums, so if you asked me to weigh those right now, i would probably say that adding is not as important as removing, and that even removing may not be critical. | 19:06 |
corvus | but it's hard for me to say in general, because that's just based on my own experience; another zuul user might be going through a growth phase in a centrally controlled environment where both of those are really important | 19:07 |
corvus | or maybe no one even understands that's how it works and they'd be surprised the buildset repo state could be inconsistent :) | 19:07 |
corvus | am i being helpful? :) don't answer that. | 19:08 |
tobiash | our users were surprised that the repo state could be inconsistent at least ;) | 19:08 |
clarkb | thinking out loud: could keep the set of jobs (and other related items) consistent with enqueue time unless a reset happens then update | 19:09 |
clarkb | that gives users an out (though not the easiest one to take advantage of) | 19:09 |
corvus | clarkb: i think that's basically #2? | 19:09 |
clarkb | oh there is an etherpad /me opens | 19:10 |
tobiash | yeah, that's #2 | 19:10 |
tobiash | related to that I think we should write the repo state into a file in the job logs dir | 19:10 |
corvus | the reason i was asking about provides/requires and other things is to try to figure out how important configuration consistency is. like, if we go with #2, is there ever a case where we could refreeze an item and have that adversely affect another item which was not refrozen? | 19:11 |
corvus | tobiash: i agree; i actually think the right way to reproduce a build. | 19:11 |
tobiash | regarding provides/requires I'm not sure but I think that would be racy now already depending if the jobs have already been started or not? | 19:12 |
tobiash | regarding semaphores I don't think there is an issue since they are locked right before job startup and unlocked after finish in any case | 19:13 |
corvus | tobiash: well, if you remove a provides job, and refreeze a requires job, then the refrozen job will no longer wait on the provides; i don't think there's a race there. if the requires has started already, then that means the provides has finished, so no problem. | 19:14 |
corvus | in a gate pipeline, if you removed a provides and it refroze, that would only happen if it reset and the provides would be behind it, and so it would get reset too and refreeze (this is still exploring option #2) | 19:15 |
corvus | and in a check pipeline under #2, none of the items would ever refreeze, right? | 19:16 |
tobiash | yes | 19:16 |
corvus | regarding hold-following-changes, if you removed a hold-following-changes job, it would continue to hold following changes until the item that was holding was refrozen without the job. likewise adding that feature. i think that means if you added it to a gate pipeline, it may start showing up randomly. like, each time an item in gate reset, it would get the hold-following-changes flag and start holding | 19:19 |
corvus | things behind it. and then later if a change ahead of it reset, it would get the flag and start holding things behind it. | 19:19 |
corvus | that might be a little weird, but maybe that behavior is okay in that circumstance? | 19:20 |
tobiash | I think that would be ok | 19:21 |
corvus | regarding semaphores -- that could be a little more problematic, in that adding a semaphore to a job wouldn't take effect until all existing runs of that job had completed. that could be a little problematic depending on exactly what the job was doing (but if you're only adding use of a contentious resource in the same change you're adding the semaphore, then the old versions of the job wouldn't be using | 19:23 |
corvus | the resource anyway). anyway, potentially problematic depending on details. removing a semaphore is probably not a big deal (you'd have extra locking for a little while until the old jobs finished) | 19:23 |
corvus | tobiash: i think those are the sort of things i had in mind with configuration consistency; but i think those are potentially acceptable behaviors in the case of #2 | 19:28 |
corvus | tobiash: i do have a slight concern that since #2 is a fundamental change to how reconfiguration works, there may be a large number of tests that need to be updated. | 19:36 |
*** jfoufas1 has quit IRC | 19:37 | |
tobiash | maybe we could also go with 1 or 3 and evaluate 2 later in more depth | 19:38 |
corvus | tobiash, clarkb, mordred: how about we proceed thusly: 1) revert buildset repo state; 2) attempt to implement option #2 in a change; if it appears viable without a rewrite of zuul and/or its test suite, 3) we ask zuul-discuss if anyone objects to the behavior change (the things we noted above plus obviously the lack of real-time job add/remove). if either the technical challenges are too complex or there | 19:39 |
corvus | are people dependent on current behavior, we look at options 1 or 3? | 19:39 |
corvus | sorry i was almost done typing that when tobiash said his thing :) | 19:39 |
corvus | i think i'm on board with tobiash's "neither of these is incorrect" idea. i mostly don't want to burn time on option 1/3 if everyone really wants #2 :) | 19:41 |
tobiash | the plan sounds viable | 20:02 |
clarkb | that seems like a reasonable plan to me | 20:03 |
avass | what actually happens with a requires right now if all buildsets providing the artifacts have already completed before the requires is enqueued? it just doesn't get any artifacts? | 20:08 |
corvus | avass: they get them; zuul looks them up in the database | 20:08 |
avass | cool | 20:09 |
corvus | clarkb: want to +3 https://review.opendev.org/785427 ? | 20:10 |
avass | it would be cool if provides/requires, files matchers and probably other things could be configurable in a project template somehow | 20:11 |
clarkb | corvus: ya let me just confirm the revert looks about right | 20:11 |
clarkb | done | 20:12 |
corvus | avass: i don't see why they wouldn't be (or maybe i don't understand) | 20:12 |
corvus | (almost everything you can do in a job you can do in a project, and anything you can do in a project you can do in a project-template) | 20:12 |
corvus | clarkb: do you think you might have time to review https://review.opendev.org/783726 today? | 20:13 |
avass | corvus: can you modify a job in a project template to "require: X" or "provide: Y" when using it? like the provides/requires is not part of the template itself but the jobs are | 20:14 |
corvus | avass: i believe so | 20:14 |
*** rlandy|rover is now known as rlandy|rover|afk | 20:14 | |
clarkb | corvus: yes, I'm about to declare that gerrit account stuff will have to wait for tomorrow (I don't like doing big things like that in the afternoon and then disappearing for dinner, and doing those updates is not fast) | 20:15 |
corvus | avass: i think that will even work as expected with the sql query :) | 20:16 |
corvus | clarkb: cool, i think if we can merge the revert, that change, and then this trivial reno: https://review.opendev.org/785054 , then we can restart opendev tomorrow and release next week | 20:17 |
avass | I suppose you just override the job in the project stanza, I wonder what would happen if the project template does that as well to set variables. like the project template has something like "jobs: [a{vars: {my_var: 1}}] and the project stanza does "jobs: [a{requires: [X]}]. would it still have the variable set in that case? | 20:19 |
*** dmsimard6 has joined #zuul | 20:24 | |
*** tosky_ has joined #zuul | 20:25 | |
*** dmsimard has quit IRC | 20:26 | |
*** dmsimard6 is now known as dmsimard | 20:26 | |
*** bschanzel_ has quit IRC | 20:26 | |
*** tosky has quit IRC | 20:26 | |
*** bschanzel has joined #zuul | 20:27 | |
corvus | avass: every level of the system adds to the inheritance path; so what really happens is that the job defined in the project inherits from the one in the project-template which inherits from the one in the job definition, etc... so in your example, both things end up there. and if you added two different 'provides' to both then it will end up providing both things. | 20:28 |
*** tosky_ is now known as tosky | 20:28 | |
clarkb | corvus: one question on the zk race change | 20:31 |
avass | corvus: cool, I wasn't sure how it resolved that. :) | 20:33 |
corvus | clarkb: repld | 20:40 |
clarkb | corvus: thanks I have approved the change | 20:42 |
corvus | clarkb: cool, i also just left an addendum comment with additional info | 20:42 |
clarkb | I've also approved the release note | 20:42 |
corvus | thanks! | 20:45 |
openstackgerrit | James E. Blair proposed zuul/zuul-operator master: WIP: Use kopf operator framework https://review.opendev.org/c/zuul/zuul-operator/+/785039 | 20:56 |
*** rpittau is now known as rpittau|afk | 21:17 | |
openstackgerrit | Merged zuul/zuul master: Revert "Make repo state buildset global" https://review.opendev.org/c/zuul/zuul/+/785427 | 21:23 |
*** sshnaidm|afk is now known as sshnaidm|off | 21:24 | |
*** fsvsbs has joined #zuul | 21:36 | |
fsvsbs | Hi qq, ihave the zuul_console streaming into /tmp/uuid.log but is not appearing in the zuul Web ui console am I missing something here that I need to configure the node pool node is static at the moment | 21:39 |
fsvsbs | It is late in the UK so hope I can catch up with you in the morning on this | 21:40 |
corvus | fsvsbs: first thing to check is probably whether the log streaming port is open on the static node | 21:48 |
corvus | fsvsbs: tcp 19885 is the port number | 21:49 |
openstackgerrit | Merged zuul/zuul master: Fix ZK-related race condition in github driver https://review.opendev.org/c/zuul/zuul/+/783726 | 21:54 |
*** y2kenny has joined #zuul | 21:57 | |
openstackgerrit | Merged zuul/zuul master: Add a checkpoint release note https://review.opendev.org/c/zuul/zuul/+/785054 | 22:00 |
y2kenny | I have been running trying to run a list of shell commands with the shell module using "with_items" with some of the items using sudo/become. All the commands seem to run, but for some reason, for the sudo commands, I see "PermissionError: [Errno 13] Permission denied" and "ValueError: No start of json char found" | 22:00 |
y2kenny | http://paste.openstack.org/show/oZQBVtt5UMdUE5COGlvQ/ | 22:00 |
y2kenny | does any one know what could cause this? | 22:01 |
*** hamalq has quit IRC | 22:07 | |
*** hamalq has joined #zuul | 22:07 | |
corvus | y2kenny: i haven't seen that, but i can tell you that /tmp/console-0a91b40d-bdbf-0d23-f891-0000000000a5-baremetal.log is part of the zuul log streaming which relies on custom ansible plugins. also, the ValueError exception doesn't come from zuul. and finally, i think there is no testing of the ansible async module with zuul. | 22:14 |
corvus | tobiash: i've done an initial triage of failing tests with option 2; it's about 13, and at first glance, i'd say about half of those we would probably just remove because they confirm the live reconfiguration behavior. i'll do a more in-depth triage next. | 22:15 |
y2kenny | corvus: I am suspecting it's an ansible thing but I just want to double check. | 22:16 |
corvus | tobiash: list in etherpad | 22:16 |
corvus | y2kenny: very well could be, but there's definitely an intersection with zuul there, so also could be zuul, or zuul+ansible; hopefully those bits of info add context | 22:17 |
corvus | y2kenny: if it's at all possible to try something similar without async, i'd give that a shot to narrow things down | 22:18 |
corvus | y2kenny: an immediate thought occurs to me: what if the async task starts by executing a command with evelated privs, and the zuul console log gets created as rooot, and the later async calls that "check in" on the process don't run as root and don't have access. | 22:18 |
corvus | no idea if that is a logically consistent theory -- just brainstorming :) | 22:19 |
y2kenny | corvus: that's an interesting idea... | 22:19 |
y2kenny | I will try to disable the async and see what happens | 22:20 |
y2kenny | corvus: disabling async appears to eliminated the ValueError. The log persist. | 22:42 |
y2kenny | the errror related to the tmp/log I mean | 22:43 |
y2kenny | corvus: I have an idea to test your theory... I will get back to you | 22:45 |
y2kenny | corvus: you are probably right baout the stream log. If I fixed the entire task to "become: True" instead of switching between true and force depending on the item (with_item), the stream log don't give error | 22:57 |
openstackgerrit | James E. Blair proposed zuul/zuul master: Fix missing repo state restore https://review.opendev.org/c/zuul/zuul/+/785310 | 23:29 |
openstackgerrit | James E. Blair proposed zuul/zuul master: WIP: Revert "Revert "Make repo state buildset global"" https://review.opendev.org/c/zuul/zuul/+/785535 | 23:29 |
openstackgerrit | James E. Blair proposed zuul/zuul master: WIP: Keep jobgraphs frozen across reconfiguration https://review.opendev.org/c/zuul/zuul/+/785536 | 23:29 |
corvus | y2kenny: cool -- i mean, bummer about the error, but i'm glad we could make progress :) | 23:29 |
y2kenny | corvus: so is zuul stream log the third party plugin or did you mean something else? | 23:30 |
y2kenny | corvus: I am wondering where I can look to try to find the bug | 23:30 |
y2kenny | corvus: I am wondering if this is just a matter of setting the right mode for the stream log file | 23:32 |
y2kenny | (like a+rw) | 23:32 |
corvus | tobiash, clarkb, mordred: ^ i did a more in-depth triage of the tests which would be affected; i think the #2 approach is technically feasible, and i didn't see any major gotchas during my triage. i think the main new thing is we need to address behavior when a project is removed from a tenant. but i think that's a tractable problem. i uploaded a change with my triage notes inline in the tests as | 23:33 |
corvus | comments. so if you want to take a look at that and you agree, then i think next step is email to zuul-discuss to raise the behavior changes. | 23:33 |
corvus | y2kenny: the zuul log stream is an ansible plugin that zuul automatically installs; it's in the zuul/ansible dir in the zuul repo. i'm unsure if there is a reason the file is not created world-readable, or if it was just left to the default and it sometimes ends up that way due to a restricted umask. | 23:36 |
corvus | y2kenny: to be specific, i think it's the custom command action module we're talking about | 23:36 |
y2kenny | ok | 23:37 |
*** tosky has quit IRC | 23:38 | |
*** ajitha has joined #zuul | 23:44 | |
openstackgerrit | James E. Blair proposed zuul/zuul-operator master: WIP: Use kopf operator framework https://review.opendev.org/c/zuul/zuul-operator/+/785039 | 23:50 |
*** Goneri has quit IRC | 23:59 |
Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!