ianw | i'll get back to the debian-stable removal and maybe we can free up space there | 00:00 |
---|---|---|
prometheanfire | can periodic jobs email the core reviewer teams (or be otherwise configurable for notifications? | 01:30 |
ianw | yes they can; iirc zuul-jobs might be an example | 01:39 |
ianw | https://review.opendev.org/c/zuul/zuul-jobs/+/748682 | 01:41 |
fungi | ianw: prometheanfire: i think the only inbuilt mailing feature in zuul is the smtp exporter, and the recipient address for that is configured on a per pipeline basis | 01:58 |
fungi | https://zuul-ci.org/docs/zuul/reference/drivers/smtp.html | 01:59 |
* prometheanfire does like zuul | 02:11 | |
fungi | it's not got enough insight into gerrit's data structures to work out core reviewer addresses or anything, nor can it configure notification addresses on a per-job or per-project basis | 02:19 |
fungi | the periodic-stable pipeline, for example, is configured to send failure reports to the stable-maint ml: https://opendev.org/openstack/project-config/src/branch/master/zuul.d/pipelines.yaml#L293-L297 | 02:21 |
fungi | similarly, failures for the release-post pipeline are reported to the release-job-failures ml: from: zuul@openstack.org | 02:22 |
fungi | er, meant to paste https://opendev.org/openstack/project-config/src/branch/master/zuul.d/pipelines.yaml#L262-L266 | 02:22 |
fungi | also the pre-release, release, and tag pipelines get failure reports sent there | 02:23 |
prometheanfire | well, for zuul, could be a useful feature request | 03:01 |
*** bhagyashris__ is now known as bhagyashris | 04:11 | |
opendevreview | Ian Wienand proposed opendev/base-jobs master: Remove debian-stable nodeset https://review.opendev.org/c/opendev/base-jobs/+/802639 | 04:25 |
ianw | fungi: ^ i think with that list of dependencies that is finally ready ... | 05:27 |
*** gibi is now known as gibi_back_15UTC | 06:37 | |
*** ysandeep is now known as ysandeep|trng | 06:50 | |
*** jpena|off is now known as jpena | 07:31 | |
frickler | those deps all seem to fail on broken c7 jobs. more skeletons in the closet ... | 07:38 |
*** ykarel is now known as ykarel|lunch | 08:41 | |
*** ykarel|lunch is now known as ykarel | 09:02 | |
opendevreview | Thierry Carrez proposed openstack/project-config master: Add ttx as OP to #openinfra-events https://review.opendev.org/c/openstack/project-config/+/814381 | 09:07 |
ttx | Zuul's release-approval queue is blocked since 2021-10-15 15:42:42 | 09:44 |
frickler | openstack-zuul-jobs-linters is also failing with pyyaml6, should be an easy fix I hope, but lunch is first | 10:00 |
opendevreview | daniel.pawlik proposed opendev/system-config master: DNM Add zuul-log-scrapper role https://review.opendev.org/c/opendev/system-config/+/814391 | 10:01 |
*** mnasiadka_ is now known as mnasiadka | 10:10 | |
frickler | for the release-approval, iiuc that's what fungi was looking at earlier, but no resolution yet? | 10:18 |
frickler | zuul is also complaining about a lot of config errors, significant amount seems to be rename-related | 10:19 |
opendevreview | Dr. Jens Harbott proposed openstack/project-config master: Fix use of yaml.load() https://review.opendev.org/c/openstack/project-config/+/814401 | 10:26 |
frickler | ERROR: Project openinfra/ansible-role-refstack-client has non existing acl_config line | 11:01 |
opendevreview | Dr. Jens Harbott proposed openstack/project-config master: Fix use of yaml.load() https://review.opendev.org/c/openstack/project-config/+/814401 | 11:10 |
opendevreview | Dr. Jens Harbott proposed openstack/project-config master: Fix the renaming of ansible-role-refstack-client https://review.opendev.org/c/openstack/project-config/+/814409 | 11:10 |
*** dviroel is now known as dviroel|rover | 11:10 | |
* frickler hopes that this order will work, otherwise will merge them | 11:11 | |
frickler | of course murphy strikes again | 11:21 |
opendevreview | Dr. Jens Harbott proposed openstack/project-config master: Fix project-config testing https://review.opendev.org/c/openstack/project-config/+/814401 | 11:23 |
*** jpena is now known as jpena|lunch | 11:31 | |
opendevreview | Dr. Jens Harbott proposed openstack/project-config master: Add ttx as OP to #openinfra-events https://review.opendev.org/c/openstack/project-config/+/814381 | 11:36 |
opendevreview | Merged openstack/project-config master: Fix project-config testing https://review.opendev.org/c/openstack/project-config/+/814401 | 11:48 |
opendevreview | Merged openstack/project-config master: Add ttx as OP to #openinfra-events https://review.opendev.org/c/openstack/project-config/+/814381 | 12:08 |
*** jpena|lunch is now known as jpena | 12:23 | |
*** ysandeep|trng is now known as ysandeep | 12:57 | |
opendevreview | Dong Zhang proposed zuul/zuul-jobs master: Implement role for limiting zuul log file size https://review.opendev.org/c/zuul/zuul-jobs/+/813034 | 13:06 |
*** hjensas is now known as hjensas|afk | 13:19 | |
ttx | infra-root: Not super urgent, but Zuul's release-approval queue (the one used to trigger the PTL-approval test on release changes) seems to be stuck since Friday (478 changes added up) | 13:20 |
fungi | ttx: yep, known since saturday, i'm hoping to get some more eyes on the traceback/exception i found related to the top change there, figured if it was stuck into today that wouldn't be the end of the world so i didn't aggressively ping folks on their weekend | 13:23 |
fungi | brought it up in the zuul matrix channel though for input | 13:23 |
ttx | yeah, it's not super critical, I'm more concerned that it might add up to the point of slowing down other things | 13:25 |
fungi | same here, so hopefully we can decide whether we need to keep it in that state much longer or can try to clear it | 13:25 |
fungi | if it were the check or gate pipeline i'd have just collected as much data as i could and worked on getting it unstuck (probably a scheduler restart) | 13:26 |
Clark[m] | I suspect a dequeue may be sufficient to get this type of thing moving again. However the cache entry for the specific change might be stale/bad and even a restart won't fix that | 13:32 |
*** akahat is now known as akahat|afk | 13:57 | |
clarkb | fungi: also I haven't forgotten that we need to do a gear release and looks like maybe we should consider a bindep release? I can push tags for those after school run if we want to do that | 13:59 |
clarkb | Then I'd also like to land the gerrit 3.3.7 change and the gerrit.config cleanup change with plans to restart gerrit on those today/tomorrow | 14:00 |
fungi | yeah, probably need to add release notes for bindep | 14:00 |
*** gibi_back_15UTC is now known as gibi | 14:06 | |
clarkb | fungi: is there a change up to default gitea_always_update to true? then we can consider if we want to just do that. | 14:11 |
corvus | i can look at the zuul queue shortly | 14:13 |
clarkb | corvus: thanks! | 14:14 |
opendevreview | Clark Boylan proposed opendev/bindep master: Add release note for rocky and manjaro https://review.opendev.org/c/opendev/bindep/+/814431 | 14:20 |
clarkb | fungi: ^ bindep release note | 14:20 |
clarkb | https://review.opendev.org/c/opendev/system-config/+/814048 and https://review.opendev.org/c/opendev/system-config/+/813716 are the two gerrit related changes I mentioned above. I'll approve the second one after the school run since it has the necessary +2's. Landing the first to udpate to 3.3.7 would be nice too though | 14:23 |
clarkb | oh I guess https://review.opendev.org/c/opendev/system-config/+/813675 is still out there too. Should probably land that ebfore a gerrit restart too | 14:28 |
clarkb | its hard to know if we are using an ansible group anywhere though. Maybe reviewers haev a better way of checkign for that than my bad grepping | 14:29 |
*** timburke_ is now known as timburke | 14:39 | |
*** akahat|afk is now known as akahat | 14:48 | |
fungi | clarkb: thanks for the bindep reno, was that all you spotted since the last tag? | 14:59 |
opendevreview | daniel.pawlik proposed opendev/system-config master: DNM Add zuul-log-scrapper role https://review.opendev.org/c/opendev/system-config/+/814391 | 15:00 |
fungi | clarkb: yep, those look like it from `git log 2.9.0..origin/master` | 15:04 |
fungi | i guess we'll want to make it 2.10.0, probably can skip making release candidates given the low impact of the other changes (most iffy one is where we stop trying distutils.version.LooseVersion) | 15:05 |
Clark[m] | ++ | 15:07 |
*** ykarel is now known as ykarel|away | 15:08 | |
clarkb | and for gear I think we're looking at 0.16.0 due to the tls changes, randomized server connection and modified connection timeout process | 15:17 |
clarkb | fungi: ^ if that sounds right ot you I can start with the gear release nowish since that doesn't use reno. Then do the bindep 2.10.0 once the release notes land | 15:17 |
clarkb | I've approed the gerrit config cleanup just now as well | 15:18 |
fungi | yeah, pretty sure 0.16.0 is what we talked about previously, and the pin zuul added was for gear<0.16 | 15:22 |
fungi | double-checking now | 15:22 |
fungi | gear>=0.13.0,<0.16.0,!=0.15.0 | 15:22 |
clarkb | fungi: ok I'll load up a gpg key and remember how to push a tag to gerrit :) | 15:23 |
fungi | i'm happy to if we want to wait until after 17:00z | 15:24 |
clarkb | nah I've done it. Good to exercise this memory. Does commit aa21a0c61b1b665714f5b6e55ec202db9ddc22f1 (HEAD -> master, tag: 0.16.0, origin/master, origin/HEAD) look right to you? | 15:27 |
fungi | clarkb: yep, that looks like current origin/master and the new version we discussed | 15:28 |
clarkb | pushed | 15:29 |
fungi | we should probably send a release announcement to service-announce about it as well | 15:29 |
clarkb | hrm I don't know that it ran any jobs | 15:30 |
clarkb | I wonder if that wasn't ported when we moved it to the opendev tenant | 15:30 |
clarkb | no it lists a couple of release jobs | 15:31 |
clarkb | Project opendev/gear not in pipeline <Pipeline release> for change <Tag 0x7f9c346d9d30 opendev/gear creates refs/tags/0.16.0 on fac493c11ec7319a724ed4b29ff2766e1862f643> | 15:36 |
fungi | umm | 15:37 |
clarkb | oh wait it is there I think it redrew the status page and moved the "card" | 15:41 |
clarkb | its just too early in the morning for me to process the release queue moved from the right side of the screen to the left side ... | 15:41 |
clarkb | ya it is on pypi now and it is building its docker image currently. Ok all is well. I'll send an email once the docker image is done getting processed | 15:41 |
clarkb | and the mesasge I pasted above must be from the openstack release pipeline not the opendev release pipeline | 15:42 |
clarkb | and bindep is waiting for one of those fedora images that refuse to boot in most clouds :/ | 15:42 |
fungi | if you want to crib a release announcement, http://lists.opendev.org/pipermail/service-announce/2021-April/000018.html was my most recent one for git-review | 15:46 |
*** marios is now known as marios|out | 15:48 | |
clarkb | thanks email sent | 15:55 |
*** frenzy_friday is now known as frenzyfriday|pto | 15:55 | |
fungi | thanks for tackling that! | 15:56 |
fungi | i've been sucked into ptg sessions since 13:00 and still have another hour to go | 15:56 |
opendevreview | Clark Boylan proposed opendev/system-config master: Always update gitea repo meta data https://review.opendev.org/c/opendev/system-config/+/814443 | 15:58 |
clarkb | there's a change to discuss simply udpating all the projects all the time. Testing should double check the cost in runtime for us too | 15:59 |
*** ysandeep is now known as ysandeep|dinner | 16:06 | |
reed | FYI https://github.com/MetaMask/eth-phishing-detect/issues/5643 | 16:08 |
clarkb | reed: their interactive checker says we aren't blocked either | 16:16 |
clarkb | I wonder if there is some bug on their end | 16:16 |
reed | could be anything 🙂 I don't understand how half of this stuff works | 16:17 |
opendevreview | Merged opendev/system-config master: Clean up our gerrit config https://review.opendev.org/c/opendev/system-config/+/813716 | 16:18 |
clarkb | I've approved https://review.opendev.org/c/opendev/system-config/+/814048 since it is largely mechanical (it has fungi's +2) | 16:32 |
*** jpena is now known as jpena|off | 16:34 | |
clarkb | Looks like the mina update to 2.7.0 isn't a drop in update with gerrit | 16:55 |
fungi | :/ | 17:08 |
*** ysandeep|dinner is now known as ysandeep | 17:14 | |
corvus | clarkb: fungittx i've identified the zuul bug with the release-management queue. i think a dequeue command should get things moving again, so i'll go ahead and issue that now. | 17:42 |
*** ysandeep is now known as ysandeep|out | 17:42 | |
corvus | (details on the bug in #zuul) | 17:42 |
opendevreview | Merged opendev/system-config master: Build Gerrit 3.3.7 images https://review.opendev.org/c/opendev/system-config/+/814048 | 17:43 |
clarkb | corvus: thanks | 17:44 |
fungi | appreciated! i'll work on the dequeue now | 17:47 |
corvus | fungi: i'm on it | 17:47 |
clarkb | this mina thing is fine. I've fixed basically all the issues except for a place where we have to define a new abstract method and ... an slf4j logger import that doesn't work because I don't understand bazel | 17:48 |
clarkb | s/fine/fun/ | 17:48 |
fungi | oh, thanks corvus! | 17:48 |
fungi | sorry, i missed where you said "i'll go ahead and issue that now" | 17:49 |
corvus | clarkb: naturally i just assumed you were referencing the "This is fine." meme | 17:49 |
fungi | mina [breaks ssh for everyone]: "this is fine" | 17:49 |
clarkb | what is really curious about the logger import is that the code path that hits this doesn't appear to ahve changed between mina 2.4.0 and 2.7.0 | 17:50 |
clarkb | aha I think maybe I need to update the version of slf4j too | 17:52 |
corvus | the number of items in release-approval is decreasing. it may take a while to zero out. | 17:52 |
clarkb | ok it wasn't the version, but I've updated that nayway. Turns out you have to both depending on a thing and ensure its visibility is visible from where you do the depend | 18:00 |
fungi | clarkb: you're trying to port mina-sshd's negotiation implementation to the embedded copy in gerrit? | 18:01 |
fungi | is that one slightly forked then, not just as simple as pulling in a new version i guess... | 18:02 |
clarkb | fungi: no I'm trying to make gerrit 3.3 build against mina 2.7.0 expecting that the effort to build against 2.7.1 when it becomes available will be minimal | 18:03 |
fungi | ahh, okay | 18:03 |
opendevreview | Merged opendev/bindep master: Add release note for rocky and manjaro https://review.opendev.org/c/opendev/bindep/+/814431 | 18:10 |
corvus | release-approval is at 0 now | 18:12 |
opendevreview | Clark Boylan proposed opendev/system-config master: Push a patch to test MINA 2.7 with Gerrit https://review.opendev.org/c/opendev/system-config/+/814230 | 18:26 |
clarkb | That patch builds locally for me. I've put a hold on it so that we can interact with it after zuul does its thing and see if ssh is generally working | 18:28 |
clarkb | fungi: fwiw I think backporting the kex handler stuff to our use of mina on top of mina 2.4 is possible. But I suspect that the best thing here is simply to get to an up to date version instead | 18:29 |
clarkb | infra-root I'm thinking that https://review.opendev.org/c/opendev/system-config/+/813675 is probably the riskest change of the bunch that I've written (just because it is hard to tell if gerrit group is used somewhere unexpected) | 18:40 |
clarkb | For that reason I'm thinking maybe I'll restart gerrit today with the 3.3.7 update and the config cleanup to make sure that is all happy. Then we can do the gerrit group cleanup and another restart in the future (to help isolate any potential fallout) | 18:40 |
clarkb | If that seems reasonable I'll plan to do the restart after lunch today | 18:40 |
clarkb | fungi: for bindep does tagging 36e28c76fa1d9370e967d08f4edf18a023c2aff7 as 2.10.0 look good to you? Do you want to make that release or should I? | 18:42 |
fungi | clarkb: that matches my origin/master ref and the version we discussed, feel free to tag and push, or i can get around to it in a bit | 18:54 |
clarkb | I can do it | 18:54 |
fungi | thanks! | 18:55 |
clarkb | pushed | 18:56 |
fungi | awesome | 18:58 |
clarkb | https://pypi.org/project/bindep/ | 19:04 |
clarkb | Should I send a service-announce email for this one too? I guess so | 19:05 |
fungi | i have in the past, yeah | 19:05 |
clarkb | sent | 19:10 |
fungi | perfect, thanks again! | 19:12 |
clarkb | we've got ~59 nodepool nodes that are locked and in-use for 4 and a half days | 19:13 |
clarkb | one is in a locked but used state and the last one is a held node | 19:14 |
clarkb | one of the jobs associated with this nodes is still trying to get finished on ze02 (99c5b5eff43c404f8e2d11221944cd65 is the job uuid) | 19:20 |
clarkb | corvus: ^ fyi that zuul seems to actually be holding those locks and failing to process the job finish | 19:20 |
fungi | clarkb: yeah, i looked at that over the weekend too, seems to coincide with the same zk disconnect late last week which led to the scheduler getting restarted | 19:21 |
fungi | they're basically all from ~3 hours before the scheduler restart | 19:21 |
clarkb | ya I guess we didn't restart executors too which would've killed the ephemeral znodes | 19:22 |
fungi | refrained from trying to manually clean them up since we weren't pressed for quota and thought they might be useful for identifying the problem | 19:22 |
clarkb | but zuul should handle this regardless unless we updated the executor in the process and it was no logner compatible with the exectors? | 19:22 |
clarkb | the queue indicated by the "Finishing job" log entries on ze02 is growing it seems | 19:23 |
clarkb | Node 0026928358 in nodepool was assigned to build 99c5b5eff43c404f8e2d11221944cd65 which ran on ze02 and has yet to successfully finish and unlock since ~Friday? | 19:24 |
fungi | i wonder if those were all for builds which finished in the problem timeframe prior to the scheduler restart | 19:25 |
frickler | that sounds plausible to me. we also still need to clean up the held nodes that zuul lost track of during the earlier upgrade, I think? | 19:29 |
clarkb | frickler: ya if any of those are still held on the nodepool side but not recorded on the zuul side we should clean them up in nodepool when we are done with them | 19:36 |
clarkb | looking at the code I think we only try the once to call finishJob | 19:53 |
clarkb | and if that doesn't succeed then the job worker remains present forever? | 19:53 |
clarkb | and grepping for Finishing Job: 99c5b5eff43c404f8e2d11221944cd65 returns no results on ze02 implying we never called that method? | 19:55 |
clarkb | The last thing we seem to have done is pause the job | 19:57 |
clarkb | `zgrep 99c5b5eff43c404f8e2d11221944cd65 /var/log/zuul/executor-debug.log.4.gz | grep -v 'Finishing Job' | less` on ze02 shows this. I'll dopuble check the other files really quickly too | 19:58 |
clarkb | ya pausing is the last thing logged if we excluding the Finishing Job logging | 19:59 |
clarkb | I think the next step is to do a thread dump to see if the thread is still running (I don't think it will be) or if we've just got a reference to the build in job_workers because the thread died before calling finishJob | 20:01 |
clarkb | but I need to eat lunch now. Back in a bit | 20:01 |
clarkb | also note that a graceful stop won't work on these exectuors because those jobs will never go out of the job_workers dict and thati s what graceful stop waits for an empty dict | 20:02 |
*** dviroel|rover is now known as dviroel|rover|afk | 21:01 | |
clarkb | https://paste.opendev.org/show/b1fOfN46k9aW0DsKYkz6/ I don't think that is related to the issue we have here but potentially intersting | 21:08 |
clarkb | do we know approximately when the zk issue occured? I'm not seeing it in the ze02 logs yet | 21:12 |
clarkb | corvus: ^ fyi this is a fun one too. | 21:12 |
clarkb | That exception in the paste is the only one that I think could be related. The others are git updates that failed and log streamers closing connections | 21:15 |
corvus | does casting to a list help avoid that? | 21:16 |
corvus | clarkb: i don't think that would cause a significant error; i think that's the loop that gets the next job, so failing there means "just start over and try again" | 21:16 |
clarkb | corvus: ya I didn't think it is related to the issue we're seeing with execute never calling finishJob() but I can't see anything else that might cause that | 21:17 |
clarkb | corvus: I think you are meant to make a copy and iterate over that rather than iterate the live data to avoid that error | 21:17 |
clarkb | I'm almost ready to do the sigusr2 on ze02 and see if the threads ave completely gone away or if they are still present | 21:17 |
corvus | clarkb: you mean copy the dict first? | 21:18 |
clarkb | corvus: or the list of values/keys/items that is being iterated over (I haven't looked at the exact code yet as i'm still trying to make sense of the locked nodes belonging to unfinished jobs issue) | 21:20 |
corvus | clarkb: yeah, just wondering if you do list(dict.values()) if the list call is atomic enough to avoid that issue or whether it is subject to the same thing... | 21:23 |
corvus | i guess it probably holds the GIL for the duration of list(), so it's effectively mutexed... | 21:23 |
corvus | so i think it's probably sufficient. :) | 21:23 |
clarkb | ya I think that is the case | 21:24 |
clarkb | Thread: 139991145465600 build-99c5b5eff43c404f8e2d11221944cd65 d: False <- that thread still exists | 21:24 |
clarkb | File "/usr/local/lib/python3.8/site-packages/zuul/executor/server.py", line 999, in pause\n self._resume_event.wait() | 21:24 |
clarkb | so its waiting for the unpause condition to occur but that isn't going to happen? | 21:25 |
clarkb | that gets triggered by a JobRequestEvent.RESUMED event | 21:27 |
corvus | clarkb: did it log this? "Received %s event for build %s" | 21:28 |
corvus | looking for that to get a deleted event | 21:28 |
corvus | clarkb: because the build request delete should cause that, which should cause the jobworker stop() method to be called which should resume the job then stop it | 21:29 |
corvus | clarkb: are we talking about the zk issue where we lost contact with the server, and it appeared that we also lost all the watches? | 21:30 |
corvus | because that depends on watches too. | 21:30 |
clarkb | corvus: yes the incident that frickler restarted the scheduelr for. And ya I'm seeing that we do a child watch for delete and resume znodes | 21:30 |
corvus | in which case, we may actually have more debugging information than i realized, though we don't have a repl on the executor to help with it | 21:31 |
corvus | but i would like to examine zk and see what sessions it thinks are active | 21:31 |
corvus | (because so far we have no idea how it's possible to resume a zk session and lose the watches; that's not supposed to happen and i can't reproduce it locally) | 21:32 |
clarkb | corvus: ze02 is the one where 99c5b5eff43c404f8e2d11221944cd65 is still running in a paused state if you need specifics to focus on | 21:32 |
corvus | thx | 21:32 |
clarkb | I hit sigusr2 twice so yappi isn't running anymore but there may be some info in the lgos from when it collected brief data? I'll check for that log message | 21:32 |
clarkb | it almost seems like the scheduler restart caused it to forget about all these builds even though they should still be in zk? | 21:33 |
clarkb | and that caused us to not send the resume event | 21:33 |
corvus | clarkb: occam's razor says to me that if the scheduler was stuck because it lost the watches on its zk session, then it's very likely that the executor was in the same situation. | 21:34 |
corvus | i mean, there could be something else going on, but that seems like a perfectly satisfactory explanation that i think we'd need to falsify first | 21:34 |
clarkb | `zgrep 'Received .* event for build' /var/log/zuul/executor-debug.log.4.gz | grep 99c5b5eff43c404f8e2d11221944cd65` returns no results | 21:34 |
clarkb | I guess the next step is to look at the zk state and see if there is a resume or delete event sitting in the child znode listing that a watch would've seen? | 21:35 |
corvus | thx. that's consistent with the lost watches hypothesis | 21:35 |
corvus | clarkb: the delete event should literally be the znode deleted; so yeah, we can check for the 99c5b5eff43c404f8e2d11221944cd65 build request and it should not be there | 21:36 |
clarkb | corvus: will you do that or should I? | 21:36 |
corvus | i will, i have a zk shell already | 21:36 |
corvus | get 99c5b5eff43c404f8e2d11221944cd65 | 21:37 |
corvus | {"uuid": "99c5b5eff43c404f8e2d11221944cd65", "state": "paused", "precedence": 300, "resultpath": null, "zone": null, "buildsetuuid": "e2a6a56c35d94d78bcb378deb11d9696", "jobname": "tripleo-ci-centos-8-content-provider-ussuri", "tenantname": "openstack", "pipelinename": "periodic", "eventid": "724e0983166c4458a789c520989e86a8", "workerinfo": {"hostname": "ze02.opendev.org", "log_port": 7900}} | 21:37 |
corvus | well, that's that theory falsified :) | 21:37 |
corvus | the scheduler should have deleted that build request after restarting | 21:38 |
clarkb | is there a resume child znode that we might have missed? | 21:38 |
corvus | no children | 21:38 |
clarkb | interesting so as far as zuul is concerned the job has been running for 4 days? :) | 21:38 |
corvus | well, is it in the pipeline? | 21:39 |
corvus | (and it has been running for 4 days, right?) | 21:39 |
clarkb | no its not in the status.json rendered pipeline at least | 21:39 |
clarkb | but the job_worker thread is present and the job's last activity was to pause | 21:40 |
clarkb | I guess the scheduler forgot about it though | 21:40 |
clarkb | which is weird because the build request is there showing a paused state | 21:40 |
corvus | was it before the restart? | 21:40 |
clarkb | the pause? I'm not sure as I'm not sure when exactly the restart happened. | 21:40 |
clarkb | 2021-10-14 07:54:49,630 INFO zuul.AnsibleJob: [e: 724e0983166c4458a789c520989e86a8] [build: 99c5b5eff43c404f8e2d11221944cd65] Pausing job tripleo-ci-centos-8-content-provider-ussuri for ref refs/heads/master (change https://opendev.org/openstack/tripleo-ci/commit/None) | 21:41 |
clarkb | that is when the pause for the job occurred | 21:41 |
corvus | okay, i thought for some reason we were debugging fallout from the scheduler restart | 21:41 |
clarkb | corvus: well fungi mentioned he thought it was related, but I've not yet tracked down when exactly the restart happened | 21:42 |
corvus | 2021-10-14 10:03:50 UTC zuul was stuck processing jobs and has been restarted. pending jobs will be re-enqueued | 21:42 |
corvus | maybe that? | 21:42 |
clarkb | 10:01:02* frickler | #status notice zuul was stuck processing jobs and has been restarted. pending jobs will be re-enqueued | 21:43 |
clarkb | ya so the job was paused before the restart, but those tripleo jobs that run after the pause can run for a significant amount of time so it is likely the job was still paused when that happened | 21:43 |
corvus | then the question is why didn't the scheduler delete the build request | 21:44 |
corvus | hrm. we may simply not do that. | 21:46 |
corvus | we cleanup requests from executors that crash while the scheduler is up, but i think we may not do the other way around. | 21:47 |
corvus | so, short-term fix: if you restart the scheduler, do the whole system. | 21:48 |
corvus | long-term fix: pipeline state in zk | 21:48 |
corvus | given that this started being an issue like 3 months ago, and we're 90% of the way to it not being an issue, i think i'd lean toward not doing a medium-term fix and just sticking with the short-term fix for now. we could send an announcement to that effect... | 21:49 |
corvus | i relayed that to #zuul since that's a zuul-project discussion | 21:52 |
clarkb | ya I think if we communicate "restart executors and mergers when restarting the scheduler" that is probably reasonable for now | 21:52 |
corvus | infra-root: ^ for the time being, if you restart the zuul scheduler, go ahead and restart the mergers and executors too (the zuul_restart playbook), because of that bug in zuul | 21:57 |
clarkb | corvus: I'm thinking maybe we wait until your fix for the other issue is ready then restart everything at that point? | 21:57 |
corvus | sgtm | 21:57 |
corvus | s/ready/merged -- i'm pretty sure the fix is ready (i ran tests locally) | 21:57 |
clarkb | ya I'm starting my review of it now | 21:58 |
fungi | oof, that's a lot of scrollback for just wandering off for dinner | 22:05 |
fungi | okay, luckily i already read the tl;dr in zuul matrix, so i think skimming it was good enough ;) | 22:07 |
clarkb | should we hold off on a gerrit restart to do that when we restart zuul? | 22:13 |
clarkb | (I want to restart gerrit for the 3.3.7 image update and to ensure the config file cleanup doesn't cause problems) | 22:13 |
clarkb | neat my MINA 2.7.0 held node seems to be working with stuff like ls-projects and show-queue | 22:19 |
fungi | i can help with a combined zuul+gerrit restart when we're ready for that | 22:21 |
fungi | i guess the idea is to restart all of zuul on 814493? | 22:22 |
clarkb | ya | 22:24 |
fungi | not sure if i can legitimately claim to be reviewing that, but i'm staring at it really, really hard | 22:25 |
fungi | based on the earlier discussions in matrix, i think i follow it | 22:26 |
clarkb | https://172.99.67.89/c/x/test-project/+/21 I was able to push that via local ssh as well. I think that generally means newer mina is working. The other big mina thing is replication though | 22:29 |
clarkb | testing that is a bit more of an involved process so probably won't get into that uintil a 2.7.1 shows up and we can argue for doing the update | 22:29 |
clarkb | fungi: are you able to review 814493? I'll ping tristanC about it too since he expressed interest | 22:30 |
fungi | clarkb: i just did, and didn't approve specifically because tristan had wanted to review it | 22:31 |
clarkb | thanks | 22:32 |
fungi | but yeah, i think i followed the solution and it looked okay | 22:33 |
clarkb | Finally remembered to approve the prometheus spec. I got distracted last week by the rename improvement stuff | 22:45 |
fungi | good call | 22:46 |
clarkb | fungi: frickler: what was the story with the zuul complaints about old venus secrets in zk? | 22:48 |
clarkb | The rename process should've renamed them to openstack/venus paths in zk then deleted the venus/venus paths | 22:48 |
fungi | the backup seemed to want to back up the old paths | 22:50 |
fungi | and then complained when it couldn't find them in zk | 22:50 |
clarkb | thats interesting since I thought it only operated off of what it saw in zk | 22:51 |
clarkb | corvus: ^ fyi | 22:51 |
clarkb | fungi: was it root email where it complained? | 22:51 |
fungi | that's a great question, i hadn't had time to look, just repeating what frickler said | 22:51 |
clarkb | ya seems to have occurred 23 hours ago and daily before that after the renames | 22:52 |
fungi | yep, root cronspam | 22:52 |
fungi | just found a copy in my cronspam inbox | 22:52 |
clarkb | ok adding to the meeting agenda. I think the export is supposed to be greedy and export as much as it can so this likely isn't a fatal issue. But something we should cleanup in our rename process or the zuul export-keys command | 22:52 |
fungi | ERROR:zuul.KeyStorage:Unable to load keys at /keystorage/gerrit/osf/osf%2Fopenstackid | 22:53 |
fungi | and so on | 22:53 |
opendevreview | Merged opendev/infra-specs master: Spec to deploy Prometheus as a Cacti replacement https://review.opendev.org/c/opendev/infra-specs/+/804122 | 22:53 |
fungi | maybe we don't correctly remove the parent znode? | 22:54 |
clarkb | fungi: oh that could be | 22:54 |
fungi | and so the tool is finding an empty one there | 22:54 |
fungi | because, yeah, i thought it was effectively stateless | 22:54 |
clarkb | fungi: ya looking directly at the db I think this is exactly the issue | 23:02 |
clarkb | and ya checking the backups I think we're still dumping everything else | 23:03 |
clarkb | I'm working on a fix | 23:14 |
clarkb | remote: https://review.opendev.org/c/zuul/zuul/+/814504 Cleanup empty secrets dirs when deleting secrets | 23:35 |
clarkb | I'm going to need to do the zuul restart and gerrit restart tomorrow if I'm helping. Have dinner to help with now. Happy for others to do it this evening if they have time though but I don't think we are in a rush tomorrow should be fine | 23:57 |
Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!