frickler | I think I missed something, why does zuul need a downtime? | 05:53 |
---|---|---|
*** ralonsoh_away is now known as ralonsoh | 07:09 | |
fungi | frickler: the default branches are cached in zk. they'll get refreshed in time if configs for those repos are updated, but to clear the cache in zk and force it sooner we'd need both schedulers offline first | 11:56 |
frickler | fungi: that answers one question and raises the next one: what is changing about default branches? and sorry if I missed that, sometimes I skip things when there is too much backlog in the morning | 12:03 |
fungi | frickler: the patch in zuul for the bug you pointed out with refs/heads getting prefixed | 12:43 |
fungi | frickler: clearing the cache so that it gets repopulated with the https://review.opendev.org/893925 fix in place | 12:44 |
frickler | ah, that error is cached, o.k. | 12:45 |
fungi | yes, the cache will correct itself over time, but for projects that don't get frequent updates it could take a while | 12:48 |
frickler | ack, then the restart does make sense I guess | 13:03 |
TheJulia | hey guys, you can clear out the autohold I have. It shed some light on the issue, but I'm still sort of chasing the issue, just deferring for the moment. | 15:39 |
fungi | TheJulia: thanks for letting us know, i've cleaned it up now. happy hunting, and let us know if you need more help | 15:41 |
opendevreview | Bernhard Berg proposed zuul/zuul-jobs master: prepare-workspace-git: Add ability to define synced pojects https://review.opendev.org/c/zuul/zuul-jobs/+/887917 | 16:08 |
clarkb | looks like min-ready: 0 for fedora and our ready node timeout has resulted in no fedora nodes in nodepool | 16:46 |
clarkb | I think that puts us in a good spot for Monday to merge the removal changes | 16:47 |
fungi | agreed | 17:02 |
fungi | mm3 migration notifications have been sent to the airship-discuss and kata-dev mailing lists | 17:51 |
clarkb | fungi: I think both are small enough that we don't have to worry about copying significant amounts of data right? Its only openstack that will pose a problem for that | 17:53 |
fungi | correct. if you look at the todo list at the bottom of https://etherpad.opendev.org/p/mm3migration i've approximated a 4-hour migration window for openstack (the migration script itself takes around 2.5 hours to complete for the site in my test runs) | 17:54 |
fungi | i'll still make a warm rsync copy immediately prior to the window so that we spend as little time as possible copying data during the outage | 17:55 |
fungi | for all of them | 17:55 |
clarkb | ++ | 17:56 |
fungi | because it's just running the same command a couple of times before the maintenance that i'll also run during, so not any extra work and can shave minutes off the maintenance window | 17:56 |
opendevreview | James E. Blair proposed zuul/zuul-jobs master: prepare-workspace-git: Add ability to define synced pojects https://review.opendev.org/c/zuul/zuul-jobs/+/887917 | 18:16 |
frickler | corvus: the horizon stable/stein job was deleted on August 16, but periodic jobs are still running and there still is a stable/stein tab on https://zuul.opendev.org/t/openstack/project/opendev.org/openstack/horizon | 18:50 |
frickler | other branches detected at that date do not seem affected https://paste.opendev.org/show/bKk2BIRN1I4sNFpFlA4N/ | 18:51 |
frickler | *deleted | 18:51 |
fungi | branch was deleted | 18:52 |
frickler | periodic-stable pipeline to be exact, two jobs still listed in the "View Job Graph" https://zuul.opendev.org/t/openstack/project/opendev.org/openstack/horizon?branch=stable%2Fstein&pipeline=periodic-stable | 18:53 |
frickler | I guess if you do the zuul maint on saturday, it will do a full-reconfigure and clean this up anyway? | 18:54 |
frickler | so maybe we can either ignore now and see if it goes away then, or use the time to possibly do some debugging | 18:54 |
frickler | ah, branch deleted, not job, thx fungi, I should delete myself for today, too, I guess ;) | 18:56 |
corvus | frickler: horizon issue due to gerrit disconnect at time of event: https://paste.opendev.org/show/bHIpyhR39QxPz9KZG087/ (possibly gerrit had a bunch of work going on at the time and wasn't very responsive?) | 19:31 |
corvus | that will be corrected on the next branch change, or we could force it ahead of time with a full-reconfigure, or it sounds like it's probably just fine to let it be corrected during the restart | 19:32 |
fungi | aha, yeah i wondered if the event had simply gone missing | 19:40 |
fungi | the new event bus work in gerrit ought to solve this sort of case longer term | 19:41 |
corvus | in this case the event was processed, it's just gerrit chose shortly after that moment to go out to lunch which interrupted processing. we discard events in that case since we're already halfway through processing. so i don't think the pubsub stuff would have changed that. | 19:47 |
corvus | (arguably we could push the event back on the stack, but it's not a simple decision -- there could be negative ramifications from that) | 19:48 |
fungi | aha | 19:49 |
frickler | corvus: thx for digging in the logs. I think waiting for the restart is fine in this case | 19:51 |
frickler | elodilles: do you run those branch deletions directly next to each other? I wonder if some sleep in between might be helpful | 19:52 |
frickler | also if you could add timestamps to your log that might help possible debugging in the future | 19:55 |
fungi | i suppose a barrage of branch deletions could have fired events that caused pipelines in zuul to be triggered, leading to mergers fetching refs from gerrit and unintentionally knocking it offline briefly? | 19:57 |
clarkb | though we have a limited number of mergers which should mitigate that but maybe not limited enough | 19:58 |
corvus | should actually mostly be the schedulers doing this op (unusually) | 19:59 |
fungi | just speculating. the cause could have been just about anything, and was just as likely unrelated to anything going on for branch deletion | 19:59 |
corvus | i think the branch lookup is currently a git op; we may be able to save some cpu cycles by making it a gerrit api call. honestly haven't benchmarked them to figure out which is faster | 19:59 |
corvus | (that code predates gerrit having an http api :) | 20:00 |
corvus | one of those lines may make it all the way back to zuul v0 | 20:00 |
fungi | also i wonder if the gerrit driver could special-case ref-updated changes resulting from branch deletion. if memory serves, the newref in them is 0x0 so should be pretty identifiable | 20:00 |
fungi | at least in our case, i don't think we currently have any reason to want to enqueue those into pipelines (and they've resulted in some confusion in the past) | 20:01 |
corvus | oh it's definitely special cased, it knows it's a branch deletion. the special case is: branches changed, see what they are now. that way it's self-healing. | 20:01 |
fungi | oh, got it | 20:02 |
fungi | i know at one point we were seeing something like zuul enqueuing git ref 0x0 into the post pipeline and then running builds which ultimately failed, but maybe that hasn't been for a while now | 20:03 |
frickler | ah, that's why any branch operation would fix it. so creating the 2023.2 branch would also solve the issue | 20:17 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!