Thursday, 2023-09-07

frickler	I think I missed something, why does zuul need a downtime?	05:53
*** ralonsoh_away is now known as ralonsoh		07:09
fungi	frickler: the default branches are cached in zk. they'll get refreshed in time if configs for those repos are updated, but to clear the cache in zk and force it sooner we'd need both schedulers offline first	11:56
frickler	fungi: that answers one question and raises the next one: what is changing about default branches? and sorry if I missed that, sometimes I skip things when there is too much backlog in the morning	12:03
fungi	frickler: the patch in zuul for the bug you pointed out with refs/heads getting prefixed	12:43
fungi	frickler: clearing the cache so that it gets repopulated with the https://review.opendev.org/893925 fix in place	12:44
frickler	ah, that error is cached, o.k.	12:45
fungi	yes, the cache will correct itself over time, but for projects that don't get frequent updates it could take a while	12:48
frickler	ack, then the restart does make sense I guess	13:03
TheJulia	hey guys, you can clear out the autohold I have. It shed some light on the issue, but I'm still sort of chasing the issue, just deferring for the moment.	15:39
fungi	TheJulia: thanks for letting us know, i've cleaned it up now. happy hunting, and let us know if you need more help	15:41
opendevreview	Bernhard Berg proposed zuul/zuul-jobs master: prepare-workspace-git: Add ability to define synced pojects https://review.opendev.org/c/zuul/zuul-jobs/+/887917	16:08
clarkb	looks like min-ready: 0 for fedora and our ready node timeout has resulted in no fedora nodes in nodepool	16:46
clarkb	I think that puts us in a good spot for Monday to merge the removal changes	16:47
fungi	agreed	17:02
fungi	mm3 migration notifications have been sent to the airship-discuss and kata-dev mailing lists	17:51
clarkb	fungi: I think both are small enough that we don't have to worry about copying significant amounts of data right? Its only openstack that will pose a problem for that	17:53
fungi	correct. if you look at the todo list at the bottom of https://etherpad.opendev.org/p/mm3migration i've approximated a 4-hour migration window for openstack (the migration script itself takes around 2.5 hours to complete for the site in my test runs)	17:54
fungi	i'll still make a warm rsync copy immediately prior to the window so that we spend as little time as possible copying data during the outage	17:55
fungi	for all of them	17:55
clarkb	++	17:56
fungi	because it's just running the same command a couple of times before the maintenance that i'll also run during, so not any extra work and can shave minutes off the maintenance window	17:56
opendevreview	James E. Blair proposed zuul/zuul-jobs master: prepare-workspace-git: Add ability to define synced pojects https://review.opendev.org/c/zuul/zuul-jobs/+/887917	18:16
frickler	corvus: the horizon stable/stein job was deleted on August 16, but periodic jobs are still running and there still is a stable/stein tab on https://zuul.opendev.org/t/openstack/project/opendev.org/openstack/horizon	18:50
frickler	other branches detected at that date do not seem affected https://paste.opendev.org/show/bKk2BIRN1I4sNFpFlA4N/	18:51
frickler	*deleted	18:51
fungi	branch was deleted	18:52
frickler	periodic-stable pipeline to be exact, two jobs still listed in the "View Job Graph" https://zuul.opendev.org/t/openstack/project/opendev.org/openstack/horizon?branch=stable%2Fstein&pipeline=periodic-stable	18:53
frickler	I guess if you do the zuul maint on saturday, it will do a full-reconfigure and clean this up anyway?	18:54
frickler	so maybe we can either ignore now and see if it goes away then, or use the time to possibly do some debugging	18:54
frickler	ah, branch deleted, not job, thx fungi, I should delete myself for today, too, I guess ;)	18:56
corvus	frickler: horizon issue due to gerrit disconnect at time of event: https://paste.opendev.org/show/bHIpyhR39QxPz9KZG087/ (possibly gerrit had a bunch of work going on at the time and wasn't very responsive?)	19:31
corvus	that will be corrected on the next branch change, or we could force it ahead of time with a full-reconfigure, or it sounds like it's probably just fine to let it be corrected during the restart	19:32
fungi	aha, yeah i wondered if the event had simply gone missing	19:40
fungi	the new event bus work in gerrit ought to solve this sort of case longer term	19:41
corvus	in this case the event was processed, it's just gerrit chose shortly after that moment to go out to lunch which interrupted processing. we discard events in that case since we're already halfway through processing. so i don't think the pubsub stuff would have changed that.	19:47
corvus	(arguably we could push the event back on the stack, but it's not a simple decision -- there could be negative ramifications from that)	19:48
fungi	aha	19:49
frickler	corvus: thx for digging in the logs. I think waiting for the restart is fine in this case	19:51
frickler	elodilles: do you run those branch deletions directly next to each other? I wonder if some sleep in between might be helpful	19:52
frickler	also if you could add timestamps to your log that might help possible debugging in the future	19:55
fungi	i suppose a barrage of branch deletions could have fired events that caused pipelines in zuul to be triggered, leading to mergers fetching refs from gerrit and unintentionally knocking it offline briefly?	19:57
clarkb	though we have a limited number of mergers which should mitigate that but maybe not limited enough	19:58
corvus	should actually mostly be the schedulers doing this op (unusually)	19:59
fungi	just speculating. the cause could have been just about anything, and was just as likely unrelated to anything going on for branch deletion	19:59
corvus	i think the branch lookup is currently a git op; we may be able to save some cpu cycles by making it a gerrit api call. honestly haven't benchmarked them to figure out which is faster	19:59
corvus	(that code predates gerrit having an http api :)	20:00
corvus	one of those lines may make it all the way back to zuul v0	20:00
fungi	also i wonder if the gerrit driver could special-case ref-updated changes resulting from branch deletion. if memory serves, the newref in them is 0x0 so should be pretty identifiable	20:00
fungi	at least in our case, i don't think we currently have any reason to want to enqueue those into pipelines (and they've resulted in some confusion in the past)	20:01
corvus	oh it's definitely special cased, it knows it's a branch deletion. the special case is: branches changed, see what they are now. that way it's self-healing.	20:01
fungi	oh, got it	20:02
fungi	i know at one point we were seeing something like zuul enqueuing git ref 0x0 into the post pipeline and then running builds which ultimately failed, but maybe that hasn't been for a while now	20:03
frickler	ah, that's why any branch operation would fix it. so creating the 2023.2 branch would also solve the issue	20:17

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!