Monday, 2021-10-18

ianw	i'll get back to the debian-stable removal and maybe we can free up space there	00:00
prometheanfire	can periodic jobs email the core reviewer teams (or be otherwise configurable for notifications?	01:30
ianw	yes they can; iirc zuul-jobs might be an example	01:39
ianw	https://review.opendev.org/c/zuul/zuul-jobs/+/748682	01:41
fungi	ianw: prometheanfire: i think the only inbuilt mailing feature in zuul is the smtp exporter, and the recipient address for that is configured on a per pipeline basis	01:58
fungi	https://zuul-ci.org/docs/zuul/reference/drivers/smtp.html	01:59
* prometheanfire does like zuul		02:11
fungi	it's not got enough insight into gerrit's data structures to work out core reviewer addresses or anything, nor can it configure notification addresses on a per-job or per-project basis	02:19
fungi	the periodic-stable pipeline, for example, is configured to send failure reports to the stable-maint ml: https://opendev.org/openstack/project-config/src/branch/master/zuul.d/pipelines.yaml#L293-L297	02:21
fungi	similarly, failures for the release-post pipeline are reported to the release-job-failures ml: from: zuul@openstack.org	02:22
fungi	er, meant to paste https://opendev.org/openstack/project-config/src/branch/master/zuul.d/pipelines.yaml#L262-L266	02:22
fungi	also the pre-release, release, and tag pipelines get failure reports sent there	02:23
prometheanfire	well, for zuul, could be a useful feature request	03:01
*** bhagyashris__ is now known as bhagyashris		04:11
opendevreview	Ian Wienand proposed opendev/base-jobs master: Remove debian-stable nodeset https://review.opendev.org/c/opendev/base-jobs/+/802639	04:25
ianw	fungi: ^ i think with that list of dependencies that is finally ready ...	05:27
*** gibi is now known as gibi_back_15UTC		06:37
*** ysandeep is now known as ysandeep\|trng		06:50
*** jpena\|off is now known as jpena		07:31
frickler	those deps all seem to fail on broken c7 jobs. more skeletons in the closet ...	07:38
*** ykarel is now known as ykarel\|lunch		08:41
*** ykarel\|lunch is now known as ykarel		09:02
opendevreview	Thierry Carrez proposed openstack/project-config master: Add ttx as OP to #openinfra-events https://review.opendev.org/c/openstack/project-config/+/814381	09:07
ttx	Zuul's release-approval queue is blocked since 2021-10-15 15:42:42	09:44
frickler	openstack-zuul-jobs-linters is also failing with pyyaml6, should be an easy fix I hope, but lunch is first	10:00
opendevreview	daniel.pawlik proposed opendev/system-config master: DNM Add zuul-log-scrapper role https://review.opendev.org/c/opendev/system-config/+/814391	10:01
*** mnasiadka_ is now known as mnasiadka		10:10
frickler	for the release-approval, iiuc that's what fungi was looking at earlier, but no resolution yet?	10:18
frickler	zuul is also complaining about a lot of config errors, significant amount seems to be rename-related	10:19
opendevreview	Dr. Jens Harbott proposed openstack/project-config master: Fix use of yaml.load() https://review.opendev.org/c/openstack/project-config/+/814401	10:26
frickler	ERROR: Project openinfra/ansible-role-refstack-client has non existing acl_config line	11:01
opendevreview	Dr. Jens Harbott proposed openstack/project-config master: Fix use of yaml.load() https://review.opendev.org/c/openstack/project-config/+/814401	11:10
opendevreview	Dr. Jens Harbott proposed openstack/project-config master: Fix the renaming of ansible-role-refstack-client https://review.opendev.org/c/openstack/project-config/+/814409	11:10
*** dviroel is now known as dviroel\|rover		11:10
* frickler hopes that this order will work, otherwise will merge them		11:11
frickler	of course murphy strikes again	11:21
opendevreview	Dr. Jens Harbott proposed openstack/project-config master: Fix project-config testing https://review.opendev.org/c/openstack/project-config/+/814401	11:23
*** jpena is now known as jpena\|lunch		11:31
opendevreview	Dr. Jens Harbott proposed openstack/project-config master: Add ttx as OP to #openinfra-events https://review.opendev.org/c/openstack/project-config/+/814381	11:36
opendevreview	Merged openstack/project-config master: Fix project-config testing https://review.opendev.org/c/openstack/project-config/+/814401	11:48
opendevreview	Merged openstack/project-config master: Add ttx as OP to #openinfra-events https://review.opendev.org/c/openstack/project-config/+/814381	12:08
*** jpena\|lunch is now known as jpena		12:23
*** ysandeep\|trng is now known as ysandeep		12:57
opendevreview	Dong Zhang proposed zuul/zuul-jobs master: Implement role for limiting zuul log file size https://review.opendev.org/c/zuul/zuul-jobs/+/813034	13:06
*** hjensas is now known as hjensas\|afk		13:19
ttx	infra-root: Not super urgent, but Zuul's release-approval queue (the one used to trigger the PTL-approval test on release changes) seems to be stuck since Friday (478 changes added up)	13:20
fungi	ttx: yep, known since saturday, i'm hoping to get some more eyes on the traceback/exception i found related to the top change there, figured if it was stuck into today that wouldn't be the end of the world so i didn't aggressively ping folks on their weekend	13:23
fungi	brought it up in the zuul matrix channel though for input	13:23
ttx	yeah, it's not super critical, I'm more concerned that it might add up to the point of slowing down other things	13:25
fungi	same here, so hopefully we can decide whether we need to keep it in that state much longer or can try to clear it	13:25
fungi	if it were the check or gate pipeline i'd have just collected as much data as i could and worked on getting it unstuck (probably a scheduler restart)	13:26
Clark[m]	I suspect a dequeue may be sufficient to get this type of thing moving again. However the cache entry for the specific change might be stale/bad and even a restart won't fix that	13:32
*** akahat is now known as akahat\|afk		13:57
clarkb	fungi: also I haven't forgotten that we need to do a gear release and looks like maybe we should consider a bindep release? I can push tags for those after school run if we want to do that	13:59
clarkb	Then I'd also like to land the gerrit 3.3.7 change and the gerrit.config cleanup change with plans to restart gerrit on those today/tomorrow	14:00
fungi	yeah, probably need to add release notes for bindep	14:00
*** gibi_back_15UTC is now known as gibi		14:06
clarkb	fungi: is there a change up to default gitea_always_update to true? then we can consider if we want to just do that.	14:11
corvus	i can look at the zuul queue shortly	14:13
clarkb	corvus: thanks!	14:14
opendevreview	Clark Boylan proposed opendev/bindep master: Add release note for rocky and manjaro https://review.opendev.org/c/opendev/bindep/+/814431	14:20
clarkb	fungi: ^ bindep release note	14:20
clarkb	https://review.opendev.org/c/opendev/system-config/+/814048 and https://review.opendev.org/c/opendev/system-config/+/813716 are the two gerrit related changes I mentioned above. I'll approve the second one after the school run since it has the necessary +2's. Landing the first to udpate to 3.3.7 would be nice too though	14:23
clarkb	oh I guess https://review.opendev.org/c/opendev/system-config/+/813675 is still out there too. Should probably land that ebfore a gerrit restart too	14:28
clarkb	its hard to know if we are using an ansible group anywhere though. Maybe reviewers haev a better way of checkign for that than my bad grepping	14:29
*** timburke_ is now known as timburke		14:39
*** akahat\|afk is now known as akahat		14:48
fungi	clarkb: thanks for the bindep reno, was that all you spotted since the last tag?	14:59
opendevreview	daniel.pawlik proposed opendev/system-config master: DNM Add zuul-log-scrapper role https://review.opendev.org/c/opendev/system-config/+/814391	15:00
fungi	clarkb: yep, those look like it from `git log 2.9.0..origin/master`	15:04
fungi	i guess we'll want to make it 2.10.0, probably can skip making release candidates given the low impact of the other changes (most iffy one is where we stop trying distutils.version.LooseVersion)	15:05
Clark[m]	++	15:07
*** ykarel is now known as ykarel\|away		15:08
clarkb	and for gear I think we're looking at 0.16.0 due to the tls changes, randomized server connection and modified connection timeout process	15:17
clarkb	fungi: ^ if that sounds right ot you I can start with the gear release nowish since that doesn't use reno. Then do the bindep 2.10.0 once the release notes land	15:17
clarkb	I've approed the gerrit config cleanup just now as well	15:18
fungi	yeah, pretty sure 0.16.0 is what we talked about previously, and the pin zuul added was for gear<0.16	15:22
fungi	double-checking now	15:22
fungi	gear>=0.13.0,<0.16.0,!=0.15.0	15:22
clarkb	fungi: ok I'll load up a gpg key and remember how to push a tag to gerrit :)	15:23
fungi	i'm happy to if we want to wait until after 17:00z	15:24
clarkb	nah I've done it. Good to exercise this memory. Does commit aa21a0c61b1b665714f5b6e55ec202db9ddc22f1 (HEAD -> master, tag: 0.16.0, origin/master, origin/HEAD) look right to you?	15:27
fungi	clarkb: yep, that looks like current origin/master and the new version we discussed	15:28
clarkb	pushed	15:29
fungi	we should probably send a release announcement to service-announce about it as well	15:29
clarkb	hrm I don't know that it ran any jobs	15:30
clarkb	I wonder if that wasn't ported when we moved it to the opendev tenant	15:30
clarkb	no it lists a couple of release jobs	15:31
clarkb	Project opendev/gear not in pipeline <Pipeline release> for change <Tag 0x7f9c346d9d30 opendev/gear creates refs/tags/0.16.0 on fac493c11ec7319a724ed4b29ff2766e1862f643>	15:36
fungi	umm	15:37
clarkb	oh wait it is there I think it redrew the status page and moved the "card"	15:41
clarkb	its just too early in the morning for me to process the release queue moved from the right side of the screen to the left side ...	15:41
clarkb	ya it is on pypi now and it is building its docker image currently. Ok all is well. I'll send an email once the docker image is done getting processed	15:41
clarkb	and the mesasge I pasted above must be from the openstack release pipeline not the opendev release pipeline	15:42
clarkb	and bindep is waiting for one of those fedora images that refuse to boot in most clouds :/	15:42
fungi	if you want to crib a release announcement, http://lists.opendev.org/pipermail/service-announce/2021-April/000018.html was my most recent one for git-review	15:46
*** marios is now known as marios\|out		15:48
clarkb	thanks email sent	15:55
*** frenzy_friday is now known as frenzyfriday\|pto		15:55
fungi	thanks for tackling that!	15:56
fungi	i've been sucked into ptg sessions since 13:00 and still have another hour to go	15:56
opendevreview	Clark Boylan proposed opendev/system-config master: Always update gitea repo meta data https://review.opendev.org/c/opendev/system-config/+/814443	15:58
clarkb	there's a change to discuss simply udpating all the projects all the time. Testing should double check the cost in runtime for us too	15:59
*** ysandeep is now known as ysandeep\|dinner		16:06
reed	FYI https://github.com/MetaMask/eth-phishing-detect/issues/5643	16:08
clarkb	reed: their interactive checker says we aren't blocked either	16:16
clarkb	I wonder if there is some bug on their end	16:16
reed	could be anything 🙂 I don't understand how half of this stuff works	16:17
opendevreview	Merged opendev/system-config master: Clean up our gerrit config https://review.opendev.org/c/opendev/system-config/+/813716	16:18
clarkb	I've approved https://review.opendev.org/c/opendev/system-config/+/814048 since it is largely mechanical (it has fungi's +2)	16:32
*** jpena is now known as jpena\|off		16:34
clarkb	Looks like the mina update to 2.7.0 isn't a drop in update with gerrit	16:55
fungi	:/	17:08
*** ysandeep\|dinner is now known as ysandeep		17:14
corvus	clarkb: fungittx i've identified the zuul bug with the release-management queue. i think a dequeue command should get things moving again, so i'll go ahead and issue that now.	17:42
*** ysandeep is now known as ysandeep\|out		17:42
corvus	(details on the bug in #zuul)	17:42
opendevreview	Merged opendev/system-config master: Build Gerrit 3.3.7 images https://review.opendev.org/c/opendev/system-config/+/814048	17:43
clarkb	corvus: thanks	17:44
fungi	appreciated! i'll work on the dequeue now	17:47
corvus	fungi: i'm on it	17:47
clarkb	this mina thing is fine. I've fixed basically all the issues except for a place where we have to define a new abstract method and ... an slf4j logger import that doesn't work because I don't understand bazel	17:48
clarkb	s/fine/fun/	17:48
fungi	oh, thanks corvus!	17:48
fungi	sorry, i missed where you said "i'll go ahead and issue that now"	17:49
corvus	clarkb: naturally i just assumed you were referencing the "This is fine." meme	17:49
fungi	mina [breaks ssh for everyone]: "this is fine"	17:49
clarkb	what is really curious about the logger import is that the code path that hits this doesn't appear to ahve changed between mina 2.4.0 and 2.7.0	17:50
clarkb	aha I think maybe I need to update the version of slf4j too	17:52
corvus	the number of items in release-approval is decreasing. it may take a while to zero out.	17:52
clarkb	ok it wasn't the version, but I've updated that nayway. Turns out you have to both depending on a thing and ensure its visibility is visible from where you do the depend	18:00
fungi	clarkb: you're trying to port mina-sshd's negotiation implementation to the embedded copy in gerrit?	18:01
fungi	is that one slightly forked then, not just as simple as pulling in a new version i guess...	18:02
clarkb	fungi: no I'm trying to make gerrit 3.3 build against mina 2.7.0 expecting that the effort to build against 2.7.1 when it becomes available will be minimal	18:03
fungi	ahh, okay	18:03
opendevreview	Merged opendev/bindep master: Add release note for rocky and manjaro https://review.opendev.org/c/opendev/bindep/+/814431	18:10
corvus	release-approval is at 0 now	18:12
opendevreview	Clark Boylan proposed opendev/system-config master: Push a patch to test MINA 2.7 with Gerrit https://review.opendev.org/c/opendev/system-config/+/814230	18:26
clarkb	That patch builds locally for me. I've put a hold on it so that we can interact with it after zuul does its thing and see if ssh is generally working	18:28
clarkb	fungi: fwiw I think backporting the kex handler stuff to our use of mina on top of mina 2.4 is possible. But I suspect that the best thing here is simply to get to an up to date version instead	18:29
clarkb	infra-root I'm thinking that https://review.opendev.org/c/opendev/system-config/+/813675 is probably the riskest change of the bunch that I've written (just because it is hard to tell if gerrit group is used somewhere unexpected)	18:40
clarkb	For that reason I'm thinking maybe I'll restart gerrit today with the 3.3.7 update and the config cleanup to make sure that is all happy. Then we can do the gerrit group cleanup and another restart in the future (to help isolate any potential fallout)	18:40
clarkb	If that seems reasonable I'll plan to do the restart after lunch today	18:40
clarkb	fungi: for bindep does tagging 36e28c76fa1d9370e967d08f4edf18a023c2aff7 as 2.10.0 look good to you? Do you want to make that release or should I?	18:42
fungi	clarkb: that matches my origin/master ref and the version we discussed, feel free to tag and push, or i can get around to it in a bit	18:54
clarkb	I can do it	18:54
fungi	thanks!	18:55
clarkb	pushed	18:56
fungi	awesome	18:58
clarkb	https://pypi.org/project/bindep/	19:04
clarkb	Should I send a service-announce email for this one too? I guess so	19:05
fungi	i have in the past, yeah	19:05
clarkb	sent	19:10
fungi	perfect, thanks again!	19:12
clarkb	we've got ~59 nodepool nodes that are locked and in-use for 4 and a half days	19:13
clarkb	one is in a locked but used state and the last one is a held node	19:14
clarkb	one of the jobs associated with this nodes is still trying to get finished on ze02 (99c5b5eff43c404f8e2d11221944cd65 is the job uuid)	19:20
clarkb	corvus: ^ fyi that zuul seems to actually be holding those locks and failing to process the job finish	19:20
fungi	clarkb: yeah, i looked at that over the weekend too, seems to coincide with the same zk disconnect late last week which led to the scheduler getting restarted	19:21
fungi	they're basically all from ~3 hours before the scheduler restart	19:21
clarkb	ya I guess we didn't restart executors too which would've killed the ephemeral znodes	19:22
fungi	refrained from trying to manually clean them up since we weren't pressed for quota and thought they might be useful for identifying the problem	19:22
clarkb	but zuul should handle this regardless unless we updated the executor in the process and it was no logner compatible with the exectors?	19:22
clarkb	the queue indicated by the "Finishing job" log entries on ze02 is growing it seems	19:23
clarkb	Node 0026928358 in nodepool was assigned to build 99c5b5eff43c404f8e2d11221944cd65 which ran on ze02 and has yet to successfully finish and unlock since ~Friday?	19:24
fungi	i wonder if those were all for builds which finished in the problem timeframe prior to the scheduler restart	19:25
frickler	that sounds plausible to me. we also still need to clean up the held nodes that zuul lost track of during the earlier upgrade, I think?	19:29
clarkb	frickler: ya if any of those are still held on the nodepool side but not recorded on the zuul side we should clean them up in nodepool when we are done with them	19:36
clarkb	looking at the code I think we only try the once to call finishJob	19:53
clarkb	and if that doesn't succeed then the job worker remains present forever?	19:53
clarkb	and grepping for Finishing Job: 99c5b5eff43c404f8e2d11221944cd65 returns no results on ze02 implying we never called that method?	19:55
clarkb	The last thing we seem to have done is pause the job	19:57
clarkb	`zgrep 99c5b5eff43c404f8e2d11221944cd65 /var/log/zuul/executor-debug.log.4.gz \| grep -v 'Finishing Job' \| less` on ze02 shows this. I'll dopuble check the other files really quickly too	19:58
clarkb	ya pausing is the last thing logged if we excluding the Finishing Job logging	19:59
clarkb	I think the next step is to do a thread dump to see if the thread is still running (I don't think it will be) or if we've just got a reference to the build in job_workers because the thread died before calling finishJob	20:01
clarkb	but I need to eat lunch now. Back in a bit	20:01
clarkb	also note that a graceful stop won't work on these exectuors because those jobs will never go out of the job_workers dict and thati s what graceful stop waits for an empty dict	20:02
*** dviroel\|rover is now known as dviroel\|rover\|afk		21:01
clarkb	https://paste.opendev.org/show/b1fOfN46k9aW0DsKYkz6/ I don't think that is related to the issue we have here but potentially intersting	21:08
clarkb	do we know approximately when the zk issue occured? I'm not seeing it in the ze02 logs yet	21:12
clarkb	corvus: ^ fyi this is a fun one too.	21:12
clarkb	That exception in the paste is the only one that I think could be related. The others are git updates that failed and log streamers closing connections	21:15
corvus	does casting to a list help avoid that?	21:16
corvus	clarkb: i don't think that would cause a significant error; i think that's the loop that gets the next job, so failing there means "just start over and try again"	21:16
clarkb	corvus: ya I didn't think it is related to the issue we're seeing with execute never calling finishJob() but I can't see anything else that might cause that	21:17
clarkb	corvus: I think you are meant to make a copy and iterate over that rather than iterate the live data to avoid that error	21:17
clarkb	I'm almost ready to do the sigusr2 on ze02 and see if the threads ave completely gone away or if they are still present	21:17
corvus	clarkb: you mean copy the dict first?	21:18
clarkb	corvus: or the list of values/keys/items that is being iterated over (I haven't looked at the exact code yet as i'm still trying to make sense of the locked nodes belonging to unfinished jobs issue)	21:20
corvus	clarkb: yeah, just wondering if you do list(dict.values()) if the list call is atomic enough to avoid that issue or whether it is subject to the same thing...	21:23
corvus	i guess it probably holds the GIL for the duration of list(), so it's effectively mutexed...	21:23
corvus	so i think it's probably sufficient. :)	21:23
clarkb	ya I think that is the case	21:24
clarkb	Thread: 139991145465600 build-99c5b5eff43c404f8e2d11221944cd65 d: False <- that thread still exists	21:24
clarkb	File "/usr/local/lib/python3.8/site-packages/zuul/executor/server.py", line 999, in pause\n self._resume_event.wait()	21:24
clarkb	so its waiting for the unpause condition to occur but that isn't going to happen?	21:25
clarkb	that gets triggered by a JobRequestEvent.RESUMED event	21:27
corvus	clarkb: did it log this? "Received %s event for build %s"	21:28
corvus	looking for that to get a deleted event	21:28
corvus	clarkb: because the build request delete should cause that, which should cause the jobworker stop() method to be called which should resume the job then stop it	21:29
corvus	clarkb: are we talking about the zk issue where we lost contact with the server, and it appeared that we also lost all the watches?	21:30
corvus	because that depends on watches too.	21:30
clarkb	corvus: yes the incident that frickler restarted the scheduelr for. And ya I'm seeing that we do a child watch for delete and resume znodes	21:30
corvus	in which case, we may actually have more debugging information than i realized, though we don't have a repl on the executor to help with it	21:31
corvus	but i would like to examine zk and see what sessions it thinks are active	21:31
corvus	(because so far we have no idea how it's possible to resume a zk session and lose the watches; that's not supposed to happen and i can't reproduce it locally)	21:32
clarkb	corvus: ze02 is the one where 99c5b5eff43c404f8e2d11221944cd65 is still running in a paused state if you need specifics to focus on	21:32
corvus	thx	21:32
clarkb	I hit sigusr2 twice so yappi isn't running anymore but there may be some info in the lgos from when it collected brief data? I'll check for that log message	21:32
clarkb	it almost seems like the scheduler restart caused it to forget about all these builds even though they should still be in zk?	21:33
clarkb	and that caused us to not send the resume event	21:33
corvus	clarkb: occam's razor says to me that if the scheduler was stuck because it lost the watches on its zk session, then it's very likely that the executor was in the same situation.	21:34
corvus	i mean, there could be something else going on, but that seems like a perfectly satisfactory explanation that i think we'd need to falsify first	21:34
clarkb	`zgrep 'Received .* event for build' /var/log/zuul/executor-debug.log.4.gz \| grep 99c5b5eff43c404f8e2d11221944cd65` returns no results	21:34
clarkb	I guess the next step is to look at the zk state and see if there is a resume or delete event sitting in the child znode listing that a watch would've seen?	21:35
corvus	thx. that's consistent with the lost watches hypothesis	21:35
corvus	clarkb: the delete event should literally be the znode deleted; so yeah, we can check for the 99c5b5eff43c404f8e2d11221944cd65 build request and it should not be there	21:36
clarkb	corvus: will you do that or should I?	21:36
corvus	i will, i have a zk shell already	21:36
corvus	get 99c5b5eff43c404f8e2d11221944cd65	21:37
corvus	{"uuid": "99c5b5eff43c404f8e2d11221944cd65", "state": "paused", "precedence": 300, "resultpath": null, "zone": null, "buildsetuuid": "e2a6a56c35d94d78bcb378deb11d9696", "jobname": "tripleo-ci-centos-8-content-provider-ussuri", "tenantname": "openstack", "pipelinename": "periodic", "eventid": "724e0983166c4458a789c520989e86a8", "workerinfo": {"hostname": "ze02.opendev.org", "log_port": 7900}}	21:37
corvus	well, that's that theory falsified :)	21:37
corvus	the scheduler should have deleted that build request after restarting	21:38
clarkb	is there a resume child znode that we might have missed?	21:38
corvus	no children	21:38
clarkb	interesting so as far as zuul is concerned the job has been running for 4 days? :)	21:38
corvus	well, is it in the pipeline?	21:39
corvus	(and it has been running for 4 days, right?)	21:39
clarkb	no its not in the status.json rendered pipeline at least	21:39
clarkb	but the job_worker thread is present and the job's last activity was to pause	21:40
clarkb	I guess the scheduler forgot about it though	21:40
clarkb	which is weird because the build request is there showing a paused state	21:40
corvus	was it before the restart?	21:40
clarkb	the pause? I'm not sure as I'm not sure when exactly the restart happened.	21:40
clarkb	2021-10-14 07:54:49,630 INFO zuul.AnsibleJob: [e: 724e0983166c4458a789c520989e86a8] [build: 99c5b5eff43c404f8e2d11221944cd65] Pausing job tripleo-ci-centos-8-content-provider-ussuri for ref refs/heads/master (change https://opendev.org/openstack/tripleo-ci/commit/None)	21:41
clarkb	that is when the pause for the job occurred	21:41
corvus	okay, i thought for some reason we were debugging fallout from the scheduler restart	21:41
clarkb	corvus: well fungi mentioned he thought it was related, but I've not yet tracked down when exactly the restart happened	21:42
corvus	2021-10-14 10:03:50 UTC zuul was stuck processing jobs and has been restarted. pending jobs will be re-enqueued	21:42
corvus	maybe that?	21:42
clarkb	10:01:02* frickler \| #status notice zuul was stuck processing jobs and has been restarted. pending jobs will be re-enqueued	21:43
clarkb	ya so the job was paused before the restart, but those tripleo jobs that run after the pause can run for a significant amount of time so it is likely the job was still paused when that happened	21:43
corvus	then the question is why didn't the scheduler delete the build request	21:44
corvus	hrm. we may simply not do that.	21:46
corvus	we cleanup requests from executors that crash while the scheduler is up, but i think we may not do the other way around.	21:47
corvus	so, short-term fix: if you restart the scheduler, do the whole system.	21:48
corvus	long-term fix: pipeline state in zk	21:48
corvus	given that this started being an issue like 3 months ago, and we're 90% of the way to it not being an issue, i think i'd lean toward not doing a medium-term fix and just sticking with the short-term fix for now. we could send an announcement to that effect...	21:49
corvus	i relayed that to #zuul since that's a zuul-project discussion	21:52
clarkb	ya I think if we communicate "restart executors and mergers when restarting the scheduler" that is probably reasonable for now	21:52
corvus	infra-root: ^ for the time being, if you restart the zuul scheduler, go ahead and restart the mergers and executors too (the zuul_restart playbook), because of that bug in zuul	21:57
clarkb	corvus: I'm thinking maybe we wait until your fix for the other issue is ready then restart everything at that point?	21:57
corvus	sgtm	21:57
corvus	s/ready/merged -- i'm pretty sure the fix is ready (i ran tests locally)	21:57
clarkb	ya I'm starting my review of it now	21:58
fungi	oof, that's a lot of scrollback for just wandering off for dinner	22:05
fungi	okay, luckily i already read the tl;dr in zuul matrix, so i think skimming it was good enough ;)	22:07
clarkb	should we hold off on a gerrit restart to do that when we restart zuul?	22:13
clarkb	(I want to restart gerrit for the 3.3.7 image update and to ensure the config file cleanup doesn't cause problems)	22:13
clarkb	neat my MINA 2.7.0 held node seems to be working with stuff like ls-projects and show-queue	22:19
fungi	i can help with a combined zuul+gerrit restart when we're ready for that	22:21
fungi	i guess the idea is to restart all of zuul on 814493?	22:22
clarkb	ya	22:24
fungi	not sure if i can legitimately claim to be reviewing that, but i'm staring at it really, really hard	22:25
fungi	based on the earlier discussions in matrix, i think i follow it	22:26
clarkb	https://172.99.67.89/c/x/test-project/+/21 I was able to push that via local ssh as well. I think that generally means newer mina is working. The other big mina thing is replication though	22:29
clarkb	testing that is a bit more of an involved process so probably won't get into that uintil a 2.7.1 shows up and we can argue for doing the update	22:29
clarkb	fungi: are you able to review 814493? I'll ping tristanC about it too since he expressed interest	22:30
fungi	clarkb: i just did, and didn't approve specifically because tristan had wanted to review it	22:31
clarkb	thanks	22:32
fungi	but yeah, i think i followed the solution and it looked okay	22:33
clarkb	Finally remembered to approve the prometheus spec. I got distracted last week by the rename improvement stuff	22:45
fungi	good call	22:46
clarkb	fungi: frickler: what was the story with the zuul complaints about old venus secrets in zk?	22:48
clarkb	The rename process should've renamed them to openstack/venus paths in zk then deleted the venus/venus paths	22:48
fungi	the backup seemed to want to back up the old paths	22:50
fungi	and then complained when it couldn't find them in zk	22:50
clarkb	thats interesting since I thought it only operated off of what it saw in zk	22:51
clarkb	corvus: ^ fyi	22:51
clarkb	fungi: was it root email where it complained?	22:51
fungi	that's a great question, i hadn't had time to look, just repeating what frickler said	22:51
clarkb	ya seems to have occurred 23 hours ago and daily before that after the renames	22:52
fungi	yep, root cronspam	22:52
fungi	just found a copy in my cronspam inbox	22:52
clarkb	ok adding to the meeting agenda. I think the export is supposed to be greedy and export as much as it can so this likely isn't a fatal issue. But something we should cleanup in our rename process or the zuul export-keys command	22:52
fungi	ERROR:zuul.KeyStorage:Unable to load keys at /keystorage/gerrit/osf/osf%2Fopenstackid	22:53
fungi	and so on	22:53
opendevreview	Merged opendev/infra-specs master: Spec to deploy Prometheus as a Cacti replacement https://review.opendev.org/c/opendev/infra-specs/+/804122	22:53
fungi	maybe we don't correctly remove the parent znode?	22:54
clarkb	fungi: oh that could be	22:54
fungi	and so the tool is finding an empty one there	22:54
fungi	because, yeah, i thought it was effectively stateless	22:54
clarkb	fungi: ya looking directly at the db I think this is exactly the issue	23:02
clarkb	and ya checking the backups I think we're still dumping everything else	23:03
clarkb	I'm working on a fix	23:14
clarkb	remote: https://review.opendev.org/c/zuul/zuul/+/814504 Cleanup empty secrets dirs when deleting secrets	23:35
clarkb	I'm going to need to do the zuul restart and gerrit restart tomorrow if I'm helping. Have dinner to help with now. Happy for others to do it this evening if they have time though but I don't think we are in a rush tomorrow should be fine	23:57

Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!