Saturday, 2021-02-27

clarkb	fungi: it calls log = get_annotated_logger(self.log, task.zuul_event_id, build=task.build)	00:00
clarkb	corvus: ya you could probably put task back on the queue	00:01
clarkb	aborting from that separate thread would be a bit weird though I think. If we're going to abort we should maybe do it via _update() checking the result of innerUpdateLoop activity	00:02
fungi	yeah, for reference this is how it winds up in the log: http://paste.openstack.org/show/803050/	00:02
fungi	i get lost quickly when trying to reason through cross-thread interactions	00:03
corvus	oh, so the idea is to abort the build when that happens?	00:03
fungi	we do abort under similar circumstances in AnsibleJob.execute()	00:04
fungi	but if it's caught in ExecutorServer._innerUpdateLoop() the build returns ERROR instead	00:05
fungi	https://zuul.opendev.org/t/openstack/build/999f096bccbe43c286b7c6ed3c5deeb5	00:06
fungi	"Error: Failed to update project None"	00:06
corvus	fungi: https://opendev.org/zuul/zuul/src/branch/master/zuul/executor/server.py#L1046	00:06
fungi	so that's where we should send abort?	00:07
corvus	that's what translates the task failure into an ExecutorError exception, and ExecutorError means "non-transient"	00:07
corvus	fungi: i think you're suggesting a nuance, in that a brokenprocesspool on the executor should be considered a transient error	00:08
fungi	ahh	00:08
clarkb	that could be implemented by updating UpdateTask to have a transient error vs non transiet error flag	00:09
corvus	so you'll want to distinguish this condition from others, and do something other than ExecutorError.	00:09
corvus	yeah like clarkb said	00:09
clarkb	one tricky thing is that the regular merger server shares some of those code paths	00:09
fungi	i was mostly wondering if it should exhibit behavior consistent with https://opendev.org/zuul/zuul/src/branch/master/zuul/executor/server.py#L948-L953	00:09
corvus	if you do clarkb's suggestion, then it'd probably look like that -- "if transient: self._send_aborted()"	00:11
corvus	i think you could do that without even dealing with any new exceptions	00:12
fungi	or is the problem that the ExecutorError is masking the BrokenProcessPool there?	00:12
corvus	it's not masking it since they're in different threads	00:12
fungi	okay, i think i get it	00:13
fungi	but we can apparently raise ExecutorError and BrokenProcessPool in the execute thread too	00:13
corvus	fungi: but -- if we did transfer the exception between threads, then yeah, that would be caught by https://opendev.org/zuul/zuul/src/branch/master/zuul/executor/server.py#L948	00:13
corvus	yeah, i'm not sure where else it could come from	00:14
fungi	or could it be that the exception handling at 948 and 954 are dead code because those aren't raised in that thread?	00:15
corvus	could be	00:16
corvus	well not 954, that's definitely live	00:16
corvus	but 948 could be dead	00:16
fungi	wondering if someone added the one at 948 as an attempt to deal with this exact condition	00:16
corvus	(ExecutorError has subclasses which i know are raised)	00:17
corvus	worth a look in the history	00:17
fungi	added a year ago by https://review.opendev.org/709307	00:18
fungi	and yeah the commit message describes exactly what i think we experienced	00:18
fungi	oom killer sniped the python process, leading to a broken process pool	00:19
corvus	oh nope, it's not dead code. https://opendev.org/zuul/zuul/src/branch/master/zuul/executor/server.py#L1095	00:20
corvus	that will hit the abort path	00:20
clarkb	ya its only used by merger paths but apparently there is at least one outside of the separate thread	00:22
clarkb	what are state_items?	00:22
corvus	fungi: so yeah, i think we're back at the same place; the existing code isn't defective, it's just not as nuanced as you'd like, and i think clarkb's suggestion is the best way to add that.	00:22
fungi	cool, i'll give it a shot. thanks!	00:23
corvus	clarkb: yeesh, that could sure use a comment :) i believe they are queue items which are present only for the purpose of fixing repo state, but they don't have associated changes. i think if a project is in required-projects but there isn't a change ahead in that project, then a synthetic item will appear there so that all of the jobs running for a change use the same repo state for all the repos.	00:26
openstackgerrit	Merged zuul/zuul master: Delay getting many files to mergers https://review.opendev.org/c/zuul/zuul/+/756837	00:37
openstackgerrit	Merged zuul/zuul master: Defer setting build pause to event queue https://review.opendev.org/c/zuul/zuul/+/712939	00:37
*** tosky has quit IRC		00:42
openstackgerrit	James E. Blair proposed zuul/zuul master: Update zoned executor stats https://review.opendev.org/c/zuul/zuul/+/777852	01:11
corvus	clarkb, fungi: can you check my comment on https://review.opendev.org/761756 ?	01:17
clarkb	corvus: I think that is correct, basically ever downstream has to manually update to get new tag state (either removal or move)	01:23
clarkb	note that git fetch --prune-tags will not help with a moved tag if I read the manpage correctly	01:24
clarkb	it will also remove any local tags that were never on the remote (shouldn't be an issue for zuul)	01:24
corvus	moved tag? oh	01:24
corvus	oy	01:24
corvus	like same name different commit	01:25
clarkb	correct	01:25
clarkb	zuul will continue to see the wrong value in that case I think (basically changing tags or removingthem is bad)	01:25
corvus	i guess the downside to implementing this change is that it might lead people to believe that this works when it really doesn't	01:25
clarkb	ya, also I think it may be worth having a test of these cases	01:26
clarkb	bceause its tricky stuff and if we've got a test we can at least assert zuul does what we expect it to	01:26
corvus	ie, you delete a tag, the next job prunes it on executor 01, then you add it back with a new value, and it runs on executor02 and silently does the wrong thing	01:26
clarkb	yup, or even just force push the new tag value so there is no intermediate delete	01:27
clarkb	(I think that is allowed when pushing)	01:27
clarkb	*force push the same tag name but with new commit value	01:27
corvus	so basically, nothing good ever happens with deleting or changing tags; and there's an argument to be made that the status quo at least breaks in a way that causes people to have to do work	01:28
corvus	otoh, a force-push tag update is not going to trigger any zuul breakage	01:28
corvus	so an argument could be made that this is an incremental improvement so we can at least handle 1 of the 3 possibilites semi-transparently.	01:29
clarkb	zuul would see the force push of the tag change as a refs/tags/ update and rerun jobs against the old commit	01:29
clarkb	and ya this should handle deletes just fine as long as all executors/mergers see the delete eventually	01:29
corvus	i guess i'm on the fence on this; still slightly leaning toward saying "yes". cause it probably doesn't make things any worse, even though it mostly doesn't make them better.	01:30
clarkb	I guess the only race there is if a new tag with the same name is pushed after a delete, because any other event would trigger the pruning	01:30
corvus	yeah	01:30
corvus	clarkb: you want to leave some notes on that change? (or if you need to eod, i can)	01:30
clarkb	I can do it	01:30
corvus	kk, thx	01:31
clarkb	corvus: you may want to double check what I wrote, but its posted now	01:33
clarkb	but now I do need to eod, I smell food	01:37
openstackgerrit	Merged zuul/zuul master: Enhance logging of status_get requests https://review.opendev.org/c/zuul/zuul/+/771669	02:24
*** paladox has quit IRC		02:33
*** paladox has joined #zuul		02:39
*** ikhan has quit IRC		04:14
*** evrardjp has quit IRC		05:33
*** evrardjp has joined #zuul		05:33
tobiash	corvus: I've responded on 767084	07:47
openstackgerrit	Tobias Henkel proposed zuul/zuul master: Allow refreshing volatile data in canMerge check https://review.opendev.org/c/zuul/zuul/+/767084	08:14
openstackgerrit	Tobias Henkel proposed zuul/zuul master: Check cycle items are mergeable before reporting https://review.opendev.org/c/zuul/zuul/+/743450	08:14
tobiash	corvus: I changed 767084 so that it does what we need but avoids to update the changes completely ^	08:15
*** evrardjp_ has joined #zuul		09:19
*** evrardjp has quit IRC		09:21
*** zbr has joined #zuul		09:25
*** zbr9 has joined #zuul		09:37
*** zbr has quit IRC		09:40
*** zbr9 is now known as zbr		09:40
*** evrardjp_ has quit IRC		09:46
*** zbr7 has joined #zuul		09:51
*** zbr has quit IRC		09:53
*** zbr7 is now known as zbr		09:53
*** zbr has quit IRC		10:15
*** zbr has joined #zuul		10:16
*** zbr2 has joined #zuul		10:20
*** zbr has quit IRC		10:23
*** zbr2 is now known as zbr		10:23
*** jangutter_ has quit IRC		10:40
*** jangutter has joined #zuul		10:40
*** tosky has joined #zuul		11:24
*** evrardjp has joined #zuul		11:44
*** zbr has quit IRC		13:00
*** zbr has joined #zuul		13:01
*** evrardjp has quit IRC		13:26
*** evrardjp has joined #zuul		13:27
*** holser_ has quit IRC		13:27
*** zbr1 has joined #zuul		13:29
*** zbr has quit IRC		13:31
*** zbr1 is now known as zbr		13:31
*** jangutter_ has joined #zuul		13:32
*** jangutter_ has joined #zuul		13:33
*** jangutter has quit IRC		13:33
*** ikhan has joined #zuul		13:36
*** jangutter_ has quit IRC		13:54
*** zbr2 has joined #zuul		13:55
*** jangutter has joined #zuul		13:55
*** zbr has quit IRC		13:56
*** zbr2 is now known as zbr		13:56
*** evrardjp has quit IRC		14:04
*** evrardjp has joined #zuul		14:05
corvus	tobiash: can you see my reply on https://review.opendev.org/743450 ? i'm still confused	14:25
tobiash	corvus: I think I was wrong, generally an event is needed (e.g. gerrit). Only in the github case the event is not needed.	14:28
tobiash	because of the lacking events by github	14:29
tobiash	I confused this with the parent change	14:29
tobiash	generally this change was intended to work also without the parent change	14:30
tobiash	maybe it's easier to understand if we'd switch the order of those two	14:30
corvus	tobiash: yeah, i think it is closely related to the parent	14:30
tobiash	actually 743450 was older than the parent which has been added in between at some point in time to fix the lack of github events	14:32
corvus	tobiash: i guess my thought is that the point of the late canMerge check is to catch the case where github has changed but our model of it hasn't, which is most likely to happen if github changes something without emitting an event (though there are other ways like delayed event processing). so that's the new test case we should add.	14:32
corvus	i certainly won't argue with adding 2 test cases :) but i still think if we had to pick one, it would be the "did not get event" case	14:32
tobiash	I agree, the 'did not get event' case covers both cases	14:34
tobiash	so yes, we could have either both cases or just the 'dif not get event' case	14:35
*** jfoufas1 has joined #zuul		14:35
corvus	tobiash: cool	14:35
tobiash	corvus: in that case I'd change the 'make it draft but don't send event' to 'add some additional required check to branch protection and don't send event'	14:36
tobiash	that fits the real world better since in that case github indeed doesn't send an event and the parent change gets automatically tested	14:37
tobiash	oh well, draft is part of the refetching	14:37
tobiash	so I think just removing the event will do fine	14:38
corvus	i love -1 line fixes	14:38
tobiash	:)	14:38
*** jfoufas1 has quit IRC		14:40
tobiash	corvus: I've replied with a summary of what we just discussed	14:42
*** zbr0 has joined #zuul		14:43
corvus	tobiash: lgtm. do you want you or i to update that or wait for swest_ to chime in?	14:43
tobiash	already on it	14:43
corvus	k	14:44
tobiash	test case is green locally without the event :)	14:45
*** zbr has quit IRC		14:45
*** zbr0 is now known as zbr		14:45
openstackgerrit	Tobias Henkel proposed zuul/zuul master: Check cycle items are mergeable before reporting https://review.opendev.org/c/zuul/zuul/+/743450	14:45
tobiash	corvus: 777843 has a small typo and a minor linting issue, lgtm otherwise	14:49
corvus	tobiash: ah yep, i have a few -1Vs to go through :)	14:51
openstackgerrit	James E. Blair proposed zuul/zuul master: Update zoned executor stats https://review.opendev.org/c/zuul/zuul/+/777852	14:53
openstackgerrit	James E. Blair proposed zuul/zuul master: Support enqueuing behind circular dependencies https://review.opendev.org/c/zuul/zuul/+/777843	14:54
openstackgerrit	James E. Blair proposed zuul/zuul master: Use ZooKeeper TLS in tests https://review.opendev.org/c/zuul/zuul/+/777489	14:59
tobiash	corvus: I think we should rebase 673840 (allow unzoned option) on 777852 since it would be affected by it	14:59
openstackgerrit	James E. Blair proposed zuul/zuul master: Emit config warnings if shared queues on pipelines https://review.opendev.org/c/zuul/zuul/+/777479	15:00
corvus	tobiash: oh that's going to make the 'online' calculation difficult	15:02
tobiash	corvus: I think I'd do the same with online then (register unzoned and zoned online function)	15:03
corvus	since that's an online zoned executor, and online unzoned executor -- but for counting "all" is it 1 or 2 executors? :)	15:04
tobiash	corvus: maybe we should remove the 'all' then	15:04
corvus	my intuition says it's still just 1 executor for "all" which means we need to detect that the zoned and unzoned are the same	15:05
tobiash	if we want the 'all' to be correct, we could add a uuid to the online function and deduplicate then	15:05
corvus	tobiash: yeah (though i'd go with hostname, we already do that for other gearman functions)	15:06
corvus	("hostname" -- it's really more than the hostname and more like a uuid)	15:06
corvus	(that just happens to have the hostname in it :)	15:06
tobiash	k	15:08
corvus	tobiash: it's generally not too hard to make dashboards that sum gauges -- so we could remove all and someone who wants that can sum(zoned, unzoned). they'd get 2 as the answer in that case though. so if we want to be able to have a dashboard with 1 executor in the dual zoned/unzoned case, then i think our only option is to keep all and do that extra work with the hostname	15:08
tobiash	speaking for us we just list online/accepting executors in different graphs per zone plus unzoned so we don't sum them	15:09
tobiash	but I can imagine that someone might want to keep the 'all' category as well	15:10
corvus	tobiash: yeah. i agree that zoned / unzoned is more useful from an ops pov.	15:11
corvus	doesn't matter if you have 5 executors running if all of them are zoned and you need an unzoned one.	15:11
tobiash	yeah	15:12
corvus	tobiash: then maybe we should eliminate the "all" stats as well as the "unzoned" stats hierarchy, make the top level mean unzoned, but keep the zoned ones under "zoned" to avoid name collisions?	15:14
corvus	that sort of forces users to use that view of the system, so they don't inadvertently create a dashboard that says everything's okay	15:15
tobiash	from hierarchy data model long term I think zoned/unzoned should be the same level	15:16
tobiash	like unzoned is just another or the default zone	15:16
corvus	tobiash: that would be a breaking change for existing deployments	15:16
tobiash	yeah, so maybe we could do both for a period of time and deprecate the old one	15:17
*** evrardjp has quit IRC		15:17
corvus	yeah	15:17
*** evrardjp_ has joined #zuul		15:17
tobiash	and remove the old maybe with 5.0	15:17
corvus	okay, i left a note on the unzoned option change, and i'm leaving a note on the stats change now; i'll rework the stats change in a little bit	15:18
tobiash	k	15:18
corvus	tobiash: ok comment left on https://review.opendev.org/777852 i think that covers it	15:20
tobiash	lgtm	15:23
*** holser has joined #zuul		15:48
*** evrardjp_ has quit IRC		16:00
*** evrardjp has joined #zuul		16:00
*** evrardjp has quit IRC		16:16
*** evrardjp has joined #zuul		16:16
*** evrardjp has quit IRC		16:17
*** evrardjp has joined #zuul		16:18
*** holser has quit IRC		16:33
*** evrardjp has quit IRC		17:21
*** evrardjp has joined #zuul		17:22
*** jangutter has quit IRC		19:03
*** jangutter has joined #zuul		19:03
sassyn	Hi everyone	19:21
sassyn	thank u avass	19:21
sassyn	thank u also tobiash	19:21
sassyn	The issue I'm having is still there. Commit code to gerrit, go to zuul console, open the gate job, I see the commit. Clicking on the Job show for example : 16:21 Job console starting..... and the next line is 16:31 Running Ansible setup....	19:24
sassyn	10 min waiting time for nothing. This is the only job i have in my system	19:25
avass	sassyn: that's probably the time it takes for the executor to clone/checkout any repos it needs to run the jobs. are the repos large or are you running jobs with a lot of repos?	19:46
avass	sassyn: in any way I don't think it's "nothing" it's just not showing what's actually happening during that time in a transparent way :)	19:48
tobiash	sassyn: you may want to have a look into the executor logs to see what it's doing	20:07
*** ikhan has quit IRC		20:08
tobiash	sassyn: how large are the repos involved in that job and what storage type is it using?	20:08
tobiash	Also the bandwidth to gerrit may matter	20:08
avass	tobiash: would it be possible to log something like "Setting up workspace" or does the executor build logs start when the jobs actually start?	20:09
tobiash	sassyn: also make sure that the git_dir and the job_dir are on the same volume (https://zuul-ci.org/docs/zuul/discussion/components.html#attr-executor.git_dir)	20:10
tobiash	having those on separate mount points will dramatically slow down the job startup times	20:10
tobiash	because this prevents hard linking during cloning into the job workspace	20:11
tobiash	avass: it's writing this when it creates the job dir: https://opendev.org/zuul/zuul/src/branch/master/zuul/executor/server.py#L568	20:13
avass	tobiash: ah	20:14
tobiash	avass: which is right before fetching and preparing the repos	20:14
avass	maybe updating that message and adding a "setting up workspace" then so users know that it's working on something :)	20:15
*** ikhan has joined #zuul		20:16
tobiash	avass: I think it should be easy to add further messages there	20:16
tobiash	the first could be something like updating repos	20:17
tobiash	and here we could add something like preparing workspace: https://opendev.org/zuul/zuul/src/branch/master/zuul/executor/server.py#L1061	20:17
avass	thanks. looking at that in a moment. I think it would be good to make that a bit more transparent to the user	20:20
tobiash	fungi, corvus: regarding the processpool topic. I haven't read the full conversation so just ignore me if that's superfluous. I don't think we should do a _send_aborted in this case since this sends the build result to ABORTED which should only be done if the scheduler requested the abort. If you want to trigger a retry build result None is the right thing like for a pre failure or in the unreachable case: https://	20:22
tobiash	opendev.org/zuul/zuul/src/branch/master/zuul/executor/server.py#L1431	20:22
fungi	tobiash: does that mean you feel like the current behavior at https://opendev.org/zuul/zuul/src/branch/master/zuul/executor/server.py#L948-L953 is also wrong?	20:26
tobiash	fungi: I mean the job result	20:33
tobiash	fungi: ah yeah, I think exactly that's where you want to change the behavior	20:34
fungi	tobiash: i'm not sure i follow. _send_aborted() causes the build to be aborted and retried, right? at least that's what the code comment says	20:35
tobiash	I think the code mismatches the comment	20:35
tobiash	send_aborted sends a job result of ABORTED back to the scheduler	20:36
tobiash	a retry needs to be triggered exactly like https://opendev.org/zuul/zuul/src/branch/master/zuul/executor/server.py#L955-L958 just with result=None instead	20:37
fungi	interestingly, what we observed was that the result reported for the build was ERROR with detail "Error: Failed to update project None"	20:37
tobiash	a result of None will tell the scheduler to retry	20:37
fungi	we didn't get an ABORTED result reported	20:38
fungi	https://zuul.opendev.org/t/openstack/build/999f096bccbe43c286b7c6ed3c5deeb5	20:38
tobiash	fungi: can you dig up the stack trace in the executor log?	20:38
fungi	tobiash: yep, this was what i found http://paste.openstack.org/show/803050/	20:39
tobiash	need to check if we re-raise that exception	20:39
fungi	as you guessed earlier, it was the result of the process being terminated by the kernel's oom killer	20:39
tobiash	yeah, that's pretty much always the case, either oom or a crashing python interpreter	20:40
tobiash	and oom is much more likely these day	20:40
tobiash	fungi: ok, so your stack trace shows that the exception is catched here: https://opendev.org/zuul/zuul/src/branch/master/zuul/executor/server.py#L2962	20:42
tobiash	and the exception information is not passed to the caller	20:42
tobiash	which is why you end up here: https://opendev.org/zuul/zuul/src/branch/master/zuul/executor/server.py#L955	20:43
tobiash	with result error	20:43
tobiash	what we could do is to attach some error information to the task here: https://opendev.org/zuul/zuul/src/branch/master/zuul/executor/server.py#L2963	20:44
tobiash	and evaluate them here: https://opendev.org/zuul/zuul/src/branch/master/zuul/executor/server.py#L1046	20:44
tobiash	e.g. if the error is transient return None there	20:45
openstackgerrit	Tobias Henkel proposed zuul/zuul master: Retry job on broken process pool https://review.opendev.org/c/zuul/zuul/+/777874	20:48
tobiash	fungi: like ^	20:48
fungi	aha, thanks, think that's basically what we discussed at http://eavesdrop.openstack.org/irclogs/%23zuul/%23zuul.2021-02-27.log.html#t2021-02-27T00:11:50	20:55
tobiash	oh wait, that misses something	20:55
openstackgerrit	Tobias Henkel proposed zuul/zuul master: Retry job on broken process pool https://review.opendev.org/c/zuul/zuul/+/777874	20:59
openstackgerrit	Tobias Henkel proposed zuul/zuul master: Retry job on broken process pool https://review.opendev.org/c/zuul/zuul/+/777874	21:02
openstackgerrit	Tobias Henkel proposed zuul/zuul master: Retry job on failed repo updates https://review.opendev.org/c/zuul/zuul/+/777875	21:03
tobiash	fungi, corvus: as an alternative we could also retry jobs on every uptate failure ^	21:03
fungi	i guess i'd be slightly more worried about the possibility for some sorts of update failures to be persistent, leading to retry loops. but we do at least still have a retry limit serving as a backstop in that case	21:20

Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!