Tuesday, 2021-06-29

ianw	clarkb: out of interest i ran a dnf update which pulled in firefox 89.0.2 and i agree that the console highlighting seems to work better on the current production with that	00:07
ianw	also everything now has rounded corners	00:08
fungi	for safety	00:36
ianw	it also appears to have replaced a previous feature that i occasionally enjoyed "playing any sound" with long timeouts and "firefox is not responding" windows	00:50
*** marios is now known as marios\|ruck		05:12
opendevreview	Felix Edel proposed zuul/zuul master: Lock/unlock nodes on executor server https://review.opendev.org/c/zuul/zuul/+/774610	05:59
*** jpena\|off is now known as jpena		07:00
opendevreview	Benjamin Schanzel proposed zuul/zuul-jobs master: Add a meta log upload role with a failover mechanism https://review.opendev.org/c/zuul/zuul-jobs/+/795336	08:15
*** bhagyashris_ is now known as bhagyashris		09:15
opendevreview	Merged zuul/zuul master: Configure unique command socket path per scheduler https://review.opendev.org/c/zuul/zuul/+/771462	09:20
opendevreview	Merged zuul/zuul master: Refactor config/tenant (re-)loading https://review.opendev.org/c/zuul/zuul/+/795263	09:23
opendevreview	Benjamin Schanzel proposed zuul/zuul-jobs master: Add a meta log upload role with a failover mechanism https://review.opendev.org/c/zuul/zuul-jobs/+/795336	09:46
opendevreview	Merged zuul/zuul master: Replace TreeCache in component registry https://review.opendev.org/c/zuul/zuul/+/796582	10:44
*** jpena is now known as jpena\|lunch		11:34
opendevreview	Felix Edel proposed zuul/zuul master: Lock/unlock nodes on executor server https://review.opendev.org/c/zuul/zuul/+/774610	11:38
*** jpena\|lunch is now known as jpena		12:36
pabelanger[m]	Is anyone using the jwt tokens to manage autoholds for jobs?	12:54
pabelanger[m]	I'm wanting to set them up, but was curious to see if anyone else has	12:54
pabelanger[m]	also, is there plans to expose that funtionality in the web UI, if a token is found?	12:54
tristanC	pabelanger[m]: yes we do, mostly to enable qa team debug their jobs without our assistance.	13:07
tristanC	jwt is configured by default in sf-config with https://softwarefactory-project.io/cgit/software-factory/sf-config/tree/ansible/roles/sf-zuul/templates/zuul.conf.j2#n79	13:08
tristanC	then tokens can be issued with `zuul create-auth-token --auth-config zuul_operator --tenant <tenant> --user <audit-info> --expires-in 7776000`	13:09
pabelanger[m]	so, each time you want QE to setup autohold, you manually generate the token?	13:12
pabelanger[m]	or maybe every 90 days	13:15
pabelanger[m]	when it expires	13:15
tristanC	pabelanger[m]: yes, those are per user token	13:16
pabelanger[m]	maybe I'll setup a demo with you downstream to see how it all works	13:17
tristanC	pabelanger[m]: the documentation has been proposed in https://review.opendev.org/c/zuul/zuul/+/727785	13:20
tristanC	pabelanger[m]: along with https://zuul-ci.org/docs/zuul/reference/client.html#create-auth-token	13:20
pabelanger[m]	tristanC: do you have any logic for self-serve tokens?	13:37
pabelanger[m]	would love to get another review on https://review.opendev.org/c/zuul/zuul/+/798131	13:48
pabelanger[m]	adds a few more check_run status for github	13:48
opendevreview	Felix Edel proposed zuul/zuul master: Lock/unlock nodes on executor server https://review.opendev.org/c/zuul/zuul/+/774610	13:53
opendevreview	Simon Westphahl proposed zuul/zuul master: wip: Items/ahead behind wrong after re-enqueue https://review.opendev.org/c/zuul/zuul/+/798691	14:04
opendevreview	Simon Westphahl proposed zuul/zuul master: wip: Items ahead/behind wrong after re-enqueue https://review.opendev.org/c/zuul/zuul/+/798691	14:05
tristanC	pabelanger[m]: what do you mean by self-serve tokens?	14:07
pabelanger[m]	tristanC: if a token expires, an infra-root needs to regenerate said token and give it to QE right?	14:52
pabelanger[m]	Rather then the QE person generating it themself?	14:53
tristanC	pabelanger[m]: for that i think you need a service like keycloack to let the user authenticate themself	14:54
pabelanger[m]	ack	14:55
pabelanger[m]	let me go and read up about it, see if I can hook it up to existing SSO at redhat	14:55
apevec[m]	pabelanger: mhu has web UI WIP, lemme find topic	15:22
apevec[m]	https://review.opendev.org/q/topic:"fffaff"	15:26
*** jpena is now known as jpena\|off		15:59
*** marios\|ruck is now known as marios\|out		16:17
clarkb	corvus: looks like https://review.opendev.org/c/zuul/zuul/+/774610 is an important zuul change? I'll try to review that between meetings this morning	16:54
corvus	clarkb: yes, but don't review it yet; i'm currently rebasing it on the rest of the executor api stack (there are significant conflicts)	16:57
corvus	clarkb: i think the next step is https://review.opendev.org/770902	16:57
clarkb	corvus: got it	16:58
* corvus < https://matrix.org/_matrix/media/r0/download/matrix.org/IvpviwJtxhrFhEOilheDGlly/message.txt >		16:58
corvus	i'm putting the lock/unlock onto the end of that	16:58
clarkb	ok so start at the bottom of that stack and work forward	16:59
clarkb	top of the lsit in your paste	16:59
corvus	yep; and most of that stack is one change exploded out into hopefully smaller parts; so there's a certain amount of "introduce an api; make some changes to the api; then use the api" in that progression	17:00
clarkb	great I'll start there	17:00
opendevreview	Merged zuul/zuul master: Add skipped / neutral statuses to the github driver https://review.opendev.org/c/zuul/zuul/+/798131	17:12
opendevreview	James E. Blair proposed zuul/zuul master: Execute builds via ZooKeeper https://review.opendev.org/c/zuul/zuul/+/788988	17:39
opendevreview	James E. Blair proposed zuul/zuul master: Move build request cleanup from executor to scheduler https://review.opendev.org/c/zuul/zuul/+/794687	17:39
opendevreview	James E. Blair proposed zuul/zuul master: Handle errors in the executor main loop https://review.opendev.org/c/zuul/zuul/+/796583	17:39
opendevreview	James E. Blair proposed zuul/zuul master: Shard BuildRequest parameters https://review.opendev.org/c/zuul/zuul/+/797149	17:39
opendevreview	James E. Blair proposed zuul/zuul master: Compress sharded ZK data https://review.opendev.org/c/zuul/zuul/+/797156	17:39
opendevreview	James E. Blair proposed zuul/zuul master: Switch to ZooKeeper backed merge result events https://review.opendev.org/c/zuul/zuul/+/784195	17:39
opendevreview	James E. Blair proposed zuul/zuul master: Switch from global to tenant event queues https://review.opendev.org/c/zuul/zuul/+/797440	17:39
opendevreview	James E. Blair proposed zuul/zuul master: Remove unused addResultEvent method https://review.opendev.org/c/zuul/zuul/+/797542	17:39
opendevreview	James E. Blair proposed zuul/zuul master: Lock/unlock nodes on executor server https://review.opendev.org/c/zuul/zuul/+/774610	17:39
corvus	(i spotted an error in the middle of the stack, so i went ahead and fixed it)	17:39
corvus	that was just a 2 line change in 788988; the rest is rebasing on that. i expect the stack is stable now and remains ready for review.	17:43
tobiash[m]	corvus: regarding https://review.opendev.org/c/zuul/zuul/+/774610/21/zuul/executor/server.py#1371 could prepareVars already have an effect on the nodes?	19:19
tobiash[m]	I wonder if we need to set the nodeset into use before prepareVars	19:20
corvus	tobiash: yes to both	19:25
opendevreview	James E. Blair proposed zuul/zuul master: Lock/unlock nodes on executor server https://review.opendev.org/c/zuul/zuul/+/774610	20:42
opendevreview	James E. Blair proposed zuul/nodepool master: Add requestor_data field to node request https://review.opendev.org/c/zuul/nodepool/+/798746	20:51
clarkb	corvus: I left some comments/questions/notes on https://review.opendev.org/c/zuul/zuul/+/770902 if you or felixedel or tobiash[m] get a chance to look those over that would be great	20:52
clarkb	I suspect some may be clarified by the rest of the stack whcih I'm going to try and review as well	20:52
clarkb	aha yup the next change answers my questions about "default-zone"	20:54
corvus	clarkb: yeah, sorry about that -- it's not my favorite thing to have changes not self-contained like that, but it also sort of seemed like leaving changes like the default-zone in their own change made some things easier to understand	20:55
clarkb	corvus: I'm through the first 5 in the stack now. Left some comments on a few of them. Would be good if you can check those	21:31
clarkb	The change to execute jobs via zk is not small. I'm going to take a break here and do reviews of some other stuff to ensure they get attention today then come back to this	21:32
corvus	clarkb: thanks! i'm already well into working on your comments :)	21:36
opendevreview	James E. Blair proposed zuul/zuul master: Execute builds via ZooKeeper https://review.opendev.org/c/zuul/zuul/+/788988	21:38
opendevreview	James E. Blair proposed zuul/zuul master: Move build request cleanup from executor to scheduler https://review.opendev.org/c/zuul/zuul/+/794687	21:38
opendevreview	James E. Blair proposed zuul/zuul master: Handle errors in the executor main loop https://review.opendev.org/c/zuul/zuul/+/796583	21:38
opendevreview	James E. Blair proposed zuul/zuul master: Shard BuildRequest parameters https://review.opendev.org/c/zuul/zuul/+/797149	21:38
opendevreview	James E. Blair proposed zuul/zuul master: Compress sharded ZK data https://review.opendev.org/c/zuul/zuul/+/797156	21:38
opendevreview	James E. Blair proposed zuul/zuul master: Switch to ZooKeeper backed merge result events https://review.opendev.org/c/zuul/zuul/+/784195	21:38
opendevreview	James E. Blair proposed zuul/zuul master: Switch from global to tenant event queues https://review.opendev.org/c/zuul/zuul/+/797440	21:38
opendevreview	James E. Blair proposed zuul/zuul master: Remove unused addResultEvent method https://review.opendev.org/c/zuul/zuul/+/797542	21:38
opendevreview	James E. Blair proposed zuul/zuul master: Lock/unlock nodes on executor server https://review.opendev.org/c/zuul/zuul/+/774610	21:38
opendevreview	James E. Blair proposed zuul/zuul master: Only call one build callback in executor api https://review.opendev.org/c/zuul/zuul/+/798750	21:38
opendevreview	James E. Blair proposed zuul/zuul master: Remove exception handling in executor api isLocked https://review.opendev.org/c/zuul/zuul/+/798751	21:38
corvus	clarkb: i don't want to steal your attention, but would you mind looking at https://review.opendev.org/798750 and https://review.opendev.org/798751 ? they are small and should address your comments in the first change	21:42
clarkb	sure I'm between changes right now	21:43
clarkb	corvus: for https://review.opendev.org/c/zuul/zuul/+/798751/1/zuul/zk/executor.py does the path lostBuildRequests() takes (and hunts for) not trigger the case where there is no lock node?	21:48
clarkb	corvus: I think if those two functions weren't right next to each other I wouldn't have thought about that at all, but seeing them together it occured to me that the lock may not exist (for whatever reason) and that is what lostBuildRequests() is looking for	21:49
corvus	creating a Lock object creates a locke node	21:49
clarkb	right but listBuildRequests() iterates over all BuildRequests and checks there lock state. If they have never been locked before (or have been unlocked and still exist) then we could hit that?	21:49
corvus	and any parent nodes	21:49
clarkb	this might be more test specific where we manipulate the requests and locks more directly but I think the first test case that is added may even run over that (though I dind't look in the logs of the job to confirm)	21:50
corvus	hit what? there is nothing in isLocked that could raise a NoNodeError under normal circumstances	21:50
corvus	and by normal circumstances, i mean anything short of a malfunctioning connection or server	21:51
clarkb	corvus: if a BuildRequest exists but has never been locked before then won't isLocked hit the NoNodeError() becuase a lock has never been created for it that BuildRequest yet?	21:51
clarkb	If BuildRequests and locks shared the same tree then this would probably be avoided but since they are in distinct trees I think we can hit this?	21:51
corvus	clarkb: no -- because creating a kazoo.recipe.Lock() object causes the lock node and all parent nodes to be created	21:52
clarkb	aha ok	21:52
clarkb	then we check the contendors count to actually determine if it is locked (and it would be initialized to 0 in the case of no lock existing previously	21:53
corvus	so either ExecutorApi.lock() or ExecutorApi.isLocked() will cause the initial node to be created.	21:53
corvus	clarkb: correct; contenders lists the children of the lock node	21:53
corvus	(because contenders are ephemeral znodes under the lock node)	21:53
clarkb	now I understand why you never expect that exception to happen at all	21:53
clarkb	in which case the new change is the correct cleanup	21:54
corvus	(also, btw, i ran this through a unit test to double check, but didn't think it was necessary to add a test for it)	21:55
clarkb	ya I think I was confused through an acceptance of the old code being correct (even though I had called it out as being odd). I'm not sure you need a specific test for this.	21:56
clarkb	The first test that is added does cover this as a side effect I think	21:56
corvus	yeah, if isLocked could raise that error, then so could lock	21:57
corvus	felixedel, tobiash: would you please (tomorrow) take a look at clarkb's comments on https://review.opendev.org/770902 and my changes in response: https://review.opendev.org/798750 and https://review.opendev.org/798751	22:03
clarkb	corvus: one thing I'm noticing as I pick up the use zk to executor jobs change is that the tense on the BuildRequestEvent is maybe a bit odd. They are all past tense but we still have work to do to fulfill that action. A minor thing but may make it sligghtly more clear that we are actively canceling/resuming/deleting jobs if we name then cancel/resume/delete instead of	23:11
clarkb	canceled/resumed/deleted	23:11
corvus	clarkb: ++	23:12
clarkb	corvus: in https://review.opendev.org/c/zuul/zuul/+/788988/29/zuul/executor/server.py#3438 completeBuild only seems to be called in the error or cancel case. Where do we mark the build request as completed under normal operation?	23:16
clarkb	the function is completeBuild at the end of that file fwiw	23:16
clarkb	hrm maybe the gerrit diff is simply hiding the call in the nromal case since it hasn't changed?	23:17
clarkb	the parameters to the call do change though so it should show up if that is the case	23:18
corvus	clarkb: line 1315	23:25
clarkb	thanks	23:25
clarkb	corvus: pauseBuild() resumeBuild() and completeBuild() all call self.executor_api.update() directly without explicitly holding the lock. In _runBuildWorker() we do that after grabbing the lock and explicitly make note of why it is safe to call update() directly and that we do it to reduce lock contention. Is it safe to do that in pause/resume/completeBuild() ?	23:29
clarkb	I think pause and resume are safe because we do those off to the side in child nodes	23:30
clarkb	but is complete?	23:30
clarkb	also all three update the build request state	23:30
corvus	clarkb: thinking	23:33
corvus	clarkb: ah, i think the first thing to note is that this is a slightly evolved use of zk compared to what we've done elsewhere -- basically, the .update() call is designed to potentially fail. previously, we've wrapped ZK updates with locks: lock, write, unlock. but this is designed to use the zstat attribute (previously read zk state counter) to cause the write to fail if someone else has written in the interim. so basically, when calling	23:39
corvus	update() we always assume we're the only actor, and if we're wrong, update() will raise an exception. instead, the executor api lock means that an executor is running the build (it's more or less only exists so the scheduler can detect if a build's executor crashes.	23:39
clarkb	Got it, the lock here is much higher level than a write mutex	23:40
clarkb	more of a signal flare than a mutex. What happens when the zstat attribute is larger than expected and we try to resume a build? do we try again?	23:41
corvus	clarkb: yep. then, secondarily, i think pauseBuild, resumeBuild, and completeBuild will all actually be called by the executor while it has the lock, so no other executor is going to write to it anyway; the scheduler might, however, if it were to forcibly remove the build; exceptions in that case would be okay i think	23:42
corvus	clarkb: that case seems potentially problematic; perhaps we should set the threading event before calling executorapi.update ?	23:44
clarkb	where does the retry occur? for example when pausing the pause() method calls executor_server.pauseBuild() which calls executor_api.update() then pause() waits on the resume event to be set	23:44
corvus	clarkb: (basically swap the last 2 lines of resume())	23:44
clarkb	I think the only place we could retry in and have that work is in update()?	23:44
corvus	clarkb: no retry, our irc messages raced	23:44
clarkb	corvus: ok /me rereads	23:45
clarkb	aha we assume it will work because we're the only high level signal (lock) holder	23:45
corvus	clarkb: "yep, then secondarily" was a continuation of the first thought; "that case seems" is the beginning of the reply to "What happens when the zstat"	23:46
clarkb	I'm following now	23:46
corvus	clarkb: yeah, but since you mention it, i'm concerned with what could happen if the scheduler deletes the build request immediately after unpausing; i think that could trip your edge case	23:47
corvus	it's not exactly the same thing, because it's not going to be a zstat error	23:47
corvus	but basically, since looking at that bit of code, i see the potential for failing to resume	23:48
clarkb	and the reason would be that we would raise in resume() prior to calling set() on the resume event	23:48
corvus	oh, actually that case is taken care of	23:48
corvus	we specifically handle the request not being found during the resume	23:48
clarkb	is that what BuildRequestNotFound excpetion handling in resumeBuild is handling?	23:48
clarkb	yup ok	23:48
corvus	yeah	23:49
corvus	okay, so i think i'm not seeing a way to hit that case	23:49
clarkb	and that is distinct to zstat being too new as it doesn't exist at all	23:49
corvus	right	23:49
corvus	basically, the zstat change detection shouldn't fail, because we hold the lock and no one else should be updating it. but if someone did, we would raise an exception, and in the case of resuming a build, it probably wouldn't resume.	23:50
corvus	it doesn't strike me as super robust, but also, it shouldn't happen	23:51
clarkb	But for there to be a job to resume an executor must be running that job and to run that job it must hold the lock	23:51
corvus	yeah	23:51
clarkb	(similar case with pause)	23:51
corvus	i think i'd be willing to give this a try, and if we find some way that it can happen, then we'd probably have to put in a periodic state check (like, on the executor, check every build in zk and see if it has a resume subnode). but if the event processing works as we expect, i don't think it should be an issue.	23:52
clarkb	* for there to be a job to resume a unique executor must be running that job and it must hold the lock uniquely	23:53
corvus	yep	23:54
clarkb	great I'm going to finish up this review (going through the test changes now) but I think this is looking ok	23:56
clarkb	have a couple of other minor notes but that asked about big stuff that jumped out at me here already	23:56
corvus	cool, then if things look good to felixedel and tobiash tomorrow, we may be able to get these merged and resume speed on v5 :)	23:58

Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!