ianw | clarkb: out of interest i ran a dnf update which pulled in firefox 89.0.2 and i agree that the console highlighting seems to work better on the current production with that | 00:07 |
---|---|---|
ianw | also everything now has rounded corners | 00:08 |
fungi | for safety | 00:36 |
ianw | it also appears to have replaced a previous feature that i occasionally enjoyed "playing any sound" with long timeouts and "firefox is not responding" windows | 00:50 |
*** marios is now known as marios|ruck | 05:12 | |
opendevreview | Felix Edel proposed zuul/zuul master: Lock/unlock nodes on executor server https://review.opendev.org/c/zuul/zuul/+/774610 | 05:59 |
*** jpena|off is now known as jpena | 07:00 | |
opendevreview | Benjamin Schanzel proposed zuul/zuul-jobs master: Add a meta log upload role with a failover mechanism https://review.opendev.org/c/zuul/zuul-jobs/+/795336 | 08:15 |
*** bhagyashris_ is now known as bhagyashris | 09:15 | |
opendevreview | Merged zuul/zuul master: Configure unique command socket path per scheduler https://review.opendev.org/c/zuul/zuul/+/771462 | 09:20 |
opendevreview | Merged zuul/zuul master: Refactor config/tenant (re-)loading https://review.opendev.org/c/zuul/zuul/+/795263 | 09:23 |
opendevreview | Benjamin Schanzel proposed zuul/zuul-jobs master: Add a meta log upload role with a failover mechanism https://review.opendev.org/c/zuul/zuul-jobs/+/795336 | 09:46 |
opendevreview | Merged zuul/zuul master: Replace TreeCache in component registry https://review.opendev.org/c/zuul/zuul/+/796582 | 10:44 |
*** jpena is now known as jpena|lunch | 11:34 | |
opendevreview | Felix Edel proposed zuul/zuul master: Lock/unlock nodes on executor server https://review.opendev.org/c/zuul/zuul/+/774610 | 11:38 |
*** jpena|lunch is now known as jpena | 12:36 | |
pabelanger[m] | Is anyone using the jwt tokens to manage autoholds for jobs? | 12:54 |
pabelanger[m] | I'm wanting to set them up, but was curious to see if anyone else has | 12:54 |
pabelanger[m] | also, is there plans to expose that funtionality in the web UI, if a token is found? | 12:54 |
tristanC | pabelanger[m]: yes we do, mostly to enable qa team debug their jobs without our assistance. | 13:07 |
tristanC | jwt is configured by default in sf-config with https://softwarefactory-project.io/cgit/software-factory/sf-config/tree/ansible/roles/sf-zuul/templates/zuul.conf.j2#n79 | 13:08 |
tristanC | then tokens can be issued with `zuul create-auth-token --auth-config zuul_operator --tenant <tenant> --user <audit-info> --expires-in 7776000` | 13:09 |
pabelanger[m] | so, each time you want QE to setup autohold, you manually generate the token? | 13:12 |
pabelanger[m] | or maybe every 90 days | 13:15 |
pabelanger[m] | when it expires | 13:15 |
tristanC | pabelanger[m]: yes, those are per user token | 13:16 |
pabelanger[m] | maybe I'll setup a demo with you downstream to see how it all works | 13:17 |
tristanC | pabelanger[m]: the documentation has been proposed in https://review.opendev.org/c/zuul/zuul/+/727785 | 13:20 |
tristanC | pabelanger[m]: along with https://zuul-ci.org/docs/zuul/reference/client.html#create-auth-token | 13:20 |
pabelanger[m] | tristanC: do you have any logic for self-serve tokens? | 13:37 |
pabelanger[m] | would love to get another review on https://review.opendev.org/c/zuul/zuul/+/798131 | 13:48 |
pabelanger[m] | adds a few more check_run status for github | 13:48 |
opendevreview | Felix Edel proposed zuul/zuul master: Lock/unlock nodes on executor server https://review.opendev.org/c/zuul/zuul/+/774610 | 13:53 |
opendevreview | Simon Westphahl proposed zuul/zuul master: wip: Items/ahead behind wrong after re-enqueue https://review.opendev.org/c/zuul/zuul/+/798691 | 14:04 |
opendevreview | Simon Westphahl proposed zuul/zuul master: wip: Items ahead/behind wrong after re-enqueue https://review.opendev.org/c/zuul/zuul/+/798691 | 14:05 |
tristanC | pabelanger[m]: what do you mean by self-serve tokens? | 14:07 |
pabelanger[m] | tristanC: if a token expires, an infra-root needs to regenerate said token and give it to QE right? | 14:52 |
pabelanger[m] | Rather then the QE person generating it themself? | 14:53 |
tristanC | pabelanger[m]: for that i think you need a service like keycloack to let the user authenticate themself | 14:54 |
pabelanger[m] | ack | 14:55 |
pabelanger[m] | let me go and read up about it, see if I can hook it up to existing SSO at redhat | 14:55 |
apevec[m] | pabelanger: mhu has web UI WIP, lemme find topic | 15:22 |
apevec[m] | https://review.opendev.org/q/topic:"fffaff" | 15:26 |
*** jpena is now known as jpena|off | 15:59 | |
*** marios|ruck is now known as marios|out | 16:17 | |
clarkb | corvus: looks like https://review.opendev.org/c/zuul/zuul/+/774610 is an important zuul change? I'll try to review that between meetings this morning | 16:54 |
corvus | clarkb: yes, but don't review it yet; i'm currently rebasing it on the rest of the executor api stack (there are significant conflicts) | 16:57 |
corvus | clarkb: i think the next step is https://review.opendev.org/770902 | 16:57 |
clarkb | corvus: got it | 16:58 |
* corvus < https://matrix.org/_matrix/media/r0/download/matrix.org/IvpviwJtxhrFhEOilheDGlly/message.txt > | 16:58 | |
corvus | i'm putting the lock/unlock onto the end of that | 16:58 |
clarkb | ok so start at the bottom of that stack and work forward | 16:59 |
clarkb | top of the lsit in your paste | 16:59 |
corvus | yep; and most of that stack is one change exploded out into hopefully smaller parts; so there's a certain amount of "introduce an api; make some changes to the api; then use the api" in that progression | 17:00 |
clarkb | great I'll start there | 17:00 |
opendevreview | Merged zuul/zuul master: Add skipped / neutral statuses to the github driver https://review.opendev.org/c/zuul/zuul/+/798131 | 17:12 |
opendevreview | James E. Blair proposed zuul/zuul master: Execute builds via ZooKeeper https://review.opendev.org/c/zuul/zuul/+/788988 | 17:39 |
opendevreview | James E. Blair proposed zuul/zuul master: Move build request cleanup from executor to scheduler https://review.opendev.org/c/zuul/zuul/+/794687 | 17:39 |
opendevreview | James E. Blair proposed zuul/zuul master: Handle errors in the executor main loop https://review.opendev.org/c/zuul/zuul/+/796583 | 17:39 |
opendevreview | James E. Blair proposed zuul/zuul master: Shard BuildRequest parameters https://review.opendev.org/c/zuul/zuul/+/797149 | 17:39 |
opendevreview | James E. Blair proposed zuul/zuul master: Compress sharded ZK data https://review.opendev.org/c/zuul/zuul/+/797156 | 17:39 |
opendevreview | James E. Blair proposed zuul/zuul master: Switch to ZooKeeper backed merge result events https://review.opendev.org/c/zuul/zuul/+/784195 | 17:39 |
opendevreview | James E. Blair proposed zuul/zuul master: Switch from global to tenant event queues https://review.opendev.org/c/zuul/zuul/+/797440 | 17:39 |
opendevreview | James E. Blair proposed zuul/zuul master: Remove unused addResultEvent method https://review.opendev.org/c/zuul/zuul/+/797542 | 17:39 |
opendevreview | James E. Blair proposed zuul/zuul master: Lock/unlock nodes on executor server https://review.opendev.org/c/zuul/zuul/+/774610 | 17:39 |
corvus | (i spotted an error in the middle of the stack, so i went ahead and fixed it) | 17:39 |
corvus | that was just a 2 line change in 788988; the rest is rebasing on that. i expect the stack is stable now and remains ready for review. | 17:43 |
tobiash[m] | corvus: regarding https://review.opendev.org/c/zuul/zuul/+/774610/21/zuul/executor/server.py#1371 could prepareVars already have an effect on the nodes? | 19:19 |
tobiash[m] | I wonder if we need to set the nodeset into use before prepareVars | 19:20 |
corvus | tobiash: yes to both | 19:25 |
opendevreview | James E. Blair proposed zuul/zuul master: Lock/unlock nodes on executor server https://review.opendev.org/c/zuul/zuul/+/774610 | 20:42 |
opendevreview | James E. Blair proposed zuul/nodepool master: Add requestor_data field to node request https://review.opendev.org/c/zuul/nodepool/+/798746 | 20:51 |
clarkb | corvus: I left some comments/questions/notes on https://review.opendev.org/c/zuul/zuul/+/770902 if you or felixedel or tobiash[m] get a chance to look those over that would be great | 20:52 |
clarkb | I suspect some may be clarified by the rest of the stack whcih I'm going to try and review as well | 20:52 |
clarkb | aha yup the next change answers my questions about "default-zone" | 20:54 |
corvus | clarkb: yeah, sorry about that -- it's not my favorite thing to have changes not self-contained like that, but it also sort of seemed like leaving changes like the default-zone in their own change made some things easier to understand | 20:55 |
clarkb | corvus: I'm through the first 5 in the stack now. Left some comments on a few of them. Would be good if you can check those | 21:31 |
clarkb | The change to execute jobs via zk is not small. I'm going to take a break here and do reviews of some other stuff to ensure they get attention today then come back to this | 21:32 |
corvus | clarkb: thanks! i'm already well into working on your comments :) | 21:36 |
opendevreview | James E. Blair proposed zuul/zuul master: Execute builds via ZooKeeper https://review.opendev.org/c/zuul/zuul/+/788988 | 21:38 |
opendevreview | James E. Blair proposed zuul/zuul master: Move build request cleanup from executor to scheduler https://review.opendev.org/c/zuul/zuul/+/794687 | 21:38 |
opendevreview | James E. Blair proposed zuul/zuul master: Handle errors in the executor main loop https://review.opendev.org/c/zuul/zuul/+/796583 | 21:38 |
opendevreview | James E. Blair proposed zuul/zuul master: Shard BuildRequest parameters https://review.opendev.org/c/zuul/zuul/+/797149 | 21:38 |
opendevreview | James E. Blair proposed zuul/zuul master: Compress sharded ZK data https://review.opendev.org/c/zuul/zuul/+/797156 | 21:38 |
opendevreview | James E. Blair proposed zuul/zuul master: Switch to ZooKeeper backed merge result events https://review.opendev.org/c/zuul/zuul/+/784195 | 21:38 |
opendevreview | James E. Blair proposed zuul/zuul master: Switch from global to tenant event queues https://review.opendev.org/c/zuul/zuul/+/797440 | 21:38 |
opendevreview | James E. Blair proposed zuul/zuul master: Remove unused addResultEvent method https://review.opendev.org/c/zuul/zuul/+/797542 | 21:38 |
opendevreview | James E. Blair proposed zuul/zuul master: Lock/unlock nodes on executor server https://review.opendev.org/c/zuul/zuul/+/774610 | 21:38 |
opendevreview | James E. Blair proposed zuul/zuul master: Only call one build callback in executor api https://review.opendev.org/c/zuul/zuul/+/798750 | 21:38 |
opendevreview | James E. Blair proposed zuul/zuul master: Remove exception handling in executor api isLocked https://review.opendev.org/c/zuul/zuul/+/798751 | 21:38 |
corvus | clarkb: i don't want to steal your attention, but would you mind looking at https://review.opendev.org/798750 and https://review.opendev.org/798751 ? they are small and should address your comments in the first change | 21:42 |
clarkb | sure I'm between changes right now | 21:43 |
clarkb | corvus: for https://review.opendev.org/c/zuul/zuul/+/798751/1/zuul/zk/executor.py does the path lostBuildRequests() takes (and hunts for) not trigger the case where there is no lock node? | 21:48 |
clarkb | corvus: I think if those two functions weren't right next to each other I wouldn't have thought about that at all, but seeing them together it occured to me that the lock may not exist (for whatever reason) and that is what lostBuildRequests() is looking for | 21:49 |
corvus | creating a Lock object creates a locke node | 21:49 |
clarkb | right but listBuildRequests() iterates over all BuildRequests and checks there lock state. If they have never been locked before (or have been unlocked and still exist) then we could hit that? | 21:49 |
corvus | and any parent nodes | 21:49 |
clarkb | this might be more test specific where we manipulate the requests and locks more directly but I think the first test case that is added may even run over that (though I dind't look in the logs of the job to confirm) | 21:50 |
corvus | hit what? there is nothing in isLocked that could raise a NoNodeError under normal circumstances | 21:50 |
corvus | and by normal circumstances, i mean anything short of a malfunctioning connection or server | 21:51 |
clarkb | corvus: if a BuildRequest exists but has never been locked before then won't isLocked hit the NoNodeError() becuase a lock has never been created for it that BuildRequest yet? | 21:51 |
clarkb | If BuildRequests and locks shared the same tree then this would probably be avoided but since they are in distinct trees I think we can hit this? | 21:51 |
corvus | clarkb: no -- because creating a kazoo.recipe.Lock() object causes the lock node and all parent nodes to be created | 21:52 |
clarkb | aha ok | 21:52 |
clarkb | then we check the contendors count to actually determine if it is locked (and it would be initialized to 0 in the case of no lock existing previously | 21:53 |
corvus | so either ExecutorApi.lock() or ExecutorApi.isLocked() will cause the initial node to be created. | 21:53 |
corvus | clarkb: correct; contenders lists the children of the lock node | 21:53 |
corvus | (because contenders are ephemeral znodes under the lock node) | 21:53 |
clarkb | now I understand why you never expect that exception to happen at all | 21:53 |
clarkb | in which case the new change is the correct cleanup | 21:54 |
corvus | (also, btw, i ran this through a unit test to double check, but didn't think it was necessary to add a test for it) | 21:55 |
clarkb | ya I think I was confused through an acceptance of the old code being correct (even though I had called it out as being odd). I'm not sure you need a specific test for this. | 21:56 |
clarkb | The first test that is added does cover this as a side effect I think | 21:56 |
corvus | yeah, if isLocked could raise that error, then so could lock | 21:57 |
corvus | felixedel, tobiash: would you please (tomorrow) take a look at clarkb's comments on https://review.opendev.org/770902 and my changes in response: https://review.opendev.org/798750 and https://review.opendev.org/798751 | 22:03 |
clarkb | corvus: one thing I'm noticing as I pick up the use zk to executor jobs change is that the tense on the BuildRequestEvent is maybe a bit odd. They are all past tense but we still have work to do to fulfill that action. A minor thing but may make it sligghtly more clear that we are actively canceling/resuming/deleting jobs if we name then cancel/resume/delete instead of | 23:11 |
clarkb | canceled/resumed/deleted | 23:11 |
corvus | clarkb: ++ | 23:12 |
clarkb | corvus: in https://review.opendev.org/c/zuul/zuul/+/788988/29/zuul/executor/server.py#3438 completeBuild only seems to be called in the error or cancel case. Where do we mark the build request as completed under normal operation? | 23:16 |
clarkb | the function is completeBuild at the end of that file fwiw | 23:16 |
clarkb | hrm maybe the gerrit diff is simply hiding the call in the nromal case since it hasn't changed? | 23:17 |
clarkb | the parameters to the call do change though so it should show up if that is the case | 23:18 |
corvus | clarkb: line 1315 | 23:25 |
clarkb | thanks | 23:25 |
clarkb | corvus: pauseBuild() resumeBuild() and completeBuild() all call self.executor_api.update() directly without explicitly holding the lock. In _runBuildWorker() we do that after grabbing the lock and explicitly make note of why it is safe to call update() directly and that we do it to reduce lock contention. Is it safe to do that in pause/resume/completeBuild() ? | 23:29 |
clarkb | I think pause and resume are safe because we do those off to the side in child nodes | 23:30 |
clarkb | but is complete? | 23:30 |
clarkb | also all three update the build request state | 23:30 |
corvus | clarkb: thinking | 23:33 |
corvus | clarkb: ah, i think the first thing to note is that this is a slightly evolved use of zk compared to what we've done elsewhere -- basically, the .update() call is designed to potentially fail. previously, we've wrapped ZK updates with locks: lock, write, unlock. but this is designed to use the zstat attribute (previously read zk state counter) to cause the write to fail if someone else has written in the interim. so basically, when calling | 23:39 |
corvus | update() we always assume we're the only actor, and if we're wrong, update() will raise an exception. instead, the executor api lock means that an executor is running the build (it's more or less only exists so the scheduler can detect if a build's executor crashes. | 23:39 |
clarkb | Got it, the lock here is much higher level than a write mutex | 23:40 |
clarkb | more of a signal flare than a mutex. What happens when the zstat attribute is larger than expected and we try to resume a build? do we try again? | 23:41 |
corvus | clarkb: yep. then, secondarily, i think pauseBuild, resumeBuild, and completeBuild will all actually be called by the executor while it has the lock, so no other executor is going to write to it anyway; the scheduler might, however, if it were to forcibly remove the build; exceptions in that case would be okay i think | 23:42 |
corvus | clarkb: that case seems potentially problematic; perhaps we should set the threading event before calling executorapi.update ? | 23:44 |
clarkb | where does the retry occur? for example when pausing the pause() method calls executor_server.pauseBuild() which calls executor_api.update() then pause() waits on the resume event to be set | 23:44 |
corvus | clarkb: (basically swap the last 2 lines of resume()) | 23:44 |
clarkb | I think the only place we could retry in and have that work is in update()? | 23:44 |
corvus | clarkb: no retry, our irc messages raced | 23:44 |
clarkb | corvus: ok /me rereads | 23:45 |
clarkb | aha we assume it will work because we're the only high level signal (lock) holder | 23:45 |
corvus | clarkb: "yep, then secondarily" was a continuation of the first thought; "that case seems" is the beginning of the reply to "What happens when the zstat" | 23:46 |
clarkb | I'm following now | 23:46 |
corvus | clarkb: yeah, but since you mention it, i'm concerned with what could happen if the scheduler deletes the build request immediately after unpausing; i think that could trip your edge case | 23:47 |
corvus | it's not exactly the same thing, because it's not going to be a zstat error | 23:47 |
corvus | but basically, since looking at that bit of code, i see the potential for failing to resume | 23:48 |
clarkb | and the reason would be that we would raise in resume() prior to calling set() on the resume event | 23:48 |
corvus | oh, actually that case is taken care of | 23:48 |
corvus | we specifically handle the request not being found during the resume | 23:48 |
clarkb | is that what BuildRequestNotFound excpetion handling in resumeBuild is handling? | 23:48 |
clarkb | yup ok | 23:48 |
corvus | yeah | 23:49 |
corvus | okay, so i think i'm not seeing a way to hit that case | 23:49 |
clarkb | and that is distinct to zstat being too new as it doesn't exist at all | 23:49 |
corvus | right | 23:49 |
corvus | basically, the zstat change detection shouldn't fail, because we hold the lock and no one else should be updating it. but if someone did, we would raise an exception, and in the case of resuming a build, it probably wouldn't resume. | 23:50 |
corvus | it doesn't strike me as super robust, but also, it shouldn't happen | 23:51 |
clarkb | But for there to be a job to resume an executor must be running that job and to run that job it must hold the lock | 23:51 |
corvus | yeah | 23:51 |
clarkb | (similar case with pause) | 23:51 |
corvus | i think i'd be willing to give this a try, and if we find some way that it can happen, then we'd probably have to put in a periodic state check (like, on the executor, check every build in zk and see if it has a resume subnode). but if the event processing works as we expect, i don't think it should be an issue. | 23:52 |
clarkb | * for there to be a job to resume a unique executor must be running that job and it must hold the lock uniquely | 23:53 |
corvus | yep | 23:54 |
clarkb | great I'm going to finish up this review (going through the test changes now) but I think this is looking ok | 23:56 |
clarkb | have a couple of other minor notes but that asked about big stuff that jumped out at me here already | 23:56 |
corvus | cool, then if things look good to felixedel and tobiash tomorrow, we may be able to get these merged and resume speed on v5 :) | 23:58 |
Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!