clarkb | fungi: it calls log = get_annotated_logger(self.log, task.zuul_event_id, build=task.build) | 00:00 |
---|---|---|
clarkb | corvus: ya you could probably put task back on the queue | 00:01 |
clarkb | aborting from that separate thread would be a bit weird though I think. If we're going to abort we should maybe do it via _update() checking the result of innerUpdateLoop activity | 00:02 |
fungi | yeah, for reference this is how it winds up in the log: http://paste.openstack.org/show/803050/ | 00:02 |
fungi | i get lost quickly when trying to reason through cross-thread interactions | 00:03 |
corvus | oh, so the idea is to abort the build when that happens? | 00:03 |
fungi | we do abort under similar circumstances in AnsibleJob.execute() | 00:04 |
fungi | but if it's caught in ExecutorServer._innerUpdateLoop() the build returns ERROR instead | 00:05 |
fungi | https://zuul.opendev.org/t/openstack/build/999f096bccbe43c286b7c6ed3c5deeb5 | 00:06 |
fungi | "Error: Failed to update project None" | 00:06 |
corvus | fungi: https://opendev.org/zuul/zuul/src/branch/master/zuul/executor/server.py#L1046 | 00:06 |
fungi | so that's where we should send abort? | 00:07 |
corvus | that's what translates the task failure into an ExecutorError exception, and ExecutorError means "non-transient" | 00:07 |
corvus | fungi: i think you're suggesting a nuance, in that a brokenprocesspool on the executor should be considered a transient error | 00:08 |
fungi | ahh | 00:08 |
clarkb | that could be implemented by updating UpdateTask to have a transient error vs non transiet error flag | 00:09 |
corvus | so you'll want to distinguish this condition from others, and do something other than ExecutorError. | 00:09 |
corvus | yeah like clarkb said | 00:09 |
clarkb | one tricky thing is that the regular merger server shares some of those code paths | 00:09 |
fungi | i was mostly wondering if it should exhibit behavior consistent with https://opendev.org/zuul/zuul/src/branch/master/zuul/executor/server.py#L948-L953 | 00:09 |
corvus | if you do clarkb's suggestion, then it'd probably look like that -- "if transient: self._send_aborted()" | 00:11 |
corvus | i think you could do that without even dealing with any new exceptions | 00:12 |
fungi | or is the problem that the ExecutorError is masking the BrokenProcessPool there? | 00:12 |
corvus | it's not masking it since they're in different threads | 00:12 |
fungi | okay, i think i get it | 00:13 |
fungi | but we can apparently raise ExecutorError and BrokenProcessPool in the execute thread too | 00:13 |
corvus | fungi: but -- if we did transfer the exception between threads, then yeah, that would be caught by https://opendev.org/zuul/zuul/src/branch/master/zuul/executor/server.py#L948 | 00:13 |
corvus | yeah, i'm not sure where else it could come from | 00:14 |
fungi | or could it be that the exception handling at 948 and 954 are dead code because those aren't raised in that thread? | 00:15 |
corvus | could be | 00:16 |
corvus | well not 954, that's definitely live | 00:16 |
corvus | but 948 could be dead | 00:16 |
fungi | wondering if someone added the one at 948 as an attempt to deal with this exact condition | 00:16 |
corvus | (ExecutorError has subclasses which i know are raised) | 00:17 |
corvus | worth a look in the history | 00:17 |
fungi | added a year ago by https://review.opendev.org/709307 | 00:18 |
fungi | and yeah the commit message describes exactly what i think we experienced | 00:18 |
fungi | oom killer sniped the python process, leading to a broken process pool | 00:19 |
corvus | oh nope, it's not dead code. https://opendev.org/zuul/zuul/src/branch/master/zuul/executor/server.py#L1095 | 00:20 |
corvus | that will hit the abort path | 00:20 |
clarkb | ya its only used by merger paths but apparently there is at least one outside of the separate thread | 00:22 |
clarkb | what are state_items? | 00:22 |
corvus | fungi: so yeah, i think we're back at the same place; the existing code isn't defective, it's just not as nuanced as you'd like, and i think clarkb's suggestion is the best way to add that. | 00:22 |
fungi | cool, i'll give it a shot. thanks! | 00:23 |
corvus | clarkb: yeesh, that could sure use a comment :) i believe they are queue items which are present only for the purpose of fixing repo state, but they don't have associated changes. i think if a project is in required-projects but there isn't a change ahead in that project, then a synthetic item will appear there so that all of the jobs running for a change use the same repo state for all the repos. | 00:26 |
openstackgerrit | Merged zuul/zuul master: Delay getting many files to mergers https://review.opendev.org/c/zuul/zuul/+/756837 | 00:37 |
openstackgerrit | Merged zuul/zuul master: Defer setting build pause to event queue https://review.opendev.org/c/zuul/zuul/+/712939 | 00:37 |
*** tosky has quit IRC | 00:42 | |
openstackgerrit | James E. Blair proposed zuul/zuul master: Update zoned executor stats https://review.opendev.org/c/zuul/zuul/+/777852 | 01:11 |
corvus | clarkb, fungi: can you check my comment on https://review.opendev.org/761756 ? | 01:17 |
clarkb | corvus: I think that is correct, basically ever downstream has to manually update to get new tag state (either removal or move) | 01:23 |
clarkb | note that git fetch --prune-tags will not help with a moved tag if I read the manpage correctly | 01:24 |
clarkb | it will also remove any local tags that were never on the remote (shouldn't be an issue for zuul) | 01:24 |
corvus | moved tag? oh | 01:24 |
corvus | oy | 01:24 |
corvus | like same name different commit | 01:25 |
clarkb | correct | 01:25 |
clarkb | zuul will continue to see the wrong value in that case I think (basically changing tags or removingthem is bad) | 01:25 |
corvus | i guess the downside to implementing this change is that it might lead people to believe that this works when it really doesn't | 01:25 |
clarkb | ya, also I think it may be worth having a test of these cases | 01:26 |
clarkb | bceause its tricky stuff and if we've got a test we can at least assert zuul does what we expect it to | 01:26 |
corvus | ie, you delete a tag, the next job prunes it on executor 01, then you add it back with a new value, and it runs on executor02 and silently does the wrong thing | 01:26 |
clarkb | yup, or even just force push the new tag value so there is no intermediate delete | 01:27 |
clarkb | (I think that is allowed when pushing) | 01:27 |
clarkb | *force push the same tag name but with new commit value | 01:27 |
corvus | so basically, nothing good ever happens with deleting or changing tags; and there's an argument to be made that the status quo at least breaks in a way that causes people to have to do work | 01:28 |
corvus | otoh, a force-push tag update is not going to trigger any zuul breakage | 01:28 |
corvus | so an argument could be made that this is an incremental improvement so we can at least handle 1 of the 3 possibilites semi-transparently. | 01:29 |
clarkb | zuul would see the force push of the tag change as a refs/tags/ update and rerun jobs against the old commit | 01:29 |
clarkb | and ya this should handle deletes just fine as long as all executors/mergers see the delete eventually | 01:29 |
corvus | i guess i'm on the fence on this; still slightly leaning toward saying "yes". cause it probably doesn't make things any worse, even though it mostly doesn't make them better. | 01:30 |
clarkb | I guess the only race there is if a new tag with the same name is pushed after a delete, because any other event would trigger the pruning | 01:30 |
corvus | yeah | 01:30 |
corvus | clarkb: you want to leave some notes on that change? (or if you need to eod, i can) | 01:30 |
clarkb | I can do it | 01:30 |
corvus | kk, thx | 01:31 |
clarkb | corvus: you may want to double check what I wrote, but its posted now | 01:33 |
clarkb | but now I do need to eod, I smell food | 01:37 |
openstackgerrit | Merged zuul/zuul master: Enhance logging of status_get requests https://review.opendev.org/c/zuul/zuul/+/771669 | 02:24 |
*** paladox has quit IRC | 02:33 | |
*** paladox has joined #zuul | 02:39 | |
*** ikhan has quit IRC | 04:14 | |
*** evrardjp has quit IRC | 05:33 | |
*** evrardjp has joined #zuul | 05:33 | |
tobiash | corvus: I've responded on 767084 | 07:47 |
openstackgerrit | Tobias Henkel proposed zuul/zuul master: Allow refreshing volatile data in canMerge check https://review.opendev.org/c/zuul/zuul/+/767084 | 08:14 |
openstackgerrit | Tobias Henkel proposed zuul/zuul master: Check cycle items are mergeable before reporting https://review.opendev.org/c/zuul/zuul/+/743450 | 08:14 |
tobiash | corvus: I changed 767084 so that it does what we need but avoids to update the changes completely ^ | 08:15 |
*** evrardjp_ has joined #zuul | 09:19 | |
*** evrardjp has quit IRC | 09:21 | |
*** zbr has joined #zuul | 09:25 | |
*** zbr9 has joined #zuul | 09:37 | |
*** zbr has quit IRC | 09:40 | |
*** zbr9 is now known as zbr | 09:40 | |
*** evrardjp_ has quit IRC | 09:46 | |
*** zbr7 has joined #zuul | 09:51 | |
*** zbr has quit IRC | 09:53 | |
*** zbr7 is now known as zbr | 09:53 | |
*** zbr has quit IRC | 10:15 | |
*** zbr has joined #zuul | 10:16 | |
*** zbr2 has joined #zuul | 10:20 | |
*** zbr has quit IRC | 10:23 | |
*** zbr2 is now known as zbr | 10:23 | |
*** jangutter_ has quit IRC | 10:40 | |
*** jangutter has joined #zuul | 10:40 | |
*** tosky has joined #zuul | 11:24 | |
*** evrardjp has joined #zuul | 11:44 | |
*** zbr has quit IRC | 13:00 | |
*** zbr has joined #zuul | 13:01 | |
*** evrardjp has quit IRC | 13:26 | |
*** evrardjp has joined #zuul | 13:27 | |
*** holser_ has quit IRC | 13:27 | |
*** zbr1 has joined #zuul | 13:29 | |
*** zbr has quit IRC | 13:31 | |
*** zbr1 is now known as zbr | 13:31 | |
*** jangutter_ has joined #zuul | 13:32 | |
*** jangutter_ has joined #zuul | 13:33 | |
*** jangutter has quit IRC | 13:33 | |
*** ikhan has joined #zuul | 13:36 | |
*** jangutter_ has quit IRC | 13:54 | |
*** zbr2 has joined #zuul | 13:55 | |
*** jangutter has joined #zuul | 13:55 | |
*** zbr has quit IRC | 13:56 | |
*** zbr2 is now known as zbr | 13:56 | |
*** evrardjp has quit IRC | 14:04 | |
*** evrardjp has joined #zuul | 14:05 | |
corvus | tobiash: can you see my reply on https://review.opendev.org/743450 ? i'm still confused | 14:25 |
tobiash | corvus: I think I was wrong, generally an event is needed (e.g. gerrit). Only in the github case the event is not needed. | 14:28 |
tobiash | because of the lacking events by github | 14:29 |
tobiash | I confused this with the parent change | 14:29 |
tobiash | generally this change was intended to work also without the parent change | 14:30 |
tobiash | maybe it's easier to understand if we'd switch the order of those two | 14:30 |
corvus | tobiash: yeah, i think it is closely related to the parent | 14:30 |
tobiash | actually 743450 was older than the parent which has been added in between at some point in time to fix the lack of github events | 14:32 |
corvus | tobiash: i guess my thought is that the point of the late canMerge check is to catch the case where github has changed but our model of it hasn't, which is most likely to happen if github changes something without emitting an event (though there are other ways like delayed event processing). so that's the new test case we should add. | 14:32 |
corvus | i certainly won't argue with adding 2 test cases :) but i still think if we had to pick one, it would be the "did not get event" case | 14:32 |
tobiash | I agree, the 'did not get event' case covers both cases | 14:34 |
tobiash | so yes, we could have either both cases or just the 'dif not get event' case | 14:35 |
*** jfoufas1 has joined #zuul | 14:35 | |
corvus | tobiash: cool | 14:35 |
tobiash | corvus: in that case I'd change the 'make it draft but don't send event' to 'add some additional required check to branch protection and don't send event' | 14:36 |
tobiash | that fits the real world better since in that case github indeed doesn't send an event and the parent change gets automatically tested | 14:37 |
tobiash | oh well, draft is part of the refetching | 14:37 |
tobiash | so I think just removing the event will do fine | 14:38 |
corvus | i love -1 line fixes | 14:38 |
tobiash | :) | 14:38 |
*** jfoufas1 has quit IRC | 14:40 | |
tobiash | corvus: I've replied with a summary of what we just discussed | 14:42 |
*** zbr0 has joined #zuul | 14:43 | |
corvus | tobiash: lgtm. do you want you or i to update that or wait for swest_ to chime in? | 14:43 |
tobiash | already on it | 14:43 |
corvus | k | 14:44 |
tobiash | test case is green locally without the event :) | 14:45 |
*** zbr has quit IRC | 14:45 | |
*** zbr0 is now known as zbr | 14:45 | |
openstackgerrit | Tobias Henkel proposed zuul/zuul master: Check cycle items are mergeable before reporting https://review.opendev.org/c/zuul/zuul/+/743450 | 14:45 |
tobiash | corvus: 777843 has a small typo and a minor linting issue, lgtm otherwise | 14:49 |
corvus | tobiash: ah yep, i have a few -1Vs to go through :) | 14:51 |
openstackgerrit | James E. Blair proposed zuul/zuul master: Update zoned executor stats https://review.opendev.org/c/zuul/zuul/+/777852 | 14:53 |
openstackgerrit | James E. Blair proposed zuul/zuul master: Support enqueuing behind circular dependencies https://review.opendev.org/c/zuul/zuul/+/777843 | 14:54 |
openstackgerrit | James E. Blair proposed zuul/zuul master: Use ZooKeeper TLS in tests https://review.opendev.org/c/zuul/zuul/+/777489 | 14:59 |
tobiash | corvus: I think we should rebase 673840 (allow unzoned option) on 777852 since it would be affected by it | 14:59 |
openstackgerrit | James E. Blair proposed zuul/zuul master: Emit config warnings if shared queues on pipelines https://review.opendev.org/c/zuul/zuul/+/777479 | 15:00 |
corvus | tobiash: oh that's going to make the 'online' calculation difficult | 15:02 |
tobiash | corvus: I think I'd do the same with online then (register unzoned and zoned online function) | 15:03 |
corvus | since that's an online zoned executor, and online unzoned executor -- but for counting "all" is it 1 or 2 executors? :) | 15:04 |
tobiash | corvus: maybe we should remove the 'all' then | 15:04 |
corvus | my intuition says it's still just 1 executor for "all" which means we need to detect that the zoned and unzoned are the same | 15:05 |
tobiash | if we want the 'all' to be correct, we could add a uuid to the online function and deduplicate then | 15:05 |
corvus | tobiash: yeah (though i'd go with hostname, we already do that for other gearman functions) | 15:06 |
corvus | ("hostname" -- it's really more than the hostname and more like a uuid) | 15:06 |
corvus | (that just happens to have the hostname in it :) | 15:06 |
tobiash | k | 15:08 |
corvus | tobiash: it's generally not too hard to make dashboards that sum gauges -- so we could remove all and someone who wants that can sum(zoned, unzoned). they'd get 2 as the answer in that case though. so if we want to be able to have a dashboard with 1 executor in the dual zoned/unzoned case, then i think our only option is to keep all and do that extra work with the hostname | 15:08 |
tobiash | speaking for us we just list online/accepting executors in different graphs per zone plus unzoned so we don't sum them | 15:09 |
tobiash | but I can imagine that someone might want to keep the 'all' category as well | 15:10 |
corvus | tobiash: yeah. i agree that zoned / unzoned is more useful from an ops pov. | 15:11 |
corvus | doesn't matter if you have 5 executors running if all of them are zoned and you need an unzoned one. | 15:11 |
tobiash | yeah | 15:12 |
corvus | tobiash: then maybe we should eliminate the "all" stats as well as the "unzoned" stats hierarchy, make the top level mean unzoned, but keep the zoned ones under "zoned" to avoid name collisions? | 15:14 |
corvus | that sort of forces users to use that view of the system, so they don't inadvertently create a dashboard that says everything's okay | 15:15 |
tobiash | from hierarchy data model long term I think zoned/unzoned should be the same level | 15:16 |
tobiash | like unzoned is just another or the default zone | 15:16 |
corvus | tobiash: that would be a breaking change for existing deployments | 15:16 |
tobiash | yeah, so maybe we could do both for a period of time and deprecate the old one | 15:17 |
*** evrardjp has quit IRC | 15:17 | |
corvus | yeah | 15:17 |
*** evrardjp_ has joined #zuul | 15:17 | |
tobiash | and remove the old maybe with 5.0 | 15:17 |
corvus | okay, i left a note on the unzoned option change, and i'm leaving a note on the stats change now; i'll rework the stats change in a little bit | 15:18 |
tobiash | k | 15:18 |
corvus | tobiash: ok comment left on https://review.opendev.org/777852 i think that covers it | 15:20 |
tobiash | lgtm | 15:23 |
*** holser has joined #zuul | 15:48 | |
*** evrardjp_ has quit IRC | 16:00 | |
*** evrardjp has joined #zuul | 16:00 | |
*** evrardjp has quit IRC | 16:16 | |
*** evrardjp has joined #zuul | 16:16 | |
*** evrardjp has quit IRC | 16:17 | |
*** evrardjp has joined #zuul | 16:18 | |
*** holser has quit IRC | 16:33 | |
*** evrardjp has quit IRC | 17:21 | |
*** evrardjp has joined #zuul | 17:22 | |
*** jangutter has quit IRC | 19:03 | |
*** jangutter has joined #zuul | 19:03 | |
sassyn | Hi everyone | 19:21 |
sassyn | thank u avass | 19:21 |
sassyn | thank u also tobiash | 19:21 |
sassyn | The issue I'm having is still there. Commit code to gerrit, go to zuul console, open the gate job, I see the commit. Clicking on the Job show for example : 16:21 Job console starting..... and the next line is 16:31 Running Ansible setup.... | 19:24 |
sassyn | 10 min waiting time for nothing. This is the only job i have in my system | 19:25 |
avass | sassyn: that's probably the time it takes for the executor to clone/checkout any repos it needs to run the jobs. are the repos large or are you running jobs with a lot of repos? | 19:46 |
avass | sassyn: in any way I don't think it's "nothing" it's just not showing what's actually happening during that time in a transparent way :) | 19:48 |
tobiash | sassyn: you may want to have a look into the executor logs to see what it's doing | 20:07 |
*** ikhan has quit IRC | 20:08 | |
tobiash | sassyn: how large are the repos involved in that job and what storage type is it using? | 20:08 |
tobiash | Also the bandwidth to gerrit may matter | 20:08 |
avass | tobiash: would it be possible to log something like "Setting up workspace" or does the executor build logs start when the jobs actually start? | 20:09 |
tobiash | sassyn: also make sure that the git_dir and the job_dir are on the same volume (https://zuul-ci.org/docs/zuul/discussion/components.html#attr-executor.git_dir) | 20:10 |
tobiash | having those on separate mount points will dramatically slow down the job startup times | 20:10 |
tobiash | because this prevents hard linking during cloning into the job workspace | 20:11 |
tobiash | avass: it's writing this when it creates the job dir: https://opendev.org/zuul/zuul/src/branch/master/zuul/executor/server.py#L568 | 20:13 |
avass | tobiash: ah | 20:14 |
tobiash | avass: which is right before fetching and preparing the repos | 20:14 |
avass | maybe updating that message and adding a "setting up workspace" then so users know that it's working on something :) | 20:15 |
*** ikhan has joined #zuul | 20:16 | |
tobiash | avass: I think it should be easy to add further messages there | 20:16 |
tobiash | the first could be something like updating repos | 20:17 |
tobiash | and here we could add something like preparing workspace: https://opendev.org/zuul/zuul/src/branch/master/zuul/executor/server.py#L1061 | 20:17 |
avass | thanks. looking at that in a moment. I think it would be good to make that a bit more transparent to the user | 20:20 |
tobiash | fungi, corvus: regarding the processpool topic. I haven't read the full conversation so just ignore me if that's superfluous. I don't think we should do a _send_aborted in this case since this sends the build result to ABORTED which should only be done if the scheduler requested the abort. If you want to trigger a retry build result None is the right thing like for a pre failure or in the unreachable case: https:// | 20:22 |
tobiash | opendev.org/zuul/zuul/src/branch/master/zuul/executor/server.py#L1431 | 20:22 |
fungi | tobiash: does that mean you feel like the current behavior at https://opendev.org/zuul/zuul/src/branch/master/zuul/executor/server.py#L948-L953 is also wrong? | 20:26 |
tobiash | fungi: I mean the job result | 20:33 |
tobiash | fungi: ah yeah, I think exactly that's where you want to change the behavior | 20:34 |
fungi | tobiash: i'm not sure i follow. _send_aborted() causes the build to be aborted and retried, right? at least that's what the code comment says | 20:35 |
tobiash | I think the code mismatches the comment | 20:35 |
tobiash | send_aborted sends a job result of ABORTED back to the scheduler | 20:36 |
tobiash | a retry needs to be triggered exactly like https://opendev.org/zuul/zuul/src/branch/master/zuul/executor/server.py#L955-L958 just with result=None instead | 20:37 |
fungi | interestingly, what we observed was that the result reported for the build was ERROR with detail "Error: Failed to update project None" | 20:37 |
tobiash | a result of None will tell the scheduler to retry | 20:37 |
fungi | we didn't get an ABORTED result reported | 20:38 |
fungi | https://zuul.opendev.org/t/openstack/build/999f096bccbe43c286b7c6ed3c5deeb5 | 20:38 |
tobiash | fungi: can you dig up the stack trace in the executor log? | 20:38 |
fungi | tobiash: yep, this was what i found http://paste.openstack.org/show/803050/ | 20:39 |
tobiash | need to check if we re-raise that exception | 20:39 |
fungi | as you guessed earlier, it was the result of the process being terminated by the kernel's oom killer | 20:39 |
tobiash | yeah, that's pretty much always the case, either oom or a crashing python interpreter | 20:40 |
tobiash | and oom is much more likely these day | 20:40 |
tobiash | fungi: ok, so your stack trace shows that the exception is catched here: https://opendev.org/zuul/zuul/src/branch/master/zuul/executor/server.py#L2962 | 20:42 |
tobiash | and the exception information is not passed to the caller | 20:42 |
tobiash | which is why you end up here: https://opendev.org/zuul/zuul/src/branch/master/zuul/executor/server.py#L955 | 20:43 |
tobiash | with result error | 20:43 |
tobiash | what we could do is to attach some error information to the task here: https://opendev.org/zuul/zuul/src/branch/master/zuul/executor/server.py#L2963 | 20:44 |
tobiash | and evaluate them here: https://opendev.org/zuul/zuul/src/branch/master/zuul/executor/server.py#L1046 | 20:44 |
tobiash | e.g. if the error is transient return None there | 20:45 |
openstackgerrit | Tobias Henkel proposed zuul/zuul master: Retry job on broken process pool https://review.opendev.org/c/zuul/zuul/+/777874 | 20:48 |
tobiash | fungi: like ^ | 20:48 |
fungi | aha, thanks, think that's basically what we discussed at http://eavesdrop.openstack.org/irclogs/%23zuul/%23zuul.2021-02-27.log.html#t2021-02-27T00:11:50 | 20:55 |
tobiash | oh wait, that misses something | 20:55 |
openstackgerrit | Tobias Henkel proposed zuul/zuul master: Retry job on broken process pool https://review.opendev.org/c/zuul/zuul/+/777874 | 20:59 |
openstackgerrit | Tobias Henkel proposed zuul/zuul master: Retry job on broken process pool https://review.opendev.org/c/zuul/zuul/+/777874 | 21:02 |
openstackgerrit | Tobias Henkel proposed zuul/zuul master: Retry job on failed repo updates https://review.opendev.org/c/zuul/zuul/+/777875 | 21:03 |
tobiash | fungi, corvus: as an alternative we could also retry jobs on every uptate failure ^ | 21:03 |
fungi | i guess i'd be slightly more worried about the possibility for some sorts of update failures to be persistent, leading to retry loops. but we do at least still have a retry limit serving as a backstop in that case | 21:20 |
Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!