*** bhavikdbavishi has joined #zuul | 02:35 | |
*** rlandy|rover|bbl is now known as rlandy|rover | 02:46 | |
*** rlandy|rover has quit IRC | 03:20 | |
*** bhavikdbavishi has quit IRC | 03:46 | |
*** bhavikdbavishi has joined #zuul | 03:46 | |
*** bhavikdbavishi has quit IRC | 03:48 | |
*** bhavikdbavishi has joined #zuul | 03:48 | |
*** bjackman has joined #zuul | 04:40 | |
*** quiquell|off is now known as quiquell | 07:18 | |
*** bjackman has quit IRC | 08:52 | |
*** bjackman has joined #zuul | 08:54 | |
bjackman | Could anyone guess how I might have ended up with stuff in my Zuul queue that appears to be "running" in the web UI, but that has nothing in its logs, and all I can see in the executor logs is stuff from the merger? | 09:48 |
---|---|---|
bjackman | None of my jobs seem to start _really_ running. I can see that nodepool satisfied some node requests but then seems to have freed them up again | 09:49 |
*** sshnaidm|afk is now known as sshnaidm | 09:52 | |
*** bhavikdbavishi has quit IRC | 10:13 | |
tristanC | bjackman: iirc this can happens when executor host are busy, are they accepting jobs? | 10:23 |
bjackman | tristanC, the scheduler was printing the "Received handle " messages, so I guess so? | 10:52 |
bjackman | The blockage just suddenly cleared up so I suspect that zuul wasn't actually locked up, my infra was just very slow.. | 10:52 |
bjackman | I don't know what could cause such long delays though | 10:53 |
*** dkehn has quit IRC | 11:04 | |
*** bjackman has quit IRC | 12:01 | |
*** bhavikdbavishi has joined #zuul | 12:29 | |
*** rlandy has joined #zuul | 12:42 | |
*** rlandy is now known as rlandy|rover | 12:43 | |
*** bjackman has joined #zuul | 12:50 | |
*** bhavikdbavishi has quit IRC | 12:54 | |
*** bhavikdbavishi has joined #zuul | 13:01 | |
*** quiquell is now known as quiquell|brb | 13:15 | |
*** ianychoi has joined #zuul | 13:17 | |
*** quiquell|brb is now known as quiquell | 13:35 | |
*** bhavikdbavishi has quit IRC | 13:44 | |
*** bjackman has quit IRC | 13:44 | |
Shrews | tristanC: do you think this is a transient error, or something more concerning? http://logs.openstack.org/42/621642/3/check/nodepool-functional-k8s/b9aef79/job-output.txt.gz#_2019-01-02_20_49_34_020284 | 13:56 |
sshnaidm | do you know if releasing external resources feature is planned (that was discussed here http://lists.zuul-ci.org/pipermail/zuul-discuss/2018-December/000653.html) | 13:58 |
sshnaidm | or worth to create a task/story/whatever to track it? | 13:59 |
fungi | sshnaidm: bjackman seems to have disappeared in the last few minutes, but was the initiator of that discussion. i'm not sure whether he's working on it. a story in sb would be good if there isn't one already | 13:59 |
sshnaidm | I didn't find a story and created this one: https://storyboard.openstack.org/#!/story/2004694 | 14:05 |
sshnaidm | fungi, ^^ | 14:05 |
sshnaidm | please feel free to comment/correct it if something is wrong | 14:05 |
fungi | thanks sshnaidm! | 14:06 |
*** dkehn has joined #zuul | 14:16 | |
*** bhavikdbavishi has joined #zuul | 15:10 | |
*** sshnaidm has quit IRC | 15:15 | |
*** bhavikdbavishi has quit IRC | 15:16 | |
*** sshnaidm has joined #zuul | 15:26 | |
*** quiquell is now known as quiquell|off | 15:28 | |
*** bhavikdbavishi has joined #zuul | 15:41 | |
*** bjackman has joined #zuul | 15:54 | |
openstackgerrit | David Shrewsbury proposed openstack-infra/nodepool master: Extract out common config parsing for ConfigPool https://review.openstack.org/621642 | 16:05 |
*** sshnaidm has quit IRC | 16:15 | |
corvus | i'm looking at two potential zuul bugs this morning -- 1) a bug that fungi and Shrews found where a change is stuck in the queue because zuul doesn't know what jobs are attached to it. and 2) we seem to be performing tenant reconfiguration events very often in response to actions on ansible pull requests. i haven't started looking into that one at all. | 16:22 |
Shrews | corvus: i'm very interested in the outcome of #1 and how we get to that state. it's obviously something we protect against for some reason | 16:25 |
*** sshnaidm has joined #zuul | 16:28 | |
Shrews | that last sentence should read: "it's obviously something we expect since we protect against it" | 16:29 |
*** sshnaidm_ has joined #zuul | 16:32 | |
*** bjackman has quit IRC | 16:33 | |
*** sshnaidm has quit IRC | 16:35 | |
*** sshnaidm__ has joined #zuul | 16:35 | |
*** sshnaidm has joined #zuul | 16:37 | |
*** sshnaidm_ has quit IRC | 16:38 | |
*** sshnaidm__ has quit IRC | 16:40 | |
*** sshnaidm_ has joined #zuul | 16:40 | |
*** sshnaidm has quit IRC | 16:43 | |
corvus | and #3 -- there's an unhandled exception in the reviseRequest method; it's falling back on the general scheduler exception handler. | 16:54 |
openstackgerrit | Merged openstack-infra/zuul master: Switch back to three columns for mid sized screens https://review.openstack.org/627997 | 17:02 |
*** bjackman has joined #zuul | 17:09 | |
corvus | Shrews: any chance you have a few mins to track down the reconfiguration events in the logs and find out what the actual events are? (i know they are mostly about pull requests -- but are they pr's merged, opened, commented?) | 17:19 |
corvus | Shrews: the current description of the problem is "something about ansible pull requests are causing reconfigs"; i'm hoping if we refine that into "ansible pull requests cause reconfigs when <blank>" and it makes the problem obvious :) | 17:20 |
*** bjackman has quit IRC | 17:22 | |
Shrews | corvus: will see what i can dig up for you | 17:22 |
corvus | Shrews: "reconfiguration" is the best thing to grep for | 17:23 |
Shrews | ossum | 17:23 |
corvus | ?reconfiguration | 17:23 |
*** panda is now known as panda|off | 17:27 | |
*** sshnaidm__ has joined #zuul | 17:31 | |
*** sshnaidm_ has quit IRC | 17:34 | |
Shrews | corvus: omg | 17:35 |
corvus | yeah, it's a few, huh? :) | 17:35 |
Shrews | umm, it would be pretty bad if every PR change to ansible/ansible (even if it doesn't touch openstack files) triggered reconfigs, eh? | 17:35 |
corvus | Shrews: yeah, it would be a little bit of wasted work. :) | 17:36 |
Shrews | b/c this change triggers the first reconfig in the current log file: https://github.com/ansible/ansible/pull/50435/commits/503561a54bdf3c551683278615f1e6e198b74744 | 17:36 |
corvus | considering it takes about 40-60 seconds of dedicated computation to do a reconfig. | 17:37 |
Shrews | that touches nothing of concern to us | 17:37 |
Shrews | pardon the paste | 17:38 |
Shrews | 2019-01-03 06:27:10,348 INFO zuul.Pipeline.openstack.check: Adding change <Change 0x7f5674666b70 ansible/ansible 50435,503561a54bdf3c551683278615f1e6e198b74744> to queue <ChangeQueue check: ansible/ansible> in <Pipeline check> | 17:38 |
Shrews | 2019-01-03 06:27:10,350 INFO zuul.Pipeline.openstack.third-party-check: Adding change <Change 0x7f5674666b70 ansible/ansible 50435,503561a54bdf3c551683278615f1e6e198b74744> to queue <ChangeQueue third-party-check: ansible/ansible> in <Pipeline third-party-check> | 17:38 |
Shrews | 2019-01-03 06:27:10,599 INFO zuul.Scheduler: Tenant reconfiguration beginning for openstack due to projects {(<Project ansible/ansible>, 'devel')} | 17:38 |
corvus | on the other bug -- i'm developing a theory that we're seeing the change in question move in and out of the active window during reconfiguration; if the change moves out of the active window during a reconfiguration, it may no longer have any jobs attached to it, and if there were builds running for those jobs, we'd reject their results; however i'm not sure why we wouldn't restart them when they move into | 17:40 |
corvus | the active window | 17:40 |
corvus | Shrews: so was that a PR open event? | 17:40 |
Shrews | corvus: not sure how to tell | 17:42 |
Shrews | i think so, based on the first ansibot comment on that pr | 17:43 |
corvus | Shrews: maybe line up timestamps? if all else fails, you can backtrack to the event delivery and look up the event in github. | 17:43 |
*** panda|off has quit IRC | 17:44 | |
Shrews | the "about a day ago" timestamp on the PR does not help, but it does reference that commit # | 17:45 |
corvus | Shrews: if you mouseover you'll get a real time | 17:46 |
*** panda has joined #zuul | 17:48 | |
SpamapS | pabelanger: are you still looking for nginx configs that work btw? I'm revisiting ours to fix some problems. | 17:49 |
pabelanger | SpamapS: Yah, I'd be happy to also try them out | 17:52 |
pabelanger | I haven't had much time to dig into it just yet | 17:52 |
Shrews | corvus: our logs indicate it was a PR changed event | 17:57 |
Shrews | i could not line up the PR open to our timestamp | 17:57 |
corvus | Shrews: so adding commits to the pr? | 17:57 |
Shrews | 2019-01-03 06:27:10,347 DEBUG zuul.Scheduler: Submitting tenant reconfiguration event for openstack due to event <GithubTriggerEvent 0x7f5674666a90 pull_request ansible/ansible refs/pull/50435/head changed github.com/ansible/ansible 50435,503561a54bdf3c551683278615f1e6e198b74744 delivery: 97dffdc0-0f20-11e9-80c1-aefc073e53ea> in project ansible/ansible | 17:58 |
corvus | Shrews: are they all like that? | 17:58 |
Shrews | that commit was to fix the ansible lint errors, so that seems correct | 17:59 |
Shrews | corvus: will look at the next one | 17:59 |
Shrews | corvus: and that timestamp does line up, so yes, a PR change event | 18:01 |
Shrews | ok, now on to the next | 18:01 |
Shrews | corvus: hrm, looks like a status change (possibly labels being removed/added?) on that same change triggers another reconfigure | 18:05 |
Shrews | seconds after the previous | 18:06 |
corvus | ok not just the one type of event then | 18:06 |
Shrews | and for irrelevant PRs | 18:07 |
Shrews | so multiple issues i guess | 18:07 |
corvus | Shrews: based on that, i wonder if this is what is causing it? http://git.zuul-ci.org/cgit/zuul/tree/zuul/scheduler.py#n1058 | 18:11 |
corvus | fungi, Shrews: i think i've traced bug #1 -- a change has to be in the active window, make node requests, leave the active window, be subject to a tenant reconfiguration and re-enqueue, then the earlier node requests complete. at that point zuul has lost track of the jobs for the change because it is no longer active, so it returns the nodes, but does not clear the record of the node request. when the | 18:37 |
openstack | bug 1 in Ubuntu Malaysia LoCo Team "Microsoft has a majority market share" [Critical,In progress] https://launchpad.net/bugs/1 - Assigned to MFauzilkamil Zainuddin (apogee) | 18:37 |
corvus | change enters the active window again, it still thinks there's an outstanding node request and does nothing. | 18:37 |
corvus | always helpful that one :) | 18:37 |
corvus | it's going to take me a little bit to make up the test case for that | 18:39 |
fungi | hrm, so how does a change go from within the active window to outside it? | 18:40 |
corvus | (and i think bug number 3 is related here -- the sheer number of reconfigurations we are undergoing has served to increase the odds of hitting that case) | 18:40 |
fungi | window shrinkage? | 18:40 |
corvus | fungi: the active window shrinks. yeah | 18:40 |
Shrews | corvus: oh how fun | 18:40 |
corvus | it shrinks by half on a failure in our config | 18:40 |
fungi | cool, that's an obscure set of preconditions | 18:41 |
corvus | fungi: yeah, most of which we test individually, but not together :) | 18:41 |
clarkb | openstack is a really good beta user | 18:41 |
clarkb | :P | 18:41 |
corvus | also, worth noting that you have to have an active window > the minimum for this to happen. also a rare condition for us. | 18:42 |
clarkb | corvus: something about not merging a bunch of broken code over the holidays helping with that :) | 18:45 |
corvus | \o/ i have a reproducing test case | 19:02 |
corvus | the exception i noted as bug 2 is directly related to bug 1, and i don't think will happen once bug 1 is fixed; so i think i'm going to decline to "fix" that for now, as it might just mask other error conditions later. | 19:13 |
openstack | bug 1 in Ubuntu Malaysia LoCo Team "Microsoft has a majority market share" [Critical,In progress] https://launchpad.net/bugs/1 - Assigned to MFauzilkamil Zainuddin (apogee) | 19:13 |
corvus | i'm going to have to call these bug A,B,C aren't i? | 19:13 |
*** bhavikdbavishi has quit IRC | 19:25 | |
Shrews | i prefer alpha, beta, charlie... | 19:26 |
Shrews | corvus: so that scheduler code you pointed to... not quite sure how that would be triggered. would be nice if we had some logging to indicate which condition actually triggered the reconfigure there | 20:00 |
Shrews | grr, gotta afk for a bit | 20:19 |
corvus | Shrews: we don't load any config from the ansible repo, so the second clause (line 1063-1064) should always be true. so it sounds like 1061-1064 reduces to "if an event has a branch, reconfigure". especially since i don't think we set "exclude unprotected branches" which means the carveout on lines 1069-1071 won't apply either. | 20:35 |
*** hashar has joined #zuul | 20:38 | |
openstackgerrit | James E. Blair proposed openstack-infra/zuul master: Remove unecessary finally clauses https://review.openstack.org/628299 | 21:08 |
openstackgerrit | James E. Blair proposed openstack-infra/zuul master: Fix items stuck in queue pending node requests https://review.openstack.org/628300 | 21:08 |
openstackgerrit | James E. Blair proposed openstack-infra/zuul master: Be more aggressive in canceling node requests https://review.openstack.org/628301 | 21:08 |
corvus | okay, that's the outcome of looking into bug alpha (and bravo) | 21:08 |
corvus | Shrews, fungi, clarkb: ^ | 21:09 |
clarkb | the first one is an easy review. I'll have to dig into the other two after I eat | 21:11 |
corvus | i'm going to try to work on a test case for bug charlie now | 21:15 |
mordred | corvus: I like the assertTrue(len(self.builds), 4) | 21:18 |
corvus | mordred: it's why my tests are so reliable | 21:18 |
mordred | ++ | 21:19 |
mordred | oh - I think in that case true is for len(self.builds) - and 4 becomes the message it will display if len(self.builds) is false right? | 21:19 |
corvus | yeah i think so | 21:19 |
mordred | assertion failed: 4 | 21:19 |
corvus | so it would still catch len(self.builds)==0. but that's not an expected failure case... more likely is 2 instead of 4. | 21:20 |
mordred | yah | 21:20 |
mordred | which would happily still be true | 21:21 |
mordred | type inferance is great until it isn't | 21:21 |
*** hashar has quit IRC | 21:37 | |
Shrews | corvus: oh, i see. i clearly did not understand the 2nd clause | 21:48 |
openstackgerrit | James E. Blair proposed openstack-infra/zuul master: Cache empty branch config to prevent spurious reconfig https://review.openstack.org/628307 | 21:48 |
corvus | Shrews, tobiash: ^ i think that's the fix for the reconfiguration bug | 21:49 |
corvus | fbo: ^ | 21:49 |
mordred | corvus: you have angered the pep8 gods: http://zuul.openstack.org/build/bfd4140565b84c57af7e75494c9faba8 ... but the zuul dashboard helped me see the error more quickly than before | 22:03 |
clarkb | the first change in the stack failed py36 tests | 22:05 |
mordred | sad panda | 22:07 |
corvus | false negative; possibly due to limestone perf issues | 22:07 |
openstackgerrit | James E. Blair proposed openstack-infra/zuul master: Be more aggressive in canceling node requests https://review.openstack.org/628301 | 22:08 |
openstackgerrit | James E. Blair proposed openstack-infra/zuul master: Cache empty branch config to prevent spurious reconfig https://review.openstack.org/628307 | 22:08 |
corvus | that should take care of the pep8 thing | 22:08 |
clarkb | corvus: for deleting the node request, nodepool will notice the request has gone away and then stick the booted node for that into its pool for fulfilling the next request? | 22:19 |
corvus | yep | 22:22 |
corvus | (we do currently delete node requests if we're resetting an item (eg, a gate reset) so that code is exercised) | 22:23 |
Shrews | corvus: with 628307 in place, would we still catch the github edge case of going from protected to unprotected? | 22:38 |
Shrews | or vice versa i guess | 22:39 |
corvus | Shrews: i believe so -- i think the branch won't show up in the list of branches we get from the driver in that case | 22:39 |
corvus | (so when it does show up (because the bit has flipped), it really will return None in the call to get the branch config in the scheduler -- that second clause we were looking at earlier) | 22:40 |
corvus | but we might want to double check that with tobiash :) | 22:41 |
Shrews | yeah, i'm a bit lost on that portion of it, but the rest looks sane (and i learned things!) | 22:41 |
clarkb | Zuul ignores the unprotected branches? | 22:45 |
clarkb | for configuration that is | 22:45 |
corvus | clarkb: it's an option | 22:45 |
Shrews | https://zuul-ci.org/docs/zuul/admin/tenants.html#attr-tenant.untrusted-projects.<project>.exclude-unprotected-branches , i think | 22:46 |
clarkb | I guess this situation only matters if we are ignoring unprotected branches otherwise we'd be tracking the config the entire time | 22:47 |
clarkb | also that is scoped to only when we are ignoring all config anyway | 22:49 |
clarkb | so I think this is ok? | 22:49 |
Shrews | corvus: how does one differentiate between protected and unprotected branches in github? | 22:52 |
Shrews | is that a github setting? | 22:53 |
corvus | Shrews: we've reached the limits of my knowledge here, i just know it's a github thing. | 22:53 |
Shrews | :) | 22:55 |
pabelanger | Shrews: yup, there are branch settings in github where you apply branch protection rules | 22:55 |
Shrews | https://help.github.com/articles/defining-the-mergeability-of-pull-requests/ | 22:55 |
Shrews | that ^ describes it | 22:55 |
Shrews | pabelanger: thx | 22:55 |
pabelanger | we just started to enable it on all ansible-network roles before the break | 22:56 |
Shrews | corvus: mainly wanted to verify that i wasn't missing a zuul-side config for it | 22:56 |
clarkb | pabelanger: are those rules viewable by non admins somewhere? | 22:56 |
pabelanger | clarkb: I don't believe so | 22:57 |
pabelanger | clarkb: I think, if you are logged into github you will just see branch checks are required on a PR | 22:58 |
pabelanger | eg: https://github.com/ansible-network/cloud_vpn/pull/55 | 22:58 |
clarkb | I don't see naything there indicating that | 22:58 |
clarkb | so maybe we can't see that | 22:59 |
pabelanger | Required statuses must pass before merging, is the section I see | 22:59 |
pabelanger | but it is my PR | 22:59 |
openstackgerrit | Merged openstack-infra/zuul master: Remove unecessary finally clauses https://review.openstack.org/628299 | 23:00 |
openstackgerrit | Merged openstack-infra/zuul master: Fix items stuck in queue pending node requests https://review.openstack.org/628300 | 23:00 |
clarkb | I don't see that (^F doesn't show it either) | 23:00 |
pabelanger | clarkb: okay, so possible only owners / collaborators / teams see it in a PR, since they can only merge code | 23:01 |
openstackgerrit | Merged openstack-infra/zuul master: Be more aggressive in canceling node requests https://review.openstack.org/628301 | 23:03 |
openstackgerrit | Merged openstack-infra/zuul master: Cache empty branch config to prevent spurious reconfig https://review.openstack.org/628307 | 23:03 |
Shrews | our tenant config docs show the use of exclude-unprotected-branches with a gerrit source. that's not correct, is it? | 23:06 |
Shrews | the subsequent docs for that option only mention github | 23:07 |
Shrews | tobiash: maybe you know? ^^ | 23:07 |
mordred | Shrews: I believe you are right - that is a github option - branch protections are a github concept generally | 23:08 |
Shrews | mordred: i'm allowed to be right once a year, so it's possible :) | 23:09 |
Shrews | oh wow, i need to start dinner | 23:10 |
tobiash | Shrews: exclude unprotected branches is a tenant or project option in main.yaml and not directly related to a connection | 23:33 |
tobiash | But yes it only has an effect in the github driver | 23:34 |
tobiash | Shrews: do you have a link to that page? Maybe it needs a correction. | 23:35 |
Shrews | tobiash: https://zuul-ci.org/docs/zuul/admin/tenants.html#tenant | 23:37 |
tobiash | Shrews: ok, so that's an option on tenant or project level and has no effect on projects of a gerrit source | 23:38 |
tobiash | And I'd suggest to set this to true on tenant level | 23:39 |
corvus | tobiash: if you could retro-review https://review.openstack.org/628307 i'd appreciate it :) | 23:52 |
tobiash | corvus: sure, I'll have a look tomorrow :) | 23:53 |
Generated by irclog2html.py 2.15.3 by Marius Gedminas - find it at mg.pov.lt!