pabelanger | looks like we didn't properly dequeue the previous version of the PR for some reason | 00:00 |
---|---|---|
pabelanger | looking at https://github.com/ansible/ansible/pull/55199 is seems to have ~14commit pushed close together, wonder if that has something to do with it | 00:01 |
pabelanger | http://paste.openstack.org/show/752908/ | 00:02 |
pabelanger | I'd only expect 1 change there, not 3 | 00:03 |
*** ianychoi has joined #zuul | 00:19 | |
*** mattw4 has quit IRC | 01:01 | |
*** swest has quit IRC | 01:09 | |
*** swest has joined #zuul | 01:24 | |
*** igordc has quit IRC | 01:35 | |
*** bhavikdbavishi has joined #zuul | 01:58 | |
*** bhavikdbavishi1 has joined #zuul | 02:01 | |
*** bhavikdbavishi has quit IRC | 02:02 | |
*** bhavikdbavishi1 is now known as bhavikdbavishi | 02:02 | |
pabelanger | okay, looking at https://github.com/ansible/ansible/pull/55199 more, I _think_ they some how used the web ui to make code changes to the PR | 02:03 |
pabelanger | which requests in zuul enqueuing the same change 6 times into check | 02:03 |
pabelanger | http://paste.openstack.org/show/752910/ was the event from github | 02:17 |
pabelanger | however, I am unsure why zuul triggered on it, the action was synchronize | 02:17 |
pabelanger | oh, that seems to map to 'changed' for zuul | 02:19 |
*** rlandy|ruck has quit IRC | 02:24 | |
*** bhavikdbavishi has quit IRC | 02:26 | |
pabelanger | okay, will have to continue this in the morning, I think it is possible we are getting more then 1 Change ID for a PR, which seems to not dequeue the previous Change, because they are different | 02:34 |
*** michael-beaver has quit IRC | 02:43 | |
*** dmellado_ has joined #zuul | 03:08 | |
*** dmellado has quit IRC | 03:08 | |
*** dmellado_ is now known as dmellado | 03:08 | |
*** mugsie has quit IRC | 03:22 | |
*** EmilienM has quit IRC | 03:22 | |
*** kklimonda has quit IRC | 03:22 | |
*** mgagne has quit IRC | 03:22 | |
*** persia has quit IRC | 03:22 | |
*** sean-k-mooney has quit IRC | 03:22 | |
*** bhavikdbavishi has joined #zuul | 03:31 | |
*** mugsie has joined #zuul | 03:38 | |
*** EmilienM has joined #zuul | 03:38 | |
*** kklimonda has joined #zuul | 03:38 | |
*** mgagne has joined #zuul | 03:38 | |
*** persia has joined #zuul | 03:38 | |
*** sean-k-mooney has joined #zuul | 03:38 | |
*** irclogbot_3 has quit IRC | 03:41 | |
*** irclogbot_1 has joined #zuul | 03:43 | |
*** igordc has joined #zuul | 04:12 | |
*** igordc has quit IRC | 04:19 | |
*** swest has quit IRC | 04:33 | |
*** pcaruana has joined #zuul | 05:09 | |
*** swest has joined #zuul | 05:14 | |
*** swest has quit IRC | 05:17 | |
*** swest has joined #zuul | 05:18 | |
*** raukadah is now known as chandankumar | 05:21 | |
*** bhavikdbavishi has quit IRC | 05:21 | |
*** bhavikdbavishi has joined #zuul | 05:22 | |
*** gtema has joined #zuul | 06:24 | |
*** tobiash has joined #zuul | 06:45 | |
tobiash | zuul-maint: looks like zuul-web got a memleak here: https://paste.pics/9f7cf9a259c39876361b49b1ebebcd29 | 06:45 |
tobiash | does anyone experience the same? | 06:46 |
tobiash | unfortunately it's hard to debug because sigusr2 seems to print the stack trace and objects but also kills it :/ | 06:46 |
tobiash | those should be the changes that went into this update: https://etherpad.openstack.org/p/ipW8nmqqPn | 06:59 |
tristanC | tobiash: perhaps one of the requirements also changed? | 07:20 |
tobiash | maybe | 07:21 |
tristanC | tobiash: fwiw we are also running with the commit that are bolded in the etherpad without issue | 07:24 |
tristanC | tobiash: e.g. SQLAlchemy 1.3.4 released May 28 | 07:31 |
*** hashar has joined #zuul | 07:39 | |
*** jpena|off is now known as jpena | 07:50 | |
*** zbr|ooo is now known as zbr|ruck | 08:01 | |
*** sanjayu_ has joined #zuul | 08:16 | |
openstackgerrit | Matthieu Huin proposed zuul/zuul master: Zuul Web: add /api/user/authorizations endpoint https://review.opendev.org/641099 | 08:16 |
openstackgerrit | Matthieu Huin proposed zuul/zuul master: authentication config: add optional token_expiry https://review.opendev.org/642408 | 08:17 |
openstackgerrit | Matthieu Huin proposed zuul/zuul master: [WIP] admin REST API: zuul-web integration https://review.opendev.org/643536 | 08:17 |
*** jangutter has joined #zuul | 08:18 | |
*** hashar has quit IRC | 08:33 | |
*** hashar has joined #zuul | 08:40 | |
openstackgerrit | Fabien Boucher proposed zuul/zuul master: Set same timeout for py35/36 jobs in check and gate https://review.opendev.org/665331 | 09:01 |
fbo | Any reason the timeout was different between check and gate py35/36 jobs ? The pagure driver patch failed in the gate due to a timeout. In check we have 1 hour but in gate 45 minutes. ^ | 09:04 |
tobiash | fbo: it used to be 45 minutes and the bump to 1 hour probably just forgot the gate | 09:09 |
fbo | tobiash: alright, yes it seems to be a good reason :) | 09:19 |
fbo | I'll recheck the pagure driver patch after we this .zuul.yaml change then | 09:20 |
fbo | *merge* | 09:21 |
openstackgerrit | Matthieu Huin proposed zuul/zuul master: [WIP] admin REST API: zuul-web integration https://review.opendev.org/643536 | 09:44 |
*** bhavikdbavishi has quit IRC | 09:56 | |
*** hashar has quit IRC | 09:59 | |
*** sanjayu_ has quit IRC | 10:11 | |
*** sanjayu_ has joined #zuul | 10:12 | |
*** rfolco has quit IRC | 10:13 | |
*** rfolco has joined #zuul | 10:14 | |
*** gtema has quit IRC | 10:24 | |
*** gtema_ has joined #zuul | 10:24 | |
*** gtema_ is now known as gtema | 10:24 | |
*** sanjayu_ has quit IRC | 10:36 | |
*** saneax has joined #zuul | 10:38 | |
*** panda is now known as panda|lunch | 11:28 | |
*** jpena is now known as jpena|lunch | 11:36 | |
*** EmilienM is now known as EvilienM | 11:45 | |
*** rf0lc0 has joined #zuul | 12:21 | |
*** rfolco has quit IRC | 12:21 | |
*** tobias-urdin has joined #zuul | 12:25 | |
tobias-urdin | https://review.opendev.org/#/c/637339/ yay looking forward to the next zuul and nodepool releases :) | 12:25 |
openstackgerrit | Matthieu Huin proposed zuul/zuul master: [WIP] admin REST API: zuul-web integration https://review.opendev.org/643536 | 12:37 |
*** jpena|lunch is now known as jpena | 12:37 | |
*** rlandy has joined #zuul | 12:38 | |
*** rlandy is now known as rlandy|ruck | 12:38 | |
*** panda|lunch is now known as panda | 12:55 | |
*** rf0lc0 has quit IRC | 13:35 | |
*** rfolco has joined #zuul | 13:42 | |
*** rfolco has quit IRC | 13:46 | |
*** rfolco has joined #zuul | 13:46 | |
*** chandankumar is now known as raukadah | 13:59 | |
*** saneax has quit IRC | 14:06 | |
*** rfolco has quit IRC | 14:20 | |
*** rfolco has joined #zuul | 14:21 | |
corvus | tobias-urdin: yeah, we were planning on releasing them today, but we found an issue with github interaction that we want to clear up first | 14:24 |
*** rfolco is now known as rfolco|off | 14:25 | |
corvus | tobiash: i don't see radical memory use changes on the server where we're running zuul-web: http://cacti.openstack.org/cacti/graph.php?action=view&local_graph_id=64792&rra_id=all (though that reporting is not as detailed) | 14:28 |
tobiash | corvus: k, thanks, I'll look next week more in depth into this | 14:29 |
tobiash | for now I just increased the resource limit and let openshift handle the restarts :) | 14:29 |
corvus | yay cloud | 14:29 |
Shrews | corvus: mordred: It looks like next Wednesday is going to be my last full day for a while. Will be gone until at least July 8th. I'll let you two decide what to do with the autohold changes (which I consider complete at this point) in relation to the web admin stuff. | 14:29 |
Shrews | certainly not urgent to merge, but will require changes if merged after the web admin stuff | 14:31 |
corvus | Shrews: yeah, i'm expecting it to be relatively straightforward api changes rather than more complicated structural changes, so i expect we can update the patches | 14:33 |
Shrews | k k | 14:33 |
corvus | i'm restarting opendev's zuul to pick up the new github fixes | 14:41 |
corvus | we could release nodepool now, but i'd like to release them at the same time because of the node python path change (it's not strictly necessary, but it seems weird not to) | 14:42 |
*** mgariepy has joined #zuul | 14:46 | |
corvus | pabelanger: 2019-06-14 14:44:39,652 ERROR zuul.GithubEventProcessor: [e: c657f89e-8eb2-11e9-908a-1f05e5124b4f] Failed handling status event; remote said retry after 60,will retry attempt 1/5 in 60.0 seconds | 14:46 |
corvus | pabelanger: that was 403 as a first response (ie, it wasn't after a 502) | 14:46 |
pabelanger | neat, lets see what happens | 14:46 |
corvus | pabelanger, tobiash: that makes me think that in addition to the 502 errors causing retries, we're *also* simply running into rate limits probably due to the paralellization change | 14:47 |
tobiash | corvus: yes, probably | 14:47 |
tobiash | before we just queued on our side with no chance reaching the limit because of serialization | 14:47 |
pabelanger | would agree | 14:47 |
tobiash | I think there is still some potential to reduce api calls (e.g. some calls due to github3.py object model) | 14:49 |
tobiash | and the getpullbysha can be optimized using graphql with one request | 14:50 |
corvus | pabelanger: 2019-06-14 14:51:08,371 ERROR zuul.GithubEventProcessor: [e: c657f89e-8eb2-11e9-908a-1f05e5124b4f] Failed handling status event; remote said retry after 60,will retry attempt 5/5 in 60.0 seconds | 14:51 |
corvus | pabelanger: perhaps the retry-after should be +5 seconds? | 14:52 |
pabelanger | corvus: yah, that was my fear, we needed to add a buffer | 14:52 |
pabelanger | sounds like a good idea | 14:52 |
corvus | tobiash: do you think we also need to calculate our internal rate and slow down? we do that in openstacksdk (it's code inherited from nodepool). we don't necessarily have to ask github for the rate after each request, we could just do it once and count our own calls. we could probably avoid hitting the limit most of the time that way. | 14:56 |
corvus | tobiash: or do you think we should just try to honor the retry-after header and see if that will take care of it? | 14:56 |
openstackgerrit | Paul Belanger proposed zuul/zuul master: Pad retry_after logic for github when we raise an exception https://review.opendev.org/665415 | 15:00 |
pabelanger | tobiash: corvus: ^ but I also wonder if some how multiple retries for different events, is also going to trigger the 403 abuse api | 15:03 |
pabelanger | do we need to wait 'retry_after' for any API call, or just to the specific api? | 15:03 |
tobiash | corvus: we don't have a mechanism yet to globally throttle the requests | 15:06 |
*** hashar has joined #zuul | 15:15 | |
pabelanger | tobiash: I wonder if github3.py exposes anything, IIRC they are using python requests | 15:17 |
fungi | score! http://logs.openstack.org/19/663119/1/check/zuul-jobs-test-registry/6109606/ara-report/result/35f53315-15d6-4f5a-8e24-aa07a2796913/ | 15:17 |
fungi | looks like https://review.opendev.org/663119 is ready to be reviewed | 15:17 |
openstackgerrit | Paul Belanger proposed zuul/zuul master: Pad retry_after logic for github when we raise an exception https://review.opendev.org/665415 | 15:22 |
pabelanger | ^also cleaned up traceback message, to add space after comma :) | 15:22 |
corvus | thanks, that would have annoyed me to no end :) | 15:23 |
pabelanger | Ooo | 15:24 |
pabelanger | 2019-06-14 14:55:33,759 ERROR zuul.GithubEventProcessor: [e: 75a44c70-8eb4-11e9-94c5-7a90e0072678] Failed handling pull_request event; remote said retry after 60,will retry attempt 1/5 in 60.0 seconds | 15:24 |
pabelanger | however, the retry worked | 15:24 |
*** rfolco has joined #zuul | 15:24 | |
pabelanger | I can see zuul processing that event | 15:24 |
pabelanger | interesting enough | 15:24 |
pabelanger | it look like all the 5/5 retries seem to be from the status event | 15:24 |
pabelanger | but that pull_request did work as expected | 15:25 |
*** rfolco|off has quit IRC | 15:25 | |
*** mgariepy has left #zuul | 15:26 | |
pabelanger | http://paste.openstack.org/show/752933/ | 15:26 |
pabelanger | that is a snip of logic working | 15:26 |
*** guilhermesp has joined #zuul | 15:30 | |
openstackgerrit | Merged zuul/zuul master: Set same timeout for py35/36 jobs in check and gate https://review.opendev.org/665331 | 15:32 |
pabelanger | fbo: thanks, I didn't think to check that the other day^ | 15:33 |
pabelanger | also seen a timeout in gate | 15:33 |
corvus | pabelanger: is the zuul.ansible.com instance watching ansible/ansible? | 15:38 |
pabelanger | corvus: yes, we are reporting results back too | 15:38 |
corvus | pabelanger: you're seeing the same errors there, right? | 15:39 |
pabelanger | corvus: when I looked last week yes, I can check now, give me a moment | 15:39 |
corvus | pabelanger: if this is going to be an iterative process, perhaps it would be easier for you to test out changes like 665415 on that instance, and then, once you're satisfied with the result, we could restart opendev as a final verification? | 15:40 |
pabelanger | corvus: c657f89e-8eb2-11e9-908a-1f05e5124b4f we also see that failure today | 15:40 |
corvus | it's... not easy to restart opendev, and we lose about 1500 compute-hours worth of work each time we do it... :/ | 15:41 |
corvus | probably more, really | 15:41 |
pabelanger | corvus: sure, that seems reasonable, I'd apply it now however we are not running the new multi processing patch tobiash landed. That is not to say we cannot upgrade to latest master | 15:42 |
corvus | pabelanger: yeah, i think you'll need master + your new patch | 15:42 |
pabelanger | corvus: agree | 15:42 |
corvus | that is interesting that you see that error even without the multiprocessing patch | 15:43 |
pabelanger | looking at zuul.o.o logs, I am concerned that if we group 2 or more retries together, they will both not wait the required '60' seconds for a retry, we just keep fighting each other until 5/5 tries. | 15:43 |
clarkb | pabelanger: yes that is the multiprocessing change | 15:44 |
clarkb | that should be fine if the retry is due ti a backend error | 15:44 |
clarkb | but likely less good if being throttled by rate limiter | 15:44 |
pabelanger | I wonder if we should consider backing that out, until we can determine a better way to deal with rate limits | 15:44 |
pabelanger | (multiprocessing change, I mean) | 15:45 |
corvus | pabelanger: yes, i think that's quite likely going to be a problem. maybe you and tobiash can come up with a determination of whether it is, and a recommendation for how we should proceed. | 15:45 |
pabelanger | corvus: okay, I think what we can do on zuul.a.c, is apply the retry-after logic (minus multiprocessing) to even see if 665415 is needed | 15:46 |
corvus | pabelanger: good idea, that will give us another data point | 15:46 |
corvus | pabelanger: then you could also apply 665415 (still without MP) and see if that changes things | 15:47 |
pabelanger | okay, let me work on that. However, I am not sure I will be able to completely that for today, have to #dadops early today. But should have time over the weekend to confirm | 15:47 |
pabelanger | corvus: +1 | 15:47 |
corvus | pabelanger: looking at the ...b4f event -- it's interesting that some of the timestamps between the attempts are ~2 minutes apart, suggesting that the reque itself still took almost a minute to return a 403. | 15:52 |
*** swest has quit IRC | 15:55 | |
*** jamesmcarthur has joined #zuul | 15:56 | |
*** bhavikdbavishi has joined #zuul | 15:57 | |
pabelanger | okay, I've applied: https://review.opendev.org/665220/ | 15:58 |
pabelanger | and confirmed zuul.a.c is able to enqueue a PR | 15:58 |
pabelanger | so, now we wait for an execption | 15:58 |
pabelanger | 2019-06-14 15:59:43,328 ERROR zuul.GithubEventConnector: [delivery: 281d19ce-8ebd-11e9-8513-c2e73b0c5da4] Failed handling status event; remote said retry after None,will retry attempt 1/5 in 60.0 seconds | 16:00 |
pabelanger | now we wait | 16:00 |
pabelanger | also | 16:00 |
pabelanger | http://paste.openstack.org/show/752935/ | 16:00 |
pabelanger | that is new | 16:00 |
pabelanger | okay, a retry worked | 16:01 |
pabelanger | http://paste.openstack.org/show/752936/ | 16:01 |
pabelanger | with is without any padding | 16:02 |
pabelanger | let me find another instances or 2 | 16:02 |
tobiash | corvus, pabelanger: re rate limit, do you still see frequent 'get all prs of ansible' in getPullBySha? | 16:04 |
tobiash | if yes, that would be likely one reason to hit the rate limit | 16:05 |
tobiash | basically this is a quick stab I got from a github engineer to get a pr by sha: http://paste.openstack.org/show/752937/ | 16:07 |
tobiash | using graphql and this only costs one request | 16:07 |
tobiash | so if getpullbysha is the reason we have our first graphql use case | 16:08 |
tobiash | (and those events listed above *are* status events) | 16:08 |
*** bhavikdbavishi has quit IRC | 16:12 | |
*** mattw4 has joined #zuul | 16:13 | |
pabelanger | 2019-06-14 16:04:21,511 ERROR zuul.GithubEventConnector: [delivery: f506a3a6-8ebd-11e9-935d-2fe311e1498f] Failed handling status event; remote said retry after None,will retry attempt 1/5 in 60.0 seconds | 16:17 |
pabelanger | that was 502 Server Error | 16:17 |
pabelanger | and retry was successful | 16:17 |
pabelanger | \o/ | 16:17 |
pabelanger | so, that at least fixes the original issue | 16:17 |
pabelanger | so far 2 for 2 on retry | 16:17 |
pabelanger | also, we didn't get 'Retry-After' header on 502 exepction | 16:18 |
*** bhavikdbavishi has joined #zuul | 16:18 | |
pabelanger | tobiash: yah, each one as always been in getPullBySha(), that I have seen on zuul.a.c | 16:20 |
openstackgerrit | Paul Belanger proposed zuul/zuul master: Revert "Parallelize github event processing" https://review.opendev.org/665429 | 16:28 |
pabelanger | tobiash: clarkb: corvus: ^revert of parallel github, if we decided to go that route before next zuul release | 16:28 |
pabelanger | I maked WIP for now | 16:28 |
openstackgerrit | Paul Belanger proposed zuul/zuul master: Revert "Parallelize github event processing" https://review.opendev.org/665429 | 16:29 |
*** jangutter has quit IRC | 16:47 | |
*** bhavikdbavishi has quit IRC | 16:51 | |
pabelanger | tobiash: corvus: clarkb: while waiting for zuul.a.c to get more exceptions, last night I noticed the following: http://paste.openstack.org/show/752908/ we had more the 1 PR (same ID) in our check pipeline. Which resulted in zuul multiple test results on the PR: https://github.com/ansible/ansible/pull/55199 I am unsure why we didn't dequeue the previous changes in the check pipeline, but have a feeling because | 16:57 |
pabelanger | they were different change ids | 16:57 |
*** jpena is now known as jpena|off | 16:57 | |
corvus | pabelanger: i saw you mentioned that. are you looking into it? | 16:58 |
mnaser | is there a way to lock down zuul's UI by anyway of auth or should i go back to basics (i.e. whitelist ips or something on apache layer) | 16:59 |
mnaser | im thinking if there isnt anything native i can try using mod_auth_openidc or something like that | 17:00 |
mnaser | just not sure how many things that might break :) | 17:00 |
corvus | mnaser: no, i think apache should be fine | 17:00 |
mnaser | going to run mod_auth_openidc and see how that works out then :) | 17:01 |
corvus | mnaser: that should certainly work for whitelabeled tenants, or the entire system non-whitelabeled. if you wanted to restrict access to individual tenants without whitelabling, i think it would mostly work because of the way almost all the urls are tenant-scoped, execpt for the tenant index itself. | 17:03 |
corvus | not sure which of the 3 situations you're dealing with | 17:03 |
pabelanger | corvus: I am, but needed a little help understanding how we decided to dequeue a previous version of a PR / review. If I understand, it is tied to the Change? | 17:04 |
mnaser | corvus: yeah it's more of a single-tenant deployment in this case so its just protecting the whole thing behind a wall | 17:04 |
corvus | mnaser: cool, i don't anticipate problems; let us know how it goes :) | 17:04 |
corvus | pabelanger: first -- you left this set to the default, right? https://zuul-ci.org/docs/zuul/user/config.html#attr-pipeline.dequeue-on-new-patchset | 17:05 |
pabelanger | corvus: correct, also I didn't know that was an option :) | 17:06 |
corvus | pabelanger: good, forget you ever heard about it :) | 17:06 |
pabelanger | done | 17:07 |
corvus | pabelanger: this is the key bit: https://opendev.org/zuul/zuul/src/branch/master/zuul/driver/github/githubmodel.py#L41 | 17:07 |
mnaser | wee | 17:07 |
mnaser | worked fine (initially) | 17:07 |
mnaser | ill see if other things might break | 17:07 |
corvus | pabelanger: when a new item is enqueued, zuul should iterate over all the other changes in the pipeline and if that's true for any of them, remove those | 17:07 |
pabelanger | great | 17:08 |
pabelanger | that gives me a starting point | 17:08 |
corvus | pabelanger: first thing that comes to mind is that the "updated_at" looks like a weak spot -- what if something raced and that ended up wrong? | 17:09 |
pabelanger | corvus: yah, my thoughts too. looking at the PR, I see a quick push of multiple commits | 17:09 |
corvus | pabelanger: yeah, so it might be worth collecting all the log lines for those and see if you can determine what the updated_at values would be each time it ran through | 17:10 |
pabelanger | ack, will do | 17:10 |
pabelanger | thanks | 17:11 |
corvus | pabelanger: the other key part of this system is https://opendev.org/zuul/zuul/src/branch/master/zuul/manager/__init__.py#L213 | 17:11 |
corvus | pabelanger: you should be able to grep for those methods and see their call locations, etc | 17:12 |
pabelanger | k | 17:12 |
clarkb | I guess PRs don't have the nice iteration number that changes in gerrit come with | 17:13 |
corvus | no, that's why we have to check both the sha and the timestamp | 17:13 |
tobiash | pabelanger, corvus: the parallel event processing is crucial for bigger environments like ours. Without it we'll have hours of event queueing. And I'm not sure if that's better than occasionally hitting the rate limit. So I think we should try the graphql for getpullbysha and see if that reduces the number of requests. | 17:16 |
clarkb | tobiash: we might also be able to throttle it so that there is always at least X time between requests or something | 17:17 |
corvus | tobiash: that sounds reasonable. i still think it would be swell if you or pabelanger or someone would investigate whether an overall throttle is necessary. | 17:17 |
clarkb | then you could tune that for your install (which I'mgessing has less rate limit concerns than public github) | 17:17 |
corvus | clarkb: you lost me at "tune" | 17:17 |
corvus | github tells us rate limits, no tuning should be necessary | 17:18 |
tobiash | yes, that makes sense | 17:18 |
tobiash | I think we might be able to hook into the requests session | 17:18 |
tobiash | and delay that if neccessary | 17:18 |
clarkb | corvus: but it seems are hitting errors on start up well before the rate limits? | 17:19 |
pabelanger | tobiash: Yup, my main concern, is the abuse message we get back from github.com, I'd hate to see github apps be banned (not that I expect them to do that for first offense) | 17:19 |
clarkb | the rate limit is like 5k requests per hour on a regular account | 17:19 |
clarkb | (I guess we could be making many thousands of requets all at once) | 17:19 |
corvus | clarkb: if we're hitting some kind of unpublished rate limit, then i think that would dramatically change how we would approach it. | 17:20 |
pabelanger | so far, without parallel patch, zuul.a.c has worked as expect with retries logic. If we release with parallel, and not fixing getpullbysha, that will be something new for us to debug | 17:20 |
corvus | it sounds like we need *a lot* more information. | 17:20 |
pabelanger | tobiash: I agree, graphql sounds to be the fix, is that something we need to add to github3.py? | 17:20 |
tobiash | pabelanger: are you already running this: https://review.opendev.org/660800 ? | 17:20 |
tobiash | no | 17:20 |
tobiash | just a post request with the session object | 17:21 |
tobiash | I'll try it later or during the weekend | 17:21 |
pabelanger | tobiash: no, 660800, I just cherry-picked the retry-after changes | 17:21 |
tobiash | should be fairly easy | 17:21 |
clarkb | ya its 5k/hour on a regular authenticated account | 17:21 |
pabelanger | does getpullbysha hit search API? | 17:22 |
*** igordc has joined #zuul | 17:22 | |
clarkb | pabelanger: no | 17:22 |
pabelanger | k | 17:22 |
clarkb | it lists all the PRs instead (becaus search api does have a loewr limit) | 17:22 |
tobiash | no it does not according to github engineer | 17:23 |
tobiash | it's a graphql query and he showed me that this one is very cheap | 17:23 |
pabelanger | clarkb: right, that's what google was saying, search API seem to have more aggressive rate-limit | 17:23 |
clarkb | tobiash: https://developer.github.com/v3/search/#rate-limit depends on the api you use | 17:23 |
clarkb | that may be true for graphql but search api via rest has the lower rate limit | 17:24 |
pabelanger | tobiash: what is the path forward for graphql in zuul? Do you have thoughts | 17:24 |
corvus | i think that before we can say something is a solution, we need to know what the problem is; and it sounds like there is some doubt about whether this is a regular api limit, or something new. so before we engineer solutions which may or may not address the underlying problem, can we try to identify what it is first? | 17:25 |
pabelanger | +1 | 17:26 |
clarkb | https://developer.github.com/v3/guides/best-practices-for-integrators/#dealing-with-abuse-rate-limits has clues | 17:26 |
clarkb | "make requests serially" | 17:26 |
clarkb | I think it is likely we are hitting their abuse traps rather than regular rate limits on startup and that may be due to concurrent requests | 17:27 |
corvus | (perhaps the problem is that we're calling gitpullbysha on the same repo simultaneously in 10 different threads with an empty cache?) | 17:27 |
corvus | (perhaps an analysis of the logs would reveal information like that ^) | 17:27 |
tobiash | ah, so the abuse rate limit is something different | 17:29 |
tobiash | so we should probably reduce the number of threads | 17:29 |
corvus | that is one thing that we could do | 17:30 |
corvus | but again, we need data | 17:30 |
tobiash | and implement rate limiting in a central place that evaluates the retry-after header | 17:30 |
*** hashar has quit IRC | 17:30 | |
corvus | that may not be necessary if we avoid the behavior that causes it in the first place | 17:30 |
tobiash | yeah, I'm just thinking as I have no data ;) | 17:30 |
corvus | yep, thinking is good :) | 17:31 |
corvus | so far we've had more patches than analysis | 17:31 |
corvus | mostly i'm trying to say it's time for the opposite :) | 17:31 |
tobiash | pabelanger: can you confirm in the logs that you hit the abuse rate limit and not the normal rate limit? I thought I read something like this in your snippets but cannot find it anymore. | 17:32 |
pabelanger | tobiash: yah, it was abuse, I need a moment to post it from zuul.o.o | 17:33 |
tobiash | ok, then I think we have the following options: | 17:33 |
tobiash | * queue per installation id | 17:33 |
tobiash | * reduce threads (but to which number?) Github doesn't document this | 17:34 |
tobiash | * central limit with retry-after inspection | 17:34 |
corvus | tobiash: the docs clarkb linked suggest one thread per install id | 17:36 |
corvus | well, it says "Make requests for a single user or client ID serially. Do not make requests for a single user or client ID concurrently." | 17:36 |
corvus | what's a client id? | 17:36 |
tobiash | yes, that's the safe side, but that imposes scalability problems :/ | 17:36 |
tobiash | client id is either an installation or user | 17:37 |
tobiash | so when using app auth you have a rate limit per installation (most of the time github organization) | 17:37 |
corvus | so for most of us, we'd just end up back at 'one therad' | 17:37 |
corvus | one thread | 17:37 |
corvus | oh wait | 17:37 |
tobiash | which is basically back to queuing | 17:38 |
corvus | no an installation is app-in-a-project | 17:38 |
corvus | so wouldn't that be one thread per project? | 17:38 |
tobiash | most users install the app on the org | 17:38 |
tobiash | as best practice | 17:38 |
tobiash | you can install it on org or individual repos | 17:38 |
tobiash | so in our case we have a large org which would fall back to be single threaded, which would break us again | 17:39 |
tobiash | maybe we can try out the grey area of few but > 1 threads | 17:39 |
tobiash | pabelanger: it would be great to see all get requests in the log around that incident | 17:40 |
tobiash | to see if we polled 1000 pull requests from ansible (which for sure would quickly trigger the abuse limit when multithreaded) | 17:41 |
corvus | i think "grep zuul.Github debug.log" would be safe to share? | 17:41 |
pabelanger | tobiash: yah, we have 1.9k open PRs on ansible/ansible | 17:42 |
corvus | pabelanger, tobiash: it would be also interesting to know if we see 403 errors after successfully populating the LRU caches | 17:42 |
pabelanger | corvus: I think do, I haven't see any sensative data | 17:42 |
pabelanger | so* | 17:42 |
tobiash | maybe we didn't even successfully pupulate the lru caches | 17:43 |
openstackgerrit | Merged zuul/zuul master: cli: prevent exception for the show command https://review.opendev.org/664235 | 17:44 |
corvus | tobiash, pabelanger: http://paste.openstack.org/show/752945/ http://paste.openstack.org/show/752946/ http://paste.openstack.org/show/752947/ http://paste.openstack.org/show/752948/ | 17:47 |
corvus | that's the first 45 minutes after the most recent startup | 17:47 |
corvus | (in 500 line chunks in order) | 17:48 |
corvus | oh oops, that's still too long, they got cut off | 17:48 |
tobiash | maybe you can also further grep for 'GET' to just see the requests (otherwise I'll do it locally) | 17:49 |
corvus | tobiash, pabelanger: http://paste.openstack.org/show/752949/ through http://paste.openstack.org/show/752957/ inclusive | 17:49 |
corvus | that's 200 line chunks | 17:50 |
corvus | there are raw links, you can fetch http://paste.openstack.org/raw/752957/ etc and recombine them | 17:50 |
corvus | okay, i'm going to leave you to it | 17:51 |
tobiash | already on it | 17:51 |
tobiash | ftr, this does the job: for i in `seq 752949 752957`; do wget http://paste.openstack.org/raw/$i ; done | 17:52 |
mnaser | https://zuul-ci.org/docs/zuul-jobs/install.html | 17:55 |
mnaser | is this wrong? should the source be "opendev" and not "zuul-git" ? | 17:55 |
pabelanger | mnaser: yah, looks wrong | 17:56 |
mnaser | and i assume it should be zuul/zuul-jobs ? | 17:56 |
pabelanger | maybe fall out from mass rename | 17:56 |
pabelanger | yup | 17:56 |
tobiash | I see the 403 only during/after sequences like those: http://paste.openstack.org/show/752960/ | 17:57 |
mnaser | guilhermesp: wanna push up a patch to zuul/zuul-jobs to fix that so it has 'opendev' under sources and zuul/zuul-jobs instead ? | 17:57 |
guilhermesp | sure mnaser let me see | 17:58 |
tobiash | and there are many sequences like those | 17:58 |
tobiash | so I think that lru cache is also less effective than expected initially (I think we saw this also when introducing it) | 17:58 |
tobiash | and with every of those occations we're doing multiple sweeps in parallel | 18:00 |
tobiash | so we could either serialize status events or try out graphql for this which would eliminate those sweeps completely | 18:00 |
pabelanger | tobiash: so, do you think we need to address that before the next zuul release? Or revert parallel, release, working on your suggestions | 18:08 |
tobiash | hrm, I'd have to re-add parallel then to my deployment, as without it I'm doomed | 18:09 |
openstackgerrit | Guilherme Steinmuller Pimentel proposed zuul/zuul-jobs master: Fix installation docs https://review.opendev.org/665443 | 18:10 |
tobiash | I just tried out the getpull by sha via graphql and rate limit api tells me that I have 5000 bucks per hour and the request costs 1 | 18:10 |
pabelanger | I can still upgrade zuul.a.c, so latest master, and see how ofter we see the 403 exceptions, but I would imagine it will be the same frequency as zuul.o.o. But I won't be able to do that now, I am AFK in 30mins or so | 18:11 |
pabelanger | but, I do think it will introduce a new issue for us to debug | 18:11 |
tobiash | example graphql: http://paste.openstack.org/show/752961/ | 18:11 |
mnaser | thanks guilhermesp ^ | 18:11 |
tobiash | ok so how about revert parallel, release, re-add parallel with less threads (4-8? during regular requests in the logs the abuse detection didn't kick in) + getpullbysha via graphql? | 18:13 |
pabelanger | I'm game for what everybody else wants :) | 18:14 |
tobiash | but yeah we definitely need to address this before a release as at least getpullbysha on large repos kind of guarantees to trigger the abuse detection | 18:16 |
*** jamesmcarthur_ has joined #zuul | 18:17 | |
pabelanger | agree, sounds like this is something we should address. I'm not sure what is involved to move to graphql honestly. | 18:18 |
tobiash | or if a release is not that urgent we could skip the revert and directly change thread count and graphql | 18:18 |
pabelanger | I believe SpamapS was looking for a new release, but in our case if we did the revert of parallel, we could install latest master for all the fixes we need in zuul.a.c | 18:21 |
*** jamesmcarthur has quit IRC | 18:21 | |
pabelanger | then work to address parallel issues | 18:21 |
pabelanger | so we don't 'need | 18:21 |
pabelanger | err | 18:21 |
pabelanger | so we don't 'need' a release, we can pick is up from pre-release | 18:21 |
tobiash | this is a graphql example that contains exactly the information we need in getpullbysha: http://paste.openstack.org/show/752962/ | 18:22 |
tobiash | just has to be templated with the sha, send to github as post using the github client session and do some json parsing | 18:22 |
*** jamesmcarthur_ has quit IRC | 18:24 | |
corvus | i lean toward revert parallel, release, then spend a week or so getting it right before we release again :) | 18:29 |
pabelanger | okay, I've also removed my wip on https://review.opendev.org/665429/ (the revert) | 18:29 |
pabelanger | looking at zuul.a.c logs again, I've only have 2 instances of retries today. Github is working well it seems :D | 18:30 |
tobiash | then it's decided | 18:30 |
pabelanger | but did confirm a 502 Server Error will not have a 'Retry-After' header | 18:30 |
fungi | well, or at least some 502 server errors won't | 18:31 |
fungi | can't say for sure that they all don't | 18:31 |
pabelanger | true | 18:31 |
corvus | pabelanger: oh, i missed the link to a 502 without a retry, can you paste that again? | 18:31 |
pabelanger | let me find it | 18:32 |
pabelanger | http://paste.openstack.org/show/752964/ | 18:33 |
pabelanger | then | 18:34 |
pabelanger | 2019-06-14 16:07:12,081 DEBUG zuul.GithubEventConnector: [delivery: f506a3a6-8ebd-11e9-935d-2fe311e1498f] Got PR on project ansible/ansible for sha ba7b2f053d487e533c82b17cb194619b33cbde4f | 18:34 |
pabelanger | so it worked as expected | 18:34 |
pabelanger | okay, I have to wrap up for now. Thanks again everybody for help on github driver this week, much appreciated. I believe this will go a long way to making zuul better for ansible org! | 18:35 |
pabelanger | will check back later this evening | 18:36 |
openstackgerrit | Merged zuul/zuul-jobs master: Fix installation docs https://review.opendev.org/665443 | 18:58 |
*** jamesmcarthur has joined #zuul | 19:02 | |
tobiash | grr, github apps are not allowed to use search in graphql :/ | 19:16 |
tobiash | so that option is gone | 19:16 |
*** gtema has quit IRC | 19:30 | |
*** pcaruana has quit IRC | 19:39 | |
*** jamesmcarthur has quit IRC | 19:41 | |
*** jamesmcarthur has joined #zuul | 19:42 | |
*** jamesmcarthur has quit IRC | 19:44 | |
*** jamesmcarthur has joined #zuul | 19:44 | |
*** jamesmcarthur has quit IRC | 20:00 | |
*** jamesmcarthur has joined #zuul | 20:04 | |
tobiash | pabelanger, corvus: I'm currently trying out some things using the search api. And I think we can and should go with the search api instead. It has a rate limit of 30 per minute which is one every two seconds on average per installation id. It's unlikely that we hit that. I'd combine this with search related rate limit handling. With that we should be on the safe side and can remove this lru cache entirely. | 20:05 |
*** jamesmcarthur has quit IRC | 20:05 | |
corvus | tobiash: have we determined whether it's merely the startup of the lru cache that's at issue? | 20:09 |
tobiash | the lru cache in this case is suboptimal and a workaround in any case, so I think we should use search+rate limit handling anyway | 20:10 |
corvus | (i thought the lru cache was a pretty good idea -- after all, it may be able to eliminate some queries entirely?) | 20:10 |
corvus | iirc, the idea was we get an event for a sha, we do one query on it, then as different people and bots leave comments and updates on that sha for the next hour or so, we don't have to look it up again | 20:12 |
tobiash | well the initial idea was not to have to iterate over all pull requests on every status event. For this use case the cache was a (complicated) workadound which I think I remember worked somehow but was not as effective as we hoped. | 20:14 |
corvus | basically, "find a pr by sha" is expensive either way -- now it takes many requests (and many seconds) out of a large limit, if we switch to search, it will take one request out of a much lower limit. so it seems like either way, the cache would be helpful. | 20:14 |
clarkb | it does work (it also populates the cache when it does the scan for all known shas) | 20:14 |
clarkb | so it isn't just the current PR that it iwll remember | 20:14 |
clarkb | where it doesn't work is when you have other CI systems doing post merge testing | 20:14 |
clarkb | because those commit shas cannot be known | 20:14 |
tobiash | the limit is high enough that it's really hard to hit, essentially only by trying to hit it in my local testing | 20:15 |
corvus | clarkb: what do you mean by "cannot be known?" | 20:15 |
tobiash | the sha changes on merge | 20:15 |
tobiash | that was the problem that invalidates the cache in ansible on every post job | 20:15 |
clarkb | corvus: when ansible merges say 3 changes in a row. Only one of the three post merge shas can be calculated before they merge | 20:15 |
corvus | sure, but if you search for it, you will find it, right? | 20:16 |
clarkb | corvus: yes | 20:16 |
clarkb | the cache can't cache it though | 20:16 |
corvus | well, wait, why can't it cache it? | 20:16 |
clarkb | so for that particular case the cache isn't helpful | 20:16 |
clarkb | because when PR1 triggers a cache update the sha for PR2 post merge isn't known | 20:17 |
clarkb | so you basically always hit the search case when you get post merge test results | 20:17 |
corvus | why are there multiple PR's in the example? | 20:17 |
clarkb | corvus: because for the first PR it is cacheable due to github telling you what the sha will be if this change merges (and nothing else merges first) | 20:18 |
clarkb | its only a problem when you start merging multiple PRs in succession | 20:18 |
corvus | i think we're looking at this in very different ways | 20:18 |
clarkb | basically when those events come in a large percentage of the time the data is not present in the cache | 20:19 |
clarkb | so for that particular set of events the cache is not helpful. For all of the pre merge events its super helpful | 20:19 |
corvus | all i see is a cache which maps sha -> PR. sometimes a PR gets new shas. that can be because of a new commit being added, or it can be cause of a merge. in any case, when a PR gets a new sha, it's a cache miss, and a cache miss means we (currently) query the whole list of PRs. | 20:19 |
corvus | i understand you're saying that generating lots of events on PRs post-merge means we see a lot of new shas which are cache misses. | 20:20 |
clarkb | yes and that causes backup problems | 20:20 |
clarkb | (which is the sort of thing tobiash is trying to fix) | 20:20 |
corvus | tobiash: is this the main culprit in terms of time we're spending on api queries? ie, is this the main thing the pallelization patch was designed to address? | 20:21 |
openstackgerrit | Tobias Henkel proposed zuul/zuul master: Switch getPullBySha to using the search api https://review.opendev.org/665469 | 20:22 |
tobiash | corvus, clarkb: this uses search and handles the rate limit gracefully ^ | 20:23 |
tobiash | in local testing with an endless loop It takes around 45 seconds to reach the limit (so worst case scenario). The code handles this gracefully, waits for the time github tells us to wait (a few seconds in testing) and after that it has right another 30 requests left | 20:24 |
corvus | tobiash: see my question about the parallization patch? | 20:26 |
tobiash | still reading backlog | 20:26 |
corvus | ack. i'm trying to understand whether making getpullbysha faster would mean we don't need that. | 20:27 |
tobiash | corvus: regarding the parallelization, this is not only about getpullbysha, even without it we're running into queueing because of serialization | 20:28 |
tobiash | this is the thing the parallelization patch addresses | 20:28 |
tobiash | and basically the current implementation of getpullbysha abuses this and triggers the abuse protection | 20:29 |
corvus | well, maybe -- i mean, github won't promise that any amount of parallelization is acceptable. | 20:29 |
tobiash | so swiching to search should resolve this mostly and we should maybe combine this with limiting the threads to like 4 or 8 | 20:29 |
tobiash | per installation, but I think we should try out a little bit of the grey area | 20:30 |
tobiash | otherwise we should at least parallelize and lock per instance id | 20:31 |
tobiash | and from the log you posted the only problematic occurrences was due to getpullbysha | 20:33 |
tobiash | so I'd suggest limiting the threads and using search. If then there are still issues, limit parallelization per installation id | 20:34 |
corvus | okay, let me propose some things: 1) we use the search api like in your patch, but we *also* keep the cache. that should still help in the cases where multiple events arrive for a sha in succession (still fairly common). 2) we parallelize per installation id. 3) we add an undocumented option to increase the number of threads per instance id (and maybe even an RPC call to change it if you want to get | 20:34 |
corvus | fancy) so we can observe the grey area. if we think it's necessary/useful, we document the option (but leave the default as 1) | 20:34 |
corvus | oh, and if we find it's not useful or dangerous, we remove the option | 20:35 |
tobiash | sounds like a good plan | 20:36 |
corvus | clarkb, pabelanger: ^ | 20:36 |
clarkb | corvus: sounds good to me | 20:36 |
clarkb | in this case an installation id is the app on the github side? | 20:37 |
clarkb | so opendev has 1? | 20:37 |
corvus | clarkb: it's an app installed in an org or repo | 20:37 |
corvus | clarkb: so opendev has a handful of them | 20:37 |
clarkb | ah ok so its the application of our app into the projects | 20:37 |
corvus | yep. it's kind of annoying that there's an incentive to install the app for each repo rather than at the org level (in order to dodge the abuse limit). | 20:41 |
openstackgerrit | Tobias Henkel proposed zuul/zuul master: Switch getPullBySha to using the search api https://review.opendev.org/665469 | 20:46 |
tobiash | that retains the cache ^ | 20:47 |
tobiash | tested locally against github.com | 20:48 |
tobiash | so 1) check, 2) is a little bit more complicated, so I won't do this today ;) | 20:49 |
tobiash | corvus: when parallelizing per installation id, do we still want to keep global ordering or do you think ordering per installation id is sufficient? | 20:50 |
corvus | tobiash: i'm inclined to say that global ordering would be best. it probably would be mostly okay to order per installation id, but maybe there would be times where cross-project dependencies might end up wrong if we did that? | 20:53 |
tobiash | k, I'm also unsure about that, I just wanted to thing about wether this is a compromise we can make in favor of simpler implementation and scalabiloty | 20:54 |
corvus | tobiash: yeah, the one thing i know is that sequencing per install-id will require more thought. :) | 20:55 |
clarkb | corvus: I think that may be the case considering depends on don't specify a sha but instead specify the container objcet (change/PR) | 20:55 |
tobiash | yepp | 20:55 |
openstackgerrit | Tobias Henkel proposed zuul/zuul master: Switch getPullBySha to using the search api https://review.opendev.org/665469 | 20:56 |
corvus | clarkb: yeah, i'm coming up with similar thoughts. that, basically, we could enqueue A1->B1 even after B2 has been pushed, then later, we see B2 and abort the job because it's not current. so in the end, zuul won't test the wrong thing, but users would be confused because from their POV, A1 should have tested with B2 to start with. | 20:57 |
corvus | (and they'd see one of those annoying cryptic 'dependency unable to merge' messages) | 20:57 |
tobiash | yeah, that's one of those edge cases | 20:57 |
tobiash | ok, I'll keep the global orderung | 20:58 |
corvus | ++ | 20:58 |
tobiash | *ordering | 20:58 |
*** zbr|ruck has quit IRC | 21:13 | |
corvus | ofosos: i left comments on the bitbucket series -- i think it's really close. mostly a couple of subtle things which i think our abbreviated developer docs don't make very clear, sorry. | 21:20 |
*** zbr has joined #zuul | 21:30 | |
ofosos | corvus: woohoo, thank you very much | 21:36 |
*** EvilienM is now known as EmilienM | 22:03 | |
*** rlandy|ruck has quit IRC | 22:09 | |
pabelanger | corvus: tobiash: clarkb: +1 plan looks good. | 22:25 |
pabelanger | tobiash: looks like test case failed on your search api update | 22:25 |
*** rfolco has quit IRC | 22:44 | |
*** igordc has quit IRC | 22:45 | |
SpamapS | ofosos: I'll try and find some time next week to look at your next pushes. Thanks for hanging in there, it's realy awesome to add another flower to the zuul bouquet of drivers. :-D | 23:02 |
clarkb | I managed to do a quick skim but didn't leave much useful feedback (just one code cleanup suggestion) | 23:03 |
*** igordc has joined #zuul | 23:13 | |
*** mattw4 has quit IRC | 23:25 | |
*** mattw4 has joined #zuul | 23:25 | |
openstackgerrit | Merged zuul/zuul master: Revert "Parallelize github event processing" https://review.opendev.org/665429 | 23:37 |
Generated by irclog2html.py 2.15.3 by Marius Gedminas - find it at mg.pov.lt!