Friday, 2019-06-14

pabelanger	looks like we didn't properly dequeue the previous version of the PR for some reason	00:00
pabelanger	looking at https://github.com/ansible/ansible/pull/55199 is seems to have ~14commit pushed close together, wonder if that has something to do with it	00:01
pabelanger	http://paste.openstack.org/show/752908/	00:02
pabelanger	I'd only expect 1 change there, not 3	00:03
*** ianychoi has joined #zuul		00:19
*** mattw4 has quit IRC		01:01
*** swest has quit IRC		01:09
*** swest has joined #zuul		01:24
*** igordc has quit IRC		01:35
*** bhavikdbavishi has joined #zuul		01:58
*** bhavikdbavishi1 has joined #zuul		02:01
*** bhavikdbavishi has quit IRC		02:02
*** bhavikdbavishi1 is now known as bhavikdbavishi		02:02
pabelanger	okay, looking at https://github.com/ansible/ansible/pull/55199 more, I _think_ they some how used the web ui to make code changes to the PR	02:03
pabelanger	which requests in zuul enqueuing the same change 6 times into check	02:03
pabelanger	http://paste.openstack.org/show/752910/ was the event from github	02:17
pabelanger	however, I am unsure why zuul triggered on it, the action was synchronize	02:17
pabelanger	oh, that seems to map to 'changed' for zuul	02:19
*** rlandy\|ruck has quit IRC		02:24
*** bhavikdbavishi has quit IRC		02:26
pabelanger	okay, will have to continue this in the morning, I think it is possible we are getting more then 1 Change ID for a PR, which seems to not dequeue the previous Change, because they are different	02:34
*** michael-beaver has quit IRC		02:43
*** dmellado_ has joined #zuul		03:08
*** dmellado has quit IRC		03:08
*** dmellado_ is now known as dmellado		03:08
*** mugsie has quit IRC		03:22
*** EmilienM has quit IRC		03:22
*** kklimonda has quit IRC		03:22
*** mgagne has quit IRC		03:22
*** persia has quit IRC		03:22
*** sean-k-mooney has quit IRC		03:22
*** bhavikdbavishi has joined #zuul		03:31
*** mugsie has joined #zuul		03:38
*** EmilienM has joined #zuul		03:38
*** kklimonda has joined #zuul		03:38
*** mgagne has joined #zuul		03:38
*** persia has joined #zuul		03:38
*** sean-k-mooney has joined #zuul		03:38
*** irclogbot_3 has quit IRC		03:41
*** irclogbot_1 has joined #zuul		03:43
*** igordc has joined #zuul		04:12
*** igordc has quit IRC		04:19
*** swest has quit IRC		04:33
*** pcaruana has joined #zuul		05:09
*** swest has joined #zuul		05:14
*** swest has quit IRC		05:17
*** swest has joined #zuul		05:18
*** raukadah is now known as chandankumar		05:21
*** bhavikdbavishi has quit IRC		05:21
*** bhavikdbavishi has joined #zuul		05:22
*** gtema has joined #zuul		06:24
*** tobiash has joined #zuul		06:45
tobiash	zuul-maint: looks like zuul-web got a memleak here: https://paste.pics/9f7cf9a259c39876361b49b1ebebcd29	06:45
tobiash	does anyone experience the same?	06:46
tobiash	unfortunately it's hard to debug because sigusr2 seems to print the stack trace and objects but also kills it :/	06:46
tobiash	those should be the changes that went into this update: https://etherpad.openstack.org/p/ipW8nmqqPn	06:59
tristanC	tobiash: perhaps one of the requirements also changed?	07:20
tobiash	maybe	07:21
tristanC	tobiash: fwiw we are also running with the commit that are bolded in the etherpad without issue	07:24
tristanC	tobiash: e.g. SQLAlchemy 1.3.4 released May 28	07:31
*** hashar has joined #zuul		07:39
*** jpena\|off is now known as jpena		07:50
*** zbr\|ooo is now known as zbr\|ruck		08:01
*** sanjayu_ has joined #zuul		08:16
openstackgerrit	Matthieu Huin proposed zuul/zuul master: Zuul Web: add /api/user/authorizations endpoint https://review.opendev.org/641099	08:16
openstackgerrit	Matthieu Huin proposed zuul/zuul master: authentication config: add optional token_expiry https://review.opendev.org/642408	08:17
openstackgerrit	Matthieu Huin proposed zuul/zuul master: [WIP] admin REST API: zuul-web integration https://review.opendev.org/643536	08:17
*** jangutter has joined #zuul		08:18
*** hashar has quit IRC		08:33
*** hashar has joined #zuul		08:40
openstackgerrit	Fabien Boucher proposed zuul/zuul master: Set same timeout for py35/36 jobs in check and gate https://review.opendev.org/665331	09:01
fbo	Any reason the timeout was different between check and gate py35/36 jobs ? The pagure driver patch failed in the gate due to a timeout. In check we have 1 hour but in gate 45 minutes. ^	09:04
tobiash	fbo: it used to be 45 minutes and the bump to 1 hour probably just forgot the gate	09:09
fbo	tobiash: alright, yes it seems to be a good reason :)	09:19
fbo	I'll recheck the pagure driver patch after we this .zuul.yaml change then	09:20
fbo	merge	09:21
openstackgerrit	Matthieu Huin proposed zuul/zuul master: [WIP] admin REST API: zuul-web integration https://review.opendev.org/643536	09:44
*** bhavikdbavishi has quit IRC		09:56
*** hashar has quit IRC		09:59
*** sanjayu_ has quit IRC		10:11
*** sanjayu_ has joined #zuul		10:12
*** rfolco has quit IRC		10:13
*** rfolco has joined #zuul		10:14
*** gtema has quit IRC		10:24
*** gtema_ has joined #zuul		10:24
*** gtema_ is now known as gtema		10:24
*** sanjayu_ has quit IRC		10:36
*** saneax has joined #zuul		10:38
*** panda is now known as panda\|lunch		11:28
*** jpena is now known as jpena\|lunch		11:36
*** EmilienM is now known as EvilienM		11:45
*** rf0lc0 has joined #zuul		12:21
*** rfolco has quit IRC		12:21
*** tobias-urdin has joined #zuul		12:25
tobias-urdin	https://review.opendev.org/#/c/637339/ yay looking forward to the next zuul and nodepool releases :)	12:25
openstackgerrit	Matthieu Huin proposed zuul/zuul master: [WIP] admin REST API: zuul-web integration https://review.opendev.org/643536	12:37
*** jpena\|lunch is now known as jpena		12:37
*** rlandy has joined #zuul		12:38
*** rlandy is now known as rlandy\|ruck		12:38
*** panda\|lunch is now known as panda		12:55
*** rf0lc0 has quit IRC		13:35
*** rfolco has joined #zuul		13:42
*** rfolco has quit IRC		13:46
*** rfolco has joined #zuul		13:46
*** chandankumar is now known as raukadah		13:59
*** saneax has quit IRC		14:06
*** rfolco has quit IRC		14:20
*** rfolco has joined #zuul		14:21
corvus	tobias-urdin: yeah, we were planning on releasing them today, but we found an issue with github interaction that we want to clear up first	14:24
*** rfolco is now known as rfolco\|off		14:25
corvus	tobiash: i don't see radical memory use changes on the server where we're running zuul-web: http://cacti.openstack.org/cacti/graph.php?action=view&local_graph_id=64792&rra_id=all (though that reporting is not as detailed)	14:28
tobiash	corvus: k, thanks, I'll look next week more in depth into this	14:29
tobiash	for now I just increased the resource limit and let openshift handle the restarts :)	14:29
corvus	yay cloud	14:29
Shrews	corvus: mordred: It looks like next Wednesday is going to be my last full day for a while. Will be gone until at least July 8th. I'll let you two decide what to do with the autohold changes (which I consider complete at this point) in relation to the web admin stuff.	14:29
Shrews	certainly not urgent to merge, but will require changes if merged after the web admin stuff	14:31
corvus	Shrews: yeah, i'm expecting it to be relatively straightforward api changes rather than more complicated structural changes, so i expect we can update the patches	14:33
Shrews	k k	14:33
corvus	i'm restarting opendev's zuul to pick up the new github fixes	14:41
corvus	we could release nodepool now, but i'd like to release them at the same time because of the node python path change (it's not strictly necessary, but it seems weird not to)	14:42
*** mgariepy has joined #zuul		14:46
corvus	pabelanger: 2019-06-14 14:44:39,652 ERROR zuul.GithubEventProcessor: [e: c657f89e-8eb2-11e9-908a-1f05e5124b4f] Failed handling status event; remote said retry after 60,will retry attempt 1/5 in 60.0 seconds	14:46
corvus	pabelanger: that was 403 as a first response (ie, it wasn't after a 502)	14:46
pabelanger	neat, lets see what happens	14:46
corvus	pabelanger, tobiash: that makes me think that in addition to the 502 errors causing retries, we're also simply running into rate limits probably due to the paralellization change	14:47
tobiash	corvus: yes, probably	14:47
tobiash	before we just queued on our side with no chance reaching the limit because of serialization	14:47
pabelanger	would agree	14:47
tobiash	I think there is still some potential to reduce api calls (e.g. some calls due to github3.py object model)	14:49
tobiash	and the getpullbysha can be optimized using graphql with one request	14:50
corvus	pabelanger: 2019-06-14 14:51:08,371 ERROR zuul.GithubEventProcessor: [e: c657f89e-8eb2-11e9-908a-1f05e5124b4f] Failed handling status event; remote said retry after 60,will retry attempt 5/5 in 60.0 seconds	14:51
corvus	pabelanger: perhaps the retry-after should be +5 seconds?	14:52
pabelanger	corvus: yah, that was my fear, we needed to add a buffer	14:52
pabelanger	sounds like a good idea	14:52
corvus	tobiash: do you think we also need to calculate our internal rate and slow down? we do that in openstacksdk (it's code inherited from nodepool). we don't necessarily have to ask github for the rate after each request, we could just do it once and count our own calls. we could probably avoid hitting the limit most of the time that way.	14:56
corvus	tobiash: or do you think we should just try to honor the retry-after header and see if that will take care of it?	14:56
openstackgerrit	Paul Belanger proposed zuul/zuul master: Pad retry_after logic for github when we raise an exception https://review.opendev.org/665415	15:00
pabelanger	tobiash: corvus: ^ but I also wonder if some how multiple retries for different events, is also going to trigger the 403 abuse api	15:03
pabelanger	do we need to wait 'retry_after' for any API call, or just to the specific api?	15:03
tobiash	corvus: we don't have a mechanism yet to globally throttle the requests	15:06
*** hashar has joined #zuul		15:15
pabelanger	tobiash: I wonder if github3.py exposes anything, IIRC they are using python requests	15:17
fungi	score! http://logs.openstack.org/19/663119/1/check/zuul-jobs-test-registry/6109606/ara-report/result/35f53315-15d6-4f5a-8e24-aa07a2796913/	15:17
fungi	looks like https://review.opendev.org/663119 is ready to be reviewed	15:17
openstackgerrit	Paul Belanger proposed zuul/zuul master: Pad retry_after logic for github when we raise an exception https://review.opendev.org/665415	15:22
pabelanger	^also cleaned up traceback message, to add space after comma :)	15:22
corvus	thanks, that would have annoyed me to no end :)	15:23
pabelanger	Ooo	15:24
pabelanger	2019-06-14 14:55:33,759 ERROR zuul.GithubEventProcessor: [e: 75a44c70-8eb4-11e9-94c5-7a90e0072678] Failed handling pull_request event; remote said retry after 60,will retry attempt 1/5 in 60.0 seconds	15:24
pabelanger	however, the retry worked	15:24
*** rfolco has joined #zuul		15:24
pabelanger	I can see zuul processing that event	15:24
pabelanger	interesting enough	15:24
pabelanger	it look like all the 5/5 retries seem to be from the status event	15:24
pabelanger	but that pull_request did work as expected	15:25
*** rfolco\|off has quit IRC		15:25
*** mgariepy has left #zuul		15:26
pabelanger	http://paste.openstack.org/show/752933/	15:26
pabelanger	that is a snip of logic working	15:26
*** guilhermesp has joined #zuul		15:30
openstackgerrit	Merged zuul/zuul master: Set same timeout for py35/36 jobs in check and gate https://review.opendev.org/665331	15:32
pabelanger	fbo: thanks, I didn't think to check that the other day^	15:33
pabelanger	also seen a timeout in gate	15:33
corvus	pabelanger: is the zuul.ansible.com instance watching ansible/ansible?	15:38
pabelanger	corvus: yes, we are reporting results back too	15:38
corvus	pabelanger: you're seeing the same errors there, right?	15:39
pabelanger	corvus: when I looked last week yes, I can check now, give me a moment	15:39
corvus	pabelanger: if this is going to be an iterative process, perhaps it would be easier for you to test out changes like 665415 on that instance, and then, once you're satisfied with the result, we could restart opendev as a final verification?	15:40
pabelanger	corvus: c657f89e-8eb2-11e9-908a-1f05e5124b4f we also see that failure today	15:40
corvus	it's... not easy to restart opendev, and we lose about 1500 compute-hours worth of work each time we do it... :/	15:41
corvus	probably more, really	15:41
pabelanger	corvus: sure, that seems reasonable, I'd apply it now however we are not running the new multi processing patch tobiash landed. That is not to say we cannot upgrade to latest master	15:42
corvus	pabelanger: yeah, i think you'll need master + your new patch	15:42
pabelanger	corvus: agree	15:42
corvus	that is interesting that you see that error even without the multiprocessing patch	15:43
pabelanger	looking at zuul.o.o logs, I am concerned that if we group 2 or more retries together, they will both not wait the required '60' seconds for a retry, we just keep fighting each other until 5/5 tries.	15:43
clarkb	pabelanger: yes that is the multiprocessing change	15:44
clarkb	that should be fine if the retry is due ti a backend error	15:44
clarkb	but likely less good if being throttled by rate limiter	15:44
pabelanger	I wonder if we should consider backing that out, until we can determine a better way to deal with rate limits	15:44
pabelanger	(multiprocessing change, I mean)	15:45
corvus	pabelanger: yes, i think that's quite likely going to be a problem. maybe you and tobiash can come up with a determination of whether it is, and a recommendation for how we should proceed.	15:45
pabelanger	corvus: okay, I think what we can do on zuul.a.c, is apply the retry-after logic (minus multiprocessing) to even see if 665415 is needed	15:46
corvus	pabelanger: good idea, that will give us another data point	15:46
corvus	pabelanger: then you could also apply 665415 (still without MP) and see if that changes things	15:47
pabelanger	okay, let me work on that. However, I am not sure I will be able to completely that for today, have to #dadops early today. But should have time over the weekend to confirm	15:47
pabelanger	corvus: +1	15:47
corvus	pabelanger: looking at the ...b4f event -- it's interesting that some of the timestamps between the attempts are ~2 minutes apart, suggesting that the reque itself still took almost a minute to return a 403.	15:52
*** swest has quit IRC		15:55
*** jamesmcarthur has joined #zuul		15:56
*** bhavikdbavishi has joined #zuul		15:57
pabelanger	okay, I've applied: https://review.opendev.org/665220/	15:58
pabelanger	and confirmed zuul.a.c is able to enqueue a PR	15:58
pabelanger	so, now we wait for an execption	15:58
pabelanger	2019-06-14 15:59:43,328 ERROR zuul.GithubEventConnector: [delivery: 281d19ce-8ebd-11e9-8513-c2e73b0c5da4] Failed handling status event; remote said retry after None,will retry attempt 1/5 in 60.0 seconds	16:00
pabelanger	now we wait	16:00
pabelanger	also	16:00
pabelanger	http://paste.openstack.org/show/752935/	16:00
pabelanger	that is new	16:00
pabelanger	okay, a retry worked	16:01
pabelanger	http://paste.openstack.org/show/752936/	16:01
pabelanger	with is without any padding	16:02
pabelanger	let me find another instances or 2	16:02
tobiash	corvus, pabelanger: re rate limit, do you still see frequent 'get all prs of ansible' in getPullBySha?	16:04
tobiash	if yes, that would be likely one reason to hit the rate limit	16:05
tobiash	basically this is a quick stab I got from a github engineer to get a pr by sha: http://paste.openstack.org/show/752937/	16:07
tobiash	using graphql and this only costs one request	16:07
tobiash	so if getpullbysha is the reason we have our first graphql use case	16:08
tobiash	(and those events listed above are status events)	16:08
*** bhavikdbavishi has quit IRC		16:12
*** mattw4 has joined #zuul		16:13
pabelanger	2019-06-14 16:04:21,511 ERROR zuul.GithubEventConnector: [delivery: f506a3a6-8ebd-11e9-935d-2fe311e1498f] Failed handling status event; remote said retry after None,will retry attempt 1/5 in 60.0 seconds	16:17
pabelanger	that was 502 Server Error	16:17
pabelanger	and retry was successful	16:17
pabelanger	\o/	16:17
pabelanger	so, that at least fixes the original issue	16:17
pabelanger	so far 2 for 2 on retry	16:17
pabelanger	also, we didn't get 'Retry-After' header on 502 exepction	16:18
*** bhavikdbavishi has joined #zuul		16:18
pabelanger	tobiash: yah, each one as always been in getPullBySha(), that I have seen on zuul.a.c	16:20
openstackgerrit	Paul Belanger proposed zuul/zuul master: Revert "Parallelize github event processing" https://review.opendev.org/665429	16:28
pabelanger	tobiash: clarkb: corvus: ^revert of parallel github, if we decided to go that route before next zuul release	16:28
pabelanger	I maked WIP for now	16:28
openstackgerrit	Paul Belanger proposed zuul/zuul master: Revert "Parallelize github event processing" https://review.opendev.org/665429	16:29
*** jangutter has quit IRC		16:47
*** bhavikdbavishi has quit IRC		16:51
pabelanger	tobiash: corvus: clarkb: while waiting for zuul.a.c to get more exceptions, last night I noticed the following: http://paste.openstack.org/show/752908/ we had more the 1 PR (same ID) in our check pipeline. Which resulted in zuul multiple test results on the PR: https://github.com/ansible/ansible/pull/55199 I am unsure why we didn't dequeue the previous changes in the check pipeline, but have a feeling because	16:57
pabelanger	they were different change ids	16:57
*** jpena is now known as jpena\|off		16:57
corvus	pabelanger: i saw you mentioned that. are you looking into it?	16:58
mnaser	is there a way to lock down zuul's UI by anyway of auth or should i go back to basics (i.e. whitelist ips or something on apache layer)	16:59
mnaser	im thinking if there isnt anything native i can try using mod_auth_openidc or something like that	17:00
mnaser	just not sure how many things that might break :)	17:00
corvus	mnaser: no, i think apache should be fine	17:00
mnaser	going to run mod_auth_openidc and see how that works out then :)	17:01
corvus	mnaser: that should certainly work for whitelabeled tenants, or the entire system non-whitelabeled. if you wanted to restrict access to individual tenants without whitelabling, i think it would mostly work because of the way almost all the urls are tenant-scoped, execpt for the tenant index itself.	17:03
corvus	not sure which of the 3 situations you're dealing with	17:03
pabelanger	corvus: I am, but needed a little help understanding how we decided to dequeue a previous version of a PR / review. If I understand, it is tied to the Change?	17:04
mnaser	corvus: yeah it's more of a single-tenant deployment in this case so its just protecting the whole thing behind a wall	17:04
corvus	mnaser: cool, i don't anticipate problems; let us know how it goes :)	17:04
corvus	pabelanger: first -- you left this set to the default, right? https://zuul-ci.org/docs/zuul/user/config.html#attr-pipeline.dequeue-on-new-patchset	17:05
pabelanger	corvus: correct, also I didn't know that was an option :)	17:06
corvus	pabelanger: good, forget you ever heard about it :)	17:06
pabelanger	done	17:07
corvus	pabelanger: this is the key bit: https://opendev.org/zuul/zuul/src/branch/master/zuul/driver/github/githubmodel.py#L41	17:07
mnaser	wee	17:07
mnaser	worked fine (initially)	17:07
mnaser	ill see if other things might break	17:07
corvus	pabelanger: when a new item is enqueued, zuul should iterate over all the other changes in the pipeline and if that's true for any of them, remove those	17:07
pabelanger	great	17:08
pabelanger	that gives me a starting point	17:08
corvus	pabelanger: first thing that comes to mind is that the "updated_at" looks like a weak spot -- what if something raced and that ended up wrong?	17:09
pabelanger	corvus: yah, my thoughts too. looking at the PR, I see a quick push of multiple commits	17:09
corvus	pabelanger: yeah, so it might be worth collecting all the log lines for those and see if you can determine what the updated_at values would be each time it ran through	17:10
pabelanger	ack, will do	17:10
pabelanger	thanks	17:11
corvus	pabelanger: the other key part of this system is https://opendev.org/zuul/zuul/src/branch/master/zuul/manager/__init__.py#L213	17:11
corvus	pabelanger: you should be able to grep for those methods and see their call locations, etc	17:12
pabelanger	k	17:12
clarkb	I guess PRs don't have the nice iteration number that changes in gerrit come with	17:13
corvus	no, that's why we have to check both the sha and the timestamp	17:13
tobiash	pabelanger, corvus: the parallel event processing is crucial for bigger environments like ours. Without it we'll have hours of event queueing. And I'm not sure if that's better than occasionally hitting the rate limit. So I think we should try the graphql for getpullbysha and see if that reduces the number of requests.	17:16
clarkb	tobiash: we might also be able to throttle it so that there is always at least X time between requests or something	17:17
corvus	tobiash: that sounds reasonable. i still think it would be swell if you or pabelanger or someone would investigate whether an overall throttle is necessary.	17:17
clarkb	then you could tune that for your install (which I'mgessing has less rate limit concerns than public github)	17:17
corvus	clarkb: you lost me at "tune"	17:17
corvus	github tells us rate limits, no tuning should be necessary	17:18
tobiash	yes, that makes sense	17:18
tobiash	I think we might be able to hook into the requests session	17:18
tobiash	and delay that if neccessary	17:18
clarkb	corvus: but it seems are hitting errors on start up well before the rate limits?	17:19
pabelanger	tobiash: Yup, my main concern, is the abuse message we get back from github.com, I'd hate to see github apps be banned (not that I expect them to do that for first offense)	17:19
clarkb	the rate limit is like 5k requests per hour on a regular account	17:19
clarkb	(I guess we could be making many thousands of requets all at once)	17:19
corvus	clarkb: if we're hitting some kind of unpublished rate limit, then i think that would dramatically change how we would approach it.	17:20
pabelanger	so far, without parallel patch, zuul.a.c has worked as expect with retries logic. If we release with parallel, and not fixing getpullbysha, that will be something new for us to debug	17:20
corvus	it sounds like we need a lot more information.	17:20
pabelanger	tobiash: I agree, graphql sounds to be the fix, is that something we need to add to github3.py?	17:20
tobiash	pabelanger: are you already running this: https://review.opendev.org/660800 ?	17:20
tobiash	no	17:20
tobiash	just a post request with the session object	17:21
tobiash	I'll try it later or during the weekend	17:21
pabelanger	tobiash: no, 660800, I just cherry-picked the retry-after changes	17:21
tobiash	should be fairly easy	17:21
clarkb	ya its 5k/hour on a regular authenticated account	17:21
pabelanger	does getpullbysha hit search API?	17:22
*** igordc has joined #zuul		17:22
clarkb	pabelanger: no	17:22
pabelanger	k	17:22
clarkb	it lists all the PRs instead (becaus search api does have a loewr limit)	17:22
tobiash	no it does not according to github engineer	17:23
tobiash	it's a graphql query and he showed me that this one is very cheap	17:23
pabelanger	clarkb: right, that's what google was saying, search API seem to have more aggressive rate-limit	17:23
clarkb	tobiash: https://developer.github.com/v3/search/#rate-limit depends on the api you use	17:23
clarkb	that may be true for graphql but search api via rest has the lower rate limit	17:24
pabelanger	tobiash: what is the path forward for graphql in zuul? Do you have thoughts	17:24
corvus	i think that before we can say something is a solution, we need to know what the problem is; and it sounds like there is some doubt about whether this is a regular api limit, or something new. so before we engineer solutions which may or may not address the underlying problem, can we try to identify what it is first?	17:25
pabelanger	+1	17:26
clarkb	https://developer.github.com/v3/guides/best-practices-for-integrators/#dealing-with-abuse-rate-limits has clues	17:26
clarkb	"make requests serially"	17:26
clarkb	I think it is likely we are hitting their abuse traps rather than regular rate limits on startup and that may be due to concurrent requests	17:27
corvus	(perhaps the problem is that we're calling gitpullbysha on the same repo simultaneously in 10 different threads with an empty cache?)	17:27
corvus	(perhaps an analysis of the logs would reveal information like that ^)	17:27
tobiash	ah, so the abuse rate limit is something different	17:29
tobiash	so we should probably reduce the number of threads	17:29
corvus	that is one thing that we could do	17:30
corvus	but again, we need data	17:30
tobiash	and implement rate limiting in a central place that evaluates the retry-after header	17:30
*** hashar has quit IRC		17:30
corvus	that may not be necessary if we avoid the behavior that causes it in the first place	17:30
tobiash	yeah, I'm just thinking as I have no data ;)	17:30
corvus	yep, thinking is good :)	17:31
corvus	so far we've had more patches than analysis	17:31
corvus	mostly i'm trying to say it's time for the opposite :)	17:31
tobiash	pabelanger: can you confirm in the logs that you hit the abuse rate limit and not the normal rate limit? I thought I read something like this in your snippets but cannot find it anymore.	17:32
pabelanger	tobiash: yah, it was abuse, I need a moment to post it from zuul.o.o	17:33
tobiash	ok, then I think we have the following options:	17:33
tobiash	* queue per installation id	17:33
tobiash	* reduce threads (but to which number?) Github doesn't document this	17:34
tobiash	* central limit with retry-after inspection	17:34
corvus	tobiash: the docs clarkb linked suggest one thread per install id	17:36
corvus	well, it says "Make requests for a single user or client ID serially. Do not make requests for a single user or client ID concurrently."	17:36
corvus	what's a client id?	17:36
tobiash	yes, that's the safe side, but that imposes scalability problems :/	17:36
tobiash	client id is either an installation or user	17:37
tobiash	so when using app auth you have a rate limit per installation (most of the time github organization)	17:37
corvus	so for most of us, we'd just end up back at 'one therad'	17:37
corvus	one thread	17:37
corvus	oh wait	17:37
tobiash	which is basically back to queuing	17:38
corvus	no an installation is app-in-a-project	17:38
corvus	so wouldn't that be one thread per project?	17:38
tobiash	most users install the app on the org	17:38
tobiash	as best practice	17:38
tobiash	you can install it on org or individual repos	17:38
tobiash	so in our case we have a large org which would fall back to be single threaded, which would break us again	17:39
tobiash	maybe we can try out the grey area of few but > 1 threads	17:39
tobiash	pabelanger: it would be great to see all get requests in the log around that incident	17:40
tobiash	to see if we polled 1000 pull requests from ansible (which for sure would quickly trigger the abuse limit when multithreaded)	17:41
corvus	i think "grep zuul.Github debug.log" would be safe to share?	17:41
pabelanger	tobiash: yah, we have 1.9k open PRs on ansible/ansible	17:42
corvus	pabelanger, tobiash: it would be also interesting to know if we see 403 errors after successfully populating the LRU caches	17:42
pabelanger	corvus: I think do, I haven't see any sensative data	17:42
pabelanger	so*	17:42
tobiash	maybe we didn't even successfully pupulate the lru caches	17:43
openstackgerrit	Merged zuul/zuul master: cli: prevent exception for the show command https://review.opendev.org/664235	17:44
corvus	tobiash, pabelanger: http://paste.openstack.org/show/752945/ http://paste.openstack.org/show/752946/ http://paste.openstack.org/show/752947/ http://paste.openstack.org/show/752948/	17:47
corvus	that's the first 45 minutes after the most recent startup	17:47
corvus	(in 500 line chunks in order)	17:48
corvus	oh oops, that's still too long, they got cut off	17:48
tobiash	maybe you can also further grep for 'GET' to just see the requests (otherwise I'll do it locally)	17:49
corvus	tobiash, pabelanger: http://paste.openstack.org/show/752949/ through http://paste.openstack.org/show/752957/ inclusive	17:49
corvus	that's 200 line chunks	17:50
corvus	there are raw links, you can fetch http://paste.openstack.org/raw/752957/ etc and recombine them	17:50
corvus	okay, i'm going to leave you to it	17:51
tobiash	already on it	17:51
tobiash	ftr, this does the job: for i in `seq 752949 752957`; do wget http://paste.openstack.org/raw/$i ; done	17:52
mnaser	https://zuul-ci.org/docs/zuul-jobs/install.html	17:55
mnaser	is this wrong? should the source be "opendev" and not "zuul-git" ?	17:55
pabelanger	mnaser: yah, looks wrong	17:56
mnaser	and i assume it should be zuul/zuul-jobs ?	17:56
pabelanger	maybe fall out from mass rename	17:56
pabelanger	yup	17:56
tobiash	I see the 403 only during/after sequences like those: http://paste.openstack.org/show/752960/	17:57
mnaser	guilhermesp: wanna push up a patch to zuul/zuul-jobs to fix that so it has 'opendev' under sources and zuul/zuul-jobs instead ?	17:57
guilhermesp	sure mnaser let me see	17:58
tobiash	and there are many sequences like those	17:58
tobiash	so I think that lru cache is also less effective than expected initially (I think we saw this also when introducing it)	17:58
tobiash	and with every of those occations we're doing multiple sweeps in parallel	18:00
tobiash	so we could either serialize status events or try out graphql for this which would eliminate those sweeps completely	18:00
pabelanger	tobiash: so, do you think we need to address that before the next zuul release? Or revert parallel, release, working on your suggestions	18:08
tobiash	hrm, I'd have to re-add parallel then to my deployment, as without it I'm doomed	18:09
openstackgerrit	Guilherme Steinmuller Pimentel proposed zuul/zuul-jobs master: Fix installation docs https://review.opendev.org/665443	18:10
tobiash	I just tried out the getpull by sha via graphql and rate limit api tells me that I have 5000 bucks per hour and the request costs 1	18:10
pabelanger	I can still upgrade zuul.a.c, so latest master, and see how ofter we see the 403 exceptions, but I would imagine it will be the same frequency as zuul.o.o. But I won't be able to do that now, I am AFK in 30mins or so	18:11
pabelanger	but, I do think it will introduce a new issue for us to debug	18:11
tobiash	example graphql: http://paste.openstack.org/show/752961/	18:11
mnaser	thanks guilhermesp ^	18:11
tobiash	ok so how about revert parallel, release, re-add parallel with less threads (4-8? during regular requests in the logs the abuse detection didn't kick in) + getpullbysha via graphql?	18:13
pabelanger	I'm game for what everybody else wants :)	18:14
tobiash	but yeah we definitely need to address this before a release as at least getpullbysha on large repos kind of guarantees to trigger the abuse detection	18:16
*** jamesmcarthur_ has joined #zuul		18:17
pabelanger	agree, sounds like this is something we should address. I'm not sure what is involved to move to graphql honestly.	18:18
tobiash	or if a release is not that urgent we could skip the revert and directly change thread count and graphql	18:18
pabelanger	I believe SpamapS was looking for a new release, but in our case if we did the revert of parallel, we could install latest master for all the fixes we need in zuul.a.c	18:21
*** jamesmcarthur has quit IRC		18:21
pabelanger	then work to address parallel issues	18:21
pabelanger	so we don't 'need	18:21
pabelanger	err	18:21
pabelanger	so we don't 'need' a release, we can pick is up from pre-release	18:21
tobiash	this is a graphql example that contains exactly the information we need in getpullbysha: http://paste.openstack.org/show/752962/	18:22
tobiash	just has to be templated with the sha, send to github as post using the github client session and do some json parsing	18:22
*** jamesmcarthur_ has quit IRC		18:24
corvus	i lean toward revert parallel, release, then spend a week or so getting it right before we release again :)	18:29
pabelanger	okay, I've also removed my wip on https://review.opendev.org/665429/ (the revert)	18:29
pabelanger	looking at zuul.a.c logs again, I've only have 2 instances of retries today. Github is working well it seems :D	18:30
tobiash	then it's decided	18:30
pabelanger	but did confirm a 502 Server Error will not have a 'Retry-After' header	18:30
fungi	well, or at least some 502 server errors won't	18:31
fungi	can't say for sure that they all don't	18:31
pabelanger	true	18:31
corvus	pabelanger: oh, i missed the link to a 502 without a retry, can you paste that again?	18:31
pabelanger	let me find it	18:32
pabelanger	http://paste.openstack.org/show/752964/	18:33
pabelanger	then	18:34
pabelanger	2019-06-14 16:07:12,081 DEBUG zuul.GithubEventConnector: [delivery: f506a3a6-8ebd-11e9-935d-2fe311e1498f] Got PR on project ansible/ansible for sha ba7b2f053d487e533c82b17cb194619b33cbde4f	18:34
pabelanger	so it worked as expected	18:34
pabelanger	okay, I have to wrap up for now. Thanks again everybody for help on github driver this week, much appreciated. I believe this will go a long way to making zuul better for ansible org!	18:35
pabelanger	will check back later this evening	18:36
openstackgerrit	Merged zuul/zuul-jobs master: Fix installation docs https://review.opendev.org/665443	18:58
*** jamesmcarthur has joined #zuul		19:02
tobiash	grr, github apps are not allowed to use search in graphql :/	19:16
tobiash	so that option is gone	19:16
*** gtema has quit IRC		19:30
*** pcaruana has quit IRC		19:39
*** jamesmcarthur has quit IRC		19:41
*** jamesmcarthur has joined #zuul		19:42
*** jamesmcarthur has quit IRC		19:44
*** jamesmcarthur has joined #zuul		19:44
*** jamesmcarthur has quit IRC		20:00
*** jamesmcarthur has joined #zuul		20:04
tobiash	pabelanger, corvus: I'm currently trying out some things using the search api. And I think we can and should go with the search api instead. It has a rate limit of 30 per minute which is one every two seconds on average per installation id. It's unlikely that we hit that. I'd combine this with search related rate limit handling. With that we should be on the safe side and can remove this lru cache entirely.	20:05
*** jamesmcarthur has quit IRC		20:05
corvus	tobiash: have we determined whether it's merely the startup of the lru cache that's at issue?	20:09
tobiash	the lru cache in this case is suboptimal and a workaround in any case, so I think we should use search+rate limit handling anyway	20:10
corvus	(i thought the lru cache was a pretty good idea -- after all, it may be able to eliminate some queries entirely?)	20:10
corvus	iirc, the idea was we get an event for a sha, we do one query on it, then as different people and bots leave comments and updates on that sha for the next hour or so, we don't have to look it up again	20:12
tobiash	well the initial idea was not to have to iterate over all pull requests on every status event. For this use case the cache was a (complicated) workadound which I think I remember worked somehow but was not as effective as we hoped.	20:14
corvus	basically, "find a pr by sha" is expensive either way -- now it takes many requests (and many seconds) out of a large limit, if we switch to search, it will take one request out of a much lower limit. so it seems like either way, the cache would be helpful.	20:14
clarkb	it does work (it also populates the cache when it does the scan for all known shas)	20:14
clarkb	so it isn't just the current PR that it iwll remember	20:14
clarkb	where it doesn't work is when you have other CI systems doing post merge testing	20:14
clarkb	because those commit shas cannot be known	20:14
tobiash	the limit is high enough that it's really hard to hit, essentially only by trying to hit it in my local testing	20:15
corvus	clarkb: what do you mean by "cannot be known?"	20:15
tobiash	the sha changes on merge	20:15
tobiash	that was the problem that invalidates the cache in ansible on every post job	20:15
clarkb	corvus: when ansible merges say 3 changes in a row. Only one of the three post merge shas can be calculated before they merge	20:15
corvus	sure, but if you search for it, you will find it, right?	20:16
clarkb	corvus: yes	20:16
clarkb	the cache can't cache it though	20:16
corvus	well, wait, why can't it cache it?	20:16
clarkb	so for that particular case the cache isn't helpful	20:16
clarkb	because when PR1 triggers a cache update the sha for PR2 post merge isn't known	20:17
clarkb	so you basically always hit the search case when you get post merge test results	20:17
corvus	why are there multiple PR's in the example?	20:17
clarkb	corvus: because for the first PR it is cacheable due to github telling you what the sha will be if this change merges (and nothing else merges first)	20:18
clarkb	its only a problem when you start merging multiple PRs in succession	20:18
corvus	i think we're looking at this in very different ways	20:18
clarkb	basically when those events come in a large percentage of the time the data is not present in the cache	20:19
clarkb	so for that particular set of events the cache is not helpful. For all of the pre merge events its super helpful	20:19
corvus	all i see is a cache which maps sha -> PR. sometimes a PR gets new shas. that can be because of a new commit being added, or it can be cause of a merge. in any case, when a PR gets a new sha, it's a cache miss, and a cache miss means we (currently) query the whole list of PRs.	20:19
corvus	i understand you're saying that generating lots of events on PRs post-merge means we see a lot of new shas which are cache misses.	20:20
clarkb	yes and that causes backup problems	20:20
clarkb	(which is the sort of thing tobiash is trying to fix)	20:20
corvus	tobiash: is this the main culprit in terms of time we're spending on api queries? ie, is this the main thing the pallelization patch was designed to address?	20:21
openstackgerrit	Tobias Henkel proposed zuul/zuul master: Switch getPullBySha to using the search api https://review.opendev.org/665469	20:22
tobiash	corvus, clarkb: this uses search and handles the rate limit gracefully ^	20:23
tobiash	in local testing with an endless loop It takes around 45 seconds to reach the limit (so worst case scenario). The code handles this gracefully, waits for the time github tells us to wait (a few seconds in testing) and after that it has right another 30 requests left	20:24
corvus	tobiash: see my question about the parallization patch?	20:26
tobiash	still reading backlog	20:26
corvus	ack. i'm trying to understand whether making getpullbysha faster would mean we don't need that.	20:27
tobiash	corvus: regarding the parallelization, this is not only about getpullbysha, even without it we're running into queueing because of serialization	20:28
tobiash	this is the thing the parallelization patch addresses	20:28
tobiash	and basically the current implementation of getpullbysha abuses this and triggers the abuse protection	20:29
corvus	well, maybe -- i mean, github won't promise that any amount of parallelization is acceptable.	20:29
tobiash	so swiching to search should resolve this mostly and we should maybe combine this with limiting the threads to like 4 or 8	20:29
tobiash	per installation, but I think we should try out a little bit of the grey area	20:30
tobiash	otherwise we should at least parallelize and lock per instance id	20:31
tobiash	and from the log you posted the only problematic occurrences was due to getpullbysha	20:33
tobiash	so I'd suggest limiting the threads and using search. If then there are still issues, limit parallelization per installation id	20:34
corvus	okay, let me propose some things: 1) we use the search api like in your patch, but we also keep the cache. that should still help in the cases where multiple events arrive for a sha in succession (still fairly common). 2) we parallelize per installation id. 3) we add an undocumented option to increase the number of threads per instance id (and maybe even an RPC call to change it if you want to get	20:34
corvus	fancy) so we can observe the grey area. if we think it's necessary/useful, we document the option (but leave the default as 1)	20:34
corvus	oh, and if we find it's not useful or dangerous, we remove the option	20:35
tobiash	sounds like a good plan	20:36
corvus	clarkb, pabelanger: ^	20:36
clarkb	corvus: sounds good to me	20:36
clarkb	in this case an installation id is the app on the github side?	20:37
clarkb	so opendev has 1?	20:37
corvus	clarkb: it's an app installed in an org or repo	20:37
corvus	clarkb: so opendev has a handful of them	20:37
clarkb	ah ok so its the application of our app into the projects	20:37
corvus	yep. it's kind of annoying that there's an incentive to install the app for each repo rather than at the org level (in order to dodge the abuse limit).	20:41
openstackgerrit	Tobias Henkel proposed zuul/zuul master: Switch getPullBySha to using the search api https://review.opendev.org/665469	20:46
tobiash	that retains the cache ^	20:47
tobiash	tested locally against github.com	20:48
tobiash	so 1) check, 2) is a little bit more complicated, so I won't do this today ;)	20:49
tobiash	corvus: when parallelizing per installation id, do we still want to keep global ordering or do you think ordering per installation id is sufficient?	20:50
corvus	tobiash: i'm inclined to say that global ordering would be best. it probably would be mostly okay to order per installation id, but maybe there would be times where cross-project dependencies might end up wrong if we did that?	20:53
tobiash	k, I'm also unsure about that, I just wanted to thing about wether this is a compromise we can make in favor of simpler implementation and scalabiloty	20:54
corvus	tobiash: yeah, the one thing i know is that sequencing per install-id will require more thought. :)	20:55
clarkb	corvus: I think that may be the case considering depends on don't specify a sha but instead specify the container objcet (change/PR)	20:55
tobiash	yepp	20:55
openstackgerrit	Tobias Henkel proposed zuul/zuul master: Switch getPullBySha to using the search api https://review.opendev.org/665469	20:56
corvus	clarkb: yeah, i'm coming up with similar thoughts. that, basically, we could enqueue A1->B1 even after B2 has been pushed, then later, we see B2 and abort the job because it's not current. so in the end, zuul won't test the wrong thing, but users would be confused because from their POV, A1 should have tested with B2 to start with.	20:57
corvus	(and they'd see one of those annoying cryptic 'dependency unable to merge' messages)	20:57
tobiash	yeah, that's one of those edge cases	20:57
tobiash	ok, I'll keep the global orderung	20:58
corvus	++	20:58
tobiash	*ordering	20:58
*** zbr\|ruck has quit IRC		21:13
corvus	ofosos: i left comments on the bitbucket series -- i think it's really close. mostly a couple of subtle things which i think our abbreviated developer docs don't make very clear, sorry.	21:20
*** zbr has joined #zuul		21:30
ofosos	corvus: woohoo, thank you very much	21:36
*** EvilienM is now known as EmilienM		22:03
*** rlandy\|ruck has quit IRC		22:09
pabelanger	corvus: tobiash: clarkb: +1 plan looks good.	22:25
pabelanger	tobiash: looks like test case failed on your search api update	22:25
*** rfolco has quit IRC		22:44
*** igordc has quit IRC		22:45
SpamapS	ofosos: I'll try and find some time next week to look at your next pushes. Thanks for hanging in there, it's realy awesome to add another flower to the zuul bouquet of drivers. :-D	23:02
clarkb	I managed to do a quick skim but didn't leave much useful feedback (just one code cleanup suggestion)	23:03
*** igordc has joined #zuul		23:13
*** mattw4 has quit IRC		23:25
*** mattw4 has joined #zuul		23:25
openstackgerrit	Merged zuul/zuul master: Revert "Parallelize github event processing" https://review.opendev.org/665429	23:37

Generated by irclog2html.py 2.15.3 by Marius Gedminas - find it at mg.pov.lt!