Tuesday, 2019-02-12

tristanC	zuul-maint, looking for review on https://review.openstack.org/632620 to be able to use add-build-sshkey on static node. thanks :)	00:00
openstackgerrit	Tristan Cacqueray proposed openstack-infra/zuul master: executor: enable zuul_return to update Ansible inventory https://review.openstack.org/590092	00:20
*** rlandy is now known as rlandy\|bbl		00:36
*** TheJulia has joined #zuul		00:59
*** mrhillsman has joined #zuul		00:59
*** jtanner has joined #zuul		00:59
*** sdake has joined #zuul		01:01
openstackgerrit	Merged openstack-infra/zuul master: Add a note about sqlalchemy metadata https://review.openstack.org/636189	01:18
*** sdake has quit IRC		01:37
*** sdake has joined #zuul		01:39
openstackgerrit	James E. Blair proposed openstack-infra/zuul-jobs master: Add intermediate registry push/pull roles https://review.openstack.org/634829	01:41
*** sdake has quit IRC		01:43
*** sdake has joined #zuul		01:50
*** sdake has quit IRC		01:51
openstackgerrit	Merged openstack-infra/zuul master: web: add /{tenant}/buildsets route https://review.openstack.org/630035	02:12
*** saneax has joined #zuul		02:50
*** saneax has quit IRC		02:52
*** saneax has joined #zuul		02:53
*** rlandy\|bbl is now known as rlandy		03:12
*** rlandy has quit IRC		03:19
*** sdake has joined #zuul		03:39
*** sdake has quit IRC		03:59
*** manjeets has quit IRC		04:41
*** saneax has quit IRC		04:42
*** bjackman has joined #zuul		04:55
*** shanemcd has quit IRC		05:03
*** shanemcd has joined #zuul		05:04
*** greg-g has quit IRC		05:13
*** chandankumar is now known as chkumar\|ruck		05:47
openstackgerrit	Merged openstack-infra/zuul master: Update SQLAlchemy url in docs https://review.openstack.org/635148	05:59
*** sshnaidm is now known as sshnaidm\|afk		06:39
*** saneax has joined #zuul		06:45
openstackgerrit	Felix Schmidt proposed openstack-infra/zuul master: Retrieve full list of jobs with details per tenant via API https://review.openstack.org/635714	06:56
*** quiquell\|off is now known as quiquell\|rover		07:02
*** snapiri has joined #zuul		07:24
*** quiquell\|rover is now known as quique\|rover\|brb		07:53
*** themroc has joined #zuul		08:10
*** quique\|rover\|brb is now known as quiquell\|rover		08:12
openstackgerrit	Quique Llorente proposed openstack-infra/zuul master: Escape jinja2 template from commit message https://review.openstack.org/633930	08:21
*** gtema has joined #zuul		08:24
*** jpena\|off is now known as jpena		08:53
*** panda\|off is now known as panda		09:04
openstackgerrit	Quique Llorente proposed openstack-infra/zuul master: Escape jinja2 template from commit message https://review.openstack.org/633930	09:43
quiquell\|rover	tristanC: different approach ^	09:45
openstackgerrit	Quique Llorente proposed openstack-infra/zuul master: Mark as unsafe commit message at inventory https://review.openstack.org/633930	09:45
*** sshnaidm\|afk is now known as sshnaidm		09:52
SpamapS	tristanC: heh, your python, on first test, comes out about 10x slower than my rust. Trying the C++ now. ;)	10:22
SpamapS	The C++ is still faster. :-P	10:24
SpamapS	oh interesting, with more URLs, the rust gets faster	10:27
SpamapS	suggesting that the C++ LRU cache maybe isn't so awesome.	10:28
ttx	Hey all, I'm trying to get to the bottom of delays in the GitHub triggers... it seems to take anywhere between 30sec and 10min for a GitHub event to enter Zuul pipelines. I was wondering if that was coming from the processing on Zuul side, or some form of webhook batching on the GitHub side. How much of that is expected behavior vs. bug ?	10:32
tristanC	SpamapS: how are you testing?	10:33
SpamapS	tristanC: I spun up an nginx with 10000 pretend builds	10:34
SpamapS	tristanC: and fed them all to each program	10:35
SpamapS	ttx: GitHub triggers have been known to happen asynchronously, but you can see them on the app's detail page if you have access.. and then compare with the zuul scheduler logs to see who was slower.	10:35
SpamapS	tristanC: I'm pretty sure the problem with your python is that it doesn't use a persistent client	10:36
ttx	SpamapS: ok, I don't have access to that, but i know a few people who have	10:36
tristanC	SpamapS: I mean, how did you measure the performance?	10:36
SpamapS	tristanC: oh, just /usr/bin/time	10:37
SpamapS	tristanC: I'm about to push what I did into the git repo I made, so you can repeat if you like. :)	10:39
tristanC	SpamapS: sure, i'm curious how this would affect overall user experience. Initial build lookup only happens once per visit, then it should rewrite should be served by the cache	10:40
tristanC	s/it should//	10:41
SpamapS	tristanC: yeah I would like to try some different schemes, like repeat builds	10:41
SpamapS	But I'd say log URL's are probably not going to have a fantastic cache hit rate.	10:42
tobiash	ttx: normally this is a matter of less than 30s	10:42
SpamapS	tristanC: I tend to agree with you that this probably isn't a big deal though.	10:43
ttx	tobiash: I'm testing right now and I'm at 17min so far	10:43
tobiash	ttx: is this the time until a change shows up in the pipeline without or with jobs?	10:43
ttx	change shows up in pipeline	10:44
tobiash	Job freezing can take a while because that requires a merger call	10:44
tobiash	ttx: maybe you're running into api request limit?	10:44
tobiash	ttx: do you use an api token or github app?	10:45
ttx	tobiash: maybe, but the activity is not that significant. GitHub app	10:45
tobiash	Hrm	10:45
tobiash	Then you should probably first check webhook history in gh and then zuul web logs	10:46
SpamapS	tristanC: using a persistent session object improved by about 15%. Still 7x slower than the C++/Rust.	10:46
ttx	tobiash: if you confirm that is a bit unexpected behavior, I'll try to dig up more data points	10:46
tobiash	Just to verify that there is everything ok	10:46
tobiash	The webhooks should arrive within a couple of seconds	10:47
SpamapS	and uses 20MB more memory than the C++/Rust which only use about 8MB.. but honestly, this is just late night insomnia playing by me ... honestly.. not sure how performance sensitive this actually is.	10:47
tristanC	SpamapS: well it depends what is the overhead relative to the total latency of the browser going through that service	10:49
SpamapS	tristanC: the python responds to cache misses in 1ms. ;)	10:54
tristanC	SpamapS: e.g., a request to logs.openstack.org takes about 300ms, thus if the difference between the implementation is less than 10ms, then it shouldn't matter much right	10:54
SpamapS	Yeah, 1ms vs. 100ns. :-P	10:55
SpamapS	I mean it's nice if things are fast. :)	10:55
SpamapS	and honestly, the C++ and Rust are pretty darn readable.	10:55
SpamapS	But, there's also about 4x the code. :)	10:56
SpamapS	so, yeah, in terms of overall cost to the project, while I wish we could rewrite all of Zuul in Rust (please god not C++), this isn't a part that I think needs to be in !python	10:57
* SpamapS should probably try that sleep thing again		11:02
*** iurygregory has joined #zuul		11:34
iurygregory	Hi everyone, while moving jobs to zuulv3 in ironic we notice that ironic-tempest-plugin is not running all jobs we have set in project.yaml https://github.com/openstack/ironic-tempest-plugin/blob/master/zuul.d/project.yaml, i checked if the parent jobs really exists and they do anyone have ideas what it could be? Some jobs that arent running are defined in stable-jobs.yaml https://github.com/openstack/ironic-tempest-plugin/blob/master/	11:42
iurygregory	zuul.d/stable-jobs.yaml	11:42
iurygregory	https://github.com/openstack/ironic-tempest-plugin/blob/master/zuul.d/stable-jobs.yaml *	11:42
openstackgerrit	Matthieu Huin proposed openstack-infra/zuul master: [WIP] Zuul CLI: allow access via REST https://review.openstack.org/636315	11:50
*** jpena is now known as jpena\|lunch		12:26
*** abregman has joined #zuul		12:32
*** panda is now known as panda\|lunch		12:32
abregman	hey. does zuul have an API?	12:34
pabelanger	abregman: yes, zuul-web is the service that allows that	12:46
abregman	pabelanger: how exactly do I access it? for example if my address is https://my-zuul.com	12:47
jkt	abregman: there's this pending work, https://review.openstack.org/#/c/535541/	12:50
jkt	abregman: are you looking for Getting integration by any chance?	12:50
abregman	jkt: thanks, I'll look into it. what do you mean by getting integration?	12:51
pabelanger	abregman: there is some docs around deployment options: https://zuul-ci.org/docs/zuul/admin/installation.html?highlight=dashboard#web-deployment-options but should be https://my-zuul.com/api	12:51
abregman	basically I'm just looking for a way to pull some information using some scripts I want to write	12:51
abregman	pabelanger: thanks, I'll go over it. using /api just redirects me to /tenants	12:52
openstackgerrit	Quique Llorente proposed openstack-infra/zuul master: Mark as unsafe commit message at inventory https://review.openstack.org/633930	12:53
jkt	abregman: ok; I just wanted to see if you were looking for https://gerrit.googlesource.com/plugins/zuul-status/	12:54
pabelanger	abregman: yup, depends on how your dashboard is setup, here is a list of entrypoints: http://git.zuul-ci.org/cgit/zuul/tree/zuul/web/__init__.py#n661	12:55
pabelanger	you can also get data from statsd, if setup too	12:55
abregman	pabelanger: awesome! exactly what I was looking for. thanks :)	12:57
abregman	pabelanger++	12:57
*** bjackman has quit IRC		12:59
dmsimard	pabelanger: http://logs.openstack.org/41/535541/9/check/zuul-build-dashboard/b99f517/npm/html click API at the top right (from https://review.openstack.org/#/c/535541/ )	13:00
jkt	wow!	13:05
*** chkumar\|ruck is now known as chkumar\|pto		13:10
*** panda\|lunch is now known as panda		13:13
*** quiquell\|rover is now known as quique\|rover\|eat		13:17
*** rlandy has joined #zuul		13:17
*** jpena\|lunch is now known as jpena		13:30
*** sdake has joined #zuul		13:32
openstackgerrit	Quique Llorente proposed openstack-infra/zuul master: Mark as unsafe commit message at inventory https://review.openstack.org/633930	13:37
*** openstackgerrit has quit IRC		13:37
*** iurygregory has left #zuul		13:38
*** quique\|rover\|eat is now known as quiquell\|rover		13:39
*** abregman has quit IRC		14:01
*** sdake has quit IRC		14:03
*** openstackgerrit has joined #zuul		14:15
openstackgerrit	Tristan Cacqueray proposed openstack-infra/zuul master: web: prevent status update loop in background https://review.openstack.org/636343	14:15
pabelanger	mordred: with openstack client and openstack server show, is properties field map to metadata in openstacksdk?	14:18
pabelanger	mordred: eg: line 25 @ https://softwarefactory-project.io/paste/show/1425/	14:19
mordred	pabelanger: yes, I think so?	14:19
pabelanger	ack	14:20
pabelanger	trying to figure out why nodepool might be leaking nodes when exception is raised when creating a node	14:21
pabelanger	https://softwarefactory-project.io/paste/show/1424/ from nodepool, note line 5	14:22
pabelanger	but https://softwarefactory-project.io/paste/show/1425/ shows it actually exists	14:22
pabelanger	but, then trying to figure out why cleanup handler isn't seeing it, and trying to delete it	14:22
*** sshnaidm has quit IRC		14:30
*** sshnaidm has joined #zuul		14:30
openstackgerrit	Felix Schmidt proposed openstack-infra/zuul master: Retrieve full list of jobs with details per tenant via API https://review.openstack.org/635714	14:30
*** sdake_ has joined #zuul		14:37
*** ParsectiX has joined #zuul		14:44
*** swest has quit IRC		14:47
openstackgerrit	Matthieu Huin proposed openstack-infra/zuul master: [WIP] Zuul CLI: allow access via REST https://review.openstack.org/636315	14:47
ParsectiX	Hi guys. I'm trying to add an ldap group as a member on another group via CLI and it fails. When I'm doing the the same operation from the WebUI it works	14:49
ParsectiX	ssh -p 2244 localhost gerrit set-members Administrators --include ldap/admins	14:50
ParsectiX	fatal: Group Not Found: ldap%3Acn%3Dadmins%2Cou%3Dgroups%2Cdc%3Dexample%2Cdc%3Dio	14:50
ParsectiX	Can you please advice how to debug this?	14:50
openstackgerrit	Matthieu Huin proposed openstack-infra/zuul master: [WIP] Zuul CLI: allow access via REST https://review.openstack.org/636315	14:54
*** snapiri has quit IRC		14:56
openstackgerrit	Quique Llorente proposed openstack-infra/zuul master: Mark as unsafe commit message at inventory https://review.openstack.org/633930	14:58
*** Parsecti_ has joined #zuul		15:05
*** ParsectiX has quit IRC		15:07
*** ianychoi has quit IRC		15:07
*** ianychoi has joined #zuul		15:07
fungi	Parsecti_: you may want to try asking in #gerrit	15:21
*** Parsecti_ has quit IRC		15:21
pabelanger	is anybody able to see why https://softwarefactory-project.io/paste/show/1425/ isn't hitting line 493 in http://git.zuul-ci.org/cgit/nodepool/tree/nodepool/driver/openstack/provider.py?h=3.4.0#n458	15:23
pabelanger	if my zk data is correct, the nodepool_node_id doesn't exist	15:24
pabelanger	but, for some reason, nodepool isn't trying to delete the node	15:24
quiquell\|rover	pabelanger, tobiash, tristanC: The little fix for jinja at gerrit commit is all good now https://review.openstack.org/#/c/633930/	15:26
quiquell\|rover	Have use !unsafe from ansible yaml to skip it	15:26
*** gtema has quit IRC		15:32
*** gtema has joined #zuul		15:35
clarkb	pabelanger: see line 462	15:38
clarkb	pabelanger: that method is specifically for cleaning up servers not in zk, that server appears to be in zk	15:39
pabelanger	clarkb: based on the info i am getting from fbo / jpena we don't see the server in zk	15:39
pabelanger	https://softwarefactory-project.io/paste/show/1424/	15:40
clarkb	pabelanger: can you double check that? 0001603003 would be the identifier in zk	15:40
pabelanger	is also log from nodepool	15:40
clarkb	oh that logs says the cloud didn't have the node. Seems like a bug in your cloud then	15:41
*** saneax has quit IRC		15:41
pabelanger	no, server create failed with 500 no more resources	15:41
pabelanger	but, couldn't delete because in error state I think	15:42
pabelanger	however	15:42
pabelanger	I can't see why cleanup worker isn't trying to delete again	15:42
pabelanger	interesting	15:43
pabelanger	zk id=0001603336 but nodepool_node_id='0001603003'	15:43
pabelanger	should they be the same?	15:43
*** ParsectiX has joined #zuul		15:44
clarkb	pabelanger: nodepool will assign a new id to nodes it didn't know about in zk	15:44
clarkb	that log could be from the cleanup method?	15:45
pabelanger	k, I think I see the issue	15:46
pabelanger	with help from fbo	15:46
pabelanger	/nodepool/nodes/0001603003/lock/780341bd82e9489c991ac7e972d2b5d0__lock__0000000034	15:46
pabelanger	so, data does exist in zk	15:47
pabelanger	which is instance https://softwarefactory-project.io/paste/show/1425/	15:47
pabelanger	however	15:47
pabelanger	https://softwarefactory-project.io/paste/show/1424/ is the error when nodepool tried to create server	15:47
pabelanger	but, we've deleted the wrong node id for some reason	15:47
pabelanger	leading to data being leaked in zk	15:48
clarkb	I'm not sure its the wrong id	15:48
clarkb	the way nodepool accounts for this stuff it create a new node to track leaky things	15:48
clarkb	so the id changes, but that is expected iirc	15:48
clarkb	it is possible that it doesn't cleanup details for the original id properly though	15:50
pabelanger	okay, but in that case, I can't see how we'd actually delete the data in zk, based on http://git.zuul-ci.org/cgit/nodepool/tree/nodepool/driver/openstack/provider.py?h=3.4.0#n492	15:50
clarkb	pabelanger: ah ya that might be buggy then	15:50
clarkb	though what it is checking there is "if I don't have an id for this id in zk then make a new node with a new id to track it"	15:51
clarkb	which should work	15:51
clarkb	it is giving it a dummy node in zk so that it can track that it needs to delete the leaked instance	15:51
clarkb	where this might be buggy is if it fails to delete the first time through ti could createa a second dummy node and so on	15:52
clarkb	(so we'd leak dummy nodes until the real instance deleted and then all the dummy nodes could be cleaned up?)	15:52
fungi	i've observed an odd behavior with a github connection in our zuul deployment. we see trigger events appear in the scheduler's debug log immediately (zuul.GithubGearmanWorker: Github Webhook Received) but then there's a lengthy delay sometimes on the order of an hour before it gets processed (zuul.GithubConnection: Scheduling event from github, zuul.Scheduler: Processing trigger event <GithubTriggerEvent ...>).	15:54
fungi	any idea what could cause that delay? when it happens i don't see any events hanging out in the events queue according to the status api, and there's nothing identifiable in the debug logs i can interpret as an explanation for the wait either	15:54
pabelanger	I'm just looking at code now, I know we changed this a little recently	15:54
clarkb	pabelanger: http://git.zuul-ci.org/cgit/nodepool/tree/nodepool/driver/openstack/handler.py?h=3.4.0#n247 I think that is the area that changed semi recently	15:57
clarkb	http://git.zuul-ci.org/cgit/nodepool/tree/nodepool/driver/openstack/handler.py?h=3.4.0#n255 is where we create the new node to stand in	15:57
pabelanger	ah, right	15:58
pabelanger	clarkb: so, I don't think we are updating the actually instance properly with that new zk ID	15:59
pabelanger	which leads to the leak	15:59
clarkb	pabelanger: I agree we don't update the existing instance with the new zk id. I don't know that that causes a leak	16:00
pabelanger	let me get some logs on new zk id	16:00
clarkb	why do you think that is causing the leak? nodepool should delete the error'd instance under the new id and it will go away from the cloud	16:00
clarkb	it is possible we leak the old node in zk, but that shouldn't affect the leak cloud side	16:01
clarkb	fungi: what happens after the webhook is received by the scheduler? I wonder if there is anything obvious that could block	16:02
pabelanger	clarkb: I am confirming we actually do a delete on new zk ID	16:02
clarkb	fungi: like maybe construction of the trigger eevnt requires info back from github?	16:02
clarkb	fungi: I think the flow there is github POSTs to zuul-web, zuul-web does a gearman event, zuul scheduler processes gearman event as worker, then I don't know	16:05
fungi	i thought that happened concurrent with the "processing trigger" activity in the scheduler	16:05
clarkb	fungi: I think the event has to be converted to a trigger first	16:06
fungi	but yeah, when my current meeting ends i'll see if i can trace the flow of functions between GithubGearmanWorker and GithubConnection	16:06
clarkb	the event is the raw github activity and triggers are a bit more filtered down to things zuul can take action on	16:06
pabelanger	clarkb: https://softwarefactory-project.io/paste/show/1428/	16:07
pabelanger	we seem to be creating multiple servers using the same zk id	16:07
pabelanger	but you can see bae8dfbb-d165-47db-9e08-fc8111ba36b8 there	16:07
pabelanger	eventually the zk id comes online	16:07
pabelanger	but the error instance, also has that nodepool_node_id	16:08
pabelanger	https://softwarefactory-project.io/paste/show/1429/ looks to be the delete	16:09
pabelanger	but, didn't delete cloud side	16:09
clarkb	ya so its likely failing to delete in the cloud? andnodepool trying over and over again?	16:09
pabelanger	no, the delete only happens once	16:13
pabelanger	I'd like it to keep trying to delete over and over	16:13
pabelanger	but, cleanup worker doesn't because nodepool_node_id is in zk	16:13
pabelanger	being used to boot the next server	16:13
pabelanger	so, I am unsure how to proceed in this case, if we don't properly delete the node, and a new node is booted with same zk id	16:14
pabelanger	tobiash: might be of interest to you, since you helped here in the past	16:16
pabelanger	^	16:16
clarkb	pabelanger: so the original node_id is being reused?	16:16
clarkb	pabelanger: can you show the logs of that?	16:16
*** quiquell\|rover is now known as quiquell\|off		16:17
pabelanger	clarkb: https://softwarefactory-project.io/paste/show/1428/	16:17
pabelanger	0001603003 is the id in question	16:18
pabelanger	each of those attempts, fail with 500 (lack of resources)	16:18
pabelanger	https://softwarefactory-project.io/paste/show/1425/ is actually bae8dfbb-d165-47db-9e08-fc8111ba36b8 attempt	16:19
clarkb	right ok so the problem is the logical request's id doesn't change but we've already written properties to the host that we check against for cleanup	16:19
clarkb	so cleanup won't happen until the reused original request id is deleted then at that point we can cleanup the original failures. That does indeed seem to be a bug	16:19
pabelanger	yah, I am just trying to confirm we have more then 1 leaked (error node) with the same nodepool_node_id	16:20
*** manjeets has joined #zuul		16:27
*** themroc has quit IRC		16:28
pabelanger	clarkb: confirmed, found multiple error'd instances with same nodepool_node_id: https://softwarefactory-project.io/paste/show/1430/ 0001615360	16:47
corvus	pabelanger, clarkb: especially since Shrews is out, can you make sure to collect all of this in a storyboard bug report?	16:49
pabelanger	corvus: yup!	16:50
clarkb	pabelanger: ya the final thing we'll want to confirm is if things do cleanup once the successfully booted node with that id is deleted	16:56
clarkb	that should remove the zk record for it allowing everything else to be properly processed as a laeked node (mostly this is just to confirm observed behavior vs expected behavior)	16:56
pabelanger	clarkb: agree	16:57
*** ParsectiX has quit IRC		16:58
*** sdake_ has quit IRC		17:01
*** sdake has joined #zuul		17:02
fungi	just to follow up, current theory on our github trigger processing delay is growing backlog in GithubConnection.event_queue so looking to stuff a statsd report call somewhere for that	17:06
fungi	(we have some very, very active gh projects associated with our connection)	17:06
*** gtema has quit IRC		17:06
fungi	presumably event_queue.qsize() is what we should report on?	17:07
openstackgerrit	Fabien Boucher proposed openstack-infra/zuul master: [WIP] URLTrigger driver time based - artifact change jobs triggering driver https://review.openstack.org/635567	17:15
clarkb	fungi: I think so	17:16
fungi	cool, i'm just learning me up on how to add a new stat now	17:17
fungi	or more like reverse-engineering from the existing uses	17:17
AJaeger	zuul-jobs maintainer, could you review https://review.openstack.org/#/c/635941/ for the updated puppet-forge upload, please?	17:27
openstackgerrit	Paul Belanger proposed openstack-infra/nodepool master: Properly handle TaskManagerStopped exception https://review.openstack.org/636393	17:29
pabelanger	clarkb: mordred: corvus: do you mind confirming ^ is the correct approach for handling when task managers die? In the case of sf.io (rdo-cloud) we end up spamming logs a fair bit when it happens.	17:30
pabelanger	nhicher: fbo: ^	17:30
clarkb	pabelanger: do you know why the task manager dies?	17:33
clarkb	we should be careful we aren't papering over another bug	17:33
*** tobiash has quit IRC		17:34
*** pwhalen has quit IRC		17:34
pabelanger	clarkb: yah, haven't figured out why it dies, I don't think we have properly logging enabled. But I do see the new manager being created	17:35
*** tobiash has joined #zuul		17:35
pabelanger	just the previous requests are using original, and failing	17:35
pabelanger	let me get logs	17:35
pabelanger	clarkb: http://paste.openstack.org/show/744958/	17:37
pabelanger	is the best we have right now	17:37
pabelanger	line 23 is when new TaskManager is created	17:38
pabelanger	and believe line 30 is still using the original, that just died	17:38
clarkb	new provider managers are created when your config is chagned iirc	17:38
pabelanger	Oh, hmm	17:39
clarkb	if the exception is due to ^ then your fix is probably appropriate	17:39
clarkb	since we can safely short circuit in that case	17:39
pabelanger	I should be able to confirm if that was the case	17:39
*** spsurya has quit IRC		17:40
pabelanger	clarkb: all managers would be created right, not just a specific?	17:41
clarkb	pabelanger: would be specific to the providers that had their config change	17:41
pabelanger	k	17:42
clarkb	there is an if statement that checks the old config against the new config and if they differ creates a new manager under the new config iirc	17:42
pabelanger	ack	17:42
*** pwhalen has joined #zuul		17:44
*** sdake has quit IRC		17:44
*** sdake has joined #zuul		17:46
fungi	grr, having a hard time working out where to plumb in the statsd configuration/options in the github driver. i'm guessing i could reference it from the scheduler object? presumably by the time GithubConnection.addEvent() and eventDone() are called, self.sched is not None so i could just reference self.sched.statsd directly there?	17:50
fungi	or is there a better way?	17:51
clarkb	fungi: zuul.cmd.scheduler calls registerConnections which registers the scheduler with the connections. That should happen quite early.	17:54
clarkb	You probably still want to guard against self.sched being None but otherwise should be safe to do what you describe	17:54
clarkb	fungi: and with the guard we just won't send any of that data until the scheduler is register which should be fine	17:56
tobiash	fungi: yes, you can just use self.sched (and guard against none)	18:02
fungi	yep, i already had the conditional in there. just running tox against it now before i push it up	18:02
fungi	should this come with a regression test too?	18:02
fungi	apparently i now go down the rabbit hole of making sure i have the necessary libraries to build the c extensions for some of zuul's dependencies	18:05
clarkb	fungi: I don't know how much testing of the statsd events we do. I don't think its much (but that may have changed)	18:05
clarkb	fungi: bindep should capture them last I tried	18:05
clarkb	re2 being the big "recent" one	18:05
fungi	oh, neat, i also need a mysql server?	18:06
tobiash	fungi: yes, for some tests	18:06
*** panda is now known as panda\|off		18:06
fungi	and a postgresql server and a zookeeper server	18:06
fungi	will it skip those tests if i don't have those servers installed and configured?	18:07
tobiash	fungi: there is the tools/test-setup.sh that xonfigures all of those stuff	18:07
fungi	guess i'm about to find out	18:07
tobiash	fungi: I guess they will fail	18:07
tobiash	and zooleeper is needed for all tests ;)	18:07
tobiash	if there is interest, I can push up a docker-compose based test setup for this stuff	18:08
clarkb	fungi: ya it willfail the db tests. I usually just run with a zookeeper and ignore the db tests unless I'm looking at code in that region	18:08
tobiash	that might be useful for folks to just fire up correctly configured mysql,postgres and zookeeper on tmpfs in docker	18:08
fungi	looks like i'll be running pep8 locally and then pushing this up to run unit tests under ci instead, since i have a backlog of other stuff to get to	18:09
pabelanger	clarkb: looks like you are right, nodepool.yaml was modified	18:09
clarkb	tobiash: or even document a regex to skip the mysql and postgres tests	18:09
clarkb	I always have t ofigure out that regex again (shame on me for not writing it down)	18:09
tobiash	I never used such a regex	18:10
openstackgerrit	Jeremy Stanley proposed openstack-infra/zuul master: Track GitHub connection event queue length stats https://review.openstack.org/636404	18:10
fungi	there we go ^	18:10
tobiash	fungi: thanks, wanna have this ^ :)	18:11
fungi	statsd key name there was chosen arbitrarily	18:11
fungi	please recommend improvements! ;)	18:11
clarkb	fungi: you probably want to do github.$connectionname.event_queue	18:12
clarkb	since you can potentially have multiple github connections	18:12
fungi	aha, thanks	18:12
fungi	on a related note, i see where it reports the github api rate limit count remainder and reset epoch timestamp but we seem to be tracking several different rate limits there and the logs don't do anything to differentiate them. any idea what the distinction between them might be?	18:14
clarkb	jlk: ^	18:15
corvus	fungi: it might be per-app-install?	18:15
fungi	oh, do we have several of those on the openstack/opendev zuul deployment?	18:16
fungi	if so, that would make sense	18:16
fungi	are those treated as different github "connections"?	18:16
jlk	There is a per-app rate, and anonymous rate, and non-app API rate	18:16
jlk	I believe the rate is per installation of your app	18:16
corvus	fungi: we have only one connection (which maps to one github app), but that connection has many installations of the app.	18:17
jlk	so if you have one Zuul app, and 30 places have installed it (that would be 30 connections in Zuul config), you'll get a rate per installation	18:17
clarkb	re event queue lengths the openstack zuul takes significant amount of time to process events from github. reading logs with fungi it seems that we processes events serially in a fifo and those events individually may take some seconds to fully process. Because this serialized over time we accumulate a backlog. The chagne above should help track the size of the queue to see if it does indeed grow a backlog (we	18:17
fungi	ahh, so basically per project (org?) using the app?	18:17
clarkb	just don't have enough logging/data to know for sure currently)	18:17
jlk	oh wait, I think corvus just contradicted me, and he would know better.	18:17
tobiash	corvus: is right ')	18:17
tobiash	;)	18:17
jlk	I forgot how each installation mapped to Zuul config	18:18
tobiash	jlk: each connection can have many installations	18:18
fungi	wondering what additional detail we should include in the debug log (and perhaps in an associated statsd key name pattern) for tracking api rate limits	18:18
tobiash	otherwise we would not be able to handle that	18:18
jlk	https://developer.github.com/apps/building-github-apps/understanding-rate-limits-for-github-apps/	18:19
corvus	fungi: there's an installation number, and we know what repo/org it is associated with. maybe that?	18:19
corvus	(and we know what installation we're using for any given operation)	18:19
tobiash	fungi: for tracking api rate limits you probably want to have that per installation (if we're having the installation id)	18:19
fungi	if the rate limit is per org (or project?) which has installed the app, then i guess we're at least not at risk of one user of the app starving the api rate limit for another?	18:19
corvus	fungi: that is the idea, yes	18:19
jlk	yes. It's part of why app based is so compelling	18:20
fungi	that helps. was trying to wrap my mind around the model there. thanks!	18:20
jlk	you get a rate limit per consumer of your app, rather than the old method of a single rate limit shared by everybody sending you web hooks	18:20
jlk	that said, I thik not every call we make to teh API is via an app	18:20
clarkb	tobiash: have you seen pile up of github driver events leading to slow processing? I'm curious if this is something others have dealt with	18:21
jlk	https://developer.github.com/v3/#rate-limiting	18:21
jlk	the additional rate limiting is ^^	18:21
jlk	5K per hour for things we do as our own user, 60 an hour for anonymous calls	18:21
jlk	AAAND	18:21
jlk	https://developer.github.com/v3/search/#rate-limit	18:21
tobiash	clarkb: not that dramatically	18:21
tobiash	but our github enterprise has not rate limit set (yet)	18:21
jlk	when we hit search (for finding DependsOn and such), 30 per minute	18:22
jlk	clarkb: I haven't looked at the change, but I thought we did work to parallelize GitHub event processing a while back, sending it into a worker queue	18:22
jlk	or is it CONSUMPTION of the queue is serial?	18:22
fungi	jlk: that's what it seems like, yes (the latter)	18:23
jlk	I think the work I did was to make it so the hook receiver didn't hang until the event was processed.	18:23
jlk	nod	18:23
fungi	we see log entries immediately about the trigger event getting queued, but then a lengthy delay (sometimes hours) before the log entry saying it's being scheduled and processed	18:23
*** jpena is now known as jpena\|off		18:23
jlk	gotcha.	18:24
fungi	so basically hanging out in the connection driver's own event queue (we think)	18:24
jlk	that makes sense given what I know	18:24
jlk	I had /assumed/ that processing the queue would be expedient. I didn't think it would be serial.	18:25
jlk	Whatever thing processes the queue should grow threading/forking capability	18:25
fungi	so i suppose burning down that queue is each trigger multiplied by the time it takes for each query that trigger implies (plus the occasional github api socket timeouts)	18:26
jlk	I wonder if one needs to create buckets per installation, and then serially processes each bucket, but allow parallel bucket processing	18:26
jlk	that way each bucket is going to FIFO the events w/ some guarantee, but one noisy client shouldn't slow down every client	18:27
corvus	jlk: i don't know what the problem fungi and clarkb are dealing with is, but we receive a tiny percentage of github events compared to gerrit, and our scheduler has plenty of idle time, so i don't think this is a structural problem.	18:28
jlk	Does the system process gerrit events serially, or are there more than one worker working the queue?	18:29
jlk	also I think in teh gerrit case all the info is at hand, whereas on the GitHub case I _think_ we have to go query the API for more info.	18:29
corvus	jlk: serially -- both gerrit and github accept events from the remote system, do whatever querying needs to be done (yes, it's more in github than gerrit, but sometimes in gerrit too), then hand off the fully populated event to the scheduler, which is itself single-threaded.	18:30
jlk	Define 'idle time' for the scheduler. Could it be working just one task and is otherwise idle (and that one task is the serial burn down of the queue), or is it completely idle, no tasks waiting (which would seem counter to what fungi and clarkb are seeing)	18:30
fungi	corvus: current problem observed is that when an action is taken on a github project which should trigger a job, it takes our zuul deployment hours (at peak) to ennqueue the builds	18:31
*** ParsectiX has joined #zuul		18:33
corvus	jlk: the scheduler's event queue (which is where the gerrit/github events are deposited), is often empty. it's possible the github driver is spending too much time performing queries to prepare the events for the scheduler. but those events still need to be delivered in order. it's possible that paralellizing the queries there, and then re-serializing before sending them to the scheduler could be more	18:33
corvus	efficient -- maybe that's what you were suggesting. but i have my doubts as to whether we're seeing a volume where that's necessary.	18:33
fungi	reviewing the scheduler's debug log and correlating to the gh app's "advanced" log entries, the trigger event is seen by the githubgearmanworker immediately but then there are no subsequent log entries for it until the githubconnection schedules it much, much later	18:33
corvus	fungi: have some log entries you can paste?	18:33
fungi	sure, just a sec i'll get one up	18:34
*** panda\|off has quit IRC		18:34
fungi	here are the log entries containing the triggering event's github uuid: http://paste.openstack.org/show/744965/	18:35
fungi	in that case there was a 3-hour delay between receipt and scheduling	18:36
tobiash	fungi: do you have a stream from zuul.GithubConnection at hand?	18:36
fungi	how long of one would help?	18:37
tobiash	fungi: maybe that could help to identify where the bottleneck is	18:37
tobiash	fungi: maybe starting with that event and a few minutes length	18:37
fungi	we logged 14764 entries for zuul.GithubConnection in the past ~12 hours	18:37
fungi	sure, i can get that	18:37
*** gcerami_ has joined #zuul		18:37
tobiash	fungi: maybe just the part between the first two lines of your paste	18:38
clarkb	jlk ya my current theory is each individual event is costlly enough that we accumulate a backlog	18:38
clarkb	we have little insight via logging of that though hence fungi's change	18:38
tobiash	do you have numbers about the event inflow?	18:39
clarkb	wecan probably wc the event recieved log messages to get that	18:39
fungi	4657 log entries for zuul.GithubGearmanWorker and zuul.GithubConnection between 15:32:45,739 and 18:33:32,243	18:40
fungi	so i don't think i can fit them in a paste	18:40
clarkb	I've got to prep for the infra meeting now though	18:40
fungi	i can certainly do the few minutes following 15:32:45,739 though as a start	18:40
tobiash	fungi: oops, I overlooked the hours part of the timestamp ;)	18:41
openstackgerrit	Merged openstack-infra/zuul-jobs master: Add intermediate registry push/pull roles https://review.openstack.org/634829	18:41
fungi	yeah, three hours between trigger and scheduling in that instance, which is... why i'm trying to look into it	18:41
tobiash	so that was queued for three hours?	18:42
tobiash	that is definitely not normal	18:42
fungi	i think it was in the githubconnection object's event_queue for three hours, yes	18:42
fungi	not the zuul scheduler's event queue	18:42
tobiash	yes, pretty sure	18:42
fungi	because during that time the scheduler just about always showed an event queue of 0 for that tenant	18:43
*** sdake has quit IRC		18:44
mordred	fungi: so is that 4657 log entries 4657 events from github, or just log lines?	18:44
jlk	hrm, the first log entry is when we are in handle_payload	18:45
fungi	debug level or higher log lines for GithubConnection and GithubGearmanWorker	18:45
jlk	which then sends it off to self.connection.addEvent	18:45
clarkb	jlk: ya it gets converted into an internal event there and added to the connection.event_queue	18:45
jlk	yeah	18:45
clarkb	jlk: and then it seems to just sit them for hours. WHich is why I think the queue is long and we aren't flushing it as quickly as we get new events	18:45
jlk	Probably need more logging/metrics in _handleEvent()	18:46
fungi	here's a sample of the GithubConnection and GithubGearmanWorker log entries starting with when 717fd7b0-2edb-11e9-89ef-7895d9174016 was received and stretching for roughly 8 minutes http://paste.openstack.org/show/744967	18:47
jlk	hrm, no real way to tell what event caused line 5	18:49
mordred	feels like it might be nice to log the webhook id with more things to make it easier to trace the actions of an individual webhook	18:49
jlk	yup	18:49
jlk	I was just thinking the same	18:49
fungi	would filtering out the repo permissions and rate limit entries help?	18:49
jlk	maybe, but I think we lack enough context around log events to truly trace the event	18:50
jlk	It's likely we need to make the webook ID (headers.get('x-github-delivery')) part of the event object that carries through	18:51
*** sdake has joined #zuul		18:51
jlk	oh we do	18:51
jlk	we add it in addEvent	18:51
jlk	We just don't log it again until _after_ much of the processing has been done	18:52
jlk	We have the data at https://github.com/openstack-infra/zuul/blob/master/zuul/driver/github/githubconnection.py#L190	18:52
corvus	we received 278 events from github during the 17:00 hour. that gives us a budget of about 13 seconds per event.	18:53
jlk	We should log/emit a metric at that point befoire the work begins to process it	18:53
pabelanger	just happen to be looking at openstack-discuss, and looks like no official room for zuul at PTG, is that right? http://lists.openstack.org/pipermail/openstack-discuss/2019-February/002618.html	18:56
corvus	pabelanger: that's right	18:57
pabelanger	ack, thanks!	18:57
*** krasmussen has joined #zuul		18:58
corvus	jlk: agreed, we know when it's finished, but not when it starts	18:58
jlk	That whole function and the method calls are ripe for metrics	18:59
corvus	2019-02-12 15:34:22,636 DEBUG zuul.GithubConnection: Scheduling event from github: <GithubTriggerEvent 0x7f6a05daf9e8 pull_request ansible/ansible refs/pull/50854/head reopened github.com/ansible/ansible 50854,b8048fe130722c11a42eb817aae08f628fc3af53 delivery: 5ea967b0-2ecb-11e9-9d2b-4586291aacb1>	18:59
corvus	2019-02-12 15:37:05,593 DEBUG zuul.GithubConnection: Got PR on project ansible for sha b8048fe130722c11a42eb817aae08f628fc3af53	18:59
corvus	for instance, what happened in those 3 minutes?	18:59
corvus	the first line is the end of processing of one event, right? the next line is 3 minutes later... was it idle, or was it doing something for b804 that whole time	19:00
clarkb	or perhaps dealing with intermedaite events that don't get processed to the point of logging?	19:01
corvus	also, why is it the same sha in both? probably we started a new event for the same PR (happens a lot). but would be good to be certain.	19:01
corvus	fungi: are you working on sprinkling more log events in there?	19:02
jlk	a re-open would be a new event on teh same sha	19:02
jlk	but we actually have no idea if the second line is in any way related to the event on the first line	19:02
jlk	the second line could have been from an event BEFORE the first line	19:02
*** tflink_ has joined #zuul		19:03
jlk	corvus: that 'Got PR on' line in a couple different functions, and is a function used by a variety of events	19:04
jlk	it's just saying "I asked the GitHub API for the details on a pull request object"	19:04
*** mugsie has quit IRC		19:04
*** tflink has quit IRC		19:04
*** fdegir has quit IRC		19:04
jlk	well, I take that back	19:05
*** mugsie has joined #zuul		19:05
jlk	that PARTICULAR line is called via _event_status(), it got a status event on that PR for that hash	19:05
*** fdegir has joined #zuul		19:05
jlk	which could be "pending" status, or "success" or "failure"	19:05
jlk	well no, we return early in pending.	19:06
jlk	anyway a lot of this logging stuff was written before I really had any idea what we'd need to trace :D	19:06
fungi	corvus: i was going to try to understand enough github to help differentiate the various rate limit loglines. can try to improve other logging as well if i get enough understanding of what it means	19:09
*** tflink_ is now known as tflink		19:10
jlk	fungi: basically at the start of https://github.com/openstack-infra/zuul/blob/6d6c69f93e9755b3b812c85ffceb1b830bd75d6f/zuul/driver/github/githubconnection.py#L189 we get a 'delivery' item, which is the webhook ID. We need to log reception of it there, to match the line when we added it to the queue. From there we need to log the processing of it. Specifically starting to handle the event at https://github.com/openstack-infra/zuul/blob/6d6c69f93e9755b3b812c85ff	19:12
jlk	ceb1b830bd75d6f/zuul/driver/github/githubconnection.py#L219	19:12
jlk	https://github.com/openstack-infra/zuul/blob/6d6c69f93e9755b3b812c85ffceb1b830bd75d6f/zuul/driver/github/githubconnection.py#L219	19:13
jlk	There's some amount of processing in the method() call that causes outbound requests to GitHub's API. Once method() is done, there is further calls that are made in the "if event:" block. All the logging within this class should be called with the 'delivery' string.	19:14
*** gcerami_ is now known as panda		19:14
corvus	jlk, fungi: yeah, i think i'll take a stab at reworking that to use a loggingadaptor real quick	19:17
corvus	so we can easily get the string everywhere	19:17
openstackgerrit	Tobias Henkel proposed openstack-infra/zuul master: Add dockerized test setup https://review.openstack.org/636424	19:18
tobiash	corvus, mordred, fungi: is there interest in this? ^	19:18
tobiash	I find it quite useful for setting up the test environment on a dev machine	19:19
*** dmsimard has quit IRC		19:22
fungi	getting familiar with using docker locally is still on my to do list. presumably useful but i don't yet have enough context to know whether i'll find it so	19:24
*** jamesmcarthur has joined #zuul		19:26
tobiash	fungi: in that case, just fire up that script, wait for completion and then you can run tox	19:26
jamesmcarthur	hi all - not sure who to address this towards, but we looked into some solutions for anonymous analytics for the zuul.org site.	19:28
jamesmcarthur	Our fantastic intern put together some details: https://docs.google.com/document/d/1xli_xW5Fp5xI-FYtEzJV2ijn6cxUfEczk0pVdO69I7w/edit?usp=sharing	19:28
jamesmcarthur	corvus: fungi: if this seems like it would work for zuulci.org, please let me know	19:28
*** jamesmcarthur has quit IRC		19:30
*** aprice has joined #zuul		19:30
dkehn	clarkb: http://git.openstack.org/cgit/openstack-infra/project-config/tree/zuul.d/jobs.yaml#n9, when setting a parent's job.abstract what is the implication of that?	19:32
*** dmsimard has joined #zuul		19:32
clarkb	dkehn: an abstract job cannot be run. It is there only to be part of the inheritance tree	19:32
clarkb	dkehn: useful if you have some incomplete set of job config that needs to be inherited to fix up	19:32
openstackgerrit	James E. Blair proposed openstack-infra/zuul master: Improve event logging in githubconnection https://review.openstack.org/636429	19:33
corvus	tobiash, jlk, fungi: ^	19:34
* jlk looks		19:34
corvus	from a unit test run, that looks like: http://paste.openstack.org/show/744972/	19:36
jlk	I think it's a good start, but I would also sprinkle in logging into the _whatever() methods that are called by method(). Could be a follow up change.	19:37
corvus	jlk: good idea, and i just saw a bug, so new ps incoming	19:38
jlk	neat	19:38
tobiash	corvus: commented	19:39
*** sdake has quit IRC		19:42
corvus	yeah, that's the bug :)	19:47
*** jamesmcarthur has joined #zuul		19:59
*** ParsectiX has quit IRC		20:12
fungi	the delivery uuid is indeed most helpful there	20:13
fungi	that'll make it easy to track with one grep	20:13
tobiash	corvus, jlk: given that the event processing really is too slow for the single threaded event queue, what do you think about doing that multi-threaded while maintaining stricktly ordered forwarding to the scheduler loop?	20:13
corvus	tobiash: you are way ahead of where i am in understanding this.	20:14
corvus	i would really like to avoid dramatic redesigns until we have more data	20:15
tobiash	sorry, I'm thinking if it's possible to do the github event queue processing multi threaded and still maintain the correct order in which these events are forwarded into the scheduler run loop	20:15
corvus	tobiash: yes, i believe jlk and i also said as much earlier	20:15
tobiash	corvus: sure, I'm just already thinking about it ;)	20:16
corvus	tobiash: but let's not even think about doing that until we understand the problem	20:16
SpamapS	poppycock, run off into the weeds and write code, I say, data is for the unbold.	20:17
jlk	hehe	20:17
tobiash	looks like I missed parts of the backlog, sorry for the noise ;)	20:17
corvus	there are so many ways that approach could cause us problems too, especially if we don't know what's actually causing the issue	20:17
jlk	I'm with corvus on this though. We're making guesses w/out data. I'd like to see logging output from the changes we're making to really trace through what's happening	20:17
jlk	We could throw more threads at the problem but if the real issue is that our API calls are really slow for some reason we'd want to fix that as well / instead.	20:17
corvus	as soon as i finish this sanwich, i will finish that patch	20:18
clarkb	I should go find lunch too, then return to review the data collecting change	20:18
jlk	lunch sounds good	20:19
tobiash	maybe there is also potential to reduce the number of api requests	20:20
clarkb	jlk: it is that time of day in this part of the world. Did your precipitation switch to rain? we never got any snow here but pdx airport got ~5". Now we have inches of rain	20:20
jlk	yeah we've been getting rained on since last night.	20:20
jlk	We had likely over a foot of accumulated snow in the yard/roofs/etc.	20:21
jlk	I'm very much done w/ shoveling my patio/driveway/sidewalks	20:21
mordred	jlk: a foot of snow is not what your neck of the woods is designed for	20:21
clarkb	jlk: it was crazy how different being a few miles west changed everything here	20:21
clarkb	jlk: on saturday we had a bunch of ice then everything went to melty rain and no snow :(	20:22
jlk	mordred: no, indeed not. That's why the kids have been on snow days for... a while now	20:22
jlk	I'm also very much ready for them to go back.	20:22
jlk	I'm honestly surprised I didn't lose power. Most our power is above ground lines, and all the trees had HEAVY snow on them. I could hear trees breaking all over the neighborhood. None took out our lines though. Seattle and the east side are not as lucky.	20:23
mordred	jlk: I saw photos that looked very inappropriate for seattle	20:27
*** edmondsw has joined #zuul		20:27
mordred	they looked more like photos of the side of mt. baker	20:28
jlk	Here's a set of photos from yesterday, before ANOTHER 5~ inches fell. (or rather WHILE it was falling) https://www.instagram.com/p/BtwjHNhBlw7/	20:28
pabelanger	shoveling wet snow stinks, but makes great snowballs	20:28
*** jamesmcarthur has quit IRC		20:31
SpamapS	Trust the Canadian about snow and ice.	20:32
SpamapS	They know.	20:32
fungi	yeah, i don't want to just assume we know why github event processing is slow on our deployment. i'd rather collect data first and see where it leads us. i was mostly trying to infer enough from the data we have to know where strategic gathering of additional data is likely to pay off in subsequent diagnosis	20:34
fungi	which has resulted in needing to learn a bit more about github than i anticipated, but i suppose i shouldn't be surprised by it	20:35
tobiash	this may not be the github event processing, I just looked at your paste and every event from 'Updating <Change...' until ' Scheduling event from github' is below 8s as fas as I can see	20:35
tobiash	I'm stunbling accross this: http://paste.openstack.org/show/744974/	20:36
tobiash	if the queue length is > 0 I'd expect that after 'scheduling event from github' it immediately resumes with the next 'updating <change'	20:37
tobiash	but there is a 6min gap in between	20:37
clarkb	we dont log the updating change until after the requests though right?	20:39
clarkb	so it could be blocking aomewhere between?	20:39
tobiash	hrm, I don't see much code in between that could be blocking	20:42
tobiash	just looks like it is blocked here: https://git.zuul-ci.org/cgit/zuul/tree/zuul/driver/github/githubconnection.py#n190	20:44
tobiash	but I also think we need the logging improvement for a deeper analysis	20:45
clarkb	ya	20:45
corvus	i'm about to push up a new ps	20:45
corvus	make sure i have logging there :)	20:45
openstackgerrit	James E. Blair proposed openstack-infra/zuul master: Improve event logging in githubconnection https://review.openstack.org/636429	20:47
corvus	tobiash, jlk, fungi, clarkb: ^ sorry it's a more significant change, but it means less passing around of data and loggers.	20:47
corvus	oh cool, the diff on that in gertty at least looks really good	20:48
corvus	gerrit too. cool. should be easy to follow and review then.	20:48
*** jamesmcarthur has joined #zuul		20:52
*** jamesmcarthur has quit IRC		20:54
tobiash	corvus: found a bug ^	21:00
corvus	tobiash: same thing a few lines down as well	21:01
tobiash	yah	21:01
openstackgerrit	James E. Blair proposed openstack-infra/zuul master: Improve event logging in githubconnection https://review.openstack.org/636429	21:01
corvus	i'm runinng tests.unit.test_github_driver.TestGithubDriver.test_pull_event locally, but that's it	21:02
corvus	so i'm catching the big errors, but that doesn't test everything	21:02
*** fdegir has quit IRC		21:06
*** mugsie has quit IRC		21:06
*** mugsie has joined #zuul		21:06
tobiash	corvus: found some more, do you mind if I push an update?	21:15
corvus	tobiash: go for it	21:15
openstackgerrit	Tobias Henkel proposed openstack-infra/zuul master: Improve event logging in githubconnection https://review.openstack.org/636429	21:15
tobiash	ran the complete test suite locally ^	21:15
corvus	clever	21:15
corvus	tobiash: those all look good, thx	21:16
tobiash	yw :)	21:16
tobiash	corvus: so our +1 add up to +2?	21:17
corvus	tobiash: i think so. :) let's +3 it unless jlk wants in on this...?	21:18
jlk	I'm running a meeting, don't wait in me	21:18
tobiash	k	21:18
tobiash	that logging also should reveal if there are blocking problems on that queue	21:19
corvus	tobiash, clarkb, fungi, jlk: i ran the sigusr debug handler, which means i got two stack traces at random points in time. for the githubconnection event handler thread, they both had the same stack traces:	21:21
corvus	File "/usr/local/lib/python3.5/dist-packages/zuul/driver/github/githubconnection.py", line 1133, in getPullBySha	21:21
corvus	for pr in repo.pull_requests(state='open'):	21:21
corvus	which tells me that, statistically speaking, we probably spend a lot of time in that bit of code.	21:21
corvus	in particular, i can think of one repo that we monitor where "iterate over every open pull request" may not be the best idea.	21:21
clarkb	we must list all open pull requests too? I wonder if that is expensive for ansible	21:22
clarkb	ya	21:22
tobiash	hrm, that sounds not very optimal	21:22
tobiash	hrm, that code path is used on a status event	21:23
tobiash	we should check if there is really no pr number in the status event that forces us to do that	21:24
tobiash	but I guess that explains what you observe	21:25
mordred	oy. yeah	21:25
clarkb	the unlabel event I have pulled up has a pull_request dict	21:26
*** sdake has joined #zuul		21:26
tobiash	hrm, the docs tell me that there is really no reference to the pr: https://developer.github.com/v3/activity/events/types/#statusevent	21:27
corvus	can we do a search instead of iterating?	21:27
clarkb	oh I see its a commit being updated and we try to refernce that back to PR	21:28
jlk	I'll pay attention here in a moment.	21:28
tobiash	checking the api docs	21:28
tobiash	I think a search could work: https://help.github.com/articles/searching-issues-and-pull-requests/#search-by-commit-sha	21:30
tobiash	together with https://developer.github.com/v3/search/#search-issues-and-pull-requests	21:30
corvus	i'm looking far back in the logs today when things were less busy. 8a98cf82-2e8f-11e9-921a-f68fa09c6c2a was received by the gearman worker at 6:29:25 and was delivered to the scheduler at 6:30:49. so it took 84 seconds.	21:31
jlk	okay trying to get caught up on backscroll	21:32
corvus	(assuming there was no event processor backlog, which i am assuming because there was no activity for the next 4 minutes after the delivery to the scheduler)	21:32
jlk	wait	21:33
SpamapS	oh wow, if you have debug logging on the scheduler, zuul-git makes a Read Refs log entry that is basically "everything that has ever landed in any of these repos"	21:33
jlk	we're not iterating in the way you think	21:33
corvus	jlk: whew	21:33
jlk	At least I don't think. let met get to that line of code	21:33
tobiash	corvus: search by sha works: curl https://api.github.com/search/issues\?q\=6891aacd27e28676bab8ec3e957cde26c1616e1c	21:33
SpamapS	$ kubectl logs zuul-scheduler-0\|grep 'Read refs'\|wc	21:33
SpamapS	8 88944 3200902	21:33
SpamapS	all of the refs/changes/foo's and every single commit hash	21:34
corvus	SpamapS: i think you can filter out that logging with a logging config file	21:34
SpamapS	corvus: I can, I just wonder if that's a necessary level of logging, even at debug. :)	21:34
jlk	So it's one API call to get all the pull requests	21:34
jlk	then WITHIN zuul code we iterate over what we got as a result in that API call	21:34
corvus	SpamapS: it's a good question. i've certainly used it in the past, but we do not currently have it enabled in opestack-infra	21:35
jlk	it's python loop iteration, not a new API call for each PR	21:35
SpamapS	I find the debug logs to be quite useful. I dump them all into elasticsearch and filter them out usually, but it's nice to have them there.	21:35
jlk	So an API call to get all teh open pull requests of a repo vs a API call to search for a PR w/ a hash is likely to be roughly the same	21:35
tobiash	jlk: well, that probably does pagination	21:35
mordred	jlk: it might be doing pagination calls on the backend in github3.py though	21:35
jlk	tobiash: that's a fair point.	21:35
corvus	SpamapS: i could see removing it from the "default" debug logging config, and next time i need it, i can make a manual logging config	21:35
mordred	especially since we're talking about the ansible/ansible repo here :)	21:36
tobiash	jlk: so on a busy project you might get much data while looping	21:36
jlk	let me dig further	21:36
jlk	Because we could tell i tto get more at once	21:36
SpamapS	Is there a trace level?	21:36
corvus	apparently there are 1713 open prs	21:36
corvus	SpamapS: not in python	21:36
SpamapS	shame	21:36
SpamapS	anyway, I think I found a bug in github that is unrelated to what you all are looking at	21:37
jlk	https://github.com/sigmavirus24/github3.py/blob/master/src/github3/repos/repo.py#L2355-L2412	21:37
SpamapS	but it may be all zuul just not working well with github....	21:37
corvus	SpamapS: yeah. openstack was so disappointed they made one, but it turned out to be a bad idea because none of the tooling deals with it right, so i don't think we should follow suit. but i'd totally use it if it really existed.	21:37
SpamapS	I'd like to have the "require gate status check" branch protection checked on my repos, so only Zuul can merge things, but the status report comes after the merge.	21:38
tobiash	jlk: well, we only want that single matching pull request so I think we really want to change this to a search	21:38
jlk	Search is a different API limit	21:38
jlk	a much shorter one	21:38
tobiash	how much?	21:39
jlk	30 per minute	21:39
tobiash	hrm	21:39
SpamapS	oh wait, no, the driver actually does set statuses before calling merge	21:40
tobiash	but I guess we can live with that limit given that it takes 2-10s to process one event	21:40
jlk	tobiash: On a busy zuul w/ lots of github clients you could possibly overrun that limit quickly	21:41
tobiash	SpamapS: we're using that since ages and I got no complaints about this so far, is this ghe?	21:41
SpamapS	tobiash: no, github.com	21:41
SpamapS	tobiash: I wonder if it's just async	21:41
SpamapS	'bit of a race condition	21:41
tobiash	jlk: with the rate I saw in the logs zuul is not able to process more than 30 events per minute	21:41
SpamapS	Like, need to confirm the status if you get a 405	21:42
tobiash	SpamapS: maybe, I haven't connected my zuul to github.com	21:42
SpamapS	I've used the setting with GHE in the past too, at GoDaddy.	21:42
SpamapS	so I was surprised when my change failed ot merge.	21:42
* SpamapS tries again		21:43
tobiash	SpamapS: then async on github.com seems likely	21:43
tobiash	jlk: granted, there are other places in zuul that perform a search too	21:43
jlk	oh I know why this code looks familiar	21:44
jlk	and whY I hate this part of GitHub	21:44
jlk	You can have multiple PRs open with the same head SHA.	21:44
jlk	it's unlikely, but possible.	21:44
corvus	yeah, if it happens in zuul, we just punt.	21:44
jlk	and since there's no way to match the event with the specific PR it came from, we can't just pick the first and continue	21:45
jlk	which is why we dump them all	21:45
jlk	we DO punt of we find more than 1	21:45
SpamapS	mebbe some day github will adopt Gerrit's model and deprecate "pull" requests. :-P	21:45
* SpamapS can dream		21:45
jlk	SpamapS: I don't think that would improve this	21:45
jlk	see what corvus just said	21:45
corvus	i think the new logging will tell us with more certainty how long the 'getpullbysha' method takes. i'm not sure it will tell us how many underlying requests we made...	21:45
jlk	corvus: yeah. We might be able to get some logging from github3.py for that	21:46
jlk	anyway...	21:46
jlk	I _think_ this problem set of code could go away if we fully switched over to the checks API	21:46
tobiash	corvus: but the github3.py logging will tell you	21:46
jlk	except, we still want to be able to trigger stuff on events outside of zuul events, so maybe not	21:46
corvus	jlk: hrm, yeah, i guess it probably can't go away until nothing uses status anymore...	21:47
jlk	sadly.	21:48
clarkb	out of curiousity what do the status events give us?	21:48
corvus	however, the fewer status changes, the less this code is hit. so any movement in that direction is an improvement.	21:48
tobiash	so what should we do, performa search or try to make the iteration faster?	21:48
jlk	and Ansible has a LOT of status users.	21:48
tobiash	clarkb: we need this to trigger the gate	21:48
clarkb	most of the trigger stuff is oreinted around PR changes not commit changes right?	21:48
jlk	clarkb: they're how we currently set a head of a PR as passing CI or not	21:48
tobiash	in case review and merge label are already there, the status event from the check pipeline will trigger the gate	21:48
jlk	clarkb: and how we learn if another CI system has passed the change or not	21:49
*** sdake has quit IRC		21:50
jlk	tobiash: We should get some visibility into whether or not this bit of code is a significant slow down, and if so, how much faster a search would be, or if there are ways we can improve the API calls github3.py makes	21:50
* clarkb reviews the new logging change since that seems like a good next step		21:50
corvus	jlk: ++ clarkb so i think openstack-infra should restart the scheduler with the logging change, and probably also increase the logging level for github3.py so we can see those requests.	21:51
corvus	that latter will be a logging config change on our scheduler... and should be temporary because boy is it going to be big.	21:51
clarkb	looks like everyone has already reviewed it :)	21:51
tobiash	corvus: we have 71M of github3.py logging in three days	21:53
jlk	hrm.	21:54
jlk	I'm curious	21:55
tobiash	while having 1G of zuul debug logs in 10 hours...	21:55
jlk	GithubEventProcessor() class uses the GithubEventLogAdapter, which logs the 'delivery' key.	21:55
* corvus stands by to -2 his own patch....		21:55
jlk	oh wait, okay. I see we hand it a 'delivery' thing in __init__	21:56
* corvus stands down		21:56
jlk	I was worried that run() wouldn't have the 'delivery'.	21:56
jlk	so now we'll have one of these running for EACH event that comes in?	21:56
corvus	yeah. it should always be there. it may be None when we shut down.	21:56
corvus	jlk: yes, but it's not a thread. it's really just a data structure.	21:57
jlk	Yeah, I'm just thinking we could really move things along if "GithubEventProcessor(self, data).run()" was an async fire and forget call	21:58
corvus	(however, if we did decide to have some parallelism in the future, you could hand these things to, say, a pool of workers)	21:58
jlk	GithubEventConnector uses threading, how many threads does it start?	21:58
clarkb	jlk: one aiui	21:58
jlk	like how quickly will it go through its run() loop?	21:59
corvus	certainly O(1).	21:59
corvus	might be 2.	21:59
clarkb	jlk one event at a time	22:00
jlk	I wonder if we could relieve some back pressure by spawning threads for each event	22:00
clarkb	corvus: where does kwarys[extra][delivery] make its way into that method?	22:00
jlk	but... threading. Now you have 0(N) problems.	22:00
jlk	clarkb: I believe it comes from the 'data' sent in on line 458	22:01
corvus	jlk: yeah if we did have to parallelize, i'd want to keep a tight lid on the number of threads here	22:01
jlk	We'll just re-write it all in Go, use Gorutines, and everything will be fine, right?	22:01
clarkb	oh I see ya its reading it directly from the event data	22:02
openstackgerrit	Merged openstack-infra/zuul master: Improve event logging in githubconnection https://review.openstack.org/636429	22:02
corvus	(because, right now, we have a running system with a (presumed) backlog. which is better than a non-running system with 2000 stuck threads)	22:02
jlk	if we were to make threads there, we could totally get out of sync too, where some threads would finish faster than others and we'd deliver events out of order.	22:02
clarkb	the reason it isn't already parallelized is that for state correctness we want to update zuul's view of the world serialy right?	22:02
clarkb	so ya we could churn through events quicker, but we'd have to carefully lock around updates to the internal state model	22:02
corvus	correct	22:03
jlk	my brain just went all melty	22:03
corvus	just another day in #zuul	22:03
jlk	thinking about creating a ordered list of events in a data store, allowing parallel processing to get the events processed up to the point of adding them into the scheduler queue, and then going back to serial to dump them INTO the queue	22:03
jlk	er into the Scheduler queue	22:03
jlk	I can see it in my mind, but no way I could interpret that into code	22:04
tobiash	jlk: getchange has side effects...	22:04
jlk	yeh I haven't even thought through the side effects	22:04
corvus	jlk: it's so easy in napkin-language	22:04
jlk	none of this was written with thread safety in mind	22:05
corvus	well, that's not quite true :)	22:05
corvus	tobiash: in general, getchange's side effects should be okay	22:05
jlk	ahem. None of what _I_ wrote was written with thread safety in mind...	22:05
clarkb	is the extra kwarg something that python logging inserts?	22:06
clarkb	I'm assuming so if this all works in tests	22:06
tobiash	corvus: I guess you're right	22:07
corvus	we allow a certain amount of mutation of data on the Ref objects. they're generally expected to be as up to date as possible and can change without locking.	22:07
corvus	clarkb: yes, and the adapter is copied from the executor	22:07
corvus	(i'd like to use them in more and more places, they're really cool)	22:08
clarkb	ya its a neat way to apply a mutation to consistent types of log data	22:08
clarkb	it just has a slightly magic interface with kwargs mutations	22:08
corvus	clarkb: it's also extra metadata on the log entry itself, so it's queryable via journald, etc.	22:08
clarkb	oh thats an extra neat feature	22:09
jlk	Oh I remember something else. We can't even ask GitHub to add a reference to the PR in that webhook, because the status is SET on a hash ( https://developer.github.com/v3/repos/statuses/#create-a-status ) it's not set on any specific pull request.	22:23
clarkb	ya this seems to be a mismatch between pull request workflows and commit based api	22:23
clarkb	so you have to jump through hoops to match them up in places	22:23
corvus	given that, i wonder if an endpoint to return all the prs for a hash wouldn't be a terrible idea?	22:24
corvus	especially one that counted against the normal api limit, not the search limit :)	22:24
jlk	I'm sure you could cook something like that up w/ GraphQL	22:27
jlk	however. GraphQL.	22:27
*** jamesmcarthur has joined #zuul		22:27
corvus	tristanC: i've restarted zuul-web with the buildsets api endpoint patch, and the buildsets tab on http://logs.openstack.org/41/630041/5/check/zuul-build-dashboard/2290fd5/npm/html/ is behaving strangely. it seems to return a small set of buildsets at random.	22:29
jlk	Digging through _iter() some more. The repo.pull_requests() API call defaults to asking for "-1" number of pulls, which is "all". That -1 is passed through to _iter(), which goes to GitHubIterator, which sets a page limit at 100 in that case.	22:32
jlk	so, without direction it's going to fetch 100 at a time and page through them until done. We could... fetch more? Ask for 500?	22:32
corvus	so that's 18 requests for ansible by default	22:33
jlk	A good candidate for why that takes a while.	22:33
jlk	I'm looking to see if there is an upper bound for how many we can request at once	22:34
jlk	"For some resources, you can also set a custom page size up to 100 with the ?per_page parameter"	22:34
jlk	so... 100 :/	22:34
corvus	maybe we could do an lru cache of sha->pr mapping	22:35
jlk	Honestly, it wouldn't be the worst idea ever to just work 100 at a time to find ONE that matches and just end there.	22:35
jlk	it's just perpetuating the oddity	22:36
clarkb	corvus: event at a few thousand entries that should be a relatively small amount of memory	22:36
clarkb	corvus: that seems like a not bad idea to me	22:36
jlk	Would need some way to expire things from the cache as they close. Handle a PR close event and remove the PR from the cache	22:39
jlk	otherwise you wouldn't recover from "multiple PRs w/ this hash" problem fast enough.	22:39
corvus	good point	22:40
clarkb	(mostly personal curiousity) how do you end up with PRs that have the same hash?	22:41
clarkb	is the common case trying to merge the same commit to multiple branches?	22:41
clarkb	and the branches have the same commit history?	22:41
jlk	I think it's very uncommon, it's just possible?	22:41
clarkb	(if we can describe the case we might be able to say something like "zuul doesn't work with this workflow")	22:41
jlk	like you've made a release branch and then quickly had to fix a thing, the fix goes to both master and the release branch, and master hadn't moved yet	22:41
jlk	or if somebody accidentally opens 2x PRs from the same branch to the same target	22:42
jlk	there's lots of tooling out there to open PRs, a hiccup could cause multiple.	22:42
*** tflink has quit IRC		23:02
*** tflink has joined #zuul		23:02
*** jamesmcarthur has quit IRC		23:11
SpamapS	clarkb: you force push	23:24
SpamapS	have accidentally done it ;)	23:24
jlk	heh. Yeah there's nothing GitHub (or gerrit) side that would prevent a variety of scenarios from happening	23:25
clarkb	Gerrit actually enforces a lot more rule around this	23:25
clarkb	like force pushing can be disabled pretty forcefully. And you can't push the same commit to multiple changes iirc you'd have to change the changeid? though maybe that works with the stable/master example above	23:27
clarkb	since the unique tuple is sha, changeid, branch	23:27
jlk	a force push can be disabled by GitHub too, if the repo owner chooses to do so	23:27
jlk	except that often PRs come from forks, and the rules don't necessarily apply	23:27
*** jamesmcarthur has joined #zuul		23:32
SpamapS	ya, it's not only possible, but I'd say it's a valid scenario that should result in a clear error message.	23:32
*** jamesmcarthur has quit IRC		23:32
*** jamesmcarthur has joined #zuul		23:33
SpamapS	Even if it means spamming both PR's. ;)	23:33
clarkb	ya I don't necessarily think its a bad thing, except that if we maybe sacrifice some corner face functionality for working at all that might be a good compromise	23:33
jlk	That means finding all the matches again :/	23:33
clarkb	The other thing that gerrit does is more explicitly couples the workflow and api. So you have the commit info in the events	23:34
clarkb	(I have no idea what it is like to implement that in github's api btu a query for list of PRs that map to this commit could be useful, this was suggested earlier I think)	23:34
jlk	our checks API is definitely an attempt at improving this	23:35
jlk	because check runs and suites are attached to specific PRs	23:35
clarkb	jlk: in that case the entire CI workflow runs through that different API?	23:35
jlk	it'll just take a while for the whole world to move over	23:35
jlk	clarkb: yeah. As an app you register as a check suite, and on PR open a check run is requested, which we Zuul would get and start interacting with. We send results into the check run	23:36
clarkb	would we still need to process status updates to deal with gating things?	23:37
jlk	yes, until all the other systems move over into check runs	23:37
*** dkehn has quit IRC		23:37
jlk	so instead of setting status, they would be updating their own specific check suite/runs. We can iterate over them to see if they're complete or not	23:37
jlk	hrm I wonder	23:37
*** dkehn has joined #zuul		23:38
jlk	I wonder if we would get updates about a check run for another app	23:38
jlk	maybe. Hard to tell from the docs	23:39
jlk	"GitHub Apps that have the checks:read permission and subscribe to the check_suite webhook event receive the completed action payload for all check suites in the app's repository. "	23:42
jlk	I think we'd receive hooks as each check suite completes, which is analogous to a status success.	23:43
*** sdake has joined #zuul		23:43
clarkb	so the limitation might become "gating only works properly if all check suites use the checks api" and then switch zuul over to that (I'm hand waving though as I doubt I'd write that conversion or have all the details anticipated)	23:44
jlk	heh	23:44
jlk	maybe. I think it'd be more phased	23:44
jlk	1) move Zuul over to checks API. 2) sometime later deprecate reacting to status events. 3) stop responding to status events, claim that gating (on systems other than Zuul) require those systems to be using checks API	23:45
jlk	a bunch of work is done for 1, in github3.py. It's probably time I poke at Zuul again and try to make use of it	23:45
*** jamesmcarthur has quit IRC		23:56

Generated by irclog2html.py 2.15.3 by Marius Gedminas - find it at mg.pov.lt!