tristanC | zuul-maint, looking for review on https://review.openstack.org/632620 to be able to use add-build-sshkey on static node. thanks :) | 00:00 |
---|---|---|
openstackgerrit | Tristan Cacqueray proposed openstack-infra/zuul master: executor: enable zuul_return to update Ansible inventory https://review.openstack.org/590092 | 00:20 |
*** rlandy is now known as rlandy|bbl | 00:36 | |
*** TheJulia has joined #zuul | 00:59 | |
*** mrhillsman has joined #zuul | 00:59 | |
*** jtanner has joined #zuul | 00:59 | |
*** sdake has joined #zuul | 01:01 | |
openstackgerrit | Merged openstack-infra/zuul master: Add a note about sqlalchemy metadata https://review.openstack.org/636189 | 01:18 |
*** sdake has quit IRC | 01:37 | |
*** sdake has joined #zuul | 01:39 | |
openstackgerrit | James E. Blair proposed openstack-infra/zuul-jobs master: Add intermediate registry push/pull roles https://review.openstack.org/634829 | 01:41 |
*** sdake has quit IRC | 01:43 | |
*** sdake has joined #zuul | 01:50 | |
*** sdake has quit IRC | 01:51 | |
openstackgerrit | Merged openstack-infra/zuul master: web: add /{tenant}/buildsets route https://review.openstack.org/630035 | 02:12 |
*** saneax has joined #zuul | 02:50 | |
*** saneax has quit IRC | 02:52 | |
*** saneax has joined #zuul | 02:53 | |
*** rlandy|bbl is now known as rlandy | 03:12 | |
*** rlandy has quit IRC | 03:19 | |
*** sdake has joined #zuul | 03:39 | |
*** sdake has quit IRC | 03:59 | |
*** manjeets has quit IRC | 04:41 | |
*** saneax has quit IRC | 04:42 | |
*** bjackman has joined #zuul | 04:55 | |
*** shanemcd has quit IRC | 05:03 | |
*** shanemcd has joined #zuul | 05:04 | |
*** greg-g has quit IRC | 05:13 | |
*** chandankumar is now known as chkumar|ruck | 05:47 | |
openstackgerrit | Merged openstack-infra/zuul master: Update SQLAlchemy url in docs https://review.openstack.org/635148 | 05:59 |
*** sshnaidm is now known as sshnaidm|afk | 06:39 | |
*** saneax has joined #zuul | 06:45 | |
openstackgerrit | Felix Schmidt proposed openstack-infra/zuul master: Retrieve full list of jobs with details per tenant via API https://review.openstack.org/635714 | 06:56 |
*** quiquell|off is now known as quiquell|rover | 07:02 | |
*** snapiri has joined #zuul | 07:24 | |
*** quiquell|rover is now known as quique|rover|brb | 07:53 | |
*** themroc has joined #zuul | 08:10 | |
*** quique|rover|brb is now known as quiquell|rover | 08:12 | |
openstackgerrit | Quique Llorente proposed openstack-infra/zuul master: Escape jinja2 template from commit message https://review.openstack.org/633930 | 08:21 |
*** gtema has joined #zuul | 08:24 | |
*** jpena|off is now known as jpena | 08:53 | |
*** panda|off is now known as panda | 09:04 | |
openstackgerrit | Quique Llorente proposed openstack-infra/zuul master: Escape jinja2 template from commit message https://review.openstack.org/633930 | 09:43 |
quiquell|rover | tristanC: different approach ^ | 09:45 |
openstackgerrit | Quique Llorente proposed openstack-infra/zuul master: Mark as unsafe commit message at inventory https://review.openstack.org/633930 | 09:45 |
*** sshnaidm|afk is now known as sshnaidm | 09:52 | |
SpamapS | tristanC: heh, your python, on first test, comes out about 10x slower than my rust. Trying the C++ now. ;) | 10:22 |
SpamapS | The C++ is still faster. :-P | 10:24 |
SpamapS | oh interesting, with more URLs, the rust gets faster | 10:27 |
SpamapS | suggesting that the C++ LRU cache maybe isn't so awesome. | 10:28 |
ttx | Hey all, I'm trying to get to the bottom of delays in the GitHub triggers... it seems to take anywhere between 30sec and 10min for a GitHub event to enter Zuul pipelines. I was wondering if that was coming from the processing on Zuul side, or some form of webhook batching on the GitHub side. How much of that is expected behavior vs. bug ? | 10:32 |
tristanC | SpamapS: how are you testing? | 10:33 |
SpamapS | tristanC: I spun up an nginx with 10000 pretend builds | 10:34 |
SpamapS | tristanC: and fed them all to each program | 10:35 |
SpamapS | ttx: GitHub triggers have been known to happen asynchronously, but you can see them on the app's detail page if you have access.. and then compare with the zuul scheduler logs to see who was slower. | 10:35 |
SpamapS | tristanC: I'm pretty sure the problem with your python is that it doesn't use a persistent client | 10:36 |
ttx | SpamapS: ok, I don't have access to that, but i know a few people who have | 10:36 |
tristanC | SpamapS: I mean, how did you measure the performance? | 10:36 |
SpamapS | tristanC: oh, just /usr/bin/time | 10:37 |
SpamapS | tristanC: I'm about to push what I did into the git repo I made, so you can repeat if you like. :) | 10:39 |
tristanC | SpamapS: sure, i'm curious how this would affect overall user experience. Initial build lookup only happens once per visit, then it should rewrite should be served by the cache | 10:40 |
tristanC | s/it should// | 10:41 |
SpamapS | tristanC: yeah I would like to try some different schemes, like repeat builds | 10:41 |
SpamapS | But I'd say log URL's are probably not going to have a fantastic cache hit rate. | 10:42 |
tobiash | ttx: normally this is a matter of less than 30s | 10:42 |
SpamapS | tristanC: I tend to agree with you that this probably isn't a big deal though. | 10:43 |
ttx | tobiash: I'm testing right now and I'm at 17min so far | 10:43 |
tobiash | ttx: is this the time until a change shows up in the pipeline without or with jobs? | 10:43 |
ttx | change shows up in pipeline | 10:44 |
tobiash | Job freezing can take a while because that requires a merger call | 10:44 |
tobiash | ttx: maybe you're running into api request limit? | 10:44 |
tobiash | ttx: do you use an api token or github app? | 10:45 |
ttx | tobiash: maybe, but the activity is not that significant. GitHub app | 10:45 |
tobiash | Hrm | 10:45 |
tobiash | Then you should probably first check webhook history in gh and then zuul web logs | 10:46 |
SpamapS | tristanC: using a persistent session object improved by about 15%. Still 7x slower than the C++/Rust. | 10:46 |
ttx | tobiash: if you confirm that is a bit unexpected behavior, I'll try to dig up more data points | 10:46 |
tobiash | Just to verify that there is everything ok | 10:46 |
tobiash | The webhooks should arrive within a couple of seconds | 10:47 |
SpamapS | and uses 20MB more memory than the C++/Rust which only use about 8MB.. but honestly, this is just late night insomnia playing by me ... honestly.. not sure how performance sensitive this actually is. | 10:47 |
tristanC | SpamapS: well it depends what is the overhead relative to the total latency of the browser going through that service | 10:49 |
SpamapS | tristanC: the python responds to cache misses in 1ms. ;) | 10:54 |
tristanC | SpamapS: e.g., a request to logs.openstack.org takes about 300ms, thus if the difference between the implementation is less than 10ms, then it shouldn't matter much right | 10:54 |
SpamapS | Yeah, 1ms vs. 100ns. :-P | 10:55 |
SpamapS | I mean it's nice if things are fast. :) | 10:55 |
SpamapS | and honestly, the C++ and Rust are pretty darn readable. | 10:55 |
SpamapS | But, there's also about 4x the code. :) | 10:56 |
SpamapS | so, yeah, in terms of overall cost to the project, while I wish we could rewrite all of Zuul in Rust (please god not C++), this isn't a part that I think needs to be in !python | 10:57 |
* SpamapS should probably try that sleep thing again | 11:02 | |
*** iurygregory has joined #zuul | 11:34 | |
iurygregory | Hi everyone, while moving jobs to zuulv3 in ironic we notice that ironic-tempest-plugin is not running all jobs we have set in project.yaml https://github.com/openstack/ironic-tempest-plugin/blob/master/zuul.d/project.yaml, i checked if the parent jobs really exists and they do anyone have ideas what it could be? Some jobs that arent running are defined in stable-jobs.yaml https://github.com/openstack/ironic-tempest-plugin/blob/master/ | 11:42 |
iurygregory | zuul.d/stable-jobs.yaml | 11:42 |
iurygregory | https://github.com/openstack/ironic-tempest-plugin/blob/master/zuul.d/stable-jobs.yaml * | 11:42 |
openstackgerrit | Matthieu Huin proposed openstack-infra/zuul master: [WIP] Zuul CLI: allow access via REST https://review.openstack.org/636315 | 11:50 |
*** jpena is now known as jpena|lunch | 12:26 | |
*** abregman has joined #zuul | 12:32 | |
*** panda is now known as panda|lunch | 12:32 | |
abregman | hey. does zuul have an API? | 12:34 |
pabelanger | abregman: yes, zuul-web is the service that allows that | 12:46 |
abregman | pabelanger: how exactly do I access it? for example if my address is https://my-zuul.com | 12:47 |
jkt | abregman: there's this pending work, https://review.openstack.org/#/c/535541/ | 12:50 |
jkt | abregman: are you looking for Getting integration by any chance? | 12:50 |
abregman | jkt: thanks, I'll look into it. what do you mean by getting integration? | 12:51 |
pabelanger | abregman: there is some docs around deployment options: https://zuul-ci.org/docs/zuul/admin/installation.html?highlight=dashboard#web-deployment-options but should be https://my-zuul.com/api | 12:51 |
abregman | basically I'm just looking for a way to pull some information using some scripts I want to write | 12:51 |
abregman | pabelanger: thanks, I'll go over it. using /api just redirects me to /tenants | 12:52 |
openstackgerrit | Quique Llorente proposed openstack-infra/zuul master: Mark as unsafe commit message at inventory https://review.openstack.org/633930 | 12:53 |
jkt | abregman: ok; I just wanted to see if you were looking for https://gerrit.googlesource.com/plugins/zuul-status/ | 12:54 |
pabelanger | abregman: yup, depends on how your dashboard is setup, here is a list of entrypoints: http://git.zuul-ci.org/cgit/zuul/tree/zuul/web/__init__.py#n661 | 12:55 |
pabelanger | you can also get data from statsd, if setup too | 12:55 |
abregman | pabelanger: awesome! exactly what I was looking for. thanks :) | 12:57 |
abregman | pabelanger++ | 12:57 |
*** bjackman has quit IRC | 12:59 | |
dmsimard | pabelanger: http://logs.openstack.org/41/535541/9/check/zuul-build-dashboard/b99f517/npm/html click API at the top right (from https://review.openstack.org/#/c/535541/ ) | 13:00 |
jkt | wow! | 13:05 |
*** chkumar|ruck is now known as chkumar|pto | 13:10 | |
*** panda|lunch is now known as panda | 13:13 | |
*** quiquell|rover is now known as quique|rover|eat | 13:17 | |
*** rlandy has joined #zuul | 13:17 | |
*** jpena|lunch is now known as jpena | 13:30 | |
*** sdake has joined #zuul | 13:32 | |
openstackgerrit | Quique Llorente proposed openstack-infra/zuul master: Mark as unsafe commit message at inventory https://review.openstack.org/633930 | 13:37 |
*** openstackgerrit has quit IRC | 13:37 | |
*** iurygregory has left #zuul | 13:38 | |
*** quique|rover|eat is now known as quiquell|rover | 13:39 | |
*** abregman has quit IRC | 14:01 | |
*** sdake has quit IRC | 14:03 | |
*** openstackgerrit has joined #zuul | 14:15 | |
openstackgerrit | Tristan Cacqueray proposed openstack-infra/zuul master: web: prevent status update loop in background https://review.openstack.org/636343 | 14:15 |
pabelanger | mordred: with openstack client and openstack server show, is properties field map to metadata in openstacksdk? | 14:18 |
pabelanger | mordred: eg: line 25 @ https://softwarefactory-project.io/paste/show/1425/ | 14:19 |
mordred | pabelanger: yes, I think so? | 14:19 |
pabelanger | ack | 14:20 |
pabelanger | trying to figure out why nodepool might be leaking nodes when exception is raised when creating a node | 14:21 |
pabelanger | https://softwarefactory-project.io/paste/show/1424/ from nodepool, note line 5 | 14:22 |
pabelanger | but https://softwarefactory-project.io/paste/show/1425/ shows it actually exists | 14:22 |
pabelanger | but, then trying to figure out why cleanup handler isn't seeing it, and trying to delete it | 14:22 |
*** sshnaidm has quit IRC | 14:30 | |
*** sshnaidm has joined #zuul | 14:30 | |
openstackgerrit | Felix Schmidt proposed openstack-infra/zuul master: Retrieve full list of jobs with details per tenant via API https://review.openstack.org/635714 | 14:30 |
*** sdake_ has joined #zuul | 14:37 | |
*** ParsectiX has joined #zuul | 14:44 | |
*** swest has quit IRC | 14:47 | |
openstackgerrit | Matthieu Huin proposed openstack-infra/zuul master: [WIP] Zuul CLI: allow access via REST https://review.openstack.org/636315 | 14:47 |
ParsectiX | Hi guys. I'm trying to add an ldap group as a member on another group via CLI and it fails. When I'm doing the the same operation from the WebUI it works | 14:49 |
ParsectiX | ssh -p 2244 localhost gerrit set-members Administrators --include ldap/admins | 14:50 |
ParsectiX | fatal: Group Not Found: ldap%3Acn%3Dadmins%2Cou%3Dgroups%2Cdc%3Dexample%2Cdc%3Dio | 14:50 |
ParsectiX | Can you please advice how to debug this? | 14:50 |
openstackgerrit | Matthieu Huin proposed openstack-infra/zuul master: [WIP] Zuul CLI: allow access via REST https://review.openstack.org/636315 | 14:54 |
*** snapiri has quit IRC | 14:56 | |
openstackgerrit | Quique Llorente proposed openstack-infra/zuul master: Mark as unsafe commit message at inventory https://review.openstack.org/633930 | 14:58 |
*** Parsecti_ has joined #zuul | 15:05 | |
*** ParsectiX has quit IRC | 15:07 | |
*** ianychoi has quit IRC | 15:07 | |
*** ianychoi has joined #zuul | 15:07 | |
fungi | Parsecti_: you may want to try asking in #gerrit | 15:21 |
*** Parsecti_ has quit IRC | 15:21 | |
pabelanger | is anybody able to see why https://softwarefactory-project.io/paste/show/1425/ isn't hitting line 493 in http://git.zuul-ci.org/cgit/nodepool/tree/nodepool/driver/openstack/provider.py?h=3.4.0#n458 | 15:23 |
pabelanger | if my zk data is correct, the nodepool_node_id doesn't exist | 15:24 |
pabelanger | but, for some reason, nodepool isn't trying to delete the node | 15:24 |
quiquell|rover | pabelanger, tobiash, tristanC: The little fix for jinja at gerrit commit is all good now https://review.openstack.org/#/c/633930/ | 15:26 |
quiquell|rover | Have use !unsafe from ansible yaml to skip it | 15:26 |
*** gtema has quit IRC | 15:32 | |
*** gtema has joined #zuul | 15:35 | |
clarkb | pabelanger: see line 462 | 15:38 |
clarkb | pabelanger: that method is specifically for cleaning up servers not in zk, that server appears to be in zk | 15:39 |
pabelanger | clarkb: based on the info i am getting from fbo / jpena we don't see the server in zk | 15:39 |
pabelanger | https://softwarefactory-project.io/paste/show/1424/ | 15:40 |
clarkb | pabelanger: can you double check that? 0001603003 would be the identifier in zk | 15:40 |
pabelanger | is also log from nodepool | 15:40 |
clarkb | oh that logs says the cloud didn't have the node. Seems like a bug in your cloud then | 15:41 |
*** saneax has quit IRC | 15:41 | |
pabelanger | no, server create failed with 500 no more resources | 15:41 |
pabelanger | but, couldn't delete because in error state I think | 15:42 |
pabelanger | however | 15:42 |
pabelanger | I can't see why cleanup worker isn't trying to delete again | 15:42 |
pabelanger | interesting | 15:43 |
pabelanger | zk id=0001603336 but nodepool_node_id='0001603003' | 15:43 |
pabelanger | should they be the same? | 15:43 |
*** ParsectiX has joined #zuul | 15:44 | |
clarkb | pabelanger: nodepool will assign a new id to nodes it didn't know about in zk | 15:44 |
clarkb | that log could be from the cleanup method? | 15:45 |
pabelanger | k, I think I see the issue | 15:46 |
pabelanger | with help from fbo | 15:46 |
pabelanger | /nodepool/nodes/0001603003/lock/780341bd82e9489c991ac7e972d2b5d0__lock__0000000034 | 15:46 |
pabelanger | so, data does exist in zk | 15:47 |
pabelanger | which is instance https://softwarefactory-project.io/paste/show/1425/ | 15:47 |
pabelanger | however | 15:47 |
pabelanger | https://softwarefactory-project.io/paste/show/1424/ is the error when nodepool tried to create server | 15:47 |
pabelanger | but, we've deleted the wrong node id for some reason | 15:47 |
pabelanger | leading to data being leaked in zk | 15:48 |
clarkb | I'm not sure its the wrong id | 15:48 |
clarkb | the way nodepool accounts for this stuff it create a new node to track leaky things | 15:48 |
clarkb | so the id changes, but that is expected iirc | 15:48 |
clarkb | it is possible that it doesn't cleanup details for the original id properly though | 15:50 |
pabelanger | okay, but in that case, I can't see how we'd actually delete the data in zk, based on http://git.zuul-ci.org/cgit/nodepool/tree/nodepool/driver/openstack/provider.py?h=3.4.0#n492 | 15:50 |
clarkb | pabelanger: ah ya that might be buggy then | 15:50 |
clarkb | though what it is checking there is "if I don't have an id for this id in zk then make a new node with a new id to track it" | 15:51 |
clarkb | which should work | 15:51 |
clarkb | it is giving it a dummy node in zk so that it can track that it needs to delete the leaked instance | 15:51 |
clarkb | where this might be buggy is if it fails to delete the first time through ti could createa a second dummy node and so on | 15:52 |
clarkb | (so we'd leak dummy nodes until the real instance deleted and then all the dummy nodes could be cleaned up?) | 15:52 |
fungi | i've observed an odd behavior with a github connection in our zuul deployment. we see trigger events appear in the scheduler's debug log immediately (zuul.GithubGearmanWorker: Github Webhook Received) but then there's a lengthy delay sometimes on the order of an hour before it gets processed (zuul.GithubConnection: Scheduling event from github, zuul.Scheduler: Processing trigger event <GithubTriggerEvent ...>). | 15:54 |
fungi | any idea what could cause that delay? when it happens i don't see any events hanging out in the events queue according to the status api, and there's nothing identifiable in the debug logs i can interpret as an explanation for the wait either | 15:54 |
pabelanger | I'm just looking at code now, I know we changed this a little recently | 15:54 |
clarkb | pabelanger: http://git.zuul-ci.org/cgit/nodepool/tree/nodepool/driver/openstack/handler.py?h=3.4.0#n247 I think that is the area that changed semi recently | 15:57 |
clarkb | http://git.zuul-ci.org/cgit/nodepool/tree/nodepool/driver/openstack/handler.py?h=3.4.0#n255 is where we create the new node to stand in | 15:57 |
pabelanger | ah, right | 15:58 |
pabelanger | clarkb: so, I don't think we are updating the actually instance properly with that new zk ID | 15:59 |
pabelanger | which leads to the leak | 15:59 |
clarkb | pabelanger: I agree we don't update the existing instance with the new zk id. I don't know that that causes a leak | 16:00 |
pabelanger | let me get some logs on new zk id | 16:00 |
clarkb | why do you think that is causing the leak? nodepool should delete the error'd instance under the new id and it will go away from the cloud | 16:00 |
clarkb | it is possible we leak the old node in zk, but that shouldn't affect the leak cloud side | 16:01 |
clarkb | fungi: what happens after the webhook is received by the scheduler? I wonder if there is anything obvious that could block | 16:02 |
pabelanger | clarkb: I am confirming we actually do a delete on new zk ID | 16:02 |
clarkb | fungi: like maybe construction of the trigger eevnt requires info back from github? | 16:02 |
clarkb | fungi: I think the flow there is github POSTs to zuul-web, zuul-web does a gearman event, zuul scheduler processes gearman event as worker, then I don't know | 16:05 |
fungi | i thought that happened concurrent with the "processing trigger" activity in the scheduler | 16:05 |
clarkb | fungi: I think the event has to be converted to a trigger first | 16:06 |
fungi | but yeah, when my current meeting ends i'll see if i can trace the flow of functions between GithubGearmanWorker and GithubConnection | 16:06 |
clarkb | the event is the raw github activity and triggers are a bit more filtered down to things zuul can take action on | 16:06 |
pabelanger | clarkb: https://softwarefactory-project.io/paste/show/1428/ | 16:07 |
pabelanger | we seem to be creating multiple servers using the same zk id | 16:07 |
pabelanger | but you can see bae8dfbb-d165-47db-9e08-fc8111ba36b8 there | 16:07 |
pabelanger | eventually the zk id comes online | 16:07 |
pabelanger | but the error instance, also has that nodepool_node_id | 16:08 |
pabelanger | https://softwarefactory-project.io/paste/show/1429/ looks to be the delete | 16:09 |
pabelanger | but, didn't delete cloud side | 16:09 |
clarkb | ya so its likely failing to delete in the cloud? andnodepool trying over and over again? | 16:09 |
pabelanger | no, the delete only happens once | 16:13 |
pabelanger | I'd like it to keep trying to delete over and over | 16:13 |
pabelanger | but, cleanup worker doesn't because nodepool_node_id is in zk | 16:13 |
pabelanger | being used to boot the next server | 16:13 |
pabelanger | so, I am unsure how to proceed in this case, if we don't properly delete the node, and a new node is booted with same zk id | 16:14 |
pabelanger | tobiash: might be of interest to you, since you helped here in the past | 16:16 |
pabelanger | ^ | 16:16 |
clarkb | pabelanger: so the original node_id is being reused? | 16:16 |
clarkb | pabelanger: can you show the logs of that? | 16:16 |
*** quiquell|rover is now known as quiquell|off | 16:17 | |
pabelanger | clarkb: https://softwarefactory-project.io/paste/show/1428/ | 16:17 |
pabelanger | 0001603003 is the id in question | 16:18 |
pabelanger | each of those attempts, fail with 500 (lack of resources) | 16:18 |
pabelanger | https://softwarefactory-project.io/paste/show/1425/ is actually bae8dfbb-d165-47db-9e08-fc8111ba36b8 attempt | 16:19 |
clarkb | right ok so the problem is the logical request's id doesn't change but we've already written properties to the host that we check against for cleanup | 16:19 |
clarkb | so cleanup won't happen until the reused original request id is deleted then at that point we can cleanup the original failures. That does indeed seem to be a bug | 16:19 |
pabelanger | yah, I am just trying to confirm we have more then 1 leaked (error node) with the same nodepool_node_id | 16:20 |
*** manjeets has joined #zuul | 16:27 | |
*** themroc has quit IRC | 16:28 | |
pabelanger | clarkb: confirmed, found multiple error'd instances with same nodepool_node_id: https://softwarefactory-project.io/paste/show/1430/ 0001615360 | 16:47 |
corvus | pabelanger, clarkb: especially since Shrews is out, can you make sure to collect all of this in a storyboard bug report? | 16:49 |
pabelanger | corvus: yup! | 16:50 |
clarkb | pabelanger: ya the final thing we'll want to confirm is if things do cleanup once the successfully booted node with that id is deleted | 16:56 |
clarkb | that should remove the zk record for it allowing everything else to be properly processed as a laeked node (mostly this is just to confirm observed behavior vs expected behavior) | 16:56 |
pabelanger | clarkb: agree | 16:57 |
*** ParsectiX has quit IRC | 16:58 | |
*** sdake_ has quit IRC | 17:01 | |
*** sdake has joined #zuul | 17:02 | |
fungi | just to follow up, current theory on our github trigger processing delay is growing backlog in GithubConnection.event_queue so looking to stuff a statsd report call somewhere for that | 17:06 |
fungi | (we have some very, very active gh projects associated with our connection) | 17:06 |
*** gtema has quit IRC | 17:06 | |
fungi | presumably event_queue.qsize() is what we should report on? | 17:07 |
openstackgerrit | Fabien Boucher proposed openstack-infra/zuul master: [WIP] URLTrigger driver time based - artifact change jobs triggering driver https://review.openstack.org/635567 | 17:15 |
clarkb | fungi: I think so | 17:16 |
fungi | cool, i'm just learning me up on how to add a new stat now | 17:17 |
fungi | or more like reverse-engineering from the existing uses | 17:17 |
AJaeger | zuul-jobs maintainer, could you review https://review.openstack.org/#/c/635941/ for the updated puppet-forge upload, please? | 17:27 |
openstackgerrit | Paul Belanger proposed openstack-infra/nodepool master: Properly handle TaskManagerStopped exception https://review.openstack.org/636393 | 17:29 |
pabelanger | clarkb: mordred: corvus: do you mind confirming ^ is the correct approach for handling when task managers die? In the case of sf.io (rdo-cloud) we end up spamming logs a fair bit when it happens. | 17:30 |
pabelanger | nhicher: fbo: ^ | 17:30 |
clarkb | pabelanger: do you know why the task manager dies? | 17:33 |
clarkb | we should be careful we aren't papering over another bug | 17:33 |
*** tobiash has quit IRC | 17:34 | |
*** pwhalen has quit IRC | 17:34 | |
pabelanger | clarkb: yah, haven't figured out why it dies, I don't think we have properly logging enabled. But I do see the new manager being created | 17:35 |
*** tobiash has joined #zuul | 17:35 | |
pabelanger | just the previous requests are using original, and failing | 17:35 |
pabelanger | let me get logs | 17:35 |
pabelanger | clarkb: http://paste.openstack.org/show/744958/ | 17:37 |
pabelanger | is the best we have right now | 17:37 |
pabelanger | line 23 is when new TaskManager is created | 17:38 |
pabelanger | and believe line 30 is still using the original, that just died | 17:38 |
clarkb | new provider managers are created when your config is chagned iirc | 17:38 |
pabelanger | Oh, hmm | 17:39 |
clarkb | if the exception is due to ^ then your fix is probably appropriate | 17:39 |
clarkb | since we can safely short circuit in that case | 17:39 |
pabelanger | I should be able to confirm if that was the case | 17:39 |
*** spsurya has quit IRC | 17:40 | |
pabelanger | clarkb: all managers would be created right, not just a specific? | 17:41 |
clarkb | pabelanger: would be specific to the providers that had their config change | 17:41 |
pabelanger | k | 17:42 |
clarkb | there is an if statement that checks the old config against the new config and if they differ creates a new manager under the new config iirc | 17:42 |
pabelanger | ack | 17:42 |
*** pwhalen has joined #zuul | 17:44 | |
*** sdake has quit IRC | 17:44 | |
*** sdake has joined #zuul | 17:46 | |
fungi | grr, having a hard time working out where to plumb in the statsd configuration/options in the github driver. i'm guessing i could reference it from the scheduler object? presumably by the time GithubConnection.addEvent() and eventDone() are called, self.sched is not None so i could just reference self.sched.statsd directly there? | 17:50 |
fungi | or is there a better way? | 17:51 |
clarkb | fungi: zuul.cmd.scheduler calls registerConnections which registers the scheduler with the connections. That should happen quite early. | 17:54 |
clarkb | You probably still want to guard against self.sched being None but otherwise should be safe to do what you describe | 17:54 |
clarkb | fungi: and with the guard we just won't send any of that data until the scheduler is register which should be fine | 17:56 |
tobiash | fungi: yes, you can just use self.sched (and guard against none) | 18:02 |
fungi | yep, i already had the conditional in there. just running tox against it now before i push it up | 18:02 |
fungi | should this come with a regression test too? | 18:02 |
fungi | apparently i now go down the rabbit hole of making sure i have the necessary libraries to build the c extensions for some of zuul's dependencies | 18:05 |
clarkb | fungi: I don't know how much testing of the statsd events we do. I don't think its much (but that may have changed) | 18:05 |
clarkb | fungi: bindep should capture them last I tried | 18:05 |
clarkb | re2 being the big "recent" one | 18:05 |
fungi | oh, neat, i also need a mysql server? | 18:06 |
tobiash | fungi: yes, for some tests | 18:06 |
*** panda is now known as panda|off | 18:06 | |
fungi | and a postgresql server and a zookeeper server | 18:06 |
fungi | will it skip those tests if i don't have those servers installed and configured? | 18:07 |
tobiash | fungi: there is the tools/test-setup.sh that xonfigures all of those stuff | 18:07 |
fungi | guess i'm about to find out | 18:07 |
tobiash | fungi: I guess they will fail | 18:07 |
tobiash | and zooleeper is needed for all tests ;) | 18:07 |
tobiash | if there is interest, I can push up a docker-compose based test setup for this stuff | 18:08 |
clarkb | fungi: ya it willfail the db tests. I usually just run with a zookeeper and ignore the db tests unless I'm looking at code in that region | 18:08 |
tobiash | that might be useful for folks to just fire up correctly configured mysql,postgres and zookeeper on tmpfs in docker | 18:08 |
fungi | looks like i'll be running pep8 locally and then pushing this up to run unit tests under ci instead, since i have a backlog of other stuff to get to | 18:09 |
pabelanger | clarkb: looks like you are right, nodepool.yaml was modified | 18:09 |
clarkb | tobiash: or even document a regex to skip the mysql and postgres tests | 18:09 |
clarkb | I always have t ofigure out that regex again (shame on me for not writing it down) | 18:09 |
tobiash | I never used such a regex | 18:10 |
openstackgerrit | Jeremy Stanley proposed openstack-infra/zuul master: Track GitHub connection event queue length stats https://review.openstack.org/636404 | 18:10 |
fungi | there we go ^ | 18:10 |
tobiash | fungi: thanks, wanna have this ^ :) | 18:11 |
fungi | statsd key name there was chosen arbitrarily | 18:11 |
fungi | please recommend improvements! ;) | 18:11 |
clarkb | fungi: you probably want to do github.$connectionname.event_queue | 18:12 |
clarkb | since you can potentially have multiple github connections | 18:12 |
fungi | aha, thanks | 18:12 |
fungi | on a related note, i see where it reports the github api rate limit count remainder and reset epoch timestamp but we seem to be tracking several different rate limits there and the logs don't do anything to differentiate them. any idea what the distinction between them might be? | 18:14 |
clarkb | jlk: ^ | 18:15 |
corvus | fungi: it might be per-app-install? | 18:15 |
fungi | oh, do we have several of those on the openstack/opendev zuul deployment? | 18:16 |
fungi | if so, that would make sense | 18:16 |
fungi | are those treated as different github "connections"? | 18:16 |
jlk | There is a per-app rate, and anonymous rate, and non-app API rate | 18:16 |
jlk | I believe the rate is per installation of your app | 18:16 |
corvus | fungi: we have only one connection (which maps to one github app), but that connection has many installations of the app. | 18:17 |
jlk | so if you have one Zuul app, and 30 places have installed it (that would be 30 connections in Zuul config), you'll get a rate per installation | 18:17 |
clarkb | re event queue lengths the openstack zuul takes significant amount of time to process events from github. reading logs with fungi it seems that we processes events serially in a fifo and those events individually may take some seconds to fully process. Because this serialized over time we accumulate a backlog. The chagne above should help track the size of the queue to see if it does indeed grow a backlog (we | 18:17 |
fungi | ahh, so basically per project (org?) using the app? | 18:17 |
clarkb | just don't have enough logging/data to know for sure currently) | 18:17 |
jlk | oh wait, I think corvus just contradicted me, and he would know better. | 18:17 |
tobiash | corvus: is right ') | 18:17 |
tobiash | ;) | 18:17 |
jlk | I forgot how each installation mapped to Zuul config | 18:18 |
tobiash | jlk: each connection can have many installations | 18:18 |
fungi | wondering what additional detail we should include in the debug log (and perhaps in an associated statsd key name pattern) for tracking api rate limits | 18:18 |
tobiash | otherwise we would not be able to handle that | 18:18 |
jlk | https://developer.github.com/apps/building-github-apps/understanding-rate-limits-for-github-apps/ | 18:19 |
corvus | fungi: there's an installation number, and we know what repo/org it is associated with. maybe that? | 18:19 |
corvus | (and we know what installation we're using for any given operation) | 18:19 |
tobiash | fungi: for tracking api rate limits you probably want to have that per installation (if we're having the installation id) | 18:19 |
fungi | if the rate limit is per org (or project?) which has installed the app, then i guess we're at least not at risk of one user of the app starving the api rate limit for another? | 18:19 |
corvus | fungi: that is the idea, yes | 18:19 |
jlk | yes. It's part of why app based is so compelling | 18:20 |
fungi | that helps. was trying to wrap my mind around the model there. thanks! | 18:20 |
jlk | you get a rate limit per consumer of your app, rather than the old method of a single rate limit shared by everybody sending you web hooks | 18:20 |
jlk | that said, I thik not every call we make to teh API is via an app | 18:20 |
clarkb | tobiash: have you seen pile up of github driver events leading to slow processing? I'm curious if this is something others have dealt with | 18:21 |
jlk | https://developer.github.com/v3/#rate-limiting | 18:21 |
jlk | the additional rate limiting is ^^ | 18:21 |
jlk | 5K per hour for things we do as our own user, 60 an hour for anonymous calls | 18:21 |
jlk | AAAND | 18:21 |
jlk | https://developer.github.com/v3/search/#rate-limit | 18:21 |
tobiash | clarkb: not that dramatically | 18:21 |
tobiash | but our github enterprise has not rate limit set (yet) | 18:21 |
jlk | when we hit search (for finding DependsOn and such), 30 per minute | 18:22 |
jlk | clarkb: I haven't looked at the change, but I thought we did work to parallelize GitHub event processing a while back, sending it into a worker queue | 18:22 |
jlk | or is it CONSUMPTION of the queue is serial? | 18:22 |
fungi | jlk: that's what it seems like, yes (the latter) | 18:23 |
jlk | I think the work I did was to make it so the hook receiver didn't hang until the event was processed. | 18:23 |
jlk | nod | 18:23 |
fungi | we see log entries immediately about the trigger event getting queued, but then a lengthy delay (sometimes hours) before the log entry saying it's being scheduled and processed | 18:23 |
*** jpena is now known as jpena|off | 18:23 | |
jlk | gotcha. | 18:24 |
fungi | so basically hanging out in the connection driver's own event queue (we think) | 18:24 |
jlk | that makes sense given what I know | 18:24 |
jlk | I had /assumed/ that processing the queue would be expedient. I didn't think it would be serial. | 18:25 |
jlk | Whatever thing processes the queue should grow threading/forking capability | 18:25 |
fungi | so i suppose burning down that queue is each trigger multiplied by the time it takes for each query that trigger implies (plus the occasional github api socket timeouts) | 18:26 |
jlk | I wonder if one needs to create buckets per installation, and then serially processes each bucket, but allow parallel bucket processing | 18:26 |
jlk | that way each bucket is going to FIFO the events w/ some guarantee, but one noisy client shouldn't slow down every client | 18:27 |
corvus | jlk: i don't know what the problem fungi and clarkb are dealing with is, but we receive a tiny percentage of github events compared to gerrit, and our scheduler has plenty of idle time, so i don't think this is a structural problem. | 18:28 |
jlk | Does the system process gerrit events serially, or are there more than one worker working the queue? | 18:29 |
jlk | also I think in teh gerrit case all the info is at hand, whereas on the GitHub case I _think_ we have to go query the API for more info. | 18:29 |
corvus | jlk: serially -- both gerrit and github accept events from the remote system, do whatever querying needs to be done (yes, it's more in github than gerrit, but sometimes in gerrit too), then hand off the fully populated event to the scheduler, which is itself single-threaded. | 18:30 |
jlk | Define 'idle time' for the scheduler. Could it be working just one task and is otherwise idle (and that one task is the serial burn down of the queue), or is it completely idle, no tasks waiting (which would seem counter to what fungi and clarkb are seeing) | 18:30 |
fungi | corvus: current problem observed is that when an action is taken on a github project which should trigger a job, it takes our zuul deployment hours (at peak) to ennqueue the builds | 18:31 |
*** ParsectiX has joined #zuul | 18:33 | |
corvus | jlk: the scheduler's event queue (which is where the gerrit/github events are deposited), is often empty. it's possible the github driver is spending too much time performing queries to prepare the events for the scheduler. but those events still need to be delivered in order. it's possible that paralellizing the queries there, and then re-serializing before sending them to the scheduler could be more | 18:33 |
corvus | efficient -- maybe that's what you were suggesting. but i have my doubts as to whether we're seeing a volume where that's necessary. | 18:33 |
fungi | reviewing the scheduler's debug log and correlating to the gh app's "advanced" log entries, the trigger event is seen by the githubgearmanworker immediately but then there are no subsequent log entries for it until the githubconnection schedules it much, much later | 18:33 |
corvus | fungi: have some log entries you can paste? | 18:33 |
fungi | sure, just a sec i'll get one up | 18:34 |
*** panda|off has quit IRC | 18:34 | |
fungi | here are the log entries containing the triggering event's github uuid: http://paste.openstack.org/show/744965/ | 18:35 |
fungi | in that case there was a 3-hour delay between receipt and scheduling | 18:36 |
tobiash | fungi: do you have a stream from zuul.GithubConnection at hand? | 18:36 |
fungi | how long of one would help? | 18:37 |
tobiash | fungi: maybe that could help to identify where the bottleneck is | 18:37 |
tobiash | fungi: maybe starting with that event and a few minutes length | 18:37 |
fungi | we logged 14764 entries for zuul.GithubConnection in the past ~12 hours | 18:37 |
fungi | sure, i can get that | 18:37 |
*** gcerami_ has joined #zuul | 18:37 | |
tobiash | fungi: maybe just the part between the first two lines of your paste | 18:38 |
clarkb | jlk ya my current theory is each individual event is costlly enough that we accumulate a backlog | 18:38 |
clarkb | we have little insight via logging of that though hence fungi's change | 18:38 |
tobiash | do you have numbers about the event inflow? | 18:39 |
clarkb | wecan probably wc the event recieved log messages to get that | 18:39 |
fungi | 4657 log entries for zuul.GithubGearmanWorker and zuul.GithubConnection between 15:32:45,739 and 18:33:32,243 | 18:40 |
fungi | so i don't think i can fit them in a paste | 18:40 |
clarkb | I've got to prep for the infra meeting now though | 18:40 |
fungi | i can certainly do the few minutes following 15:32:45,739 though as a start | 18:40 |
tobiash | fungi: oops, I overlooked the hours part of the timestamp ;) | 18:41 |
openstackgerrit | Merged openstack-infra/zuul-jobs master: Add intermediate registry push/pull roles https://review.openstack.org/634829 | 18:41 |
fungi | yeah, three hours between trigger and scheduling in that instance, which is... why i'm trying to look into it | 18:41 |
tobiash | so that was queued for three hours? | 18:42 |
tobiash | that is definitely not normal | 18:42 |
fungi | i think it was in the githubconnection object's event_queue for three hours, yes | 18:42 |
fungi | not the zuul scheduler's event queue | 18:42 |
tobiash | yes, pretty sure | 18:42 |
fungi | because during that time the scheduler just about always showed an event queue of 0 for that tenant | 18:43 |
*** sdake has quit IRC | 18:44 | |
mordred | fungi: so is that 4657 log entries 4657 events from github, or just log lines? | 18:44 |
jlk | hrm, the first log entry is when we are in handle_payload | 18:45 |
fungi | debug level or higher log lines for GithubConnection and GithubGearmanWorker | 18:45 |
jlk | which then sends it off to self.connection.addEvent | 18:45 |
clarkb | jlk: ya it gets converted into an internal event there and added to the connection.event_queue | 18:45 |
jlk | yeah | 18:45 |
clarkb | jlk: and then it seems to just sit them for hours. WHich is why I think the queue is long and we aren't flushing it as quickly as we get new events | 18:45 |
jlk | Probably need more logging/metrics in _handleEvent() | 18:46 |
fungi | here's a sample of the GithubConnection and GithubGearmanWorker log entries starting with when 717fd7b0-2edb-11e9-89ef-7895d9174016 was received and stretching for roughly 8 minutes http://paste.openstack.org/show/744967 | 18:47 |
jlk | hrm, no real way to tell what event caused line 5 | 18:49 |
mordred | feels like it might be nice to log the webhook id with more things to make it easier to trace the actions of an individual webhook | 18:49 |
jlk | yup | 18:49 |
jlk | I was just thinking the same | 18:49 |
fungi | would filtering out the repo permissions and rate limit entries help? | 18:49 |
jlk | maybe, but I think we lack enough context around log events to truly trace the event | 18:50 |
jlk | It's likely we need to make the webook ID (headers.get('x-github-delivery')) part of the event object that carries through | 18:51 |
*** sdake has joined #zuul | 18:51 | |
jlk | oh we do | 18:51 |
jlk | we add it in addEvent | 18:51 |
jlk | We just don't log it again until _after_ much of the processing has been done | 18:52 |
jlk | We have the data at https://github.com/openstack-infra/zuul/blob/master/zuul/driver/github/githubconnection.py#L190 | 18:52 |
corvus | we received 278 events from github during the 17:00 hour. that gives us a budget of about 13 seconds per event. | 18:53 |
jlk | We should log/emit a metric at that point befoire the work begins to process it | 18:53 |
pabelanger | just happen to be looking at openstack-discuss, and looks like no official room for zuul at PTG, is that right? http://lists.openstack.org/pipermail/openstack-discuss/2019-February/002618.html | 18:56 |
corvus | pabelanger: that's right | 18:57 |
pabelanger | ack, thanks! | 18:57 |
*** krasmussen has joined #zuul | 18:58 | |
corvus | jlk: agreed, we know when it's finished, but not when it starts | 18:58 |
jlk | That whole function and the method calls are ripe for metrics | 18:59 |
corvus | 2019-02-12 15:34:22,636 DEBUG zuul.GithubConnection: Scheduling event from github: <GithubTriggerEvent 0x7f6a05daf9e8 pull_request ansible/ansible refs/pull/50854/head reopened github.com/ansible/ansible 50854,b8048fe130722c11a42eb817aae08f628fc3af53 delivery: 5ea967b0-2ecb-11e9-9d2b-4586291aacb1> | 18:59 |
corvus | 2019-02-12 15:37:05,593 DEBUG zuul.GithubConnection: Got PR on project ansible for sha b8048fe130722c11a42eb817aae08f628fc3af53 | 18:59 |
corvus | for instance, what happened in those 3 minutes? | 18:59 |
corvus | the first line is the end of processing of one event, right? the next line is 3 minutes later... was it idle, or was it doing something for b804 that whole time | 19:00 |
clarkb | or perhaps dealing with intermedaite events that don't get processed to the point of logging? | 19:01 |
corvus | also, why is it the same sha in both? probably we started a new event for the same PR (happens a lot). but would be good to be certain. | 19:01 |
corvus | fungi: are you working on sprinkling more log events in there? | 19:02 |
jlk | a re-open would be a new event on teh same sha | 19:02 |
jlk | but we actually have no idea if the second line is in any way related to the event on the first line | 19:02 |
jlk | the second line could have been from an event BEFORE the first line | 19:02 |
*** tflink_ has joined #zuul | 19:03 | |
jlk | corvus: that 'Got PR on' line in a couple different functions, and is a function used by a variety of events | 19:04 |
jlk | it's just saying "I asked the GitHub API for the details on a pull request object" | 19:04 |
*** mugsie has quit IRC | 19:04 | |
*** tflink has quit IRC | 19:04 | |
*** fdegir has quit IRC | 19:04 | |
jlk | well, I take that back | 19:05 |
*** mugsie has joined #zuul | 19:05 | |
jlk | that PARTICULAR line is called via _event_status(), it got a status event on that PR for that hash | 19:05 |
*** fdegir has joined #zuul | 19:05 | |
jlk | which could be "pending" status, or "success" or "failure" | 19:05 |
jlk | well no, we return early in pending. | 19:06 |
jlk | anyway a lot of this logging stuff was written before I really had any idea what we'd need to trace :D | 19:06 |
fungi | corvus: i was going to try to understand enough github to help differentiate the various rate limit loglines. can try to improve other logging as well if i get enough understanding of what it means | 19:09 |
*** tflink_ is now known as tflink | 19:10 | |
jlk | fungi: basically at the start of https://github.com/openstack-infra/zuul/blob/6d6c69f93e9755b3b812c85ffceb1b830bd75d6f/zuul/driver/github/githubconnection.py#L189 we get a 'delivery' item, which is the webhook ID. We need to log reception of it there, to match the line when we added it to the queue. From there we need to log the processing of it. Specifically starting to handle the event at https://github.com/openstack-infra/zuul/blob/6d6c69f93e9755b3b812c85ff | 19:12 |
jlk | ceb1b830bd75d6f/zuul/driver/github/githubconnection.py#L219 | 19:12 |
jlk | https://github.com/openstack-infra/zuul/blob/6d6c69f93e9755b3b812c85ffceb1b830bd75d6f/zuul/driver/github/githubconnection.py#L219 | 19:13 |
jlk | There's some amount of processing in the method() call that causes outbound requests to GitHub's API. Once method() is done, there is further calls that are made in the "if event:" block. All the logging within this class should be called with the 'delivery' string. | 19:14 |
*** gcerami_ is now known as panda | 19:14 | |
corvus | jlk, fungi: yeah, i think i'll take a stab at reworking that to use a loggingadaptor real quick | 19:17 |
corvus | so we can easily get the string everywhere | 19:17 |
openstackgerrit | Tobias Henkel proposed openstack-infra/zuul master: Add dockerized test setup https://review.openstack.org/636424 | 19:18 |
tobiash | corvus, mordred, fungi: is there interest in this? ^ | 19:18 |
tobiash | I find it quite useful for setting up the test environment on a dev machine | 19:19 |
*** dmsimard has quit IRC | 19:22 | |
fungi | getting familiar with using docker locally is still on my to do list. presumably useful but i don't yet have enough context to know whether i'll find it so | 19:24 |
*** jamesmcarthur has joined #zuul | 19:26 | |
tobiash | fungi: in that case, just fire up that script, wait for completion and then you can run tox | 19:26 |
jamesmcarthur | hi all - not sure who to address this towards, but we looked into some solutions for anonymous analytics for the zuul.org site. | 19:28 |
jamesmcarthur | Our fantastic intern put together some details: https://docs.google.com/document/d/1xli_xW5Fp5xI-FYtEzJV2ijn6cxUfEczk0pVdO69I7w/edit?usp=sharing | 19:28 |
jamesmcarthur | corvus: fungi: if this seems like it would work for zuulci.org, please let me know | 19:28 |
*** jamesmcarthur has quit IRC | 19:30 | |
*** aprice has joined #zuul | 19:30 | |
dkehn | clarkb: http://git.openstack.org/cgit/openstack-infra/project-config/tree/zuul.d/jobs.yaml#n9, when setting a parent's job.abstract what is the implication of that? | 19:32 |
*** dmsimard has joined #zuul | 19:32 | |
clarkb | dkehn: an abstract job cannot be run. It is there only to be part of the inheritance tree | 19:32 |
clarkb | dkehn: useful if you have some incomplete set of job config that needs to be inherited to fix up | 19:32 |
openstackgerrit | James E. Blair proposed openstack-infra/zuul master: Improve event logging in githubconnection https://review.openstack.org/636429 | 19:33 |
corvus | tobiash, jlk, fungi: ^ | 19:34 |
* jlk looks | 19:34 | |
corvus | from a unit test run, that looks like: http://paste.openstack.org/show/744972/ | 19:36 |
jlk | I think it's a good start, but I would also sprinkle in logging into the _whatever() methods that are called by method(). Could be a follow up change. | 19:37 |
corvus | jlk: good idea, and i just saw a bug, so new ps incoming | 19:38 |
jlk | neat | 19:38 |
tobiash | corvus: commented | 19:39 |
*** sdake has quit IRC | 19:42 | |
corvus | yeah, that's the bug :) | 19:47 |
*** jamesmcarthur has joined #zuul | 19:59 | |
*** ParsectiX has quit IRC | 20:12 | |
fungi | the delivery uuid is indeed most helpful there | 20:13 |
fungi | that'll make it easy to track with one grep | 20:13 |
tobiash | corvus, jlk: given that the event processing really is too slow for the single threaded event queue, what do you think about doing that multi-threaded while maintaining stricktly ordered forwarding to the scheduler loop? | 20:13 |
corvus | tobiash: you are way ahead of where i am in understanding this. | 20:14 |
corvus | i would really like to avoid dramatic redesigns until we have more data | 20:15 |
tobiash | sorry, I'm thinking if it's possible to do the github event queue processing multi threaded and still maintain the correct order in which these events are forwarded into the scheduler run loop | 20:15 |
corvus | tobiash: yes, i believe jlk and i also said as much earlier | 20:15 |
tobiash | corvus: sure, I'm just already thinking about it ;) | 20:16 |
corvus | tobiash: but let's not even think about doing that until we understand the problem | 20:16 |
SpamapS | poppycock, run off into the weeds and write code, I say, data is for the unbold. | 20:17 |
jlk | hehe | 20:17 |
tobiash | looks like I missed parts of the backlog, sorry for the noise ;) | 20:17 |
corvus | there are so many ways that approach could cause us problems too, especially if we don't know what's actually causing the issue | 20:17 |
jlk | I'm with corvus on this though. We're making guesses w/out data. I'd like to see logging output from the changes we're making to really trace through what's happening | 20:17 |
jlk | We could throw more threads at the problem but if the real issue is that our API calls are really slow for some reason we'd want to fix that as well / instead. | 20:17 |
corvus | as soon as i finish this sanwich, i will finish that patch | 20:18 |
clarkb | I should go find lunch too, then return to review the data collecting change | 20:18 |
jlk | lunch sounds good | 20:19 |
tobiash | maybe there is also potential to reduce the number of api requests | 20:20 |
clarkb | jlk: it is that time of day in this part of the world. Did your precipitation switch to rain? we never got any snow here but pdx airport got ~5". Now we have inches of rain | 20:20 |
jlk | yeah we've been getting rained on since last night. | 20:20 |
jlk | We had likely over a foot of accumulated snow in the yard/roofs/etc. | 20:21 |
jlk | I'm very much done w/ shoveling my patio/driveway/sidewalks | 20:21 |
mordred | jlk: a foot of snow is not what your neck of the woods is designed for | 20:21 |
clarkb | jlk: it was crazy how different being a few miles west changed everything here | 20:21 |
clarkb | jlk: on saturday we had a bunch of ice then everything went to melty rain and no snow :( | 20:22 |
jlk | mordred: no, indeed not. That's why the kids have been on snow days for... a while now | 20:22 |
jlk | I'm also very much ready for them to go back. | 20:22 |
jlk | I'm honestly surprised I didn't lose power. Most our power is above ground lines, and all the trees had HEAVY snow on them. I could hear trees breaking all over the neighborhood. None took out our lines though. Seattle and the east side are not as lucky. | 20:23 |
mordred | jlk: I saw photos that looked very inappropriate for seattle | 20:27 |
*** edmondsw has joined #zuul | 20:27 | |
mordred | they looked more like photos of the side of mt. baker | 20:28 |
jlk | Here's a set of photos from yesterday, before ANOTHER 5~ inches fell. (or rather WHILE it was falling) https://www.instagram.com/p/BtwjHNhBlw7/ | 20:28 |
pabelanger | shoveling wet snow stinks, but makes great snowballs | 20:28 |
*** jamesmcarthur has quit IRC | 20:31 | |
SpamapS | Trust the Canadian about snow and ice. | 20:32 |
SpamapS | They know. | 20:32 |
fungi | yeah, i don't want to just assume we know why github event processing is slow on our deployment. i'd rather collect data first and see where it leads us. i was mostly trying to infer enough from the data we have to know where strategic gathering of additional data is likely to pay off in subsequent diagnosis | 20:34 |
fungi | which has resulted in needing to learn a bit more about github than i anticipated, but i suppose i shouldn't be surprised by it | 20:35 |
tobiash | this may not be the github event processing, I just looked at your paste and every event from 'Updating <Change...' until ' Scheduling event from github' is below 8s as fas as I can see | 20:35 |
tobiash | I'm stunbling accross this: http://paste.openstack.org/show/744974/ | 20:36 |
tobiash | if the queue length is > 0 I'd expect that after 'scheduling event from github' it immediately resumes with the next 'updating <change' | 20:37 |
tobiash | but there is a 6min gap in between | 20:37 |
clarkb | we dont log the updating change until after the requests though right? | 20:39 |
clarkb | so it could be blocking aomewhere between? | 20:39 |
tobiash | hrm, I don't see much code in between that could be blocking | 20:42 |
tobiash | just looks like it is blocked here: https://git.zuul-ci.org/cgit/zuul/tree/zuul/driver/github/githubconnection.py#n190 | 20:44 |
tobiash | but I also think we need the logging improvement for a deeper analysis | 20:45 |
clarkb | ya | 20:45 |
corvus | i'm about to push up a new ps | 20:45 |
corvus | make sure i have logging there :) | 20:45 |
openstackgerrit | James E. Blair proposed openstack-infra/zuul master: Improve event logging in githubconnection https://review.openstack.org/636429 | 20:47 |
corvus | tobiash, jlk, fungi, clarkb: ^ sorry it's a more significant change, but it means less passing around of data and loggers. | 20:47 |
corvus | oh cool, the diff on that in gertty at least looks really good | 20:48 |
corvus | gerrit too. cool. should be easy to follow and review then. | 20:48 |
*** jamesmcarthur has joined #zuul | 20:52 | |
*** jamesmcarthur has quit IRC | 20:54 | |
tobiash | corvus: found a bug ^ | 21:00 |
corvus | tobiash: same thing a few lines down as well | 21:01 |
tobiash | yah | 21:01 |
openstackgerrit | James E. Blair proposed openstack-infra/zuul master: Improve event logging in githubconnection https://review.openstack.org/636429 | 21:01 |
corvus | i'm runinng tests.unit.test_github_driver.TestGithubDriver.test_pull_event locally, but that's it | 21:02 |
corvus | so i'm catching the big errors, but that doesn't test everything | 21:02 |
*** fdegir has quit IRC | 21:06 | |
*** mugsie has quit IRC | 21:06 | |
*** mugsie has joined #zuul | 21:06 | |
tobiash | corvus: found some more, do you mind if I push an update? | 21:15 |
corvus | tobiash: go for it | 21:15 |
openstackgerrit | Tobias Henkel proposed openstack-infra/zuul master: Improve event logging in githubconnection https://review.openstack.org/636429 | 21:15 |
tobiash | ran the complete test suite locally ^ | 21:15 |
corvus | clever | 21:15 |
corvus | tobiash: those all look good, thx | 21:16 |
tobiash | yw :) | 21:16 |
tobiash | corvus: so our +1 add up to +2? | 21:17 |
corvus | tobiash: i think so. :) let's +3 it unless jlk wants in on this...? | 21:18 |
jlk | I'm running a meeting, don't wait in me | 21:18 |
tobiash | k | 21:18 |
tobiash | that logging also should reveal if there are blocking problems on that queue | 21:19 |
corvus | tobiash, clarkb, fungi, jlk: i ran the sigusr debug handler, which means i got two stack traces at random points in time. for the githubconnection event handler thread, they both had the same stack traces: | 21:21 |
corvus | File "/usr/local/lib/python3.5/dist-packages/zuul/driver/github/githubconnection.py", line 1133, in getPullBySha | 21:21 |
corvus | for pr in repo.pull_requests(state='open'): | 21:21 |
corvus | which tells me that, statistically speaking, we probably spend a lot of time in that bit of code. | 21:21 |
corvus | in particular, i can think of one repo that we monitor where "iterate over every open pull request" may not be the best idea. | 21:21 |
clarkb | we must list all open pull requests too? I wonder if that is expensive for ansible | 21:22 |
clarkb | ya | 21:22 |
tobiash | hrm, that sounds not very optimal | 21:22 |
tobiash | hrm, that code path is used on a status event | 21:23 |
tobiash | we should check if there is really no pr number in the status event that forces us to do that | 21:24 |
tobiash | but I guess that explains what you observe | 21:25 |
mordred | oy. yeah | 21:25 |
clarkb | the unlabel event I have pulled up has a pull_request dict | 21:26 |
*** sdake has joined #zuul | 21:26 | |
tobiash | hrm, the docs tell me that there is really no reference to the pr: https://developer.github.com/v3/activity/events/types/#statusevent | 21:27 |
corvus | can we do a search instead of iterating? | 21:27 |
clarkb | oh I see its a commit being updated and we try to refernce that back to PR | 21:28 |
jlk | I'll pay attention here in a moment. | 21:28 |
tobiash | checking the api docs | 21:28 |
tobiash | I think a search could work: https://help.github.com/articles/searching-issues-and-pull-requests/#search-by-commit-sha | 21:30 |
tobiash | together with https://developer.github.com/v3/search/#search-issues-and-pull-requests | 21:30 |
corvus | i'm looking far back in the logs today when things were less busy. 8a98cf82-2e8f-11e9-921a-f68fa09c6c2a was received by the gearman worker at 6:29:25 and was delivered to the scheduler at 6:30:49. so it took 84 seconds. | 21:31 |
jlk | okay trying to get caught up on backscroll | 21:32 |
corvus | (assuming there was no event processor backlog, which i am assuming because there was no activity for the next 4 minutes after the delivery to the scheduler) | 21:32 |
jlk | wait | 21:33 |
SpamapS | oh wow, if you have debug logging on the scheduler, zuul-git makes a Read Refs log entry that is basically "everything that has ever landed in any of these repos" | 21:33 |
jlk | we're not iterating in the way you think | 21:33 |
corvus | jlk: whew | 21:33 |
jlk | At least I don't think. let met get to that line of code | 21:33 |
tobiash | corvus: search by sha works: curl https://api.github.com/search/issues\?q\=6891aacd27e28676bab8ec3e957cde26c1616e1c | 21:33 |
SpamapS | $ kubectl logs zuul-scheduler-0|grep 'Read refs'|wc | 21:33 |
SpamapS | 8 88944 3200902 | 21:33 |
SpamapS | all of the refs/changes/foo's and every single commit hash | 21:34 |
corvus | SpamapS: i think you can filter out that logging with a logging config file | 21:34 |
SpamapS | corvus: I can, I just wonder if that's a necessary level of logging, even at debug. :) | 21:34 |
jlk | So it's one API call to get all the pull requests | 21:34 |
jlk | then WITHIN zuul code we iterate over what we got as a result in that API call | 21:34 |
corvus | SpamapS: it's a good question. i've certainly used it in the past, but we do not currently have it enabled in opestack-infra | 21:35 |
jlk | it's python loop iteration, not a new API call for each PR | 21:35 |
SpamapS | I find the debug logs to be quite useful. I dump them all into elasticsearch and filter them out usually, but it's nice to have them there. | 21:35 |
jlk | So an API call to get all teh open pull requests of a repo vs a API call to search for a PR w/ a hash is likely to be roughly the same | 21:35 |
tobiash | jlk: well, that probably does pagination | 21:35 |
mordred | jlk: it might be doing pagination calls on the backend in github3.py though | 21:35 |
jlk | tobiash: that's a fair point. | 21:35 |
corvus | SpamapS: i could see removing it from the "default" debug logging config, and next time i need it, i can make a manual logging config | 21:35 |
mordred | especially since we're talking about the ansible/ansible repo here :) | 21:36 |
tobiash | jlk: so on a busy project you might get much data while looping | 21:36 |
jlk | let me dig further | 21:36 |
jlk | Because we could tell i tto get more at once | 21:36 |
SpamapS | Is there a trace level? | 21:36 |
corvus | apparently there are 1713 open prs | 21:36 |
corvus | SpamapS: not in python | 21:36 |
SpamapS | shame | 21:36 |
SpamapS | anyway, I think I found a bug in github that is unrelated to what you all are looking at | 21:37 |
jlk | https://github.com/sigmavirus24/github3.py/blob/master/src/github3/repos/repo.py#L2355-L2412 | 21:37 |
SpamapS | but it may be all zuul just not working well with github.... | 21:37 |
corvus | SpamapS: yeah. openstack was so disappointed they made one, but it turned out to be a bad idea because none of the tooling deals with it right, so i don't think we should follow suit. but i'd totally use it if it really existed. | 21:37 |
SpamapS | I'd like to have the "require gate status check" branch protection checked on my repos, so only Zuul can merge things, but the status report comes *after* the merge. | 21:38 |
tobiash | jlk: well, we only want that single matching pull request so I think we really want to change this to a search | 21:38 |
jlk | Search is a different API limit | 21:38 |
jlk | a much shorter one | 21:38 |
tobiash | how much? | 21:39 |
jlk | 30 per minute | 21:39 |
tobiash | hrm | 21:39 |
SpamapS | oh wait, no, the driver actually does set statuses before calling merge | 21:40 |
tobiash | but I guess we can live with that limit given that it takes 2-10s to process one event | 21:40 |
jlk | tobiash: On a busy zuul w/ lots of github clients you could possibly overrun that limit quickly | 21:41 |
tobiash | SpamapS: we're using that since ages and I got no complaints about this so far, is this ghe? | 21:41 |
SpamapS | tobiash: no, github.com | 21:41 |
SpamapS | tobiash: I wonder if it's just async | 21:41 |
SpamapS | 'bit of a race condition | 21:41 |
tobiash | jlk: with the rate I saw in the logs zuul is not able to process more than 30 events per minute | 21:41 |
SpamapS | Like, need to confirm the status if you get a 405 | 21:42 |
tobiash | SpamapS: maybe, I haven't connected my zuul to github.com | 21:42 |
SpamapS | I've used the setting with GHE in the past too, at GoDaddy. | 21:42 |
SpamapS | so I was surprised when my change failed ot merge. | 21:42 |
* SpamapS tries again | 21:43 | |
tobiash | SpamapS: then async on github.com seems likely | 21:43 |
tobiash | jlk: granted, there are other places in zuul that perform a search too | 21:43 |
jlk | oh I know why this code looks familiar | 21:44 |
jlk | and whY I hate this part of GitHub | 21:44 |
jlk | You can have multiple PRs open with the same head SHA. | 21:44 |
jlk | it's unlikely, but possible. | 21:44 |
corvus | yeah, if it happens in zuul, we just punt. | 21:44 |
jlk | and since there's no way to match the event with the specific PR it came from, we can't just pick the first and continue | 21:45 |
jlk | which is why we dump them all | 21:45 |
jlk | we DO punt of we find more than 1 | 21:45 |
SpamapS | mebbe some day github will adopt Gerrit's model and deprecate "pull" requests. :-P | 21:45 |
* SpamapS can dream | 21:45 | |
jlk | SpamapS: I don't think that would improve this | 21:45 |
jlk | see what corvus just said | 21:45 |
corvus | i think the new logging will tell us with more certainty how long the 'getpullbysha' method takes. i'm not sure it will tell us how many underlying requests we made... | 21:45 |
jlk | corvus: yeah. We might be able to get some logging from github3.py for that | 21:46 |
jlk | anyway... | 21:46 |
jlk | I _think_ this problem set of code could go away if we fully switched over to the checks API | 21:46 |
tobiash | corvus: but the github3.py logging will tell you | 21:46 |
jlk | except, we still want to be able to trigger stuff on events outside of zuul events, so maybe not | 21:46 |
corvus | jlk: hrm, yeah, i guess it probably can't go away until nothing uses status anymore... | 21:47 |
jlk | sadly. | 21:48 |
clarkb | out of curiousity what do the status events give us? | 21:48 |
corvus | however, the fewer status changes, the less this code is hit. so any movement in that direction is an improvement. | 21:48 |
tobiash | so what should we do, performa search or try to make the iteration faster? | 21:48 |
jlk | and Ansible has a LOT of status users. | 21:48 |
tobiash | clarkb: we need this to trigger the gate | 21:48 |
clarkb | most of the trigger stuff is oreinted around PR changes not commit changes right? | 21:48 |
jlk | clarkb: they're how we currently set a head of a PR as passing CI or not | 21:48 |
tobiash | in case review and merge label are already there, the status event from the check pipeline will trigger the gate | 21:48 |
jlk | clarkb: and how we learn if another CI system has passed the change or not | 21:49 |
*** sdake has quit IRC | 21:50 | |
jlk | tobiash: We should get some visibility into whether or not this bit of code is a significant slow down, and if so, how much faster a search would be, or if there are ways we can improve the API calls github3.py makes | 21:50 |
* clarkb reviews the new logging change since that seems like a good next step | 21:50 | |
corvus | jlk: ++ clarkb so i think openstack-infra should restart the scheduler with the logging change, and probably also increase the logging level for github3.py so we can see those requests. | 21:51 |
corvus | that latter will be a logging config change on our scheduler... and should be temporary because boy is it going to be big. | 21:51 |
clarkb | looks like everyone has already reviewed it :) | 21:51 |
tobiash | corvus: we have 71M of github3.py logging in three days | 21:53 |
jlk | hrm. | 21:54 |
jlk | I'm curious | 21:55 |
tobiash | while having 1G of zuul debug logs in 10 hours... | 21:55 |
jlk | GithubEventProcessor() class uses the GithubEventLogAdapter, which logs the 'delivery' key. | 21:55 |
* corvus stands by to -2 his own patch.... | 21:55 | |
jlk | oh wait, okay. I see we hand it a 'delivery' thing in __init__ | 21:56 |
* corvus stands down | 21:56 | |
jlk | I was worried that run() wouldn't have the 'delivery'. | 21:56 |
jlk | so now we'll have one of these running for EACH event that comes in? | 21:56 |
corvus | yeah. it should always be there. it may be None when we shut down. | 21:56 |
corvus | jlk: yes, but it's not a thread. it's really just a data structure. | 21:57 |
jlk | Yeah, I'm just thinking we could really move things along if "GithubEventProcessor(self, data).run()" was an async fire and forget call | 21:58 |
corvus | (however, if we did decide to have some parallelism in the future, you could hand these things to, say, a pool of workers) | 21:58 |
jlk | GithubEventConnector uses threading, how many threads does it start? | 21:58 |
clarkb | jlk: one aiui | 21:58 |
jlk | like how quickly will it go through its run() loop? | 21:59 |
corvus | certainly O(1). | 21:59 |
corvus | might be 2. | 21:59 |
clarkb | jlk one event at a time | 22:00 |
jlk | I wonder if we could relieve some back pressure by spawning threads for each event | 22:00 |
clarkb | corvus: where does kwarys[extra][delivery] make its way into that method? | 22:00 |
jlk | but... threading. Now you have 0(N) problems. | 22:00 |
jlk | clarkb: I believe it comes from the 'data' sent in on line 458 | 22:01 |
corvus | jlk: yeah if we did have to parallelize, i'd want to keep a tight lid on the number of threads here | 22:01 |
jlk | We'll just re-write it all in Go, use Gorutines, and everything will be fine, right? | 22:01 |
clarkb | oh I see ya its reading it directly from the event data | 22:02 |
openstackgerrit | Merged openstack-infra/zuul master: Improve event logging in githubconnection https://review.openstack.org/636429 | 22:02 |
corvus | (because, right now, we have a running system with a (presumed) backlog. which is better than a non-running system with 2000 stuck threads) | 22:02 |
jlk | if we were to make threads there, we could totally get out of sync too, where some threads would finish faster than others and we'd deliver events out of order. | 22:02 |
clarkb | the reason it isn't already parallelized is that for state correctness we want to update zuul's view of the world serialy right? | 22:02 |
clarkb | so ya we could churn through events quicker, but we'd have to carefully lock around updates to the internal state model | 22:02 |
corvus | correct | 22:03 |
jlk | my brain just went all melty | 22:03 |
corvus | just another day in #zuul | 22:03 |
jlk | thinking about creating a ordered list of events in a data store, allowing parallel processing to get the events processed up to the point of adding them into the scheduler queue, and then going back to serial to dump them INTO the queue | 22:03 |
jlk | er into the Scheduler queue | 22:03 |
jlk | I can see it in my mind, but no way I could interpret that into code | 22:04 |
tobiash | jlk: getchange has side effects... | 22:04 |
jlk | yeh I haven't even thought through the side effects | 22:04 |
corvus | jlk: it's so easy in napkin-language | 22:04 |
jlk | none of this was written with thread safety in mind | 22:05 |
corvus | well, that's not quite true :) | 22:05 |
corvus | tobiash: in general, getchange's side effects should be okay | 22:05 |
jlk | ahem. None of what _I_ wrote was written with thread safety in mind... | 22:05 |
clarkb | is the extra kwarg something that python logging inserts? | 22:06 |
clarkb | I'm assuming so if this all works in tests | 22:06 |
tobiash | corvus: I guess you're right | 22:07 |
corvus | we allow a certain amount of mutation of data on the Ref objects. they're generally expected to be as up to date as possible and can change without locking. | 22:07 |
corvus | clarkb: yes, and the adapter is copied from the executor | 22:07 |
corvus | (i'd like to use them in more and more places, they're really cool) | 22:08 |
clarkb | ya its a neat way to apply a mutation to consistent types of log data | 22:08 |
clarkb | it just has a slightly magic interface with kwargs mutations | 22:08 |
corvus | clarkb: it's also extra metadata on the log entry itself, so it's queryable via journald, etc. | 22:08 |
clarkb | oh thats an extra neat feature | 22:09 |
jlk | Oh I remember something else. We can't even ask GitHub to add a reference to the PR in that webhook, because the status is SET on a hash ( https://developer.github.com/v3/repos/statuses/#create-a-status ) it's not set on any specific pull request. | 22:23 |
clarkb | ya this seems to be a mismatch between pull request workflows and commit based api | 22:23 |
clarkb | so you have to jump through hoops to match them up in places | 22:23 |
corvus | given that, i wonder if an endpoint to return all the prs for a hash wouldn't be a terrible idea? | 22:24 |
corvus | especially one that counted against the normal api limit, not the search limit :) | 22:24 |
jlk | I'm sure you could cook something like that up w/ GraphQL | 22:27 |
jlk | however. GraphQL. | 22:27 |
*** jamesmcarthur has joined #zuul | 22:27 | |
corvus | tristanC: i've restarted zuul-web with the buildsets api endpoint patch, and the buildsets tab on http://logs.openstack.org/41/630041/5/check/zuul-build-dashboard/2290fd5/npm/html/ is behaving strangely. it seems to return a small set of buildsets at random. | 22:29 |
jlk | Digging through _iter() some more. The repo.pull_requests() API call defaults to asking for "-1" number of pulls, which is "all". That -1 is passed through to _iter(), which goes to GitHubIterator, which sets a page limit at 100 in that case. | 22:32 |
jlk | so, without direction it's going to fetch 100 at a time and page through them until done. We could... fetch more? Ask for 500? | 22:32 |
corvus | so that's 18 requests for ansible by default | 22:33 |
jlk | A good candidate for why that takes a while. | 22:33 |
jlk | I'm looking to see if there is an upper bound for how many we can request at once | 22:34 |
jlk | "For some resources, you can also set a custom page size up to 100 with the ?per_page parameter" | 22:34 |
jlk | so... 100 :/ | 22:34 |
corvus | maybe we could do an lru cache of sha->pr mapping | 22:35 |
jlk | Honestly, it wouldn't be the worst idea ever to just work 100 at a time to find ONE that matches and just end there. | 22:35 |
jlk | it's just perpetuating the oddity | 22:36 |
clarkb | corvus: event at a few thousand entries that should be a relatively small amount of memory | 22:36 |
clarkb | corvus: that seems like a not bad idea to me | 22:36 |
jlk | Would need some way to expire things from the cache as they close. Handle a PR close event and remove the PR from the cache | 22:39 |
jlk | otherwise you wouldn't recover from "multiple PRs w/ this hash" problem fast enough. | 22:39 |
corvus | good point | 22:40 |
clarkb | (mostly personal curiousity) how do you end up with PRs that have the same hash? | 22:41 |
clarkb | is the common case trying to merge the same commit to multiple branches? | 22:41 |
clarkb | and the branches have the same commit history? | 22:41 |
jlk | I think it's very uncommon, it's just possible? | 22:41 |
clarkb | (if we can describe the case we might be able to say something like "zuul doesn't work with this workflow") | 22:41 |
jlk | like you've made a release branch and then quickly had to fix a thing, the fix goes to both master and the release branch, and master hadn't moved yet | 22:41 |
jlk | or if somebody accidentally opens 2x PRs from the same branch to the same target | 22:42 |
jlk | there's lots of tooling out there to open PRs, a hiccup could cause multiple. | 22:42 |
*** tflink has quit IRC | 23:02 | |
*** tflink has joined #zuul | 23:02 | |
*** jamesmcarthur has quit IRC | 23:11 | |
SpamapS | clarkb: you force push | 23:24 |
SpamapS | have accidentally done it ;) | 23:24 |
jlk | heh. Yeah there's nothing GitHub (or gerrit) side that would prevent a variety of scenarios from happening | 23:25 |
clarkb | Gerrit actually enforces a lot more rule around this | 23:25 |
clarkb | like force pushing can be disabled pretty forcefully. And you can't push the same commit to multiple changes iirc you'd have to change the changeid? though maybe that works with the stable/master example above | 23:27 |
clarkb | since the unique tuple is sha, changeid, branch | 23:27 |
jlk | a force push can be disabled by GitHub too, if the repo owner chooses to do so | 23:27 |
jlk | except that often PRs come from forks, and the rules don't necessarily apply | 23:27 |
*** jamesmcarthur has joined #zuul | 23:32 | |
SpamapS | ya, it's not only possible, but I'd say it's a valid scenario that should result in a clear error message. | 23:32 |
*** jamesmcarthur has quit IRC | 23:32 | |
*** jamesmcarthur has joined #zuul | 23:33 | |
SpamapS | Even if it means spamming both PR's. ;) | 23:33 |
clarkb | ya I don't necessarily think its a bad thing, except that if we maybe sacrifice some corner face functionality for working at all that might be a good compromise | 23:33 |
jlk | That means finding all the matches again :/ | 23:33 |
clarkb | The other thing that gerrit does is more explicitly couples the workflow and api. So you have the commit info in the events | 23:34 |
clarkb | (I have no idea what it is like to implement that in github's api btu a query for list of PRs that map to this commit could be useful, this was suggested earlier I think) | 23:34 |
jlk | our checks API is definitely an attempt at improving this | 23:35 |
jlk | because check runs and suites are attached to specific PRs | 23:35 |
clarkb | jlk: in that case the entire CI workflow runs through that different API? | 23:35 |
jlk | it'll just take a while for the whole world to move over | 23:35 |
jlk | clarkb: yeah. As an app you register as a check suite, and on PR open a check run is requested, which we Zuul would get and start interacting with. We send results into the check run | 23:36 |
clarkb | would we still need to process status updates to deal with gating things? | 23:37 |
jlk | yes, until all the other systems move over into check runs | 23:37 |
*** dkehn has quit IRC | 23:37 | |
jlk | so instead of setting status, they would be updating their own specific check suite/runs. We can iterate over them to see if they're complete or not | 23:37 |
jlk | hrm I wonder | 23:37 |
*** dkehn has joined #zuul | 23:38 | |
jlk | I wonder if we would get updates about a check run for another app | 23:38 |
jlk | maybe. Hard to tell from the docs | 23:39 |
jlk | "GitHub Apps that have the checks:read permission and subscribe to the check_suite webhook event receive the completed action payload for all check suites in the app's repository. " | 23:42 |
jlk | I think we'd receive hooks as each check suite completes, which is analogous to a status success. | 23:43 |
*** sdake has joined #zuul | 23:43 | |
clarkb | so the limitation might become "gating only works properly if all check suites use the checks api" and then switch zuul over to that (I'm hand waving though as I doubt I'd write that conversion or have all the details anticipated) | 23:44 |
jlk | heh | 23:44 |
jlk | maybe. I think it'd be more phased | 23:44 |
jlk | 1) move Zuul over to checks API. 2) sometime later deprecate reacting to status events. 3) stop responding to status events, claim that gating (on systems other than Zuul) require those systems to be using checks API | 23:45 |
jlk | a bunch of work is done for 1, in github3.py. It's probably time I poke at Zuul again and try to make use of it | 23:45 |
*** jamesmcarthur has quit IRC | 23:56 |
Generated by irclog2html.py 2.15.3 by Marius Gedminas - find it at mg.pov.lt!