corvus | hrm, i'm not sure that fixed it | 00:00 |
---|---|---|
corvus | i still see the 121061 error | 00:01 |
clarkb | https://docs.github.com/en/rest/reference/apps#get-an-installation-for-the-authenticated-app though I wonder if that will work if we are getting 403s. Another approach might be to cross check the list of github repos we've got against whatever the app sees on github if that is a thing? | 00:02 |
corvus | if you or ianw know how to do that, please feel free | 00:03 |
corvus | meanwhile, i will attempt a binary search by elimination with repeated restarts | 00:04 |
ianw | hrm, i can only think of logging in interactively | 00:04 |
clarkb | our links in our docs are 404s | 00:08 |
clarkb | looks like maybe I'm not a member of opendevorg? | 00:09 |
clarkb | oh right we used a central account | 00:09 |
clarkb | my brain is catching up | 00:09 |
ianw | well i can go in and see the app, all it's permissions, etc | 00:10 |
ianw | but i can't seem to see where it's installed | 00:10 |
corvus | i can't seem to run edit-secrets on bridge, so i don't know how to log into github and look | 00:11 |
clarkb | corvus: I think ianw has the file open and we've locked around gpg because of gpg agent :/ | 00:11 |
clarkb | corvus: if ianw closes the file then it should be openable | 00:11 |
ianw | oh sorry let me close | 00:11 |
ianw | out | 00:11 |
ianw | I can see "Recent Deliveries" that shows where it ran | 00:11 |
corvus | which org owns the app? | 00:14 |
ianw | https://github.com/organizations/opendevorg/settings/apps shoudl get you there | 00:15 |
ianw | you have to go "Developer Settings -> GitHub apps" | 00:15 |
corvus | i don't see a list of installations | 00:17 |
corvus | removing all the projects from the zuul tenant config is still not sufficient to avoid the error | 00:17 |
ianw | agree, i can not find a list of where it is installed at all either | 00:18 |
corvus | just as a sanity check, i restarted on 4.9.0 and this problem persists (so it's unrelated to zuul changes) | 00:20 |
corvus | normally at this point, i would start hacking zuul code in place to get more debug info, but that's a bit difficult with containerized deployment | 00:20 |
clarkb | I think you can use stop and start instead of down and up to prserve the container contents. Still not really easy though | 00:21 |
corvus | well, it's difficult to edit the root install as the zuul user | 00:21 |
corvus | i do not know how to haxor our deployment :| | 00:22 |
ianw | i think you can just exec with --user root though? | 00:22 |
clarkb | corvus: in your manual edits of the tenant config are you removing the github repos in all tenants? I think a few of them have github entries. Also fungi has a way to edit the container contents from the host | 00:23 |
corvus | oh hey that works | 00:23 |
ianw | i feel like i've done that, with a quick apt-get install vim at points | 00:23 |
clarkb | you dig into /var/run/docker or /var/lib/docker something something | 00:23 |
clarkb | ah looks like ianw had the workaround | 00:23 |
corvus | clarkb: yes, i completely removed all github projects | 00:23 |
clarkb | k | 00:23 |
ianw | start/stop of container probably works, but i've also had some luck just HUPing 1 to reload changes | 00:24 |
fungi | sorry, stepped away for a few, but haven't we seen this happen in the past if we restart zuul too frequently and exceed the github api rate limits? | 00:26 |
corvus | this is a single installation that's failing | 00:26 |
clarkb | ok looking at the code we ask github for the list of installations and then attempt to auth them | 00:27 |
corvus | and it's not the first one | 00:27 |
clarkb | we don't seem to use the config on disk for this. its like github's api is eventually consistent here | 00:27 |
clarkb | _prime_installation_map() in particular is where this happens | 00:28 |
clarkb | https://docs.github.com/en/rest/reference/apps#list-installations-for-the-authenticated-app is the api request we make | 00:28 |
clarkb | maybe we need to cross check the permissions returned in that list of dicts and filter out any that don't have sufficient perms | 00:29 |
clarkb | I suspect that the initial hunch for the repo that is causing this is probably true, but they likely only reduced perms and didn't remove the installation | 00:31 |
clarkb | if they removed the installation we'd be ok? | 00:31 |
corvus | that's my understanding | 00:32 |
ianw | i feel like you have to accept the permissions that the app requests? | 00:32 |
clarkb | ianw: ya but you can also unaccept them. I suspect that is what happened | 00:32 |
ianw | istr we had to update the permissions, or had discussions around it, and it would require everyone re-authing | 00:32 |
clarkb | if we can uninstall the app as the app owner that might be one route here ? | 00:33 |
corvus | okay i think i have a list of every repo that isn't the problem | 00:39 |
corvus | sigmavirus24/github3 does not appear in that list | 00:40 |
corvus | also, just fyi, there's some rando stuff in that list | 00:40 |
corvus | infra-root: how would you like to proceed? | 00:41 |
ianw | we can't actually uninstall the app from our side afaict? | 00:42 |
corvus | i'm unaware of a way to do that | 00:42 |
fungi | i'm fine with removing that entry from the tenant config and project list | 00:43 |
corvus | fungi: that is not sufficient | 00:43 |
corvus | the only quick fix i have to offer is to completely remove the github connection from opendev's zuul | 00:43 |
fungi | oh, sorry, i guess i missed the list of options, rereading | 00:43 |
clarkb | corvus: could we put a filter for that particular installation in the zuul code? | 00:43 |
clarkb | then also remove it from our config? | 00:43 |
corvus | clarkb: remove what from our config? | 00:44 |
fungi | i'm find disconnecting from github if the current state of the github driver in zuul is that any project on github with our app installed can cause zuul to be unable to start | 00:44 |
clarkb | corvus: the change I made to remove the github3.py repo from our config. Basically ensure our config aligns with the filter in the prime function | 00:44 |
fungi | er, fine | 00:44 |
clarkb | https://docs.github.com/en/rest/reference/apps#delete-an-installation-for-the-authenticated-app seems to be a possibility too but you need a jwt token and that seems quite involved | 00:45 |
clarkb | Another option is to reach out to sigmavirus and ask them to uninstall it I suppose (though one that may not happen quickly) | 00:45 |
clarkb | corvus: re the filter idea, maybe exclude any installs that 403? | 00:46 |
fungi | we're only providing advisory testing to projects on github, if we stop connecting to github for a little while we're not blocking any of them from being able to merge code | 00:46 |
clarkb | hrm the concurrency code there makes that weird | 00:47 |
corvus | clarkb: yeah, i mean there's obviously code fixes that should be done | 00:47 |
clarkb | fungi: that is a good point and I guess aligns with the idea of disabling the github connection entirely. THough I have no idea what other things (job side) that might break | 00:48 |
clarkb | I suppose starting there is straightforward and wecan figure it out from there? | 00:48 |
corvus | i'm just not sure we should rely on a code fix in an ephemeral container right now | 00:48 |
ianw | doing the delete might be good, i can see a few things that make a jwt bin bash | 00:48 |
corvus | i may be able to get a jwt | 00:48 |
corvus | like, print out the one that zuul is using | 00:48 |
corvus | okay, i have a jwt | 00:48 |
corvus | gimme a sec to try to make a curl | 00:49 |
clarkb | k | 00:49 |
corvus | i ran it; no output | 00:51 |
corvus | restarting to see if it worked | 00:51 |
corvus | yes i tihnk it's gone | 00:52 |
corvus | i will pull the clean docker image and restart now | 00:52 |
clarkb | ok | 00:52 |
ianw | \o/ | 00:52 |
corvus | 2021-09-23 00:53:40,343 INFO zuul.Scheduler: Starting scheduler | 00:55 |
corvus | that means it got past the crash-at-startup phase and is actually in its exception handler | 00:55 |
corvus | and it's loading configs now | 00:55 |
corvus | ianw, clarkb, fungi: thanks, that was some teamwork :) | 00:56 |
Clark[m] | No worries. I do need to pop out now in order to prep for tomorrow | 00:57 |
corvus | ianw: zuul master is running now; i'm still planning on checking back in a bit to collect data | 00:57 |
ianw | excellent | 00:59 |
ianw | i just installed the app to see what options i have | 00:59 |
ianw | i have a "suspend installation" and a "uninstall" | 01:00 |
ianw | but not fine-grained controls | 01:00 |
corvus | i wonder if it was suspended then | 01:00 |
ianw | i suspect maybe we ended up in "suspended" state? | 01:00 |
ianw | well i won't suspend the app, but it's a clue at least :) | 01:01 |
ianw | corvus: looks like there's a "suspended_at" "suspended_by" in the /app/installations endpoint | 01:06 |
corvus | once again, i'm not really sure what's happening | 01:09 |
corvus | 2021-09-23 00:59:39,327 WARNING zuul.GithubRateLimitHandler: API rate limit reached, need to wait for 2706 seconds | 01:11 |
corvus | wow | 01:11 |
corvus | are we actually just going to sit for 45 minutes before we complete startup because of that? | 01:11 |
corvus | yes, that's exactly what's going on | 01:12 |
corvus | we can't finish loading the openstack tenant because there are github projects in it, and we can't fetch the github branch lists because of api rate limiting, so we're going to sit in the startup routine for 45 minutes until that timeout is up | 01:12 |
corvus | i mean, i guess it's only 30m from now, so there's that | 01:14 |
ianw | hrm, would that have been hit due to the restarts being more api request heavy? | 01:15 |
corvus | possibly -- though it's also related to the number of github repos that we access that aren't in an installation | 01:16 |
corvus | we get more api requests via in app installation; without it, we are limited to a fairly small pool | 01:16 |
corvus | so the more repos we add "for dependencies" the less margin we have | 01:16 |
corvus | i guess one way to look at it is that with a pool of X requests per hour, we can start up 1 time per hour with X repos; or 2 times per hour with X/2 repos, etc. | 01:17 |
corvus | apparently 13 times in 2 hours is too many | 01:19 |
corvus | ianw, Clark, fungi: the options i can offer are: wait 25 more minutes, or remove github from the config | 01:20 |
ianw | removing it, and then presumably at another point having to stop and start queues again to put it back, seems like more pain than just waiting ~20 minutes | 01:22 |
fungi | yes, at this point i'd be inclined to just wait another 20 minutes, as the additional restart probably means an equivalent amount of downtime | 01:22 |
corvus | sounds good, i'll see about food | 01:23 |
corvus | zuul is up; i'm re-enqueing | 02:03 |
yuriys | what a day/night you guys are having | 02:05 |
corvus | re-enqueue done | 02:11 |
corvus | good news, processing is slow already, so i will collect data now | 02:16 |
yuriys | Question, when a 'Create Node' even is triggered for a job request that requires more than 1 instance for CI testing, does the executor processing the call issue the create instance api call with a 'count' variable, like "give me ubuntu node, a count of 3" | 02:41 |
yuriys | event* | 02:41 |
yuriys | Or does it make 3 separate calls to the chosen provider. | 02:42 |
corvus | yuriys: it's the nodepool-launcher that creates, and it's 3 create calls | 02:43 |
yuriys | ty | 02:44 |
corvus | ianw: okay, i think i have all necessary info and can restart zuul on 4.9.0 now | 02:53 |
corvus | i'll start working on that | 02:53 |
ianw | ++, thanks | 02:53 |
corvus | re-enqueing | 03:06 |
corvus | ianw: re-enqueue is complete, jobs seem to be running. i'm going to go pass out now. you have the conn. :) | 03:22 |
ianw | haha thanks. redshirt is on | 03:23 |
*** bhagyashris is now known as bhagyashris|rover | 04:08 | |
*** ysandeep|away is now known as ysandeep | 05:51 | |
*** anbanerj|ruck is now known as frenzy_friday | 07:08 | |
*** jpena|off is now known as jpena | 07:30 | |
*** rpittau|afk is now known as rpittau | 07:30 | |
opendevreview | Merged openstack/project-config master: grafana: further path fixes https://review.opendev.org/c/openstack/project-config/+/810338 | 07:47 |
opendevreview | Merged opendev/irc-meetings master: Changed Interop WG meetings time and date. https://review.opendev.org/c/opendev/irc-meetings/+/809903 | 08:32 |
*** ysandeep is now known as ysandeep|lunch | 09:24 | |
*** ysandeep|lunch is now known as ysandeep | 10:24 | |
*** ysandeep is now known as ysandeep|afk | 11:19 | |
*** ysandeep|afk is now known as ysandeep | 11:29 | |
*** jpena is now known as jpena|lunch | 11:31 | |
*** jpena|lunch is now known as jpena | 12:33 | |
opendevreview | Kendall Nelson proposed opendev/system-config master: Setup Letsencrypt for ptgbot site https://review.opendev.org/c/opendev/system-config/+/804791 | 13:50 |
opendevreview | Kendall Nelson proposed opendev/system-config master: Setting Up Ansible For ptgbot https://review.opendev.org/c/opendev/system-config/+/803190 | 13:57 |
mnaser | hi infra-root, did https://review.opendev.org/c/openstack/openstack-helm/+/810142 get caught in a very weird/bad timing during a zuul restart or is something wrong before i recheck? | 15:07 |
mnaser | oooh, maybe we got caught in that 45 minute github timeout ::p | 15:08 |
mnaser | 9:16 pm is when it was wf+1 which confirms the timeline | 15:09 |
mnaser | i just recheck'd for now, sorry for the noise :) | 15:09 |
fungi | mnaser: yeah, i think your math is right | 15:10 |
fungi | there are probably at least a few other changes in a similar state | 15:10 |
mnaser | fungi: https://review.opendev.org/q/label:Workflow%253D1+label:Verified%252B1+is:open small search shows we might have been the only lucky ones :p | 15:13 |
fungi | neat! good timing on our part i guess (maybe not so much on yours!) | 15:14 |
fungi | ;) | 15:14 |
opendevreview | Merged openstack/project-config master: Retire puppet-freezer - Step 1: End project Gating https://review.opendev.org/c/openstack/project-config/+/808072 | 15:50 |
*** rpittau is now known as rpittau|afk | 16:02 | |
*** marios is now known as marios|out | 16:06 | |
*** ysandeep is now known as ysandeep|dinner | 16:06 | |
opendevreview | Merged openstack/project-config master: Retire puppet-freezer - Step 3: Remove Project https://review.opendev.org/c/openstack/project-config/+/808675 | 16:31 |
*** jpena is now known as jpena|off | 16:33 | |
*** ysandeep|dinner is now known as ysandeep | 17:09 | |
fungi | pushing a change for review is taking minutes on my system, not sure whether it's my network | 17:41 |
fungi | Received disconnect from 2604:e100:1:0:f816:3eff:fe52:22de port 29418:2: Detected AuthTimeout after 120848/120000 ms. | 17:41 |
fungi | fatal: Could not read from remote repository. | 17:41 |
fungi | yeah, that's not good. worked on a retry though still sluggish | 17:41 |
opendevreview | Jeremy Stanley proposed zuul/zuul-jobs master: Deprecate EOL Python releases and OS versions https://review.opendev.org/c/zuul/zuul-jobs/+/810299 | 17:41 |
fungi | my local gertty also keeps switching to offline mode, so it too is having trouble querying gerrit | 17:42 |
fungi | infra-root: i have a root screen session on afs01.dfw with a pvmove from the main04 volume to the new main05 volume i just created/attached | 19:28 |
fungi | it'll probably take a few hours to complete, and then i'll proceed with cleaning up and removing main04 | 19:29 |
fungi | and just after i said that, the workstation i was ssh'd into it with crashed hard. so glad it did it in screen! | 19:32 |
*** ysandeep is now known as ysandeep|out | 19:33 | |
*** dviroel is now known as dviroel|out | 20:56 | |
fungi | okay, pvmove just completed, thinks are still looking sane on the server so i'll start cleaning up the old volume | 22:22 |
fungi | old main04 volume detached and deleted now, i'll close out the maintenance ticket with the provider | 22:36 |
fungi | #status log Replaced main04 volume on afs01.dfw with a new main05 volume in order to avoid impact from upcoming provider maintenance activity | 22:40 |
opendevstatus | fungi: finished logging | 22:40 |
Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!