Thursday, 2021-09-23

corvushrm, i'm not sure that fixed it00:00
corvusi still see the 121061 error00:01
clarkb though I wonder if that will work if we are getting 403s. Another approach might be to cross check the list of github repos we've got against whatever the app sees on github if that is a thing?00:02
corvusif you or ianw know how to do that, please feel free00:03
corvusmeanwhile, i will attempt a binary search by elimination with repeated restarts00:04
ianwhrm, i can only think of logging in interactively00:04
clarkbour links in our docs are 404s00:08
clarkblooks like maybe I'm not a member of opendevorg?00:09
clarkboh right we used a central account00:09
clarkbmy brain is catching up00:09
ianwwell i can go in and see the app, all it's permissions, etc00:10
ianwbut i can't seem to see where it's installed00:10
corvusi can't seem to run edit-secrets on bridge, so i don't know how to log into github and look00:11
clarkbcorvus: I think ianw has the file open and we've locked around gpg because of gpg agent :/00:11
clarkbcorvus: if ianw closes the file then it should be openable00:11
ianwoh sorry let me close00:11
ianwI can see "Recent Deliveries" that shows where it ran00:11
corvuswhich org owns the app?00:14
ianw shoudl get you there00:15
ianwyou have to go "Developer Settings -> GitHub apps"00:15
corvusi don't see a list of installations00:17
corvusremoving all the projects from the zuul tenant config is still not sufficient to avoid the error00:17
ianwagree, i can not find a list of where it is installed at all either00:18
corvusjust as a sanity check, i restarted on 4.9.0 and this problem persists (so it's unrelated to zuul changes)00:20
corvusnormally at this point, i would start hacking zuul code in place to get more debug info, but that's a bit difficult with containerized deployment00:20
clarkbI think you can use stop and start instead of down and up to prserve the container contents. Still not really easy though00:21
corvuswell, it's difficult to edit the root install as the zuul user00:21
corvusi do not know how to haxor our deployment :|00:22
ianwi think you can just exec with --user root though?00:22
clarkbcorvus: in your manual edits of the tenant config are you removing the github repos in all tenants? I think a few of them have github entries. Also fungi has a way to edit the container contents from the host00:23
corvusoh hey that works00:23
ianwi feel like i've done that, with a quick apt-get install vim at points00:23
clarkbyou dig into /var/run/docker or /var/lib/docker something something00:23
clarkbah looks like ianw had the workaround00:23
corvusclarkb: yes, i completely removed all github projects00:23
ianwstart/stop of container probably works, but i've also had some luck just HUPing 1 to reload changes00:24
fungisorry, stepped away for a few, but haven't we seen this happen in the past if we restart zuul too frequently and exceed the github api rate limits?00:26
corvusthis is a single installation that's failing00:26
clarkbok looking at the code we ask github for the list of installations and then attempt to auth them00:27
corvusand it's not the first one00:27
clarkbwe don't seem to use the config on disk for this. its like github's api is eventually consistent here00:27
clarkb_prime_installation_map() in particular is where this happens00:28
clarkb is the api request we make00:28
clarkbmaybe we need to cross check the permissions returned in that list of dicts and filter out any that don't have sufficient perms00:29
clarkbI suspect that the initial hunch for the repo that is causing this is probably true, but they likely only reduced perms and didn't remove the installation00:31
clarkbif they removed the installation we'd be ok?00:31
corvusthat's my understanding00:32
ianwi feel like you have to accept the permissions that the app requests?00:32
clarkbianw: ya but you can also unaccept them. I suspect that is what happened00:32
ianwistr we had to update the permissions, or had discussions around it, and it would require everyone re-authing00:32
clarkbif we can uninstall the app as the app owner that might be one route here ?00:33
corvusokay i think i have a list of every repo that isn't the problem00:39
corvussigmavirus24/github3 does not appear in that list00:40
corvusalso, just fyi, there's some rando stuff in that list00:40
corvusinfra-root: how would you like to proceed?00:41
ianwwe can't actually uninstall the app from our side afaict?00:42
corvusi'm unaware of a way to do that00:42
fungii'm fine with removing that entry from the tenant config and project list00:43
corvusfungi: that is not sufficient00:43
corvusthe only quick fix i have to offer is to completely remove the github connection from opendev's zuul00:43
fungioh, sorry, i guess i missed the list of options, rereading00:43
clarkbcorvus: could we put a filter for that particular installation in the zuul code?00:43
clarkbthen also remove it from our config?00:43
corvusclarkb: remove what from our config?00:44
fungii'm find disconnecting from github if the current state of the github driver in zuul is that any project on github with our app installed can cause zuul to be unable to start00:44
clarkbcorvus: the change I made to remove the repo from our config. Basically ensure our config aligns with the filter in the prime function00:44
fungier, fine00:44
clarkb seems to be a possibility too but you need a jwt token and that seems quite involved00:45
clarkbAnother option is to reach out to sigmavirus and ask them to uninstall it I suppose (though one that may not happen quickly)00:45
clarkbcorvus: re the filter idea, maybe exclude any installs that 403?00:46
fungiwe're only providing advisory testing to projects on github, if we stop connecting to github for a little while we're not blocking any of them from being able to merge code00:46
clarkbhrm the concurrency code there makes that weird00:47
corvusclarkb: yeah, i mean there's obviously code fixes that should be done00:47
clarkbfungi: that is a good point and I guess aligns with the idea of disabling the github connection entirely. THough I have no idea what other things (job side) that might break00:48
clarkbI suppose starting there is straightforward and wecan figure it out from there?00:48
corvusi'm just not sure we should rely on a code fix in an ephemeral container right now00:48
ianwdoing the delete might be good, i can see a few things that make a jwt bin bash00:48
corvusi may be able to get a jwt00:48
corvuslike, print out the one that zuul is using00:48
corvusokay, i have a jwt00:48
corvusgimme a sec to try to make a curl00:49
corvusi ran it; no output00:51
corvusrestarting to see if it worked00:51
corvusyes i tihnk it's gone00:52
corvusi will pull the clean docker image and restart now00:52
corvus2021-09-23 00:53:40,343 INFO zuul.Scheduler: Starting scheduler00:55
corvusthat means it got past the crash-at-startup phase and is actually in its exception handler00:55
corvusand it's loading configs now00:55
corvusianw, clarkb, fungi: thanks, that was some teamwork :)00:56
Clark[m]No worries. I do need to pop out now in order to prep for tomorrow00:57
corvusianw: zuul master is running now; i'm still planning on checking back in a bit to collect data00:57
ianwi just installed the app to see what options i have00:59
ianwi have a "suspend installation" and a "uninstall"01:00
ianwbut not fine-grained controls01:00
corvusi wonder if it was suspended then01:00
ianwi suspect maybe we ended up in "suspended" state?01:00
ianwwell i won't suspend the app, but it's a clue at least :)01:01
ianwcorvus: looks like there's a "suspended_at" "suspended_by" in the /app/installations endpoint01:06
corvusonce again, i'm not really sure what's happening01:09
corvus2021-09-23 00:59:39,327 WARNING zuul.GithubRateLimitHandler: API rate limit reached, need to wait for 2706 seconds01:11
corvusare we actually just going to sit for 45 minutes before we complete startup because of that?01:11
corvusyes, that's exactly what's going on01:12
corvuswe can't finish loading the openstack tenant because there are github projects in it, and we can't fetch the github branch lists because of api rate limiting, so we're going to sit in the startup routine for 45 minutes until that timeout is up01:12
corvusi mean, i guess it's only 30m from now, so there's that01:14
ianwhrm, would that have been hit due to the restarts being more api request heavy?01:15
corvuspossibly -- though it's also related to the number of github repos that we access that aren't in an installation01:16
corvuswe get more api requests via in app installation; without it, we are limited to a fairly small pool01:16
corvusso the more repos we add "for dependencies" the less margin we have01:16
corvusi guess one way to look at it is that with a pool of X requests per hour, we can start up 1 time per hour with X repos; or 2 times per hour with X/2 repos, etc.01:17
corvusapparently 13 times in 2 hours is too many01:19
corvusianw, Clark, fungi: the options i can offer are: wait 25 more minutes, or remove github from the config01:20
ianwremoving it, and then presumably at another point having to stop and start queues again to put it back, seems like more pain than just waiting ~20 minutes01:22
fungiyes, at this point i'd be inclined to just wait another 20 minutes, as the additional restart probably means an equivalent amount of downtime01:22
corvussounds good, i'll see about food01:23
corvuszuul is up; i'm re-enqueing02:03
yuriyswhat a day/night you guys are having02:05
corvusre-enqueue done02:11
corvusgood news, processing is slow already, so i will collect data now02:16
yuriysQuestion, when a 'Create Node' even is triggered for a job request that requires more than 1 instance for CI testing, does the executor processing the call issue the create instance api call with a 'count' variable, like "give me ubuntu node, a count of 3"02:41
yuriysOr does it make 3 separate calls to the chosen provider.02:42
corvusyuriys: it's the nodepool-launcher that creates, and it's 3 create calls02:43
corvusianw: okay, i think i have all necessary info and can restart zuul on 4.9.0 now02:53
corvusi'll start working on that02:53
ianw++, thanks02:53
corvusianw: re-enqueue is complete, jobs seem to be running.  i'm going to go pass out now.  you have the conn.  :)03:22
ianwhaha thanks.  redshirt is on03:23
*** bhagyashris is now known as bhagyashris|rover04:08
*** ysandeep|away is now known as ysandeep05:51
*** anbanerj|ruck is now known as frenzy_friday07:08
*** jpena|off is now known as jpena07:30
*** rpittau|afk is now known as rpittau07:30
opendevreviewMerged openstack/project-config master: grafana: further path fixes
opendevreviewMerged opendev/irc-meetings master: Changed Interop WG meetings time and date.
*** ysandeep is now known as ysandeep|lunch09:24
*** ysandeep|lunch is now known as ysandeep10:24
*** ysandeep is now known as ysandeep|afk11:19
*** ysandeep|afk is now known as ysandeep11:29
*** jpena is now known as jpena|lunch11:31
*** jpena|lunch is now known as jpena12:33
opendevreviewKendall Nelson proposed opendev/system-config master: Setup Letsencrypt for ptgbot site
opendevreviewKendall Nelson proposed opendev/system-config master: Setting Up Ansible For ptgbot
mnaserhi infra-root, did get caught in a very weird/bad timing during a zuul restart or is something wrong before i recheck?15:07
mnaseroooh, maybe we got caught in that 45 minute github timeout ::p15:08
mnaser9:16 pm is when it was wf+1 which confirms the timeline15:09
mnaseri just recheck'd for now, sorry for the noise :)15:09
fungimnaser: yeah, i think your math is right15:10
fungithere are probably at least a few other changes in a similar state15:10
mnaserfungi: small search shows we might have been the only lucky ones :p15:13
fungineat! good timing on our part i guess (maybe not so much on yours!)15:14
opendevreviewMerged openstack/project-config master: Retire puppet-freezer - Step 1: End project Gating
*** rpittau is now known as rpittau|afk16:02
*** marios is now known as marios|out16:06
*** ysandeep is now known as ysandeep|dinner16:06
opendevreviewMerged openstack/project-config master: Retire puppet-freezer - Step 3: Remove Project
*** jpena is now known as jpena|off16:33
*** ysandeep|dinner is now known as ysandeep17:09
fungipushing a change for review is taking minutes on my system, not sure whether it's my network17:41
fungiReceived disconnect from 2604:e100:1:0:f816:3eff:fe52:22de port 29418:2: Detected AuthTimeout after 120848/120000 ms.17:41
fungifatal: Could not read from remote repository.17:41
fungiyeah, that's not good. worked on a retry though still sluggish17:41
opendevreviewJeremy Stanley proposed zuul/zuul-jobs master: Deprecate EOL Python releases and OS versions
fungimy local gertty also keeps switching to offline mode, so it too is having trouble querying gerrit17:42
fungiinfra-root: i have a root screen session on afs01.dfw with a pvmove from the main04 volume to the new main05 volume i just created/attached19:28
fungiit'll probably take a few hours to complete, and then i'll proceed with cleaning up and removing main0419:29
fungiand just after i said that, the workstation i was ssh'd into it with crashed hard. so glad it did it in screen!19:32
*** ysandeep is now known as ysandeep|out19:33
*** dviroel is now known as dviroel|out20:56
fungiokay, pvmove just completed, thinks are still looking sane on the server so i'll start cleaning up the old volume22:22
fungiold main04 volume detached and deleted now, i'll close out the maintenance ticket with the provider22:36
fungi#status log Replaced main04 volume on afs01.dfw with a new main05 volume in order to avoid impact from upcoming provider maintenance activity22:40
opendevstatusfungi: finished logging22:40

Generated by 2.17.2 by Marius Gedminas - find it at!