Friday, 2023-02-10

clarkbianw: seems like that occurs early enough in bootstrapping that it should work well00:27
ianw#status log all production hosts updated to docker 2300:44
opendevstatusianw: finished logging00:44
ianwshows the flag being turned up00:48
clarkbianw: does that map to your logging changes? I don't see channelOpen or channelClosed etc01:00
ianwi don't think "debug" is the right level01:00
ianwperhaps TRACE is01:00
ianwit logs atFine()01:01
ianwbut 01:01
ianwfatal: "FINE" is not a valid value for "LEVEL"01:01
clarkb"FINE is a message level providing tracing information.01:02
clarkbtrace seems right given ^01:02
clarkbthats likely to be extremely chatty in prod though01:03
clarkbwe might want ot up the priority of these log lines to avoid a flood?01:03
ianwit may be a couple of extra lines, but i feel like that's probably ok.  when i enable it, i'll watch closely01:05
ianwgerrit2 has 697G free atm, so not too worried about overflowing that :)01:05
clarkbwell mostly thinking the rest of gerrit might have extensive logging atFine but I don't know for sure01:05
clarkbheh ok :)01:05
ianwoh you only turn up that particular logger01:06
clarkboh I see that now01:06
opendevreviewIan Wienand proposed opendev/system-config master: gerrit: increase ssh channel debugging
ianw^ hopefully that shows the messages, that would be great to confirm it working01:06
ianw(and better testing than it will get upstream too :)01:07
clarkbya its really cool that we can do that sort of thing. We can also depends on which is easier than the patch method but forces upstream to merge first01:11
ianwdoh i didn't even think of depends-on!  01:13
clarkbin this case if you want ot move ahead of upstream then depends on won't work ,but otherwise it should work it relaly great way to have our CI prove things for upstream changes01:16
ianwhrm, i still don't see any message ->
opendevreviewIan Wienand proposed opendev/system-config master: gerrit: increase ssh channel debugging
ianw^ i was using atFine() but that changes it to atFinest() ... maybe that makes a difference?02:30
ianwit seems that there's differences between flogger and j4git and i wonder if there's some disconnect02:30
ianwhrm, i wonder if we restart gerrit or something?02:53
opendevreviewIan Wienand proposed opendev/system-config master: gerrit: increase ssh channel debugging
*** yadnesh|away is now known as yadnesh04:28
ianwi don't quite know how to make it work :/ ... I've sent a message
*** chandankumar is now known as chkumar|rover05:07
*** Tengu9 is now known as Tengu08:21
*** jpena|off is now known as jpena08:29
opendevreviewAlex Kavanagh proposed openstack/project-config master: Add charms-stable-maint group for charms projects
opendevreviewAlex Kavanagh proposed openstack/project-config master: Add charms-stable-maint group for charms projects
opendevreviewAlex Kavanagh proposed openstack/project-config master: Add charms-stable-maint group for charms projects
opendevreviewAlex Kavanagh proposed openstack/project-config master: Add charms-stable-maint group for charms projects
*** dasm|off is now known as dasm13:49
*** yadnesh is now known as yadnesh|away15:06
fungiclarkb: expanding on my comment in the zuul channel, that's also why we got notified about a bunch of expiring ssl certs. nothing is getting enqueued into deploy15:10
fungibut additionally, all openstack releases which got approved this week have not tagged or published anything15:12
corvusfungi: do you have a list of tenants+pipelines throwing the error?15:22
fungicorvus: well, like i mentioned in matrix, there may be more than one cause so it's hard to disentangle. the vast majority (by orders of magnitude) appear for the release-post and deploy pipelines in the openstack tenant15:26
fungilooks like it might be approximately 2023-02-06 17:15:19,025 utc in the logs where they start up15:27
corvusfungi: yeah, i think we just want the current high-frequency ones.  basically if it only happens occasionally, it's probably harmless, but if it happens every time we go through the processing loop, that's a problem.  it is possible to tell by the traceback, but it's tedious.  easier to just say we're interested in the ones with >1000/day.15:28
fungiit's possible that correlates to activity too15:28
corvusfungi: so the high-frequency ones are release-post and deploy in openstack?15:28
corvusfungi: to recover, we should delete the pipeline state from zk for those pipelines.  there is a command for that; i will run it and echo it here.15:29
fungii guess that's safe since there's nothing successfully enqueuing into them anyway15:30
fungiso nothing to lose15:30
corvusyep.  if there were, we'd want to grab the queue for later re-enqueue15:30
fungithat does leave me wondering what transpired to corrupt things across multiple pipelines in there though15:31
Clark[m]I've got a school run nowish but will review the suspected fix as soon as that is done15:31
corvusfungi: it's described in the commit message of
Clark[m]If we can land it by this afternoon auto upgrades will deploy it15:31
fungicorvus: oh, got it. so the race you were talking about then15:32
corvusyep.  triggered by a reconfiguration.15:32
corvusi ran zuul-admin delete-pipeline-state openstack release-post -- it's releading the openstack tenant from scratch now.... which is not what i was expecting but let's see where this goes.15:33
fungiand looking at the logs more closely, the corruption in release-post may have started about a day earlier than deploy, so i guess they didn't happen at the same time and we might expect more pipelines to be impacted until the restart15:33
corvusthat is plausible15:33
dasmo/ are you aware of > Something went wrong. on page?15:39
corvusdasm: yep, repair in progress15:40
dasmack, thx15:42
corvusassuming this does work, there is a chance we'll need to go through it again after the second pipeline is removed.15:46
corvusokay reconfig is complete15:53
fungiwas that for the first or second pipeline cleanup?15:56
corvusfungi: the first... i'm going to look into whether we need to do anything for the second or if it got cleaned up automatically....15:57
fungioh, cool. thanks!15:57
corvusthere appear to be a large number of release-post jobs running now15:58
fungioh excellent! i guess the scheduler was saving them up?15:59
corvusso if there really was nothing already in those 2 pipelines there may actually be no data loss.15:59
fungithat's a rather neat side effect15:59
corvusthe deploy pipeline also has 1130 events queued up for it.15:59
corvuswe are still seeing the errors for the deploy pipeline, so we will need to repeat the process.15:59
fungihopefully most of those are just waiting to get filtered out16:00
corvusyeah, should go pretty quick16:01
corvusthere are now 10 items in the release pipeline16:02
corvusi think we're ready to proceed with the deploy pipeline delete.  given what we saw earlier, i think it's safe to proceed with that now.  but if you'd prefer, we could wait until the release jobs settle down.16:04
corvusmy preference would be to proceed -- all of these things are pretty slow, so better to do them at the same time16:04
corvus#status log deleted pipeline state for openstack/release-post and openstack/deploy due to data corruption in zk16:07
opendevstatuscorvus: finished logging16:08
corvusdone.  it's releading config again.16:08
clarkbwfm. I'm just sitting down now and will tr to look at that change momentarily16:08
corvusthis will suspend processing for another 20-30 minutes while it reconfigures again16:09
fungiyep, sorry stepped away for a sec to grab a snack, nut sounds good16:10
fungier, but sounds good16:11
fungi(snack was popcorn, not nuts)16:11
corvuswhy not both? have some cracker jacks!16:11
corvusit's back16:29
corvusi think we may need a bit more repair, as i think the schedulers may think they're out of date.  i'm going to try a tenant reconfiguration event to clear it.16:39
corvus2023-02-10 16:38:32,864 DEBUG zuul.Scheduler: [e: 5655ab38bd674308a46db2e9323d1935] Trigger event minimum reconfigure ltime of 1666484481791 newer than current reconfigure ltime of -1, aborting early16:39
corvusthat's the error16:39
fungiout of date as in they think the cache is stale? or what?16:40
fungithey think they have a stale view of the config?16:40
corvusyeah the second thing16:40
fungigot it16:41
corvusi think it probably happened because of the unusual way in which they were prompted to get their current config16:41
corvusi've issued zuul-scheduler tenant-reconfigure openstack16:42
corvusi'm not sure how long this will take to complete (up to 30m but could be faster depending on if it decides to use any caches).  at least the web ui should continue to function this time though.16:43
corvus2023-02-10 16:44:41,991 WARNING zuul.GithubRateLimitHandler: API rate limit reached, need to wait for 3573 seconds16:48
corvusi believe that means the openstack tenant is suspending processing for one hour waiting to go under the github api limit.16:49
fungioh, right, too many rescans of gh branches i guess16:50
corvusnext expected action at 17:44 utc16:54
*** jpena is now known as jpena|off17:05
clarkbI've approved the fix for the race condition in zuul that led to this17:36
fungithanks, i had started going over it but it's pretty involved and i hadn't managed to page it all in yet, so i figure others have a better handle on those parts of the codebase17:38
clarkbyup its the sort of change where I wish that we could have "review maps" a set of directions for logically moving through the change in a way that leads to most understanding17:39
fungiyou are here17:40
clarkbfor this change you really want to have _postConfig() and PipelineState.create() adjacent to each other I think17:40
clarkband then fan out from there to lookup things like _internalCreate()17:40
corvusand also have read the commit message for the change that broke it too17:42
corvusi like that idea though.  i wrote a novel of a commit message to try to help me and others understand this -- maybe i could have written the "review map" in the commit message.17:43
clarkbcorvus: the commit message helped a bunch. But ya there is definitely the mechanical side of figuring out where the stuff discussed in the commit message maps into the code17:44
corvusthings are moving again17:44
clarkbit would be neat if gerrit supported some tool where people pushing code could somehow spacially organize things or give them an order17:44
corvuscat jobs are being submitted17:44
clarkbof course it may turn out that the order I find works in my brain is different than yours and this isn't something that can be expressed from one person to another. Would be a really neat avenue for research i bet17:45
corvusso we're probably 20 minutes out from the config updating.  and hopefully it sets the ltime.17:45
fungiagreed, i started with reading the commit message obviously, but was trying to match up the code changes with my evolving understanding of what was written in the message17:45
corvusthe deploy pipeline is being processed now17:59
corvusi've seen a few errors related to as well.  i don't think they are critical failures, but just fyi for anyone else scanning the logs.  that change has merged so we'll pick it up in the weekend restart too.18:07
clarkbit just occured to me that the LE refreshes in the daily pipeline. So we may have other pipelines with this problem18:12
clarkbI'm wondering if we should wait for the regular weekly zuul update to fix the issue then run through this sort of thing again?18:13
corvusthe exceptions are pretty frequent and verbose, so i think if we're not seeing them we're okay18:13
corvusi do see the projectcontext serialize exception happening for 863252 in the gate.  i think we should dequeue that change.18:14
corvus#status log manually dequeued 863252 due to zuul serialization error18:15
opendevstatuscorvus: finished logging18:15
clarkbthat makes sense18:16
corvusokay, i'm not seeing any more errors, i think we can call this done18:18
clarkbfungi: infra-prod-base is failing which I think prevents the LE jobs from running18:18
opendevreviewClark Boylan proposed opendev/system-config master: gerrit: increase ssh channel debugging
clarkbianw: ^ enjoy your weekend, but I'm testing if dropping the classnmae works (it will make the entire sshd more verbose, but maybe we'll learn something)18:24
fungiclarkb: that makes sense, and is probably a better explanation for the pending ssl cert expiration alerts18:42
fungiclarkb: apparently it's a package download error on mirror01.regionone.osuosl18:46
fungii'll see if i can correct it and get the jobs unstuck18:46
fungihopefully it's not a persistent error with one of ubuntu's arm64 packages/indices18:48
fungiThe following packages will be upgraded: openafs-modules-source18:49
fungiahh, this was an index file hash mismatch, not a package specifically18:50
fungiFailed to update apt cache: E:Failed to fetch  File has unexpected size (46604 != 45960). Mirror sync in progress? [IP: 80]18:50
fungithat was encountered during TASK [base/server : Ensure required build packages for non-wheel architectures]18:51
fungii forced an index update (had to try a couple of times because the first run reported a timeout), so hopefully it's all set now18:51
fungii guess we just need to merge something that will trigger another deploy18:52
fungii'm going to self-approve "Better diag for Gerrit server connection limit" and keep an eye on it as it rolls out. shouldn't cause any disruption but if it does... meh, friday18:53
fungihourly just started up, but it doesn't run the infra-prod-base job19:02
rosmaitawhen someone has a minute ... this one line change to a .zuul.yaml is giving an unknown configuration error, but I can't see what it is:
clarkbrosmaita: its hitting a zuul bug that should fix19:39
clarkbthat change is merged nad will be deployed later today/tomorrow19:39
rosmaitaclarkb: thanks for the quick response!  have a good weekend19:40
fungi"this will be fixed over the weekend" is an excellent reason to call it a week19:40
clarkbb63f601787284072a0d015ef4ddb7c74 is the internal zuul event if any other admins want to double check the scheduler logs19:40
clarkbbut its complaing about the lack of the serialize method on that object which that change should fix19:41
opendevreviewMerged opendev/system-config master: Better diag for Gerrit server connection limit
fungiinfra-prod-base is running now19:56
clarkbfungi: I guess we should double check the gerrit firewall rules afterwards and that the tools do what you expect them to?20:02
fungifailed again :/20:02
clarkbI see the log rule in place at least20:03
clarkbit likely did not fail on review0220:03
fungii don't immediately find the failure in /var/log/ansible/base.yaml.log this time, but ansible exited nonzero20:03
fungithe log doesn't mention any failed tasks20:04 is unreachable20:04
fungithat's different from failed. would explain it20:05
clarkbya the summary separates failed and unreachable things20:05
clarkbyou have to look for both20:05
fungiwhen i ssh'd into it earlier, took a minute or more to let me in. then the next time i tried it was immediate20:05
fungimaybe network issues there?20:05
clarkblimestone or osuosl?20:06
clarkbthe previous thing was on osuosl iirc. This is limestone20:06
fungioh! limestone20:06
fungii didn't even realize that was still up20:06
fungiwe haven't been running jobs there for ages, have we?20:06
clarkbya I don't think so, but we've kept the mirror going20:06
fungiprobably time to take it out of our inventory and turn it off (unless someone has already beaten us to the second step)20:07
fungii'll get a cleanup patch up now20:07
clarkbya the ohter option would be attempting to resurrect it but not sure thats viable right now?20:07
fungii see we've still got references to packethost kicking around as well20:10
opendevreviewJeremy Stanley proposed opendev/system-config master: Farewell limestone
opendevreviewJeremy Stanley proposed opendev/system-config master: Finish cleaning up packethost references
fungiclarkb: should we clean up configuration relevant to internap/inap/iweb too?20:18
fungii can push that up while i'm thinking about it20:18
Clark[m]Sorry switched to lunchmode. I think half of that got done but kept bridge things for cleanup? But ya can probably go now too20:19
fungii'll send it up for review, don't have to decide now. the limestone one is the most pressing20:19
opendevreviewJeremy Stanley proposed opendev/system-config master: Final cleanup of internap/inap/iweb references
fungiprobably should also clean up any lingering bits of the old linaro environment soon too, if there are any left20:29
*** dasm is now known as dasm|off22:38

Generated by 2.17.3 by Marius Gedminas - find it at!