clarkb | ianw: seems like that occurs early enough in bootstrapping that it should work well | 00:27 |
---|---|---|
ianw | #status log all production hosts updated to docker 23 | 00:44 |
opendevstatus | ianw: finished logging | 00:44 |
ianw | https://zuul.opendev.org/t/openstack/build/3c28d10bd0954321b9bc0d6f5ee09c28/log/review99.opendev.org/logs/sshd_log#3 | 00:48 |
ianw | shows the flag being turned up | 00:48 |
clarkb | ianw: does that map to your logging changes? I don't see channelOpen or channelClosed etc | 01:00 |
ianw | i don't think "debug" is the right level | 01:00 |
ianw | perhaps TRACE is | 01:00 |
ianw | it logs atFine() | 01:01 |
ianw | but | 01:01 |
ianw | fatal: "FINE" is not a valid value for "LEVEL" | 01:01 |
clarkb | "FINE is a message level providing tracing information. | 01:02 |
clarkb | trace seems right given ^ | 01:02 |
clarkb | thats likely to be extremely chatty in prod though | 01:03 |
clarkb | we might want ot up the priority of these log lines to avoid a flood? | 01:03 |
ianw | it may be a couple of extra lines, but i feel like that's probably ok. when i enable it, i'll watch closely | 01:05 |
ianw | gerrit2 has 697G free atm, so not too worried about overflowing that :) | 01:05 |
clarkb | well mostly thinking the rest of gerrit might have extensive logging atFine but I don't know for sure | 01:05 |
clarkb | heh ok :) | 01:05 |
ianw | oh you only turn up that particular logger | 01:06 |
clarkb | oh I see that now | 01:06 |
opendevreview | Ian Wienand proposed opendev/system-config master: gerrit: increase ssh channel debugging https://review.opendev.org/c/opendev/system-config/+/873214 | 01:06 |
ianw | ^ hopefully that shows the messages, that would be great to confirm it working | 01:06 |
ianw | (and better testing than it will get upstream too :) | 01:07 |
clarkb | ya its really cool that we can do that sort of thing. We can also depends on which is easier than the patch method but forces upstream to merge first | 01:11 |
ianw | doh i didn't even think of depends-on! | 01:13 |
clarkb | in this case if you want ot move ahead of upstream then depends on won't work ,but otherwise it should work it relaly great way to have our CI prove things for upstream changes | 01:16 |
ianw | hrm, i still don't see any message -> https://811ccd123619f6b4f552-72986f9d77858f09d25729af74ad2ea1.ssl.cf5.rackcdn.com/873214/4/check/system-config-run-review-3.6/649d5e2/review99.opendev.org/logs/sshd_log | 02:15 |
opendevreview | Ian Wienand proposed opendev/system-config master: gerrit: increase ssh channel debugging https://review.opendev.org/c/opendev/system-config/+/873214 | 02:29 |
ianw | ^ i was using atFine() but that changes it to atFinest() ... maybe that makes a difference? | 02:30 |
ianw | it seems that there's differences between flogger and j4git and i wonder if there's some disconnect | 02:30 |
ianw | hrm, i wonder if we restart gerrit or something? | 02:53 |
opendevreview | Ian Wienand proposed opendev/system-config master: gerrit: increase ssh channel debugging https://review.opendev.org/c/opendev/system-config/+/873214 | 03:00 |
*** yadnesh|away is now known as yadnesh | 04:28 | |
ianw | i don't quite know how to make it work :/ ... I've sent a message https://groups.google.com/g/repo-discuss/c/DdrCnGJC16g | 04:49 |
*** chandankumar is now known as chkumar|rover | 05:07 | |
*** Tengu9 is now known as Tengu | 08:21 | |
*** jpena|off is now known as jpena | 08:29 | |
opendevreview | Alex Kavanagh proposed openstack/project-config master: Add charms-stable-maint group for charms projects https://review.opendev.org/c/openstack/project-config/+/872406 | 10:50 |
opendevreview | Alex Kavanagh proposed openstack/project-config master: Add charms-stable-maint group for charms projects https://review.opendev.org/c/openstack/project-config/+/872406 | 11:19 |
opendevreview | Alex Kavanagh proposed openstack/project-config master: Add charms-stable-maint group for charms projects https://review.opendev.org/c/openstack/project-config/+/872406 | 12:00 |
opendevreview | Alex Kavanagh proposed openstack/project-config master: Add charms-stable-maint group for charms projects https://review.opendev.org/c/openstack/project-config/+/872406 | 13:12 |
*** dasm|off is now known as dasm | 13:49 | |
*** yadnesh is now known as yadnesh|away | 15:06 | |
fungi | clarkb: expanding on my comment in the zuul channel, that's also why we got notified about a bunch of expiring ssl certs. nothing is getting enqueued into deploy | 15:10 |
Clark[m] | Neat | 15:11 |
fungi | but additionally, all openstack releases which got approved this week have not tagged or published anything | 15:12 |
corvus | fungi: do you have a list of tenants+pipelines throwing the error? | 15:22 |
fungi | corvus: well, like i mentioned in matrix, there may be more than one cause so it's hard to disentangle. the vast majority (by orders of magnitude) appear for the release-post and deploy pipelines in the openstack tenant | 15:26 |
fungi | looks like it might be approximately 2023-02-06 17:15:19,025 utc in the logs where they start up | 15:27 |
corvus | fungi: yeah, i think we just want the current high-frequency ones. basically if it only happens occasionally, it's probably harmless, but if it happens every time we go through the processing loop, that's a problem. it is possible to tell by the traceback, but it's tedious. easier to just say we're interested in the ones with >1000/day. | 15:28 |
fungi | it's possible that correlates to activity too | 15:28 |
corvus | fungi: so the high-frequency ones are release-post and deploy in openstack? | 15:28 |
fungi | correct | 15:28 |
corvus | fungi: to recover, we should delete the pipeline state from zk for those pipelines. there is a command for that; i will run it and echo it here. | 15:29 |
fungi | thanks! | 15:29 |
fungi | i guess that's safe since there's nothing successfully enqueuing into them anyway | 15:30 |
fungi | so nothing to lose | 15:30 |
corvus | yep. if there were, we'd want to grab the queue for later re-enqueue | 15:30 |
fungi | that does leave me wondering what transpired to corrupt things across multiple pipelines in there though | 15:31 |
Clark[m] | I've got a school run nowish but will review the suspected fix as soon as that is done | 15:31 |
corvus | fungi: it's described in the commit message of https://review.opendev.org/872482 | 15:31 |
Clark[m] | If we can land it by this afternoon auto upgrades will deploy it | 15:31 |
fungi | corvus: oh, got it. so the race you were talking about then | 15:32 |
corvus | yep. triggered by a reconfiguration. | 15:32 |
corvus | i ran zuul-admin delete-pipeline-state openstack release-post -- it's releading the openstack tenant from scratch now.... which is not what i was expecting but let's see where this goes. | 15:33 |
fungi | and looking at the logs more closely, the corruption in release-post may have started about a day earlier than deploy, so i guess they didn't happen at the same time and we might expect more pipelines to be impacted until the restart | 15:33 |
corvus | that is plausible | 15:33 |
dasm | o/ are you aware of > Something went wrong. on https://zuul.openstack.org/status page? | 15:39 |
corvus | dasm: yep, repair in progress | 15:40 |
dasm | ack, thx | 15:42 |
corvus | assuming this does work, there is a chance we'll need to go through it again after the second pipeline is removed. | 15:46 |
corvus | okay reconfig is complete | 15:53 |
fungi | was that for the first or second pipeline cleanup? | 15:56 |
corvus | fungi: the first... i'm going to look into whether we need to do anything for the second or if it got cleaned up automatically.... | 15:57 |
fungi | oh, cool. thanks! | 15:57 |
corvus | there appear to be a large number of release-post jobs running now | 15:58 |
fungi | oh excellent! i guess the scheduler was saving them up? | 15:59 |
corvus | so if there really was nothing already in those 2 pipelines there may actually be no data loss. | 15:59 |
corvus | yep | 15:59 |
fungi | that's a rather neat side effect | 15:59 |
corvus | the deploy pipeline also has 1130 events queued up for it. | 15:59 |
fungi | oof | 15:59 |
corvus | we are still seeing the errors for the deploy pipeline, so we will need to repeat the process. | 15:59 |
fungi | hopefully most of those are just waiting to get filtered out | 16:00 |
corvus | yeah, should go pretty quick | 16:01 |
corvus | there are now 10 items in the release pipeline | 16:02 |
corvus | i think we're ready to proceed with the deploy pipeline delete. given what we saw earlier, i think it's safe to proceed with that now. but if you'd prefer, we could wait until the release jobs settle down. | 16:04 |
corvus | my preference would be to proceed -- all of these things are pretty slow, so better to do them at the same time | 16:04 |
corvus | #status log deleted pipeline state for openstack/release-post and openstack/deploy due to data corruption in zk | 16:07 |
opendevstatus | corvus: finished logging | 16:08 |
corvus | done. it's releading config again. | 16:08 |
corvus | reloading | 16:08 |
clarkb | wfm. I'm just sitting down now and will tr to look at that change momentarily | 16:08 |
corvus | this will suspend processing for another 20-30 minutes while it reconfigures again | 16:09 |
fungi | yep, sorry stepped away for a sec to grab a snack, nut sounds good | 16:10 |
fungi | er, but sounds good | 16:11 |
fungi | (snack was popcorn, not nuts) | 16:11 |
corvus | why not both? have some cracker jacks! | 16:11 |
fungi | touché | 16:12 |
corvus | it's back | 16:29 |
fungi | thanks!!! | 16:29 |
corvus | i think we may need a bit more repair, as i think the schedulers may think they're out of date. i'm going to try a tenant reconfiguration event to clear it. | 16:39 |
corvus | 2023-02-10 16:38:32,864 DEBUG zuul.Scheduler: [e: 5655ab38bd674308a46db2e9323d1935] Trigger event minimum reconfigure ltime of 1666484481791 newer than current reconfigure ltime of -1, aborting early | 16:39 |
corvus | that's the error | 16:39 |
fungi | out of date as in they think the cache is stale? or what? | 16:40 |
fungi | they think they have a stale view of the config? | 16:40 |
corvus | yeah the second thing | 16:40 |
fungi | got it | 16:41 |
corvus | i think it probably happened because of the unusual way in which they were prompted to get their current config | 16:41 |
corvus | i've issued zuul-scheduler tenant-reconfigure openstack | 16:42 |
fungi | thanks | 16:42 |
corvus | i'm not sure how long this will take to complete (up to 30m but could be faster depending on if it decides to use any caches). at least the web ui should continue to function this time though. | 16:43 |
corvus | 2023-02-10 16:44:41,991 WARNING zuul.GithubRateLimitHandler: API rate limit reached, need to wait for 3573 seconds | 16:48 |
corvus | i believe that means the openstack tenant is suspending processing for one hour waiting to go under the github api limit. | 16:49 |
fungi | oh, right, too many rescans of gh branches i guess | 16:50 |
corvus | next expected action at 17:44 utc | 16:54 |
*** jpena is now known as jpena|off | 17:05 | |
clarkb | I've approved the fix for the race condition in zuul that led to this | 17:36 |
fungi | thanks, i had started going over it but it's pretty involved and i hadn't managed to page it all in yet, so i figure others have a better handle on those parts of the codebase | 17:38 |
clarkb | yup its the sort of change where I wish that we could have "review maps" a set of directions for logically moving through the change in a way that leads to most understanding | 17:39 |
fungi | you are here | 17:40 |
clarkb | for this change you really want to have _postConfig() and PipelineState.create() adjacent to each other I think | 17:40 |
clarkb | and then fan out from there to lookup things like _internalCreate() | 17:40 |
corvus | and also have read the commit message for the change that broke it too | 17:42 |
corvus | i like that idea though. i wrote a novel of a commit message to try to help me and others understand this -- maybe i could have written the "review map" in the commit message. | 17:43 |
clarkb | corvus: the commit message helped a bunch. But ya there is definitely the mechanical side of figuring out where the stuff discussed in the commit message maps into the code | 17:44 |
corvus | things are moving again | 17:44 |
clarkb | it would be neat if gerrit supported some tool where people pushing code could somehow spacially organize things or give them an order | 17:44 |
corvus | cat jobs are being submitted | 17:44 |
clarkb | of course it may turn out that the order I find works in my brain is different than yours and this isn't something that can be expressed from one person to another. Would be a really neat avenue for research i bet | 17:45 |
corvus | so we're probably 20 minutes out from the config updating. and hopefully it sets the ltime. | 17:45 |
fungi | agreed, i started with reading the commit message obviously, but was trying to match up the code changes with my evolving understanding of what was written in the message | 17:45 |
corvus | the deploy pipeline is being processed now | 17:59 |
corvus | i've seen a few errors related to https://review.opendev.org/872519 as well. i don't think they are critical failures, but just fyi for anyone else scanning the logs. that change has merged so we'll pick it up in the weekend restart too. | 18:07 |
clarkb | it just occured to me that the LE refreshes in the daily pipeline. So we may have other pipelines with this problem | 18:12 |
clarkb | I'm wondering if we should wait for the regular weekly zuul update to fix the issue then run through this sort of thing again? | 18:13 |
corvus | the exceptions are pretty frequent and verbose, so i think if we're not seeing them we're okay | 18:13 |
corvus | i do see the projectcontext serialize exception happening for 863252 in the gate. i think we should dequeue that change. | 18:14 |
corvus | #status log manually dequeued 863252 due to zuul serialization error | 18:15 |
opendevstatus | corvus: finished logging | 18:15 |
clarkb | that makes sense | 18:16 |
corvus | okay, i'm not seeing any more errors, i think we can call this done | 18:18 |
clarkb | fungi: infra-prod-base is failing which I think prevents the LE jobs from running | 18:18 |
opendevreview | Clark Boylan proposed opendev/system-config master: gerrit: increase ssh channel debugging https://review.opendev.org/c/opendev/system-config/+/873214 | 18:24 |
clarkb | ianw: ^ enjoy your weekend, but I'm testing if dropping the classnmae works (it will make the entire sshd more verbose, but maybe we'll learn something) | 18:24 |
fungi | clarkb: that makes sense, and is probably a better explanation for the pending ssl cert expiration alerts | 18:42 |
fungi | clarkb: apparently it's a package download error on mirror01.regionone.osuosl | 18:46 |
fungi | i'll see if i can correct it and get the jobs unstuck | 18:46 |
clarkb | thanks! | 18:47 |
fungi | hopefully it's not a persistent error with one of ubuntu's arm64 packages/indices | 18:48 |
fungi | The following packages will be upgraded: openafs-modules-source | 18:49 |
fungi | ahh, this was an index file hash mismatch, not a package specifically | 18:50 |
fungi | Failed to update apt cache: E:Failed to fetch http://ddebs.ubuntu.com/dists/focal-proposed/main/binary-arm64/Packages.xz File has unexpected size (46604 != 45960). Mirror sync in progress? [IP: 185.125.190.18 80] | 18:50 |
fungi | that was encountered during TASK [base/server : Ensure required build packages for non-wheel architectures] | 18:51 |
fungi | i forced an index update (had to try a couple of times because the first run reported a timeout), so hopefully it's all set now | 18:51 |
fungi | i guess we just need to merge something that will trigger another deploy | 18:52 |
fungi | i'm going to self-approve https://review.opendev.org/872989 "Better diag for Gerrit server connection limit" and keep an eye on it as it rolls out. shouldn't cause any disruption but if it does... meh, friday | 18:53 |
fungi | hourly just started up, but it doesn't run the infra-prod-base job | 19:02 |
rosmaita | when someone has a minute ... this one line change to a .zuul.yaml is giving an unknown configuration error, but I can't see what it is: https://review.opendev.org/c/openstack/cinder-tempest-plugin/+/873407 | 19:31 |
clarkb | rosmaita: its hitting a zuul bug that https://review.opendev.org/c/zuul/zuul/+/872519 should fix | 19:39 |
clarkb | that change is merged nad will be deployed later today/tomorrow | 19:39 |
rosmaita | clarkb: thanks for the quick response! have a good weekend | 19:40 |
fungi | "this will be fixed over the weekend" is an excellent reason to call it a week | 19:40 |
rosmaita | :) | 19:40 |
clarkb | b63f601787284072a0d015ef4ddb7c74 is the internal zuul event if any other admins want to double check the scheduler logs | 19:40 |
clarkb | but its complaing about the lack of the serialize method on that object which that change should fix | 19:41 |
opendevreview | Merged opendev/system-config master: Better diag for Gerrit server connection limit https://review.opendev.org/c/opendev/system-config/+/872989 | 19:48 |
fungi | infra-prod-base is running now | 19:56 |
clarkb | fungi: I guess we should double check the gerrit firewall rules afterwards and that the tools do what you expect them to? | 20:02 |
fungi | failed again :/ | 20:02 |
clarkb | I see the log rule in place at least | 20:03 |
clarkb | it likely did not fail on review02 | 20:03 |
fungi | i don't immediately find the failure in /var/log/ansible/base.yaml.log this time, but ansible exited nonzero | 20:03 |
fungi | the log doesn't mention any failed tasks | 20:04 |
clarkb | mirror01.regionone.limestone.opendev.org is unreachable | 20:04 |
fungi | that's different from failed. would explain it | 20:05 |
clarkb | ya the summary separates failed and unreachable things | 20:05 |
clarkb | you have to look for both | 20:05 |
fungi | when i ssh'd into it earlier, took a minute or more to let me in. then the next time i tried it was immediate | 20:05 |
fungi | maybe network issues there? | 20:05 |
clarkb | limestone or osuosl? | 20:06 |
clarkb | the previous thing was on osuosl iirc. This is limestone | 20:06 |
fungi | oh! limestone | 20:06 |
fungi | i didn't even realize that was still up | 20:06 |
fungi | we haven't been running jobs there for ages, have we? | 20:06 |
clarkb | ya I don't think so, but we've kept the mirror going | 20:06 |
fungi | probably time to take it out of our inventory and turn it off (unless someone has already beaten us to the second step) | 20:07 |
fungi | i'll get a cleanup patch up now | 20:07 |
clarkb | ya the ohter option would be attempting to resurrect it but not sure thats viable right now? | 20:07 |
fungi | i see we've still got references to packethost kicking around as well | 20:10 |
opendevreview | Jeremy Stanley proposed opendev/system-config master: Farewell limestone https://review.opendev.org/c/opendev/system-config/+/873428 | 20:14 |
opendevreview | Jeremy Stanley proposed opendev/system-config master: Finish cleaning up packethost references https://review.opendev.org/c/opendev/system-config/+/873429 | 20:17 |
fungi | clarkb: should we clean up configuration relevant to internap/inap/iweb too? | 20:18 |
fungi | i can push that up while i'm thinking about it | 20:18 |
Clark[m] | Sorry switched to lunchmode. I think half of that got done but kept bridge things for cleanup? But ya can probably go now too | 20:19 |
fungi | i'll send it up for review, don't have to decide now. the limestone one is the most pressing | 20:19 |
opendevreview | Jeremy Stanley proposed opendev/system-config master: Final cleanup of internap/inap/iweb references https://review.opendev.org/c/opendev/system-config/+/873430 | 20:23 |
fungi | probably should also clean up any lingering bits of the old linaro environment soon too, if there are any left | 20:29 |
*** dasm is now known as dasm|off | 22:38 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!