Friday, 2023-02-10

clarkb	ianw: seems like that occurs early enough in bootstrapping that it should work well	00:27
ianw	#status log all production hosts updated to docker 23	00:44
opendevstatus	ianw: finished logging	00:44
ianw	https://zuul.opendev.org/t/openstack/build/3c28d10bd0954321b9bc0d6f5ee09c28/log/review99.opendev.org/logs/sshd_log#3	00:48
ianw	shows the flag being turned up	00:48
clarkb	ianw: does that map to your logging changes? I don't see channelOpen or channelClosed etc	01:00
ianw	i don't think "debug" is the right level	01:00
ianw	perhaps TRACE is	01:00
ianw	it logs atFine()	01:01
ianw	but	01:01
ianw	fatal: "FINE" is not a valid value for "LEVEL"	01:01
clarkb	"FINE is a message level providing tracing information.	01:02
clarkb	trace seems right given ^	01:02
clarkb	thats likely to be extremely chatty in prod though	01:03
clarkb	we might want ot up the priority of these log lines to avoid a flood?	01:03
ianw	it may be a couple of extra lines, but i feel like that's probably ok. when i enable it, i'll watch closely	01:05
ianw	gerrit2 has 697G free atm, so not too worried about overflowing that :)	01:05
clarkb	well mostly thinking the rest of gerrit might have extensive logging atFine but I don't know for sure	01:05
clarkb	heh ok :)	01:05
ianw	oh you only turn up that particular logger	01:06
clarkb	oh I see that now	01:06
opendevreview	Ian Wienand proposed opendev/system-config master: gerrit: increase ssh channel debugging https://review.opendev.org/c/opendev/system-config/+/873214	01:06
ianw	^ hopefully that shows the messages, that would be great to confirm it working	01:06
ianw	(and better testing than it will get upstream too :)	01:07
clarkb	ya its really cool that we can do that sort of thing. We can also depends on which is easier than the patch method but forces upstream to merge first	01:11
ianw	doh i didn't even think of depends-on!	01:13
clarkb	in this case if you want ot move ahead of upstream then depends on won't work ,but otherwise it should work it relaly great way to have our CI prove things for upstream changes	01:16
ianw	hrm, i still don't see any message -> https://811ccd123619f6b4f552-72986f9d77858f09d25729af74ad2ea1.ssl.cf5.rackcdn.com/873214/4/check/system-config-run-review-3.6/649d5e2/review99.opendev.org/logs/sshd_log	02:15
opendevreview	Ian Wienand proposed opendev/system-config master: gerrit: increase ssh channel debugging https://review.opendev.org/c/opendev/system-config/+/873214	02:29
ianw	^ i was using atFine() but that changes it to atFinest() ... maybe that makes a difference?	02:30
ianw	it seems that there's differences between flogger and j4git and i wonder if there's some disconnect	02:30
ianw	hrm, i wonder if we restart gerrit or something?	02:53
opendevreview	Ian Wienand proposed opendev/system-config master: gerrit: increase ssh channel debugging https://review.opendev.org/c/opendev/system-config/+/873214	03:00
*** yadnesh\|away is now known as yadnesh		04:28
ianw	i don't quite know how to make it work :/ ... I've sent a message https://groups.google.com/g/repo-discuss/c/DdrCnGJC16g	04:49
*** chandankumar is now known as chkumar\|rover		05:07
*** Tengu9 is now known as Tengu		08:21
*** jpena\|off is now known as jpena		08:29
opendevreview	Alex Kavanagh proposed openstack/project-config master: Add charms-stable-maint group for charms projects https://review.opendev.org/c/openstack/project-config/+/872406	10:50
opendevreview	Alex Kavanagh proposed openstack/project-config master: Add charms-stable-maint group for charms projects https://review.opendev.org/c/openstack/project-config/+/872406	11:19
opendevreview	Alex Kavanagh proposed openstack/project-config master: Add charms-stable-maint group for charms projects https://review.opendev.org/c/openstack/project-config/+/872406	12:00
opendevreview	Alex Kavanagh proposed openstack/project-config master: Add charms-stable-maint group for charms projects https://review.opendev.org/c/openstack/project-config/+/872406	13:12
*** dasm\|off is now known as dasm		13:49
*** yadnesh is now known as yadnesh\|away		15:06
fungi	clarkb: expanding on my comment in the zuul channel, that's also why we got notified about a bunch of expiring ssl certs. nothing is getting enqueued into deploy	15:10
Clark[m]	Neat	15:11
fungi	but additionally, all openstack releases which got approved this week have not tagged or published anything	15:12
corvus	fungi: do you have a list of tenants+pipelines throwing the error?	15:22
fungi	corvus: well, like i mentioned in matrix, there may be more than one cause so it's hard to disentangle. the vast majority (by orders of magnitude) appear for the release-post and deploy pipelines in the openstack tenant	15:26
fungi	looks like it might be approximately 2023-02-06 17:15:19,025 utc in the logs where they start up	15:27
corvus	fungi: yeah, i think we just want the current high-frequency ones. basically if it only happens occasionally, it's probably harmless, but if it happens every time we go through the processing loop, that's a problem. it is possible to tell by the traceback, but it's tedious. easier to just say we're interested in the ones with >1000/day.	15:28
fungi	it's possible that correlates to activity too	15:28
corvus	fungi: so the high-frequency ones are release-post and deploy in openstack?	15:28
fungi	correct	15:28
corvus	fungi: to recover, we should delete the pipeline state from zk for those pipelines. there is a command for that; i will run it and echo it here.	15:29
fungi	thanks!	15:29
fungi	i guess that's safe since there's nothing successfully enqueuing into them anyway	15:30
fungi	so nothing to lose	15:30
corvus	yep. if there were, we'd want to grab the queue for later re-enqueue	15:30
fungi	that does leave me wondering what transpired to corrupt things across multiple pipelines in there though	15:31
Clark[m]	I've got a school run nowish but will review the suspected fix as soon as that is done	15:31
corvus	fungi: it's described in the commit message of https://review.opendev.org/872482	15:31
Clark[m]	If we can land it by this afternoon auto upgrades will deploy it	15:31
fungi	corvus: oh, got it. so the race you were talking about then	15:32
corvus	yep. triggered by a reconfiguration.	15:32
corvus	i ran zuul-admin delete-pipeline-state openstack release-post -- it's releading the openstack tenant from scratch now.... which is not what i was expecting but let's see where this goes.	15:33
fungi	and looking at the logs more closely, the corruption in release-post may have started about a day earlier than deploy, so i guess they didn't happen at the same time and we might expect more pipelines to be impacted until the restart	15:33
corvus	that is plausible	15:33
dasm	o/ are you aware of > Something went wrong. on https://zuul.openstack.org/status page?	15:39
corvus	dasm: yep, repair in progress	15:40
dasm	ack, thx	15:42
corvus	assuming this does work, there is a chance we'll need to go through it again after the second pipeline is removed.	15:46
corvus	okay reconfig is complete	15:53
fungi	was that for the first or second pipeline cleanup?	15:56
corvus	fungi: the first... i'm going to look into whether we need to do anything for the second or if it got cleaned up automatically....	15:57
fungi	oh, cool. thanks!	15:57
corvus	there appear to be a large number of release-post jobs running now	15:58
fungi	oh excellent! i guess the scheduler was saving them up?	15:59
corvus	so if there really was nothing already in those 2 pipelines there may actually be no data loss.	15:59
corvus	yep	15:59
fungi	that's a rather neat side effect	15:59
corvus	the deploy pipeline also has 1130 events queued up for it.	15:59
fungi	oof	15:59
corvus	we are still seeing the errors for the deploy pipeline, so we will need to repeat the process.	15:59
fungi	hopefully most of those are just waiting to get filtered out	16:00
corvus	yeah, should go pretty quick	16:01
corvus	there are now 10 items in the release pipeline	16:02
corvus	i think we're ready to proceed with the deploy pipeline delete. given what we saw earlier, i think it's safe to proceed with that now. but if you'd prefer, we could wait until the release jobs settle down.	16:04
corvus	my preference would be to proceed -- all of these things are pretty slow, so better to do them at the same time	16:04
corvus	#status log deleted pipeline state for openstack/release-post and openstack/deploy due to data corruption in zk	16:07
opendevstatus	corvus: finished logging	16:08
corvus	done. it's releading config again.	16:08
corvus	reloading	16:08
clarkb	wfm. I'm just sitting down now and will tr to look at that change momentarily	16:08
corvus	this will suspend processing for another 20-30 minutes while it reconfigures again	16:09
fungi	yep, sorry stepped away for a sec to grab a snack, nut sounds good	16:10
fungi	er, but sounds good	16:11
fungi	(snack was popcorn, not nuts)	16:11
corvus	why not both? have some cracker jacks!	16:11
fungi	touché	16:12
corvus	it's back	16:29
fungi	thanks!!!	16:29
corvus	i think we may need a bit more repair, as i think the schedulers may think they're out of date. i'm going to try a tenant reconfiguration event to clear it.	16:39
corvus	2023-02-10 16:38:32,864 DEBUG zuul.Scheduler: [e: 5655ab38bd674308a46db2e9323d1935] Trigger event minimum reconfigure ltime of 1666484481791 newer than current reconfigure ltime of -1, aborting early	16:39
corvus	that's the error	16:39
fungi	out of date as in they think the cache is stale? or what?	16:40
fungi	they think they have a stale view of the config?	16:40
corvus	yeah the second thing	16:40
fungi	got it	16:41
corvus	i think it probably happened because of the unusual way in which they were prompted to get their current config	16:41
corvus	i've issued zuul-scheduler tenant-reconfigure openstack	16:42
fungi	thanks	16:42
corvus	i'm not sure how long this will take to complete (up to 30m but could be faster depending on if it decides to use any caches). at least the web ui should continue to function this time though.	16:43
corvus	2023-02-10 16:44:41,991 WARNING zuul.GithubRateLimitHandler: API rate limit reached, need to wait for 3573 seconds	16:48
corvus	i believe that means the openstack tenant is suspending processing for one hour waiting to go under the github api limit.	16:49
fungi	oh, right, too many rescans of gh branches i guess	16:50
corvus	next expected action at 17:44 utc	16:54
*** jpena is now known as jpena\|off		17:05
clarkb	I've approved the fix for the race condition in zuul that led to this	17:36
fungi	thanks, i had started going over it but it's pretty involved and i hadn't managed to page it all in yet, so i figure others have a better handle on those parts of the codebase	17:38
clarkb	yup its the sort of change where I wish that we could have "review maps" a set of directions for logically moving through the change in a way that leads to most understanding	17:39
fungi	you are here	17:40
clarkb	for this change you really want to have _postConfig() and PipelineState.create() adjacent to each other I think	17:40
clarkb	and then fan out from there to lookup things like _internalCreate()	17:40
corvus	and also have read the commit message for the change that broke it too	17:42
corvus	i like that idea though. i wrote a novel of a commit message to try to help me and others understand this -- maybe i could have written the "review map" in the commit message.	17:43
clarkb	corvus: the commit message helped a bunch. But ya there is definitely the mechanical side of figuring out where the stuff discussed in the commit message maps into the code	17:44
corvus	things are moving again	17:44
clarkb	it would be neat if gerrit supported some tool where people pushing code could somehow spacially organize things or give them an order	17:44
corvus	cat jobs are being submitted	17:44
clarkb	of course it may turn out that the order I find works in my brain is different than yours and this isn't something that can be expressed from one person to another. Would be a really neat avenue for research i bet	17:45
corvus	so we're probably 20 minutes out from the config updating. and hopefully it sets the ltime.	17:45
fungi	agreed, i started with reading the commit message obviously, but was trying to match up the code changes with my evolving understanding of what was written in the message	17:45
corvus	the deploy pipeline is being processed now	17:59
corvus	i've seen a few errors related to https://review.opendev.org/872519 as well. i don't think they are critical failures, but just fyi for anyone else scanning the logs. that change has merged so we'll pick it up in the weekend restart too.	18:07
clarkb	it just occured to me that the LE refreshes in the daily pipeline. So we may have other pipelines with this problem	18:12
clarkb	I'm wondering if we should wait for the regular weekly zuul update to fix the issue then run through this sort of thing again?	18:13
corvus	the exceptions are pretty frequent and verbose, so i think if we're not seeing them we're okay	18:13
corvus	i do see the projectcontext serialize exception happening for 863252 in the gate. i think we should dequeue that change.	18:14
corvus	#status log manually dequeued 863252 due to zuul serialization error	18:15
opendevstatus	corvus: finished logging	18:15
clarkb	that makes sense	18:16
corvus	okay, i'm not seeing any more errors, i think we can call this done	18:18
clarkb	fungi: infra-prod-base is failing which I think prevents the LE jobs from running	18:18
opendevreview	Clark Boylan proposed opendev/system-config master: gerrit: increase ssh channel debugging https://review.opendev.org/c/opendev/system-config/+/873214	18:24
clarkb	ianw: ^ enjoy your weekend, but I'm testing if dropping the classnmae works (it will make the entire sshd more verbose, but maybe we'll learn something)	18:24
fungi	clarkb: that makes sense, and is probably a better explanation for the pending ssl cert expiration alerts	18:42
fungi	clarkb: apparently it's a package download error on mirror01.regionone.osuosl	18:46
fungi	i'll see if i can correct it and get the jobs unstuck	18:46
clarkb	thanks!	18:47
fungi	hopefully it's not a persistent error with one of ubuntu's arm64 packages/indices	18:48
fungi	The following packages will be upgraded: openafs-modules-source	18:49
fungi	ahh, this was an index file hash mismatch, not a package specifically	18:50
fungi	Failed to update apt cache: E:Failed to fetch http://ddebs.ubuntu.com/dists/focal-proposed/main/binary-arm64/Packages.xz File has unexpected size (46604 != 45960). Mirror sync in progress? [IP: 185.125.190.18 80]	18:50
fungi	that was encountered during TASK [base/server : Ensure required build packages for non-wheel architectures]	18:51
fungi	i forced an index update (had to try a couple of times because the first run reported a timeout), so hopefully it's all set now	18:51
fungi	i guess we just need to merge something that will trigger another deploy	18:52
fungi	i'm going to self-approve https://review.opendev.org/872989 "Better diag for Gerrit server connection limit" and keep an eye on it as it rolls out. shouldn't cause any disruption but if it does... meh, friday	18:53
fungi	hourly just started up, but it doesn't run the infra-prod-base job	19:02
rosmaita	when someone has a minute ... this one line change to a .zuul.yaml is giving an unknown configuration error, but I can't see what it is: https://review.opendev.org/c/openstack/cinder-tempest-plugin/+/873407	19:31
clarkb	rosmaita: its hitting a zuul bug that https://review.opendev.org/c/zuul/zuul/+/872519 should fix	19:39
clarkb	that change is merged nad will be deployed later today/tomorrow	19:39
rosmaita	clarkb: thanks for the quick response! have a good weekend	19:40
fungi	"this will be fixed over the weekend" is an excellent reason to call it a week	19:40
rosmaita	:)	19:40
clarkb	b63f601787284072a0d015ef4ddb7c74 is the internal zuul event if any other admins want to double check the scheduler logs	19:40
clarkb	but its complaing about the lack of the serialize method on that object which that change should fix	19:41
opendevreview	Merged opendev/system-config master: Better diag for Gerrit server connection limit https://review.opendev.org/c/opendev/system-config/+/872989	19:48
fungi	infra-prod-base is running now	19:56
clarkb	fungi: I guess we should double check the gerrit firewall rules afterwards and that the tools do what you expect them to?	20:02
fungi	failed again :/	20:02
clarkb	I see the log rule in place at least	20:03
clarkb	it likely did not fail on review02	20:03
fungi	i don't immediately find the failure in /var/log/ansible/base.yaml.log this time, but ansible exited nonzero	20:03
fungi	the log doesn't mention any failed tasks	20:04
clarkb	mirror01.regionone.limestone.opendev.org is unreachable	20:04
fungi	that's different from failed. would explain it	20:05
clarkb	ya the summary separates failed and unreachable things	20:05
clarkb	you have to look for both	20:05
fungi	when i ssh'd into it earlier, took a minute or more to let me in. then the next time i tried it was immediate	20:05
fungi	maybe network issues there?	20:05
clarkb	limestone or osuosl?	20:06
clarkb	the previous thing was on osuosl iirc. This is limestone	20:06
fungi	oh! limestone	20:06
fungi	i didn't even realize that was still up	20:06
fungi	we haven't been running jobs there for ages, have we?	20:06
clarkb	ya I don't think so, but we've kept the mirror going	20:06
fungi	probably time to take it out of our inventory and turn it off (unless someone has already beaten us to the second step)	20:07
fungi	i'll get a cleanup patch up now	20:07
clarkb	ya the ohter option would be attempting to resurrect it but not sure thats viable right now?	20:07
fungi	i see we've still got references to packethost kicking around as well	20:10
opendevreview	Jeremy Stanley proposed opendev/system-config master: Farewell limestone https://review.opendev.org/c/opendev/system-config/+/873428	20:14
opendevreview	Jeremy Stanley proposed opendev/system-config master: Finish cleaning up packethost references https://review.opendev.org/c/opendev/system-config/+/873429	20:17
fungi	clarkb: should we clean up configuration relevant to internap/inap/iweb too?	20:18
fungi	i can push that up while i'm thinking about it	20:18
Clark[m]	Sorry switched to lunchmode. I think half of that got done but kept bridge things for cleanup? But ya can probably go now too	20:19
fungi	i'll send it up for review, don't have to decide now. the limestone one is the most pressing	20:19
opendevreview	Jeremy Stanley proposed opendev/system-config master: Final cleanup of internap/inap/iweb references https://review.opendev.org/c/opendev/system-config/+/873430	20:23
fungi	probably should also clean up any lingering bits of the old linaro environment soon too, if there are any left	20:29
*** dasm is now known as dasm\|off		22:38

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!