Thursday, 2021-12-02

clarkb	fungi: completely unrelated to ^ at 00:00 I always get exim panic log emails from lists. I assume those are actually old panics and we might be able to clear them out? But I'm not clued into the dark arts of email to know for sure	00:05
*** rlandy\|ruck\|biab is now known as rlandy\|ruck		00:06
*** rlandy\|ruck is now known as rlandy\|out		00:30
fungi	clarkb: the one from lists.o.o looks like something was going on around 07:05:52-07:07:34 tuesday and again at 00:01:59 wednesday which caused contention for access to /var/spool/exim4/db/retry.lockfile, possibly just collisions between deliveries for different mailman sites?	00:38
fungi	er, no would have to be between mailman processes i suppose	00:39
fungi	maybe something else was locking it temporarily, but i can't imagine what	00:39
fungi	s/mailman processes/exim processes/ i meant	00:39
fungi	as for new debugging info in the zuul build inventories, is that the playbook context stuff?	00:40
Clark[m]	Yup the new playbook context	00:43
corvus	yeah, that's a bunch of info that was only in the executor logs earlier; should help advanced users figure out what zuul did in complex situations	00:45
fungi	awesome, thanks!	00:48
corvus	we should be good to do a rolling restart of schedulers+web whenever convenient to pick up the bugfix	01:57
corvus	i'll start on that now	02:38
corvus	zuul02 scheduler is restarting	02:41
corvus	this time i'm just doing: docker-compose down; docker-compose up -d	02:41
corvus	that seems to be working well so far	02:41
corvus	02 is done; restarting 01 now	02:52
corvus	ah, this time zuul01 took too long to shut down and docker killed it; so i think we still need to tune that.	02:54
corvus	i think that means i'm a chaos monkey and we just tested "kill a scheduler while it's in the middle of re-enqueing all changes in a pipeline". that appears to have worked fine.	02:58
ianw	haha i've been called worse	03:01
corvus	i'm going to restart the web server now; so expect status page outage	03:03
corvus	looks like everything is up now	03:12
opendevreview	Merged opendev/system-config master: Upgrade to gerrit 3.3.8 https://review.opendev.org/c/opendev/system-config/+/819733	03:14
ianw	i was going to sneak ^ in but you beat me to it :)	03:16
corvus	oh sorry...	03:21
corvus	it looks like there's a problem with the periodic-stable pipeline; it may be a result of my chaos-monkey action	03:22
corvus	i'm going to see if i can manually correct it; otherwise we may need a full shutdown/start	03:22
opendevreview	Merged openstack/diskimage-builder master: Fix BLS based bootloader installation https://review.opendev.org/c/openstack/diskimage-builder/+/818851	03:26
corvus	okay, i perfomed zk surgery to completely empty the periodic-stable pipeline and am now re-enqueing it. i'll try to figure out what went wrong from the log files tomorrow	03:30
corvus	there are a lot of failures in that pipeline now; i can't tell if they're legitimate, or if it has something to do with the 00000 commit sha they are all enqueued with	03:34
corvus	i think it's too uncertain and we should just drop the queue	03:35
corvus	which is unfortunate since we have no way to restore it	03:35
corvus	i've done that now.	03:36
corvus	status summary: everything is up and running, but we won't have periodic-stable results for today	03:37
corvus	i'm out for the night	03:44
ianw	thanks for looking after it! i'm sure i would have got it helplessly tangled up :)	03:45
*** ysandeep\|out is now known as ysandeep\|ruck		04:33
*** pojadhav- is now known as pojadhav		05:22
*** ysandeep\|ruck is now known as ysandeep\|afk		05:52
*** ysandeep\|afk is now known as ysandeep\|ruck		06:15
*** raukadah is now known as chandankumar		06:51
*** ykarel__ is now known as ykarel		07:08
frickler	clarkb: fungi: I'm still trying to clean up exim paniclogs on other servers, but I didn't get mail from lists.o.o, likely because the aliases there were never updated. also excluding ianw, not sure if intentional or not	07:14
frickler	most of the locking errors seems to be happening at logrotate time, which with the focal upgrade seems to have moved from 06:25 to 00:00?	07:17
frickler	I'll see whether one can tune the timeout	07:18
*** ysandeep\|ruck is now known as ysandeep\|lunch		07:23
*** ysandeep\|lunch is now known as ysandeep		08:28
*** ysandeep is now known as ysandeep\|ruck		08:28
ianw	frickler: i didn't intentionally not update, i think just never got around to it!	09:05
Unit193	fungi: Well, not quite what we were hoping for, but at least https://launchpad.net/ubuntu/+source/pastebinit/1.5.1-1ubuntu1 is a start...	10:05
ykarel	Is there some issue with zuul.openstack.org? it's not loading	10:06
ykarel	https://zuul.opendev.org/t/openstack/status working though	10:07
ykarel	inspecting returns: TypeError: "r is undefined"	10:08
frickler	ianw: not sure if we were talking about the same thing. I meant to say that you are missing in the list of aliases to send root mail to on lists.o.o	10:19
frickler	ykarel: I can confirm that, best use zuul.opendev.org for now. will need to wait for corvus to dig deeper I guess	10:23
ykarel	frickler, ack and Thanks for check	10:24
opendevreview	Arx Cruz proposed opendev/elastic-recheck rdo: Fix ER bot to report back to gerrit with bug/error report https://review.opendev.org/c/opendev/elastic-recheck/+/805638	10:38
*** rlandy\|out is now known as rlandy\|ruck		11:12
opendevreview	Marios Andreou proposed opendev/base-jobs master: Fix NODEPOOL_CENTOS_MIRROR for 9-stream https://review.opendev.org/c/opendev/base-jobs/+/820018	11:33
marios	fungi: whenever you next have some review time please add to your queue ^^^ i updated to use bash instead of jinja per comment thanks for looking	11:34
opendevreview	Marios Andreou proposed opendev/base-jobs master: Fix NODEPOOL_CENTOS_MIRROR for 9-stream https://review.opendev.org/c/opendev/base-jobs/+/820018	11:58
opendevreview	Marios Andreou proposed opendev/base-jobs master: Fix NODEPOOL_CENTOS_MIRROR for 9-stream https://review.opendev.org/c/opendev/base-jobs/+/820018	12:01
*** pojadhav is now known as pojadhav\|afk		12:01
*** pojadhav\|afk is now known as pojadhav		12:54
*** pojadhav is now known as pojadhav\|afk		13:48
*** ysandeep\|ruck is now known as ysandeep\|afk		14:14
dtantsur	can confirm the zuul.o.o problem	14:17
*** ysandeep\|afk is now known as ysandeep		14:20
*** ysandeep is now known as ysandeep\|afk		14:27
*** ysandeep\|afk is now known as ysandeep		15:12
corvus	that should be fixed by https://review.opendev.org/820184	15:27
*** ysandeep is now known as ysandeep\|out		15:34
clarkb	as far as we know the system-config deploy jobs are running again right? I'll plan to approve the matrix-gerritbot update after gerrit user summit if so	15:41
fungi	i believe so, yes. i haven't approved the lists.openinfra.dev addition yet though	16:08
fungi	want to wait until i'm less distracted by meetings	16:08
*** chandankumar is now known as raukadah		16:12
*** tosky_ is now known as tosky		16:17
*** marios is now known as marios\|out		16:35
*** priteau is now known as Guest7388		16:38
*** priteau_ is now known as priteau		16:38
clarkb	making this note here so I don't forget. Gerrit 3.4 (or is it 3.5?) allows usernames to be case insensitive. Existing installations remain case sensitive by default. We should check in our 3.3 to 3.4 test jobs that we don't break usernames	16:45
clarkb	we can createa zuul and Zuul user or similar and then going forward we should catch problems automatically	16:45
clarkb	except we may need to toggle the config explicitly to avoid the default on new installs being insensitive. Anyway the testing we've got should cover this well, just need to update the system a bit	16:47
fungi	we could in theory check for collisions, but i expect they're many	16:48
clarkb	yes I know we have collisions just from the user cleanups I've done for the conflicting external ids problem	16:53
clarkb	when people end up with a second user they often make their username a variant of the original	16:54
clarkb	often by changing case of a character or three	16:54
fungi	clarkb: any opinion on whether we should be using base-test to vet https://review.opendev.org/820018 before approving?	17:26
clarkb	fungi: its probably sufficient to run the script locally if you want to avoid that dance	17:28
fungi	i've approved 818826 to create lists.openinfra.dev and will keep an eye on it	17:28
clarkb	but I think we should test it since the mirror config affects a lot of jobs	17:28
fungi	yes, i looked at it very closely in order to spot obvious syntax or logic issues which could have broader fallout, but i'm not confident in my skills as a shell parser	17:28
clarkb	But also, that config is long since deprecated iirc	17:28
fungi	yes	17:29
clarkb	we might suggest that starting with centos stream people use the proper mirror configuration tooling	17:29
clarkb	but I'm indifferent to that as a shell script vars are useful in various contexts	17:29
fungi	that's not a bad idea, it would be starting with stream 9 specifically though	17:30
fungi	stream 8 didn't need changes to the mirroring	17:30
clarkb	ah	17:30
clarkb	ya the -ge 9	17:30
fungi	centos changed up their mirror path for stream 9	17:30
clarkb	tl;dr if the script as proposed runs locally I think we can approve it	17:33
opendevreview	Merged opendev/system-config master: Create a new lists.openinfra.dev mailing list site https://review.opendev.org/c/opendev/system-config/+/818826	17:57
clarkb	one thing I notice is that the order of jobs isn't quite what I expected but that must be an artifact of actually writing down our dependencies :)	18:29
fungi	our dependencies aren't quite what we expected	18:31
fungi	fwiw, looks like the periodic puppet-else job ran again, but /var/lib/storyboard/www/js/templates.js on storyboard.o.o did not get updated	18:34
clarkb	I think the source isn't updated the way we think it is	18:35
clarkb	/home/zuul/src/opendev.org/opendev/system-config git log -1 shows Merge "Cache Ansible Galaxy on CI mirror servers"	18:36
clarkb	we should probably hold off on making updates otherwise we'll have a giant pile of them that all apply at once when we fix that	18:36
clarkb	also before manage-projects runs do we need to stop ansible?	18:36
clarkb	(I don't know if we've chagned projects.yaml in the last few days)	18:37
clarkb	https://zuul.opendev.org/t/openstack/build/d72175e06a8c4b5999b058a77e984755 is the build that should've updated the source and looking at the logs I think we did	18:38
clarkb	but then later jobs must've reset it or something?	18:38
clarkb	I'm confused, and can't really debug right now as I'm trying to pay ettention to gerrit user summit	18:38
fungi	yeah, should i disable ansible on bridge for now?	18:39
clarkb	probably?	18:40
clarkb	the problem with the ansible disable is that we retry every job 3 times :/	18:40
clarkb	but I haven't come up with a better idea than that other than emergency filing everything but that is problematic for other reasons. I think ansible disable is probably warranted until we can understand this better	18:40
clarkb	https://zuul.opendev.org/t/openstack/build/d72175e06a8c4b5999b058a77e984755/log/job-output.txt#225-229 is not what is reflected on the system	18:41
fungi	#status log Temporarily disabled ansible deployment through bridge.o.o while we troubleshoot system-config state there	18:41
opendevstatus	fungi: finished logging	18:41
clarkb	it synced to a different host	18:42
clarkb	https://zuul.opendev.org/t/openstack/build/d72175e06a8c4b5999b058a77e984755/log/job-output.txt#214 is not bridge	18:42
clarkb	I think it was a single use test node?	18:42
fungi	oho	18:43
clarkb	so basically we're not updating system-config on bridge then running things. I think that we're likely ok except for potentialyl recreating an old project on gerrit if we had done renames but we haven't done renames so should be fine	18:43
clarkb	anyway back to gerrit user summit now that I'ev largely convinced myself we aren't breaking anything, just not updating the way we expected	18:44
fungi	yeah, our deployments are basically just being deferred	18:44
clarkb	ianw's day should be starting soon and may understand this	18:44
clarkb	this is almost certainly a result of the switch to a single job to update system-config at the beginning of a buildset	18:45
fungi	yeah, the zuul inventory for that build indicates there's an ubuntu-focal node	18:45
clarkb	fungi: note that infra-prod-service-lists is running now (it must'ev started before you put the prevention in place, or we've broken the preventing in the CD refactor) but as mentioend previously I think this will just apply tuesdays state and we should be ok	18:48
clarkb	(sidenote the thing that tipped me off to updating a different host was I checked the reflog on system-config and didn't see the refs shown in the job log	18:48
fungi	yeah	18:51
fungi	6bcf28b from 21:03:56 tuesday was the last update to ~zuul/src/opendev.org/opendev/system-config on bridge	18:52
fungi	c663d9b from 00:50:45 wednesday was the next change which should have been updated there	18:54
fungi	so the breakage started in that ~3.75hr timespan	18:54
clarkb	DISABLE-ANSIBLE is only evaluated in the setup src job	18:55
clarkb	since we put the file in place after that job the other jobs are free to continue	18:55
clarkb	fungi: it was almost certainly 9cccb02bb09671fc98e42b335e649589610b33cf/42df57b545d6f8dd314678174c281c249171c1d0	18:57
fungi	in theory 42df57b from 13:48:44 wednesday would have switched to running the correct job	18:58
fungi	and that much it seems to have done	18:58
fungi	but the job itself is not yet doing the right thing	18:58
clarkb	well the key is we stopped updating system-config in the other jobs	18:58
clarkb	and then started running a job that wasn't updating properly	18:59
fungi	yep	18:59
clarkb	We might get away with a simple revert for now. Then reevaluate from there	18:59
clarkb	but might be good to see if ianw has an opinion first	19:00
clarkb	Its still a bit early there though	19:00
fungi	yeah, once he's around he may already have a clearer picture of what it was supposed to be doing vs what it's actually doing	19:00
clarkb	opendev-infra-prod-base <- that job still seems to exist and the changes linked above switched us off of that. I think if we revert we'll go back to using this job and hsould work? maybe? I hope?	19:02
clarkb	heh	19:02
clarkb	the hourly job runs are not running the source update job	19:03
clarkb	so we've got another layer of problem where once we get things working if we reenqueue stuff we'll apply updates then hourly will undo them	19:03
clarkb	I'm wondering if we shouldn't consider disabling ssh access since DISABLE-ANSIBLE is non functional	19:03
clarkb	ya I think we need to revert for that reason either way	19:04
clarkb	we can't safely roll forward without adding pipeline edits in addition to fixing the setup-src job	19:05
fungi	so squash a revert of 9cccb02+42df57b i guess	19:05
fungi	i can push that up	19:05
clarkb	yes I think so. But I'm leaning towards lets disable ssh access, push the revert then wait for ianw to help untangle	19:06
fungi	how do we globally disable ssh access to our servers?	19:07
fungi	or do you mean just disable ssh access for zuul@bridge	19:07
clarkb	fungi: you only need to disable it for zuul@bridge	19:07
clarkb	move the authorized_keys file aside?	19:07
fungi	we have a zuul-zone-zuul-ci.org-20200401 key and a zuul-opendev.org-20200401 authorized, i guess it's the latter?	19:09
fungi	ahh, yeah the first is for dns i guess	19:09
fungi	okay, i've commented out the zuul-opendev.org-20200401 key	19:09
clarkb	ok I think I'm understanding what the setup-src job is doing that is wrong. Because it has a regular node (no nodes: []) we run the normal repo setup against the remote host	19:12
opendevreview	Jeremy Stanley proposed opendev/system-config master: Revert "infra-prod: clone source once" https://review.opendev.org/c/opendev/system-config/+/820250	19:12
clarkb	Then our tasks that run against bridge.openstack.org are completely skipped beacuse it isn't in the inventory	19:12
fungi	820250 is a squashing of reverts for commits 42df57b and 9cccb02	19:12
clarkb	70827542adfaf5816fdf396e61c5d021b0fa3769 is a flawed change	19:14
clarkb	the assertion in the commit message is only half true	19:15
clarkb	fungi: we need to revert ^ as well	19:15
clarkb	because the inventory add in setup-keys is what was allowing setup-src.yaml to find bridge and update the system-config repo	19:16
fungi	okay	19:16
clarkb	when we dropped the inventory add from setup-keys we dropped the ability to update system-config	19:16
fungi	i can't find that commit	19:17
clarkb	fungi: it is in opendev/base-jobs	19:17
fungi	oh, got it	19:17
clarkb	I think the order is revert 70827542adfaf5816fdf396e61c5d021b0fa3769 then do 820250	19:17
clarkb	if we do it in the other order we'll still be broken	19:17
opendevreview	Jeremy Stanley proposed opendev/base-jobs master: Revert "infra-prod-setup-keys: drop inventory add" https://review.opendev.org/c/opendev/base-jobs/+/820251	19:18
opendevreview	Jeremy Stanley proposed opendev/system-config master: Revert "infra-prod: clone source once" https://review.opendev.org/c/opendev/system-config/+/820250	19:19
fungi	depends-on added	19:19
clarkb	Once we're reverted I think the plan forward is to update the setup-src job to not run with nodes first, then update our pipeline config updates as before but ensure the src update job is in all the pipelines and that all the jobs hard depend on that setup src job. We want them to fail if setup src fails.	19:20
clarkb	But maybe we get back to where each job is updating system-config today so that we can reenqueue stuff (we have to be careful doing this because reenqueing to deploy will use the exact chagne state which means if we reenqueue out of order or whatever we can have problems)	19:21
clarkb	then pick up the break out again next week?	19:21
fungi	wfm	19:22
clarkb	re reenquing stuff a safer appraoch may be to let something update system-config (the hourly deploy jobs most likely) then manually run other playbooks that we want to pick up that stuff	19:23
fungi	sure. the daily will also kick off in a few hours	19:23
fungi	well, ~6.5 i think	19:23
clarkb	the last thing we need to sort out is where DISABLE-ANSIBLE got broken. That might also need a (parial) revert	19:24
clarkb	ok I think the existing revert to go back to the old base job will return DISABLE-ANSIBLE before	19:27
clarkb	s/before/behavior	19:27
clarkb	I've +2'd both changes and left notes about what I've found in my debugging. I guess we can wait another hour or so to see what ianw thinks?	19:29
clarkb	in the meantime infra-root please do not approve any system-cofnig changes	19:29
clarkb	fungi: we should make a list of changes to system-config and project-config to audit and rerun as necessary once happy again	19:30
clarkb	for system-config f29aa2da1688ab445d78d3c6596467bae9281f48 3c993c317b79640c2f86d91559f6d2b7ec83d17a 4285b4092839daea4bb7d2574f2a8923310d8278 33fc2a4d4e0628f1580893579c275f0095ce7eec	19:31
clarkb	of those the lists update is probably the most scary one. I think the gerrit image update wouldn't have really affected prod since all we'd do is pull the image maybe	19:31
clarkb	the haproxy changes might update haproxy in production.	19:32
fungi	i've got to step away to cook dinner (christine has something pressing at 21:00) but i can take a look once we eat	19:32
clarkb	for project-config 9d2f65a663df801beae4385368c86a21fca83c8e is the only one we need to check but I think it landed early enough to not be a problem	19:33
fungi	i can probably scrape a list of changes reported in here by gerritbot as a cross-check	19:33
clarkb	so really just the system-config commits above and of those only the lists one is concerning. I think once we think we're fixed we manually update system-config and manually run the gitea load balancer, lists and gerrit playbooks	19:34
clarkb	Then we can fix ssh for zuul on bridge and see if gerrit does the right thing? I guess the fear there is it might revert our checkout somehow but I think the risk of that is low	19:34
clarkb	ya I'm going to need lunch soon so this is probbaly all fine to pause a bit until ianw is awake and can review what we've found and decide if the plan is good	19:36
*** artom__ is now known as artom		19:36
Clark[m]	I've switched to lunch mode but just realized that maybe landing the system-config revert will trigger all the things to run? And maybe that is better than trying to manually run stuff? If we choose to manually run stuff we should do that before approving the revert I guess	19:47
fungi	yeah, might make the most sense to put ssh key back and enable ansible when approving the system-config revert?	19:48
Clark[m]	Ya possibly	19:51
fungi	for visibility, should the disable-ansible check be its own role even? easier to see when and where we include it in each job that way	20:03
Clark[m]	++	20:05
opendevreview	Jeremy Stanley proposed opendev/base-jobs master: Make the disable-ansible check into its own role https://review.opendev.org/c/opendev/base-jobs/+/820258	20:31
fungi	that's the role, we can switch to it where convenient i guess	20:31
corvus	i'd like to rolling restart zuul scheduler and web... any thoughts on timing?	20:46
corvus	i mean, should be non-disruptive, but also non-zero-risk	20:47
clarkb	corvus: well we're hoping to untangle the system-config breakage when ianw's day starts. Might be good to get through that first just so that we're not debugging zuul and system-config?	20:50
clarkb	I think we've got the two changes necessary to do that proposed above https://review.opendev.org/c/opendev/base-jobs/+/820251 https://review.opendev.org/c/opendev/system-config/+/820250 but washoping ianw could weigh in as he was driving that work	20:50
corvus	yep, can wait.	20:51
clarkb	I'm not sure how long we should wait on the off chance that ianw isn't around today. The base-jobs change should be super straightforward to land. It is the system-config change that is a bit more intertwined, but from what I can see that change is afe too	21:13
opendevreview	James E. Blair proposed opendev/system-config master: Add a keycloak server https://review.opendev.org/c/opendev/system-config/+/819923	21:14
corvus	i expect that to pass tests and ready for review now	21:15
clarkb	that is unexecpted, the hourly jobs are still managing to run somehow	21:18
fungi	could they be authenticating with one of the other keys?	21:19
clarkb	or a # doesn't do what we think it does in that file?	21:20
clarkb	oh yup its the wrong key	21:20
clarkb	the system-config job use the system-config key	21:20
clarkb	the key you commented out is for the opendev.org zone I think	21:20
fungi	oh, is that entry misnamed?	21:21
clarkb	no it isn't misnamed, we just misinterpreted what it meant	21:21
fungi	the zuul-ci.org one has a comment of zuul-zone-zuul-ci.org-20200401	21:21
fungi	the zuul-opendev.org-20200401 doesn't say "zone" in it	21:22
clarkb	system-config/inventory/base/group_vars/all.yaml sets the value. I think it was just recorded that way	21:22
fungi	i guess we should have called it zuul-zone-opendev.org-20200401 for consistency	21:22
clarkb	yes. But also maybe we should move the file aside as we don't really want anything running until we're haoppy with the fixups?	21:22
fungi	done, moved it temporarily to ~zuul/.ssh/disabled.authorized_keys	21:23
clarkb	in the meantime should we go ahead and approve the base-jobs revert?	21:24
clarkb	I'm going to rereview the system-config revert now with some fresh eyes to make sure we aren't missing anything	21:24
fungi	yeah, i can approve the one for base-jobs	21:24
clarkb	https://review.opendev.org/c/opendev/base-jobs/+/807807 was the last chagne to opendev-infra-prod-base. Which means we ran with that in place for about a week and seemed to be working. The system-config revert switches us back to using that job	21:26
clarkb	now to double check the contents of that job for changes	21:27
clarkb	the two changes to the playbooks that jobs run are the one to remove the inventory entry which we are reverting and the other rnames a playbook which I think is fine becuase it appears to have been 1:1 just a file change name for consistency with job names	21:29
clarkb	and ya the git log for the rename shows no delta in the file itself	21:29
clarkb	so ya I think the system-config revert is also safe.	21:29
clarkb	fungi: once base-jobs lands should we approve the system-config revert and plan to move ssh authorized_keys back and also remove DISABLE-ANSIBLE?	21:30
clarkb	then figure out if we need to run any playbooks by hand after it runs its jobs?	21:30
clarkb	basically in my rereview I can't find anything that would indicate going back to the old situation of running the repo update for each job would be a problem	21:31
opendevreview	Merged opendev/base-jobs master: Revert "infra-prod-setup-keys: drop inventory add" https://review.opendev.org/c/opendev/base-jobs/+/820251	21:34
clarkb	I guess give it a little longer in case ianw's day is still booting up and then plan to approve the other revert at the top of the hour otherwise?	21:36
clarkb	unrelated gitea just made a new 1.15 release with a bunch of bugfixes	21:40
opendevreview	Clark Boylan proposed opendev/system-config master: Update gitea to 1.15.7 https://review.opendev.org/c/opendev/system-config/+/820267	21:47
clarkb	unlikely to land that today, but we can start the CI process on it	21:47
fungi	yeah, top of the hour wfm... put ssh keys back, undo the disable-ansible, approve the change	21:53
clarkb	I think we might want to approve first so that the hourly jobs can quickly cycle out	21:54
clarkb	but then ya reenable things with the plan being the change will and and have a go	21:55
ianw	sorry, here now!	21:56
ianw	just reading	21:57
clarkb	ianw: oh hi! so basically there are a few issues we discovered with the CD refactors that landed most recently. The main issue is system-config on bridge isn't being updated by the -src job	21:57
clarkb	ianw: the reason for this is that we aren't adding bridge to the inventory anymore since we removed that from the keys playbook. But even if we fix that we also noticed that we aren't running the update job on hourly deploy or the daily periodic pipeline	21:58
clarkb	separately we also found taht only the -src job was checking DISABLE-ANSIBLE which means you can't really get ahead of the next job only the next buildset	21:58
clarkb	fungi pushed up two revert changes the first of which has laned and restores the inventory stuff to the setup-keys playbook. THe other revert has us going back to the every job updates system-config state so that we can roll forward addressing the whole set of issues	21:59
ianw	ok, i thought it all seemed to be going too easily :)	22:00
clarkb	ianw: I tried to leave comments on the revert changes to serve as hints for the future fixups but right now the priority is getting things working around as we are building up a delta (gitea haproxy, gerrit image update, and lists.opeinfra.dev changes) that haven't applied fully	22:00
clarkb	We suspect that if we land the system-config revert that a bunch of those jobs will run so we can reenable zuul access to brdige and approve that if you are happy with that plan	22:01
clarkb	we disabled ansible so that we could figure out what was going on. I think at this point I'm reasonably well convinced it wasn't doing anything bad just not doing anything new. We can probably reenable whenever I susppose	22:02
clarkb	fungi: in https://review.opendev.org/c/opendev/base-jobs/+/820258 I think you can go ahead and add that role to the base job playbooks?	22:03
fungi	is that safe? i suppose it is	22:04
clarkb	fungi: ya it should be	22:04
clarkb	with the usual caveats that updating base jobs is tricky and we should monitor	22:04
fungi	does it need to be scoped to a specific inventory host?	22:04
clarkb	fungi: yes it needs to only check on brdige	22:04
clarkb	fungi: I think you can put that in the setup-keys playbook that adds bridge to the inventory	22:05
clarkb	something like that should work well. And we can land it later when we are able to monitor and out of the unhappy current state	22:05
ianw	thanks, 820250 is approved so we can get things moving	22:05
fungi	ahh, okay, i assumed we'd want to explicitly add it to other jobs, but i guess if it's in base then it's implicitly added to all jobs without us needing to do anything	22:05
clarkb	fungi: exactly	22:06
fungi	with 820250 approved i should put back the ssh keys and undo the disable-ansible now?	22:06
clarkb	fungi: if you do that the hourly jobs will run which will delay when the 820250 jobs start. I think if we can wait for hourly to finish and then reenable that would be best	22:06
clarkb	but that only works if 820250 doesn't merge first :)	22:06
fungi	got it	22:07
fungi	i'll try to keep an eye on the screen	22:07
clarkb	I think the hourly jobs need about 4-5 more minutes to cycle out. 820250 hasn't started all jobs yet so we should have some time	22:07
clarkb	oh it just started and zuul says 26 minutes so ya we should be good to wait on the hourlies to finish first	22:07
ianw	i do wonder if we want every job checking DISABLE-ANSIBLE	22:10
ianw	i did totally overlook the other pipelines	22:11
clarkb	for me at least its nice to be able to recognize there is an issue and then hit the off switch. I suppose if we want to keep things more fine grained we could say the ssh keys are the big red button and DISABLE-ANSIBLE is more graceful	22:12
ianw	i guess you're saying you might want to stop things between the end of the src job and the other jobs starting?	22:13
clarkb	yes or between some other job in the list and the next one if we realize something is off	22:13
ianw	i was mostly thinking that cloning the source would be the place it stops; i don't have a problem with the flag as such	22:14
ianw	hmm, fair enough. does the new zuul authentication bits give the option to cancel a buildset too?	22:14
clarkb	maybe? we can dequeue with gearman as long as that still exists too	22:15
clarkb	fungi the last job in the hourly buildset is about to timeout once that is done I think we can restore the ssh keys and remove DISABLE-ANSIBLE	22:16
clarkb	fungi: its done we can reenable now. Were you going to do that or should I/	22:17
corvus	yes you can dequeue an item	22:18
clarkb	I went ahead and removed DISABLE-ANSIBLE and put the authorized_keys file back	22:19
clarkb	we're making CD omelets	22:20
opendevreview	Merged opendev/system-config master: Revert "infra-prod: clone source once" https://review.opendev.org/c/opendev/system-config/+/820250	22:23
clarkb	re Gerrit User Summit I did try to take a bunch of notes which I'll try to curate and post up somewhere. I think the big thing for us to think about is case sensitive username settings in 3.4 before we upgrade. Just to be sure that doesn't bite us later	22:23
clarkb	but I also understand how the new check stuff works	22:24
clarkb	For the new checks stuff you write a plugin that queries some CI endpoint for a change (in our case it would hit the zuul rest api I think). Then the plugin emits data in their standard format to the central checks UI system	22:25
clarkb	then they handle all the rendering for you	22:25
clarkb	that was anti climactic it decided to not run any jobs	22:27
clarkb	I guess because no jobs trigger on the base job updating?	22:27
clarkb	just when you think you understand how computers work they remind you that no no you do not :)	22:27
clarkb	Shoudl we just wait for the hourly runs to happen then we can manually run the gitea-lb playbook and the lists playbook?	22:27
fungi	thanks, i got back to the keyboard too late	22:28
clarkb	My one concern with manually updating the system-config checkout is taht we won't know that the jobs are doing it propely	22:28
clarkb	I think I've decided we don't need to do the review playbook as all we did was update the image and those did promote to docker hub properly	22:28
clarkb	or we can enqueue the lists chagne to deploy	22:29
clarkb	that was the last system-config chagne to land. I don't think we should enqueue any older changes as that will create confusion	22:29
fungi	i'm okay waiting for the hourly deploy	22:30
clarkb	cool that wfm too then	22:30
fungi	slightly worried that we've picked apart our deploy jobs enough that reenqueuing a particular change may not run everything anyway	22:30
clarkb	fungi: ya it would only run whatever jobs it enqueued previously	22:32
clarkb	though will it use the old state of the jobs too? I don't think so	22:32
ianw	fungi: why do you think the deploy jobs won't run?	22:34
clarkb	ianw: well the lists addition chagne won't run jobs for haproxy on gitea for example	22:35
clarkb	but it will run some jobs related to lists	22:35
ianw	oh right, yes i see what you mean	22:36
clarkb	but we can manually run those playbooks once we're happy the automated jobs are updating commits properly	22:37
fungi	hence the list of missing commits from the dark time	22:38
fungi	so we know what needs to be rerun	22:38
ianw	do we need infra-prod-setup-src, or should it just be part of infra-prod-install-ansible?	22:53
clarkb	ianw: hrm thats a good question. I think if we're hard depending on the source update job and there is another job we always want to run it could pull double duty	22:54
clarkb	call it prep-bridge or similar?	22:54
ianw	maybe bootstrap-bridge?	22:57
clarkb	++	22:59
clarkb	hourly jobs are starting now	23:00
fungi	good, i'm mostly back around again now	23:01
clarkb	woot it just updated system-config	23:02
clarkb	I think we're good. And can proceed with running the the lists and gitea haproxy playbooks when we like (I don't think either of those playbooks conflicts with the jobs that hourly runs	23:02
clarkb	service-gitea-lb.yaml <- that is the playbook we run for the gitea lb. I'll go ahead and run it now	23:04
clarkb	that is done. It updated the docker compose file to set the ro flag on the config bind mount and restarted the container	23:06
clarkb	I can still reach https://opendev.org	23:07
fungi	same	23:07
clarkb	I think we're good	23:07
ianw	thanks!	23:07
clarkb	service-lists.yaml is the lists playbook. Fungi did you want to run that one?	23:07
clarkb	`sudo ansible-playbook -f 20 -v /home/zuul/src/opendev.org/opendev/system-config/playbooks/service-gitea-lb.yaml` is the command I used for the gitealb	23:07
fungi	i can, just a sec	23:08
clarkb	just need to wap out the playbook name	23:08
clarkb	I'm happy to join a screen if you want to run it in screen too	23:08
fungi	cueued up in a root screen session now	23:08
fungi	er, cued	23:09
clarkb	I'm in the screen and that command looks right to me	23:09
fungi	well, also queued	23:09
fungi	okay, running	23:09
clarkb	interestingly infra-prod-service-bridge needs to be retried?	23:10
clarkb	there doesn't appear to be a new playbook log file from that job in our ansible log dir	23:11
fungi	it's working on adding the new site now	23:11
clarkb	corvus: is there a good way to see those logs from a failed but will be retried job somewhere?	23:11
clarkb	fungi: considering how long this command is taking I wonder if it is stuck on a read like we had before	23:12
fungi	yeah, looking	23:12
fungi	it's in an epoll loop	23:13
fungi	epoll_wait(5, [], 2, 1000) = 0	23:13
fungi	wait4(2537565, 0x7ffdcda5976c, WNOHANG, NULL) = 0	23:13
fungi	clock_gettime(CLOCK_MONOTONIC, {tv_sec=1398764, tv_nsec=118518927}) = 0	23:13
fungi	i think	23:13
clarkb	that task is the one we fixed for the read	23:14
clarkb	by setting stdin: ''	23:14
fungi	i don't see any child processes of that AnsiballZ_command.py anyway	23:14
clarkb	fungi: ps shows it `ps -elf \| grep newlist`	23:15
fungi	oh, yup, my ps afuxww wrapped at an inconvenient column	23:15
fungi	that newlist command looked like it wasn't a child so i skimmed past	23:16
fungi	so i wonder why newlist would hang	23:16
clarkb	strace says a read on fp 0	23:17
fungi	yes, it does	23:17
clarkb	which seems like the same issue as before	23:17
fungi	so waiting on a pipe	23:17
* fungi sighs		23:17
clarkb	well fd 0 is stdin	23:17
fungi	right, waiting on something to pipe into it i meant	23:17
clarkb	whats weird is we fixed this and made sure the fix worked I thought	23:17
fungi	i thought so too	23:18
clarkb	is there something special about newlisting the mailman list?	23:18
fungi	it was prompting for confirmation last time, right?	23:18
clarkb	fungi: prompting to send confirmation emails iirc ya	23:18
clarkb	to the list admin	23:18
fungi	ansible was making it look like a tty which caused it to go interactive	23:18
clarkb	we didn't catch it in testing because testing sets the flag to not send notifications	23:19
clarkb	but we do want those notification in production :/	23:19
fungi	i guess we should kill the hanging newlist or wait for the task to timeout	23:19
clarkb	ya I think killing the newlist is probably best. Then we can put lists.o.o and lists.kc.io in the emergency file and go try and reproduce in testing?	23:20
fungi	well, emergency file shouldn't be necessary for lists.k.i unless we try to add a list to it	23:20
fungi	but may as well	23:20
clarkb	good point	23:21
fungi	i'll add them both and then kill the newlist process	23:21
clarkb	sounds liek a plan. We may need to kill a few more newlists if it continues to try after the failed attempt (I think it will short circuit though and we should haev a half confiruged site that we can ignore?)	23:21
clarkb	ya appears to have short circuited	23:22
fungi	i didn't initially add any other lists so it was only trying to create the default metalist	23:22
clarkb	yup and I'm wondering if that metalist has additional prompts from newlist?	23:22
clarkb	since we know that adding a normal list seems to work fine we have done a few of those iirc	23:22
fungi	i suppose it might	23:23
clarkb	but we should be able to work through it via held test nodes	23:23
fungi	anyway, it's in emergency disable now, i can probably try to debug more tomorrow	23:23
clarkb	yup thanks	23:23
clarkb	I need to take a break to get some stuff done while the sun is still up	23:23
clarkb	The other thing on my list was to restart gerrit on the new image. But will seewhere we are at later and if I've got brain space for that	23:24
ianw	i'll be happy to do that when it's a bit quieter in a few hours	23:28
corvus	clarkb: yes, you can find logs of retried builds by going to the buildset, and you can get to the buildset by clicking on any completed job in the buildset to get the build page for that build, then click the buildset link. example: Bearer	23:29
corvus	eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJpYXQiOjE2Mzg0ODc0MDEuNjMwOTM5MiwiZXhwIjoxNjM4NDg4MDAxLjYzMDkzOTIsImlzcyI6Inp1dWxfb3BlcmF0b3IiLCJhdWQiOiJ6dXVsLmV4YW1wbGUuY29tIiwic3ViIjoicm9vdCIsInp1dWwiOnsiYWRtaW4iOlsibm9uZSJdfX0.ONXqLWPTlGEUa-rKkjYHnclbtsS2sxsD9FIPY7kjV3M	23:29
corvus	oh dear that's not the right example :)	23:30
ianw	i'll rework the parallel changes into a another series of "noop" jobs	23:30
corvus	example: https://zuul.opendev.org/t/openstack/buildset/6cb2b00359e349ba954be34c2f06904a	23:30
corvus	(that is not an important token, ftr)	23:30
ianw	hopefully that meet the definition of noop this time	23:31
opendevreview	Merged opendev/base-jobs master: Fix NODEPOOL_CENTOS_MIRROR for 9-stream https://review.opendev.org/c/opendev/base-jobs/+/820018	23:38
opendevreview	James E. Blair proposed opendev/system-config master: Add local auth provider to zuul https://review.opendev.org/c/opendev/system-config/+/820276	23:39
ianw	i'm keeping an eye on ^^. it's a very quick revert, but it was only an if conditional	23:40
ianw	(i mean, if it does go wrong, it can be a quick revert)	23:40
opendevreview	James E. Blair proposed openstack/project-config master: Add REST api auth rules https://review.opendev.org/c/openstack/project-config/+/820277	23:43
corvus	infra-root: the ansible hostvars file group_vars/grafana_opendev.yaml is not checked into git. should it be?	23:44
corvus	infra-root: (also there are several *.old files which seems redundant for content that's in a git repo, should they be deleted?)	23:45
fungi	ianw: ^ is that something you were working on?	23:45
ianw	yeah, looking, it might be something i've left behidn	23:45
fungi	corvus: i'd delete old/backup copies yes	23:45
corvus	i'll wait for ianw to clear before i do anything	23:45
ianw	yeah it was from the swizzle time; that group went with https://review.opendev.org/c/opendev/system-config/+/739625	23:46
ianw	i'll rm it	23:46
ianw	.. done	23:47
corvus	thx. i'm going to rm emergency.yaml.old groups.yaml.old openstack.yaml.old	23:48
ianw	++	23:52
opendevreview	James E. Blair proposed openstack/project-config master: Add REST api auth rules https://review.opendev.org/c/openstack/project-config/+/820277	23:54
clarkb	thansk for doing that cleanup. I'm back at the computer and will try to be useful again	23:58
clarkb	first up understanding why the bridge job retried	23:58
corvus	at this point in the day, i don't think i have time to do the rolling zuul restart i asked about earlier... if someone wants to do that once thing settle down, feel free, otherwise i'll ask again tomorrow. meanwhile, https://review.opendev.org/819923 https://review.opendev.org/820276 and https://review.opendev.org/820277 are all ready to merge. we should merge the latter two soon. like, before the gearman removal happens.	23:58
clarkb	https://zuul.opendev.org/t/openstack/build/317db45bca0a45ba8d79e491b74b1f5c it hit the exact time the haproxy was not working	23:58
clarkb	I can review those. I've already reviewd the keycloak chagne, but really the other two seem urgent and worht a check	23:59

Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!