Friday, 2023-11-17

fungi	if tonyb doesn't beat me to it, i'll send the one-hour status notice at 14:30 utc, add the indicated hosts to the disable list and pre-create the root screen session on review02 per https://etherpad.opendev.org/p/gerrit-upgrade-3.8	13:08
fungi	i'm headed out now but have all that prepped so i can do it from my phone if i'm not home by then	13:08
Clark[m]	Thanks! I'm working on being awake now. If possible a larger than 80x24 screen window would be good :)	13:44
tonyb	lies! all terminals are 80x24 ;P	13:45
tonyb	disabled list updated	13:45
tonyb	And the correct nodes are in groups["disabled"]	13:48
fungi	thanks!	13:52
fungi	turns out the cell signal in this parking lot is nearly nonexistent	13:53
tonyb	eeek	13:54
fungi	my phone claims to be doing cellular data with a single bar of 2g signal	13:55
fungi	occasionally switches to 4g for a minute and then drops back to 2g again	13:56
tonyb	screen session created in a 100x50 terminal	13:57
tonyb	That's super frustrating.	13:57
fungi	attached	13:57
tonyb	Hopefully the geometry works out okay.	13:59
tonyb	Is there some way to verify I can use statusbot ahead of time?	14:00
fungi	yeah, american cell providers have gotten really terrible about dropping roaming agreements, which makes it extra bad if you live in a remote location with spotty coverage	14:00
fungi	tonyb: i suppose you could #status log something, but might as well just try to do the notice and see what happens	14:01
fungi	we're 90 minutes from start anyway	14:02
tonyb	#status notice Gerrit will be unavailable for a short time starting at 15:30 UTC as it is upgraded to the 3.8 release. https://lists.opendev.org/archives/list/service-announce@lists.opendev.org/thread/XT26HFG2FOZL3UHZVLXCCANDZ3TJZM7Q/	14:05
opendevstatus	tonyb: sending notice	14:05
-opendevstatus- NOTICE: Gerrit will be unavailable for a short time starting at 15:30 UTC as it is upgraded to the 3.8 release. https://lists.opendev.org/archives/list/service-announce@lists.opendev.org/thread/XT26HFG2FOZL3UHZVLXCCANDZ3TJZM7Q/		14:05
opendevstatus	tonyb: finished sending notice	14:08
tonyb	Okay Looks like we're good for clarkb to carry-on from https://etherpad.opendev.org/p/gerrit-upgrade-3.8#L109	14:09
fungi	awesome	14:10
tonyb	I verified that the screen logging is working ... mostly to check I did it right ;P	14:12
* corvus yawns		14:15
clarkb	I've realized that we run gerrit init with the mysql db stopped. In this upgrade that isn't a problem because there are no schema changes, but for avoiding problems in the future I'm going to add a command to step 11 to start the db before running init	14:27
clarkb	and once I've got tea I'll load ssh keys and hop into screen and get ready for the rest of the fun	14:28
fungi	good call	14:28
*** blarnath is now known as d34dh0r53		14:38
clarkb	I confirm that screen logging appears to be working	14:41
clarkb	and am attached to the screen	14:41
clarkb	and emergency file looks good. Thank you for taking care of that	14:42
fungi	home again with time to spare	14:50
fungi	yeah, i checked the emergency list from the car and it looked right	14:50
tonyb	Nice.	14:51
fungi	also notified the openstack release team during their meeting	14:51
tonyb	I really like the ansible that corvus shared.	14:51
corvus	me too! hope it's accurate! :)	14:52
tonyb	fungi: awesome	14:52
corvus	https://etherpad.opendev.org/p/Av9otg2ML-52q2Nxiyi9 is my plan for zuul	15:01
clarkb	corvus: do you need an inline gzip on the mysql dump to keep file sizes down? (not sure how big that will be and if zuul01's disk is large enough)	15:03
corvus	clarkb: good point; as of 1 month ago it was 18g uncompressed. i'll probably dump it into /opt uncompressed which has plenty of space then compress it later	15:04
clarkb	sounds good	15:05
corvus	a month ago it was 2.6G compressed	15:06
corvus	i'm running the docker-compose pulls now	15:07
corvus	pulls complete	15:17
corvus	hrm, we don't have a root mysql user for zuul do we?	15:23
clarkb	heh now both backups servers are at or above 90%	15:23
corvus	(it's not important, but the zuul user lacks some privileges to inspect what the innodb engine is doing, which can be useful for monitoring progress)	15:23
clarkb	that shouldn't affect our backups for the gerrit upgrade but we'll want to address that soon	15:23
corvus	(yet another reason to just run our own db server)	15:24
tonyb	Oh yeah. I was going to prune some older backups.	15:24
clarkb	5 minutes until we start	15:25
tonyb	++	15:25
clarkb	and so its clear I plan to "drive"	15:27
clarkb	I'm awake and here so may as well :)	15:27
fungi	great. i'm standing by to help test and troubleshoot	15:28
* tonyb is watching from the cheap seats		15:28
tonyb	and is happy to help as directed	15:28
clarkb	yup I'll be sure to mention things in here if I need help	15:29
clarkb	My clock has ticked over to 1530. I'm proceeding now	15:30
corvus	stopping zuul	15:30
clarkb	giving mariadb a few seconds to start up before I proceed with backups which talk to it	15:31
clarkb	fs backups are complete and exited 0	15:32
clarkb	db backups are in progress	15:32
clarkb	db backups also report rc 0 so I'm proceeding	15:33
clarkb	I'm up to the point where I pull imges. The edits to the docker-compose.yaml file lgtm so I am proceeding	15:36
clarkb	The hash under RepoDigests near the top of the screen window seems to match the one I've got in the etherpad (and I checked that version was up to date yesterday)	15:38
clarkb	fungi: tonyb: next step is the actual upgrade. Any reason to not proceed?	15:38
tonyb	clarkb: Not that I can see.	15:38
fungi	none i'm aware of	15:38
clarkb	ok proceeding	15:38
clarkb	[2023-11-17T15:41:00.681Z] [main] INFO com.google.gerrit.pgm.Daemon : Gerrit Code Review 3.8.2-53-gfbcfb3e1e5-dirty ready	15:41
clarkb	I'm going to check reindexing maybe yall can look at web stuff?	15:41
fungi	yep, looking	15:42
clarkb	I see reindexing is in progress according to `gerrit show-queue`	15:42
fungi	Powered by Gerrit Code Review (3.8.2-53-gfbcfb3e1e5-dirty)	15:42
clarkb	web is up for me and I see the expected version	15:42
fungi	poking around the webui and i don't see anything out of the ordinary yet	15:43
tonyb	UI looks good, logout login seem to work	15:43
clarkb	thanks. The config diff has the one email template that updated as expected so thats good	15:44
clarkb	pushing a change/patchset and then c hecking it replicates is probably the last remaining functionality thing we can do until zuul is with us again?	15:44
clarkb	maybe one of you can do that?	15:44
fungi	yeah, i have a dnm change i think	15:45
clarkb	task queue has dropped by ~600 items since I first checked	15:45
corvus	zuul db backup is complete and migration has started	15:45
SvenKieske	+1 reviews work as well :)	15:45
opendevreview	Jeremy Stanley proposed opendev/bindep master: DNM: Test bindep with PYTHONWARNINGS=error https://review.opendev.org/c/opendev/bindep/+/818672	15:46
fungi	that's commit 2a9a8b0751f27e97820c2444aa2df6df6fb8f54b	15:46
clarkb	https://opendev.org/opendev/bindep/commit/2a9a8b0751f27e97820c2444aa2df6df6fb8f54b I see it	15:47
clarkb	push and replication seem good. I think the next major item then is waiting for reindexing to complete and then verifying zuul interactions once zuul is back	15:47
tonyb	You guys are so quick :)	15:47
fungi	yeah, https://opendev.org/opendev/bindep/commit/2a9a8b07 comes up for me too	15:47
fungi	tonyb: it's not the first time we've done this ;)	15:47
clarkb	tonyb: we are practiced :)	15:47
tonyb	:)	15:48
clarkb	gerrit's error log doesn't show anything unexpected. I see the expected exception from the plugin manager and reindexing updates in the log are at 33% complete	15:49
clarkb	I do note that at least one user has invalid project watches. I believe this is a known thing and not new	15:49
corvus	is it me?	15:50
clarkb	corvus: no	15:50
clarkb	its a relatively new account which surprises me.	15:50
corvus	we're at the "copy the build table" portion of the migration; i want to say that's like 8 minutes...	15:51
corvus	no, 11 minutes locally according to my notes	15:52
clarkb	just over halfway done on the reindex according to hte log file	15:53
fungi	we'll probably also want a status notice at the end to remind people some changes may need to be rechecked? maybe something like...	15:55
fungi	status notice The Gerrit upgrade is now complete; be aware we had Zuul offline in parallel for a lengthy schema migration, so any events occurring prior to XX:XX UTC may need a recheck to trigger jobs.	15:55
clarkb	++	15:56
clarkb	side note: I think we upgraded to 3.7 about a week before the 3.8 release. This upgrade is about a week before the 3.9 release. Its cool to see we're keeping up. Also they are crazy for releasing over thanksgiving	15:58
tonyb	The release should be fine, it's the consumers of the release that may disagree ;P	16:01
corvus	oh cool, i didn't quite catch the end, but i'm pretty sure the table copy took no longer than my local run, meaning my time/performance estimate should be pretty close to prod	16:03
clarkb	corvus: nice	16:04
corvus	we are 2 steps away from the point of no return on the db migration	16:04
clarkb	reindexing completed and gerrit reports it is using the new gerrit index version. There were three errors against two changes being reindexed. Both changes have id's <20k	16:04
* fungi grabs holds onto his seat		16:04
clarkb	I think we've seen that before and these are problems with old changes that we've basically accepted because what are you going to do	16:05
fungi	yes, we have a handful of "bad" changes that can't be indexed	16:05
clarkb	(if we want those to go away we could possibly try deleting the changes)	16:05
fungi	all very old	16:05
clarkb	show queue output looks good too. I'm marking that step done now	16:05
fungi	i can't remember now, but vaguely recall they're unreachable in the ui too	16:05
clarkb	ya	16:06
corvus	we are past the point of no return on the zuul migration (if there is an error, we will need to fix it or restore from backup)	16:07
clarkb	ack	16:07
fungi	noted	16:08
clarkb	fwiw on the gerrit side if infra-prod-service-review runs before we want it to at this point thats mostly safe. It will only update the docker-compose.yaml file to use the 3.7 image but on gerrit we don't let it manage service restarts	16:10
clarkb	so as long as we get the change to update to 3.8 landed quickly we should be fine. All that to say I think we'll defer to corvus on when he is ready to clean up the emergency file and take it form there	16:10
fungi	wfm	16:11
tonyb	Sounds good.	16:11
corvus	there was an error, i'm trying to sort out the logs	16:13
corvus	2023-11-17 15:43:47,859 DEBUG zuul.SQLConnection: Current migration revision: 151893067f91	16:19
corvus	2023-11-17 16:10:03,931 DEBUG zuul.Scheduler: Configured logging: 9.2.1.dev47	16:19
corvus	it appears the scheduler restarted at that time; i don't see any indication why	16:20
clarkb	agreed `docker ps -a` shows the container running for 10 minutes	16:20
clarkb	dmesg doesn't report OOMKiller kicking in	16:21
corvus	i wonder if there's a way to get the docker output from the previous container	16:21
fungi	did we have it copying to syslog?	16:21
clarkb	https://paste.opendev.org/show/bPTXJSq4qhmM81MofoBU/ I see this from docker in syslog	16:23
clarkb	I don't see any ansible tasks	16:24
clarkb	(so no unexpected ansible triggered this as far as I can tell)	16:24
corvus	so it could be a zuul crash where the logs only go to stderr	16:24
clarkb	I don't see anything in /var/log/containers which is where we've done our other docker log redirect to syslog output	16:25
fungi	yeah, directory is entirely empty	16:26
corvus	/var/lib/docker/containers only has the current container with no log from a previous run	16:26
clarkb	we probably just haven't set it up for the zuul services	16:26
fungi	right, nothing for log settings in /etc/zuul-scheduler/docker-compose.yaml	16:26
corvus	okay i think the best thing we can do now is shut down the scheduler and then i'll try to figure out where in the migration it was and see if i can reconstruct the error, assuming there was one	16:27
fungi	that sounds reasonable to me	16:28
corvus	any objections to shutting down the running scheduler (which is in a loop trying to redo the migration but it can't)?	16:28
clarkb	no objection from me	16:28
tonyb	none	16:28
fungi	please do	16:28
corvus	in case it's useful in the future, the current (reconstituted) container is b6d98a4420b035c1eab11088d2764849afc6f36d8096ef91525f1a83b1346380	16:28
corvus	\| 13869276 \| zuul \| 10.223.160.47:42130 \| zuul \| Query \| 166 \| altering table \| CREATE INDEX zuul_build_uuid_buildset_id_idx ON zuul_build (uuid, buildset_id) \|	16:29
corvus	that's the last thing i saw; trying to see if it proceeded past that	16:29
clarkb	should we do something like #status notice The Gerrit upgrade to 3.8 is complete but Zuul remains offline due to a problem with database migrations in a Zuul upgrade that was being performed in the same outage window. We will notify when both Gerrit and Zuul are happy again.	16:29
fungi	status notice The Gerrit upgrade is complete, however we have Zuul offline in parallel for a schema migration, so any events occurring during this time will be lost (requiring a recheck or similar to trigger jobs once it returns to service); we'll update again once this is complete.	16:30
corvus	yes, but you might want to indicate whether or not you think ppl should use gerrit	16:30
fungi	hah, i was just typing something similar	16:30
corvus	i have updated the zuul etherpad with the dump of all the sql statements i'm working from	16:31
clarkb	corvus: at this point I think it is fine to use gerrit to post reviews. The only major item we haven't confirmed is working is the zuul integration	16:31
clarkb	but good point. Maybe something along the lines of "it should be safe to post changes and reviews to Gerrit but you will not get CI results"	16:31
clarkb	I think fungi's message covers that actually	16:32
fungi	feel free to reword mine if you prefer	16:32
clarkb	fungi: no I think yours is good if you want to send it	16:32
fungi	#status notice The Gerrit upgrade is complete, however we have Zuul offline in parallel for a schema migration, so any events occurring during this time will be lost (requiring a recheck or similar to trigger jobs once it returns to service); we'll update again once this is complete.	16:32
opendevstatus	fungi: sending notice	16:32
-opendevstatus- NOTICE: The Gerrit upgrade is complete, however we have Zuul offline in parallel for a schema migration, so any events occurring during this time will be lost (requiring a recheck or similar to trigger jobs once it returns to service); we'll update again once this is complete.		16:33
corvus	i think we're at line 192 in the etherpad	16:34
clarkb	corvus: are you thinking roll forward from there and see if you can reproduce an error?	16:35
opendevstatus	fungi: finished sending notice	16:35
corvus	yes; and also, if able to proceed without an error, then just finish the migration manually	16:36
clarkb	ok	16:36
fungi	makes sense to me	16:36
corvus	the statement that presumably failed involves fk constraints; we should have disabled them already, but i wonder if this old database/server behaves differently	16:36
corvus	i'm going to set fk checks off and then run line 192	16:37
corvus	btw i do have a screen session on zuul02 (second window) if anyone wants to join	16:38
corvus	okay, we're running mysql 5.7 and it does not support "alter table drop constraint"	16:39
fungi	zuul01?	16:40
corvus	yep	16:40
corvus	sorry	16:40
fungi	np, attached now	16:40
fungi	we're still using a trove instance in rax for this, right?	16:41
corvus	yep	16:41
fungi	i guess we should be thinking about upgrading that instance and/or setting up our own db cluster instead	16:44
corvus	yep. i'd like to try the statement on line 203	16:45
clarkb	that looks fine to me (but I'm not monty sql wizard)	16:46
fungi	looks the same as the one at line 192. does it differ in ways i'm not spotting?	16:46
clarkb	fungi: it moves from dropping constraint to dropping foreign key	16:46
corvus	s/constraint/foreign key/	16:46
clarkb	seems to be a syntax thing	16:46
fungi	oh! right	16:47
corvus	Query OK, 0 rows affected (0.06 sec)	16:47
fungi	but yeah, i'm a bit out of my depth when it comes to foreign key constraints	16:47
corvus	show create table on that lgtm now	16:48
corvus	shall i continue running the statements in the etherpad manually?	16:48
fungi	yes please	16:48
clarkb	corvus: I think if you are comfortable doing that (you wrote the migration so should be pretty knowledgeable about hwat is needed form here) I would say go for it	16:48
clarkb	I think I would be more concerned if this was software we weren't super familiar with just because the chance of missing something is high	16:49
corvus	the scheduler in its attempt to re-run the migration failed at step1, so i don't think it did any damage	16:49
clarkb	ack	16:50
fungi	stroke of luck there, i suppose	16:50
corvus	(it's sort of symmetrical; same table is at the beginning and end of the migration)	16:52
corvus	i'm double checking the etherpad statements with the python code	16:52
corvus	(just to make sure nothing changed)	16:52
corvus	while we're waiting on this alter table; i think to fix zuul we can try just making this change and let testing tell us if that works in current mysql/postgres	16:54
clarkb	++	16:54
corvus	okay migration is complete	16:56
corvus	shall we startup the executor again now?	16:56
fungi	i think so	16:56
fungi	scheduler you mean?	16:56
corvus	ha yes lets do that one :)	16:56
clarkb	wfm	16:56
fungi	cool, yes then ;)	16:56
corvus	seems to be happy and not doing any sql thigs	16:57
fungi	yay!	16:57
fungi	thanks!!!	16:57
corvus	i will proceed with restarting the rest	16:57
fungi	once it's up i can recheck that dnm change from earlier and see if it gets enqueued	16:57
clarkb	I guess let us know when you think we should recheck fungi's bindep DNM change to check the zuul + gerrit intercommunication	16:58
clarkb	++	16:58
corvus	rebooting all other hosts	16:58
corvus	starting zuul-web on zuul01	16:59
corvus	clarkb: looks like we are going through the github branch listing	17:00
corvus	but it was relatively fast this time	17:00
clarkb	huh	17:00
fungi	hopefully won't trigger any api rate limits	17:00
clarkb	the updates we made should avoid that now	17:01
clarkb	fingers crossed anyway	17:01
corvus	yeah it's done	17:01
corvus	starting up zuul02	17:01
corvus	starting mergers and executors	17:02
fungi	dashboard is returning content now	17:03
fungi	i guess all of the queue items will end up retrying their prior builds?	17:04
corvus	the builds and buildsets tabs produce data in a reasonable amount of time	17:04
corvus	yep, it's firing them off now	17:04
fungi	oh, nice, it's just the builds which were in progress that are being retried, all the ones which had completed remain so	17:04
corvus	yep	17:05
corvus	a neat side effect of that is that we immediately have new build database records (for the "RETRY" results)	17:05
clarkb	should we recheck the bindep chagne now?	17:05
fungi	should i go ahead and recheck our test change?	17:05
corvus	yep i think it's gtg	17:06
fungi	done	17:06
clarkb	I see it enqueued	17:06
clarkb	one reason we explicitly test rechecks if they have chagned the stream event format before around comment data	17:06
clarkb	but that seems good	17:06
clarkb	and jobs are starting	17:06
clarkb	I've marked the two zuul related items on step 14 as done based on the bindep change	17:07
clarkb	that takse us to step 17 which is to quit the screen and save the log. Are we ready for that?	17:07
fungi	yeah, https://zuul.opendev.org/t/opendev/status lgtm	17:07
fungi	i suppose we'll want to see it successfully comment in gerrit too, those jobs should hopefully be relatively fats	17:08
fungi	er, fast	17:08
clarkb	I've also removed my WIP on https://review.opendev.org/c/opendev/system-config/+/899609 and think we can approve that whenever corvus is ready	17:08
clarkb	fungi: ++	17:08
fungi	some builds for 818672 have already succeeded	17:08
corvus	ready for 899609	17:09
corvus	all zuul components are running now	17:09
clarkb	tonyb: fungi: you cool with me closing the screen now and saving the log?	17:09
corvus	and i think it's fine to remove zuul from emergency now	17:09
tonyb	clarkb: Yup.	17:09
clarkb	corvus: do you want to do the emergency file cleanup? I think you can remove review related stuff as well	17:09
corvus	con do	17:10
corvus	can do	17:10
fungi	clarkb: go for it	17:10
fungi	status notice Zuul is fully back in service now, but any events occurring prior to 17:05 UTC may need a recheck to trigger jobs.	17:10
clarkb	step 17 to stop screen and move the log file is done	17:11
fungi	does that cover what we want folks to know?	17:11
clarkb	fungi: lgtm	17:11
corvus	i have removed todays maintenance entries from emergency.	17:11
fungi	thanks!	17:11
corvus	do we want to also remove the unrelated things we think we can clean up from that file, or leave that for another day?	17:11
clarkb	I've approved https://review.opendev.org/c/opendev/system-config/+/899609	17:11
fungi	#status notice Zuul is fully back in service now, but any events occurring prior to 17:05 UTC may need a recheck to trigger jobs.	17:11
opendevstatus	fungi: sending notice	17:11
clarkb	corvus: I'm happy to do that another day :)	17:11
-opendevstatus- NOTICE: Zuul is fully back in service now, but any events occurring prior to 17:05 UTC may need a recheck to trigger jobs.		17:11
clarkb	corvus: I can make a note to myself to do that monday	17:11
corvus	ack	17:11
fungi	yeah, i'm beginning to get a smidge peckish	17:12
clarkb	ok to recap where we are on the gerrit side of things: The upgrade is done, the checks we have performed have all checked out. Nothing crazy or unexpected in the gerrit error_log and we got things we expected which is double good. We have since removed services from the emergency file and approved the change to set the 3.8 image in docker-compose.yaml on review02. We need to confirm	17:13
clarkb	that the file looks good after infra-prod-service-review runs	17:13
clarkb	infra-prod-service-review does not start and stop gerrit though so it should be very safe even if we got somethign wrong there	17:13
opendevstatus	fungi: finished sending notice	17:14
tonyb	clarkb: I didn't know that infra-prod-service-review does not stop/start gerrit but everything else matches my understanding	17:16
clarkb	tonyb: ya, many of our services we let ansible automatically do that stuff. Gerrit is special enough and has lots of rules that change between versions about whether or not you need to init or reindex or both and its also disruptive to restart even when we do updates within a single version. All that means we let ansible write the configs then we manually restart things	17:17
tonyb	clarkb: makes sense.	17:19
clarkb	corvus: fwiw the builds search feature in zuul works for me. As does buildsets.	17:20
clarkb	https://review.opendev.org/c/starlingx/update/+/898850 is a post upgrade zuul comment against gerrit	17:20
clarkb	it lgtm	17:21
* clarkb takes a break while waiting for job results		17:22
fungi	the test change is still waiting on two nodes	17:28
fungi	and is in a failing state (build log shows the reason)	17:29
fungi	so looks like it's working the way it should so far	17:29
* tonyb goes afk for a bit		17:30
fungi	yeah, christine's being very patient waiting for me to take her to lunch, but i may have to assume this will work and check back on it after i return	17:31
fungi	node request backlog is down to around 65 now	17:32
clarkb	zuul is busy today. I didn't expect a friday before a major holiday for at least some contributors to be so busy	17:33
fungi	are those "Will not fetch project branches as read-only is set" errors for the opendev tenant expected?	17:33
clarkb	fungi: yes/no They are not new. But we do need to debug them	17:34
fungi	ah, okay. thanks	17:34
fungi	i have a feeling the node request for that remaining opendev-nox-docs build got accepted by a provider that's repeatedly timing out booting an instance for it	17:35
clarkb	https://review.opendev.org/c/starlingx/stx-puppet/+/900806 is a merged change after the ugprade fwiw	17:36
fungi	we're at about 30 minutes since the change was enqueued, and all its builds have completed except that one	17:36
clarkb	fungi: really we have enough other changes that have done stuff that I think its fine	17:36
corvus	ftr, no that alter tabel syntax does not work with postgres, but it does work with mysql 8.x so i've updated the patch with a conditional. i do expect it to pass tests now.	17:36
fungi	cool, i'm going to step out for an hour, back soon	17:36
clarkb	corvus: the fix gets a +2 from me	17:37
clarkb	oh the tooz thing is causing the openstack gate to thrash which could explain why it is busy	17:40
clarkb	more coincidence than anything else I think	17:40
corvus	i wonder if we should start promoting the regex-based early failure detection	17:41
corvus	seems to be working pretty well in the zuul-project jobs	17:42
corvus	maybe i should send an email next week	17:42
clarkb	++	17:43
clarkb	https://opendev.org/starlingx/stx-puppet/commits/branch/master shows the above merged change replicated and the master branch updated properly	17:47
clarkb	just more sanity checks of replication	17:47
corvus	i added another zuul change to address the missing error log problem	17:51
clarkb	good idea +2 there as well	17:51
clarkb	as a heads up the opendev hourly jobs have enqueued. They will run against zuul but not review	18:02
clarkb	the hourly jobs should wrap up in just a couple of minutes. Then a few minutes after that the change to set the image version in the docker-compose file should merge and apply (which should noop)	18:20
opendevreview	Merged opendev/system-config master: Upgrade Gerrit to Gerrit 3.8 https://review.opendev.org/c/opendev/system-config/+/899609	18:32
tonyb	gerrit-compose on review02 looks good to me (still contains 3.8)	18:34
clarkb	yup the job is still running though according to zuul	18:34
clarkb	not sure when ansible will try to modify it so we should double check after the job completes	18:34
clarkb	job is complete now and the file wasn't modified according to the timestamp	18:35
clarkb	yup looks good too. I think we are basically done at this point	18:35
tonyb	Oh so it is, I waited for it to merge but of course that isn't the "important" run I needed to wait for the deploy pipeline	18:35
clarkb	I made a note to myself to do the autohold cleanup Monday	18:35
clarkb	and then in a few days / a week we can work to clean up the 3.7 image stuff if we don't revert between now and then	18:36
clarkb	tonyb: the other thigns I've got on my list for today are to swap in the new mirror and to do db pruning. I Think fungi mentioned he would help with the db pruning. Do you have a change up for the dns update to swap in the new mirror yet?	18:37
tonyb	I don't. Gimme 5 and I will ;P	18:37
clarkb	ok I can review it :)	18:37
opendevreview	Tony Breeds proposed opendev/zone-opendev.org master: Switch CNAME for mirror.ord.rax to new mirror02 node https://review.opendev.org/c/opendev/zone-opendev.org/+/901332	18:46
tonyb	clarkb: I'd be keen to shadow you as you do the db purging.	18:46
clarkb	er sorry not db pruning. Backup pruning	18:48
clarkb	tonyb: I was going to defer that to you and fungi since fungi mentioned the other day being willing to do that with you	18:48
tonyb	Sorry that's what I meant	18:48
tonyb	clarkb: Okay cool	18:48
clarkb	ya I typoed backups as db earlier :) just wanted to be clear	18:48
clarkb	tonyb: https://docs.opendev.org/opendev/system-config/latest/sysadmin.html#managing-backup-storage is the relevant documentation	18:49
tonyb	Thanks	18:51
clarkb	when fungi returns from lucn he can double check the dns update and then let us know if backup pruning isn't in the cards for today	18:53
clarkb	if not I can refresh on it	18:53
tonyb	Okay. Sounds good. I have a few things to do in a couple of hours but if that conflicts I'll shadow next time.	18:55
clarkb	fyi it has been reported in #openstack-infra that editing files in the web UI isn't working. Specifically you can enter edit mode, but when you open a file to edit a specific file it never loads the file contents in the editor. You can then exit edit mode successfully	18:59
fungi	okay, backl	19:07
fungi	back too	19:07
fungi	tonyb: i can work with you on that now if you like	19:07
fungi	or later, either works	19:07
tonyb	fungi: If now works for you it works for me	19:08
fungi	starlingx folks are asking about errors from pip install... i'm looking into the logs now	19:08
tonyb	fungi: where? I suspect that's another venue I should hang out/monitor	19:09
fungi	pinged me directly in the starlingx general matrix room	19:09
tonyb	Ah okay	19:10
clarkb	fungi: can you weigh in on the editor being broken first?	19:12
clarkb	I want to make sure we're comfortabl with that not working for now	19:12
fungi	oh, i missed the broken editor	19:13
clarkb	I'm putting notes in the etherpad. I don't think we need to rollback for this. Its annoying but not vital	19:13
tonyb	Shoot it's workign for me now	19:14
clarkb	wut	19:15
clarkb	tonyb: it == editor in web ui?	19:15
tonyb	clarkb: Yup.	19:15
* clarkb retests		19:15
tonyb	clarkb: I reproduced exactly what was seen on 900435. I was poking around in the console/developer tools	19:16
clarkb	tonyb: it still doesnt' work for me.	19:16
tonyb	clarkb: I closed the window by mistake and now I have a functional editor	19:16
clarkb	maybe we need to hard refresh because something is cached?	19:16
clarkb	ya maybe that is it /me tries	19:16
clarkb	yup that did it	19:16
clarkb	I think the plugin html must not be versioned like the main gerrit js/html/css is so we don't get the auto eviction stuff	19:18
fungi	strangely, starlingx is seeing the reverse of https://review.opendev.org/c/openstack/project-config/+/897545 in this job, i think: https://zuul.opendev.org/t/openstack/build/7b5008b924e247c7a1f3eb76fe96151f	19:18
* clarkb updates the etherpad. That gives us somethign to send upstream		19:18
fungi	they're trying to access wheels under debian-11.8 instead of debian-11	19:18
clarkb	fungi: we updated zuul maybe ansible reverted that behavior/ swapped it around	19:19
fungi	well, i think it's that we ended up with newer libs in the ansible venvs	19:19
clarkb	fungi: yes we changed teh mirror stuff because ansible updated and changed the behavior of those vars. I'm wondering if they swapped back to the old behavior	19:20
clarkb	and did it since last Friday because we last upgarded zuul around then and just upgraded it a few hours ago	19:20
fungi	what we fixed in 897545 was the jobs that build the wheel mirrors	19:21
fungi	the problem they're seeing is that jobs are now looking for wheels in the location that the broken wheel mirror jobs wanted to publish to	19:21
fungi	so i don't think this is a revert	19:21
fungi	it looks more like a delayed reaction, where the playbook setting up the mirror urls in jobs is now exhibiting similar behavior to how we saw wheel publication break before we fixed it	19:22
clarkb	they use the same ansible versiosn though	19:22
clarkb	it should all be the ansible version in zuul's venv for ansible	19:22
clarkb	unless maybe they are doing nested ansible?	19:23
fungi	it's happening in that job when opendev.org/zuul/zuul-jobs/playbooks/tox/run.yaml is invoked, doesn't look nested	19:24
fungi	it's parented to tox-py39	19:25
clarkb	maybe we needed to update the cleint side and didn't realize it was broken after we fixed the generation side/	19:26
clarkb	and its just now getting bubbled up?	19:26
fungi	that's what it seems like, i'm just trying to find where that happens	19:26
fungi	it's possible all jobs using debian nodes are exhibiting this now	19:26
opendevreview	Clark Boylan proposed opendev/system-config master: A file with tab delimited useful utf8 chars https://review.opendev.org/c/opendev/system-config/+/900379	19:27
fungi	https://opendev.org/zuul/zuul-jobs/src/branch/master/roles/configure-mirrors/defaults/main.yaml#L12 is where it's coming from, i think, we want ansible_distribution_major_version instead of ansible_distribution_version on debian now	19:30
fungi	odd that centos is working though	19:32
corvus	any idea why it's only showing up now? is it maybe the case that this mirror doesn't get used often?	19:32
corvus	(it's unfortunate that job doesn't save tox logs so we can compare to successful runs)	19:32
clarkb	corvus: I'm beginning to suspect its been broken for a while and noone noticed	19:32
clarkb	we fixed the mirror generation side and the consumers didn't test it or if they did didn't check back in with us to say it doesn't work	19:33
clarkb	but maybe zuul's build history can confirm	19:33
corvus	the most recent runs succeeded	19:33
corvus	https://zuul.opendev.org/t/openstack/builds?job_name=sysinv-tox-py39&project=starlingx/config	19:33
fungi	yeah, i just noticed that too	19:33
corvus	failure we're looking at is 3rd newest	19:33
fungi	all for the same change	19:34
corvus	that's what made me think that the "use mirror" path may not be used often? but we can't tell on the successful jobs	19:34
tonyb	The successful one seemed to use a mirror: https://zuul.opendev.org/t/openstack/build/fe0825aa756945208f228b6c52c273d5/log/job-output.txt#747	19:36
tonyb	a non RAX mirror	19:36
fungi	https://zuul.opendev.org/t/openstack/build/fe0825aa756945208f228b6c52c273d5/log/job-output.txt#747 is in a build of that job that succeeded	19:36
tonyb	snap	19:36
corvus	that mirror url is constructed with major.minor	19:37
fungi	yeah, that's what i'm saying, not sure why it didn't cause a problem in that build	19:37
corvus	so it seems like we are getting consistent behavior from the jobs in using that variable.	19:37
corvus	maybe it's not using that particular wheel mirror; either getting it from somewhere else or building it?	19:38
clarkb	https://etherpad.opendev.org/p/tvAyWLRV07MNayX3Bbc3 this is my draft to the gerrit repo-discuss list about cache stuff	19:39
fungi	right, the wheel mirror may be unavailable/incorrect in both builds, the error for it in the failing build might be secondary and the real problem could be elsewhere	19:39
clarkb	pip is supposed to fallback when it can't find additional indexes to the indexes it does fine	19:39
clarkb	*it does find	19:39
corvus	anyway, it seems like that probably means we can exclude all of todays/yesterday's changes from the list of suspects; the fix is to update the variable; and the remaining mystery is why it only sometimes manifests (and i'm pretty sure saving the tox logs would help with that too, so i'd recommend that)	19:40
clarkb	++	19:40
clarkb	I'm going to need to stop and find food soon. I skipped breakfast and it is now almost lunch time and my body is protesting. I think remaining todos are to fire off that email to repo-discuss and then other tasks unrelated to gerrit like backup pruning and ord mirror swap out via dns	19:46
corvus	clarkb: etherpad lgtm	19:48
fungi	okay, it looks like this was the actual cause of the starlingx job failure: https://zuul.opendev.org/t/openstack/build/7b5008b924e247c7a1f3eb76fe96151f/log/job-output.txt#61245-61286	20:07
fungi	or immediately above there anyway... Could not fetch URL https://mirror-int.ord.rax.opendev.org/pypi/simple/pygments/: connection error: HTTPSConnectionPool(host='mirror-int.ord.rax.opendev.org', port=443): Read timed out. - skipping	20:09
clarkb	I'm going to send that email to repo-discuss now. Thank you all for reading it	20:23
fungi	sorry, just read it now but lgtm	20:25
clarkb	fungi: I think we can land https://review.opendev.org/c/opendev/zone-opendev.org/+/901332 when you are ready	20:27
clarkb	this will swap in the new ord mirror	20:28
clarkb	fungi: and were you still planning to do backup pruning with tonyb today?	20:28
clarkb	I'm about to eat lunch and expect to be afk for a bit	20:28
clarkb	https://groups.google.com/g/repo-discuss/c/DTrYQtY0j1k/m/7riBbIa5BwAJ	20:32
tonyb	heading to the gym now. back in about 90mins	20:34
fungi	oh, yep. i'll be around when tonyb is back from the gym	21:05
fungi	also approved 901332	21:06
opendevreview	Merged opendev/zone-opendev.org master: Switch CNAME for mirror.ord.rax to new mirror02 node https://review.opendev.org/c/opendev/zone-opendev.org/+/901332	21:12
fungi	deploy succeeded	21:54
fungi	$ host mirror.ord.rax.opendev.org	21:55
fungi	mirror.ord.rax.opendev.org is an alias for mirror02.ord.rax.opendev.org.	21:55
fungi	https://mirror.ord.rax.opendev.org/ has a working ssl cert and i can browse it as expected	21:57
tonyb	Awesome.	22:05
fungi	tonyb: so when you have a moment, take a look at https://docs.opendev.org/opendev/system-config/latest/sysadmin.html#backups if you haven't already	22:31
fungi	currently the two backup servers are backup01.ord.rax.opendev.org and backup02.ca-ymq-1.vexxhost.opendev.org	22:32
fungi	the active backup volume on both of those is /opt/backups-202010	22:32
fungi	when the active backup volume exceeds 90% used, we try to prune it	22:33
fungi	which basically boils down to running /usr/local/bin/prune-borg-backups as root on the relevant backup server	22:34
tonyb	fungi: Just got back	22:34
tonyb	Ahh got it. I misread the docs and thought it was on the backup client.	22:35
fungi	it will take a while (upwards of an hour or two) so safest to run it in a screen session in case your ssh connection gets interrupted	22:35
tonyb	Okay	22:35
fungi	you'll probably also want to crack open the prune-borg-backups script to see the goodness within	22:35
tonyb	Okay so I'll start 2 80x24 terminals (one on each server) with sudo -s ; su - ; screen <CTRL>-H	22:36
fungi	it does log the output, so hardcopy of the screen session, while it doesn't hurt, isn't all that necessary	22:36
tonyb	Ah okay	22:36
fungi	if you look in the script, you'll see it logs to a file in /opt/backups/ called prune-<timestamp>.log	22:37
fungi	ultimately, the real command is a few lines from the end of the script, it's calling `/opt/borg/bin/borg prune ...`	22:38
fungi	the rest is so much window dressing to save us from having to fiddle options	22:39
tonyb	Okay screen sessions created	22:39
tonyb	I'm looking at the prune-borg-backup script on backup02.ca-ymq-1.vexxhost.opendev.org	22:39
fungi	oh. and it's you might also want to take a peek at one of the prune logs just so you know what they include	22:40
tonyb	Will do	22:41
fungi	(they're extremely verbose, you won't really get much on stdout in your terminal other than whether it succeeded or, rarely and hopefully not, failed	22:41
fungi	you'll see it records the exit code of each prune command it runs too	22:42
fungi	ideally rc 0 obviously	22:42
fungi	there is a dry run option you might want to try first	22:42
tonyb	Sounds good.	22:43
fungi	`/usr/local/bin/prune-borg-backups noop` is the dry run syntax	22:44
fungi	`/usr/local/bin/prune-borg-backups prune` is the actually do it syntax	22:44
tonyb	Okay, that's going to take a small amount to digest, as the script does a 'read' so I was expecting it to prompt and wait etc but you're passing the mode as an argument	22:47
fungi	oh, actually yes	22:47
fungi	you're right	22:47
fungi	`/usr/local/bin/prune-borg-backups` and then enter "noop"	22:48
tonyb	Okay	22:48
fungi	if you enter anything other than "noop" or "prune" it will lol at you	22:48
tonyb	I thought it was some mode of read I didn't know about	22:48
fungi	yes, the special kind that i forget about when it's something i only run every few months ;)	22:48
tonyb	LOL	22:49
tonyb	So I understand the process as is I need to doa little more reading to really get the way (and reasons) borg is setup but so far it looks super neat	22:53
tonyb	fungi: Should serialize the servers or do them in parallel ? It seems like parallel is "safe"	22:54
fungi	yes, completely safe	22:56
fungi	clients push similar backups to two different servers every day, we have them in different service providers just as an insurance policy	22:57
fungi	they're purely for redundancy, not performance reasons	22:58
fungi	usually we don't simply because they tend not to fill up in the same week	22:58
fungi	so there's almost only ever one that needs pruning at any given point in time anyway	22:58
tonyb	https://borgbackup.readthedocs.io/en/stable/usage/prune.html says "Important: Repository disk space is not freed until you run borg compact." I don't see anywhere there is a compact run	22:59
fungi	possible it's implied somehow	23:00
tonyb	Okay.	23:02
fungi	i mean, it visibly reduces disk space on the volume when we run prune, so it's happening somehow i guess	23:03
tonyb	Yeah, Just trying to understand as much as I can.	23:04
Clark[m]	We are still borg 1.6 iirc and those docs may be for 2.0 and maybe it changes?	23:18
tonyb	Oh okay. I'll check that	23:18
tonyb	These are the 1.2.6 docs	23:23
tonyb	Okay so even the noop run takes $some_time	23:29
tonyb	It look slike both servers are at > 90% so I'll prune them both	23:30
fungi	yep, thanks!	23:31
tonyb	The noop on backup01 returned success so starting the actul prune in 5mins unless someone says "NO!"	23:32
fungi	when i do it, i just leave it running and then check back on it after a few hours or the next morning	23:32
fungi	i say go for it	23:32
tonyb	That's my plan.	23:32
tonyb	It's in a screen session as described	23:33
tonyb	backup01 is pruning, screen:0 is wheer it's running screen:1 is a tail -f of the log .... incase anyone wants to check in	23:34
tonyb	ditto for backup02	23:35
tonyb	So WRT the mirror updates, mirror02.ord.rax is now the mirror for that region	23:38
Clark[m]	I trust the script	23:39
tonyb	IIUC Assuming there are no issues I need to remove mirror01.ord.rax from DNS, and then from system-config inventory and LE handlers and then delete ther server	23:39
tonyb	Once I've done that I can do the OVH ones (doign them in serial is just about caution nothing technical)	23:40
tonyb	as discussed mirror.dfw.rax will be last because extra careful	23:41
tonyb	Sound about right?	23:41
Clark[m]	Yup	23:41
Clark[m]	We will also need to delete the volume attached to it	23:42
Clark[m]	Otherwise I think that is complete and correct	23:42
tonyb	<emote character="Mr Burns"> Excellent </emote>	23:42
tonyb	That's a good point.	23:43
* fungi nods in agreement		23:43
tonyb	Cool. I can do that next week.	23:44
tonyb	I'll also reach out to rosmaita and the i18n SIG about the future of translate	23:46
tonyb	I was looking at the wiki, I think with a little work we can use the mediawiki container images in the same way we for for many other services that would at least decouple the OS upgrades and give us some testing. There is an image for the version we're running which hopefully will make it easy to switch to. From there we can look at rolling updates forward to get to a newer version.	23:48
Clark[m]	I think the main thing is getting all the plugins and stuff going but maybe we don't care about the theming so much anymore	23:50
Clark[m]	Re prune vs compact I'm confused. Maybe compact does extra cleanup beyond what prune does? In particular prune cleans up incomplete archives automatically maybe that's all we clean up?	23:52
tonyb	It is pruning other things	23:52
tonyb	eg Pruning archive: review02-filesystem-2022-10-31T05:46:02 Mon, 2022-10-31 05:46:02 [SNIP] (39/39)	23:53
tonyb	so I think it's more than incomplete archives	23:53
tonyb	https://borgbackup.readthedocs.io/en/stable/usage/compact.html indicates it's usful after a prune.	23:55
tonyb	So I'm going to suggest that early next week (perhaps right after the team meeting), we try running a compact on a repo and see what happens to the disk utilisation.	23:56
tonyb	I don't think that Friday afternoon is a good time to play with that kind of thing ;P	23:57
fungi	right there with you	23:59

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!