Saturday, 2024-04-06

corvus1	clarkb: fungi morning	14:08
corvus1	looks like the restart has finished	14:09
corvus1	i've started a db dump from the old server	14:10
corvus1	that means this is the start time for the data loss window (jobs completed between now and when we finish won't be in the db)	14:10
corvus1	i've approved the zuul change to add the index hints (and also one other simple db related fix)	14:14
Clark[m]	Ack	14:22
corvus1	dump is finished, copying now	14:31
corvus1	restore in progress on screen on zuul-db01.	14:34
corvus1	i think we still need the dburi change, yeah?	14:35
corvus1	oh heh it's actually just a secret var	14:37
Clark[m]	Yes I think so	14:37
corvus1	so i'm going to edit secrets and update it there.	14:37
Clark[m]	And we run zuul updates hourly so it should go into affect quickly if you don't do it by hand	14:38
Clark[m]	Just keep that in mind I guess as it may go in before you're ready too?	14:39
corvus1	okay, the dburi is updated in secret hostvars. for now, let's see if the ansible deploy gets it before the import is done? and otherwise we can edit by hand.	14:40
corvus1	i don't think we're going to automatically restart zuul based on that file changing, right? so since all the zuul restarts are done now, i think we're good to let it go in any time?	14:40
Clark[m]	Oh yes that is correct	14:41
corvus1	i suppose it's possible we might do a full-reconfigure? and that might reload the connections? i don't know	14:41
Clark[m]	We trigger a zuul config reload but that applies to the tenant config not the ini config	14:41
corvus1	yeah, we only trigger a smart-reconfigure; i don't think that touches connections	14:41
fungi	morning!	14:45
fungi	looks like you started just as i was getting ready to grab a shower	14:46
corvus1	perfect timing!	14:47
corvus1	i think we're idle for another 90m or so right now. btw, import is running in screen.	14:47
Clark[m]	The zuul hourly job will likely run in about a half an hour and again an hour after that	14:48
Clark[m]	Just for timing on when the config might update.	14:48
Clark[m]	The docs explicitly mention tenant config and don't say anything about connections configs in relation to smart reconfiguration	14:50
corvus1	++ if i'm wrong about what zuul will do with the update, then i would expect it to simply fail inserts due to missing tables (or, if we happen to have all the tables a query needs, then it might insert them (but that's okay because the schema has the insert id in it, so it would go at the end) or it might hit a table lock and just freeze until that table is done.	14:50
corvus1	i went through the code to double check, and it does reload the zuul.conf file, and it does tell the connections that a reconfiguration has happened, but the sql driver ignores that event since it doesn't typically need to do anything.	14:51
corvus1	my takaway is that it shouldn't affect us, but also, if we wanted to make it easy to update the dburi without a restart, that would be a very simple change. :)	14:52
Clark[m]	It seems like making the database a bit more an explicit update is useful for understanding data integrity. But I can see how that would be useful in some situations	14:53
Clark[m]	(like with redundant active standby DB setups)	14:54
corvus1	yup	14:54
corvus1	artifacts done; halfway through builds	15:31
fungi	thinking either my terminal is too small or i'm on the wrong window of the screen session	15:32
fungi	this is on zuul-db01 yeah?	15:32
fungi	there are like 4 windows but the first one seems blank, i'm assuming that's where the progress output is showing up for the rest of you	15:33
corvus1	yeah, it's a big window sorry	15:34
corvus1	i've resized mine to 80x24 now	15:34
corvus1	when it's done, something should show up in the first blank one; that's the import	15:34
corvus1	i'm using the second one to show processlist to check progress	15:35
corvus1	zuul changes merged	15:52
fungi	no worries, i don't think screen handles dynamic resizing anyway. maybe someday we can talk about using tmux for similar purposes, but it's not a big deal	15:54
corvus1	looks like it's doen	16:24
corvus1	i'm going to docker-compose pull on the schedulers, then check the image tags and also dburi and make sure everything is staged	16:26
Clark[m]	++	16:28
corvus1	the image log looks correct, has the latest change id	16:29
corvus1	dburi looks correct	16:29
corvus1	i'm going to restart zuul-web on zuul01 now	16:30
fungi	cool	16:32
corvus1	we're going from dev82 to dev85	16:34
corvus1	zuul01 web is up, restarting zuul02 web now	16:35
corvus1	https://zuul.opendev.org/t/openstack/builds works now and is instantaneous for me	16:36
corvus1	restarting zuul01 scheduler now	16:37
fungi	~instantneous for me as well	16:37
Clark[m]	And me	16:37
fungi	instantaneous	16:37
corvus1	faster than you can spell instantaneous for sure	16:37
fungi	faster than i can type it at the very least	16:39
corvus1	both web servers are up now	16:41
corvus1	scheduler 1 is up, restarting scheduler 2 now	16:42
corvus1	that means we should be at the end of the data loss period	16:42
fungi	right	16:42
fungi	writing all further results to the new (temporary) database now	16:43
corvus1	so job results will be missing from the database and webui from about 14:10 to 16:43 on 2024-04-06	16:43
fungi	thanks! i was about to calculate that as well, saved me the bother	16:44
corvus1	how about: status notice Zuul job results from 14:10 to 16:43 on 2024-04-06 may be unavailable in Zuul's web UI. Recheck changes affected by this if necessary.	16:45
fungi	lgtm	16:45
corvus1	#status notice Zuul job results from 14:10 to 16:43 on 2024-04-06 may be unavailable in Zuul's web UI. Recheck changes affected by this if necessary.	16:46
corvus1	maybe i'm not authed?	16:47
fungi	checking	16:47
fungi	it's in channel at least	16:47
fungi	#status notice Zuul job results from 14:10 to 16:43 on 2024-04-06 may be unavailable in Zuul's web UI. Recheck changes affected by this if necessary.	16:48
opendevstatus	fungi: sending notice	16:48
-opendevstatus- NOTICE: Zuul job results from 14:10 to 16:43 on 2024-04-06 may be unavailable in Zuul's web UI. Recheck changes affected by this if necessary.		16:48
*** corvus is now known as Guest269		16:48
*** corvus1 is now known as corvus		16:48
fungi	yeah, corvus1 isn't logged in according to /whois	16:48
corvus	i am now :)	16:48
corvus	https://zuul.opendev.org/t/openstack/build/7cdf35ccdddb46fd804638d4c8305516 is post-migration	16:49
fungi	agreed	16:49
Clark[m]	And now we can see if general responsiveness improves as well	16:49
corvus	all components are up now	16:49
fungi	awesome! heating up the wok to make lunch, but will be around to keep an eye out for any problem reports	16:49
opendevstatus	fungi: finished sending notice	16:50
corvus	i dequeued the sanbox post items	16:51
corvus	there's a periodic item in there now that's stuck on a node request?	16:51
corvus	doesn't look like i can dequeue it	16:52
Clark[m]	That's the change vlotorev got in to reproduce that problem...	16:52
Clark[m]	Or at least I think it is /me double checks	16:53
corvus	the ones i dequeued, yes; not the periodic one though	16:53
Clark[m]	Ya the sandbox change	16:53
corvus	there were 3 queue items. 2 in "post" one in "periodic". i dequeued the 2 in "post". the one in "periodic" remains because i am unable to dequeue it through the web ui.	16:54
Clark[m]	And it will return if we don't revert the change in sandbox first. We should probably do that then figure out the dequeue	16:55
corvus	they will return the next time someone merges 2 changes in sandbox in rapid succession. but yes, it should all be reverted.	16:55
Clark[m]	Ya if we revert then the problem goes away for new trigger events	16:56
Clark[m]	And I plan to update the change to address this monday	16:57
Clark[m]	Which may also allow them to run normally	16:57
corvus	it's a real travesty what happened to that equisite corpse	17:06
corvus	remote: https://review.opendev.org/c/opendev/sandbox/+/915197 Revert to an earlier state of the repo [NEW]	17:13
corvus	remote: https://review.opendev.org/c/opendev/sandbox/+/915198 Update test.py for python3 [NEW]	17:13
corvus	Clark: fungi ^ some cleanup	17:13
Clark[m]	corvus: I guess there wasn't one specific change we could revert? I don't see a periodic job in the zuul.yaml diff	17:18
corvus	Clark: oh i'm sure that could be smaller, but the repo was a mess. just doing 2 things at once.	17:19
Clark[m]	I see	17:19
fungi	also worth consideration: a dedicated sandbox tenant in zuul for further isolation	17:22
corvus	maybe with only check and gate pipelines	17:23
fungi	yeah, that would make sense. simplicity helps the demonstration	17:26
mnaser	https://releases.openstack.org/ is not responding to me	21:30
mnaser	cc infra-root ^	21:32
mnaser	it seems like its responding very slowly now :X	21:33
fungi	looking	22:55
fungi	it's responding instantly to me now, but i'll check the system resource graphs for static.o.o	22:56
fungi	http://cacti.openstack.org/cacti/graph.php?action=view&local_graph_id=71254&rra_id=all	22:57
fungi	load average spiked right when mnaser was noticing issues	22:58
fungi	there was also a connection count spike, but that could be a symptom of connections waiting much longer than usual for a response	22:58
fungi	server started complaining about losing contact with afs01.dfw (in the same cloud provider and region) at 21:12:12 utc, continuing with some frequency until 21:23:46 utc	23:01
fungi	[Sat Apr 6 21:31:06 2024] INFO: task jbd2/dm-0-8:549 blocked for more than 120 seconds.	23:03
fungi	that's the start of errors in dmesg on afs01.dfw	23:03
fungi	"This message is to inform you that our monitoring systems have detected a problem with the server which hosts your Cloud Block Storage device, afs01.dfw.opendev.org/main04, [...] at 2024-04-06T21:34:09.152730."	23:06
fungi	we also got a "returned to service" update on that same ticket at 21:59:21 utc	23:07
fungi	i guess i'll reboot afs01.dfw and then, if lvm comes back up clean, see what might be stuck	23:09
mnaser	yeah looked very io-y :\	23:09
fungi	i/o errors in dmesg ceased at 21:45:10 utc, looks like	23:09
fungi	server's back up now	23:20
fungi	reboot took about 10 minutes	23:20
fungi	looks like /vicepa came up mounted and writeable	23:22
fungi	[ OK ] Finished File System Check on /dev/main/vicepa.	23:24
fungi	boot.log confirms it did get checked	23:24
fungi	`vos status` says "No active transactions on 104.130.138.161"	23:26
fungi	same for the other two fileservers, so nothing to abort	23:27
fungi	i'll start a root screen session and work through manually releasing the list of volumes	23:29
fungi	about 1/3 of the way through now	23:40
fungi	about 2/3 of the way through now	23:48
fungi	and done	23:56
fungi	except for the volumes without any read-only replicas configured, of course, which i skipped	23:56

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!