Friday, 2021-08-13

*** dmellado_ is now known as dmellado		00:36
clarkb	I'm still able to ssh into servers like codeserach and nl01 and eavesdrop01	00:56
clarkb	they all have a single port 22 entry now	00:56
clarkb	(I think it was infra-prod-base that applied the updates	00:56
clarkb	I don't know that I'll be functional when this deploy finishes in order to clean up old ansible stuff on bridge. Might have to just be very very careful tomorrow morning and clean stuff up that is a day old?	00:58
clarkb	also the matrix oftc bridge seems to have died	00:58
fungi	yep, the servers lgtm	02:06
fungi	matrix bridge less so :/	02:07
Unit193	fungi: Thanks for fixing things, btw!	05:14
*** corvus is now known as Guest4101		09:45
*** dviroel\|ruck\|out is now known as dviroel\|ruck		11:13
*** diablo_rojo is now known as Guest4115		11:36
fungi	Unit193: it's not fixed yet, is it? we're still working on debugging the problem afaik	12:06
frickler	if mordred happens to show up again later, would be great if someone can point him to today's backlog in #openstack-sdks where he might be able to help (regression in sdk's caching caused by major release of decorator lib)	12:38
frickler	of course anyone else who might be able to help is also welcome, the issue seems to be out of reach for my mediocre python skills	12:39
opendevreview	Merged opendev/system-config master: Upgrade etherpad to 1.8.14 https://review.opendev.org/c/opendev/system-config/+/804136	14:31
Clark[m]	fungi: I'm making tea now but will be at a proper keyboard soon	14:35
fungi	no worries, i think we've still got a while before the deployment happens	14:35
fungi	etherpad just restarted	14:39
fungi	i'm able to load an existing pad just fine	14:40
clarkb	oh that was quicker than I expected but I'm at a keyboard now	14:46
clarkb	let me load ssh keys and all that	14:46
fungi	there's no rush, i doubt anyone's using meetpad right this moment	14:47
fungi	and the infra-prod-service-etherpad job is still running	14:47
clarkb	It loads in meetpad, but I'm not sure this is using the newer version as the colors don't touch between lines	14:48
fungi	we haven't cleaned up the ansible processes on bridge yet, eh? load average tehre is ~12	14:48
clarkb	ya ansible was still going last night when I needed to call it a day	14:49
fungi	i guess we can disable-ansible after this and clean up	14:49
clarkb	fungi: ya the image we are running is not updated on prod	14:50
fungi	wonder why it restarted in that case... checking the deploy log on bridge	14:50
clarkb	I wonder if that is a race with the dockerhub indexes eventual consistency	14:51
fungi	it's at the docker-compose pull phase just now	14:51
clarkb	oh weird are we going to restart twice?	14:52
fungi	i wonder	14:52
clarkb	fungi: looks like the pull started almost 15 minutes ago	14:52
clarkb	related to the high system load maybe?	14:52
fungi	that's what i suspect, yeah	14:52
clarkb	I think the pull may have completed and then the restart just hasn't been properly logged yet? Becuase the timing lines up for the restart I think	14:54
clarkb	I half suspect that we restarted after the pull because the mariadb image updated but the pull of the etherpad image didn't update due to docker hub races	14:54
clarkb	if that is the case we should be able to safely pull and up -d manually once ansible is out of the way	14:54
clarkb	fungi: it seems like ansible is making no progress at all	14:59
clarkb	I'm going to start looking at process listings on bridge	14:59
clarkb	fungi: `ps -elf \| grep ansible \| grep Aug12 \| grep remote_puppet_else` maybe we start by killing those processes?	15:00
clarkb	I'm going to start there. There are no remote puppet else jobs running and those are all from yseterday	15:02
fungi	yeah, that should be plenty safe	15:02
clarkb	fungi: also we should look for ssh connectivity problems as these tend to start from that	15:03
*** Guest4101 is now known as notcorvus		15:03
*** notcorvus is now known as corvus		15:03
clarkb	I think if you ps and grep for the controlmaster processes you can find old ones that might indicate bad conenctivity	15:03
fungi	i also see a couple from Aug11	15:03
*** corvus is now known as reallynotcorvus		15:04
*** reallynotcorvus is now known as corvus		15:04
fungi	some of these are showing up as defunct too, so not sure if they'll be killable	15:04
clarkb	logstash-worker11 and elasticsearch06 are maybe sad hosts	15:05
clarkb	fungi: do you think you can check on those and reboot them while I dig through processes that we might be able to clean up?	15:05
fungi	yeah, looking into them	15:06
clarkb	elasticsearch02 maybe as well	15:06
fungi	Connection closed by 2001:4800:7819:103:be76:4eff:fe04:b9d7 port 22	15:06
clarkb	next I'll clean out the base.yaml playbooks from august 12. That playbook doesn't seem to be running in zuul either	15:07
fungi	yeah, all three of them are resetting connections on 22/tcp	15:07
fungi	i'll check their oob consoles	15:07
corvus	tristanC: http://eavesdrop01.opendev.org:9001/ is answering ... what's the URI for prometheus stats? and are you monitoring it now? do you have enough data to see if the connection issue is resolved?	15:08
clarkb	All of the august 12 ansible processes seem to be cleaned up and load has fallen significantly	15:11
clarkb	Looking at etherpad the job finished and it is still running the old image. I think we should manually pull and up -d as soon as we are happy with bridge	15:11
fungi	all three of the servers you mentioned were showing hung kernel tasks reported on their consoles, i've hard rebooted them	15:12
clarkb	thanks	15:12
clarkb	those were the three IPs I saw with stale ssh control processes	15:12
fungi	i can ssh into all three of them now, though i expect the elasticsearch data is shot	15:13
clarkb	fungi: I would give it a bit to try and recover on its own (but check if the processes need to start)	15:14
clarkb	then we can delete any corrupted indexes once it has had a chance to recover	15:14
fungi	#status log Hard rebooted elasticsearch02, elasticsearch06, and logstash-worker11 as all three seemed to be hung	15:14
opendevstatus	fungi: finished logging	15:14
clarkb	ansible is busy now but all of the processses related to ansible on bridge seem to be current	15:15
clarkb	ya elasticsearch doesn't seem to be running on 02	15:16
clarkb	fungi: should I start those processes?	15:16
fungi	oh, yeah i forgot it doesn't start them automatically	15:16
fungi	i guess once the prod-hourly build complete, we can see if there are any lingering ansible proceses	15:17
fungi	fatal: [etherpad01.opendev.org]: FAILED! => { ... "cmd": "docker-compose up -d"	15:18
clarkb	there are a set of base.yaml playbooks running with current timestamps however I see no associated job. I half wonder if we unstuck those processes by rebooting the servers	15:19
fungi	maybe	15:19
clarkb	fungi: ya but it definitely restarted the containers. I think from ansible's perspecitve it sees it as a failure but it did restart	15:19
clarkb	fungi: that said I think our next step is to rerun pull on etherpad and up -d to get the image	15:19
clarkb	fungi: do you want to do that or should I?	15:19
fungi	i'll do that now	15:19
clarkb	thanks	15:19
fungi	ERROR: readlink /var/lib/docker/overlay2/l: invalid argument	15:20
clarkb	you get that when trying to up the service?	15:21
fungi	yes	15:21
fungi	the compose file looks fine though, not truncated	15:21
clarkb	stackoverflow says that is a corrupted image. https://stackoverflow.com/questions/55334380/error-readlink-var-lib-docker-overlay2-invalid-argument	15:21
fungi	argh	15:22
clarkb	fungi: can you up just the mariadb container and see if that stargs?	15:22
clarkb	if that starts then we can delete and repull the etherpad image	15:22
fungi	yeah, that works	15:22
clarkb	`sudo docker image rm 5dbd5f4908bd` then docker-compose pull again?	15:23
fungi	already done, almost finished pulling	15:23
fungi	that's better	15:23
fungi	that looks like newer etherpad now	15:24
clarkb	https://etherpad.opendev.org/p/project-renames-2021-07-30 loads for me now and ya looks newer	15:24
fungi	i loaded the same one	15:25
fungi	i see you active on it	15:25
clarkb	https://meetpad.opendev.org/isitbroken loads that etherpad for me and I can add text	15:25
clarkb	I'm not too worried about the actual call as long as the pad loads there and it seems to	15:25
fungi	joining	15:25
fungi	also looks like recent improvements in jitsi-meet or etherpad (or both) have made the window embedding a bit more serviceable	15:32
clarkb	meetpad and etherpad both seem happy. If you notice anything feel free to metnion it	15:32
fungi	clarkb: for the kata listserv, should i go ahead and start trying to create a server snapshot?	15:33
clarkb	fungi: if you want to. The thing I'm always confused about is what do we need to do on the server to make it safe to boot the resulting snapshot? Do we disable and stop exim and mailman?	15:34
clarkb	that is my biggest concern and I'm not completely up to speed on how all the file spooling works there to feel confident in doing it myself	15:34
fungi	clarkb: i guess it's a question of what we want to do with the snapshot. if we just keep it as insurance in case the in-place upgrade goes sideways, we shouldn't need to disable anything because we wouldn't boot them both at the same time	15:34
clarkb	oh ya I was thinking we would boot the snapshot and run through an upgrade on the booted snapshot	15:35
clarkb	then do the upgrade on the actual server and it will serve as both a fallback and a test system	15:35
fungi	in that case we could stop and disable the exim and mailman services while snapshotting, i guess	15:36
clarkb	I figured doing that sort of thing with the lower traffic lists.kc.io would be less impactful	15:37
clarkb	but then we'd get basically the same confidence out of upgrading it vs the prod snapshot	15:37
fungi	sure, versions would be the same, though our multi-site setup wouldn't	15:38
fungi	i also need to work out what to tweak in a dnm change to break lodgeit testing so i get a held paste equivalent for further troubleshooting the pastebinit regression	15:39
clarkb	fungi: put an assert False in system-config testinfra/test_paste.py	15:40
fungi	oh, yeah that'd do it	15:40
clarkb	I'm going to go find something to eat now that etherpad seems happy, but then after will look at the lists.kc.io stuff if you haven't already done it	15:40
tristanC	corvus: `curl http://eavesdrop01.opendev.org:9001/metrics \| grep ssh_errors` shows no errors	15:40
fungi	clarkb: sounds good, i'll get started temporarily disabling things there shortly	15:41
opendevreview	Jeremy Stanley proposed opendev/system-config master: DNM: Break paste for an autohold https://review.opendev.org/c/opendev/system-config/+/804535	15:42
corvus	tristanC: great, thanks! i'll work on an email / timeline for moving #zuul :)	15:43
fungi	clarkb: i guess i can clear your autohold for the etherpad upgrade testing?	15:45
fungi	mnaser: do you still need held nodes for multi-arch container debugging in node-labeler or uwsgi build errors in loci-keystone?	15:47
Clark[m]	fungi yes you can clear my etherpad autohold	15:50
fungi	thanks, done	15:51
fungi	it's fun that autohold and autohold-list need --tenant but autohold-delete errors if you supply --tenant	15:51
fungi	i should probably be using the standalone zuul-client instead of the rpc client anyway	15:52
fungi	prod-hourly jobs are almost done, it's on the last one now. though it likely won't complete before the top of the hour	15:53
fungi	regardless, there are no ansible processes on bridge older than a minute, so looks like cleanup was thorough	15:54
fungi	and load average has dropped from 12 to around 1, so lots better	15:55
fungi	the last job did wrap up its deployment tasks before the top of the hour, and i caught bridge with 0 ansible processes	15:59
fungi	squeaky clean	15:59
clarkb	just in time for the next hourly run	16:00
fungi	indeed	16:00
fungi	i've put lists.katacontainers.io into the deployment disable list, disabled and stopped the exim4.service and mailman.service units, and initiated image creation in the provider for the server now	16:07
clarkb	fungi: oh the other qusetion I had about that was what client do you use to talk to the rax snapshot api?	16:08
clarkb	doesi t work with a modern osc?	16:08
clarkb	also thank you!	16:08
fungi	i just used their webui	16:08
clarkb	ah	16:08
fungi	since i already had it up for the oob console stuff on the hung servers a few minutes ago	16:08
fungi	it's currently still "saving"	16:14
fungi	imaging is complete, putting services back in place now	16:19
clarkb	another benefit to using that server for this is much quicker snapshotting	16:19
fungi	eys	16:19
fungi	and it's back out of the disable list again	16:20
clarkb	I need to do a bunch of paperwork type stuff today, but hopefully monday we can boot that and test an upgrade	16:20
fungi	i also double-checked that services were running on it after starting	16:20
fungi	wfm	16:20
fungi	time to see if my well-laid trap caught a paste server	16:20
fungi	we got one	16:21
fungi	this is going to get tricky, pastebinit hard-codes server names, and also verifies ssl certs	16:27
fungi	i'm starting to wonder if it's the server rename or redirect to https confusing it	16:30
clarkb	fungi: try it with your browser to see?	16:31
clarkb	with etherpad we had to set up /etc/hosts because of the redirect	16:31
fungi	the browser's fine, and yeah i'm doing it with /etc/hosts entries to work around it	16:31
clarkb	maybe use curl instead of pastebinit?	16:33
clarkb	then you can control cert verification	16:33
fungi	yeah, but i'll need to work out what pastebinit is passing to the method	16:33
fungi	yep, i think i've confirmed it's the redirects	16:34
fungi	i was able to use pastebinit with the held server by making the vhost no longer redirect from http to https	16:34
fungi	thing is, pastebinit has a list of allowed hostnames, one of which is paste.openstack.org	16:35
fungi	trying to use it with the name paste.opendev.org throws an error	16:35
fungi	oh, though i think it may be due to the way the redirect was constructed	16:36
fungi	we didn't redirect to https://paste.opendev.org/$1 we're just redirecting to the root url	16:36
fungi	i'll see if that works with a more thorough redirect	16:37
fungi	yeah, no luck getting the redirect to work with pastebinit, but if i get rid of the redirect it's fine. just "tested" on the production server by editing its apache vhost config and that got pastebinit working	16:49
fungi	also since we don't allow search engines to index the content there, and we don't support "secretly" pasting to it really, there's no real need to redirect from http to https	16:51
fungi	i'll propose a change	16:51
opendevreview	Jeremy Stanley proposed opendev/system-config master: Stop redirecting for the paste site https://review.opendev.org/c/opendev/system-config/+/804539	17:01
fungi	Unit193: ianw: clarkb: ^ that seems to be the fix	17:01
clarkb	fungi: does pastebinit work with https:// too?	17:01
fungi	it would i think, but we'd need to update the site entry at https://phab.lubuntu.me/source/pastebinit/browse/master/pastebin.d/paste.openstack.org.conf	17:02
fungi	regexp = http://paste.openstack.org	17:02
fungi	right now trying it results in the following error:	17:03
fungi	Unknown website, please post a bugreport to request this pastebin to be added (https://paste.openstack.org)	17:03
opendevreview	Jeremy Stanley proposed opendev/lodgeit master: Properly handle paste exceptions https://review.opendev.org/c/opendev/lodgeit/+/804540	17:09
fungi	and that's ^ the other bug i discovered in digging into the problem	17:09
fungi	lest upstream just starts smacking down every bug report from someone using a distro package	17:20
fungi	(which happens in lots of projects)	17:20
clarkb	wrong window? :)	17:21
fungi	hah, yep	17:21
clarkb	looks like refstack had a backup failure. I'm hoping that it like lists is a one off internet is weird situation	17:21
Unit193	fungi: https://github.com/lubuntu-team/pastebinit/issues/6 isn't reassuring about the state of things.	17:25
fungi	Unit193: well, regardless we'll strive to keep backward compatibility with old pastebinit versions, so once 804539 merges and deployed things should hopefully stay working	17:26
fungi	(and it's temporarily working now, since i directly applied that change to the apache config to make sure it's sane)	17:27
fungi	but thanks for the pointer to that github issue, i didn't know about arch using a fork... if i get a moment i'll file a bug in debian to suggest switching to the same fork	17:28
Unit193	The maintainer in Debian is the Lubuntu team guy... I may go poking around to see what I can find.	17:29
fungi	ahh, yeah. also that fork on gh arch is using doesn't seem to actually differ from the revision history in the lubuntu phabricator	17:31
fungi	Unit193: please let me know what you find, and thanks again for alerting us to the issue before i ran into it myself!	17:33
Unit193	Hah, sure thing.	17:33
Unit193	And thanks for taking errors over IRC too.	17:33
fungi	my preference, really ;)	17:34
fungi	clarkb: ianw: looks like lance has e-mailed us asking if we've seen new issues with leaked/stuck images in osuosl	17:42
clarkb	\| 0000041150 \| 0000000001 \| osuosl-regionone \| ubuntu-focal-arm64 \| ubuntu-focal-arm64-1628601595 \| 7e23243b-aee2-4100-b702-d7e05f456606 \| deleting \| 01:00:50:58 \| that might be a leak	17:44
clarkb	looking at a cloud side image list there may be a few leaks there too	17:45
clarkb	debian-bullseye-arm64-1627056483 for example	17:46
clarkb	created_at \| 2021-07-23T16:08:06Z for that bullseye image but I don't see it in nodepool	17:47
clarkb	I can compile a list and see what others think of it	17:47
clarkb	ubuntu-focal-arm64-1628601595 cannot be deleted because it is in use through the backend store outside of glance	17:49
clarkb	server list shows no results though	17:49
fungi	we're not doing bfv for the mirror or builder are we?	17:57
clarkb	we might be, but those should use images we don't build	17:59
clarkb	the builder is in linaro not osusol. The osuosul mirror is booted from Ubuntu 20.04 (7ffbb2e7-d2f4-467a-9512-313a1c6b6afd)	18:00
clarkb	I've got an email just about ready to send to Lance	18:00
clarkb	sent	18:02
fungi	thanks!	18:16
*** dviroel\|ruck is now known as dviroel\|out		19:42
clarkb	fungi: looks like a bunch of hosts had backup failures?	19:58
clarkb	both servers report they have disk space	19:59
clarkb	looking at kdc03 the main backup failed but then the stream succeeded	20:00
clarkb	Connection closed by remote host. Is borg working on the server? was the error	20:00
clarkb	I'm going to try rerunning in screen on kdc03	20:02
clarkb	looking at the log more closely they all started about 2 hours before they errored	20:03
fungi	mm, yeah refstack, storyboard, kdc03, translate, review	20:04
fungi	also gitea01 twice (i guess one was the usual db backup failure?)	20:04
fungi	all of those except kdc03 have mysql databases	20:04
clarkb	kdc03 does do a stream backup of something though	20:05
clarkb	that said running it manually succeeded	20:05
clarkb	I suspect there was a network blip of some sort	20:05
clarkb	fungi: note all of those started around 17:12 then timed out after 2 hours and reoprted failure around 19:12	20:05
clarkb	I suspect this isn't a persistent issue given the kdc03 rerun succeeded	20:06
fungi	yeah, makes sense	20:06
clarkb	fungi: but you can check the log on kdc03 to see what it did. It was failure on normal backup, success on stream, then a bit later (nowish) I reran and you get success for both	20:06
clarkb	I guess check it tomorrow and see if things persist	20:08
fungi	yeah, missing one day is unlikely to be catastrophic	20:09
clarkb	fungi: https://review.opendev.org/c/opendev/system-config/+/804460 reviewing that one would be good before the memory of what the renames were like bcomes too stale :)	20:11
fungi	sure, i should be able to take a look now, thanks	20:26
fungi	clarkb: left one question on it, otherwise lgtm	20:30
opendevreview	Clark Boylan proposed opendev/system-config master: Update our project rename docs https://review.opendev.org/c/opendev/system-config/+/804460	20:33
clarkb	nice catch that was indeed meant to be rooted	20:34
fungi	debian bullseye releasing this weekend, probably	20:34
clarkb	exciting	20:35
fungi	yeah, scheduled for tomorrow	20:36

Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!