Monday, 2023-12-11

corvus	some post_failures showing up. 1st one i checked failed on upload to ovh_gra	15:26
corvus	yeah 3/3 failed there	15:28
opendevreview	James E. Blair proposed opendev/base-jobs master: Temporarily disable uploads to ovh_gra https://review.opendev.org/c/opendev/base-jobs/+/903351	15:30
SvenKieske	corvus: I guess this is also related? https://zuul.opendev.org/t/openstack/build/322e58959af645229a7e387686c6cab8	15:35
fungi	https://public-cloud.status-ovhcloud.com/incidents/ggsd08k3wlzn	15:36
fungi	does it look auth related	15:36
fungi	?	15:36
corvus	SvenKieske: yes it is	15:36
corvus	fungi: can't tell, no_log=true	15:36
fungi	i'm late for an errand, but can take a closer look in about an hour. also i approved 903351 but someone may need to bypass gating to merge it	15:36
opendevreview	Merged opendev/base-jobs master: Temporarily disable uploads to ovh_gra https://review.opendev.org/c/opendev/base-jobs/+/903351	15:40
frickler	seems to have been lucky	15:41
corvus	#status notice Zuul jobs reporting POST_FAILURE were due to an incident with one of our cloud providers; this provider has been temporarily disabled and changes can be rechecked.	15:43
opendevstatus	corvus: sending notice	15:43
-opendevstatus- NOTICE: Zuul jobs reporting POST_FAILURE were due to an incident with one of our cloud providers; this provider has been temporarily disabled and changes can be rechecked.		15:43
opendevstatus	corvus: finished sending notice	15:46
clarkb	I've been thinking about the best way to force gitea09 to use ipv4 to talk to the vexxhost backup server. I think .ssh/config AddressFamily inet is the proper configuration but then I wonder if we should apply that to /root/.ssh/config on all servers running backups? Or should I constrain it to gitea09? If we want to do that on all backed up servers I could see that one day we might	16:17
clarkb	have conflicting /root/.ssh/config configs for different needs/services though that isn't the case today	16:17
clarkb	anyone else have an opinion or ideas on how best ot tackle this?	16:17
clarkb	separately but related: https://review.opendev.org/c/opendev/system-config/+/902842 should remove the old gerrit replication key from gitea. Do we want to go ahead and approve/land that now or fix backups first?	16:19
fungi	https://public-cloud.status-ovhcloud.com/incidents/ggsd08k3wlzn now indicates they believe the problem in gra was resolved 14 minutes after 903351 merged	16:23
fungi	looks like they believe it happened from 15:05 to 15:54 utc	16:24
clarkb	oh looks like we already manage .ssh/config for backups	16:29
corvus	fungi: i haven't checked to see if they're ovh, but i see post_failures going back to 13:xx utc. (more before that, but those jobs all have "docker-image" in their names so i suspect they are unrelated)	16:30
fungi	is that the time the jobs started, or when the tasks failed?	16:31
fungi	but yeah, some projects also have perpetually broken image uploads	16:31
corvus	fungi: oh star time good point	16:31
corvus	yeah, and spot checking the end times of some of the 13:xx they ended at 15:xx	16:32
corvus	fungi: then i think we have high correlation with their outage times :)	16:32
fungi	agreed	16:33
corvus	clarkb: sgtm. i haven't followed 100%, but i take it the issue is something like streaming big stuff on these hosts over ipv6 is bad?	16:33
opendevreview	Clark Boylan proposed opendev/system-config master: Force borg backups to run over ipv4 https://review.opendev.org/c/opendev/system-config/+/903356	16:33
clarkb	corvus: yup ipv6 connectivity seems to be having problems between vexxhost sjc1 gitea09 and the mtl01 backup server	16:34
corvus	maybe 2024 will be the year of ipv6 and the linux desktop	16:34
clarkb	ha. I finally got around to trying to rma my laptop but lenovo said the turn around time isn't quick enough for this trip I'm taking so I'm delaying. In the mean time I discovered that if I boot with nomodeset that basically disables fancy gpu things and rendering "works". I just get a lower resolution than native with a different aspect ratio so things look weird and can't lower the	16:36
clarkb	(full) brightness. Oh and suspeding doesn't actually save as much battery as it should	16:36
clarkb	but I can limp along on that for a little bit longer	16:36
clarkb	but my brother has the same laptop and while I can reproduce the problem on an ubuntu live image his laptop cannot. So I'm fairly certain the problem is device specific.	16:37
clarkb	fungi: corvus: it is pretty easy to test if that region is working again by forcing that region in base-test	16:38
frickler	might not be related, but I'm also having no route to vexxhost via IPv6 from my local DSL provider again (had that some years ago and took a long time to resolve)	16:42
opendevreview	Clark Boylan proposed opendev/system-config master: Add hints to borg backup error logging https://review.opendev.org/c/opendev/system-config/+/903357	16:43
opendevreview	James E. Blair proposed opendev/base-jobs master: Force base-test to upload to ovh_gra https://review.opendev.org/c/opendev/base-jobs/+/903358	16:45
opendevreview	Merged opendev/base-jobs master: Force base-test to upload to ovh_gra https://review.opendev.org/c/opendev/base-jobs/+/903358	16:51
opendevreview	James E. Blair proposed zuul/zuul-jobs master: DNM: exercise base-test https://review.opendev.org/c/zuul/zuul-jobs/+/903362	16:56
opendevreview	James E. Blair proposed opendev/base-jobs master: Revert "Temporarily disable uploads to ovh_gra" https://review.opendev.org/c/opendev/base-jobs/+/903365	17:11
corvus	all but one job in the test set has completed successfully; i think we can return to standard condition now.	17:11
corvus	update: all jobs completed successfully	17:12
clarkb	+2 from me. I won't fast approve this one though and hopefully someone else can check it too	17:13
clarkb	frickler: fyi a fix for WIP is merge conflicted in change listings for gerrit has merged against stable 3.9	18:10
frickler	clarkb: oh, nice, I wasn't aware that they agreed about this being a bug	18:16
clarkb	frickler: https://gerrit-review.googlesource.com/c/gerrit/+/396899/ is the change	18:17
clarkb	I'm going to take advantage of the lack of pineapple express rain to go on a bike ride in a bit. but still happy to be around to watch any of those changes linked above (restore ovh gra logs, remove ssh key from gitea, force backups on ipv4) either before or after that happens	19:06
*** elodilles is now known as elodilles_pto		21:01
JayF	review.opendev.org is unreachable for me locally	22:09
JayF	resolves to review02.opendev.org (199.204.45.33)	22:09
NeilHanlon	same here (and via v6)	22:10
JayF	routing inside level3 according to this MTR looks bananas	22:10
JayF	https://home.jvf.cc/~jay/review-mtr-20231211.png	22:11
JayF	I wonder if there's some kind of weird BGP thing going on	22:12
JayF	because that feels like I'm being routed to the wrong area of the internet	22:12
JayF	infra-root: FYI seemingly non-actionable incident appears to be ongoing with review.opendev.org at least with a portion of the internet,	22:12
NeilHanlon	i'm checking my ripe atlas, but agree	22:13
JayF	I'm confirming with other folks around the world review.opendev.org is down but generally other internet things aren't; I don't know what network that is on tho	22:17
NeilHanlon	https://atlas.ripe.net/measurements/64730794#probes	22:18
clarkb	I'm not sure its a network thing yet. The mirror in that cloud region which has an IP addr (ipv4 anyway) in the same /24 range is reachable	22:19
clarkb	the server reports it is shutoff according to the nova api	22:20
JayF	clarkb: I can tell you generally my route to this server doesn't go through bell canada :) but maybe that's just something else weird happening simultaneously	22:20
clarkb	fungi: corvus frickler tonyb should I try to start it again vai the nova api? or do we want ot investigate further first?	22:21
clarkb	server show against the server doesn't indicate any in progress tasks	22:21
clarkb	OS-EXT-STS:task_state \| None <- I think that would tell us if they were doing a live migration for example	22:22
NeilHanlon	yeah, looking again it appears traffic arrives where it needs to	22:22
fungi	updated_at=2023-12-11T21:28:20Z status=SHUTOFF vm_state=stopped	22:22
fungi	i'm guessing that's when it went down	22:22
clarkb	that seems to align with cacti losing connectivity too	22:23
fungi	yeah, i don't see anything else to investigate without trying to reboot it	22:23
fungi	expect a lengthy wait for fsck to run	22:23
clarkb	fwiw I don't see anything in cacti indicating that something snowballed out of control prior	22:24
clarkb	fungi: should I server start it or do you want to do it?	22:24
fungi	and we may need to connect to the console if fsck requires manual intervention	22:24
fungi	go for it	22:24
clarkb	the server came right up	22:25
fungi	guilhermesp: mnaser: ^ heads up we found 16acb0cb-ead1-43b2-8be7-ab4a310b4e0a (review02.opendev.org) spontaneously shutdown in ca-ymq-1 at 21:28:20 utc according to the nova api	22:26
fungi	reboot system boot Mon Dec 11 22:25 still running 0.0.0.0	22:26
corvus	o/ sorry i missed excitement	22:27
fungi	the previous entry from last was me logging in for 43 minutes on 2023-12-06	22:27
fungi	so looks like there was nobody logged in at the time that occurred	22:27
clarkb	corvus: well there may still be exicitement	22:27
NeilHanlon	JayF: fwiw, it appears that Level3 junk you're seeing is 'normal' -- fsvo normal.	22:28
clarkb	docker reports gerrit failed to start around when I booted the server if I'm reading the docker ps -a output correctly but the last logs recorded by docker appear to be from when the server went down	22:28
clarkb	fungi: corvus: should I docker-compose down && docker-compose up -d?	22:28
corvus	couldn't a live migration have "finished" and not shown up in the taks state?	22:28
clarkb	corvus: maybe?	22:28
corvus	hrm lemme look at docker	22:28
JayF	NeilHanlon: interesting; I'm a relatively newish centurylink fiber customer so maybe I'm just not so used to that particularly quirky route	22:28
clarkb	corvus: k	22:28
fungi	last entries in syslog are from snmpd at 21:25:03, which was a few minutes before the shutdown	22:29
fungi	skimming syslog leading up to the outage, i don't see anything amiss	22:29
clarkb	I guess the other thing is whether or not we want ot force a fsck	22:30
clarkb	since it seems that no fsck was done	22:30
ianw	is it possible it shutdown cleanly?	22:30
corvus	i would generally trust ext4 on that... unless our paranoia level is 11?	22:30
fungi	ianw: i would have expected a clean shutdown to leave some trace in syslog	22:31
clarkb	corvus: ack	22:31
NeilHanlon	JayF: yeah, the thing to look for is if the packet loss is consistent between ASes -- i.e., if you have loss which continues from hop N all the way to the destination (or close to it) without any hops without loss. in short: if the loss isn't consistent from point A to B, it's likely noise from that networks' devices not liking to respond. you can	22:31
NeilHanlon	sometimes get them to treat your traceroute traffic better if you send over tcp (mtr -P 443 --tcp review.opendev.org )	22:31
clarkb	I wonder if docker the container manager is recording that the contianers failed when it came back up hence the timestamp confusion but it didn't actually try to restart them at that time	22:32
fungi	the only fsck message in syslog is this one (about the configdrive i think?):	22:32
fungi	Dec 11 22:25:19 review02 kernel: [ 6.735306] FAT-fs (vda15): Volume was not properly unmounted. Some data may be corrupt. Please run fsck.	22:32
fungi	aha, it's /boot/efi	22:33
corvus	clarkb: oh good theory. i don't have any better ideas. i don't see any docker logs suggesting it tried to start any containers.	22:33
corvus	"StartedAt": "2023-12-06T21:39:27.061492779Z",	22:34
corvus	"FinishedAt": "2023-12-11T22:25:24.402771369Z"	22:34
fungi	can anyone confirm that /etc/fstab is set to not fsck any of our filesystems?	22:34
corvus	clarkb: ^ i think those timestamps from `docker inspect ac1d7b309848` support your theory	22:34
fungi	2021-06-22 was the last modified date for /etc/fstab btw	22:35
fungi	so it's been like this for 2.5 years	22:35
clarkb	fungi: ya it seems the last field is 9	22:35
clarkb	s/9/0/	22:35
corvus	clarkb: i release my debugging hold, and i think it's okay to down/up (or maybe even just up; it will probably dtrt) once the fsck question is resolved.	22:35
JayF	NeilHanlon: how, in context of an mtr/tracert, do you know where you swap AS	22:35
clarkb	corvus: ack thanks for looking	22:35
ianw	fungi: it is defaults 0 0	22:35
ianw	on the gerrit partition	22:35
fungi	spot checking other servers, we do set fsck passno to 1 or 2 for non-swap filesystems	22:36
clarkb	fungi: so ya shoudl we set 1 on cloudimg-rootfs and /boot/efi and then 2 on /home/gerrit2?	22:36
clarkb	and then reboot?	22:36
fungi	clarkb: i think so, yes	22:36
corvus	++ to the pass fstab change	22:36
NeilHanlon	JayF: DNS (if it's available), and modernish mtr has a `-z` flag which will do AS lookups	22:36
fungi	ianw: right, "default" for the fsck passno field is 0, which means "don't fsck at boot"	22:36
JayF	NeilHanlon: oh, nice :) I'm on gentoo so I better have the flag or else I can go bump the ebuild :D	22:37
clarkb	fungi: I'll let you drive that	22:37
NeilHanlon	:D	22:37
clarkb	I suppose we could manually fsck /home/gerrit2 first without a reboot if we wanted	22:37
NeilHanlon	JayF: this is a good listen (or read w/ linked slides) https://youtu.be/L0RUI5kHzEQ that taught me everything I've now forgotten about traceroute :D	22:38
clarkb	the updated /etc/fstab looks correct to me	22:38
fungi	infra-root: i've edited /etc/fstab on review02 now so that non-swap filesystems will get a fsck at boot	22:38
fungi	rootfs and efi on passno 1, gerrit home volume on passno 2	22:39
fungi	shall i reboot the server now?	22:39
clarkb	fungi: corvus: should we down the containers before we reboot? just to avoid any unexpected interactions?	22:39
fungi	probably, yes	22:39
clarkb	that is the only other thought I have before rebooting	22:39
ianw	(i likely as not added that entry manually for the gerrit home ~ 2021-07 when we upgraded the host and afaik it wasn't an explicit choice to turn off fsck, i probably just typed 0 0 out of habit)	22:40
clarkb	fungi: I think you should docker-compose down the containers to prevfen them trying to start until we are ready then do a reboot	22:41
fungi	ianw: well, the rootfs was also set to not fsck at boot	22:41
corvus	clarkb: yes to downing	22:41
fungi	downed now	22:41
fungi	rebooting	22:41
fungi	i also have the vnc console connected	22:42
fungi	so i can watch the boot progress	22:42
fungi	it's already up	22:42
clarkb	yup, is there a good way to check if it fscked? I guess your boot console would tell you?	22:43
fungi	Dec 11 22:42:23 review02 systemd-fsck[816]: gerrit: clean, 494405/67108864 files, 113090725/268434432 blocks	22:43
fungi	from syslog	22:43
clarkb	it did not fsck / is the implication that fs was not dirty and thus could be skipped?	22:44
fungi	i can't tell, still looking	22:45
clarkb	I guess that isn't too surprising since most of the server state is on the gerrit volume. One exception is syslog/journald though	22:45
fungi	openstack console log show	22:45
fungi	Begin: Will now check root file system ... fsck from util-linux 2.34	22:45
fungi	[/usr/sbin/fsck.ext4 (1) -- /dev/vda1] fsck.ext4 -a -C0 /dev/vda1	22:46
clarkb	oh I wonder if systemd-fsck can only fsck non /	22:46
clarkb	and you need fsck before systemd for /	22:46
clarkb	that could explain the logging being missing for /	22:46
fungi	yep	22:46
clarkb	fungi: does the console log show any complaints from fsck for / if not I think we're ok?	22:47
tonyb	I think you can do something like tune2fs -l /dev/$device to see when it was last fsck'd	22:47
fungi	i did not find any errors in the console log from fsck	22:47
fungi	just messages about systemd creating fsck.slice and listening on the fsckd communication socket	22:47
clarkb	in that case I guess we can proceed with a docker-compose up -d?	22:47
fungi	agreed, i'll do that now	22:48
fungi	it's on its way up now	22:48
clarkb	we didn't move the waiting/ queue dir aside so those exceptions are "expected"	22:48
clarkb	there is also a very persistent ssh client that is failing to connect. But otherwise I think that startup log in error_log looked good	22:49
clarkb	the web ui is up for me and reports the version we were running prior	22:49
clarkb	so we didn't update gerrit (expceted we haven't made any image updates iirc)	22:49
clarkb	maybe we should approve https://review.opendev.org/c/opendev/base-jobs/+/903365 and use that as a good canary of the whole approval -> CI -> merge/submit process?	22:50
JayF	something is def. wrong	22:52
JayF	https://review.opendev.org/c/openstack/governance/+/902585 does not have any comments loaded, for example	22:53
NeilHanlon	maybe they're just in invisible ink now?	22:53
JayF	it looks weirdly spooky, all the comment spots are there but empty	22:53
tonyb	JaF: I found review.o.o to be very slow after the reboot	22:53
clarkb	yes it has to reload caches	22:53
clarkb	if it persists after 5 or 10 minutes then we should check again. Unfortunately this is "normal" which makes it hard to say if something is wrong	22:54
tonyb	JayF: and I see many comments on that change FWIW	22:54
JayF	ack; makes sense. First time I've been here to see when it first gets restarted, I think :D	22:54
clarkb	you'll see it struggle to load diffs as well	22:54
tonyb	clarkb: I'm happy to +2+A 903365	22:54
fungi	i've approved 903365 now	22:54
tonyb	LOL	22:54
NeilHanlon	comments do load, but takes a few seconds	22:54
NeilHanlon	https://drop1.neilhanlon.me/irc/uploads/b238bf77f8924b48/image.png	22:55
fungi	903365 is showing on https://zuul.opendev.org/t/opendev/status with builds in progress	22:55
fungi	eta 3 minutes	22:55
clarkb	did we lose the bot?	22:58
clarkb	the change merged but the bot doesn't seem to be connected (not surprising I guess)	22:59
fungi	and its already replicated to https://opendev.org/opendev/base-jobs/commit/ddb3137	22:59
fungi	i'll restart the bot	22:59
clarkb	thanks!	22:59
fungi	yeah, container log indicates the last event the bot saw was at 21:27:13 utc, right when the server probably died	23:03
clarkb	the ip spamming us with ssh connection attempts belongs to IBM according to whois	23:04
clarkb	anyone know anyone at IBM?	23:04
tonyb	Not that could help with that :/	23:05
clarkb	its probably some ancient jenkins that everyone forgot about	23:05
clarkb	I suspect the errors are due to its age	23:05
tonyb	fungi: If you get a moment can you share a redacted mutt.conf for accessing the infra-root mail?	23:05
fungi	status log Started the review.opendev.org server which appeared to have spontaneously shut down at 21:28 UTC, also corrected the fsck passno in its fstab, and restarted the Gerrit IRC/Matrix bot so they would start seeing change events again	23:06
fungi	kinda wordy, look okay?	23:06
tonyb	Sure, I think you can drop the passno text to shrink it a little	23:07
fungi	would rather not forget that we fixed it to actually fsck on boot	23:08
clarkb	lgtm	23:08
fungi	#status log Started the review.opendev.org server which spontaneously shut down at 21:28 UTC, corrected the fsck passno in its fstab, and restarted the Gerrit IRC/Matrix bots so they'll start seeing change events again	23:08
opendevstatus	fungi: finished logging	23:08
fungi	whittled it down a smidge	23:09
tonyb	Ummm I didn't actually get that message anywhere	23:10
tonyb	and it finished logging very quickly	23:11
fungi	tonyb: that's what status log does	23:11
fungi	as opposed to notice or alert or okay which notify irc channels	23:12
ianw	it appeared on mastodon which i was scrolling getting a tea :)	23:12
tonyb	ooooo my mistake	23:12
fungi	we usually try to avoid pestering every irc channel if there's no action required	23:12
JayF	my hilight bar in weechat appreciates you :)	23:12
fungi	and yeah, following https://fosstodon.org/@opendevinfra/ will still get them	23:13
fungi	unrelated, all pypi.org accounts will require 2fa (and so also upload tokens) starting on 2024-01-01	23:16
*** dmellado2 is now known as dmellado		23:16
fungi	https://discuss.python.org/t/announcement-2fa-requirement-for-pypi-2024-01-01/40906	23:20
clarkb	github is like 2024-01-28 ish	23:25
tonyb	I get that Jan-1st is a really nice line in the sand, but it really sucks because holiday season :/	23:27
NeilHanlon	new year, same problems 🙃	23:27

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!