Sunday, 2020-06-14

*** rchurch has quit IRC		00:25
*** rchurch has joined #opendev		00:27
*** DSpider has quit IRC		04:02
*** sgw has joined #opendev		04:06
openstackgerrit	Matthew Thode proposed openstack/diskimage-builder master: update grub cmdline to current kernel parameters https://review.opendev.org/735445	05:41
AJaeger	infra-root, seems we lost opensuse and centos mirrors, there's no such directory at https://mirror.bhs1.ovh.opendev.org/	05:51
openstackgerrit	Matthew Thode proposed openstack/diskimage-builder master: add more python selections to gentoo https://review.opendev.org/735448	06:16
yoctozepto	AJaeger: ay-yay, I was about to report that	07:34
yoctozepto	I've got another question too: is it possible to get nodes with nested virtualization? as in being able to run kvm in them?	07:37
*** moppy has quit IRC		08:01
*** moppy has joined #opendev		08:03
AJaeger	yoctozepto: see https://review.opendev.org/#/c/683431/ - but we have only a limited number of these	08:33
AJaeger	#status notice The opendev specific CentoOS and openSUSE mirror disappeared and thus CentOS and openSUSE jobs are all broken.	08:35
openstackstatus	AJaeger: sending notice	08:35
-openstackstatus- NOTICE: The opendev specific CentoOS and openSUSE mirror disappeared and thus CentOS and openSUSE jobs are all broken.		08:35
openstackstatus	AJaeger: finished sending notice	08:38
yoctozepto	AJaeger: thanks, that will do - any reason for debian being missing?	08:54
*** DSpider has joined #opendev		09:09
*** calcmandan has quit IRC		09:36
*** calcmandan has joined #opendev		09:37
*** iurygregory has joined #opendev		10:07
yoctozepto	AJaeger: same for c8 - if I just proposed a patch adding them, would it work? or does it need more setting up on the providers' side?	10:10
*** tosky has joined #opendev		10:59
*** sshnaidm_ has joined #opendev		12:34
*** sshnaidm\|afk has quit IRC		12:34
fungi	looks like afs01.dfw.openstack.org is down for some reason	12:50
fungi	or at least entirely unreachable	12:51
fungi	responds to ping but not ssh	12:51
fungi	actually it gets as far as the key exchange, so i suspect something's up with its rootfs or process count	12:52
fungi	we stopped getting snmp responses from it just before 20:00z	12:53
fungi	cpu utilization and load average show a significant spike just before we lost contact too	12:54
fungi	i'll check its oob console first for any sign of what's wrong, then i guess start following our steps in https://docs.opendev.org/opendev/system-config/latest/afs.html#recovering-a-failed-fileserver	12:59
fungi	it's too bad our fileserver outages always seem to be of the pathological sort where afs failover doesn't actually kick in and clients continue trying to contact the unresponsive server instead of the other one	13:00
fungi	hard to tell for sure from the console, but looks like it may have experienced an unexpected reboot since i see the end of fsck cleaning orphaned inodes from /dev/xvda1	13:10
fungi	though it's followed by a bunch of "task ... blocked for more than 120 seconds" kmesgs	13:11
fungi	i'm going to try to reboot it as gracefully as possible, but in order to reduce the risk of additional write activity to the rw volumes i'm going to shut down the mirror-update servers first	13:12
fungi	that's easier than trying to hold individual flocks for a bunch of volumes, commenting out cronjobs, adding hosts to the emergency disable list...	13:13
fungi	#status log temporarily powered off mirror-update.opendev.org and mirror-update.openstack.org while working on afs01.dfw.openstack.org recovery process	13:14
openstackstatus	fungi: finished logging	13:14
fungi	the "send ctrl-alt-del" button in the oob console has no effect (unsurprisingly) so i'm trying a soft reboot via cli. odds are i'll have to resort to --hard though	13:16
*** tosky has quit IRC		13:17
*** tosky has joined #opendev		13:17
fungi	yeah, doesn't seem to be having any effect. trying to openstack server reboot --hard now	13:18
fungi	console says fsck was able to proceed normally	13:19
fungi	i can ssh into it again	13:20
fungi	#status log performed hard reboot of afs01.dfw.openstack.org	13:20
openstackstatus	fungi: finished logging	13:20
fungi	yeah, syslog indicates the server was either completely hung shortly after 19:35z or lost the ability to write to its rootfs	13:23
yoctozepto	fungi: hi; do we really need to mirror the whole repos? would not it make more sense to focus on caching proxies?	13:23
fungi	yoctozepto: how about we debate mirror redesign another tmie	13:24
fungi	time	13:24
fungi	i've got a lot of cleanup to do here	13:24
yoctozepto	fungi: sure, no problem, I feel you ;-)	13:24
fungi	though in short, the reason for creating mirrors rather than using caching proxies is that distro package repositories need indices which match the packages served, and we had endless pain trying to rely on other mirrors of debian/ubuntu packages because they often served mismatched indices causing jobs to break (and proxying would just proxy that same problem)	13:25
fungi	afs and generating indices from the packages present in the mirror was the solution we found to keep package and index updates atomic	13:26
fungi	now that afs01.dfw has been rebooted, i am once again able to browse the centos and opensuse mirrors	13:28
fungi	if anyone wants to double-check the stuff which they saw failing before is back to normal, that would help	13:29
fungi	i'll work on making sure all the rw volumes are back to working order now	13:29
fungi	`bos getlog -server afs01.dfw.openstack.org -file SalvageLog` tells me "Fetching log file 'SalvageLog'... bos: no such entity (while reading log)"	13:36
yoctozepto	fungi: thanks for insights, I feared that may have been the cause	13:36
fungi	looks like FileLog contains mention of some salvage operations though	13:37
fungi	#status notice Package mirrors should be back in working order; any jobs which logged package retrieval failures between 19:35 UTC yesterday and 13:20 UTC today can be safely rechecked	13:39
openstackstatus	fungi: sending notice	13:39
-openstackstatus- NOTICE: Package mirrors should be back in working order; any jobs which logged package retrieval failures between 19:35 UTC yesterday and 13:20 UTC today can be safely rechecked		13:40
fungi	so the FileLog for afs01.dfw says it scheduled salvage for the following volumes: 536870915, 536871029, 536870921, 536871065, 536870994, 536870937	13:41
openstackstatus	fungi: finished sending notice	13:43
fungi	vos status says there are no active transactions on afs01.dfw.openstack.org so that's a good sign	13:46
fungi	the volumes mentioned as getting a salvage scheduled are (in corresponding order): root.cell, project, service, mirror.logs, docs-old, mirror.git	13:49
fungi	i'm not super worried about those as their file counts should be low (several are entirely unused and could even stand to be deleted)	13:49
fungi	i'll move on to performing manual releases of all the volumes to make sure they're releaseable	13:50
fungi	55 rw volumes	13:53
fungi	oh, only 50 with replicas though	13:56
fungi	unfortunately the outage seems to have caught mirror.centos mid-release, so it's getting a full release now which will likely take a long time. i'll try to knock out the rest in parallel	13:59
fungi	same for mirror.epel	14:03
fungi	and mirror.fedora	14:04
fungi	mirror.gem is taking a while to release, but that may be due to mirror.centos, mirror.epel and mirror.fedora being simultaneously underway	14:10
fungi	worth noting, it seems afs.db01 spends basically all its time at max bandwidth utilization these days: http://cacti.openstack.org/cacti/graph.php?action=view&local_graph_id=2362&rra_id=all	14:13
fungi	er, i mean afs01.dfw	14:13
auristor	fungi: it would be worth checking one or more of clients reading from afs01.dfw logging anything to dmesg during the outage	14:13
fungi	auristor: good idea, will take a look now, thanks!	14:14
auristor	sorry, the brain isn't functioning yet. there are some missing words in that sentence that didn't make it to the fingers.	14:14
fungi	nah, i understood ;)	14:15
fungi	[Sat Jun 13 19:56:49 2020] afs: Lost contact with file server 104.130.138.161 in cell openstack.org (code -1) (all multi-homed ip addresses down for the server)	14:15
auristor	salvaging will be schedule when the volume is first attached by the fileserver so it would also be worth attempting to read from each volume	14:15
fungi	[Sat Jun 13 19:57:00 2020] afs: file server 104.130.138.161 in cell openstack.org is back up (code 0) (multi-homed address; other same-host interfaces may still be down)	14:16
mordred	fungi: oh good morning. anything I can help with?	14:16
fungi	mordred: probably not at this point, just double-checking that all the volumes with replicas get back in sync	14:17
auristor	is there any periodic logging of the "calls waiting for a thread" count from rxdebug for afs01.dfw.openstack.org ?	14:18
fungi	unfortunately the fact that every release of the centos/epel/fedora mirrors seems to require hours to complete basically guarantees they're caught mid-release if the server with the rw volume goes offline	14:18
fungi	auristor: none that i see in dmesg (this is using the lkm from openafs 1.8.5 with linux kernel 4.15, if it makes a difference)	14:19
mordred	fungi: you know - when there is time next to breath - it's possible the difference in how yum repos work compared to apt repos might make yoctozepto's suggestion of considering caching proxies for those instead of full mirrors reasonable (there's less of an apt-get update ; wait ; apt-get install pattern with that toolchain)	14:20
fungi	oh, rxdebug... that's one of the cli tools. just a sec	14:20
mordred	but - I agree - let's circle back around to that later :)	14:20
auristor	someone might have setup a period run of "rxdebug <host> 7000 -noconn" and logged the "<n> calls waiting for a thread" number or use it as an alarm trigger.	14:22
fungi	auristor: "0 calls waiting for a thread; 244 threads are idle; 0 calls have waited for a thread" so i think that's a no	14:22
mordred	sounds like a good thing to add as a periodic thing	14:23
auristor	when that number is greater than 0 it means that all of the worker threads have been scheduled an RPC to process.	14:23
mordred	could grep out the 0 and sent it to a graphite gague	14:23
fungi	i this case i think it's just bandwidth-bound... the service provider caps throughput to 400mbps on this server instance	14:23
auristor	what I suspect happened is that the vice partition disk failed in such a way that I/O syscalls from the fileserver never completed.	14:23
auristor	Eventually all ~250 worker threads are scheduled an RPC that never completes and then incoming RPCs get placed onto the "waiting for a thread" queue. Since no workers complete, the waiting number goes up and up.	14:24
fungi	oh, when it was offline, yes probably. i noticed that the state the server was in it responded to ping and i could complete ssh key exchange but my login process either never forked or hung indefinitely	14:24
fungi	our snmp graphs indicate that in the minutes leading up to the server becoming unresponsive it maxed out cpu utilization and load average had started spiking way up	14:25
fungi	also the out of band console did not produce a login prompt. there were kernel messages present on the console complaining about hung tasks, but no clue how old those were since they were timestamped by seconds since boot (and i didn't think to jot them down so i could try to calculate the offset from logs later)	14:27
fungi	unfortunately there was nothing interesting in syslog immediately before it went silent. i expect either the logger got stuck or it ceased to be able to write to its rootfs	14:29
fungi	anyway, i'm going to leave these volume releases in progress for now, i don't feel comfortable starting any more in parallel until at least one of them completes (hopefully mirror.gem won't take too much longer to finish)	14:36
AJaeger	thanks, fungi!	14:40
smcginnis	Could these AFS server issues be the root of the release job POST_FAILURES I posted about earlier.	14:57
smcginnis	I did another one this morning not thinking to check on status of that and got another failure.	14:58
auristor	using mirror.centos.readonly as an example. It has two replicas, one on afs01 and one one afs02. During a release there one on afs01 is available and the one on afs02 is offline. If afs01 dies, there are no copies available for clients to use.	15:01
fungi	smcginnis: yes, looking at the timestamps, i expect rsync failed to write to the rw volume on afs01.dfw.o.o because the server was hung at that point	15:31
fungi	unfortunately all we got out of rsync was a nonzero exit code and no helpful errors	15:31
*** icarusfactor has joined #opendev		16:08
*** factor has quit IRC		16:10
*** icarusfactor has quit IRC		16:27
openstackgerrit	Mohammed Naser proposed opendev/system-config master: uwsgi-base: drop packages.txt https://review.opendev.org/735473	17:31
mnaser	mordred: ^ of your interest	17:32
mordred	mnaser: ++	20:00
*** sgw has quit IRC		20:00
openstackgerrit	Mohammed Naser proposed openstack/project-config master: Add vexxhost/atmosphere https://review.opendev.org/735478	20:06
*** rchurch has quit IRC		21:00
*** rchurch has joined #opendev		21:02
*** sgw has joined #opendev		22:48
*** tkajinam has joined #opendev		22:57
*** iurygregory has quit IRC		22:59
ianw	fungi: thanks for looking in on that	23:02
ianw	in news that seems to be unlikely to be unrelated, it seems graphite is down too	23:03
ianw	it has task hung messages and is non-responsive on the console	23:06
ianw	it also drops out of cacti at 9am on the 13th	23:07
fungi	hrm, i wonder if it could be related to the trove db outage, though that was back on, like, the 9th	23:10
fungi	and yeah, nearly done with afs volume manual releases. the only ones running now are mirror.fedora, mirror.opensuse and mirror.yum-puppetlabs	23:11
fungi	the mirror-update servers are still powered down	23:11
ianw	#status log rebooted graphite.openstack.org as it was unresponsive	23:12
openstackstatus	ianw: finished logging	23:12
ianw	fungi: we've never got to the bottom of why, i think we can probably group it, the rsync mirrors take so long to release	23:14
ianw	this on sleeping and clock skew was an attempt : https://opendev.org/opendev/system-config/src/branch/master/playbooks/roles/mirror-update/files/fedora-mirror-update#L126	23:16
ianw	note i have scripts for setting up logging @ https://opendev.org/opendev/system-config/src/branch/master/tools/afs-server-restart.sh	23:17
ianw	but as auristor says, if the volume is releasing there's no redundancy	23:18
ianw	and when you look at http://grafana.openstack.org/d/ACtl1JSmz/afs?viewPanel=12&orgId=1&from=now-30d&to=now	23:18
ianw	basically the mirrors take as long to release every time as their next pulse; i.e. they're basically always in the release process	23:19
ianw	which also probably results in the network being flat out 100% of the time	23:19
*** DSpider has quit IRC		23:22
*** tosky has quit IRC		23:26
fungi	and i guess we've already ruled out simple causes like rsync updating atime on every file or something	23:48
fungi	looks like we mount /vicepa with the relatime option, i suppose we could set it to noatime (openafs doesn't utilize the atime info from the store anyway, from what i'm reading), though no idea if that will make any difference for the phantom content changes rsync seems to cause	23:56
fungi	mirror.yum-puppetlabs release finished, so we're down to just fedora and opensuse now	23:58
ianw	fungi: from my notes; https://lists.openafs.org/pipermail/openafs-info/2019-September/042864.html	23:59

Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!