*** rchurch has quit IRC | 00:25 | |
*** rchurch has joined #opendev | 00:27 | |
*** DSpider has quit IRC | 04:02 | |
*** sgw has joined #opendev | 04:06 | |
openstackgerrit | Matthew Thode proposed openstack/diskimage-builder master: update grub cmdline to current kernel parameters https://review.opendev.org/735445 | 05:41 |
---|---|---|
AJaeger | infra-root, seems we lost opensuse and centos mirrors, there's no such directory at https://mirror.bhs1.ovh.opendev.org/ | 05:51 |
openstackgerrit | Matthew Thode proposed openstack/diskimage-builder master: add more python selections to gentoo https://review.opendev.org/735448 | 06:16 |
yoctozepto | AJaeger: ay-yay, I was about to report that | 07:34 |
yoctozepto | I've got another question too: is it possible to get nodes with nested virtualization? as in being able to run kvm in them? | 07:37 |
*** moppy has quit IRC | 08:01 | |
*** moppy has joined #opendev | 08:03 | |
AJaeger | yoctozepto: see https://review.opendev.org/#/c/683431/ - but we have only a limited number of these | 08:33 |
AJaeger | #status notice The opendev specific CentoOS and openSUSE mirror disappeared and thus CentOS and openSUSE jobs are all broken. | 08:35 |
openstackstatus | AJaeger: sending notice | 08:35 |
-openstackstatus- NOTICE: The opendev specific CentoOS and openSUSE mirror disappeared and thus CentOS and openSUSE jobs are all broken. | 08:35 | |
openstackstatus | AJaeger: finished sending notice | 08:38 |
yoctozepto | AJaeger: thanks, that will do - any reason for debian being missing? | 08:54 |
*** DSpider has joined #opendev | 09:09 | |
*** calcmandan has quit IRC | 09:36 | |
*** calcmandan has joined #opendev | 09:37 | |
*** iurygregory has joined #opendev | 10:07 | |
yoctozepto | AJaeger: same for c8 - if I just proposed a patch adding them, would it work? or does it need more setting up on the providers' side? | 10:10 |
*** tosky has joined #opendev | 10:59 | |
*** sshnaidm_ has joined #opendev | 12:34 | |
*** sshnaidm|afk has quit IRC | 12:34 | |
fungi | looks like afs01.dfw.openstack.org is down for some reason | 12:50 |
fungi | or at least entirely unreachable | 12:51 |
fungi | responds to ping but not ssh | 12:51 |
fungi | actually it gets as far as the key exchange, so i suspect something's up with its rootfs or process count | 12:52 |
fungi | we stopped getting snmp responses from it just before 20:00z | 12:53 |
fungi | cpu utilization and load average show a significant spike just before we lost contact too | 12:54 |
fungi | i'll check its oob console first for any sign of what's wrong, then i guess start following our steps in https://docs.opendev.org/opendev/system-config/latest/afs.html#recovering-a-failed-fileserver | 12:59 |
fungi | it's too bad our fileserver outages always seem to be of the pathological sort where afs failover doesn't actually kick in and clients continue trying to contact the unresponsive server instead of the other one | 13:00 |
fungi | hard to tell for sure from the console, but looks like it may have experienced an unexpected reboot since i see the end of fsck cleaning orphaned inodes from /dev/xvda1 | 13:10 |
fungi | though it's followed by a bunch of "task ... blocked for more than 120 seconds" kmesgs | 13:11 |
fungi | i'm going to try to reboot it as gracefully as possible, but in order to reduce the risk of additional write activity to the rw volumes i'm going to shut down the mirror-update servers first | 13:12 |
fungi | that's easier than trying to hold individual flocks for a bunch of volumes, commenting out cronjobs, adding hosts to the emergency disable list... | 13:13 |
fungi | #status log temporarily powered off mirror-update.opendev.org and mirror-update.openstack.org while working on afs01.dfw.openstack.org recovery process | 13:14 |
openstackstatus | fungi: finished logging | 13:14 |
fungi | the "send ctrl-alt-del" button in the oob console has no effect (unsurprisingly) so i'm trying a soft reboot via cli. odds are i'll have to resort to --hard though | 13:16 |
*** tosky has quit IRC | 13:17 | |
*** tosky has joined #opendev | 13:17 | |
fungi | yeah, doesn't seem to be having any effect. trying to openstack server reboot --hard now | 13:18 |
fungi | console says fsck was able to proceed normally | 13:19 |
fungi | i can ssh into it again | 13:20 |
fungi | #status log performed hard reboot of afs01.dfw.openstack.org | 13:20 |
openstackstatus | fungi: finished logging | 13:20 |
fungi | yeah, syslog indicates the server was either completely hung shortly after 19:35z or lost the ability to write to its rootfs | 13:23 |
yoctozepto | fungi: hi; do we really need to mirror the whole repos? would not it make more sense to focus on caching proxies? | 13:23 |
fungi | yoctozepto: how about we debate mirror redesign another tmie | 13:24 |
fungi | time | 13:24 |
fungi | i've got a lot of cleanup to do here | 13:24 |
yoctozepto | fungi: sure, no problem, I feel you ;-) | 13:24 |
fungi | though in short, the reason for creating mirrors rather than using caching proxies is that distro package repositories need indices which match the packages served, and we had endless pain trying to rely on other mirrors of debian/ubuntu packages because they often served mismatched indices causing jobs to break (and proxying would just proxy that same problem) | 13:25 |
fungi | afs and generating indices from the packages present in the mirror was the solution we found to keep package and index updates atomic | 13:26 |
fungi | now that afs01.dfw has been rebooted, i am once again able to browse the centos and opensuse mirrors | 13:28 |
fungi | if anyone wants to double-check the stuff which they saw failing before is back to normal, that would help | 13:29 |
fungi | i'll work on making sure all the rw volumes are back to working order now | 13:29 |
fungi | `bos getlog -server afs01.dfw.openstack.org -file SalvageLog` tells me "Fetching log file 'SalvageLog'... bos: no such entity (while reading log)" | 13:36 |
yoctozepto | fungi: thanks for insights, I feared that may have been the cause | 13:36 |
fungi | looks like FileLog contains mention of some salvage operations though | 13:37 |
fungi | #status notice Package mirrors should be back in working order; any jobs which logged package retrieval failures between 19:35 UTC yesterday and 13:20 UTC today can be safely rechecked | 13:39 |
openstackstatus | fungi: sending notice | 13:39 |
-openstackstatus- NOTICE: Package mirrors should be back in working order; any jobs which logged package retrieval failures between 19:35 UTC yesterday and 13:20 UTC today can be safely rechecked | 13:40 | |
fungi | so the FileLog for afs01.dfw says it scheduled salvage for the following volumes: 536870915, 536871029, 536870921, 536871065, 536870994, 536870937 | 13:41 |
openstackstatus | fungi: finished sending notice | 13:43 |
fungi | vos status says there are no active transactions on afs01.dfw.openstack.org so that's a good sign | 13:46 |
fungi | the volumes mentioned as getting a salvage scheduled are (in corresponding order): root.cell, project, service, mirror.logs, docs-old, mirror.git | 13:49 |
fungi | i'm not super worried about those as their file counts should be low (several are entirely unused and could even stand to be deleted) | 13:49 |
fungi | i'll move on to performing manual releases of all the volumes to make sure they're releaseable | 13:50 |
fungi | 55 rw volumes | 13:53 |
fungi | oh, only 50 with replicas though | 13:56 |
fungi | unfortunately the outage seems to have caught mirror.centos mid-release, so it's getting a full release now which will likely take a long time. i'll try to knock out the rest in parallel | 13:59 |
fungi | same for mirror.epel | 14:03 |
fungi | and mirror.fedora | 14:04 |
fungi | mirror.gem is taking a while to release, but that may be due to mirror.centos, mirror.epel and mirror.fedora being simultaneously underway | 14:10 |
fungi | worth noting, it seems afs.db01 spends basically all its time at max bandwidth utilization these days: http://cacti.openstack.org/cacti/graph.php?action=view&local_graph_id=2362&rra_id=all | 14:13 |
fungi | er, i mean afs01.dfw | 14:13 |
auristor | fungi: it would be worth checking one or more of clients reading from afs01.dfw logging anything to dmesg during the outage | 14:13 |
fungi | auristor: good idea, will take a look now, thanks! | 14:14 |
auristor | sorry, the brain isn't functioning yet. there are some missing words in that sentence that didn't make it to the fingers. | 14:14 |
fungi | nah, i understood ;) | 14:15 |
fungi | [Sat Jun 13 19:56:49 2020] afs: Lost contact with file server 104.130.138.161 in cell openstack.org (code -1) (all multi-homed ip addresses down for the server) | 14:15 |
auristor | salvaging will be schedule when the volume is first attached by the fileserver so it would also be worth attempting to read from each volume | 14:15 |
fungi | [Sat Jun 13 19:57:00 2020] afs: file server 104.130.138.161 in cell openstack.org is back up (code 0) (multi-homed address; other same-host interfaces may still be down) | 14:16 |
mordred | fungi: oh good morning. anything I can help with? | 14:16 |
fungi | mordred: probably not at this point, just double-checking that all the volumes with replicas get back in sync | 14:17 |
auristor | is there any periodic logging of the "calls waiting for a thread" count from rxdebug for afs01.dfw.openstack.org ? | 14:18 |
fungi | unfortunately the fact that every release of the centos/epel/fedora mirrors seems to require hours to complete basically guarantees they're caught mid-release if the server with the rw volume goes offline | 14:18 |
fungi | auristor: none that i see in dmesg (this is using the lkm from openafs 1.8.5 with linux kernel 4.15, if it makes a difference) | 14:19 |
mordred | fungi: you know - when there is time next to breath - it's possible the difference in how yum repos work compared to apt repos might make yoctozepto's suggestion of considering caching proxies for those instead of full mirrors reasonable (there's less of an apt-get update ; wait ; apt-get install pattern with that toolchain) | 14:20 |
fungi | oh, rxdebug... that's one of the cli tools. just a sec | 14:20 |
mordred | but - I agree - let's circle back around to that later :) | 14:20 |
auristor | someone might have setup a period run of "rxdebug <host> 7000 -noconn" and logged the "<n> calls waiting for a thread" number or use it as an alarm trigger. | 14:22 |
fungi | auristor: "0 calls waiting for a thread; 244 threads are idle; 0 calls have waited for a thread" so i think that's a no | 14:22 |
mordred | sounds like a good thing to add as a periodic thing | 14:23 |
auristor | when that number is greater than 0 it means that all of the worker threads have been scheduled an RPC to process. | 14:23 |
mordred | could grep out the 0 and sent it to a graphite gague | 14:23 |
fungi | i this case i think it's just bandwidth-bound... the service provider caps throughput to 400mbps on this server instance | 14:23 |
auristor | what I suspect happened is that the vice partition disk failed in such a way that I/O syscalls from the fileserver never completed. | 14:23 |
auristor | Eventually all ~250 worker threads are scheduled an RPC that never completes and then incoming RPCs get placed onto the "waiting for a thread" queue. Since no workers complete, the waiting number goes up and up. | 14:24 |
fungi | oh, when it was offline, yes probably. i noticed that the state the server was in it responded to ping and i could complete ssh key exchange but my login process either never forked or hung indefinitely | 14:24 |
fungi | our snmp graphs indicate that in the minutes leading up to the server becoming unresponsive it maxed out cpu utilization and load average had started spiking way up | 14:25 |
fungi | also the out of band console did not produce a login prompt. there were kernel messages present on the console complaining about hung tasks, but no clue how old those were since they were timestamped by seconds since boot (and i didn't think to jot them down so i could try to calculate the offset from logs later) | 14:27 |
fungi | unfortunately there was nothing interesting in syslog immediately before it went silent. i expect either the logger got stuck or it ceased to be able to write to its rootfs | 14:29 |
fungi | anyway, i'm going to leave these volume releases in progress for now, i don't feel comfortable starting any more in parallel until at least one of them completes (hopefully mirror.gem won't take too much longer to finish) | 14:36 |
AJaeger | thanks, fungi! | 14:40 |
smcginnis | Could these AFS server issues be the root of the release job POST_FAILURES I posted about earlier. | 14:57 |
smcginnis | I did another one this morning not thinking to check on status of that and got another failure. | 14:58 |
auristor | using mirror.centos.readonly as an example. It has two replicas, one on afs01 and one one afs02. During a release there one on afs01 is available and the one on afs02 is offline. If afs01 dies, there are no copies available for clients to use. | 15:01 |
fungi | smcginnis: yes, looking at the timestamps, i expect rsync failed to write to the rw volume on afs01.dfw.o.o because the server was hung at that point | 15:31 |
fungi | unfortunately all we got out of rsync was a nonzero exit code and no helpful errors | 15:31 |
*** icarusfactor has joined #opendev | 16:08 | |
*** factor has quit IRC | 16:10 | |
*** icarusfactor has quit IRC | 16:27 | |
openstackgerrit | Mohammed Naser proposed opendev/system-config master: uwsgi-base: drop packages.txt https://review.opendev.org/735473 | 17:31 |
mnaser | mordred: ^ of your interest | 17:32 |
mordred | mnaser: ++ | 20:00 |
*** sgw has quit IRC | 20:00 | |
openstackgerrit | Mohammed Naser proposed openstack/project-config master: Add vexxhost/atmosphere https://review.opendev.org/735478 | 20:06 |
*** rchurch has quit IRC | 21:00 | |
*** rchurch has joined #opendev | 21:02 | |
*** sgw has joined #opendev | 22:48 | |
*** tkajinam has joined #opendev | 22:57 | |
*** iurygregory has quit IRC | 22:59 | |
ianw | fungi: thanks for looking in on that | 23:02 |
ianw | in news that seems to be unlikely to be unrelated, it seems graphite is down too | 23:03 |
ianw | it has task hung messages and is non-responsive on the console | 23:06 |
ianw | it also drops out of cacti at 9am on the 13th | 23:07 |
fungi | hrm, i wonder if it could be related to the trove db outage, though that was back on, like, the 9th | 23:10 |
fungi | and yeah, nearly done with afs volume manual releases. the only ones running now are mirror.fedora, mirror.opensuse and mirror.yum-puppetlabs | 23:11 |
fungi | the mirror-update servers are still powered down | 23:11 |
ianw | #status log rebooted graphite.openstack.org as it was unresponsive | 23:12 |
openstackstatus | ianw: finished logging | 23:12 |
ianw | fungi: we've never got to the bottom of why, i think we can probably group it, the rsync mirrors take so long to release | 23:14 |
ianw | this on sleeping and clock skew was an attempt : https://opendev.org/opendev/system-config/src/branch/master/playbooks/roles/mirror-update/files/fedora-mirror-update#L126 | 23:16 |
ianw | note i have scripts for setting up logging @ https://opendev.org/opendev/system-config/src/branch/master/tools/afs-server-restart.sh | 23:17 |
ianw | but as auristor says, if the volume is releasing there's no redundancy | 23:18 |
ianw | and when you look at http://grafana.openstack.org/d/ACtl1JSmz/afs?viewPanel=12&orgId=1&from=now-30d&to=now | 23:18 |
ianw | basically the mirrors take as long to release every time as their next pulse; i.e. they're basically always in the release process | 23:19 |
ianw | which also probably results in the network being flat out 100% of the time | 23:19 |
*** DSpider has quit IRC | 23:22 | |
*** tosky has quit IRC | 23:26 | |
fungi | and i guess we've already ruled out simple causes like rsync updating atime on every file or something | 23:48 |
fungi | looks like we mount /vicepa with the relatime option, i suppose we could set it to noatime (openafs doesn't utilize the atime info from the store anyway, from what i'm reading), though no idea if that will make any difference for the phantom content changes rsync seems to cause | 23:56 |
fungi | mirror.yum-puppetlabs release finished, so we're down to just fedora and opensuse now | 23:58 |
ianw | fungi: from my notes; https://lists.openafs.org/pipermail/openafs-info/2019-September/042864.html | 23:59 |
Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!