corvus | some post_failures showing up. 1st one i checked failed on upload to ovh_gra | 15:26 |
---|---|---|
corvus | yeah 3/3 failed there | 15:28 |
opendevreview | James E. Blair proposed opendev/base-jobs master: Temporarily disable uploads to ovh_gra https://review.opendev.org/c/opendev/base-jobs/+/903351 | 15:30 |
SvenKieske | corvus: I guess this is also related? https://zuul.opendev.org/t/openstack/build/322e58959af645229a7e387686c6cab8 | 15:35 |
fungi | https://public-cloud.status-ovhcloud.com/incidents/ggsd08k3wlzn | 15:36 |
fungi | does it look auth related | 15:36 |
fungi | ? | 15:36 |
corvus | SvenKieske: yes it is | 15:36 |
corvus | fungi: can't tell, no_log=true | 15:36 |
fungi | i'm late for an errand, but can take a closer look in about an hour. also i approved 903351 but someone may need to bypass gating to merge it | 15:36 |
opendevreview | Merged opendev/base-jobs master: Temporarily disable uploads to ovh_gra https://review.opendev.org/c/opendev/base-jobs/+/903351 | 15:40 |
frickler | seems to have been lucky | 15:41 |
corvus | #status notice Zuul jobs reporting POST_FAILURE were due to an incident with one of our cloud providers; this provider has been temporarily disabled and changes can be rechecked. | 15:43 |
opendevstatus | corvus: sending notice | 15:43 |
-opendevstatus- NOTICE: Zuul jobs reporting POST_FAILURE were due to an incident with one of our cloud providers; this provider has been temporarily disabled and changes can be rechecked. | 15:43 | |
opendevstatus | corvus: finished sending notice | 15:46 |
clarkb | I've been thinking about the best way to force gitea09 to use ipv4 to talk to the vexxhost backup server. I think .ssh/config AddressFamily inet is the proper configuration but then I wonder if we should apply that to /root/.ssh/config on all servers running backups? Or should I constrain it to gitea09? If we want to do that on all backed up servers I could see that one day we might | 16:17 |
clarkb | have conflicting /root/.ssh/config configs for different needs/services though that isn't the case today | 16:17 |
clarkb | anyone else have an opinion or ideas on how best ot tackle this? | 16:17 |
clarkb | separately but related: https://review.opendev.org/c/opendev/system-config/+/902842 should remove the old gerrit replication key from gitea. Do we want to go ahead and approve/land that now or fix backups first? | 16:19 |
fungi | https://public-cloud.status-ovhcloud.com/incidents/ggsd08k3wlzn now indicates they believe the problem in gra was resolved 14 minutes after 903351 merged | 16:23 |
fungi | looks like they believe it happened from 15:05 to 15:54 utc | 16:24 |
clarkb | oh looks like we already manage .ssh/config for backups | 16:29 |
corvus | fungi: i haven't checked to see if they're ovh, but i see post_failures going back to 13:xx utc. (more before that, but those jobs all have "docker-image" in their names so i suspect they are unrelated) | 16:30 |
fungi | is that the time the jobs started, or when the tasks failed? | 16:31 |
fungi | but yeah, some projects also have perpetually broken image uploads | 16:31 |
corvus | fungi: oh star time good point | 16:31 |
corvus | yeah, and spot checking the end times of some of the 13:xx they ended at 15:xx | 16:32 |
corvus | fungi: then i think we have high correlation with their outage times :) | 16:32 |
fungi | agreed | 16:33 |
corvus | clarkb: sgtm. i haven't followed 100%, but i take it the issue is something like streaming big stuff on these hosts over ipv6 is bad? | 16:33 |
opendevreview | Clark Boylan proposed opendev/system-config master: Force borg backups to run over ipv4 https://review.opendev.org/c/opendev/system-config/+/903356 | 16:33 |
clarkb | corvus: yup ipv6 connectivity seems to be having problems between vexxhost sjc1 gitea09 and the mtl01 backup server | 16:34 |
corvus | maybe 2024 will be the year of ipv6 and the linux desktop | 16:34 |
clarkb | ha. I finally got around to trying to rma my laptop but lenovo said the turn around time isn't quick enough for this trip I'm taking so I'm delaying. In the mean time I discovered that if I boot with nomodeset that basically disables fancy gpu things and rendering "works". I just get a lower resolution than native with a different aspect ratio so things look weird and can't lower the | 16:36 |
clarkb | (full) brightness. Oh and suspeding doesn't actually save as much battery as it should | 16:36 |
clarkb | but I can limp along on that for a little bit longer | 16:36 |
clarkb | but my brother has the same laptop and while I can reproduce the problem on an ubuntu live image his laptop cannot. So I'm fairly certain the problem is device specific. | 16:37 |
clarkb | fungi: corvus: it is pretty easy to test if that region is working again by forcing that region in base-test | 16:38 |
frickler | might not be related, but I'm also having no route to vexxhost via IPv6 from my local DSL provider again (had that some years ago and took a long time to resolve) | 16:42 |
opendevreview | Clark Boylan proposed opendev/system-config master: Add hints to borg backup error logging https://review.opendev.org/c/opendev/system-config/+/903357 | 16:43 |
opendevreview | James E. Blair proposed opendev/base-jobs master: Force base-test to upload to ovh_gra https://review.opendev.org/c/opendev/base-jobs/+/903358 | 16:45 |
opendevreview | Merged opendev/base-jobs master: Force base-test to upload to ovh_gra https://review.opendev.org/c/opendev/base-jobs/+/903358 | 16:51 |
opendevreview | James E. Blair proposed zuul/zuul-jobs master: DNM: exercise base-test https://review.opendev.org/c/zuul/zuul-jobs/+/903362 | 16:56 |
opendevreview | James E. Blair proposed opendev/base-jobs master: Revert "Temporarily disable uploads to ovh_gra" https://review.opendev.org/c/opendev/base-jobs/+/903365 | 17:11 |
corvus | all but one job in the test set has completed successfully; i think we can return to standard condition now. | 17:11 |
corvus | update: all jobs completed successfully | 17:12 |
clarkb | +2 from me. I won't fast approve this one though and hopefully someone else can check it too | 17:13 |
clarkb | frickler: fyi a fix for WIP is merge conflicted in change listings for gerrit has merged against stable 3.9 | 18:10 |
frickler | clarkb: oh, nice, I wasn't aware that they agreed about this being a bug | 18:16 |
clarkb | frickler: https://gerrit-review.googlesource.com/c/gerrit/+/396899/ is the change | 18:17 |
clarkb | I'm going to take advantage of the lack of pineapple express rain to go on a bike ride in a bit. but still happy to be around to watch any of those changes linked above (restore ovh gra logs, remove ssh key from gitea, force backups on ipv4) either before or after that happens | 19:06 |
*** elodilles is now known as elodilles_pto | 21:01 | |
JayF | review.opendev.org is unreachable for me locally | 22:09 |
JayF | resolves to review02.opendev.org (199.204.45.33) | 22:09 |
NeilHanlon | same here (and via v6) | 22:10 |
JayF | routing inside level3 according to this MTR looks bananas | 22:10 |
JayF | https://home.jvf.cc/~jay/review-mtr-20231211.png | 22:11 |
JayF | I wonder if there's some kind of weird BGP thing going on | 22:12 |
JayF | because that feels like I'm being routed to the wrong area of the internet | 22:12 |
JayF | infra-root: FYI seemingly non-actionable incident appears to be ongoing with review.opendev.org at least with a portion of the internet, | 22:12 |
NeilHanlon | i'm checking my ripe atlas, but agree | 22:13 |
JayF | I'm confirming with other folks around the world review.opendev.org is down but generally other internet things aren't; I don't know what network that is on tho | 22:17 |
NeilHanlon | https://atlas.ripe.net/measurements/64730794#probes | 22:18 |
clarkb | I'm not sure its a network thing yet. The mirror in that cloud region which has an IP addr (ipv4 anyway) in the same /24 range is reachable | 22:19 |
clarkb | the server reports it is shutoff according to the nova api | 22:20 |
JayF | clarkb: I can tell you generally my route to this server doesn't go through bell canada :) but maybe that's just something else weird happening simultaneously | 22:20 |
clarkb | fungi: corvus frickler tonyb should I try to start it again vai the nova api? or do we want ot investigate further first? | 22:21 |
clarkb | server show against the server doesn't indicate any in progress tasks | 22:21 |
clarkb | OS-EXT-STS:task_state | None <- I think that would tell us if they were doing a live migration for example | 22:22 |
NeilHanlon | yeah, looking again it appears traffic arrives where it needs to | 22:22 |
fungi | updated_at=2023-12-11T21:28:20Z status=SHUTOFF vm_state=stopped | 22:22 |
fungi | i'm guessing that's when it went down | 22:22 |
clarkb | that seems to align with cacti losing connectivity too | 22:23 |
fungi | yeah, i don't see anything else to investigate without trying to reboot it | 22:23 |
fungi | expect a lengthy wait for fsck to run | 22:23 |
clarkb | fwiw I don't see anything in cacti indicating that something snowballed out of control prior | 22:24 |
clarkb | fungi: should I server start it or do you want to do it? | 22:24 |
fungi | and we may need to connect to the console if fsck requires manual intervention | 22:24 |
fungi | go for it | 22:24 |
clarkb | the server came right up | 22:25 |
fungi | guilhermesp: mnaser: ^ heads up we found 16acb0cb-ead1-43b2-8be7-ab4a310b4e0a (review02.opendev.org) spontaneously shutdown in ca-ymq-1 at 21:28:20 utc according to the nova api | 22:26 |
fungi | reboot system boot Mon Dec 11 22:25 still running 0.0.0.0 | 22:26 |
corvus | o/ sorry i missed excitement | 22:27 |
fungi | the previous entry from last was me logging in for 43 minutes on 2023-12-06 | 22:27 |
fungi | so looks like there was nobody logged in at the time that occurred | 22:27 |
clarkb | corvus: well there may still be exicitement | 22:27 |
NeilHanlon | JayF: fwiw, it appears that Level3 junk you're seeing is 'normal' -- fsvo normal. | 22:28 |
clarkb | docker reports gerrit failed to start around when I booted the server if I'm reading the docker ps -a output correctly but the last logs recorded by docker appear to be from when the server went down | 22:28 |
clarkb | fungi: corvus: should I docker-compose down && docker-compose up -d? | 22:28 |
corvus | couldn't a live migration have "finished" and not shown up in the taks state? | 22:28 |
clarkb | corvus: maybe? | 22:28 |
corvus | hrm lemme look at docker | 22:28 |
JayF | NeilHanlon: interesting; I'm a relatively newish centurylink fiber customer so maybe I'm just not so used to that particularly quirky route | 22:28 |
clarkb | corvus: k | 22:28 |
fungi | last entries in syslog are from snmpd at 21:25:03, which was a few minutes before the shutdown | 22:29 |
fungi | skimming syslog leading up to the outage, i don't see anything amiss | 22:29 |
clarkb | I guess the other thing is whether or not we want ot force a fsck | 22:30 |
clarkb | since it seems that no fsck was done | 22:30 |
ianw | is it possible it shutdown cleanly? | 22:30 |
corvus | i would generally trust ext4 on that... unless our paranoia level is 11? | 22:30 |
fungi | ianw: i would have expected a clean shutdown to leave some trace in syslog | 22:31 |
clarkb | corvus: ack | 22:31 |
NeilHanlon | JayF: yeah, the thing to look for is if the packet loss is consistent between ASes -- i.e., if you have loss which continues from hop N all the way to the destination (or close to it) without any hops without loss. in short: if the loss isn't consistent from point A to B, it's likely noise from that networks' devices not liking to respond. you can | 22:31 |
NeilHanlon | sometimes get them to treat your traceroute traffic better if you send over tcp (mtr -P 443 --tcp review.opendev.org ) | 22:31 |
clarkb | I wonder if docker the container manager is recording that the contianers failed when it came back up hence the timestamp confusion but it didn't actually try to restart them at that time | 22:32 |
fungi | the only fsck message in syslog is this one (about the configdrive i think?): | 22:32 |
fungi | Dec 11 22:25:19 review02 kernel: [ 6.735306] FAT-fs (vda15): Volume was not properly unmounted. Some data may be corrupt. Please run fsck. | 22:32 |
fungi | aha, it's /boot/efi | 22:33 |
corvus | clarkb: oh good theory. i don't have any better ideas. i don't see any docker logs suggesting it tried to start any containers. | 22:33 |
corvus | "StartedAt": "2023-12-06T21:39:27.061492779Z", | 22:34 |
corvus | "FinishedAt": "2023-12-11T22:25:24.402771369Z" | 22:34 |
fungi | can anyone confirm that /etc/fstab is set to not fsck any of our filesystems? | 22:34 |
corvus | clarkb: ^ i think those timestamps from `docker inspect ac1d7b309848` support your theory | 22:34 |
fungi | 2021-06-22 was the last modified date for /etc/fstab btw | 22:35 |
fungi | so it's been like this for 2.5 years | 22:35 |
clarkb | fungi: ya it seems the last field is 9 | 22:35 |
clarkb | s/9/0/ | 22:35 |
corvus | clarkb: i release my debugging hold, and i think it's okay to down/up (or maybe even just up; it will probably dtrt) once the fsck question is resolved. | 22:35 |
JayF | NeilHanlon: how, in context of an mtr/tracert, do you know where you swap AS | 22:35 |
clarkb | corvus: ack thanks for looking | 22:35 |
ianw | fungi: it is defaults 0 0 | 22:35 |
ianw | on the gerrit partition | 22:35 |
fungi | spot checking other servers, we do set fsck passno to 1 or 2 for non-swap filesystems | 22:36 |
clarkb | fungi: so ya shoudl we set 1 on cloudimg-rootfs and /boot/efi and then 2 on /home/gerrit2? | 22:36 |
clarkb | and then reboot? | 22:36 |
fungi | clarkb: i think so, yes | 22:36 |
corvus | ++ to the pass fstab change | 22:36 |
NeilHanlon | JayF: DNS (if it's available), and modernish mtr has a `-z` flag which will do AS lookups | 22:36 |
fungi | ianw: right, "default" for the fsck passno field is 0, which means "don't fsck at boot" | 22:36 |
JayF | NeilHanlon: oh, nice :) I'm on gentoo so I better have the flag or else I can go bump the ebuild :D | 22:37 |
clarkb | fungi: I'll let you drive that | 22:37 |
NeilHanlon | :D | 22:37 |
clarkb | I suppose we could manually fsck /home/gerrit2 first without a reboot if we wanted | 22:37 |
NeilHanlon | JayF: this is a good listen (or read w/ linked slides) https://youtu.be/L0RUI5kHzEQ that taught me everything I've now forgotten about traceroute :D | 22:38 |
clarkb | the updated /etc/fstab looks correct to me | 22:38 |
fungi | infra-root: i've edited /etc/fstab on review02 now so that non-swap filesystems will get a fsck at boot | 22:38 |
fungi | rootfs and efi on passno 1, gerrit home volume on passno 2 | 22:39 |
fungi | shall i reboot the server now? | 22:39 |
clarkb | fungi: corvus: should we down the containers before we reboot? just to avoid any unexpected interactions? | 22:39 |
fungi | probably, yes | 22:39 |
clarkb | that is the only other thought I have before rebooting | 22:39 |
ianw | (i likely as not added that entry manually for the gerrit home ~ 2021-07 when we upgraded the host and afaik it wasn't an explicit choice to turn off fsck, i probably just typed 0 0 out of habit) | 22:40 |
clarkb | fungi: I think you should docker-compose down the containers to prevfen them trying to start until we are ready then do a reboot | 22:41 |
fungi | ianw: well, the rootfs was also set to not fsck at boot | 22:41 |
corvus | clarkb: yes to downing | 22:41 |
fungi | downed now | 22:41 |
fungi | rebooting | 22:41 |
fungi | i also have the vnc console connected | 22:42 |
fungi | so i can watch the boot progress | 22:42 |
fungi | it's already up | 22:42 |
clarkb | yup, is there a good way to check if it fscked? I guess your boot console would tell you? | 22:43 |
fungi | Dec 11 22:42:23 review02 systemd-fsck[816]: gerrit: clean, 494405/67108864 files, 113090725/268434432 blocks | 22:43 |
fungi | from syslog | 22:43 |
clarkb | it did not fsck / is the implication that fs was not dirty and thus could be skipped? | 22:44 |
fungi | i can't tell, still looking | 22:45 |
clarkb | I guess that isn't too surprising since most of the server state is on the gerrit volume. One exception is syslog/journald though | 22:45 |
fungi | openstack console log show | 22:45 |
fungi | Begin: Will now check root file system ... fsck from util-linux 2.34 | 22:45 |
fungi | [/usr/sbin/fsck.ext4 (1) -- /dev/vda1] fsck.ext4 -a -C0 /dev/vda1 | 22:46 |
clarkb | oh I wonder if systemd-fsck can only fsck non / | 22:46 |
clarkb | and you need fsck before systemd for / | 22:46 |
clarkb | that could explain the logging being missing for / | 22:46 |
fungi | yep | 22:46 |
clarkb | fungi: does the console log show any complaints from fsck for / if not I think we're ok? | 22:47 |
tonyb | I think you can do something like tune2fs -l /dev/$device to see when it was last fsck'd | 22:47 |
fungi | i did not find any errors in the console log from fsck | 22:47 |
fungi | just messages about systemd creating fsck.slice and listening on the fsckd communication socket | 22:47 |
clarkb | in that case I guess we can proceed with a docker-compose up -d? | 22:47 |
fungi | agreed, i'll do that now | 22:48 |
fungi | it's on its way up now | 22:48 |
clarkb | we didn't move the waiting/ queue dir aside so those exceptions are "expected" | 22:48 |
clarkb | there is also a very persistent ssh client that is failing to connect. But otherwise I think that startup log in error_log looked good | 22:49 |
clarkb | the web ui is up for me and reports the version we were running prior | 22:49 |
clarkb | so we didn't update gerrit (expceted we haven't made any image updates iirc) | 22:49 |
clarkb | maybe we should approve https://review.opendev.org/c/opendev/base-jobs/+/903365 and use that as a good canary of the whole approval -> CI -> merge/submit process? | 22:50 |
JayF | something is def. wrong | 22:52 |
JayF | https://review.opendev.org/c/openstack/governance/+/902585 does not have any comments loaded, for example | 22:53 |
NeilHanlon | maybe they're just in invisible ink now? | 22:53 |
JayF | it looks weirdly spooky, all the comment spots are there but empty | 22:53 |
tonyb | JaF: I found review.o.o to be very slow after the reboot | 22:53 |
clarkb | yes it has to reload caches | 22:53 |
clarkb | if it persists after 5 or 10 minutes then we should check again. Unfortunately this is "normal" which makes it hard to say if something is wrong | 22:54 |
tonyb | JayF: and I see many comments on that change FWIW | 22:54 |
JayF | ack; makes sense. First time I've been here to see when it first gets restarted, I think :D | 22:54 |
clarkb | you'll see it struggle to load diffs as well | 22:54 |
tonyb | clarkb: I'm happy to +2+A 903365 | 22:54 |
fungi | i've approved 903365 now | 22:54 |
tonyb | LOL | 22:54 |
NeilHanlon | comments do load, but takes a few seconds | 22:54 |
NeilHanlon | https://drop1.neilhanlon.me/irc/uploads/b238bf77f8924b48/image.png | 22:55 |
fungi | 903365 is showing on https://zuul.opendev.org/t/opendev/status with builds in progress | 22:55 |
fungi | eta 3 minutes | 22:55 |
clarkb | did we lose the bot? | 22:58 |
clarkb | the change merged but the bot doesn't seem to be connected (not surprising I guess) | 22:59 |
fungi | and its already replicated to https://opendev.org/opendev/base-jobs/commit/ddb3137 | 22:59 |
fungi | i'll restart the bot | 22:59 |
clarkb | thanks! | 22:59 |
fungi | yeah, container log indicates the last event the bot saw was at 21:27:13 utc, right when the server probably died | 23:03 |
clarkb | the ip spamming us with ssh connection attempts belongs to IBM according to whois | 23:04 |
clarkb | anyone know anyone at IBM? | 23:04 |
tonyb | Not that could help with that :/ | 23:05 |
clarkb | its probably some ancient jenkins that everyone forgot about | 23:05 |
clarkb | I suspect the errors are due to its age | 23:05 |
tonyb | fungi: If you get a moment can you share a redacted mutt.conf for accessing the infra-root mail? | 23:05 |
fungi | status log Started the review.opendev.org server which appeared to have spontaneously shut down at 21:28 UTC, also corrected the fsck passno in its fstab, and restarted the Gerrit IRC/Matrix bot so they would start seeing change events again | 23:06 |
fungi | kinda wordy, look okay? | 23:06 |
tonyb | Sure, I think you can drop the passno text to shrink it a little | 23:07 |
fungi | would rather not forget that we fixed it to actually fsck on boot | 23:08 |
clarkb | lgtm | 23:08 |
fungi | #status log Started the review.opendev.org server which spontaneously shut down at 21:28 UTC, corrected the fsck passno in its fstab, and restarted the Gerrit IRC/Matrix bots so they'll start seeing change events again | 23:08 |
opendevstatus | fungi: finished logging | 23:08 |
fungi | whittled it down a smidge | 23:09 |
tonyb | Ummm I didn't actually get that message anywhere | 23:10 |
tonyb | and it finished logging very quickly | 23:11 |
fungi | tonyb: that's what status log does | 23:11 |
fungi | as opposed to notice or alert or okay which notify irc channels | 23:12 |
ianw | it appeared on mastodon which i was scrolling getting a tea :) | 23:12 |
tonyb | ooooo my mistake | 23:12 |
fungi | we usually try to avoid pestering every irc channel if there's no action required | 23:12 |
JayF | my hilight bar in weechat appreciates you :) | 23:12 |
fungi | and yeah, following https://fosstodon.org/@opendevinfra/ will still get them | 23:13 |
fungi | unrelated, all pypi.org accounts will require 2fa (and so also upload tokens) starting on 2024-01-01 | 23:16 |
*** dmellado2 is now known as dmellado | 23:16 | |
fungi | https://discuss.python.org/t/announcement-2fa-requirement-for-pypi-2024-01-01/40906 | 23:20 |
clarkb | github is like 2024-01-28 ish | 23:25 |
tonyb | I get that Jan-1st is a really nice line in the sand, but it really sucks because holiday season :/ | 23:27 |
NeilHanlon | new year, same problems 🙃 | 23:27 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!