Saturday, 2024-07-20

opendevreviewTakashi Kajinami proposed zuul/zuul-jobs master: Explicitly install public_suffix 5.1.1  https://review.opendev.org/c/zuul/zuul-jobs/+/92405704:04
fricklerI missed the gitea upgrade fun yesterday, but it looks like this could be related? https://paste.opendev.org/show/bGa36Fx6uUrPJFWgKkbl/11:23
fungithat seems odd to be related, but maybe something changed in gitea's ssh implementation?12:08
fungiwhich led to corrupt git repositories on disk12:08
fungilooks like that was the only server which complained?12:24
fungii wonder if that was a collision with a writer, because right now at least... ls: cannot access '/var/gitea/data/git/repositories/openstack-ansible.git/objects/9d/0ebaa4447381a66f0be8fea5aa69a17e50454a': No such file or directory12:27
fungimaybe we just keep an eye out for more errors tomorrow12:27
funginever mind, i had the wrong path12:28
fungidoes exist, and file confirms... /var/gitea/data/git/repositories/openstack/openstack-ansible.git/objects/9d/0ebaa4447381a66f0be8fea5aa69a17e50454a: empty12:28
fungihowever, stat says it was last modified 2024-07-15 13:56:07.685539790 +000012:29
fungiso it was presumably that way for several days prior to the upgrade12:30
fungifrickler: yep, if you look back through the cron e-mail from gitea12, i see it has been sending that same message every day starting on 2024-07-1612:31
fungiso whatever caused this happened on tuesday12:31
fungifrom last... reboot   system boot  Mon Jul 15 14:20   still running      0.0.0.012:32
fungiso my best guess is we had a spontaneous reboot that caught the file mid-write12:32
fungiunfortunately our oldest syslog is now from Jun 16 00:00:4212:33
fungiso we probably don't have a good record of whatever led up to that event12:34
fungioh, wait, that's june. i guess we're rotating weekly. just a sec...12:34
fungibut yeah, syslog is silent between recording an innocuous cron start at 13:17:01 and the start of the boot messages at 14:20:4112:35
fungiso seems like either the event took the kernel entirely by surprise or by the time it knew something was wrong syslog was already unable to append to the log on disk12:37
fungii suppose we could ask vexxhost if there was any host outage/reboot between 13:56-14:20 utc on 2024-07-1512:38
fungispecifically affecting server instance 653bdba6-185e-4856-a39e-74c8bbe11c6c in their sjc1 region12:39
fungibut anyway, this looks very much unrelated to yesterday's gitea upgrade12:40
fungias for cleanup, i guess we could disable gitea12 in haproxy temporarily, delete the offending file, force replicate that repository from gerrit, and then manually re-run the command from the gc cron?12:42
*** tobias-urdin is now known as tobias-urdin|pto14:01
clarkbyes that seems reasonable. You may need to move the git repo content fully aside? not sure what the repo state is in without the fiel (though it is corrupt anyway I guess)22:13
clarkbalso this isn't the first time this has happend22:14
clarkbI think the write through with the way ceph is set up is such that git is thinking the data is persisted to disk and safe but then in a hard reboot scenario bits that are cached in memory or ceph or somewhere not in the final destination get yeeted into the ether never to be seen again22:15
clarkbthe last time this happened we had a little discussion with mnaser about it and he didn't seem to think it was possible but now after a second occurrence I'm even more suspicious that ceph is effectively lying to git22:15
clarkbgit is the last thing I would expect to lose data on its own. But it relies on writes occuring when writes report they are done aiui22:16

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!