opendevreview | Takashi Kajinami proposed zuul/zuul-jobs master: Explicitly install public_suffix 5.1.1 https://review.opendev.org/c/zuul/zuul-jobs/+/924057 | 04:04 |
---|---|---|
frickler | I missed the gitea upgrade fun yesterday, but it looks like this could be related? https://paste.opendev.org/show/bGa36Fx6uUrPJFWgKkbl/ | 11:23 |
fungi | that seems odd to be related, but maybe something changed in gitea's ssh implementation? | 12:08 |
fungi | which led to corrupt git repositories on disk | 12:08 |
fungi | looks like that was the only server which complained? | 12:24 |
fungi | i wonder if that was a collision with a writer, because right now at least... ls: cannot access '/var/gitea/data/git/repositories/openstack-ansible.git/objects/9d/0ebaa4447381a66f0be8fea5aa69a17e50454a': No such file or directory | 12:27 |
fungi | maybe we just keep an eye out for more errors tomorrow | 12:27 |
fungi | never mind, i had the wrong path | 12:28 |
fungi | does exist, and file confirms... /var/gitea/data/git/repositories/openstack/openstack-ansible.git/objects/9d/0ebaa4447381a66f0be8fea5aa69a17e50454a: empty | 12:28 |
fungi | however, stat says it was last modified 2024-07-15 13:56:07.685539790 +0000 | 12:29 |
fungi | so it was presumably that way for several days prior to the upgrade | 12:30 |
fungi | frickler: yep, if you look back through the cron e-mail from gitea12, i see it has been sending that same message every day starting on 2024-07-16 | 12:31 |
fungi | so whatever caused this happened on tuesday | 12:31 |
fungi | from last... reboot system boot Mon Jul 15 14:20 still running 0.0.0.0 | 12:32 |
fungi | so my best guess is we had a spontaneous reboot that caught the file mid-write | 12:32 |
fungi | unfortunately our oldest syslog is now from Jun 16 00:00:42 | 12:33 |
fungi | so we probably don't have a good record of whatever led up to that event | 12:34 |
fungi | oh, wait, that's june. i guess we're rotating weekly. just a sec... | 12:34 |
fungi | but yeah, syslog is silent between recording an innocuous cron start at 13:17:01 and the start of the boot messages at 14:20:41 | 12:35 |
fungi | so seems like either the event took the kernel entirely by surprise or by the time it knew something was wrong syslog was already unable to append to the log on disk | 12:37 |
fungi | i suppose we could ask vexxhost if there was any host outage/reboot between 13:56-14:20 utc on 2024-07-15 | 12:38 |
fungi | specifically affecting server instance 653bdba6-185e-4856-a39e-74c8bbe11c6c in their sjc1 region | 12:39 |
fungi | but anyway, this looks very much unrelated to yesterday's gitea upgrade | 12:40 |
fungi | as for cleanup, i guess we could disable gitea12 in haproxy temporarily, delete the offending file, force replicate that repository from gerrit, and then manually re-run the command from the gc cron? | 12:42 |
*** tobias-urdin is now known as tobias-urdin|pto | 14:01 | |
clarkb | yes that seems reasonable. You may need to move the git repo content fully aside? not sure what the repo state is in without the fiel (though it is corrupt anyway I guess) | 22:13 |
clarkb | also this isn't the first time this has happend | 22:14 |
clarkb | I think the write through with the way ceph is set up is such that git is thinking the data is persisted to disk and safe but then in a hard reboot scenario bits that are cached in memory or ceph or somewhere not in the final destination get yeeted into the ether never to be seen again | 22:15 |
clarkb | the last time this happened we had a little discussion with mnaser about it and he didn't seem to think it was possible but now after a second occurrence I'm even more suspicious that ceph is effectively lying to git | 22:15 |
clarkb | git is the last thing I would expect to lose data on its own. But it relies on writes occuring when writes report they are done aiui | 22:16 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!