Saturday, 2024-07-20

opendevreview	Takashi Kajinami proposed zuul/zuul-jobs master: Explicitly install public_suffix 5.1.1 https://review.opendev.org/c/zuul/zuul-jobs/+/924057	04:04
frickler	I missed the gitea upgrade fun yesterday, but it looks like this could be related? https://paste.opendev.org/show/bGa36Fx6uUrPJFWgKkbl/	11:23
fungi	that seems odd to be related, but maybe something changed in gitea's ssh implementation?	12:08
fungi	which led to corrupt git repositories on disk	12:08
fungi	looks like that was the only server which complained?	12:24
fungi	i wonder if that was a collision with a writer, because right now at least... ls: cannot access '/var/gitea/data/git/repositories/openstack-ansible.git/objects/9d/0ebaa4447381a66f0be8fea5aa69a17e50454a': No such file or directory	12:27
fungi	maybe we just keep an eye out for more errors tomorrow	12:27
fungi	never mind, i had the wrong path	12:28
fungi	does exist, and file confirms... /var/gitea/data/git/repositories/openstack/openstack-ansible.git/objects/9d/0ebaa4447381a66f0be8fea5aa69a17e50454a: empty	12:28
fungi	however, stat says it was last modified 2024-07-15 13:56:07.685539790 +0000	12:29
fungi	so it was presumably that way for several days prior to the upgrade	12:30
fungi	frickler: yep, if you look back through the cron e-mail from gitea12, i see it has been sending that same message every day starting on 2024-07-16	12:31
fungi	so whatever caused this happened on tuesday	12:31
fungi	from last... reboot system boot Mon Jul 15 14:20 still running 0.0.0.0	12:32
fungi	so my best guess is we had a spontaneous reboot that caught the file mid-write	12:32
fungi	unfortunately our oldest syslog is now from Jun 16 00:00:42	12:33
fungi	so we probably don't have a good record of whatever led up to that event	12:34
fungi	oh, wait, that's june. i guess we're rotating weekly. just a sec...	12:34
fungi	but yeah, syslog is silent between recording an innocuous cron start at 13:17:01 and the start of the boot messages at 14:20:41	12:35
fungi	so seems like either the event took the kernel entirely by surprise or by the time it knew something was wrong syslog was already unable to append to the log on disk	12:37
fungi	i suppose we could ask vexxhost if there was any host outage/reboot between 13:56-14:20 utc on 2024-07-15	12:38
fungi	specifically affecting server instance 653bdba6-185e-4856-a39e-74c8bbe11c6c in their sjc1 region	12:39
fungi	but anyway, this looks very much unrelated to yesterday's gitea upgrade	12:40
fungi	as for cleanup, i guess we could disable gitea12 in haproxy temporarily, delete the offending file, force replicate that repository from gerrit, and then manually re-run the command from the gc cron?	12:42
*** tobias-urdin is now known as tobias-urdin\|pto		14:01
clarkb	yes that seems reasonable. You may need to move the git repo content fully aside? not sure what the repo state is in without the fiel (though it is corrupt anyway I guess)	22:13
clarkb	also this isn't the first time this has happend	22:14
clarkb	I think the write through with the way ceph is set up is such that git is thinking the data is persisted to disk and safe but then in a hard reboot scenario bits that are cached in memory or ceph or somewhere not in the final destination get yeeted into the ether never to be seen again	22:15
clarkb	the last time this happened we had a little discussion with mnaser about it and he didn't seem to think it was possible but now after a second occurrence I'm even more suspicious that ceph is effectively lying to git	22:15
clarkb	git is the last thing I would expect to lose data on its own. But it relies on writes occuring when writes report they are done aiui	22:16

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!