Friday, 2024-01-19

opendevreview	Yan Xiao proposed openstack/python-swiftclient master: Add transaction id to errors to help troubleshoot, including: - error when downloading object with truncted/missing segment - error when downloading or stat'ing non-existent object - error when stat'ing or listing non-existent container https://review.opendev.org/c/openstack/python-swiftclient/+/903770	02:55
Steals_From_Dragons	Is this the right place to ask some swift operations questions?	16:40
Steals_From_Dragons	/msg jamesdenton_ howdy! It's intern (David Alfano). How are things going?	16:42
Steals_From_Dragons	Sorry, It's been awhile since I've done irc	16:44
timburke	no worries! and yeah, this is a great spot to ask swift ops questions :-)	18:34
Steals_From_Dragons	I've been seeing some errors in my container-replicator logs raising a "ValueError('object_count cannot be < 0') in the get_own_shard_range function. I can also get the same error message when running `swift-manage-shard-ranges <path_to_container_db> info`. This doesn't happen for all of the container databases, just a small number of them. My guess was that the containers associated with those databases no longer exist, but I'm not	18:48
Steals_From_Dragons	entirely sure how to verify that.	18:48
timburke	Steals_From_Dragons, i'd start by using sqlite3 to look at the db directly; something like sqlite3 $DB_FILE 'SELECT * FROM shard_range WHERE object_counts <= 0;'	19:01
timburke	seems like there might be some database corruption; hopefully from that query, you can identify the shard DB name, then find that with swift-get-nodes, and see what that DB looks like	19:02
timburke	it's not clear yet whether the corruption is in the root DB or the shard; if it's the root, we should be able to reset the reported column in the shard to have it re-send the update	19:08
timburke	if it's the shard, hopefully it only impacts one replica of the DB (so we can fix it by deleting/quarantining the affected DB and letting replication bring it back up to full durability)	19:08
timburke	though we might still need to reset the reported column, too	19:09
Steals_From_Dragons	Ok, let me try that sqlite cmd	19:29
Steals_From_Dragons	Looks like it returned one row. I would imagine that it wouldn't return at all if it was corrupt?	19:36
timburke	we've occasionally seen bit flips in dbs, as in https://bugs.launchpad.net/swift/+bug/1823785	19:39
timburke	Steals_From_Dragons, oh! i should have had you include a -header so we'd know what those results meant :P	19:40
timburke	but i suspect there's a negative number somewhere in there, and it's the object_count -- and then somewhere else there's something that looks kinda like a path, which would be the account/container name for the shard	19:41
Steals_From_Dragons	Yea, I see the thing that looks like a path. No negative numbers though. Interestingly both object_count and bytes_used are 0	19:44
timburke	huh. but swift-container-info still raises the error? maybe check the container_stat table, like sqlite3 -line $DB_PATH 'SELECT * FROM container_stat;'	19:46
timburke	(my thinking is that it's tripping the error trying to generate its own_shard_range)	19:47
Steals_From_Dragons	Ah, there's the negative number	19:48
Steals_From_Dragons	Object_count is -1	19:48
Steals_From_Dragons	Actually, quite a few in here	19:49
Steals_From_Dragons	Bytes_used, object_count, and both container_sync_points are negative	19:49
timburke	the sync points being negative are fine; that just means it's never successfully replicated to the other primaries	19:52
timburke	speaking of other primaries, have you checked the other replicas yet?	19:53
Steals_From_Dragons	I just pulled them up with swift-get-nodes	19:53
timburke	might also want to peek at the object table (though i expect it's empty)	19:53
Steals_From_Dragons	Just want to make sure, since we are dealing with the container database, I should be using the container ring with swift-get-nodes, correct?	19:56
timburke	yup!	19:57
Steals_From_Dragons	Hm, getting a lot of 404s from the curls	20:01
Steals_From_Dragons	But I think it's probably my cmd and not the actual object	20:02
Steals_From_Dragons	`swift-exec swift-get-nodes /etc/swift/container.ring.gz -a <container from the sqlite -line cmd> ` look ok?	20:03
Steals_From_Dragons	Ah! Got it! It was my cmd. Needed to put the account in there	20:11
timburke	Steals_From_Dragons, that seems right -- as long as you've got the account in there, too, i suppose...	20:15
Steals_From_Dragons	Now the curls respond with a 204 No content.	20:15
Steals_From_Dragons	Looks like they all have the same negative values for bytes_used and object_count	20:16
timburke	huh. any of the handoffs have anything? and do any of them have a delete timestamp?	20:17
Steals_From_Dragons	The handoffs don't have anything, and the delete timestamp is all 0s	20:18
Steals_From_Dragons	correction: one of the handoffs has the same values as the first 3	20:23
timburke	have you already checked the object table and verified that it's empty on all of them? the simplest thing might be to run something like 'UPDATE policy_stat SET object_count=0, bytes_used=0;' if it really should be empty... though exactly how we got here is still a mystery...	20:25
Steals_From_Dragons	Object table in the container db?	20:26
timburke	yup -- run something like 'SELECT * FROM object WHERE deleted=0' (since we don't mind there being tombstone rows)	20:28
Steals_From_Dragons	It doesn't return anything. Makes me think there might not have been anything in there to begin with?	20:31
Steals_From_Dragons	Checking without the 'WHERE deleted=0" doesn't return anything either	20:32
timburke	in that case, i feel pretty good about running that UPDATE query -- best to hit all nodes	20:34
Steals_From_Dragons	Before I do that.	20:39
Steals_From_Dragons	Before I do that... I decided to check the openstack project that "owns" this container (via the KEY uuid), The container doesn't show up in their container listing. Is it possible they deleted it and something went wrong with the tombstone process? I feel like that would explain why there are no objects.	20:39
timburke	could be... makes me think the container got deleted, and even eventually reclaimed, but then the one corrupt handoff started trying to rsync itself back to primaries or something	20:40
Steals_From_Dragons	So, if we fix the value, you think the delete process will reactivate?	20:41
timburke	since it doesn't have any delete timestamp, no, not really. might be best to just delete the hashdir on all primaries & the handoff	21:02
timburke	might want to stop the replicators first, issue the deletes throughout the cluster, then start them again	21:04
Steals_From_Dragons	The hashdir is the directory that's a number after /<drive>/container/, correct?	21:30
Steals_From_Dragons	I just confirmed that this container was deleted a year and a half ago, kinda weird it's showing up now after all that time	21:30
Steals_From_Dragons	I'll do all of the deletes on Monday so the weekend is safe. Thank you very much for your help today timburke	21:35
timburke	sure thing! good luck, Steals_From_Dragons	21:36
timburke	oh, but the hashdir is down deeper -- like on my dev vm, i've got a db file at /srv/node1/sdb1/containers/450/afd/7089ab48d955ab0851fc51cc17a34afd/7089ab48d955ab0851fc51cc17a34afd.db, the hashdir is /srv/node1/sdb1/containers/450/afd/7089ab48d955ab0851fc51cc17a34afd/	21:37
timburke	then /srv/node1/sdb1/containers/450/ is the partition; there are likely many DBs in that partition	21:39
timburke	and /srv/node1/sdb1/containers/450/afd/ is the suffix; it's used to keep from having too many subdirectories directly under the partition	21:40
Steals_From_Dragons	Ah ok, thank you for explaining it.	21:41

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!