Monday, 2019-04-01

*** mrjk has quit IRC		00:05
*** mrjk has joined #openstack-swift		00:06
*** mikecmpbll has quit IRC		00:45
*** irclogbot_0 has quit IRC		02:19
*** psachin has joined #openstack-swift		03:17
*** psachin has quit IRC		05:09
*** psachin has joined #openstack-swift		05:15
*** ianychoi has quit IRC		05:23
*** ianychoi has joined #openstack-swift		05:24
*** psachin has quit IRC		05:57
*** psachin has joined #openstack-swift		06:05
*** ccamacho has quit IRC		06:53
*** rcernin has quit IRC		06:58
*** pcaruana has joined #openstack-swift		06:58
*** pcaruana has quit IRC		07:02
*** pcaruana has joined #openstack-swift		07:02
*** psachin has quit IRC		07:08
*** renich has joined #openstack-swift		07:08
*** rdejoux has joined #openstack-swift		07:12
*** psachin has joined #openstack-swift		07:16
*** psachin has quit IRC		07:16
*** ccamacho has joined #openstack-swift		07:22
*** e0ne has joined #openstack-swift		07:40
*** mikecmpbll has joined #openstack-swift		07:43
*** gkadam has joined #openstack-swift		08:02
*** e0ne has quit IRC		08:11
*** e0ne has joined #openstack-swift		08:17
*** renich has quit IRC		08:19
*** gkadam has quit IRC		08:32
*** tkajinam has quit IRC		08:34
*** mikecmpbll has quit IRC		08:38
*** mikecmpb_ has joined #openstack-swift		08:39
*** e0ne has quit IRC		08:50
*** e0ne has joined #openstack-swift		08:59
*** e0ne has quit IRC		09:16
*** gkadam has joined #openstack-swift		09:28
*** e0ne has joined #openstack-swift		09:35
*** e0ne has quit IRC		10:45
*** e0ne has joined #openstack-swift		11:03
*** e0ne has quit IRC		11:10
*** ybunker has joined #openstack-swift		11:27
*** e0ne has joined #openstack-swift		12:11
*** rcernin has joined #openstack-swift		12:19
zigo	Hi there !	12:24
zigo	I was wondering, how can I simulate a broken HDD, so that the drive audit does its job of commenting in /etc/fstab, etc. ?	12:24
*** e0ne has quit IRC		12:38
*** mvkr has quit IRC		12:49
zigo	I could remove a drive with qemu's console, now how do I force Swift to audit the drive and remove it from fstab like with broken hdds?	13:02
*** mvkr has joined #openstack-swift		13:16
*** irclogbot_3 has joined #openstack-swift		13:27
*** rcernin has quit IRC		13:29
*** mrjk has quit IRC		13:45
*** mrjk has joined #openstack-swift		13:48
*** e0ne has joined #openstack-swift		14:06
*** rchurch has joined #openstack-swift		14:27
ybunker	hi all, quick question.. is there a way to manually delete partitions on 100% full drives condition?, we can't run object-replicator during the day because of excessive latency, so we run it on a maintenance window (4h) daily, already set the handsoff_ parameters but drives still at 100% of used space	14:43
ybunker	so i was wondering if its possible to manually make some space there	14:43
ybunker	anyone has run into this kind of situation?	14:44
*** itlinux_ has quit IRC		14:50
*** e0ne has quit IRC		15:24
*** manous has joined #openstack-swift		15:30
manous	hi All	15:30
manous	how can i solve this issue https://paste.fedoraproject.org/paste/4xm~fOEhTLOMqdxmIecMBQ	15:31
*** e0ne has joined #openstack-swift		15:44
*** itlinux has joined #openstack-swift		15:46
*** e0ne has quit IRC		15:47
openstackgerrit	Tim Burke proposed openstack/swift master: WIP: s3api: Make multi-deletes async https://review.openstack.org/648263	15:50
timburke	good morning	15:50
*** renich has joined #openstack-swift		16:29
*** mikecmpb_ has quit IRC		16:34
ybunker	anyone?	16:39
*** e0ne has joined #openstack-swift		16:45
*** gkadam has quit IRC		17:05
*** e0ne has quit IRC		17:35
*** zigo has quit IRC		17:37
clayg	so with p 571906 once you realize that unquoted symlinks work poorly, and mostly on accident - and that quoted symlinks always work on purpose - it becomes easy to start to think "well, let's just get rid of unquoted symlinks with normalization on the way in, and all the unquoted symlinks we already have on disk that currently work will continue to work"	17:50
patchbot	https://review.openstack.org/#/c/571906/ - swift - Make symlink work with Unicode account names - 4 patch sets	17:50
clayg	i think it's a good bug fix personally	17:50
clayg	all credit to timburke	17:51
*** e0ne has joined #openstack-swift		17:57
*** mvkr has quit IRC		18:01
*** rdejoux has quit IRC		18:02
timburke	fwiw, i think we'll almost certainly want to get that in before trying to port symlink to py3, too	18:07
timburke	once you've got that loaded in your head, you might want to look at https://review.openstack.org/#/c/571907/ too	18:10
patchbot	patch 571907 - swift - Make staticweb return URL-encoded Location headers - 2 patch sets	18:10
timburke	those were both part of a long chain leading toward https://review.openstack.org/#/c/571908/ -- i don't actually remember what the failures on that were now, though...	18:11
patchbot	patch 571908 - swift - Support Unicode in account and user names during f... - 1 patch set	18:11
*** klamath has joined #openstack-swift		18:18
klamath	Howdy, wondering if anyone is around to look at a weird container error im seeing	18:18
klamath	seeing this error on liberty when trying to stat a container: Apr 1 17:55:21 908172-r2-z2-swiftstorage008 container-server: ERROR __call__ error with GET /disk48/67379/AUTH_XXXXXX/XXXXXX : #012Traceback (most recent call last):#012 File "/openstack/venvs/swift-12.0.13/lib/python2.7/site-packages/swift/container/server.py", line 582, in __call__#012 res = method(req)#012 File "/openstack/venvs/swift-12.0.13/l	18:20
klamath	ib/python2.7/site-packages/swift/common/utils.py", line 2693, in wrapped#012 return func(a, kw)#012 File "/openstack/venvs/swift-12.0.13/lib/python2.7/site-packages/swift/common/utils.py", line 1230, in _timing_stats#012 resp = func(ctrl, args, **kwargs)#012 File "/openstack/venvs/swift-12.0.13/lib/python2.7/site-packages/swift/container/server.py", line 469, in GET#012 resp_headers = gen_resp_heade	18:20
klamath	rs(info, is_deleted=is_deleted)#012 File "/openstack/venvs/swift-12.0.13/lib/python2.7/site-packages/swift/container/server.py", line 54, in gen_resp_headers#012 'X-Backend-Timestamp': Timestamp(info.get('created_at', 0)).internal,#012 File "/openstack/venvs/swift-12.0.13/lib/python2.7/site-packages/swift/common/utils.py", line 756, in __init__#012 self.timestamp = float(parts.pop(0))#012ValueError: invali	18:20
klamath	d literal for float(): 14870&4222&3r980	18:20
*** e0ne has quit IRC		18:28
*** itlinux has quit IRC		18:44
clayg	that looks like maybe a fast post related timestamp encoding maybe?	18:44
klamath	can you explain more clayg?	18:45
clayg	well, i need to eat some lunch... i was gunna search launchpad for a value error tho.. isn't liberty kinda old-ish?	18:47
klamath	yes liberty is oldish, the problem just started a few days ago	18:47
*** mvkr has joined #openstack-swift		18:48
clayg	ORLY! what changed!?	18:49
clayg	so you'll probably want to find the sqlite databases and compare the object rows in the problem db to a container that doesn't seem to have this problem.	18:49
clayg	you maybe also verify if all three copies of the sqlitedatabase have weird rows... let me see if I can look at a raw composite timestamp real quick	18:50
klamath	nothing changed on the 29th, all 4 container servers are reporting the same weird container listing times	18:51
clayg	can you tell if it's for more than one object?	18:52
klamath	it is effecting the container itself, cant pull any stat listings from those bad containers with timestamps, looking at around 21 total containers having this problem	18:53
clayg	ok, can you find the sqlite db's on disk? maybe using swift-get-nodes - i see you redacted the path AUTH_XXXXXX/XXXXXX	18:54
clayg	it's on disk48 in partition 67379 somewhere	18:54
*** itlinux has joined #openstack-swift		18:55
*** renich has quit IRC		18:55
klamath	yes we found the db on disk and looked at the container in question	18:57
*** BjoernT has joined #openstack-swift		18:57
klamath	INSERT INTO "container_info" VALUES('AUTH_XXX','XXXXX','14870&4222&3r980','1487064222.32110','0','1487064222.32110','0',63819,2453015870,'e7d93b85101b026fad7275a0d8927e3b','75b35dfa-9081-468c-aa3a-eb3399a96768','','1487064222.32110','',-1,-1,0,889692);	18:57
klamath		18:57
klamath	CREATE TABLE container_info (	18:57
klamath	account TEXT,	18:57
klamath	container TEXT,	18:57
klamath	created_at TEXT,	18:57
klamath	put_timestamp TEXT DEFAULT '0',	18:57
klamath	delete_timestamp TEXT DEFAULT '0',	18:57
klamath	reported_put_timestamp TEXT DEFAULT '0',	18:57
klamath	reported_delete_timestamp TEXT DEFAULT '0',	18:57
klamath	reported_object_count INTEGER DEFAULT 0,	18:57
klamath	reported_bytes_used INTEGER DEFAULT 0,	18:58
klamath	hash TEXT default '00000000000000000000000000000000',	18:58
klamath	id TEXT,	18:58
klamath	status TEXT DEFAULT '',	18:58
klamath	status_changed_at TEXT DEFAULT '0',	18:58
klamath	metadata TEXT DEFAULT '',	18:58
klamath	x_container_sync_point1 INTEGER DEFAULT -1,	18:58
klamath	x_container_sync_point2 INTEGER DEFAULT -1,	18:58
klamath	storage_policy_index INTEGER DEFAULT 0,	18:58
klamath	reconciler_sync_point INTEGER DEFAULT -1	18:58
klamath	);	18:58
klamath	problem is with the delete_timestamp and the non numeric values stored in it	18:58
timburke	i'm guessing its some db corruption -- the string length is right for a timestamp, and '&' and '6' or '&' and '.' are just a few bitflips away from each other	19:08
timburke	even '2' and 'r' are just one bitflip away...	19:09
klamath	anyway to update these?	19:09
klamath	would posting to the container to update the delete_timestamp?	19:10
timburke	was it the delete, or the create timestamp that was causing trouble? i thought create...	19:12
timburke	from a sqlite3 prompt, something like `UPDATE container_info SET created_at='1487064222.32980';` would probably do	19:12
timburke	do all replicas have that, or was it at least limited to just one db?	19:12
klamath	would you need to acquire lock an all dbs or just one and have it propagate out?	19:12
klamath	all dbs are showing the same bad timestamp	19:13
timburke	might be worth looking around to see if you can establish a consensus about what it should be	19:13
timburke	:-(	19:13
clayg	timburke: you must have guessed db corruption and then played with the bytes? You don't just intuitively KNOW that '&' and '6' are near each other in an ascii table!? DO YOU!?	19:13
clayg	timburke: object will have an x-timestamp - might not even be corrupted since we checksum object metadata	19:13
* timburke shrugs innocently		19:14
klamath	we havent made any changes to the db at this point, just ro	19:14
klamath	yea it appears in this case created_at is corrupt	19:14
clayg	if it's at all useful, composite timestamps look like: created_at = 1554146043.88703+991803+0	19:16
clayg	so... my guess was wrong.	19:16
timburke	so, a thing worth noting: now that we've identified at least one corrupt db... and likely seen that corruption spreading to other dbs... i'm more than a little worried about what else might be corrupted	19:16
clayg	klamath: yeah post to the object with some bs metadata might be good enough... i don't know liberty...	19:16
clayg	timburke: try not to stress about that and just put "more checksumming of sqlite data" on the todo list somewhere	19:17
clayg	sqlite has some internal checksuming - it might be interesting to dig into how it managed to fail in this case	19:18
clayg	klamath: what version of sqlite are you running!?	19:18
clayg	timburke: I think the newest version of the replicator might have to be a bit smarter about having to parse rows (you were looking at merge_items recently) - it's possible it wouldn't have been able to propagate the corruption	19:19
klamath	2.8.17	19:20
klamath	any pointers on a metadata update that would trigger a container update?	19:23
BjoernT	we run sqllite3 wihch introduced a new locking ""SQLite Version 3.0.0 introduced a new locking and journaling mechanism designed to improve concurrency over SQLite version 2 and to reduce the writer starvation problem. The new mechanism also allows atomic commits of transactions involving multiple database files. This document describes the new locking mechanism. The intended audience is programmers who want to understand and/or modify the pager	19:24
BjoernT	code and reviewers working to verify the design of SQLite version 3.""" which has a topic around "How To Corrupt Your Database Files"	19:24
BjoernT	libsqlite3-0:amd64 3.8.2-1ubuntu2.1 amd64 SQLite 3 shared library	19:25
*** e0ne has joined #openstack-swift		19:27
BjoernT	perhaps nobarrier screwed us over here	19:27
timburke	clayg, i guess it'd probably be worth pulling https://github.com/openstack/swift/blob/2.21.0/swift/common/db.py#L566-L570 out of SQL -- parse the values as actual timestamps, make the comparisons in python and store the greater...	19:29
timburke	klamath, it's particularly tricky because it's created_at that's corrupted -- and that only gets set (as i recall) during the broker's _initialize, so only when you don't already have a db file on disk	19:30
clayg	Oh, it’s the container info 😂	19:31
timburke	if it were put_timestamp instead, you could probably just issue a new PUT for the container, but as it is... not sure there's a good way do fix this via the swift API	19:31
klamath	can we manually update that created_at on the sqlite level and have it replicate out?	19:31
*** ybunker has quit IRC		19:31
clayg	That actually explains how the corruption spread a little better. But not between db. Common disk maybe?	19:34
timburke	might be safest to stop the container replicators on the affected nodes, manually run the update, then restart replicators. the trouble is that the '&' is going to compare less than '6', so the corrupt timestamp will win out during replication	19:34
clayg	Haha	19:34
timburke	maybe if you don't mind an inaccurate created_at, you could set it to 1486964222.32980 instead of 1487064222.32980?	19:35
*** spsurya has quit IRC		19:36
timburke	and be very very happy that the corruption didn't occur in that leading digit ;-)	19:36
*** manous has quit IRC		19:40
klamath	any risk in increasing the timestamp timburke?	19:41
klamath	i just tired using swiftly to put a file into that bad container and it uploaded but still cant pull container listing or any info from that container	19:43
timburke	klamath, i think the risks to using an earlier timestamp are fairly low -- fortunately, the position of the corruption means that it'll only change the created_at by a couple days or so	19:48
timburke	`UPDATE container_info SET created_at='1486964222.32980';` is seeming better and better	19:49
timburke	don't have to stop the replicators, should be able to do it on just one affected db...	19:50
timburke	still might take a bit to have the replicators propagate it out to all replicas, though	19:50
BjoernT	how sure are you that only one bit flipped and not multiple ?	19:51
BjoernT	looking at this timestamp puts us in 1974	19:51
timburke	eh? i'm seeing feb 2017...	19:52
BjoernT	its milli seconds ?	19:52
timburke	so my thinking is that '14870&4222&3r980' was supposed to be '1487064222.32980' -- which required a total of four bitflips	19:53
BjoernT	oh I was looking at r is the .	19:53
timburke	oh! maybe only 3 flips... for some reason i thought one of them required two flips...	19:56
timburke	oh, i was trying to go . -> 6 when i needed to be going & -> . and & -> 6	19:58
klamath	that fixed the problem on this one container	20:23
BjoernT	got a new date, lol 1478352919.<9729	20:25
BjoernT	< = 4 ?	20:25
BjoernT	or 8	20:26
BjoernT	probably doesnt matter as it subseconds	20:26
BjoernT	interestingly it is always just container_info	20:38
*** itlinux has quit IRC		20:58
*** itlinux has joined #openstack-swift		20:59
*** e0ne has quit IRC		21:03
*** pcaruana has quit IRC		21:17
*** samueldmq has joined #openstack-swift		21:36
*** itlinux has quit IRC		21:43
*** ccamacho has quit IRC		21:43
clayg	klamath: WTFG!!!	21:43
clayg	tell your boss you get a raise	21:44
*** itlinux has joined #openstack-swift		21:44
*** itlinux has quit IRC		21:44
*** BjoernT has quit IRC		22:01
*** renich has joined #openstack-swift		22:31
*** rcernin has joined #openstack-swift		22:40
*** tkajinam has joined #openstack-swift		22:56
*** mikecmpbll has joined #openstack-swift		23:15
*** itlinux has joined #openstack-swift		23:24
*** renich has quit IRC		23:24
*** renich has joined #openstack-swift		23:38
*** openstackgerrit has quit IRC		23:56
*** BjoernT has joined #openstack-swift		23:57
*** timburke has quit IRC		23:58

Generated by irclog2html.py 2.15.3 by Marius Gedminas - find it at mg.pov.lt!