*** mrjk has quit IRC | 00:05 | |
*** mrjk has joined #openstack-swift | 00:06 | |
*** mikecmpbll has quit IRC | 00:45 | |
*** irclogbot_0 has quit IRC | 02:19 | |
*** psachin has joined #openstack-swift | 03:17 | |
*** psachin has quit IRC | 05:09 | |
*** psachin has joined #openstack-swift | 05:15 | |
*** ianychoi has quit IRC | 05:23 | |
*** ianychoi has joined #openstack-swift | 05:24 | |
*** psachin has quit IRC | 05:57 | |
*** psachin has joined #openstack-swift | 06:05 | |
*** ccamacho has quit IRC | 06:53 | |
*** rcernin has quit IRC | 06:58 | |
*** pcaruana has joined #openstack-swift | 06:58 | |
*** pcaruana has quit IRC | 07:02 | |
*** pcaruana has joined #openstack-swift | 07:02 | |
*** psachin has quit IRC | 07:08 | |
*** renich has joined #openstack-swift | 07:08 | |
*** rdejoux has joined #openstack-swift | 07:12 | |
*** psachin has joined #openstack-swift | 07:16 | |
*** psachin has quit IRC | 07:16 | |
*** ccamacho has joined #openstack-swift | 07:22 | |
*** e0ne has joined #openstack-swift | 07:40 | |
*** mikecmpbll has joined #openstack-swift | 07:43 | |
*** gkadam has joined #openstack-swift | 08:02 | |
*** e0ne has quit IRC | 08:11 | |
*** e0ne has joined #openstack-swift | 08:17 | |
*** renich has quit IRC | 08:19 | |
*** gkadam has quit IRC | 08:32 | |
*** tkajinam has quit IRC | 08:34 | |
*** mikecmpbll has quit IRC | 08:38 | |
*** mikecmpb_ has joined #openstack-swift | 08:39 | |
*** e0ne has quit IRC | 08:50 | |
*** e0ne has joined #openstack-swift | 08:59 | |
*** e0ne has quit IRC | 09:16 | |
*** gkadam has joined #openstack-swift | 09:28 | |
*** e0ne has joined #openstack-swift | 09:35 | |
*** e0ne has quit IRC | 10:45 | |
*** e0ne has joined #openstack-swift | 11:03 | |
*** e0ne has quit IRC | 11:10 | |
*** ybunker has joined #openstack-swift | 11:27 | |
*** e0ne has joined #openstack-swift | 12:11 | |
*** rcernin has joined #openstack-swift | 12:19 | |
zigo | Hi there ! | 12:24 |
---|---|---|
zigo | I was wondering, how can I simulate a broken HDD, so that the drive audit does its job of commenting in /etc/fstab, etc. ? | 12:24 |
*** e0ne has quit IRC | 12:38 | |
*** mvkr has quit IRC | 12:49 | |
zigo | I could remove a drive with qemu's console, now how do I force Swift to audit the drive and remove it from fstab like with broken hdds? | 13:02 |
*** mvkr has joined #openstack-swift | 13:16 | |
*** irclogbot_3 has joined #openstack-swift | 13:27 | |
*** rcernin has quit IRC | 13:29 | |
*** mrjk has quit IRC | 13:45 | |
*** mrjk has joined #openstack-swift | 13:48 | |
*** e0ne has joined #openstack-swift | 14:06 | |
*** rchurch has joined #openstack-swift | 14:27 | |
ybunker | hi all, quick question.. is there a way to manually delete partitions on 100% full drives condition?, we can't run object-replicator during the day because of excessive latency, so we run it on a maintenance window (4h) daily, already set the handsoff_ parameters but drives still at 100% of used space | 14:43 |
ybunker | so i was wondering if its possible to manually make some space there | 14:43 |
ybunker | anyone has run into this kind of situation? | 14:44 |
*** itlinux_ has quit IRC | 14:50 | |
*** e0ne has quit IRC | 15:24 | |
*** manous has joined #openstack-swift | 15:30 | |
manous | hi All | 15:30 |
manous | how can i solve this issue https://paste.fedoraproject.org/paste/4xm~fOEhTLOMqdxmIecMBQ | 15:31 |
*** e0ne has joined #openstack-swift | 15:44 | |
*** itlinux has joined #openstack-swift | 15:46 | |
*** e0ne has quit IRC | 15:47 | |
openstackgerrit | Tim Burke proposed openstack/swift master: WIP: s3api: Make multi-deletes async https://review.openstack.org/648263 | 15:50 |
timburke | good morning | 15:50 |
*** renich has joined #openstack-swift | 16:29 | |
*** mikecmpb_ has quit IRC | 16:34 | |
ybunker | anyone? | 16:39 |
*** e0ne has joined #openstack-swift | 16:45 | |
*** gkadam has quit IRC | 17:05 | |
*** e0ne has quit IRC | 17:35 | |
*** zigo has quit IRC | 17:37 | |
clayg | so with p 571906 once you realize that unquoted symlinks work poorly, and mostly on accident - and that quoted symlinks always work on purpose - it becomes easy to start to think "well, let's just get rid of unquoted symlinks with normalization on the way in, and all the unquoted symlinks we already have on disk that currently work will continue to work" | 17:50 |
patchbot | https://review.openstack.org/#/c/571906/ - swift - Make symlink work with Unicode account names - 4 patch sets | 17:50 |
clayg | i think it's a good bug fix personally | 17:50 |
clayg | all credit to timburke | 17:51 |
*** e0ne has joined #openstack-swift | 17:57 | |
*** mvkr has quit IRC | 18:01 | |
*** rdejoux has quit IRC | 18:02 | |
timburke | fwiw, i think we'll almost certainly want to get that in before trying to port symlink to py3, too | 18:07 |
timburke | once you've got that loaded in your head, you might want to look at https://review.openstack.org/#/c/571907/ too | 18:10 |
patchbot | patch 571907 - swift - Make staticweb return URL-encoded Location headers - 2 patch sets | 18:10 |
timburke | those were both part of a long chain leading toward https://review.openstack.org/#/c/571908/ -- i don't actually remember what the failures on that were now, though... | 18:11 |
patchbot | patch 571908 - swift - Support Unicode in account and user names during f... - 1 patch set | 18:11 |
*** klamath has joined #openstack-swift | 18:18 | |
klamath | Howdy, wondering if anyone is around to look at a weird container error im seeing | 18:18 |
klamath | seeing this error on liberty when trying to stat a container: Apr 1 17:55:21 908172-r2-z2-swiftstorage008 container-server: ERROR __call__ error with GET /disk48/67379/AUTH_XXXXXX/XXXXXX : #012Traceback (most recent call last):#012 File "/openstack/venvs/swift-12.0.13/lib/python2.7/site-packages/swift/container/server.py", line 582, in __call__#012 res = method(req)#012 File "/openstack/venvs/swift-12.0.13/l | 18:20 |
klamath | ib/python2.7/site-packages/swift/common/utils.py", line 2693, in wrapped#012 return func(*a, **kw)#012 File "/openstack/venvs/swift-12.0.13/lib/python2.7/site-packages/swift/common/utils.py", line 1230, in _timing_stats#012 resp = func(ctrl, *args, **kwargs)#012 File "/openstack/venvs/swift-12.0.13/lib/python2.7/site-packages/swift/container/server.py", line 469, in GET#012 resp_headers = gen_resp_heade | 18:20 |
klamath | rs(info, is_deleted=is_deleted)#012 File "/openstack/venvs/swift-12.0.13/lib/python2.7/site-packages/swift/container/server.py", line 54, in gen_resp_headers#012 'X-Backend-Timestamp': Timestamp(info.get('created_at', 0)).internal,#012 File "/openstack/venvs/swift-12.0.13/lib/python2.7/site-packages/swift/common/utils.py", line 756, in __init__#012 self.timestamp = float(parts.pop(0))#012ValueError: invali | 18:20 |
klamath | d literal for float(): 14870&4222&3r980 | 18:20 |
*** e0ne has quit IRC | 18:28 | |
*** itlinux has quit IRC | 18:44 | |
clayg | that looks like maybe a fast post related timestamp encoding maybe? | 18:44 |
klamath | can you explain more clayg? | 18:45 |
clayg | well, i need to eat some lunch... i was gunna search launchpad for a value error tho.. isn't liberty kinda old-ish? | 18:47 |
klamath | yes liberty is oldish, the problem just started a few days ago | 18:47 |
*** mvkr has joined #openstack-swift | 18:48 | |
clayg | ORLY! what changed!? | 18:49 |
clayg | so you'll probably want to find the sqlite databases and compare the object rows in the problem db to a container that doesn't seem to have this problem. | 18:49 |
clayg | you maybe also verify if all three copies of the sqlitedatabase have weird rows... let me see if I can look at a raw composite timestamp real quick | 18:50 |
klamath | nothing changed on the 29th, all 4 container servers are reporting the same weird container listing times | 18:51 |
clayg | can you tell if it's for more than one object? | 18:52 |
klamath | it is effecting the container itself, cant pull any stat listings from those bad containers with timestamps, looking at around 21 total containers having this problem | 18:53 |
clayg | ok, can you find the sqlite db's on disk? maybe using swift-get-nodes - i see you redacted the path AUTH_XXXXXX/XXXXXX | 18:54 |
clayg | it's on disk48 in partition 67379 *somewhere* | 18:54 |
*** itlinux has joined #openstack-swift | 18:55 | |
*** renich has quit IRC | 18:55 | |
klamath | yes we found the db on disk and looked at the container in question | 18:57 |
*** BjoernT has joined #openstack-swift | 18:57 | |
klamath | INSERT INTO "container_info" VALUES('AUTH_XXX','XXXXX','14870&4222&3r980','1487064222.32110','0','1487064222.32110','0',63819,2453015870,'e7d93b85101b026fad7275a0d8927e3b','75b35dfa-9081-468c-aa3a-eb3399a96768','','1487064222.32110','',-1,-1,0,889692); | 18:57 |
klamath | 18:57 | |
klamath | CREATE TABLE container_info ( | 18:57 |
klamath | account TEXT, | 18:57 |
klamath | container TEXT, | 18:57 |
klamath | created_at TEXT, | 18:57 |
klamath | put_timestamp TEXT DEFAULT '0', | 18:57 |
klamath | delete_timestamp TEXT DEFAULT '0', | 18:57 |
klamath | reported_put_timestamp TEXT DEFAULT '0', | 18:57 |
klamath | reported_delete_timestamp TEXT DEFAULT '0', | 18:57 |
klamath | reported_object_count INTEGER DEFAULT 0, | 18:57 |
klamath | reported_bytes_used INTEGER DEFAULT 0, | 18:58 |
klamath | hash TEXT default '00000000000000000000000000000000', | 18:58 |
klamath | id TEXT, | 18:58 |
klamath | status TEXT DEFAULT '', | 18:58 |
klamath | status_changed_at TEXT DEFAULT '0', | 18:58 |
klamath | metadata TEXT DEFAULT '', | 18:58 |
klamath | x_container_sync_point1 INTEGER DEFAULT -1, | 18:58 |
klamath | x_container_sync_point2 INTEGER DEFAULT -1, | 18:58 |
klamath | storage_policy_index INTEGER DEFAULT 0, | 18:58 |
klamath | reconciler_sync_point INTEGER DEFAULT -1 | 18:58 |
klamath | ); | 18:58 |
klamath | problem is with the delete_timestamp and the non numeric values stored in it | 18:58 |
timburke | i'm guessing its some db corruption -- the string length is right for a timestamp, and '&' and '6' or '&' and '.' are just a few bitflips away from each other | 19:08 |
timburke | even '2' and 'r' are just one bitflip away... | 19:09 |
klamath | anyway to update these? | 19:09 |
klamath | would posting to the container to update the delete_timestamp? | 19:10 |
timburke | was it the delete, or the create timestamp that was causing trouble? i thought create... | 19:12 |
timburke | from a sqlite3 prompt, something like `UPDATE container_info SET created_at='1487064222.32980';` would probably do | 19:12 |
timburke | do all replicas have that, or was it at least limited to just one db? | 19:12 |
klamath | would you need to acquire lock an all dbs or just one and have it propagate out? | 19:12 |
klamath | all dbs are showing the same bad timestamp | 19:13 |
timburke | might be worth looking around to see if you can establish a consensus about what it *should* be | 19:13 |
timburke | :-( | 19:13 |
clayg | timburke: you must have guessed db corruption and then played with the bytes? You don't just intuitively KNOW that '&' and '6' are near each other in an ascii table!? DO YOU!? | 19:13 |
clayg | timburke: object will have an x-timestamp - might not even be corrupted since we checksum object metadata | 19:13 |
* timburke shrugs innocently | 19:14 | |
klamath | we havent made any changes to the db at this point, just ro | 19:14 |
klamath | yea it appears in this case created_at is corrupt | 19:14 |
clayg | if it's at all useful, composite timestamps look like: created_at = 1554146043.88703+991803+0 | 19:16 |
clayg | so... my guess was wrong. | 19:16 |
timburke | so, a thing worth noting: now that we've identified at least *one* corrupt db... and likely seen that corruption spreading to *other* dbs... i'm more than a little worried about what *else* might be corrupted | 19:16 |
clayg | klamath: yeah post to the object with some bs metadata might be good enough... i don't know liberty... | 19:16 |
clayg | timburke: try not to stress about that and just put "more checksumming of sqlite data" on the todo list somewhere | 19:17 |
clayg | sqlite has some internal checksuming - it might be interesting to dig into how it managed to fail in this case | 19:18 |
clayg | klamath: what version of sqlite are you running!? | 19:18 |
clayg | timburke: I think the newest version of the replicator might have to be a bit smarter about having to parse rows (you were looking at merge_items recently) - it's possible it wouldn't have been able to propagate the corruption | 19:19 |
klamath | 2.8.17 | 19:20 |
klamath | any pointers on a metadata update that would trigger a container update? | 19:23 |
BjoernT | we run sqllite3 wihch introduced a new locking ""SQLite Version 3.0.0 introduced a new locking and journaling mechanism designed to improve concurrency over SQLite version 2 and to reduce the writer starvation problem. The new mechanism also allows atomic commits of transactions involving multiple database files. This document describes the new locking mechanism. The intended audience is programmers who want to understand and/or modify the pager | 19:24 |
BjoernT | code and reviewers working to verify the design of SQLite version 3.""" which has a topic around "How To Corrupt Your Database Files" | 19:24 |
BjoernT | libsqlite3-0:amd64 3.8.2-1ubuntu2.1 amd64 SQLite 3 shared library | 19:25 |
*** e0ne has joined #openstack-swift | 19:27 | |
BjoernT | perhaps nobarrier screwed us over here | 19:27 |
timburke | clayg, i guess it'd probably be worth pulling https://github.com/openstack/swift/blob/2.21.0/swift/common/db.py#L566-L570 out of SQL -- parse the values as actual timestamps, make the comparisons in python and store the greater... | 19:29 |
timburke | klamath, it's particularly tricky because it's created_at that's corrupted -- and that only gets set (as i recall) during the broker's _initialize, so only when you don't already have a db file on disk | 19:30 |
clayg | Oh, itโs the container info ๐ | 19:31 |
timburke | if it were put_timestamp instead, you could probably just issue a new PUT for the container, but as it is... not sure there's a good way do fix this via the swift API | 19:31 |
klamath | can we manually update that created_at on the sqlite level and have it replicate out? | 19:31 |
*** ybunker has quit IRC | 19:31 | |
clayg | That actually explains how the corruption spread a little better. But not between db. Common disk maybe? | 19:34 |
timburke | might be safest to stop the container replicators on the affected nodes, manually run the update, then restart replicators. the trouble is that the '&' is going to compare less than '6', so the corrupt timestamp will win out during replication | 19:34 |
clayg | Haha | 19:34 |
timburke | maybe if you don't mind an inaccurate created_at, you could set it to 1486964222.32980 instead of 1487064222.32980? | 19:35 |
*** spsurya has quit IRC | 19:36 | |
timburke | and be very very happy that the corruption didn't occur in that leading digit ;-) | 19:36 |
*** manous has quit IRC | 19:40 | |
klamath | any risk in increasing the timestamp timburke? | 19:41 |
klamath | i just tired using swiftly to put a file into that bad container and it uploaded but still cant pull container listing or any info from that container | 19:43 |
timburke | klamath, i think the risks to using an earlier timestamp are fairly low -- fortunately, the position of the corruption means that it'll only change the created_at by a couple days or so | 19:48 |
timburke | `UPDATE container_info SET created_at='1486964222.32980';` is seeming better and better | 19:49 |
timburke | don't have to stop the replicators, should be able to do it on just one affected db... | 19:50 |
timburke | still might take a bit to have the replicators propagate it out to all replicas, though | 19:50 |
BjoernT | how sure are you that only one bit flipped and not multiple ? | 19:51 |
BjoernT | looking at this timestamp puts us in 1974 | 19:51 |
timburke | eh? i'm seeing feb 2017... | 19:52 |
BjoernT | its milli seconds ? | 19:52 |
timburke | so my thinking is that '14870&4222&3r980' was *supposed* to be '1487064222.32980' -- which required a total of four bitflips | 19:53 |
BjoernT | oh I was looking at r is the . | 19:53 |
timburke | oh! maybe only 3 flips... for some reason i thought one of them required two flips... | 19:56 |
timburke | oh, i was trying to go . -> 6 when i needed to be going & -> . and & -> 6 | 19:58 |
klamath | that fixed the problem on this one container | 20:23 |
BjoernT | got a new date, lol 1478352919.<9729 | 20:25 |
BjoernT | < = 4 ? | 20:25 |
BjoernT | or 8 | 20:26 |
BjoernT | probably doesnt matter as it subseconds | 20:26 |
BjoernT | interestingly it is always just container_info | 20:38 |
*** itlinux has quit IRC | 20:58 | |
*** itlinux has joined #openstack-swift | 20:59 | |
*** e0ne has quit IRC | 21:03 | |
*** pcaruana has quit IRC | 21:17 | |
*** samueldmq has joined #openstack-swift | 21:36 | |
*** itlinux has quit IRC | 21:43 | |
*** ccamacho has quit IRC | 21:43 | |
clayg | klamath: WTFG!!! | 21:43 |
clayg | tell your boss you get a raise | 21:44 |
*** itlinux has joined #openstack-swift | 21:44 | |
*** itlinux has quit IRC | 21:44 | |
*** BjoernT has quit IRC | 22:01 | |
*** renich has joined #openstack-swift | 22:31 | |
*** rcernin has joined #openstack-swift | 22:40 | |
*** tkajinam has joined #openstack-swift | 22:56 | |
*** mikecmpbll has joined #openstack-swift | 23:15 | |
*** itlinux has joined #openstack-swift | 23:24 | |
*** renich has quit IRC | 23:24 | |
*** renich has joined #openstack-swift | 23:38 | |
*** openstackgerrit has quit IRC | 23:56 | |
*** BjoernT has joined #openstack-swift | 23:57 | |
*** timburke has quit IRC | 23:58 |
Generated by irclog2html.py 2.15.3 by Marius Gedminas - find it at mg.pov.lt!