21:00:11 <timburke_> #startmeeting swift 21:00:12 <openstack> Meeting started Wed Jan 13 21:00:11 2021 UTC and is due to finish in 60 minutes. The chair is timburke_. Information about MeetBot at http://wiki.debian.org/MeetBot. 21:00:13 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 21:00:15 <openstack> The meeting name has been set to 'swift' 21:00:19 <timburke_> who's here for the swift meeting? 21:00:27 <mattoliverau> o/ 21:00:29 <seongsoocho> o/ 21:00:44 <rledisez> o/ 21:01:03 <acoles> o/ 21:01:12 <dsariel> o/ 21:01:25 <kota_> o/ 21:01:54 <timburke_> as usual, the agenda's at https://wiki.openstack.org/wiki/Meetings/Swift 21:02:14 <timburke_> first up 21:02:18 <timburke_> #topic reconciler/ec/encryption 21:02:28 <timburke_> #link https://bugs.launchpad.net/swift/+bug/1910804 21:02:30 <openstack> Launchpad bug 1910804 in OpenStack Object Storage (swift) "Encryption doesn't play well with processes that copy cleartext data while preserving timestamps" [Undecided,New] 21:03:07 <timburke_> so i had a customer report an issue with an object that would consistently 503 21:03:49 <clayg> ohai 21:04:06 <timburke_> digging in more, we found that they had 11 frags of it for an 8+4 policy... but those had 3 separate sets of crypto meta between them 21:04:28 <timburke_> ...and no set of crypto meta had more than 7 frags 21:04:34 <clayg> lawl 21:05:16 <acoles> (I had to think about this at first) meaning frags have been encrypted with three different body keys...for same object!!! 21:06:47 <timburke_> root cause was traced back to a couple issues: (1) we deploy with encryption in the reconciler pipeline and (2) we have every (container?) node running a reconciler 21:07:55 <timburke_> (well, that and the fact that it was moved to an EC policy. if it were going to a replicated policy, any replica regardless of crypto meta would be capable of generating a client response) 21:08:48 <timburke_> i've got a fix up to pull encryption out of the reconciler pipeline if it was misconfigured -- https://review.opendev.org/c/openstack/swift/+/770522 21:09:23 <timburke_> but i wanted to raise awareness of the issue so no one else finds themselves in this situation 21:10:56 <timburke_> also worth noting: i think you could run into a similar issue *without encryption* if your EC backend is non-deterministic 21:12:50 <timburke_> the open source backends are deterministic as i recall (that is, the frag outputs only depend on the EC params from swift.conf and the input data), but i don't know the details of shss, for example 21:13:39 <timburke_> does anyone have any questions about the bug or its impact? 21:14:54 <timburke_> all right 21:14:57 <mattoliverau> Nice investigation! 21:15:00 <timburke_> #topic SSYNC and non-durable frags 21:15:14 <timburke_> #link https://bugs.launchpad.net/swift/+bug/1778002 21:15:16 <openstack> Launchpad bug 1778002 in OpenStack Object Storage (swift) "EC non-durable fragment won't be deleted by reconstructor. " [High,Confirmed] 21:15:49 <timburke_> i know acoles (and clayg?) has been working on this problem a bit lately, though i'm not sure where things stand 21:15:49 <kota_> shss might be impacted. i'll check it. 21:16:03 <acoles> I just got my probe test working! 21:16:11 <timburke_> \o/ 21:16:24 <acoles> background: we noticed some partitions were never cleaned up on handoffs 21:16:50 <acoles> turned out they had non-durable data frags on them , so the dir would not be deleted 21:17:06 <acoles> but reconstructor/ssync does not sync non-durable frags 21:17:24 <acoles> :( 21:17:35 <acoles> so https://review.opendev.org/c/openstack/swift/+/770047 should fix that 21:18:10 <acoles> by (a) sync'ing non-durables (they could still be useful data) and (b) then removing non-durables on the handoff 21:18:52 <clayg> https://bugs.launchpad.net/swift/+bug/1778002 has been around for awhile - anyone doing EC rebalances has probably noticed it 21:18:53 <openstack> Launchpad bug 1778002 in OpenStack Object Storage (swift) "EC non-durable fragment won't be deleted by reconstructor. " [High,Confirmed] 21:20:47 <zaitcev> Hrm. I never noticed because I have excess space. 21:21:51 <timburke_> i think we mainly noticed because we monitor handoffs as part of our rebalances 21:22:05 <acoles> the commit message on the patch details the various changes needed to get the non-durables yielded to ssync and then have ssync sync them 21:22:11 <timburke_> acoles, are there any questions that might need answering, or is this something that everyone should just anticipate getting better Real Soon Now? 21:23:27 <acoles> review always welcome, but there's no specific issue I have in mind for feedback 21:23:49 <timburke_> excellent 21:23:57 <acoles> I'm about to push a new patchset - and I have one more test to write 21:25:18 <timburke_> #topic cleaning up shards when root DB is deleted and reclaimed 21:25:34 <timburke_> meanwhile, mattoliverau has picked up 21:25:37 <timburke_> #link https://bugs.launchpad.net/swift/+bug/1911232 21:25:38 <openstack> Launchpad bug 1911232 in OpenStack Object Storage (swift) "empty shards fail audit with reclaimed root db " [Undecided,Confirmed] - Assigned to Matthew Oliver (matt-0) 21:26:02 <timburke_> how's that going? 21:26:32 <mattoliverau> Yeah things are moving along. I have https://review.opendev.org/c/openstack/swift/+/770529 21:26:53 <mattoliverau> it's not fixed yet, just worked on a probe test that shows he problem. 21:27:09 <acoles> a very good place to start :) 21:27:40 <mattoliverau> In an ideal world we'd have shrinking and autosharding so shards with nothing in them was suppose to collapse into the root before reclaim_age 21:28:03 <mattoliverau> but we don't have that, and there is still an edge case where their not getting cleaned up. 21:28:59 <mattoliverau> I'll have another pathset up today that should have an initial version of a fix. Currently still on my laptop as it needs some debugging and tests 21:29:34 <mattoliverau> keep an eye out for that and then please review and we can make sure we don't leave any pesky shards around :) 21:29:46 <timburke_> sounds good 21:30:23 <timburke_> #topic s3api and allowable clock skew 21:31:10 <timburke_> i've had some clients getting back RequestTimeTooSkewed errors for a while -- not real common, but it's a fairly persistent problem 21:31:58 <timburke_> i'm fairly certain it's that they retry a failed request verbatim, rather than re-signing with the new request time 21:32:51 <timburke_> eventually, given the right retry/backoff options, the retry goes longer than 5mins and they get back a 403 21:33:35 <zaitcev> so, there's nothing we can do, right? 21:33:54 <timburke_> i *think* AWS has an allowable skew of more like 15mins (though can't remember whether i read it somewhere or determined it experimentally) 21:34:11 <zaitcev> That's what I remember, too. 21:34:24 <timburke_> so i proposed a patch to make it configurable, with a default of (what i recall as being) AWS's limit 21:34:41 <timburke_> #link https://review.opendev.org/c/openstack/swift/+/770005 21:34:48 <zaitcev> It was mentioned in the old Developer's Guide. But that document is gone, replaced with API Reference. 21:35:30 <timburke_> i wanted to check if anyone had concerns about increasing this default value (it would of course be called out in release notes later) 21:35:35 <kota_> should we extend the default value too? 21:36:24 * kota_ said same thing :P 21:36:29 <timburke_> kota_, yeah, the patch as written increases the timeout from 5mins to 15mins (if you don't explicitly set a value) 21:37:55 <timburke_> ok, seems like we're generally ok with it :-) 21:38:06 <timburke_> #topic relinker 21:39:17 <timburke_> i found a couple issues recently that might be good to know about if anyone's planning a part-power increase (or two) soon 21:39:24 <timburke_> #link https://bugs.launchpad.net/swift/+bug/1910589 21:39:25 <openstack> Launchpad bug 1910589 in OpenStack Object Storage (swift) "Multiple part power increases leads to misplaced data" [Undecided,New] 21:39:59 <timburke_> ^^^ characterizes something i think i mentioned last week, but hadn't gotten a clean repro for 21:40:24 <timburke_> rledisez, do you think you might have time to review https://review.opendev.org/c/openstack/swift/+/769855 (which should address it)? 21:40:45 <zaitcev> Christian is no longer around essentially, we have to do without. 21:41:05 <clayg> 😢 I hope he's doing well tho! 😁 21:41:47 <rledisez> timburke_: absolutely, I'll do that this week 21:42:36 <timburke_> thanks! only thing worth calling out (i think) is that the state file format changed in such a way that any old state files will just be discarded 21:43:08 <rledisez> not a big deal. don't upgrade if you're relinking, and worst case sceniario, it restart from zero 21:43:15 <timburke_> but that should only really be a concern if someone is doing a swift upgrade mid-part-power-increase, which doesn't seem like a great plan anyway 21:43:41 <clayg> hahaha 21:43:59 <timburke_> the other one i noticed is a little thornier 21:44:01 <timburke_> #link https://bugs.launchpad.net/swift/+bug/1910470 21:44:02 <openstack> Launchpad bug 1910470 in OpenStack Object Storage (swift) "swift-object-relinker does not handle unmounted disks well" [Undecided,New] 21:44:49 <timburke_> essentially, on master, if the relinker hits an unmounted disk, you get no feedback about it at all 21:45:23 <timburke_> i've got a patch that at least has us log the fact that the disk is getting skipped -- https://review.opendev.org/c/openstack/swift/+/769632 21:45:45 <timburke_> but it doesn't exit with a non-zero status code or anything 21:46:12 <seongsoocho> So now, is it safe to increase the partition power only once until the patch is applied? 21:47:00 <rledisez> seongsoocho: from production experience, it is. we did it on multiple clusters with the current status of the relinker 21:47:05 <timburke_> seongsoocho, yes, increasing it once will definitely be fine. once it's been increased, you could go manually clear the state files -- then it would be safe to do it again 21:48:04 <rledisez> but you should care about the last bug mentioned by timburke_, ensure your rights are ok (root:root) on unmounted disk to avoid bad surprises 21:48:10 <timburke_> they'd be named something like /srv/node/*/.relink.*.json 21:48:47 <seongsoocho> aha. ok thanks :) 21:49:17 <rledisez> at some point, it would be useful to have a recon option that return the values of the relink.json and tells when one is missing (eg: because unmounted) 21:49:52 <timburke_> good thought! 21:50:49 <timburke_> all right, i mostly wanted to raise awareness on those -- i'll let you know if i get a good idea on a better resolution for that second one 21:50:55 <timburke_> #topic open discussion 21:51:03 <timburke_> what else should we talk about this week? 21:52:46 <acoles> OMM I'm seeing this test fail in virtualenvs (e.g. tox -e py36) but not outside virtualenv: 'nosetests ./test/unit/common/test_manager.py:TestManagerModule.test_verify_server' - anyone else noticed that? I'm baffled 21:53:35 <acoles> AFAICT the test is asserting that swift-Object-server is not on my PATH 21:53:48 <acoles> note the capital 'O' 21:53:49 <clayg> does it always fail? 21:54:09 <acoles> inside virtualenv yes - I mean, I just noticed in last 20mins 21:54:09 <clayg> are you on a case insenstive file system 🤣 21:54:33 <acoles> vsaio and macos both the same 21:55:02 <acoles> apart from it failing, I don't like that a unit test is making assertions about what I might have on my PATH 21:55:34 <clayg> oh, i just tried venv in my vsaio and it worked 🤷♂️ 21:55:39 <acoles> if no-one else has noticed I'll dig some more 21:55:42 <clayg> py3.8 tho 21:56:42 <zaitcev> acoles: so, if you comment verify_server, does it fail? 21:56:51 <zaitcev> or, well 21:56:59 <zaitcev> it's a test, so it's a little artificial 21:57:52 <acoles> py2.7 fails 21:58:08 <zaitcev> se 21:58:13 <zaitcev> a second 21:58:34 <timburke_> maybe related to https://review.opendev.org/c/openstack/swift/+/769848 ? 21:58:42 <zaitcev> yes, that one 21:59:04 <timburke_> i should go review that... or maybe acoles should ;-) 21:59:22 <zaitcev> Yeah 21:59:50 <timburke_> all right 21:59:59 <acoles> don't think so, in my virtualenvs, '$ which swift-Object-server' actually finds a match 22:00:10 <acoles> :/ 22:00:16 <timburke_> O.o 22:00:17 <zaitcev> Maybe just back out the whole thing. It's an option. But I hoped that just backing out the effects in the decorator, and _only_ screw with the exit code, would let use preserve it. 22:00:40 <zaitcev> Oh 22:00:45 <timburke_> zaitcev, seems likely to be reasonable 22:01:08 <timburke_> acoles, well where did it come from!? 22:01:18 <timburke_> anyway, we're at time 22:01:19 <acoles> I think there's a case-insensitivity thing going on in my virtualenvs ?!? v weird 22:01:27 <timburke_> thank you all for coming, and thank you for working on swift! 22:01:44 <timburke_> there's a lot going on, and i'm excited to see it all happening 22:01:50 <timburke_> #endmeeting