#openstack-meeting log

21:00:11 <timburke_> #startmeeting swift
21:00:12 <openstack> Meeting started Wed Jan 13 21:00:11 2021 UTC and is due to finish in 60 minutes.  The chair is timburke_. Information about MeetBot at http://wiki.debian.org/MeetBot.
21:00:13 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
21:00:15 <openstack> The meeting name has been set to 'swift'
21:00:19 <timburke_> who's here for the swift meeting?
21:00:27 <mattoliverau> o/
21:00:29 <seongsoocho> o/
21:00:44 <rledisez> o/
21:01:03 <acoles> o/
21:01:12 <dsariel> o/
21:01:25 <kota_> o/
21:01:54 <timburke_> as usual, the agenda's at https://wiki.openstack.org/wiki/Meetings/Swift
21:02:14 <timburke_> first up
21:02:18 <timburke_> #topic reconciler/ec/encryption
21:02:28 <timburke_> #link https://bugs.launchpad.net/swift/+bug/1910804
21:02:30 <openstack> Launchpad bug 1910804 in OpenStack Object Storage (swift) "Encryption doesn't play well with processes that copy cleartext data while preserving timestamps" [Undecided,New]
21:03:07 <timburke_> so i had a customer report an issue with an object that would consistently 503
21:03:49 <clayg> ohai
21:04:06 <timburke_> digging in more, we found that they had 11 frags of it for an 8+4 policy... but those had 3 separate sets of crypto meta between them
21:04:28 <timburke_> ...and no set of crypto meta had more than 7 frags
21:04:34 <clayg> lawl
21:05:16 <acoles> (I had to think about this at first) meaning frags have been encrypted with three different body keys...for same object!!!
21:06:47 <timburke_> root cause was traced back to a couple issues: (1) we deploy with encryption in the reconciler pipeline and (2) we have every (container?) node running a reconciler
21:07:55 <timburke_> (well, that and the fact that it was moved to an EC policy. if it were going to a replicated policy, any replica regardless of crypto meta would be capable of generating a client response)
21:08:48 <timburke_> i've got a fix up to pull encryption out of the reconciler pipeline if it was misconfigured -- https://review.opendev.org/c/openstack/swift/+/770522
21:09:23 <timburke_> but i wanted to raise awareness of the issue so no one else finds themselves in this situation
21:10:56 <timburke_> also worth noting: i think you could run into a similar issue *without encryption* if your EC backend is non-deterministic
21:12:50 <timburke_> the open source backends are deterministic as i recall (that is, the frag outputs only depend on the EC params from swift.conf and the input data), but i don't know the details of shss, for example
21:13:39 <timburke_> does anyone have any questions about the bug or its impact?
21:14:54 <timburke_> all right
21:14:57 <mattoliverau> Nice investigation!
21:15:00 <timburke_> #topic SSYNC and non-durable frags
21:15:14 <timburke_> #link https://bugs.launchpad.net/swift/+bug/1778002
21:15:16 <openstack> Launchpad bug 1778002 in OpenStack Object Storage (swift) "EC non-durable fragment won't be deleted by reconstructor. " [High,Confirmed]
21:15:49 <timburke_> i know acoles (and clayg?) has been working on this problem a bit lately, though i'm not sure where things stand
21:15:49 <kota_> shss might be impacted. i'll check it.
21:16:03 <acoles> I just got my probe test working!
21:16:11 <timburke_> \o/
21:16:24 <acoles> background: we noticed some partitions were never cleaned up on handoffs
21:16:50 <acoles> turned out they had non-durable data frags on them , so the dir would not be deleted
21:17:06 <acoles> but reconstructor/ssync does not sync non-durable frags
21:17:24 <acoles> :(
21:17:35 <acoles> so https://review.opendev.org/c/openstack/swift/+/770047 should fix that
21:18:10 <acoles> by (a) sync'ing non-durables (they could still be useful data) and (b) then removing non-durables on the handoff
21:18:52 <clayg> https://bugs.launchpad.net/swift/+bug/1778002 has been around for awhile - anyone doing EC rebalances has probably noticed it
21:18:53 <openstack> Launchpad bug 1778002 in OpenStack Object Storage (swift) "EC non-durable fragment won't be deleted by reconstructor. " [High,Confirmed]
21:20:47 <zaitcev> Hrm. I never noticed because I have excess space.
21:21:51 <timburke_> i think we mainly noticed because we monitor handoffs as part of our rebalances
21:22:05 <acoles> the commit message on the patch details the various changes needed to get the non-durables yielded to ssync and then have ssync sync them
21:22:11 <timburke_> acoles, are there any questions that might need answering, or is this something that everyone should just anticipate getting better Real Soon Now?
21:23:27 <acoles> review always welcome, but there's no specific issue I  have in mind for feedback
21:23:49 <timburke_> excellent
21:23:57 <acoles> I'm about to push a new patchset - and I have one more test to write
21:25:18 <timburke_> #topic cleaning up shards when root DB is deleted and reclaimed
21:25:34 <timburke_> meanwhile, mattoliverau has picked up
21:25:37 <timburke_> #link https://bugs.launchpad.net/swift/+bug/1911232
21:25:38 <openstack> Launchpad bug 1911232 in OpenStack Object Storage (swift) "empty shards fail audit with reclaimed root db " [Undecided,Confirmed] - Assigned to Matthew Oliver (matt-0)
21:26:02 <timburke_> how's that going?
21:26:32 <mattoliverau> Yeah things are moving along. I have https://review.opendev.org/c/openstack/swift/+/770529
21:26:53 <mattoliverau> it's not fixed yet, just worked on a probe test that shows he problem.
21:27:09 <acoles> a very good place to start :)
21:27:40 <mattoliverau> In an ideal world we'd have shrinking and autosharding so shards with nothing in them was suppose to collapse into the root before reclaim_age
21:28:03 <mattoliverau> but we don't have that, and there is still an edge case where their not getting cleaned up.
21:28:59 <mattoliverau> I'll have another pathset up today that should have an initial version of a fix. Currently still on my laptop as it needs some debugging and tests
21:29:34 <mattoliverau> keep an eye out for that and then please review and we can make sure we don't leave any pesky shards around :)
21:29:46 <timburke_> sounds good
21:30:23 <timburke_> #topic s3api and allowable clock skew
21:31:10 <timburke_> i've had some clients getting back RequestTimeTooSkewed errors for a while -- not real common, but it's a fairly persistent problem
21:31:58 <timburke_> i'm fairly certain it's that they retry a failed request verbatim, rather than re-signing with the new request time
21:32:51 <timburke_> eventually, given the right retry/backoff options, the retry goes longer than 5mins and they get back a 403
21:33:35 <zaitcev> so, there's nothing we can do, right?
21:33:54 <timburke_> i *think* AWS has an allowable skew of more like 15mins (though can't remember whether i read it somewhere or determined it experimentally)
21:34:11 <zaitcev> That's what I remember, too.
21:34:24 <timburke_> so i proposed a patch to make it configurable, with a default of (what i recall as being) AWS's limit
21:34:41 <timburke_> #link https://review.opendev.org/c/openstack/swift/+/770005
21:34:48 <zaitcev> It was mentioned in the old Developer's Guide. But that document is gone, replaced with API Reference.
21:35:30 <timburke_> i wanted to check if anyone had concerns about increasing this default value (it would of course be called out in release notes later)
21:35:35 <kota_> should we extend the default value too?
21:36:24 * kota_ said same thing :P
21:36:29 <timburke_> kota_, yeah, the patch as written increases the timeout from 5mins to 15mins (if you don't explicitly set a value)
21:37:55 <timburke_> ok, seems like we're generally ok with it :-)
21:38:06 <timburke_> #topic relinker
21:39:17 <timburke_> i found a couple issues recently that might be good to know about if anyone's planning a part-power increase (or two) soon
21:39:24 <timburke_> #link https://bugs.launchpad.net/swift/+bug/1910589
21:39:25 <openstack> Launchpad bug 1910589 in OpenStack Object Storage (swift) "Multiple part power increases leads to misplaced data" [Undecided,New]
21:39:59 <timburke_> ^^^ characterizes something i think i mentioned last week, but hadn't gotten a clean repro for
21:40:24 <timburke_> rledisez, do you think you might have time to review https://review.opendev.org/c/openstack/swift/+/769855 (which should address it)?
21:40:45 <zaitcev> Christian is no longer around essentially, we have to do without.
21:41:05 <clayg> 😢 I hope he's doing well tho!  😁
21:41:47 <rledisez> timburke_: absolutely, I'll do that this week
21:42:36 <timburke_> thanks! only thing worth calling out (i think) is that the state file format changed in such a way that any old state files will just be discarded
21:43:08 <rledisez> not a big deal. don't upgrade if you're relinking, and worst case sceniario, it restart from zero
21:43:15 <timburke_> but that should only really be a concern if someone is doing a swift upgrade mid-part-power-increase, which doesn't seem like a great plan anyway
21:43:41 <clayg> hahaha
21:43:59 <timburke_> the other one i noticed is a little thornier
21:44:01 <timburke_> #link https://bugs.launchpad.net/swift/+bug/1910470
21:44:02 <openstack> Launchpad bug 1910470 in OpenStack Object Storage (swift) "swift-object-relinker does not handle unmounted disks well" [Undecided,New]
21:44:49 <timburke_> essentially, on master, if the relinker hits an unmounted disk, you get no feedback about it at all
21:45:23 <timburke_> i've got a patch that at least has us log the fact that the disk is getting skipped -- https://review.opendev.org/c/openstack/swift/+/769632
21:45:45 <timburke_> but it doesn't exit with a non-zero status code or anything
21:46:12 <seongsoocho> So now, is it safe to increase the partition power only once until the patch is applied?
21:47:00 <rledisez> seongsoocho: from production experience, it is. we did it on multiple clusters with the current status of the relinker
21:47:05 <timburke_> seongsoocho, yes, increasing it once will definitely be fine. once it's been increased, you could go manually clear the state files -- then it would be safe to do it again
21:48:04 <rledisez> but you should care about the last bug mentioned by timburke_, ensure your rights are ok (root:root) on unmounted disk to avoid bad surprises
21:48:10 <timburke_> they'd be named something like /srv/node/*/.relink.*.json
21:48:47 <seongsoocho> aha. ok thanks :)
21:49:17 <rledisez> at some point, it would be useful to have a recon option that return the values of the relink.json and tells when one is missing (eg: because unmounted)
21:49:52 <timburke_> good thought!
21:50:49 <timburke_> all right, i mostly wanted to raise awareness on those -- i'll let you know if i get a good idea on a better resolution for that second one
21:50:55 <timburke_> #topic open discussion
21:51:03 <timburke_> what else should we talk about this week?
21:52:46 <acoles> OMM I'm seeing this test fail in virtualenvs (e.g. tox -e py36) but not outside virtualenv: 'nosetests ./test/unit/common/test_manager.py:TestManagerModule.test_verify_server' - anyone else noticed that? I'm baffled
21:53:35 <acoles> AFAICT the test is asserting that swift-Object-server is not on my PATH
21:53:48 <acoles> note the capital 'O'
21:53:49 <clayg> does it always fail?
21:54:09 <acoles> inside virtualenv yes - I mean, I just noticed in last 20mins
21:54:09 <clayg> are you on a case insenstive file system 🤣
21:54:33 <acoles> vsaio and macos both the same
21:55:02 <acoles> apart from it failing, I don't like that a unit test is making assertions about what I might have on my PATH
21:55:34 <clayg> oh, i just tried venv in my vsaio and it worked 🤷‍♂️
21:55:39 <acoles> if no-one else has noticed I'll dig some more
21:55:42 <clayg> py3.8 tho
21:56:42 <zaitcev> acoles: so, if you comment verify_server, does it fail?
21:56:51 <zaitcev> or, well
21:56:59 <zaitcev> it's a test, so it's a little artificial
21:57:52 <acoles> py2.7 fails
21:58:08 <zaitcev> se
21:58:13 <zaitcev> a second
21:58:34 <timburke_> maybe related to https://review.opendev.org/c/openstack/swift/+/769848 ?
21:58:42 <zaitcev> yes, that one
21:59:04 <timburke_> i should go review that... or maybe acoles should ;-)
21:59:22 <zaitcev> Yeah
21:59:50 <timburke_> all right
21:59:59 <acoles> don't think so, in my virtualenvs, '$ which swift-Object-server' actually finds a match
22:00:10 <acoles> :/
22:00:16 <timburke_> O.o
22:00:17 <zaitcev> Maybe just back out the whole thing. It's an option. But I hoped that just backing out the effects in the decorator, and _only_ screw with the exit code, would let use preserve it.
22:00:40 <zaitcev> Oh
22:00:45 <timburke_> zaitcev, seems likely to be reasonable
22:01:08 <timburke_> acoles, well where did it come from!?
22:01:18 <timburke_> anyway, we're at time
22:01:19 <acoles> I think there's a case-insensitivity thing going on in my virtualenvs ?!? v weird
22:01:27 <timburke_> thank you all for coming, and thank you for working on swift!
22:01:44 <timburke_> there's a lot going on, and i'm excited to see it all happening
22:01:50 <timburke_> #endmeeting