#openstack-meeting log

21:03:42 <mattoliverau> #startmeeting swift
21:03:42 <openstack> Meeting started Wed Dec  9 21:03:42 2020 UTC and is due to finish in 60 minutes.  The chair is mattoliverau. Information about MeetBot at http://wiki.debian.org/MeetBot.
21:03:43 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
21:03:45 <openstack> The meeting name has been set to 'swift'
21:03:46 <openstack> acoles: Error: Can't start another meeting, one is in progress.  Use #endmeeting first.
21:03:50 <mattoliverau> I beat ya
21:03:54 <rledisez> :D
21:03:59 <acoles> oh thanks mattoliverau for starting the meeting
21:04:14 <mattoliverau> I'll start chairing i guess until tim comes online
21:04:14 <acoles> does that mean you will chair as well hahah!
21:04:32 <acoles> so, apologies form timburke, he has unexpected childcare duties
21:04:44 <mattoliverau> #topic Audit watchers
21:04:47 <acoles> from*
21:04:56 <acoles> thanks mattoliverau
21:05:05 <mattoliverau> acoles: cool thanks for letting us know.
21:05:10 <acoles> BTW the agenda is here https://wiki.openstack.org/wiki/Meetings/Swift
21:05:15 <mattoliverau> oh thanks
21:05:18 <zaitcev> But it's old
21:05:27 <mattoliverau> #link  https://wiki.openstack.org/wiki/Meetings/Swift
21:05:53 <mattoliverau> Any updateing from audit watchers?
21:06:06 <mattoliverau> I know I reviewed it again last night
21:06:13 <mattoliverau> and it's looking really great
21:06:16 <acoles> I know the final agenda topic is intended for today at least
21:06:56 <mattoliverau> I think we need to get some documentation in place, but that is a follow up patch I feel.
21:07:50 <zaitcev> I am going to write it.
21:08:00 <zaitcev> By "it" I mean the doc for watchers.
21:08:26 <mattoliverau> Cool, thanks zaitcev, I'll review it and land it when you're done.
21:09:04 <mattoliverau> The PR in question is: https://review.opendev.org/c/openstack/swift/+/706653
21:09:24 <mattoliverau> I was temped to put a +A on it, but knew Tim said we planned to review it.
21:09:31 <mattoliverau> *he
21:10:09 <mattoliverau> If no more questions on audit watchers shall we move on?
21:10:15 <zaitcev> Move on.
21:10:20 <acoles> great work guys, thanks
21:10:33 <zaitcev> It's dsariel's debut, BTW
21:10:43 <mattoliverau> \o/
21:11:09 <zaitcev> So I wasn't touching it on purpose, to let him get the lumps :-)
21:11:25 <dsariel> with zaitcev's great help
21:11:34 <mattoliverau> #topic s3api, +segments container, and ACLs
21:11:50 <mattoliverau> I know this is an old agenda, so do we have any update on this?
21:12:25 <mattoliverau> #link https://review.opendev.org/763106
21:12:27 <zaitcev> I looked at it and it seemed fine
21:12:47 <zaitcev> But I didn't +2
21:12:48 <zaitcev> Oh
21:13:00 <zaitcev> I know. Clay sniped me on it.
21:13:09 <mattoliverau> looks like it's +Aed
21:13:18 <mattoliverau> tho not merged
21:13:30 <mattoliverau> So I guess it just needs handholding through the gate.
21:13:30 <zaitcev> I think we can move on from that particular thing onto s3api in general if anyone knows what's up with it.
21:14:06 <acoles> move on - IIRC last week it was just to nudge it for a +A
21:14:13 <acoles> since then it has been in recheck-land
21:14:18 <mattoliverau> kk
21:14:34 <zaitcev> Yea. I knew it had no chance, so didn't recheck until Tim's "retry" patch.
21:14:42 <mattoliverau> we can come back to s3api if anyone has anything at the end
21:15:10 <mattoliverau> #topic what still has to be done in order to enable automatic sharding
21:15:29 <acoles> that's a great question
21:15:35 <mattoliverau> it is :)
21:15:39 <zaitcev> before that, mattoliverau, are you working on that off the tip we have right now?
21:15:55 <mattoliverau> thanks dsariel for adding it :)
21:16:05 <dsariel> I will be happy to help with this. Anything I can do?
21:16:24 <zaitcev> During PTG someone (Tim or Clay) mentioned that nVidia has some patches in production that are necessary for the current sharding.
21:16:45 <zaitcev> So I was wondering where the development is occurring.
21:16:56 <acoles> there's a few patches on gerrit around sharding, some of which we have shipped
21:17:04 <zaitcev> Oh
21:17:14 <zaitcev> I thought they weren't in gerrit.
21:17:20 <mattoliverau> So I obivously can't speak for nvidia.
21:17:29 <zaitcev> OK, I can find them. Well, David can find them heh.
21:17:29 <mattoliverau> but I belive they're not using auto sharding.
21:17:51 <zaitcev> yes yes, just the exisitng sharding
21:17:58 <mattoliverau> they have some smarts in their controller that identify things that need shardsing and the sharding management tool is used.
21:18:23 <acoles> but first, at a high level, my personal view is that we need to (a) put in place all we think we need to recover from split-brain autosharding and (b) convince ourselves that we have done the best we can to avoid split-brain auto-sharding
21:18:39 <mattoliverau> Auto sharding, where I want to get too I believe are some upstream WIP patches I have.
21:18:44 <mattoliverau> what acoles said ^
21:18:54 <acoles> we have a proprietary approach to avoiding split-brain sharding, and we do not enable autosharding
21:19:13 <acoles> we use swift-manage-shard-ranges
21:19:37 <zaitcev> Got it.
21:19:42 <mattoliverau> Turns out the main problem with the current auto-sharding approach is there are ways theis split brain can occur.
21:20:03 <acoles> oh, and one final piece, we need to have autoshrinking sorted too
21:20:43 <mattoliverau> +1
21:20:49 <mattoliverau> I have one POC/WIP patch that improves the leader election, but after playing with it, it minimalises these edge cases, but doesn't remove them.
21:21:41 <mattoliverau> So moved on to what acoles mentioned. If we have a way to recover from split brains and gaps then that needs to come first.
21:22:48 <zaitcev> Guys
21:22:59 <zaitcev> Our quorum is 1/2 or greater, right?
21:23:06 <mattoliverau> them we might fine we're happy to have the simple "sam" leader election approach we have now.. or decide to improve leader election.
21:23:14 <mattoliverau> we have 2 quorums
21:23:35 <zaitcev> Can it be used productively, so that the minority always agrees with majority (which presumably has a leader selected)?
21:23:45 <dsariel> can I ask a noob question: what is split-brain sharding?
21:24:08 <mattoliverau> dsariel: when more then 1 thinks they are the leader and make a diffferent set of shard ranges
21:24:21 <zaitcev> dsariel: it's a network partition, so now you have 20 nodes doing one thing and 15 nodes doing other thing.
21:24:48 <dsariel> got it. thanks
21:24:56 <mattoliverau> we have a ceil[replica/2] and a majority quorum ( replica / 2 + 1 )
21:25:24 <zaitcev> replica // 2 or
21:25:35 <mattoliverau> so yeah we'd use a majority quorum for making leader election decisions if we went and asked.
21:25:35 <zaitcev> py3 world is harsh
21:25:41 <mattoliverau> lol
21:25:48 <acoles> yes, so in the auto-sharding mode the node that thinks it is node index 0 in ring picks shard ranges and replicates them to other nodes. problem is if another node also thinks it is 0, but is likely to pick a different set of shards
21:26:54 <mattoliverau> Yes, the WIP PR I have adds some majority quorum on who actaully is index 0 and what's the version of the ring, to get rid of old nodes who may not agree because they have an old ring
21:27:18 <zaitcev> Oh, I see.
21:27:50 <mattoliverau> but that's alot of expra requests. and does minimalise the split brain edge case window. but doesn't completely eradicate it.
21:28:05 <zaitcev> Even if the network is split, the administrator is not split, the human maintains the rings, and that is the source of truth even if not used directly by sharding.
21:29:16 <acoles> so mattoliverau 's work is towards my step (b) above - reduce the chance of a mistake in choosing the leader
21:29:26 <mattoliverau> yup
21:29:34 <acoles> I've been working on recovering from mistakes
21:29:34 <mattoliverau> but turns out step (a)
21:29:45 <mattoliverau> is what we need to solve.
21:30:02 <acoles> so https://review.opendev.org/c/openstack/swift/+/765624 is a WIP, and i think mattoliverau may also have some ideas
21:30:16 <acoles> ^^ building on some discussion at the PTG
21:31:03 <mattoliverau> we apparently gerrit is not loading for me atm...
21:31:18 <mattoliverau> *well
21:31:52 <zaitcev> same here
21:31:54 <acoles> I'm deliberately keeping it simple at first - the cases we have seen are 'simple' duplicate paths that you would expect if two nodes had acted as leaders but with different local sets of objects, so choosing slightly different shard ranges
21:31:57 <mattoliverau> Anyway step (a) is what we want to solve first, when we do, leader election edge cases become less of an issue.
21:32:27 <acoles> hmmm, I just pushed it gerrit but now also not loading for me
21:32:59 <acoles> anyway, that patch adds a 'repair' command to swift-manage-shard-ranges that will find all paths, choose one and shrink all others into it
21:34:13 <acoles> mattoliverau: did you have some graph visualisation stuff? IIRC you did some work before I returned to swift-land? it would be cool to see that too
21:34:22 <mattoliverau> And I've been playing with a RangeScanner that can rebuild and/or choose best paths. It's latest addition is a testing out a new gap filler approach that uses the weighting algorithm to coose the best path (acoles spider suggestion).
21:35:29 <acoles> cool. so mattoliverau checkout https://review.opendev.org/c/openstack/swift/+/765624 - we may have some overlaps :)
21:35:37 <mattoliverau> acoles: yeah the patch includes some grpahvis to manage-shard-rangers show command  that turns shardranges into a graph.
21:35:49 <mattoliverau> nice :)
21:35:52 <acoles> but I have dodged gap-filling for now
21:36:47 <mattoliverau> I defintely will!
21:36:58 <acoles> so one answer to dsariel's question - it would be great to have review of the patches we have in progress :) and review might include getting a container sharded with split-brain and checking out the new repair command etc
21:37:09 <mattoliverau> and so should dsariel :)
21:37:27 <mattoliverau> +1
21:37:35 <acoles> dsariel: the probe test in the patch may be a good starting point to understand the problem space
21:37:48 <mattoliverau> the code isn't finished but reviewing and testing would be a huge help
21:38:01 <acoles> (the probe test uses ridiculously small numbers of objects vs real life)
21:39:14 <acoles> I also have https://review.opendev.org/c/openstack/swift/+/765623/4 which adds a 'compact' command to swift-manage-shard-ranges, its a precursor to the other because it uses similar functionality i.e. shrinking unwanted shards)
21:39:40 <acoles> sorry, I feel like this is shameless promotion of my patches, don't mean it to be
21:40:08 <acoles> it's quite likely that reviewing those and mattoliverau's patches will generate further work to help move things along
21:40:16 <mattoliverau> lol, not that's for all the work
21:41:02 <mattoliverau> *thanks
21:41:15 <mattoliverau> apparently I cant type this morning
21:41:33 <acoles> dsariel: is that helpful?
21:42:10 <mattoliverau> seeing as I can't access gerrit atm, I think this is my rangescanner (plus graphviz) POC/WIP: https://review.opendev.org/#/c/749614/
21:42:19 <dsariel> thanks, probe tests was the place I started to look at. I'll take a look on the patches. Guess will have many questions. Apologize in advance for that.
21:42:36 <acoles> please ask questions
21:42:56 <mattoliverau> you probably will, and thanks fine, great even. you'll see things a fresh which I think will be great!
21:43:10 <mattoliverau> *that's fine
21:43:12 <dsariel> Adding more objects to probe tests will increase the time they take. Is is possible to run them is a separate job?
21:43:16 <acoles> BTW those patches I linked are on a chain that starts with a fix to shard audit that we found we needed in order to shrink some overlaps
21:43:27 <mattoliverau> man, I need to read before I press <enter> :P
21:43:29 <acoles> so dig down through the patch dependency
21:44:45 <acoles> you can run an individual probetest with something like 'nosetests ./test/probe/test_sharder.py:TestManagedContainerSharding.test_manage_shard_ranges_repair_shard'
21:45:11 <mattoliverau> Anything else on this topic? seems dsariel has a bunch of code to read and test :)
21:45:24 <acoles> the small object count isn't necessarily a problem, I was just explaining that the tests aren't run at real world scale :)
21:45:40 <dsariel> :-)
21:45:59 <acoles> one other thing
21:46:28 <acoles> I rediscovered this tool 'python swift/cli/shard-info.py'
21:47:28 <acoles> it dumps all the root and shard container state after a probe test. its use is really limited to probe test analysis, it could definitely be improved, but it is a lot better that nothing
21:48:11 <acoles> dsariel: reach out to us in #openstack-swift with any questions
21:48:21 <mattoliverau> +100
21:48:21 <dsariel> awesome, thanks! will try it
21:48:41 <mattoliverau> let's move on to open floor then.
21:48:47 <mattoliverau> #topic open floor
21:49:11 <mattoliverau> is there anything else anyone wants to bring up and discuss?
21:49:26 <dsariel> thanks a lot for the directions
21:49:36 <zaitcev> Yes
21:49:44 <zaitcev> not on the topic of sharding though
21:49:46 * timburke sneaks in finally
21:49:56 <acoles> phew timburke will rescue us
21:49:58 <mattoliverau> lol, hey timburke :)
21:50:01 <zaitcev> timburke: 11 minutes left, come on
21:50:05 <zaitcev> kid okay?
21:50:17 <timburke> yup, just got overdue for his nap
21:51:47 <zaitcev> so, I was looking at the failure of Romain's patch on py27 and so far I was unsuccessfull.
21:52:37 <timburke> oh yeah -- the queuing patch, i think, is that right?
21:52:43 <zaitcev> I pulled all the remotely relevant patches from eventlet 2.29 into the 2.25 that's locked in tox, but no dice.
21:53:10 <zaitcev> I think I'll need to find just where the exceptions get stuck.
21:53:36 <zaitcev> Most of the time it's ChunkWriteTimeout, although not always.
21:54:03 <zaitcev> I'm going to dump every ChunkWriteTimeout as it's instantiated and trawl through them.
21:54:25 <zaitcev> I'm not asking for help so far, but it looks grim
21:54:55 <timburke> fwiw i suspect the ChunkWriteTimeout may be from an old watchdog for an already-passed test
21:55:10 <zaitcev> Yeah, something like that.
21:55:36 <timburke> it reminds me in some ways of the trouble i've seen in prod where a ChunkWriteTimeout pops and logs a path *but it has the wrong txn id*
21:56:02 <timburke> i've grown worried about eventlet's (green)threadlocal behavior...
21:56:15 <zaitcev> But it works fine on py3, right?
21:57:00 <timburke> ...i guess? seems to be better, anyway
21:57:27 <mattoliverau> sounds.. tedius, thanks zaitcev for going down this particular rabbit hole.
21:58:26 <mattoliverau> we have 3 minutes before we reach time. Anything else or shall we move any discussions into #openstack-swift ?
21:58:53 <zaitcev> I'm all set.
21:59:20 <timburke> there are some py3 patches and some s3api patches i'd appreciate eyes on, but i can drop those in -swift
21:59:32 <mattoliverau> kk thanks timburke
21:59:51 <mattoliverau> timburke: maybe you could update priority reviews if you get the chance :)
21:59:55 <mattoliverau> I'll call it
21:59:56 <timburke> thank *you* mattoliverau! sorry i hadn't gotten to updating the agenda
22:00:00 <timburke> that's a great idea!
22:00:03 <acoles> thanks mattoliverau for jumping in to chair, great job!
22:00:10 <mattoliverau> Thanks for all your hard work and thanks for working on swift!
22:00:15 <mattoliverau> #endmeeting