21:03:42 <mattoliverau> #startmeeting swift 21:03:42 <openstack> Meeting started Wed Dec 9 21:03:42 2020 UTC and is due to finish in 60 minutes. The chair is mattoliverau. Information about MeetBot at http://wiki.debian.org/MeetBot. 21:03:43 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 21:03:45 <openstack> The meeting name has been set to 'swift' 21:03:46 <openstack> acoles: Error: Can't start another meeting, one is in progress. Use #endmeeting first. 21:03:50 <mattoliverau> I beat ya 21:03:54 <rledisez> :D 21:03:59 <acoles> oh thanks mattoliverau for starting the meeting 21:04:14 <mattoliverau> I'll start chairing i guess until tim comes online 21:04:14 <acoles> does that mean you will chair as well hahah! 21:04:32 <acoles> so, apologies form timburke, he has unexpected childcare duties 21:04:44 <mattoliverau> #topic Audit watchers 21:04:47 <acoles> from* 21:04:56 <acoles> thanks mattoliverau 21:05:05 <mattoliverau> acoles: cool thanks for letting us know. 21:05:10 <acoles> BTW the agenda is here https://wiki.openstack.org/wiki/Meetings/Swift 21:05:15 <mattoliverau> oh thanks 21:05:18 <zaitcev> But it's old 21:05:27 <mattoliverau> #link https://wiki.openstack.org/wiki/Meetings/Swift 21:05:53 <mattoliverau> Any updateing from audit watchers? 21:06:06 <mattoliverau> I know I reviewed it again last night 21:06:13 <mattoliverau> and it's looking really great 21:06:16 <acoles> I know the final agenda topic is intended for today at least 21:06:56 <mattoliverau> I think we need to get some documentation in place, but that is a follow up patch I feel. 21:07:50 <zaitcev> I am going to write it. 21:08:00 <zaitcev> By "it" I mean the doc for watchers. 21:08:26 <mattoliverau> Cool, thanks zaitcev, I'll review it and land it when you're done. 21:09:04 <mattoliverau> The PR in question is: https://review.opendev.org/c/openstack/swift/+/706653 21:09:24 <mattoliverau> I was temped to put a +A on it, but knew Tim said we planned to review it. 21:09:31 <mattoliverau> *he 21:10:09 <mattoliverau> If no more questions on audit watchers shall we move on? 21:10:15 <zaitcev> Move on. 21:10:20 <acoles> great work guys, thanks 21:10:33 <zaitcev> It's dsariel's debut, BTW 21:10:43 <mattoliverau> \o/ 21:11:09 <zaitcev> So I wasn't touching it on purpose, to let him get the lumps :-) 21:11:25 <dsariel> with zaitcev's great help 21:11:34 <mattoliverau> #topic s3api, +segments container, and ACLs 21:11:50 <mattoliverau> I know this is an old agenda, so do we have any update on this? 21:12:25 <mattoliverau> #link https://review.opendev.org/763106 21:12:27 <zaitcev> I looked at it and it seemed fine 21:12:47 <zaitcev> But I didn't +2 21:12:48 <zaitcev> Oh 21:13:00 <zaitcev> I know. Clay sniped me on it. 21:13:09 <mattoliverau> looks like it's +Aed 21:13:18 <mattoliverau> tho not merged 21:13:30 <mattoliverau> So I guess it just needs handholding through the gate. 21:13:30 <zaitcev> I think we can move on from that particular thing onto s3api in general if anyone knows what's up with it. 21:14:06 <acoles> move on - IIRC last week it was just to nudge it for a +A 21:14:13 <acoles> since then it has been in recheck-land 21:14:18 <mattoliverau> kk 21:14:34 <zaitcev> Yea. I knew it had no chance, so didn't recheck until Tim's "retry" patch. 21:14:42 <mattoliverau> we can come back to s3api if anyone has anything at the end 21:15:10 <mattoliverau> #topic what still has to be done in order to enable automatic sharding 21:15:29 <acoles> that's a great question 21:15:35 <mattoliverau> it is :) 21:15:39 <zaitcev> before that, mattoliverau, are you working on that off the tip we have right now? 21:15:55 <mattoliverau> thanks dsariel for adding it :) 21:16:05 <dsariel> I will be happy to help with this. Anything I can do? 21:16:24 <zaitcev> During PTG someone (Tim or Clay) mentioned that nVidia has some patches in production that are necessary for the current sharding. 21:16:45 <zaitcev> So I was wondering where the development is occurring. 21:16:56 <acoles> there's a few patches on gerrit around sharding, some of which we have shipped 21:17:04 <zaitcev> Oh 21:17:14 <zaitcev> I thought they weren't in gerrit. 21:17:20 <mattoliverau> So I obivously can't speak for nvidia. 21:17:29 <zaitcev> OK, I can find them. Well, David can find them heh. 21:17:29 <mattoliverau> but I belive they're not using auto sharding. 21:17:51 <zaitcev> yes yes, just the exisitng sharding 21:17:58 <mattoliverau> they have some smarts in their controller that identify things that need shardsing and the sharding management tool is used. 21:18:23 <acoles> but first, at a high level, my personal view is that we need to (a) put in place all we think we need to recover from split-brain autosharding and (b) convince ourselves that we have done the best we can to avoid split-brain auto-sharding 21:18:39 <mattoliverau> Auto sharding, where I want to get too I believe are some upstream WIP patches I have. 21:18:44 <mattoliverau> what acoles said ^ 21:18:54 <acoles> we have a proprietary approach to avoiding split-brain sharding, and we do not enable autosharding 21:19:13 <acoles> we use swift-manage-shard-ranges 21:19:37 <zaitcev> Got it. 21:19:42 <mattoliverau> Turns out the main problem with the current auto-sharding approach is there are ways theis split brain can occur. 21:20:03 <acoles> oh, and one final piece, we need to have autoshrinking sorted too 21:20:43 <mattoliverau> +1 21:20:49 <mattoliverau> I have one POC/WIP patch that improves the leader election, but after playing with it, it minimalises these edge cases, but doesn't remove them. 21:21:41 <mattoliverau> So moved on to what acoles mentioned. If we have a way to recover from split brains and gaps then that needs to come first. 21:22:48 <zaitcev> Guys 21:22:59 <zaitcev> Our quorum is 1/2 or greater, right? 21:23:06 <mattoliverau> them we might fine we're happy to have the simple "sam" leader election approach we have now.. or decide to improve leader election. 21:23:14 <mattoliverau> we have 2 quorums 21:23:35 <zaitcev> Can it be used productively, so that the minority always agrees with majority (which presumably has a leader selected)? 21:23:45 <dsariel> can I ask a noob question: what is split-brain sharding? 21:24:08 <mattoliverau> dsariel: when more then 1 thinks they are the leader and make a diffferent set of shard ranges 21:24:21 <zaitcev> dsariel: it's a network partition, so now you have 20 nodes doing one thing and 15 nodes doing other thing. 21:24:48 <dsariel> got it. thanks 21:24:56 <mattoliverau> we have a ceil[replica/2] and a majority quorum ( replica / 2 + 1 ) 21:25:24 <zaitcev> replica // 2 or 21:25:35 <mattoliverau> so yeah we'd use a majority quorum for making leader election decisions if we went and asked. 21:25:35 <zaitcev> py3 world is harsh 21:25:41 <mattoliverau> lol 21:25:48 <acoles> yes, so in the auto-sharding mode the node that thinks it is node index 0 in ring picks shard ranges and replicates them to other nodes. problem is if another node also thinks it is 0, but is likely to pick a different set of shards 21:26:54 <mattoliverau> Yes, the WIP PR I have adds some majority quorum on who actaully is index 0 and what's the version of the ring, to get rid of old nodes who may not agree because they have an old ring 21:27:18 <zaitcev> Oh, I see. 21:27:50 <mattoliverau> but that's alot of expra requests. and does minimalise the split brain edge case window. but doesn't completely eradicate it. 21:28:05 <zaitcev> Even if the network is split, the administrator is not split, the human maintains the rings, and that is the source of truth even if not used directly by sharding. 21:29:16 <acoles> so mattoliverau 's work is towards my step (b) above - reduce the chance of a mistake in choosing the leader 21:29:26 <mattoliverau> yup 21:29:34 <acoles> I've been working on recovering from mistakes 21:29:34 <mattoliverau> but turns out step (a) 21:29:45 <mattoliverau> is what we need to solve. 21:30:02 <acoles> so https://review.opendev.org/c/openstack/swift/+/765624 is a WIP, and i think mattoliverau may also have some ideas 21:30:16 <acoles> ^^ building on some discussion at the PTG 21:31:03 <mattoliverau> we apparently gerrit is not loading for me atm... 21:31:18 <mattoliverau> *well 21:31:52 <zaitcev> same here 21:31:54 <acoles> I'm deliberately keeping it simple at first - the cases we have seen are 'simple' duplicate paths that you would expect if two nodes had acted as leaders but with different local sets of objects, so choosing slightly different shard ranges 21:31:57 <mattoliverau> Anyway step (a) is what we want to solve first, when we do, leader election edge cases become less of an issue. 21:32:27 <acoles> hmmm, I just pushed it gerrit but now also not loading for me 21:32:59 <acoles> anyway, that patch adds a 'repair' command to swift-manage-shard-ranges that will find all paths, choose one and shrink all others into it 21:34:13 <acoles> mattoliverau: did you have some graph visualisation stuff? IIRC you did some work before I returned to swift-land? it would be cool to see that too 21:34:22 <mattoliverau> And I've been playing with a RangeScanner that can rebuild and/or choose best paths. It's latest addition is a testing out a new gap filler approach that uses the weighting algorithm to coose the best path (acoles spider suggestion). 21:35:29 <acoles> cool. so mattoliverau checkout https://review.opendev.org/c/openstack/swift/+/765624 - we may have some overlaps :) 21:35:37 <mattoliverau> acoles: yeah the patch includes some grpahvis to manage-shard-rangers show command that turns shardranges into a graph. 21:35:49 <mattoliverau> nice :) 21:35:52 <acoles> but I have dodged gap-filling for now 21:36:47 <mattoliverau> I defintely will! 21:36:58 <acoles> so one answer to dsariel's question - it would be great to have review of the patches we have in progress :) and review might include getting a container sharded with split-brain and checking out the new repair command etc 21:37:09 <mattoliverau> and so should dsariel :) 21:37:27 <mattoliverau> +1 21:37:35 <acoles> dsariel: the probe test in the patch may be a good starting point to understand the problem space 21:37:48 <mattoliverau> the code isn't finished but reviewing and testing would be a huge help 21:38:01 <acoles> (the probe test uses ridiculously small numbers of objects vs real life) 21:39:14 <acoles> I also have https://review.opendev.org/c/openstack/swift/+/765623/4 which adds a 'compact' command to swift-manage-shard-ranges, its a precursor to the other because it uses similar functionality i.e. shrinking unwanted shards) 21:39:40 <acoles> sorry, I feel like this is shameless promotion of my patches, don't mean it to be 21:40:08 <acoles> it's quite likely that reviewing those and mattoliverau's patches will generate further work to help move things along 21:40:16 <mattoliverau> lol, not that's for all the work 21:41:02 <mattoliverau> *thanks 21:41:15 <mattoliverau> apparently I cant type this morning 21:41:33 <acoles> dsariel: is that helpful? 21:42:10 <mattoliverau> seeing as I can't access gerrit atm, I think this is my rangescanner (plus graphviz) POC/WIP: https://review.opendev.org/#/c/749614/ 21:42:19 <dsariel> thanks, probe tests was the place I started to look at. I'll take a look on the patches. Guess will have many questions. Apologize in advance for that. 21:42:36 <acoles> please ask questions 21:42:56 <mattoliverau> you probably will, and thanks fine, great even. you'll see things a fresh which I think will be great! 21:43:10 <mattoliverau> *that's fine 21:43:12 <dsariel> Adding more objects to probe tests will increase the time they take. Is is possible to run them is a separate job? 21:43:16 <acoles> BTW those patches I linked are on a chain that starts with a fix to shard audit that we found we needed in order to shrink some overlaps 21:43:27 <mattoliverau> man, I need to read before I press <enter> :P 21:43:29 <acoles> so dig down through the patch dependency 21:44:45 <acoles> you can run an individual probetest with something like 'nosetests ./test/probe/test_sharder.py:TestManagedContainerSharding.test_manage_shard_ranges_repair_shard' 21:45:11 <mattoliverau> Anything else on this topic? seems dsariel has a bunch of code to read and test :) 21:45:24 <acoles> the small object count isn't necessarily a problem, I was just explaining that the tests aren't run at real world scale :) 21:45:40 <dsariel> :-) 21:45:59 <acoles> one other thing 21:46:28 <acoles> I rediscovered this tool 'python swift/cli/shard-info.py' 21:47:28 <acoles> it dumps all the root and shard container state after a probe test. its use is really limited to probe test analysis, it could definitely be improved, but it is a lot better that nothing 21:48:11 <acoles> dsariel: reach out to us in #openstack-swift with any questions 21:48:21 <mattoliverau> +100 21:48:21 <dsariel> awesome, thanks! will try it 21:48:41 <mattoliverau> let's move on to open floor then. 21:48:47 <mattoliverau> #topic open floor 21:49:11 <mattoliverau> is there anything else anyone wants to bring up and discuss? 21:49:26 <dsariel> thanks a lot for the directions 21:49:36 <zaitcev> Yes 21:49:44 <zaitcev> not on the topic of sharding though 21:49:46 * timburke sneaks in finally 21:49:56 <acoles> phew timburke will rescue us 21:49:58 <mattoliverau> lol, hey timburke :) 21:50:01 <zaitcev> timburke: 11 minutes left, come on 21:50:05 <zaitcev> kid okay? 21:50:17 <timburke> yup, just got overdue for his nap 21:51:47 <zaitcev> so, I was looking at the failure of Romain's patch on py27 and so far I was unsuccessfull. 21:52:37 <timburke> oh yeah -- the queuing patch, i think, is that right? 21:52:43 <zaitcev> I pulled all the remotely relevant patches from eventlet 2.29 into the 2.25 that's locked in tox, but no dice. 21:53:10 <zaitcev> I think I'll need to find just where the exceptions get stuck. 21:53:36 <zaitcev> Most of the time it's ChunkWriteTimeout, although not always. 21:54:03 <zaitcev> I'm going to dump every ChunkWriteTimeout as it's instantiated and trawl through them. 21:54:25 <zaitcev> I'm not asking for help so far, but it looks grim 21:54:55 <timburke> fwiw i suspect the ChunkWriteTimeout may be from an old watchdog for an already-passed test 21:55:10 <zaitcev> Yeah, something like that. 21:55:36 <timburke> it reminds me in some ways of the trouble i've seen in prod where a ChunkWriteTimeout pops and logs a path *but it has the wrong txn id* 21:56:02 <timburke> i've grown worried about eventlet's (green)threadlocal behavior... 21:56:15 <zaitcev> But it works fine on py3, right? 21:57:00 <timburke> ...i guess? seems to be better, anyway 21:57:27 <mattoliverau> sounds.. tedius, thanks zaitcev for going down this particular rabbit hole. 21:58:26 <mattoliverau> we have 3 minutes before we reach time. Anything else or shall we move any discussions into #openstack-swift ? 21:58:53 <zaitcev> I'm all set. 21:59:20 <timburke> there are some py3 patches and some s3api patches i'd appreciate eyes on, but i can drop those in -swift 21:59:32 <mattoliverau> kk thanks timburke 21:59:51 <mattoliverau> timburke: maybe you could update priority reviews if you get the chance :) 21:59:55 <mattoliverau> I'll call it 21:59:56 <timburke> thank *you* mattoliverau! sorry i hadn't gotten to updating the agenda 22:00:00 <timburke> that's a great idea! 22:00:03 <acoles> thanks mattoliverau for jumping in to chair, great job! 22:00:10 <mattoliverau> Thanks for all your hard work and thanks for working on swift! 22:00:15 <mattoliverau> #endmeeting