21:00:52 <notmyname> #startmeeting swift
21:00:53 <openstack> Meeting started Wed Jan 16 21:00:52 2019 UTC and is due to finish in 60 minutes.  The chair is notmyname. Information about MeetBot at http://wiki.debian.org/MeetBot.
21:00:54 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
21:00:57 <openstack> The meeting name has been set to 'swift'
21:00:58 <notmyname> who's here for the swift team meeting?
21:01:01 <clayg> o/
21:01:04 <kota_> o.
21:01:07 <mattoliverau> o/
21:01:08 <m_kazuhiro> o/
21:01:09 <rledisez> hi o/
21:01:13 <kota_> o/
21:01:29 <tdasilva> hi
21:02:19 <notmyname> welcome
21:02:42 <notmyname> two quick logistical points to address before we get intot he main meeting topic
21:03:30 <notmyname> first, FYI, I'll be tagging the stable branches soon. I know I've been saying that, but the last patch just landed, and I realized that I need to do changelog updates. so I'll get that done asap (but it may be while I'm on a plane this weekend
21:04:02 <mattoliverau> well you'll have a long enough flight :)
21:04:08 <clayg> hahaha!
21:04:10 <notmyname> second, I'll be traveling for the next two weeks (LCA next week and holiday the week after), so for the next two meetings, it's up to the rest of you to figure out if  there is a meeting and who's leading it
21:04:34 <mattoliverau> if it makes you feel better, I'll be flying much longer this weekend :P
21:04:36 <clayg> 👃 👈
21:04:45 <notmyname> mattoliverau: it doesn't ;-)
21:04:51 <mattoliverau> lol
21:04:53 <clayg> Hahah
21:05:15 <notmyname> any questions or comments about those two things?
21:05:39 <notmyname> ok, no questions, let's move on
21:05:42 <notmyname> #topic Reconstructor can rebuild to too many primaries
21:05:59 <notmyname> https://review.openstack.org/#/c/629056 has been proposed by clayg, and he'd like to discuss it this week
21:06:00 <patchbot> patch 629056 - swift - NEED HELP: Rebuild frags for unmounted disks - 3 patch sets
21:06:08 <notmyname> clayg: you've got the floor.
21:06:15 <clayg> it's related to https://bugs.launchpad.net/swift/+bug/1510342
21:06:16 <openstack> Launchpad bug 1510342 in OpenStack Object Storage (swift) "Reconstructor does not restore a fragment to a handoff node" [Medium,Confirmed]
21:06:21 <clayg> which maybe some people have been aware of...
21:06:36 <clayg> basically our EC design doesn't support fail-in-place (i.e. unmount a primary disk and worry about ring changes later)
21:07:11 <clayg> replication does support this strategy for dealing with disk failure and it's been pointed out for awhile that difference is maybe annoying or frustrating or confusing for operators...
21:07:28 <clayg> for awhile it seemed like there was other more important stuff to work on with improving EC - but eventually we fixed that stuff :D
21:07:44 <notmyname> "we ran out of easier stuff to fix"
21:07:55 <clayg> I got to thinking about it and couldn't really come up with a good excuse NOT to do this - so I started to look at if that was worth doing and decided it might be pretty good!
21:08:23 <clayg> Does anyone have any questions about the bug or priority before I talk about implementation/design challanges that I'd like some help with before I continue coding on it?
21:08:58 <kota_> the challenge sounds so nice
21:09:39 <clayg> ok, great!  so assuming we're onboard with maybe it's worth spending some braincells on it...
21:09:57 <clayg> (rledisez i know you do EC rebuild sorta different?  you may have some different perspective here?)
21:10:15 <clayg> one issue that was interesting is I had to "index" handoff nodes
21:10:58 <clayg> we don't want two unmounted causing partner nodes to try and rebuild different fragments to the first handoff
21:11:02 <notmyname> how'd you do that? enumerate handoffs and assign them frag indexes based on a MOD or something?
21:11:30 <clayg> so instead if node index 0 is unmounted node index -1 and 1 will rebuild to first handoff, index 1 rebuilds to "second" handoff etc
21:12:12 <clayg> notmyname: exactly right - minor change in ring code - but it worked out fine... biggest issue seems to be finding all our "fake ring" implementations and teaching them to assign index to handoffs
21:12:23 <clayg> that part is just worth knowing - i think it's fine
21:12:28 <notmyname> kk
21:12:45 <clayg> my challange right now is that the "current" design for primary suffix syncing is... bananas
21:12:53 <clayg> go look t this code please -> https://review.openstack.org/#/c/629056/3/swift/obj/reconstructor.py
21:12:54 <patchbot> patch 629056 - swift - NEED HELP: Rebuild frags for unmounted disks - 3 patch sets
21:13:13 <zaitcev> I'm too scared to look.
21:13:33 <clayg> when we tell a primary to check it's left and right hand partners we stuff that in job['sync_to'] - a list of two nodes then we ... extend the list of nodes we check... with ALL the other nodes!?
21:14:23 <clayg> so the idea is that if our partners respond OK, we're done - but if there's *ANY* kind of error (timeout, rejected for concurrency, or even 507) we'll keep going until we do 2 successful syncs ... or run out of primaries to try
21:14:51 <clayg> that's not right... I think me or paul was just like... we have to do SOMETHING on failure - this can't be "wrong" although it's probably not optimal
21:15:01 <clayg> then it sat like that for... whatever 3-4 years?
21:15:38 <clayg> Now that I'm changing it so that if our left partner responds unmounted we'll actually sync to a handoff instead - it seems like *just* doing our left and right partner might be sufficient!?
21:16:12 <clayg> or maybe we could also check with maybe one more "partner" like maybe on the other side of the ring?  then we have a total of three nodes checking on every given fragment...
21:16:25 <clayg> 3 is still < ALL OF THEM so I guess I was leaning that way
21:16:33 <clayg> but I could see arguments being made for all kinds of other ideas
21:16:45 <clayg> so I was hoping we could have some of that debate now - instead of after I wrote it
21:16:59 <clayg> Anyone have an opinons/ideas/concerns/questions (that's kinda all I got)
21:17:17 <notmyname> good summary, thanks
21:17:22 <notmyname> and it sounds like a hard problem
21:17:31 <notmyname> if the fragment handoff is MOD the fragment index, doesn't that mean that you "only" need to check every N+M nodes, not all of them? eg with 8+4, you only need to check every 12th one (+frag_index)? hmm, but that means handoffs likely wouldn't be found after a rebalance
21:18:34 <notmyname> clayg: there's a step you described that I could use more detail on
21:18:44 <clayg> i don't know that putting frags further out in the handoff chain makes them any more or less likely to be reachable in the handoff chain after a rebalance
21:18:46 <notmyname> Now that I'm changing it so that if our left partner responds unmounted we'll actually sync to a handoff instead - it seems like *just* doing our left and right partner might be sufficient!?"
21:19:18 <notmyname> good point re handoffs. it's randomish rebalancing anyway
21:19:49 <notmyname> clayg: can you explain the "just doing our left and right partner" part? what is "doing" in this case?
21:19:53 <clayg> yes... the big change is that if we get a 507 specifically when syncing with a partner node we'll rebuild/sync to a handoff (and keep the handoff in-sync as long as it's mirroed primary is unmounted until a ring change)
21:21:01 <mattoliverau> Sounds good.. but to be honest, I'd need to go read the code and think on it. I might be a little jetlagged and may also may have had a few beers with dinner as I forgot about this meeting.
21:21:09 <clayg> "doing" means making sure that A SET of our left partners frags are rebuild *somewhere* - either on the primary where it belongs, or if it's unmounted we'll actively find another node and rebuild fragments there... that's what enables fail-in-place
21:22:01 <notmyname> so after a rebalance, the handoff frag will attempt to put to the primary, which may still be unmounted. then the primary neighbor will choose a new handoff location which will again be kept in sync until a rebalance? sounds like a risk for "replication bumbs"
21:22:14 <clayg> mattoliverau: right... it'd be nice if we could go over something like this around a white board.  Except I'd not like a TON of work regardless of what we decide - and we could change it later - so I was hoping to have a direction settled in the next week or so and code it up for review with a couple of weeks tops
21:23:09 <clayg> notmyname: with a fail-in-place strategy you *remove* unmounted drives from the ring before rebalance
21:23:31 <clayg> notmyname: also, with EC after a rebalance you almost certainly should be in handoffs_only mode (no rebuilds) until you finish shuffling data around
21:24:04 <notmyname> I like that second point. not sure I agree with the first
21:24:13 <clayg> so... those handoff nodes with the rebuild fragments will *probably* be the guy that ships the frags to their new primary
21:24:29 <notmyname> yeah, that makes sense
21:25:07 <clayg> notmyname: there is zero value in having a "dead" drive in the ring... you might leave the failed drive unmounted in the chassis - but having it assigned parts is kinda mad
21:25:57 <notmyname> so with replicas, drive fails and gets 507s internally and swift works around it. if you remove it from the ring, things get faster, but leaving it doesn't stop any durability things. and with EC you're *required* to remove dead drives before rebalance?
21:26:24 <clayg> yeah... so the closest thing i can think of we have to a "durability config option" is handoffs_delete
21:26:41 <notmyname> if that's true, I'm not sure it's bad, but I just want to make sure I clearly understand :-)
21:27:15 <clayg> so i'm not sure having "reconstructor extra sync nodes" option is a good idea - and I'm leaning towards 3... but it's a bit different than what we're doing now... and I don't think what we're doing now is very smart
21:28:32 <clayg> uhh... no i don't guess you're "required" to remove dead drives... but I guess I hadn't considered throughly what happens if you DON'T because it seemed odd not to
21:29:07 <notmyname> gut-reaction, it seems like more "special knowledge" that should be more built-in. makes the "how to run swift" question harder (and it's already hard).
21:29:17 <clayg> on replicated - yes any part-replicas still assigned to the unmounted drive (which could change but also might be largely the same) would have to be re-replicated to a new handoff
21:29:38 <notmyname> or rather, I wonder what sort of automation could be more built in (eg only running handoffs-only for ec policies after a rebalance)
21:30:02 <clayg> the handoff from the last ring will try to sync back to the primaries... and might fail to reap unless you have handoffs_delete set to two
21:30:23 <rledisez> notmyname: on that last point, we do that by distributing the builder (it's dirty, but efficient): https://review.openstack.org/#/c/389676/
21:30:24 <patchbot> patch 389676 - swift - WIP: reconstructor: do not sync recently moved parts - 2 patch sets
21:30:38 <clayg> yeah having to manage your ec rebalance modes is super annoying 😡
21:31:21 <notmyname> I think my questions/comments/thoughts are a bit of a slight distraction from your original question. you're basically asking if the proposed strategy has obvious correctness issues
21:33:11 <clayg> well, I don't think it has "obvious correctness" issues - I think the existing code is "obviously superfluous" - i was hoping maybe we might agree "yeah that code should change, go ahead" and then if I'm lucky I'd hear "yeah 3 nodes checking on each frag is probably about right" or "shouldn't it work like XYZ"
21:33:38 <notmyname> to the correctness question, as long as neighboring fragments are in different durability zones, then it's probably fine. (right?) for example, you don't want a rack to fail and then have a big enough gap in the available frag set so that they gap doesn't get reconstructed
21:35:01 <clayg> so handoffs go out of their way to bounce around failure domains - but I can't say for sure there wouldn't be a bad sequence for any given part of any given ring - we're just not that regimented in how we structure handoff sequences - and skipping around the handoffs via modulo might exacerbate the issue for a given frag_index in a given part
21:35:14 <clayg> but I definitely don't have a better idea there
21:35:44 <notmyname> that makes sense for handoffs. not sure we can do any better there (and choosing something modulo or random likely won't be any better than neighbors)
21:35:52 <clayg> ... and even imagining failure domain distribution is imperfect - having a copy of the frag rebuild in ANY failure domain is better than not having the fragment because you're using fail-in-place and didn't know about lp bug #1510342
21:35:53 <openstack> Launchpad bug 1510342 in OpenStack Object Storage (swift) "Reconstructor does not restore a fragment to a handoff node" [Medium,Confirmed] https://launchpad.net/bugs/1510342
21:36:19 <mattoliverau> Well, disks will fail, and duribiltiy is most important, so yes I agree once we unmount we should do something. We do it in Replication so we should also do it in EC. So I think it's the right move.
21:36:23 <notmyname> but if frags are only checking neighbors, what if primaries 2, 3, and 4 are down? what does 1 check? what does 5 check? does 3 ever get reconstructed?
21:36:54 <clayg> @notmyname right - I think that thing (broken chains) is what prompted the weird existing behavior
21:37:14 <notmyname> existing == check everything?
21:37:29 <clayg> (yes)
21:37:36 <clayg> 3 is in a tough spot because TWO of the nodes that might check on him are down/unmounted otherwise not checking on him
21:37:43 <mattoliverau> Getting more nodes from a ring tends to hit the same nodes, so left + right, or L + R + middle sounds good. But seeing as get more nodes should get the same, can't we just use them.. and trust the algorithm to keep them in different domains.
21:38:02 <clayg> ... so that lead me to maybe we should have *three* nodes check on each frag (i have a picture in my notebook) 😁
21:38:29 <clayg> notmyname: also I think you're up-to-speed - you're thinking about the problem I'm worried about - how can I not check everything but still be "good enough"
21:38:30 <notmyname> clayg: if I understand correctly, then yeah, I like the idea of your 3-way checks. 2 neighbors + one "across" the ring
21:39:28 <clayg> mattoliverau: yeah I think we can just trust the ring rebalance abstraction from this context until/unless we see a bad behavior (and then we should fix the ring - not the code that's expect it to do it's job)
21:39:29 <notmyname> and eg with 12 fragments, it's likely good enough without being too expensive. probably should do some mathy things to find the exact numbers
21:40:18 <clayg> notmyname: right mathy things... kota_ you're good at math!?  timburke is too... is he around?
21:40:22 <notmyname> rledisez: kota_: tdasilva: m_kazuhiro: does all this make sense?
21:40:31 <clayg> s/all/any of/
21:40:33 <notmyname> timburke got sick kids and is on PTO today :-(
21:40:42 <clayg> we're doomed
21:41:09 <kota_> still catching up... now
21:41:48 <m_kazuhiro> me too
21:41:49 <mattoliverau> anyway may thoughts are, yup we need to do something. Go make it better. It sounds better then what we have.. well actually a pretty good idea. I'll checkout the code tomorrow and see if I get anymore questions, or bad suggestions ;)
21:41:50 <rledisez> i *think* i get the idea of the 3-way checks, but I need to think more of it. it's way too complex for my timezone :)
21:41:57 <notmyname> yeah, that went pretty quickly. I know it's hard to follow fast-moving, technical discussions in your non-native language :-)
21:42:09 <kota_> got stomach ache and back here, long logs :P
21:42:13 <mattoliverau> rledisez: +1 :P
21:43:02 <clayg> ok good - so maybe this was the sanity check - and I'll keep going - folks can think more about it and maybe next week we'll have something in good shape to discuss... that would work ok with my timelines
21:43:06 <notmyname> clayg: ok, so assuming kota_ and m_kazuhiro and rledisez and mattoliverau will be able to sleep/wake-up and think more about the code and ideas, did you get answers you can use?
21:43:17 <notmyname> clayg: good :-)
21:43:18 <clayg> notmyname: 👍
21:43:22 <kota_> yup
21:43:25 <m_kazuhiro> ok
21:43:39 <notmyname> clayg: thanks for working on this. it's a hard problem that's important to solve, and I'm glad you're on it :-)
21:43:45 <mattoliverau> +1
21:43:47 <clayg> no worries
21:43:59 <notmyname> #topic s3 versioning
21:44:07 <clayg> zohno, no tim!
21:44:21 <notmyname> on behalf of tim, I wanted to mention something, but only for "Read this". not to discuss right now
21:44:40 <notmyname> #link https://gist.github.com/tipabu/4d49516050f6762ce9cf6b2ebb49e545
21:45:23 <notmyname> basically, tim is working on getting s3 versioning support. he's run in to a few interesting problems to solve with it. some of which have API implications. so he wrote up that doc to cover the issues and his thoughts on solutions
21:45:46 <notmyname> I know he'd appreciate people reading over it to understand what's going on so it can be talked about in more detail later
21:45:57 <notmyname> #topic open discussion
21:46:03 <notmyname> any other topics to bring up this week?
21:46:38 <notmyname> the combined ptg/summit in denver is currently accepting presentation proposals. the deadline is next week
21:46:39 <notmyname> #link https://www.openstack.org/summit/denver-2019/call-for-presentations/
21:48:33 <notmyname> doesn't sound like anything else, so let's close the meeting
21:48:45 <notmyname> thanks for coming this week
21:48:51 <notmyname> thanks for your work on swift
21:48:54 <notmyname> #endmeeting