#openstack-meeting log

19:00:38 <notmyname> #startmeeting swift
19:00:39 <openstack> Meeting started Wed Jan  7 19:00:38 2015 UTC and is due to finish in 60 minutes.  The chair is notmyname. Information about MeetBot at http://wiki.debian.org/MeetBot.
19:00:40 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
19:00:42 <kota_> yey:)
19:00:43 <openstack> The meeting name has been set to 'swift'
19:00:51 <notmyname> hello everyone
19:00:59 <mahatic> hello
19:01:05 <mattoliverau> o/
19:01:08 <peluse> hola
19:01:09 <notmyname> welcome back. I hope you had a good christmas/holiday/new year/solstice/etc
19:01:41 <notmyname> I've been catching up this week since I was out of town for a while
19:01:59 <acoles> hi
19:02:12 <notmyname> there's a few things to cover today, but nothing too big, I think
19:02:17 <notmyname> #link https://wiki.openstack.org/wiki/Meetings/Swift
19:02:24 * torgomatic says words to indicate presence
19:02:47 <notmyname> #topic general stuff
19:03:19 <notmyname> first up, as an FYI there is an openstack ops meetup (sort of their mid-cycle things) in Philadelphia on March 9-10
19:03:45 <notmyname> it's a good place to talk to other people running different openstack services
19:03:51 <cutforth> hello - sorry late for role call
19:04:30 <notmyname> it's normally kindof light on swift things (which is good), but it's IMO a good thing for ops to be aware of
19:04:40 <notmyname> I'll be there representing swift and listening
19:05:10 <notmyname> so if you or others who are running swift (and/or other openstack things) want to attend, there's the info
19:05:20 <notmyname> I think there'll be more info coming soon from fifieldt_
19:06:16 <notmyname> other things...
19:06:52 <notmyname> I want to have a new release for swift as soon as a few outstanding patches land (I'll cover them later with priority reviews)
19:07:21 <notmyname> but the point is, expect a 2.2.2 or 2.3 soon ish. I hope for around the end of January. just depends on reviews
19:07:52 <notmyname> That's the boilerplate stuff I have. anyone else have announcements?
19:07:56 <zaitcev> I'm trying, I'm trying.
19:08:02 <zaitcev> I mean, reviews.
19:08:07 <notmyname> I'll be visiting mattoliverau next week at LCA in New Zealand
19:08:12 <mattoliverau> woo
19:08:33 <clayg> notmyname: nice!  tell him "keep up the good work"
19:08:35 <notmyname> today's task is to make progress on my talk....
19:08:47 <clayg> where "good work" ~= "pointing out all of sam and mine bugs"
19:08:49 <notmyname> mattoliverau: clayg says "keep up the good work!"
19:09:05 <mattoliverau> notmyname: tell clayg "thanks man" :)
19:09:40 <notmyname> if there's no other announcements, then let's move to some priority reviews
19:09:40 <tdasilva> notmyname: I think the first review on "ring placement changes" was merged earlier today, so making progress there :)
19:09:47 <notmyname> #topic priority reviews
19:09:50 <notmyname> tdasilva: great!!
19:10:00 <notmyname> #link https://wiki.openstack.org/wiki/Swift/PriorityReviews
19:10:09 <notmyname> there's a category there for ring placement changes
19:10:41 <lpabon> o/ (i'm late) but here
19:10:42 <notmyname> those are the things that fix issues that swiftstack and red hat found over the last month based on some data placement in small and unbalanceable clusters
19:11:19 <zaitcev> I'd like Clay to re-publish 144432 if he's got a moment
19:11:38 <cschwede> notmyname: as tdasilva mentioned - the first one for the ring placement has been merged. i remove it from the list
19:11:39 <torgomatic> if people would just have enough failure domains, this would all be so much simpler
19:11:41 <clayg> zaitcev: which one is that?
19:11:44 <notmyname> we used to place data solely on failure domain. which doesn't work for failure domains that don't have a similar size (thing adding a zone or region gradually)
19:11:52 <notmyname> cschwede: thanks
19:11:54 <zaitcev> clayg: "dispersion" command
19:11:55 <clayg> i have some moments today, reviews and rebases is on the list
19:12:15 <notmyname> now we take into account weight. which doesn't work if there are some unbalanceable rings with not enough failure domains
19:13:01 <notmyname> so these patches give a really nice midpoint. allowing a ring to have an "overload" parameter which means that partitions can be placed in a failure domain, even if it's "full" by weight
19:13:08 <notmyname> read the full commit message
19:13:11 <clayg> torgomatic: well if you have 6 failure domains in a tier but their sized 10000 10 10 10 10 10 - you're still sorta screwed
19:13:34 <notmyname> clayg: not that we'd _ever_ see anything like that on a production cluster /s
19:13:39 <clayg> notmyname: I like to think of it of trading balance for dispersion
19:13:48 <notmyname> clayg: ya, that's a great way to put it
19:13:49 <cschwede> clayg: well, just increase the overload factor?
19:13:57 <torgomatic> clayg: if people would just not do that, this would all be much simpler ;)
19:14:15 <cschwede> torgomatic: people like complicated deployments ;)
19:14:22 <clayg> cschwede: no in that case I think you needed to see your dispersion is fucked and manually reduce the weights in the 10000 tier
19:14:30 <clayg> torgomatic: lol!
19:14:33 <notmyname> and the final patch in the chain is clayg's really great visualization (reporting) of dispersion vs balance. that's really important to (1) understand torgomatic's patch and (2) tell people how to fix or set the overload factor
19:14:48 <zaitcev> You know, guys, my head is really stupid and I'm having trouble thinking about what happens if we add that overload factor. Not sure how representative I'm of an operator.
19:15:22 <cschwede> zaitcev: if you don’t set it - nothing changes
19:15:42 <cschwede> that said this is most likely not what you want
19:15:56 * peluse read that at first as "if you don't get it..."  :)
19:16:12 <notmyname> zaitcev: basically, it let's you add eg 10% more partitions to a server so that eg if a server's drive fails the same server (failure domain) can pick up those parts without trying to move to a different server that already has another replica of the same partition
19:16:16 <clayg> zaitcev: i honestly couldn't "see" what overload does without the dispersion report - and overload was sorta my idea (mostly I was just there when cschwede had the idea, which is like at least partial credit)
19:16:18 <mattoliverau> lol, and that's true too :P
19:16:54 <zaitcev> okay, thanks... I copy-pasted it for future reference.
19:17:13 <notmyname> so if you have 3 servers (1 zone, 1 region), you don't want to put 2 replicas on different drives in the same server if there is still space on different drives in the server that had the drive failure
19:17:15 <clayg> /* You are not expected to understand this */
19:17:19 <notmyname> lol
19:18:04 <notmyname> so I really like the idea on paper (commit message), but ya it's tricky to get my mind wrapped around the actual placement algo
19:18:06 <zaitcev> actually that original code wasn't hard to understand if you knew how PDP-11 worked
19:18:56 <clayg> zaitcev: lol you mean the original unix source where that comment/meme came from
19:19:07 <clayg> zaitcev: I was like... I've never even *heard* of PDP-11
19:19:32 <notmyname> clayg: that's why it's hard for you to understand the ring. obviously.
19:19:32 <clayg> ANYWAY - yeah trading balance for dispersion
19:19:34 <notmyname> ;-)
19:20:04 <torgomatic> Spinal Tap used those for audio processing, right?
19:20:04 <clayg> it's like a slider to tell the ring how much you want swift < 2.0 or > 2.0
19:20:20 <notmyname> ok, so the other patch I want in the next release is the one to fix large out of sync containers
19:20:27 <clayg> do you want ubalanced rings that screw you with full disk, or un dispersed rings that screw you with failure
19:20:49 <notmyname> so once those 4 land (3 for ring balance-including the one that landed today) and the 1 for container replication, I'll look at cutting a release
19:20:57 <clayg> OR do you want to stop and pull your head out of your cluster-frankenstien and acctually think about a reasonable deployment for minute
19:21:48 <cschwede> clayg: too easy.
19:22:13 <notmyname> there's obviously some other stuff going on (other patches, other bugs), but those are the priority things so we can remove the foot-gun from users
19:22:42 <notmyname> I'll come back to other patches in a bit
19:22:48 <notmyname> but moving on
19:22:58 <notmyname> #topic EC status
19:23:30 <notmyname> monday and tuesday peluse came out to SF to work with torgomatic clayg and me on some of the outstanding work (thanks peluse!)
19:23:44 <notmyname> so the current status is good (but there's still a lot to do)
19:23:45 <peluse> yup, few big items to hit...
19:24:02 <notmyname> reconstruction is shaping up nicely
19:24:04 <peluse> reconstructor:  functional in basic use case, WIP to make it compelte but we all have good line of sight on next steps!
19:24:10 <notmyname> yay
19:24:20 <peluse> PUT:  tsg and yuanz are wrapping up a proposal to finalize PUT (the durable stuff)
19:24:22 <notmyname> tushar has a little work to finish up the PUT path
19:24:28 * notmyname let's peluse talk ;-)
19:24:29 <peluse> GET:  clayg is the man!
19:24:37 <notmyname> s/'/
19:24:46 <peluse> once those 3 big items are done we can focus on testing...
19:24:58 <peluse> probetets need some groundwork that I think torgomatic will start tackling soon
19:25:16 <peluse> and the functional tests also need to small overhaul but nothing super intrusive I hope
19:25:22 <peluse> trello is fully up to date
19:25:34 <peluse> https://trello.com/b/LlvIFIQs/swift-erasure-codes
19:25:39 <notmyname> #link https://trello.com/b/LlvIFIQs/swift-erasure-codes
19:25:41 <peluse> and priority reviews are also up to date....
19:25:45 <peluse> questions?
19:26:03 <peluse> or additoinal comments from sam/clay/john?
19:26:18 <notmyname> I'm hoping that in the next few weeks we'll have basic PUT/GET functionality and ready for the final testing and polishing
19:26:31 <peluse> heack yeah!
19:26:41 <mattoliverau> I know clayg is the man... but sounds like he will be handling every EC GET personally :P
19:26:49 <notmyname> as I told some others as to the status of EC, there's light at the end of the tunnel. and it's likely not an oncoming train. so we're good!
19:27:04 <notmyname> mattoliverau: basically, ya. just send all your data to him ;-)
19:27:27 <peluse> also, reminder, there are doc and spec patches out there for anyone looking for latest/greatest tech info
19:27:32 <peluse> links coming...
19:27:53 <peluse> https://review.openstack.org/#/c/142146/ https://review.openstack.org/#/c/144895/
19:28:13 * eren greets all the great people here and silently watches the meeting
19:28:15 <peluse> feedback welcome for sure!! (on both)
19:29:01 <notmyname> yes. now is a great time to start groking all the EC stuff. soon(ish) we'll be talking about the merge to master
19:29:09 <mattoliverau> #link https://review.openstack.org/#/c/142146/
19:29:15 <mattoliverau> #link https://review.openstack.org/#/c/144895/
19:29:23 <notmyname> mattoliverau: thanks
19:29:32 <clayg> notmyname: EC GET is one of my piorities over the next few weeks, I'd love to see it working and passing tests, and covering some first form failures (handoffs) - but there's a bunch of complexity down the road that won't be ready in the first version of EC GET that we merge
19:29:46 <clayg> notmyname: e.g. taking advantage of the .durables :P
19:30:08 <peluse> clayg, good point.  we are talking basic functinoality on all this stuff
19:30:17 <peluse> bells and whistles will cost extra :)
19:30:35 <clayg> nice
19:31:01 <notmyname> right. so eg the first version of EC shipped to customers might not have the ability to resume GETs from a different server if a drive fails in the middle of a read
19:31:07 <notmyname> clayg: ^^ that's one you talked about right?
19:31:13 <notmyname> (just wanted a practical example)
19:31:19 <clayg> sure good practical example
19:31:46 <peluse> that's a good one - I think the .durable stuff/cleanup will come very quickly after basic functinliaty whereas resume might be further away
19:32:44 <clayg> notmyname: or we might get it in, might not, i'm just saying the patch in my head for the ECObjController will basically barely work, and there's tons of other smarter things that we'll have to add like dealing with reads from partially complete PUTs
19:33:01 <notmyname> right right
19:33:06 <notmyname> and that's a good thing
19:33:38 <notmyname> anything else on EC? questions? explanation?
19:34:01 <notmyname> ok
19:34:05 <peluse> sweet
19:34:19 <notmyname> #topic interest in undelete/delayed delete [cschwede]
19:34:21 <notmyname> #link https://review.openstack.org/#/c/143799/
19:34:24 <notmyname> cschwede: you're up
19:34:31 <cschwede> Swift currently has no protection from defective or incorrect usage. Think for example buggy external applications that delete a lot of objects within a short timeframe or misused credentials.
19:34:36 <cschwede> I propose a delayed deletion of object file to add some protection for operators. The idea is quite simple: the actual .data/.meta deletion is replaced  with a rename, and the "deleted" object is stored in a second subdirectory "deleted_objects" (instead of "objects"). Only the last version is kept, and there is no replication for these files.
19:34:40 <cschwede> Either an external (running find by cron) or internal process (object-reaper?) removes these files periodically (for example after 24 hours).
19:34:40 <notmyname> torgomatic: ping (you might be interested in this)
19:34:43 <cschwede> My questions on this are: 1. Is there general interest for this inside Swift? 2. Does the approach sounds reasonable? In this case I would continue my work on the upstream patch
19:35:10 <peluse> cschwede, FYI we're doing something along those lines with EC for different reasons
19:35:30 <peluse> cschwede, would be a good topic for hackathon whiteboard sessions, are you coming?
19:35:39 <cschwede> peluse: 99.9% yes :)
19:35:46 <peluse> fantastic!
19:35:55 <notmyname> cschwede: we've seen interest from customers for similar functionality
19:36:02 <zaitcev> cschwede: did anyone ask for it
19:36:04 <clayg> cschwede: "only the latest version is kept" means - only one "last" copy - that can happen either on an overwrite (PUT) or delete?
19:36:31 <zaitcev> also, we have the object expirer, can that be re-used
19:36:34 <cschwede> notmyname: yes, this is a customer request as well
19:36:37 <cschwede> zaitcev: yes
19:36:38 <notmyname> cschwede: what's you're use case? where are you seeing this need (to zaitcev's pitn)
19:36:40 <clayg> zaitcev: yeah i've seen people ask
19:36:40 <notmyname> ah, cool
19:36:43 <cschwede> clayg: yes
19:36:54 <peluse> clayg, FYI I'm thinkin about maybe leveraging some of the .durable type stuff hwoever using a different trigger other than the .durable file.  Just brainstorming
19:37:05 <cschwede> notmyname: use case: protection from external application bugs and credential abuse
19:37:13 <clayg> cschwede: yeah the part i don't like about what you threw up so far was all the animosity I currently fell towards quarantines
19:37:59 <clayg> cschwede: like we keep that shit around "just in case" but in reality I feel like I don't really have a good grok on what's there
19:38:14 <torgomatic> as always, the question "does an object with this name currently exist" is a very hard question to answer in an eventually-consistent system, so let's not build anything that relies on answering that question
19:38:31 <clayg> cschwede: if someone wanted something out of these deleted_objects dirs - it's like a swift-get-nodes call, and then hope you haven't rebalanced since the delete, and then manually pull it out :\
19:38:33 <zaitcev> quarantine is for postmortem I thought
19:38:55 <zaitcev> like the always-on diags, you want to see what happened. what if it's bug in pickling or elsewhere
19:38:58 <cschwede> clayg: good point. hmm.
19:39:05 <clayg> cschwede: no one is paying for for the storage, and some people really don't need that kind of protection
19:39:41 <cschwede> clayg: right, but some people really need it - so it would be an option, disabled by default
19:39:42 <zaitcev> so quarantine is not like what cschwede proposes, in my mind anyway. That is more like object versioning maybe.
19:39:43 <notmyname> cschwede: the .durable idea for EC might have interesting applications here. or maybe not. it's not fully implemented/fleshed-out in EC land yet
19:40:11 <clayg> cschwede: well i'm just suggesting that maybe it's not a cluster-wide on or off switch
19:40:19 * tdasilva wonders if it would make sense to implement obj. versioning where deletes are supported
19:40:21 <peluse> yup, too many unknowns for make IRC discussion feasible but a great thing to talk about in person
19:40:26 <notmyname> cschwede: one other idea I've seen is to move objects to a system-level account (eg .deleted) on a DELETE. then reap that maybe with expirer later)
19:40:32 <clayg> cschwede: i feel like if could be like versioned objects 2.0
19:40:50 <cschwede> notmyname: that’s too slow in this case - because it is basically a COPY with a lot of data shuffling
19:41:00 <notmyname> ya, that's be cool to. (versioned objects with delete support)
19:41:05 <notmyname> cschwede: ya
19:41:06 <clayg> cschwede: but if it's something the user can either choose to do or not i guess it doesn't really protect against credential abuse :\
19:41:27 <cschwede> clayg: exactly. it should be enabled by the operator
19:41:28 <notmyname> cschwede: I think you've got the answer to the basic question: yes there's interest :-)
19:41:44 <notmyname> and no shortage of ideas
19:41:55 <cschwede> notmyname: yep, thanks, so i continue work on this and propose something during the hackathon :)
19:41:59 <notmyname> cool
19:42:25 <cschwede> thx all for the feedback/ideas!
19:42:35 <notmyname> thanks for bringing it up!
19:42:37 <notmyname> #topic open discussion
19:42:43 <notmyname> what else do you have?
19:43:07 <clayg> zaitcev: my comment on quarantines was just that it's this other directory that's not really managed by swift daemons anymore - it is postmortem, but a deleted objects dir is also postmortem (hey I just accidently the whole thing - do you have a backup?)
19:43:22 <cutforth> speaking of the hackathon, are there any details yet?
19:43:37 <notmyname> cutforth: I'll be working on that at 1pm today
19:43:46 <cutforth> notmyname: k, thx
19:43:55 <notmyname> (ie in about an hour)
19:44:37 <notmyname> if there's nothing else, let's be done
19:44:49 <notmyname> thanks everyone for coming today. and thanks for working on swift
19:45:11 <notmyname> lot's of people are using what you're making. at large scale, in prod, all over the world
19:45:24 <peluse> rock n roll
19:45:24 <notmyname> #endmeeting