19:00:38 <notmyname> #startmeeting swift 19:00:39 <openstack> Meeting started Wed Jan 7 19:00:38 2015 UTC and is due to finish in 60 minutes. The chair is notmyname. Information about MeetBot at http://wiki.debian.org/MeetBot. 19:00:40 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 19:00:42 <kota_> yey:) 19:00:43 <openstack> The meeting name has been set to 'swift' 19:00:51 <notmyname> hello everyone 19:00:59 <mahatic> hello 19:01:05 <mattoliverau> o/ 19:01:08 <peluse> hola 19:01:09 <notmyname> welcome back. I hope you had a good christmas/holiday/new year/solstice/etc 19:01:41 <notmyname> I've been catching up this week since I was out of town for a while 19:01:59 <acoles> hi 19:02:12 <notmyname> there's a few things to cover today, but nothing too big, I think 19:02:17 <notmyname> #link https://wiki.openstack.org/wiki/Meetings/Swift 19:02:24 * torgomatic says words to indicate presence 19:02:47 <notmyname> #topic general stuff 19:03:19 <notmyname> first up, as an FYI there is an openstack ops meetup (sort of their mid-cycle things) in Philadelphia on March 9-10 19:03:45 <notmyname> it's a good place to talk to other people running different openstack services 19:03:51 <cutforth> hello - sorry late for role call 19:04:30 <notmyname> it's normally kindof light on swift things (which is good), but it's IMO a good thing for ops to be aware of 19:04:40 <notmyname> I'll be there representing swift and listening 19:05:10 <notmyname> so if you or others who are running swift (and/or other openstack things) want to attend, there's the info 19:05:20 <notmyname> I think there'll be more info coming soon from fifieldt_ 19:06:16 <notmyname> other things... 19:06:52 <notmyname> I want to have a new release for swift as soon as a few outstanding patches land (I'll cover them later with priority reviews) 19:07:21 <notmyname> but the point is, expect a 2.2.2 or 2.3 soon ish. I hope for around the end of January. just depends on reviews 19:07:52 <notmyname> That's the boilerplate stuff I have. anyone else have announcements? 19:07:56 <zaitcev> I'm trying, I'm trying. 19:08:02 <zaitcev> I mean, reviews. 19:08:07 <notmyname> I'll be visiting mattoliverau next week at LCA in New Zealand 19:08:12 <mattoliverau> woo 19:08:33 <clayg> notmyname: nice! tell him "keep up the good work" 19:08:35 <notmyname> today's task is to make progress on my talk.... 19:08:47 <clayg> where "good work" ~= "pointing out all of sam and mine bugs" 19:08:49 <notmyname> mattoliverau: clayg says "keep up the good work!" 19:09:05 <mattoliverau> notmyname: tell clayg "thanks man" :) 19:09:40 <notmyname> if there's no other announcements, then let's move to some priority reviews 19:09:40 <tdasilva> notmyname: I think the first review on "ring placement changes" was merged earlier today, so making progress there :) 19:09:47 <notmyname> #topic priority reviews 19:09:50 <notmyname> tdasilva: great!! 19:10:00 <notmyname> #link https://wiki.openstack.org/wiki/Swift/PriorityReviews 19:10:09 <notmyname> there's a category there for ring placement changes 19:10:41 <lpabon> o/ (i'm late) but here 19:10:42 <notmyname> those are the things that fix issues that swiftstack and red hat found over the last month based on some data placement in small and unbalanceable clusters 19:11:19 <zaitcev> I'd like Clay to re-publish 144432 if he's got a moment 19:11:38 <cschwede> notmyname: as tdasilva mentioned - the first one for the ring placement has been merged. i remove it from the list 19:11:39 <torgomatic> if people would just have enough failure domains, this would all be so much simpler 19:11:41 <clayg> zaitcev: which one is that? 19:11:44 <notmyname> we used to place data solely on failure domain. which doesn't work for failure domains that don't have a similar size (thing adding a zone or region gradually) 19:11:52 <notmyname> cschwede: thanks 19:11:54 <zaitcev> clayg: "dispersion" command 19:11:55 <clayg> i have some moments today, reviews and rebases is on the list 19:12:15 <notmyname> now we take into account weight. which doesn't work if there are some unbalanceable rings with not enough failure domains 19:13:01 <notmyname> so these patches give a really nice midpoint. allowing a ring to have an "overload" parameter which means that partitions can be placed in a failure domain, even if it's "full" by weight 19:13:08 <notmyname> read the full commit message 19:13:11 <clayg> torgomatic: well if you have 6 failure domains in a tier but their sized 10000 10 10 10 10 10 - you're still sorta screwed 19:13:34 <notmyname> clayg: not that we'd _ever_ see anything like that on a production cluster /s 19:13:39 <clayg> notmyname: I like to think of it of trading balance for dispersion 19:13:48 <notmyname> clayg: ya, that's a great way to put it 19:13:49 <cschwede> clayg: well, just increase the overload factor? 19:13:57 <torgomatic> clayg: if people would just not do that, this would all be much simpler ;) 19:14:15 <cschwede> torgomatic: people like complicated deployments ;) 19:14:22 <clayg> cschwede: no in that case I think you needed to see your dispersion is fucked and manually reduce the weights in the 10000 tier 19:14:30 <clayg> torgomatic: lol! 19:14:33 <notmyname> and the final patch in the chain is clayg's really great visualization (reporting) of dispersion vs balance. that's really important to (1) understand torgomatic's patch and (2) tell people how to fix or set the overload factor 19:14:48 <zaitcev> You know, guys, my head is really stupid and I'm having trouble thinking about what happens if we add that overload factor. Not sure how representative I'm of an operator. 19:15:22 <cschwede> zaitcev: if you don’t set it - nothing changes 19:15:42 <cschwede> that said this is most likely not what you want 19:15:56 * peluse read that at first as "if you don't get it..." :) 19:16:12 <notmyname> zaitcev: basically, it let's you add eg 10% more partitions to a server so that eg if a server's drive fails the same server (failure domain) can pick up those parts without trying to move to a different server that already has another replica of the same partition 19:16:16 <clayg> zaitcev: i honestly couldn't "see" what overload does without the dispersion report - and overload was sorta my idea (mostly I was just there when cschwede had the idea, which is like at least partial credit) 19:16:18 <mattoliverau> lol, and that's true too :P 19:16:54 <zaitcev> okay, thanks... I copy-pasted it for future reference. 19:17:13 <notmyname> so if you have 3 servers (1 zone, 1 region), you don't want to put 2 replicas on different drives in the same server if there is still space on different drives in the server that had the drive failure 19:17:15 <clayg> /* You are not expected to understand this */ 19:17:19 <notmyname> lol 19:18:04 <notmyname> so I really like the idea on paper (commit message), but ya it's tricky to get my mind wrapped around the actual placement algo 19:18:06 <zaitcev> actually that original code wasn't hard to understand if you knew how PDP-11 worked 19:18:56 <clayg> zaitcev: lol you mean the original unix source where that comment/meme came from 19:19:07 <clayg> zaitcev: I was like... I've never even *heard* of PDP-11 19:19:32 <notmyname> clayg: that's why it's hard for you to understand the ring. obviously. 19:19:32 <clayg> ANYWAY - yeah trading balance for dispersion 19:19:34 <notmyname> ;-) 19:20:04 <torgomatic> Spinal Tap used those for audio processing, right? 19:20:04 <clayg> it's like a slider to tell the ring how much you want swift < 2.0 or > 2.0 19:20:20 <notmyname> ok, so the other patch I want in the next release is the one to fix large out of sync containers 19:20:27 <clayg> do you want ubalanced rings that screw you with full disk, or un dispersed rings that screw you with failure 19:20:49 <notmyname> so once those 4 land (3 for ring balance-including the one that landed today) and the 1 for container replication, I'll look at cutting a release 19:20:57 <clayg> OR do you want to stop and pull your head out of your cluster-frankenstien and acctually think about a reasonable deployment for minute 19:21:48 <cschwede> clayg: too easy. 19:22:13 <notmyname> there's obviously some other stuff going on (other patches, other bugs), but those are the priority things so we can remove the foot-gun from users 19:22:42 <notmyname> I'll come back to other patches in a bit 19:22:48 <notmyname> but moving on 19:22:58 <notmyname> #topic EC status 19:23:30 <notmyname> monday and tuesday peluse came out to SF to work with torgomatic clayg and me on some of the outstanding work (thanks peluse!) 19:23:44 <notmyname> so the current status is good (but there's still a lot to do) 19:23:45 <peluse> yup, few big items to hit... 19:24:02 <notmyname> reconstruction is shaping up nicely 19:24:04 <peluse> reconstructor: functional in basic use case, WIP to make it compelte but we all have good line of sight on next steps! 19:24:10 <notmyname> yay 19:24:20 <peluse> PUT: tsg and yuanz are wrapping up a proposal to finalize PUT (the durable stuff) 19:24:22 <notmyname> tushar has a little work to finish up the PUT path 19:24:28 * notmyname let's peluse talk ;-) 19:24:29 <peluse> GET: clayg is the man! 19:24:37 <notmyname> s/'/ 19:24:46 <peluse> once those 3 big items are done we can focus on testing... 19:24:58 <peluse> probetets need some groundwork that I think torgomatic will start tackling soon 19:25:16 <peluse> and the functional tests also need to small overhaul but nothing super intrusive I hope 19:25:22 <peluse> trello is fully up to date 19:25:34 <peluse> https://trello.com/b/LlvIFIQs/swift-erasure-codes 19:25:39 <notmyname> #link https://trello.com/b/LlvIFIQs/swift-erasure-codes 19:25:41 <peluse> and priority reviews are also up to date.... 19:25:45 <peluse> questions? 19:26:03 <peluse> or additoinal comments from sam/clay/john? 19:26:18 <notmyname> I'm hoping that in the next few weeks we'll have basic PUT/GET functionality and ready for the final testing and polishing 19:26:31 <peluse> heack yeah! 19:26:41 <mattoliverau> I know clayg is the man... but sounds like he will be handling every EC GET personally :P 19:26:49 <notmyname> as I told some others as to the status of EC, there's light at the end of the tunnel. and it's likely not an oncoming train. so we're good! 19:27:04 <notmyname> mattoliverau: basically, ya. just send all your data to him ;-) 19:27:27 <peluse> also, reminder, there are doc and spec patches out there for anyone looking for latest/greatest tech info 19:27:32 <peluse> links coming... 19:27:53 <peluse> https://review.openstack.org/#/c/142146/ https://review.openstack.org/#/c/144895/ 19:28:13 * eren greets all the great people here and silently watches the meeting 19:28:15 <peluse> feedback welcome for sure!! (on both) 19:29:01 <notmyname> yes. now is a great time to start groking all the EC stuff. soon(ish) we'll be talking about the merge to master 19:29:09 <mattoliverau> #link https://review.openstack.org/#/c/142146/ 19:29:15 <mattoliverau> #link https://review.openstack.org/#/c/144895/ 19:29:23 <notmyname> mattoliverau: thanks 19:29:32 <clayg> notmyname: EC GET is one of my piorities over the next few weeks, I'd love to see it working and passing tests, and covering some first form failures (handoffs) - but there's a bunch of complexity down the road that won't be ready in the first version of EC GET that we merge 19:29:46 <clayg> notmyname: e.g. taking advantage of the .durables :P 19:30:08 <peluse> clayg, good point. we are talking basic functinoality on all this stuff 19:30:17 <peluse> bells and whistles will cost extra :) 19:30:35 <clayg> nice 19:31:01 <notmyname> right. so eg the first version of EC shipped to customers might not have the ability to resume GETs from a different server if a drive fails in the middle of a read 19:31:07 <notmyname> clayg: ^^ that's one you talked about right? 19:31:13 <notmyname> (just wanted a practical example) 19:31:19 <clayg> sure good practical example 19:31:46 <peluse> that's a good one - I think the .durable stuff/cleanup will come very quickly after basic functinliaty whereas resume might be further away 19:32:44 <clayg> notmyname: or we might get it in, might not, i'm just saying the patch in my head for the ECObjController will basically barely work, and there's tons of other smarter things that we'll have to add like dealing with reads from partially complete PUTs 19:33:01 <notmyname> right right 19:33:06 <notmyname> and that's a good thing 19:33:38 <notmyname> anything else on EC? questions? explanation? 19:34:01 <notmyname> ok 19:34:05 <peluse> sweet 19:34:19 <notmyname> #topic interest in undelete/delayed delete [cschwede] 19:34:21 <notmyname> #link https://review.openstack.org/#/c/143799/ 19:34:24 <notmyname> cschwede: you're up 19:34:31 <cschwede> Swift currently has no protection from defective or incorrect usage. Think for example buggy external applications that delete a lot of objects within a short timeframe or misused credentials. 19:34:36 <cschwede> I propose a delayed deletion of object file to add some protection for operators. The idea is quite simple: the actual .data/.meta deletion is replaced with a rename, and the "deleted" object is stored in a second subdirectory "deleted_objects" (instead of "objects"). Only the last version is kept, and there is no replication for these files. 19:34:40 <cschwede> Either an external (running find by cron) or internal process (object-reaper?) removes these files periodically (for example after 24 hours). 19:34:40 <notmyname> torgomatic: ping (you might be interested in this) 19:34:43 <cschwede> My questions on this are: 1. Is there general interest for this inside Swift? 2. Does the approach sounds reasonable? In this case I would continue my work on the upstream patch 19:35:10 <peluse> cschwede, FYI we're doing something along those lines with EC for different reasons 19:35:30 <peluse> cschwede, would be a good topic for hackathon whiteboard sessions, are you coming? 19:35:39 <cschwede> peluse: 99.9% yes :) 19:35:46 <peluse> fantastic! 19:35:55 <notmyname> cschwede: we've seen interest from customers for similar functionality 19:36:02 <zaitcev> cschwede: did anyone ask for it 19:36:04 <clayg> cschwede: "only the latest version is kept" means - only one "last" copy - that can happen either on an overwrite (PUT) or delete? 19:36:31 <zaitcev> also, we have the object expirer, can that be re-used 19:36:34 <cschwede> notmyname: yes, this is a customer request as well 19:36:37 <cschwede> zaitcev: yes 19:36:38 <notmyname> cschwede: what's you're use case? where are you seeing this need (to zaitcev's pitn) 19:36:40 <clayg> zaitcev: yeah i've seen people ask 19:36:40 <notmyname> ah, cool 19:36:43 <cschwede> clayg: yes 19:36:54 <peluse> clayg, FYI I'm thinkin about maybe leveraging some of the .durable type stuff hwoever using a different trigger other than the .durable file. Just brainstorming 19:37:05 <cschwede> notmyname: use case: protection from external application bugs and credential abuse 19:37:13 <clayg> cschwede: yeah the part i don't like about what you threw up so far was all the animosity I currently fell towards quarantines 19:37:59 <clayg> cschwede: like we keep that shit around "just in case" but in reality I feel like I don't really have a good grok on what's there 19:38:14 <torgomatic> as always, the question "does an object with this name currently exist" is a very hard question to answer in an eventually-consistent system, so let's not build anything that relies on answering that question 19:38:31 <clayg> cschwede: if someone wanted something out of these deleted_objects dirs - it's like a swift-get-nodes call, and then hope you haven't rebalanced since the delete, and then manually pull it out :\ 19:38:33 <zaitcev> quarantine is for postmortem I thought 19:38:55 <zaitcev> like the always-on diags, you want to see what happened. what if it's bug in pickling or elsewhere 19:38:58 <cschwede> clayg: good point. hmm. 19:39:05 <clayg> cschwede: no one is paying for for the storage, and some people really don't need that kind of protection 19:39:41 <cschwede> clayg: right, but some people really need it - so it would be an option, disabled by default 19:39:42 <zaitcev> so quarantine is not like what cschwede proposes, in my mind anyway. That is more like object versioning maybe. 19:39:43 <notmyname> cschwede: the .durable idea for EC might have interesting applications here. or maybe not. it's not fully implemented/fleshed-out in EC land yet 19:40:11 <clayg> cschwede: well i'm just suggesting that maybe it's not a cluster-wide on or off switch 19:40:19 * tdasilva wonders if it would make sense to implement obj. versioning where deletes are supported 19:40:21 <peluse> yup, too many unknowns for make IRC discussion feasible but a great thing to talk about in person 19:40:26 <notmyname> cschwede: one other idea I've seen is to move objects to a system-level account (eg .deleted) on a DELETE. then reap that maybe with expirer later) 19:40:32 <clayg> cschwede: i feel like if could be like versioned objects 2.0 19:40:50 <cschwede> notmyname: that’s too slow in this case - because it is basically a COPY with a lot of data shuffling 19:41:00 <notmyname> ya, that's be cool to. (versioned objects with delete support) 19:41:05 <notmyname> cschwede: ya 19:41:06 <clayg> cschwede: but if it's something the user can either choose to do or not i guess it doesn't really protect against credential abuse :\ 19:41:27 <cschwede> clayg: exactly. it should be enabled by the operator 19:41:28 <notmyname> cschwede: I think you've got the answer to the basic question: yes there's interest :-) 19:41:44 <notmyname> and no shortage of ideas 19:41:55 <cschwede> notmyname: yep, thanks, so i continue work on this and propose something during the hackathon :) 19:41:59 <notmyname> cool 19:42:25 <cschwede> thx all for the feedback/ideas! 19:42:35 <notmyname> thanks for bringing it up! 19:42:37 <notmyname> #topic open discussion 19:42:43 <notmyname> what else do you have? 19:43:07 <clayg> zaitcev: my comment on quarantines was just that it's this other directory that's not really managed by swift daemons anymore - it is postmortem, but a deleted objects dir is also postmortem (hey I just accidently the whole thing - do you have a backup?) 19:43:22 <cutforth> speaking of the hackathon, are there any details yet? 19:43:37 <notmyname> cutforth: I'll be working on that at 1pm today 19:43:46 <cutforth> notmyname: k, thx 19:43:55 <notmyname> (ie in about an hour) 19:44:37 <notmyname> if there's nothing else, let's be done 19:44:49 <notmyname> thanks everyone for coming today. and thanks for working on swift 19:45:11 <notmyname> lot's of people are using what you're making. at large scale, in prod, all over the world 19:45:24 <peluse> rock n roll 19:45:24 <notmyname> #endmeeting