21:00:11 <timburke> #startmeeting swift 21:00:13 <openstack> Meeting started Wed Feb 19 21:00:11 2020 UTC and is due to finish in 60 minutes. The chair is timburke. Information about MeetBot at http://wiki.debian.org/MeetBot. 21:00:14 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 21:00:16 <openstack> The meeting name has been set to 'swift' 21:00:22 <timburke> who's here for the swift meeting? 21:00:31 <rledisez> hi o/ 21:00:34 <seongsoocho> o/ 21:00:34 <kota_> o/ 21:01:30 <mattoliverau> o/ 21:02:05 <timburke> as always, agenda's at https://wiki.openstack.org/wiki/Meetings/Swift 21:02:18 <timburke> #topic vancouver 21:02:39 <timburke> so first off, a reminder that the PTG in vancouver is coming up 21:02:45 <timburke> June 8-11 21:02:53 <timburke> #link https://www.openstack.org/events/opendev-ptg-2020/ 21:03:57 <timburke> the foundation's looking for some attendance estimations, so if you know you're going to be going, that'd be nice for me to know! 21:04:17 <zaitcev> Well, I plan to go. 21:04:21 <zaitcev> FWIW 21:04:28 <seongsoocho> I will go there. 21:04:35 <zaitcev> Never a given at my age. 21:05:00 <kota_> planning (not yeg got approval) 21:05:06 <timburke> it looks like i *won't* be able to go (my wife has her own conference that week, and i figure i owe her one) but i'm happy to make sure that there's space for swifters and help with any planning 21:05:42 <rledisez> For me and alecuyer, no approval yet 21:06:04 <clayg> 🎉??? 21:06:30 <clayg> if seongsoocho is gunna go I wanna go 21:06:49 <seongsoocho> wow.. 🙂 21:07:22 <mattoliverau> There might be a meeting in NUE at the suse office around that time. If it doesn't clash, then I might see if I can apply for travel support :) 21:07:43 <clayg> kota_: everyone in the swiftstack office is all about https://events.static.linuxfound.org/sites/events/files/slides/linuxcon15_bando.pdf 21:07:56 <timburke> should we plan on requesting an ops feedback session again? i feel like that was nice in shanghai 21:07:58 <clayg> kota_: so if you could lookup Yuichi Bando in the company directory and send him a virtual high five that'd be awesome! 21:08:30 <kota_> OH 21:08:43 <clayg> timburke: I feel like I remember prior-to-shanghai ops feedback sessions way better than shanghai 21:08:50 <timburke> fair enough 21:08:58 <clayg> ... we were all just in that one room? were their ops people that came by at some point? 21:09:03 <kota_> IIRC... he left... 21:09:27 <clayg> kota_: well... his legacy lives on! 21:09:45 <clayg> timburke: or no, there WAS an ops session in one of the speaking rooms - seperate from the PTG 21:09:46 <kota_> that sounds awesome. good to know 21:09:47 <timburke> clayg, maybe it could just be done as some set-aside time in the working sessions, then 21:09:56 <timburke> *shrug* 21:09:57 <clayg> only it was mostly like just us 21:10:26 <timburke> yeah, we wrote some stuff down at least: https://etherpad.openstack.org/p/PVG-swift-ops-feedback 21:10:32 <clayg> no no, I do remember that now - there was a couple of people in there... 21:11:25 <timburke> i think sorrison came by and wouldn't have necessarily done so for just the working sessions 21:11:53 <timburke> anyway, something to think about: do you guys want one, and if so, who would like to lead it 21:12:05 <clayg> timburke: yes I 100% agree, I just forgot - the ops sessions are great and we should definately have another one not just the working sessions 21:12:18 <timburke> 👍 21:12:58 <clayg> seongsoocho: could be the MC like "I run swift, it's not my LEAST favorite piece of software - how about y'all?" 21:14:15 <timburke> anyway, i think that's all i've got for vancouver -- thanks zaitcev, seongsoocho for the solid yes, and kota_, rledisez, alecuyer, clayg, mattoliverau for the tenative maybes ;-) 21:14:52 <timburke> i figure by the time i need to actually get a response in, we'll all know a bit more about what's approved 21:15:03 <timburke> #topic swiftclient release 21:15:09 <timburke> (sorry, a little out of order) 21:15:21 <timburke> so, we had one! there's a 3.9.0 now 21:15:35 <diablo_rojo> timburke, we will miss you 21:15:56 <timburke> this was actually kinda unplanned; milestone 2 came up and i forgot that we need to get a client library out by then 21:16:25 <timburke> but i *would* like to get another out in the not-too-distant future! 21:16:52 <timburke> in particular to pick up versioning support 21:16:55 <timburke> #link https://review.opendev.org/#/c/691877/ 21:17:05 <timburke> and symlink support 21:17:07 <timburke> #link https://review.opendev.org/#/c/694211/ 21:17:32 <zaitcev> Well poke tdasilva with a physical stick 21:18:04 <timburke> charz has apparently been rather busy :-) i'm also pretty excited about the ideas in https://review.opendev.org/#/c/707409/ 21:19:19 <timburke> the filtering in https://review.opendev.org/#/c/708074/ is kind of interesting, too, and probably closer to mergable 21:19:26 <clayg> zaitcev: can you make a summary of the debate about the interface for symlink support? 21:20:03 <timburke> and i remember people specifically asking for keystone credential support in shanghai, so https://review.opendev.org/#/c/699457/ might be good 21:20:21 <clayg> zaitcev: I remember thinking when I looked at it that the interface did not at a glance resemble the existing copy command (which I thought it reasonably MIGHT) 21:20:26 <zaitcev> clayg: swift link [--cont2=c2] c a b versus swift link c1 a c2 b 21:20:51 <clayg> so i'm not sure if things have moved to "no it's just like copy now" or "copy was wrong; and this is better" 21:21:02 <clayg> ... or maybe even "why would creating a link be like copying?!" 21:21:10 <zaitcev> A pinnacle of byteshedding. Although, I recall that Tim told me that we already have something like it, like "swift copy" so maybe we should just be consistent with that. 21:23:11 <kota_> sounds reasonable 21:24:30 <clayg> ok so maybe that one isn't done yet 21:24:41 <clayg> the versioning one is GTG I think, and I'd love to have it 21:25:17 <timburke> i'll find some time to review it this week 21:25:47 <timburke> anything else to bring up with regard to swiftclient? 21:27:05 <timburke> #topic EC gets and long tail latency 21:27:44 <timburke> so we here at swiftstack have a customer very interested in bringing down their 99.9%tile latencies 21:28:35 <timburke> we realized that even though they had concurrent gets turned on, it wasn't actually helping, because all the data was EC'ed 21:29:34 <timburke> and the way that works is we spin up ndata connections, then wait up to node_timeout for them to indicate they're ready before ever spinning up any alternate connections 21:30:40 <timburke> so we had this real interesting distribution of request timings where you'd see a big peak close to zero, then a pretty smooth drop-off until you got to node_timeout 21:30:40 <clayg> and so if you have a "slow" frag it slows down the whole mess 21:31:47 <clayg> timburke: were we able to isolate any of those specific requests up there around node_timeout and look at logs validate there was a XXXms fragment GET after a Xs node_timeout? 21:31:49 <timburke> at which point you got another peak and drop-off until 2x node_timeout, at which point you've got one more little peak and only a smattering of longer requests 21:32:15 <timburke> clayg, haven't yet; probably a good idea to validate assumptions 21:32:33 <timburke> so i guess i've got two main questions for people 21:33:00 <timburke> 1) has anyone else noticed this sort of behavior? got any users complaining about latencies? 21:34:38 <timburke> and 2) how *should* it behave? spin up ndata+nparity connections immediately and wait for ndata? spin up ndata then have the concurrency_timeout behavior we currently have with replicated? should there be any special logic around handoff nodes? 21:35:36 <rledisez> 1/ not really. yes we clearly see the ttfb is higher on EC than replica, but we didn't really investigate because by design our EC policy will be slower (and we didn't get complain from our users) 21:35:36 <rledisez> 2/ i don't want that the proxy reach all nodes immediately because it will have an impact in terms of IO. it would be ok if the IO cost was nearly zero (aka storing meta on a faster device) 21:36:14 <zaitcev> Fascinating. 21:36:17 <rledisez> but as we are considering droping replica and move all EC, this is clearly a topic i'm interested in 21:36:22 <mattoliverau> How long does it normall take for a node to become "ready" 21:36:30 <mattoliverau> does it take node_timeout 21:36:46 <mattoliverau> or can we use conn_timeout like in default concurrency 21:37:00 <mattoliverau> *default concurrency timeout 21:37:21 <mattoliverau> So their staggered, but not waiting too long. 21:37:54 <clayg> mattoliverau: I think defaulting concurrent gets to conn_timeout has proven to be absolutely brilliant and has undoubtly smoothed out latency tails on replicated GETs 21:38:59 <clayg> rledisez: I 100% agree that use-case expectation setting for "replicated is faster ttfb" proves out for the data - and I don't believe this deployment has dont A/B comparison to quantify that 21:39:56 <clayg> rledisez: OTOH, if we can do a good job translating what mattoliverau taught us when implementing concurrent gets for replicated it's possible that difference could be diminished 21:40:50 <timburke> fwiw, i've got a smallish log sample where 50% are <100ms, 90% are <600ms, and 99% are <2s 21:41:05 <rledisez> i agree it looks like the right way to do it, especially since it's configurable, operators can choose to be aggressive or not on the timeout 21:41:24 <timburke> (though i maybe need to clean that up, i don't think i filtered it to only look at object gets) 21:42:31 <timburke> one thought that i'd had was to wait until we have max(0, ndata-nparity) connection in hand before switching to the concurrent-get behavior, but maybe that's just over-complicating things 21:42:51 <mattoliverau> Sounds like conn_timeout a first thing to try and see how it goes. worst case is we get less of a tail and more connections.. which isn't great, but might be ok. Conn timeout defaults at 0.5 so might catch 90% of cases if were lucky. 21:44:40 <timburke> and i *think* https://review.opendev.org/#/c/706361/ might help with the extra connections -- we were definitely seeing connections on the proxy that would hang around well past the client response on the same cluster... 21:46:22 <timburke> anyway, just wanted to let people know about some of the stuff we've been poking at lately 21:46:28 <timburke> #topic open discussion 21:46:39 <timburke> what else would people like to bring up? 21:46:58 <clayg> 👍 we can trivially demonstrate in unittests and development that those greenthreads spawned and abandoned from the proxy after the client response goes through are still running and will finish 21:47:05 <zaitcev> So, anyone wants to help out with that dark data thing? Romain, you had an opinion about the separation. 21:47:06 <clayg> we leverage this behavior on purpose in the object-server with the container-update-timeout 21:48:58 <zaitcev> Unfortunately, Sam left for Google, so I cannot debate him. He typically knew what he was doing about this, so I'm not very happy to conclude that he was wrong. https://review.opendev.org/706653 21:49:22 <clayg> mattoliverau: so for waterfall-ec your suggestion is spawn ec_ndata connections like we currently do, then use concurrent_gets to wait up to concurrency_timeout (defaults to connection timeout) before spawning more requests 21:49:27 <rledisez> on the replace-md5-for-checksum side, I'm planning to put as a requirement, before the operator decide to use something else than md5 for checksum, that he must ensure that the whole cluster is upgraded to the right version. is that ok for everybody? i don't think it's possible to just let the proxy do its best to guess if it should use md5 or something else 21:49:48 <rledisez> zaitcev: do you have a pointer on the dark-data thing? (bugreport, review, …) 21:50:44 <clayg> mattoliverau: I assume like replicated concurrent gets we'd stop spawning once we get into handoffs until we get an non-success (maybe timeout) response from a primary 21:50:50 <timburke> zaitcev, i think he had a goal of allowing marginally-trusted code to run; if we're content to say "you wan to run this? make sure it's stable!" i think i'd probably be ok with the simpler approach 21:51:06 <mattoliverau> clayg: yeah, then we also dog food the same settings on EC and REPL. But of course, only brain storming atm 21:51:21 <zaitcev> rledisez: I had a couple of cases where the loss of capacity was significant. It always was associated with some kind of catastrophic mismanagement by the operator: a bothched restore from a backup, or having half of the drives go down. 21:51:31 <timburke> rledisez, https://review.opendev.org/#/c/706653/ i believe 21:52:08 <clayg> so if we have a 4+2, we spawn 4, then 500ms later if we're still waiting on at least 2 of the 4 responses - we spawn 2 more 21:52:21 <mattoliverau> yup 21:52:56 <mattoliverau> then the ndata get a chance to give use the goods, and if not we can fall back to a rebuild. but not go searching handoffs.. though I guess we could 21:52:59 <zaitcev> rledisez: I am comfortable to think that OVH or RAX aren't going to have any dark data even if clusters are heavily loaded. It's just that Red Hat sells this private cloud solution to people who don't care about running their clusters, and so this happens regularly. 21:53:18 <mattoliverau> worst case, other then connections, we may get a frag back. 21:53:38 <timburke> rledisez, on the hashing, yeah, that seems reasonable -- rather like what we documented for turning on encryption in an existing cluster 21:53:56 <clayg> mattoliverau: I think the code is going to be fairly amicable to that change, thanks for thinking through it 21:55:59 <rledisez> zaitcev: I'm pretty confident we have dark data hanging around :) but probably nothing massive. don't forget that container may not be up2date immediately, so you should allow a delay before checking the container server 21:56:54 <rledisez> I don't really like that because I can't imagine how long it would take, but I don't have any better option. if there is async_pendings somewhere, you might end up deleting valid object 21:56:55 <clayg> zaitcev: same here, it's unfortunately all to human to have "some kind of catastrophic mismanagement by the operator" 😁 21:57:03 <zaitcev> Yeah... It can be stuck in an update somewhere 21:57:43 <clayg> rledisez: on the md5 thing - i'm not sure I follow how you're suggesting we *enforce* "make sure your whole cluster is upgraded" 21:57:52 <seongsoocho> rledisez: In my cluster, The swift proxy-server support multiple checksum. Some container use md5 for checksum and some container use SHA256 for checksum. The customer wants to choose the checksum.. 21:58:22 <zaitcev> The main problem I'm having with this isolation thing is the throttling 21:58:36 <rledisez> clayg: we can't enforce it, but we can write in the changelog/doc "if you don't do it, you'll be in trouble" 21:58:42 <clayg> seongsoocho: !!! that's awesome! It's possible you avoided some of the hard problems rledisez is trying to solve - but I think generally that's the idea we'd like to support 21:58:44 <zaitcev> My container servers are on rotating storage and they are slower than object auditors at times. 21:58:59 <timburke> seongsoocho, whoa, cool! is that stored in the container db, too, or just used for etags, or something else? i'm sad i won't be in vancouver to ask you more about it in person! 21:59:05 <clayg> seongsoocho: i'm guessing there's at least proxy middleware to handle setting sysmeta? How much can you share? 21:59:12 <zaitcev> So, very soon you start skipping on the checks. No big deal, just recheck on next pass, right 21:59:27 <seongsoocho> timburke: It just used for etags . 22:00:17 <seongsoocho> clayg: I try to write the details on etherpad as soon as possible. 22:00:26 <zaitcev> Well, but now suppose it says that you have 0.03% capacity used up by dark data. Is that trustworthy? But one other thing is, with these separate processes everything is harder to analyze. 22:01:04 <clayg> zaitcev: oh, ok, so but the "audit_dark_data" example watcher is new - sam's original design didn't include a concrete example? 22:01:27 <timburke> all right, looks like we're about out of time -- but we can keep chatting in -swift for sure! 22:01:45 <timburke> (sorry, not sure if there's anyone who'd be waiting on the room) 22:01:55 <timburke> thank you all for coming, and thank you for working on swift! 22:01:59 <timburke> #endmeeting