21:00:33 <timburke> #startmeeting swift 21:00:34 <openstack> Meeting started Wed Sep 2 21:00:33 2020 UTC and is due to finish in 60 minutes. The chair is timburke. Information about MeetBot at http://wiki.debian.org/MeetBot. 21:00:35 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 21:00:37 <openstack> The meeting name has been set to 'swift' 21:00:42 <timburke> who's here for the swift meeting? 21:00:48 <mattoliverau> o/ 21:03:31 <timburke> i guess everyone else is lurking or late ;-) 21:03:38 <timburke> agenda's at https://wiki.openstack.org/wiki/Meetings/Swift 21:03:51 <timburke> first up, PTG planning! 21:03:54 <timburke> #topic PTG 21:04:29 <timburke> i see clayg's been busy adding topics to the etherpad :D 21:04:32 <timburke> #link https://etherpad.opendev.org/p/swift-ptg-wallaby 21:04:43 <mattoliverau> timburke: if they too late then we can decide everything ;) 21:04:48 <clayg> yup, gunna be a great PTG 21:05:23 <timburke> if you haven't already, please respond to the doodle i made to figure out meeting times that week 21:05:23 <clayg> mattoliverau & timburke BDsFL! 21:05:25 <timburke> #link https://doodle.com/poll/ukx6r9mxugfn7sed 21:06:08 <timburke> i just need to respond by next week, iirc 21:06:09 <mattoliverau> clayg: thanks for adding the topic to etherpad 21:06:46 <kota_> hi 21:06:48 <timburke> any questions on ptg logistics? 21:07:05 <clayg> I had no idea there was a doodle. I will put it in my list. 21:07:55 <timburke> all right, let's look at some patches then! 21:08:03 <timburke> #topic waterfall ec 21:08:40 <timburke> this is starting to become table stakes for a lot of the work clayg and i have ben doing lately 21:09:00 <timburke> we packaged and put it in prod recently 21:09:34 <mattoliverau> How's it going in prod? 21:10:05 <timburke> that said, it's a fairly involved change -- having an extra set of eyes take a look would likely be beneficial 21:10:29 <timburke> mattoliverau, really well! dramatic improvement to ttfb metrics 21:10:39 <clayg> mattoliverau: GREAT! 21:10:39 <kota_> nice 21:10:58 <clayg> is there a paste bin but for graphs? 21:11:19 <timburke> imgur? 21:11:45 <mattoliverau> Ive been meaning to look, but been stuck with an upcoming product milestone at work. But will definitely try. 21:12:33 <clayg> oh acctually it's not that exciting cause we moved metric names around :\ 21:13:49 <clayg> but our ttfb 99.9% used to be like 5s - then it was ~500ms and with the extra ec request it's more like 30ms 💪 21:14:16 <clayg> there's a ton of flexibility in there - no issues with regressions or 3R or anything like that - worked basically exactly how we wanted it to 🎉 21:14:37 <timburke> no worries -- i get it -- do you think you'd be able to get a look in this week? i'm tempted to say we just merge it and address further reviews as new patches 21:15:19 <clayg> i'm pretty sure anyone running EC wants this change - as long as everyone understands "the gist" and can find the option name... 21:15:29 <clayg> pretty sure it ended up as "just better by default" 21:15:44 <timburke> helps that clayg did all the plumbing you'd need to make sure you *can* configure concurrent gets just for replicated if you wanted 21:15:50 <clayg> the per-policy concurrency options are something to be aware of - but the defaults all fall through - there's no "upgrade impact" or anything 21:16:20 <timburke> UpgradeImpact: your EC GETs just got better ;-) 21:16:40 <mattoliverau> The fact that it's tested and run in prod helps. I'll make sure I give it some kind of review this week 21:16:46 <clayg> it's difficult to continue carrying it as patch just because we're doing other things to improve other stuff unrelated to waterfall ec - but the change moved a lot of code around 21:17:19 <clayg> basically for us it's the new master - we're not merging anything else ahead of it - it's blocking a bunch of unrelated stuff (like Tim's trans-id fix) 21:17:34 <mattoliverau> Now that v0.1 shard audit is out of my head I have more room :p 21:18:38 <clayg> ok, one week doesn't make a big difference to me if you want to look at it before it merged I guess... 21:19:09 <clayg> it's just a headache to keep managing it off master - but i'm already going through that again at least one more time 21:19:20 <mattoliverau> I'll have at least second high level look today and tomorrow. 21:19:29 <timburke> thanks! 21:20:01 <timburke> all right then 21:20:06 <timburke> #topic worker management 21:21:24 <timburke> so after looking at how many connections per worker, i started playing with signaling workers to exit gracefully 21:22:08 <timburke> ...and wound up bringing back some of the interfaces i removed in the first patch (in particular, do_bind_ports) 21:23:04 <timburke> i'm wondering if it'd be better/easier to review if i keep them as separate patches or squash them all together into a giant "manage workers better patch" 21:24:07 <timburke> for reference, https://review.opendev.org/#/c/745603/ improves connection distribution 21:24:07 <patchbot> patch 745603 - swift - Bind a new socket per-worker - 4 patch sets 21:24:35 <timburke> https://review.opendev.org/#/c/747332/ lets workers drain connections in response to a HUP or USR1 21:24:36 <patchbot> patch 747332 - swift - wsgi: Allow workers to gracefully exit - 3 patch sets 21:25:41 <timburke> and https://review.opendev.org/#/c/748721/ has the workers just stop listening (rather than close their socket) to make the cycling less disruptive (since any connected-but-not-yet-accepted clients should still get a response) 21:25:42 <patchbot> patch 748721 - swift - wsgi: stop closing listen sockets when workers die - 1 patch set 21:25:55 <clayg> timburke: fwiw the "bind a new socket per-worker" patch is going to be much more important for us to get out ASAP than the HUP/USR1 handling for workers (the rss killer will have to learn to "try to kill gracefully, but still timeout and murder with recklessabandon" (i.e. we can't "just" change TERM to HUP) 21:26:57 <timburke> i've also got another patch locally to make it so workers can *either* stop accepting *or* shutdown the listen socket, and then the parent opens fresh sockets as needed 21:26:59 <mattoliverau> If one is more important to get out then the other, then split em I say 21:26:59 <clayg> but if you have some vision for how it all goes back together and want to refactor the bind a new socket per worker patch that's maybe ok? I know you'd already tested the code as is - so I'd like to package that change w/o modification and we can keep making it better in the interum 21:27:11 <clayg> ... including squash everything to one if that's the way it makes the most sense 21:27:15 <mattoliverau> Then we can do our best at getting the one you need in 21:28:02 <clayg> yeah I think the worker spread is going to be a big win for everyone (I also just happen to like some of the refactors that tim did) - the graceful worker is a nice-to-have 21:28:27 <clayg> timburke: make respin everything squashed into a separate patch just for everyone to look at? 21:28:37 <timburke> i mean, we've already got +2 on the first two... 21:28:46 <timburke> clayg, yeah, i can do that 21:28:54 <clayg> maybe when it's in gerrit in green and red it'll be easier to see if/how it should be split 21:29:27 <clayg> yeah I'd be fine merging the "bind a new socket per-worker" - the only reason I think I held off was because you made some reference to wanting to squash or refactor something? 21:29:45 <clayg> but it sounds like mattoliverau and I are both pushing back on that... should we just merge bind a new socket per-worker and let you work from there? 21:30:17 <clayg> i have most of that machinery in my head - if we want to keep noodling there I think I can keep up - i am interested in making worker killing better 21:30:18 <timburke> fwiw, i ran into some weird deadlock running probe tests on my un-pushed work, where the client's sent the rest of its body and is waiting on a response, but the server is waiting on the client to send all the bytes... 21:30:23 <clayg> i miss rledisez :'( 21:30:41 <clayg> zohno!? i blame eventlet 21:31:06 <timburke> idk -- i might just need to learn more about sockets ;-) 21:31:45 <timburke> tho come to think of it -- i think it may have been waiting down in eventlet's discard()... 21:31:58 <clayg> SEE! eventlet 😡 21:32:27 <clayg> it's like "learn sockets" ok, now forget everyhting you knew about sockets 🤣 21:32:53 <clayg> except for the part you glanced over about non-blocking - THAT part you need to know really well 😁 21:33:19 <timburke> idk. i'll keep noodling with it -- but i'm starting to think we probably should go ahead and merge what's got +2s already 21:33:41 <clayg> DEAL! 21:34:02 <timburke> #topic storage policies, logging, and s3api 21:34:47 <timburke> if anyone's looking for a quicker review, here's something a little smaller and more targetted ;-) 21:35:30 <clayg> 😍 21:35:32 <timburke> we noticed that the normal {policy_index} field from https://docs.openstack.org/swift/latest/logs.html doesn't get populated properly when using s3api 21:38:47 <timburke> the main problem was that proxy-logging relies on the x-backend-storage-policy-index header getting populated in the wsgi env, but s3api pretty frequently just makes up new environs 21:40:04 <timburke> the kind of funny thing is that proxy-logging looks at the response headers, too, but we don't actually populate them (on master; see https://launchpad.net/bugs/1634382) 21:40:05 <openstack> Launchpad bug 1634382 in OpenStack Object Storage (swift) "return x-backend-storage-policy-index header from object server" [Wishlist,Confirmed] - Assigned to Christian Hugo (christianhugo) 21:41:09 <timburke> https://review.opendev.org/#/c/693893/ tried to address that in the object-server, but we found there was a problem during rolling upgrades 21:41:10 <patchbot> patch 693893 - swift - Put storage policy index in object-server responses - 1 patch set 21:41:14 <clayg> timburke: I think maybe some of the container requests do populate that field - but yeah, not object 21:42:20 <timburke> so https://review.opendev.org/#/c/749400/ has the proxy add it locally shortly after determining the best response from backends 21:42:21 <patchbot> patch 749400 - swift - proxy: Put storage policy index in object responses - 3 patch sets 21:43:07 <timburke> https://review.opendev.org/#/c/749401/ then just makes sure s3api passes it through (since it wouldn't previously) 21:43:07 <patchbot> patch 749401 - swift - s3api: Ensure backend headers make it through s3api - 3 patch sets 21:43:51 <timburke> that's all i had planned to talk about 21:43:58 <timburke> #topic open discussion 21:44:08 <timburke> anything else we should talk about today? 21:44:39 <clayg> timburke: what's the deal with the s3api quota error patch? 21:44:54 <timburke> oh, yeah! 21:45:07 <clayg> does aws api have an over quota response? 21:45:12 <clayg> does it like suck or something? 21:45:38 <timburke> nope! no concept of object quotas (afaik) 21:46:03 <timburke> it's like they prefer to just charge you more until you change your behavior or something ;-) 21:46:40 <timburke> so the first patchset was very targetted and tactical: https://review.opendev.org/#/c/749382/1 21:46:41 <patchbot> patch 749382 - swift - s3api: Make quota-exceeded errors more obvious - 2 patch sets 21:46:42 <mattoliverau> Just a random thought I had while cooking. Seems quic protocol has been decided to be http/3 I wonder if it'll speed up internal swift communication (if we trued to implement it there). Quic is http over UDP but still has some acks. Just made to go faster. Could be an interesting poc (if anyone ever has any time) :p 21:46:57 <mattoliverau> *tried 21:48:39 <timburke> second patch took on a decent bit of refactoring; i'd be interested in kota_'s opinion 21:48:54 <timburke> mattoliverau, for sure interesting -- i'd love to have the bandwidth to look at reworking things from the protocol up 21:49:32 <kota_> timburke: will look. the p 749382 ? 21:49:32 <patchbot> https://review.opendev.org/#/c/749382/ - swift - s3api: Make quota-exceeded errors more obvious - 2 patch sets 21:49:44 <timburke> yup -- thanks! 21:50:55 <kota_> oh, it's just a couple of lines changed... 21:51:45 <timburke> kota_, is that the first patchset, maybe? 21:51:50 <kota_> ah, not. it was the first. 21:51:59 <kota_> timburke: correct. thx. 21:54:41 <clayg> timburke: did you get what you needed on libec 21:54:53 <clayg> i know you said we're super down to the wire on needing to cut releases for upstream 21:55:36 <timburke> i'm still a little torn on how to handle the env var in https://review.opendev.org/#/c/738959/ 21:55:37 <patchbot> patch 738959 - liberasurecode - Be willing to write fragments with legacy crc - 2 patch sets 21:57:06 <timburke> i don't really want to log anything on unexpected values -- that seems like it'd be *incredibly* noisy on a busy proxy :-/ 21:58:55 <kota_> hmm 21:59:10 <timburke> i think there's still time for us to revisit. we've had this bug a couple years already; what's another cycle? 21:59:29 <timburke> i *should* get client patches in order though :-( 21:59:41 <timburke> i'll make sure the priority reviews page is updated for next week 22:01:05 <timburke> all right, we're at time 22:01:17 <timburke> thank you all for coming, and thank you for working on swift! 22:01:23 <timburke> #endmeeting