#openstack-meeting log

21:00:06 <timburke> #startmeeting swift
21:00:07 <openstack> Meeting started Wed Jul  1 21:00:06 2020 UTC and is due to finish in 60 minutes.  The chair is timburke. Information about MeetBot at http://wiki.debian.org/MeetBot.
21:00:08 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
21:00:10 <openstack> The meeting name has been set to 'swift'
21:00:15 <timburke> who's here for the swift meeting?
21:00:28 <kota_> o/
21:00:29 <alecuyer> o/
21:00:40 <seongsoocho> o/
21:00:59 <clayg> howdy!
21:02:02 <timburke> agenda's at https://wiki.openstack.org/wiki/Meetings/Swift
21:02:07 <timburke> first up
21:02:11 <timburke> #topic Berlin
21:02:25 <timburke> the Request for Presentations recently opened up
21:02:27 <timburke> #link http://lists.openstack.org/pipermail/openstack-discuss/2020-July/015730.html
21:02:52 <timburke> deadline's Aug 4
21:03:19 <timburke> summit will be Oct 19-23
21:03:48 <timburke> (assuming pandemic is more or less under control, so... who knows?)
21:04:03 <kota_> :/
21:04:20 <timburke> but i wanted to at least bring the call for presentations to everyone's attention
21:05:04 <timburke> that's all i've got in terms of announcements
21:05:22 <timburke> #topic replication servers
21:05:54 <timburke> so clayg made a very important observation on p 735751
21:05:55 <patchbot> https://review.opendev.org/#/c/735751/ - swift - Allow direct and internal clients to use the repli... - 4 patch sets
21:06:09 <clayg> go team!
21:06:21 <timburke> the replication servers currently only respond to SSYNC and REPLICATE requests!
21:06:33 <timburke> #link https://bugs.launchpad.net/swift/+bug/1446873
21:06:33 <openstack> Launchpad bug 1446873 in OpenStack Object Storage (swift) "ssync doesn't work with replication_server = true" [Medium,Confirmed]
21:07:13 <timburke> it won't even respond to OPTIONS, which is supposed to tell you what request methods *are* allowed
21:07:43 <clayg> ^ that's like @timburke 's favorite joke 🤣
21:08:28 <mattoliverau> Sorry I'm late, woke up this morning and had IRC troubles, but finally connected.
21:08:40 <timburke> it looks like it won't be too hard to allow replication servers to service all requests, and make `replication_server = False` basically mean "don't do SSYNC or REPLICATE"
21:08:59 <timburke> does that seem like a reasonable thing to everyone, though?
21:10:22 <timburke> i feel like the work involved in making it so i can properly *test* with separate replication servers is going to be way bigger than making it work :-/
21:12:24 <clayg> I think the conclusion on the lp bug was drop the semantic of `replicatoin_server = None` - which I think is what you HAVE to run on your replication server's now if you want to do a replication netowork and have EC rebuild still work
21:13:05 <timburke> seems related to https://review.opendev.org/#/c/337861/ 🤔
21:13:05 <patchbot> patch 337861 - swift - Permit to bind object-server on replication_port - 7 patch sets
21:13:32 <clayg> yeah probably 😬
21:13:50 <clayg> i feel like if this was easy we wouldn't have put it off so long
21:14:08 <clayg> it would be helpful to have sanity checks as we try to merge shit
21:15:24 <timburke> it's probably worth me putting in some work to make probe tests runnable with separate replication servers regardless. seems like a lot of people run that way (we certainly do!) and anything i can do to make my dev environment more representative of prod seems like a good idea
21:15:51 <timburke> but, it'll probably be a bit
21:15:58 <timburke> #topic waterfall EC
21:16:05 <timburke> clayg, how's it going?
21:16:06 <clayg> timburke: it's a non-zero cost to have to run a second set of backend *-server, but could be nice
21:16:18 <clayg> timburke: you're the only one who's commented 😠
21:16:35 <timburke> hehe
21:16:43 <timburke> fair enough
21:17:09 <clayg> i'm really proud of my little feeder system tho - I've started to conceptualize breaking them up as like a poly rhythm sort of situation
21:17:46 <clayg> where we have on flow of code that's popping ticks predictably (start ndata, then remaining nparity at a predictable pattern)
21:17:56 <clayg> then there's this *other* beat that's all random just based on when stuff responds
21:18:06 <clayg> doing that all in one loop was maddness - two loops is better
21:18:35 <clayg> and then I'm also proud of the logic simplifications that I'd managed to get so far - but it was super helpful to have another brain load it up and WTF at it
21:18:59 <clayg> some bits are still confusion - but I was desensitized
21:19:26 <timburke> so, anyone else have some bandwidth to take a look in the next week or two?
21:19:43 <clayg> you were talking about good buckets and bad buckets and 416 as success ... there might be more cleanup, but I'd need some sort of "bad behavior" to really motiviate me to "get back in there"
21:20:00 <clayg> otherwise it's just pushing symbols around for subjective qualities
21:20:40 <alecuyer> I think I can at least try the patch next week, I should have more time than these past weeks
21:20:44 <clayg> I'm happy with the tradeoffs for "the non-durable problem" - I know were never did that zoom or w/e - but it's fine (i'm still happy to answer questions as needed)
21:20:51 <timburke> alecuyer, thank you
21:21:03 <clayg> the final patch - the "make timeouts configurable per replica"
21:21:28 <clayg> it's... a little "much" by some estimations.  I like the expressiveness; but worry it's the same as the pipeline problem 😞
21:22:04 <clayg> the more people who would be willing to put on their operator hat and try to grok what those options even *mean* THE BETTER
21:22:32 <clayg> if I can explain it to YOU guys trivially there's no hope I'll ever be able to write clear docs
21:22:47 <alecuyer> ok
21:22:58 <timburke> would it be worth us documenting them as experimental/subject to change (or even removal)? just as soon as we get a chance to have a better option
21:23:54 <clayg> Then the last gotcha is the final fate of ECFragGetter - do we trim it down lean and mean and try to pull all remains of EC out of GETorHEADHandler - or is there still some hope we can unify them some way that's sane?
21:24:19 <clayg> It's really not clear to me; and I guess I'm avoiding trying to code that I'm not confident is a good idea 👎
21:24:54 <clayg> timburke: I think that'd be reasonable if after everyone looks at them we have some different ideas about directions we might go; or the best we can come up with seems like a lot of work
21:25:23 <clayg> if we look at them and say "yeah, per policy; per replica - makes sense" then we just write that down with some common examples/patterns and move on
21:26:02 <clayg> so I think I really could use some feedback right now - from anyone who has cycles - you don't have to grok all the code; or even check out the change
21:26:27 <clayg> reading the example config in the final patch (based on the converstations we been having) and providing feed back there is a great start
21:26:44 <timburke> one thing i'm definitely digging is the fact that it's available as a per-policy option immediately -- i think it'll be really handy for a lab cluster to be able to define, say, five EC policies that are identical except for those tunings
21:27:18 <mattoliverau> I'll make sure I have a look this week, especially big picture so can comment of config options. If I have time I'll try and go a little deeper.
21:27:20 <clayg> if you can glance through some of the weird corners of how ECFragGetter is starting to diverge from GETorHEADHandler and give me a gut check "yeah probably different enough" or "these smell like minor differences" would also be helpful
21:27:44 <timburke> #link https://review.opendev.org/#/c/737096
21:27:44 <patchbot> patch 737096 - swift - Make concurrency timeout per policy and replica - 3 patch sets
21:28:06 <clayg> if you can grok that main ec-waterfall loops - even setting aside weird bucket stuff like shortfall and durable - that'd like above and beyond
21:28:24 <timburke> (just realized i had a few comments i forgot to publish)
21:28:33 <clayg> at that point you may as well check it out and try to configure it - it probalby wouldn't take much work to see that it DOES do what it says on the tin
21:29:02 <clayg> THANK YOU ALL SO MUCH!!!  feedback is what I need - couldn't do this without y'all - GO TEAM!
21:29:52 <timburke> all right
21:29:57 <timburke> #topic open discussion
21:30:06 <timburke> what else do we need to bring up today?
21:30:25 <clayg> libec !!!
21:30:40 <clayg> so we definately have the old style checksum frags...
21:30:59 <clayg> everyone run this on your EC .data files -> https://gist.github.com/clayg/df7c276a43c3618d7897ba50ae87ea9d
21:31:21 <clayg> if your hash matches zlib you're golden!  upgrade to >1.16 and rejoice!
21:31:57 <clayg> if you've got that stupid stupid stupid libec inline checksums - you're probably gunna be stuck with them for awhile (and also don't upgrade - you'll die)
21:32:15 <alecuyer> thanks for that snippet clay
21:32:24 <clayg> well... you might die anyway because 1.15 was like... somehow indeterminate?
21:32:41 <timburke> well, until everything upgrades ;-)
21:32:45 <clayg> so we have to do better; but 1.16 isn't quite good enough
21:32:52 <clayg> alecuyer: timburke wrote it
21:33:31 <timburke> so, it seems like we need a way to tell libec to continue writing old frags
21:33:58 <mattoliverau> wow, nice useful bit of coding there!
21:34:00 <timburke> we also almost certainly need to write a detailed bug ;-)
21:34:18 <clayg> right, for us to upgrade we need new code to not start writing in the new format yet because old nodes won't know what to do with that shit until the upgrade
21:34:33 <clayg> I think we could probably just like... "upgrade libec real fast and restart everything!"
21:34:44 <clayg> but... we'll probably like "be careful" or whatever 🙄
21:35:33 <clayg> I'm gunna try to talk timburke into a *build* option that's just like "always write the old stupid inline crc value forever"
21:35:53 <clayg> that way I can just make it WOMM and then ship it and never thing about how much i hate C again
21:36:01 <alecuyer> hehe
21:36:42 <clayg> but... it might not work - timburke is pretty sure we should actually use the zlib version as soon as we can - so an env var with a portable build might be "better"
21:37:21 <clayg> if by "better" is - you'd rather pay some operational cost to rollout with the latch; then after upgrade remove the latch; and then be on the path of righteousness forever fighting back the forces of kludge and evil
21:38:10 <timburke> yeah, i'm still fairly nervous that our funky crc doesn't have the same kinds of guarantees that we're expecting from zlib...
21:39:04 <clayg> timburke: well i'm not sure I can write controller code that can do the upgrade properly!  like... we'd have to do a checkpoint release or something 🤮
21:39:25 <clayg> I'm sure I could build a packge with an option that's like CLAYG_IS_TOO_LAZY_TO_DO_THIS_RIGHT=True
21:39:37 <timburke> side note on all of this: thank you alecuyer for noticing this and bringing it up last week! much better to be arguing about how best to navigate this now than mid-upgrade ;-)
21:39:43 <clayg> and then if at somepoint you're sure all legacy swiftstack customers have upgrade you turn that off ;)
21:40:02 <clayg> yeah FOR SURE - alecuyer is god send ❤️
21:40:49 <alecuyer> thanks but I wish i'd do more - and be faster.. :) but thanks
21:40:50 <timburke> clayg, we could totally have a controller that always says "write legacy crcs" -- we've done controller checkpoint releases before; after the next one, we tell it to switch over
21:41:50 <clayg> yeah, again - if we can couple it to an upstream swift change that makes it just an ini/config var I'm totally down
21:42:17 <clayg> if I have to update our systemd units in the package to turn it on/off i'm pretty sure I'm too dumb to get it right
21:42:52 <clayg> alecuyer: kota_: mattoliverau: if you have access to any EC data in any cluster anywhere please see if you can get the crc thing to run on it and report back w/i a couple of weeks
21:43:10 <clayg> we'll probably kick this can down the road for awhile (1.15 is working fine for me!)
21:43:25 <clayg> that's all I got on libec
21:43:31 <alecuyer> yep, will do
21:43:59 <timburke> anybody think they'll have a chance to look at https://review.opendev.org/#/c/737856/ ? it seems to be working well for ormandj
21:44:00 <patchbot> patch 737856 - swift - py3: Stop munging RAW_PATH_INFO - 2 patch sets
21:44:28 <timburke> i'd like to backport it to ussuri and train, then ideally do a round of releases
21:45:09 <clayg> timburke: oh neat - is the test new?
21:45:14 <timburke> (probably should have done that earlier, but now that there's an additional known issue...)
21:45:16 <timburke> yeah
21:45:42 <clayg> b"GET /oh\xffboy%what$now%E2%80%bd HTTP/1.0\r\n" 🤣
21:46:41 <timburke> oh, and as a heads-up: it looks like there might be some movement on an eventlet fix for that py37/ssl bug
21:46:43 <timburke> #link https://github.com/eventlet/eventlet/pull/621
21:47:26 <alecuyer> sounds good!
21:49:27 <timburke> all right, last call
21:50:27 <timburke> thank you all for coming, and thank you for working on swift!
21:50:33 <timburke> #endmeeting