21:00:06 <timburke> #startmeeting swift 21:00:07 <openstack> Meeting started Wed Jul 1 21:00:06 2020 UTC and is due to finish in 60 minutes. The chair is timburke. Information about MeetBot at http://wiki.debian.org/MeetBot. 21:00:08 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 21:00:10 <openstack> The meeting name has been set to 'swift' 21:00:15 <timburke> who's here for the swift meeting? 21:00:28 <kota_> o/ 21:00:29 <alecuyer> o/ 21:00:40 <seongsoocho> o/ 21:00:59 <clayg> howdy! 21:02:02 <timburke> agenda's at https://wiki.openstack.org/wiki/Meetings/Swift 21:02:07 <timburke> first up 21:02:11 <timburke> #topic Berlin 21:02:25 <timburke> the Request for Presentations recently opened up 21:02:27 <timburke> #link http://lists.openstack.org/pipermail/openstack-discuss/2020-July/015730.html 21:02:52 <timburke> deadline's Aug 4 21:03:19 <timburke> summit will be Oct 19-23 21:03:48 <timburke> (assuming pandemic is more or less under control, so... who knows?) 21:04:03 <kota_> :/ 21:04:20 <timburke> but i wanted to at least bring the call for presentations to everyone's attention 21:05:04 <timburke> that's all i've got in terms of announcements 21:05:22 <timburke> #topic replication servers 21:05:54 <timburke> so clayg made a very important observation on p 735751 21:05:55 <patchbot> https://review.opendev.org/#/c/735751/ - swift - Allow direct and internal clients to use the repli... - 4 patch sets 21:06:09 <clayg> go team! 21:06:21 <timburke> the replication servers currently only respond to SSYNC and REPLICATE requests! 21:06:33 <timburke> #link https://bugs.launchpad.net/swift/+bug/1446873 21:06:33 <openstack> Launchpad bug 1446873 in OpenStack Object Storage (swift) "ssync doesn't work with replication_server = true" [Medium,Confirmed] 21:07:13 <timburke> it won't even respond to OPTIONS, which is supposed to tell you what request methods *are* allowed 21:07:43 <clayg> ^ that's like @timburke 's favorite joke 🤣 21:08:28 <mattoliverau> Sorry I'm late, woke up this morning and had IRC troubles, but finally connected. 21:08:40 <timburke> it looks like it won't be too hard to allow replication servers to service all requests, and make `replication_server = False` basically mean "don't do SSYNC or REPLICATE" 21:08:59 <timburke> does that seem like a reasonable thing to everyone, though? 21:10:22 <timburke> i feel like the work involved in making it so i can properly *test* with separate replication servers is going to be way bigger than making it work :-/ 21:12:24 <clayg> I think the conclusion on the lp bug was drop the semantic of `replicatoin_server = None` - which I think is what you HAVE to run on your replication server's now if you want to do a replication netowork and have EC rebuild still work 21:13:05 <timburke> seems related to https://review.opendev.org/#/c/337861/ 🤔 21:13:05 <patchbot> patch 337861 - swift - Permit to bind object-server on replication_port - 7 patch sets 21:13:32 <clayg> yeah probably 😬 21:13:50 <clayg> i feel like if this was easy we wouldn't have put it off so long 21:14:08 <clayg> it would be helpful to have sanity checks as we try to merge shit 21:15:24 <timburke> it's probably worth me putting in some work to make probe tests runnable with separate replication servers regardless. seems like a lot of people run that way (we certainly do!) and anything i can do to make my dev environment more representative of prod seems like a good idea 21:15:51 <timburke> but, it'll probably be a bit 21:15:58 <timburke> #topic waterfall EC 21:16:05 <timburke> clayg, how's it going? 21:16:06 <clayg> timburke: it's a non-zero cost to have to run a second set of backend *-server, but could be nice 21:16:18 <clayg> timburke: you're the only one who's commented 😠 21:16:35 <timburke> hehe 21:16:43 <timburke> fair enough 21:17:09 <clayg> i'm really proud of my little feeder system tho - I've started to conceptualize breaking them up as like a poly rhythm sort of situation 21:17:46 <clayg> where we have on flow of code that's popping ticks predictably (start ndata, then remaining nparity at a predictable pattern) 21:17:56 <clayg> then there's this *other* beat that's all random just based on when stuff responds 21:18:06 <clayg> doing that all in one loop was maddness - two loops is better 21:18:35 <clayg> and then I'm also proud of the logic simplifications that I'd managed to get so far - but it was super helpful to have another brain load it up and WTF at it 21:18:59 <clayg> some bits are still confusion - but I was desensitized 21:19:26 <timburke> so, anyone else have some bandwidth to take a look in the next week or two? 21:19:43 <clayg> you were talking about good buckets and bad buckets and 416 as success ... there might be more cleanup, but I'd need some sort of "bad behavior" to really motiviate me to "get back in there" 21:20:00 <clayg> otherwise it's just pushing symbols around for subjective qualities 21:20:40 <alecuyer> I think I can at least try the patch next week, I should have more time than these past weeks 21:20:44 <clayg> I'm happy with the tradeoffs for "the non-durable problem" - I know were never did that zoom or w/e - but it's fine (i'm still happy to answer questions as needed) 21:20:51 <timburke> alecuyer, thank you 21:21:03 <clayg> the final patch - the "make timeouts configurable per replica" 21:21:28 <clayg> it's... a little "much" by some estimations. I like the expressiveness; but worry it's the same as the pipeline problem 😞 21:22:04 <clayg> the more people who would be willing to put on their operator hat and try to grok what those options even *mean* THE BETTER 21:22:32 <clayg> if I can explain it to YOU guys trivially there's no hope I'll ever be able to write clear docs 21:22:47 <alecuyer> ok 21:22:58 <timburke> would it be worth us documenting them as experimental/subject to change (or even removal)? just as soon as we get a chance to have a better option 21:23:54 <clayg> Then the last gotcha is the final fate of ECFragGetter - do we trim it down lean and mean and try to pull all remains of EC out of GETorHEADHandler - or is there still some hope we can unify them some way that's sane? 21:24:19 <clayg> It's really not clear to me; and I guess I'm avoiding trying to code that I'm not confident is a good idea 👎 21:24:54 <clayg> timburke: I think that'd be reasonable if after everyone looks at them we have some different ideas about directions we might go; or the best we can come up with seems like a lot of work 21:25:23 <clayg> if we look at them and say "yeah, per policy; per replica - makes sense" then we just write that down with some common examples/patterns and move on 21:26:02 <clayg> so I think I really could use some feedback right now - from anyone who has cycles - you don't have to grok all the code; or even check out the change 21:26:27 <clayg> reading the example config in the final patch (based on the converstations we been having) and providing feed back there is a great start 21:26:44 <timburke> one thing i'm definitely digging is the fact that it's available as a per-policy option immediately -- i think it'll be really handy for a lab cluster to be able to define, say, five EC policies that are identical except for those tunings 21:27:18 <mattoliverau> I'll make sure I have a look this week, especially big picture so can comment of config options. If I have time I'll try and go a little deeper. 21:27:20 <clayg> if you can glance through some of the weird corners of how ECFragGetter is starting to diverge from GETorHEADHandler and give me a gut check "yeah probably different enough" or "these smell like minor differences" would also be helpful 21:27:44 <timburke> #link https://review.opendev.org/#/c/737096 21:27:44 <patchbot> patch 737096 - swift - Make concurrency timeout per policy and replica - 3 patch sets 21:28:06 <clayg> if you can grok that main ec-waterfall loops - even setting aside weird bucket stuff like shortfall and durable - that'd like above and beyond 21:28:24 <timburke> (just realized i had a few comments i forgot to publish) 21:28:33 <clayg> at that point you may as well check it out and try to configure it - it probalby wouldn't take much work to see that it DOES do what it says on the tin 21:29:02 <clayg> THANK YOU ALL SO MUCH!!! feedback is what I need - couldn't do this without y'all - GO TEAM! 21:29:52 <timburke> all right 21:29:57 <timburke> #topic open discussion 21:30:06 <timburke> what else do we need to bring up today? 21:30:25 <clayg> libec !!! 21:30:40 <clayg> so we definately have the old style checksum frags... 21:30:59 <clayg> everyone run this on your EC .data files -> https://gist.github.com/clayg/df7c276a43c3618d7897ba50ae87ea9d 21:31:21 <clayg> if your hash matches zlib you're golden! upgrade to >1.16 and rejoice! 21:31:57 <clayg> if you've got that stupid stupid stupid libec inline checksums - you're probably gunna be stuck with them for awhile (and also don't upgrade - you'll die) 21:32:15 <alecuyer> thanks for that snippet clay 21:32:24 <clayg> well... you might die anyway because 1.15 was like... somehow indeterminate? 21:32:41 <timburke> well, until everything upgrades ;-) 21:32:45 <clayg> so we have to do better; but 1.16 isn't quite good enough 21:32:52 <clayg> alecuyer: timburke wrote it 21:33:31 <timburke> so, it seems like we need a way to tell libec to continue writing old frags 21:33:58 <mattoliverau> wow, nice useful bit of coding there! 21:34:00 <timburke> we also almost certainly need to write a detailed bug ;-) 21:34:18 <clayg> right, for us to upgrade we need new code to not start writing in the new format yet because old nodes won't know what to do with that shit until the upgrade 21:34:33 <clayg> I think we could probably just like... "upgrade libec real fast and restart everything!" 21:34:44 <clayg> but... we'll probably like "be careful" or whatever 🙄 21:35:33 <clayg> I'm gunna try to talk timburke into a *build* option that's just like "always write the old stupid inline crc value forever" 21:35:53 <clayg> that way I can just make it WOMM and then ship it and never thing about how much i hate C again 21:36:01 <alecuyer> hehe 21:36:42 <clayg> but... it might not work - timburke is pretty sure we should actually use the zlib version as soon as we can - so an env var with a portable build might be "better" 21:37:21 <clayg> if by "better" is - you'd rather pay some operational cost to rollout with the latch; then after upgrade remove the latch; and then be on the path of righteousness forever fighting back the forces of kludge and evil 21:38:10 <timburke> yeah, i'm still fairly nervous that our funky crc doesn't have the same kinds of guarantees that we're expecting from zlib... 21:39:04 <clayg> timburke: well i'm not sure I can write controller code that can do the upgrade properly! like... we'd have to do a checkpoint release or something 🤮 21:39:25 <clayg> I'm sure I could build a packge with an option that's like CLAYG_IS_TOO_LAZY_TO_DO_THIS_RIGHT=True 21:39:37 <timburke> side note on all of this: thank you alecuyer for noticing this and bringing it up last week! much better to be arguing about how best to navigate this now than mid-upgrade ;-) 21:39:43 <clayg> and then if at somepoint you're sure all legacy swiftstack customers have upgrade you turn that off ;) 21:40:02 <clayg> yeah FOR SURE - alecuyer is god send ❤️ 21:40:49 <alecuyer> thanks but I wish i'd do more - and be faster.. :) but thanks 21:40:50 <timburke> clayg, we could totally have a controller that always says "write legacy crcs" -- we've done controller checkpoint releases before; after the next one, we tell it to switch over 21:41:50 <clayg> yeah, again - if we can couple it to an upstream swift change that makes it just an ini/config var I'm totally down 21:42:17 <clayg> if I have to update our systemd units in the package to turn it on/off i'm pretty sure I'm too dumb to get it right 21:42:52 <clayg> alecuyer: kota_: mattoliverau: if you have access to any EC data in any cluster anywhere please see if you can get the crc thing to run on it and report back w/i a couple of weeks 21:43:10 <clayg> we'll probably kick this can down the road for awhile (1.15 is working fine for me!) 21:43:25 <clayg> that's all I got on libec 21:43:31 <alecuyer> yep, will do 21:43:59 <timburke> anybody think they'll have a chance to look at https://review.opendev.org/#/c/737856/ ? it seems to be working well for ormandj 21:44:00 <patchbot> patch 737856 - swift - py3: Stop munging RAW_PATH_INFO - 2 patch sets 21:44:28 <timburke> i'd like to backport it to ussuri and train, then ideally do a round of releases 21:45:09 <clayg> timburke: oh neat - is the test new? 21:45:14 <timburke> (probably should have done that earlier, but now that there's an additional known issue...) 21:45:16 <timburke> yeah 21:45:42 <clayg> b"GET /oh\xffboy%what$now%E2%80%bd HTTP/1.0\r\n" 🤣 21:46:41 <timburke> oh, and as a heads-up: it looks like there might be some movement on an eventlet fix for that py37/ssl bug 21:46:43 <timburke> #link https://github.com/eventlet/eventlet/pull/621 21:47:26 <alecuyer> sounds good! 21:49:27 <timburke> all right, last call 21:50:27 <timburke> thank you all for coming, and thank you for working on swift! 21:50:33 <timburke> #endmeeting