21:00:02 <timburke> #startmeeting swift 21:00:03 <openstack> Meeting started Wed Jun 24 21:00:02 2020 UTC and is due to finish in 60 minutes. The chair is timburke. Information about MeetBot at http://wiki.debian.org/MeetBot. 21:00:04 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 21:00:06 <openstack> The meeting name has been set to 'swift' 21:00:10 <timburke> who's here for the swift meeting? 21:00:19 <seongsoocho> o/ 21:00:32 <alecuyer> o/ 21:01:03 <clayg> o/ 21:01:52 <zaitcev> o7 21:02:10 <timburke> agenda's at https://wiki.openstack.org/wiki/Meetings/Swift -- i don't have much to bring up, so it may be a fairly short meeting ;-) 21:02:37 <timburke> in fact, it's just updates 21:02:40 <timburke> #topic replication network and background daemons 21:03:03 <timburke> after the feedback last week, i updated https://review.opendev.org/#/c/735751/ to touch more daemons 21:03:03 <patchbot> patch 735751 - swift - Allow direct and internal clients to use the repli... - 2 patch sets 21:04:22 <timburke> there were a couple (reconciler and expirer) where i still provided an option of using the client-data-path network 21:04:25 <clayg> is that a filter xprofile? what is even going on with this change set?! 🤯 21:04:50 <clayg> oh, that's unrelated - use_legacy_network 👍 21:05:07 <clayg> timburke: so is that one ready to go then!? 🤗 21:05:13 <timburke> wait, what about xprofile? 21:05:20 <timburke> yeah, pretty sure it's good to go 21:05:31 <clayg> i just clicked on the file with the biggest diff 21:06:35 <timburke> my logic for whether to provide an option was basically: if it walks disks (which presumably would be exposed on the replication network), don't bother to provide an option. if it just gets work by querying swift, include it 21:07:23 <timburke> oh, yeah -- i was just moving the filter to what seemed like a more natural location. i could back that out; it's unrelated 21:07:39 <clayg> dude, this change looks good 21:08:32 <clayg> i hate the option, i'm sure all of these places would have used the replication network from the get go if these interfaces already existed 21:09:29 <timburke> does anybody know of deployments where a node that might be running reconciler or expirer *wouldn't* have access to the replication network? maybe i don't even need to include the option 21:09:40 <clayg> but the way you did it it's super clear; not a blocker for me 21:09:44 <alecuyer> at least we don't have that 21:10:04 <alecuyer> (i mean they do have access to the replication network) 21:10:24 <clayg> well the reconciler is interesting because it uses internal client and direct client - it's not clear to me that the direct client requests honor the config option? 21:11:22 <clayg> i haven't looked at this since you pushed it up last thursday - i thought it was still wip 🤷♂️ 21:11:28 <clayg> I'm sure i'll be great 21:11:31 <timburke> heh -- whoops. yeah, they don't atm -- feel free to leave a -1 21:11:54 <timburke> anyway, i just wanted to raise attention on it 21:11:57 <clayg> -1 let's just drop the option!? 😍 21:12:17 <clayg> or do you mean -1 if we have the option is has to work and best tested and 🤮 21:12:36 <timburke> either one, up to you ;-) 21:12:44 <timburke> #topic waterfall EC 21:13:03 <clayg> i totally get the argument "if ONE deployment wants the option we'll wish we had that code ready to go" 🤷♂️ 21:13:06 <mattoliverau> Sorry I'm late, slept in a bit o/ 21:13:11 <timburke> clayg, i saw you get some more patchsets up; how's it going? 21:13:14 <clayg> i just don't know if such a deployment exists and tend to YAGNI 21:13:24 <clayg> timburke: it's so great, i'm really happy 21:13:29 <timburke> mattoliverau, no worries -- we'll probably let you get breakfast before long ;-) 21:14:07 <alecuyer> thanks for the ping about the concurrent frag fetcher clay. Have started reading but I think I need to read more before I start asking questions :) 21:14:23 <clayg> ok, great, np! 21:14:56 <timburke> you also wrote up an etherpad about non-durables 21:14:58 <timburke> #link https://etherpad.opendev.org/p/the-non-durable-problem 21:15:32 <alecuyer> didn't catch that , thanks 21:16:41 <clayg> the implication was some sort of dump overly complicated looking code to define/override what's the "shortfall" for a non-durable bucket 21:17:15 <clayg> but it makes a clean way to put non-durable responses into a gradient of "probably fine" until either we get a durable response or get down to ~parity outstanding requests 21:18:12 <timburke> are there any open questions you'd like us to discuss, clayg? or should we just get on with reading the patches? 21:18:13 <clayg> i could be debated/researched/proved where on the slider is "most correct" - but I think it's a good sliding scale, so the code is "correct" in some sense 21:18:24 <clayg> uh, I'd be happy to answer questions 21:18:51 <clayg> but I would like some "early" feedback on gross over expressiveness of https://review.opendev.org/#/c/737096/ 21:18:52 <patchbot> patch 737096 - swift - Make concurrency timeout per policy and replica - 3 patch sets 21:19:09 <clayg> having concurrent_gets/timeout be *per policy* is obviously a move a positive direction 21:19:58 <clayg> but crazy implementation of alecuyer 's idea that waterfall-ec should be able to express start ndata+N concurrent requests before feeding in the remaining primaries 21:20:09 <clayg> ... could probably be expressed different ways than how I wrote it up 21:21:07 <clayg> the underlying structure (the per primary count timeout) might be a good implementation; but a smart group of folks like we have here may have some better ways to configure it 21:21:25 <clayg> i.e. is there a "do_what_i_want = true" option that would be better 21:21:55 <clayg> ... than the `concurrency_timeout<N>` that I plumbed through 21:22:38 <clayg> I mean it's great! I'm happy with it - it's *completely* sufficient for anything I might want to test in a lab 21:22:55 <clayg> ... and I'm sure different cluster's would hvae good reasons to want to do different things; so it doesn't bother me to have it exposed 21:23:14 <kota_> morning 21:23:16 <alecuyer> We'll have to test it - still swamped with romain about unrelated things but I hope we can test it soon 21:23:33 <clayg> oh wow, that'd be cool! 21:23:43 <timburke> sounds like i need to find time to review p 711342 and p 737096 this week :-) 21:23:44 <patchbot> https://review.opendev.org/#/c/711342/ - swift - Add concurrent_gets to EC GET requests - 12 patch sets 21:23:46 <patchbot> https://review.opendev.org/#/c/737096/ - swift - Make concurrency timeout per policy and replica - 3 patch sets 21:23:49 <clayg> I'll probably be testing it in our lab like... in another week or two? 21:23:56 <alecuyer> I think we want to use EC more and we have things like 12+3 so im quite sure it would help :) 21:24:20 <kota_> alarm didn't ringr 21:25:11 <timburke> alecuyer, oh for sure -- we're already grumpy with 8+4; needing to get another 4 live connections would surely make our problems worse ;-) 21:25:23 <timburke> that's it for the agenda 21:25:30 <timburke> #topic open discussion 21:25:41 <timburke> anything else we should talk about this week? 21:25:53 <alecuyer> I want to mention quickly an issue we found with romain during some tests, 21:26:15 <clayg> 😱 21:26:17 <alecuyer> the test setup had a ubuntu 20.04 proxy, and 18.04 object servers 21:26:42 <alecuyer> we used an EC policy, and were unable to reread data , 21:26:47 <timburke> py2 still, or trying out py3? 21:26:49 <timburke> eep! 21:26:54 <alecuyer> still py2 sorry :/ 21:27:11 <timburke> cool, just making sure 👍 21:27:16 <alecuyer> it _seems_ liberasure code has a problem in ubuntu between these two version 21:27:17 <alecuyer> s 21:27:22 <clayg> i blame 20.04 - that focal 😠 21:27:38 <timburke> which versions of liberasurecode? 21:27:39 <alecuyer> the object server will do get_metadata (i think sorry i don't recall the function name) 21:27:42 <clayg> wasn't there a thing with CRC libs that was $%^&ing INSANE (cc zaitcev ) 21:27:56 <timburke> were they coming from ubuntu's repos? 21:28:00 <alecuyer> 1.6.1 vs 1.5.0 i think 21:28:01 <alecuyer> yes 21:28:06 <alecuyer> recompiling appeared to fix it, 21:28:15 <zaitcev> timburke fixed most of it, I only looked at that. I think you're thinking about zlib. 21:28:19 <alecuyer> and it seems to be related to linker flags (??) if i used ubuntu flags it broke again 21:28:22 <timburke> https://github.com/openstack/liberasurecode/commit/a9b20ae6a 21:28:23 <alecuyer> but well 21:28:36 <clayg> LINKER FLAGS 21:28:51 <alecuyer> just wanted to say we saw that, and maybe i can say more outside the meeting, or once i have tested this properly 21:28:57 <alecuyer> (this was not the goal of the test erm..) 21:29:26 <zaitcev> Yes. It was related to linking order. If system zlib got routines ahead of liberasurecode, then they're used instead of ours. 21:29:26 <clayg> to find bugs is always the goal testing - you are a winner 21:29:39 <alecuyer> zaitcev: thanks, i didn't figure it out 21:30:45 <zaitcev> alecuyer: so, does this mean that the fix we committed (linked above) was incorrect? 21:31:03 <clayg> so i guess probably someone would be glad if we could produce an actionable bug for ubuntu 20.04's liberasurecode package 21:31:04 <zaitcev> I reviewed it and it seemed watertight to me, buuuuut 21:31:17 <alecuyer> Sorry to say im not sure, haven't had time to dig further, but I'll check it and post detailed version and a test script we have to check outside of swift 21:31:32 <clayg> ... but "always compile/distribute your own liberasurecode" also seems like reasonable advice 🤔 21:31:32 <timburke> i think the issue must be new frags getting read by old libec :-( 21:31:46 <zaitcev> ah, okay 21:31:47 <clayg> trolorlololo 21:32:23 <zaitcev> No, wait. What if you have a cluster that's halfway updated and rsync moves fragment archives from new to old 21:32:31 <timburke> i think if it was the other way around, with old proxy talking to new object, it'd probably be fine? until the reconstructor actually had work to do :-( 21:32:32 <clayg> timburke: but it's only: new ec compiled with "wrong" flags breaks 21:34:02 <timburke> maybe we could offer an env var or something to have the new lib write old crcs (that could then still be read by old code)? 21:34:19 <timburke> and once everything's upgraded, take out the env flag 21:34:23 <clayg> alecuyer: can you share the werx and borked flags? 21:34:43 <alecuyer> yes I will send it on #openstack-swift and etherpad 21:34:52 <clayg> FABULOUS!!! 21:34:55 <clayg> alecuyer: well done 21:35:08 <timburke> so last night i ran down an s3api issue ormandj saw upgrading to py3 21:35:13 <timburke> #link https://bugs.launchpad.net/swift/+bug/1884991 21:35:13 <openstack> Launchpad bug 1884991 in OpenStack Object Storage (swift) "s3api on py3 doesn't use the bytes-on-the-wire when calculating string to sign" [Undecided,In progress] 21:35:41 <timburke> just a heads-up -- the long tail of py3 bugs continues 21:35:57 <clayg> glad someone is testing py3 😬 21:36:22 <zaitcev> They have no choice. We, for example, aren't shipping py26 anymore, at all. 21:36:43 <clayg> zaitcev: you're a hero!!! 🤗 21:36:44 <timburke> have a fix at https://review.opendev.org/#/c/737856/, but it needs tests 21:36:44 <patchbot> patch 737856 - swift - py3: Stop munging RAW_PATH_INFO - 1 patch set 21:37:12 <clayg> i'm very anti munging - +1 on principle alone 21:38:06 <timburke> *especially* when it's for something with "raw" in the name :P 21:38:32 <clayg> timburke: ❤️ 21:39:00 <clayg> i'm not having a great time with s3 tests in https://review.opendev.org/#/c/735738/ right now 21:39:00 <patchbot> patch 735738 - swift - s3api: Don't do naive HEAD request for auth - 1 patch set 21:39:42 <zaitcev> BTW, I have 2 same things on my plate: Dark Data with dsariel and account server crashing. Not much change on either... I know Tim looked at the account server thing, but I'm going to write a probe test that solidly reproduces it. 21:40:14 <clayg> seongsoocho: how are things going for you? 21:40:50 <timburke> zaitcev, oh, yeah yeah -- p 704435 -- i've got a head-start on a probe test for you at p 737117 21:40:50 <patchbot> https://review.opendev.org/#/c/704435/ - swift - Mark a container reported if account was reclaimed - 2 patch sets 21:40:52 <patchbot> https://review.opendev.org/#/c/737117/ - swift - probe: Explore reaping with async pendings - 1 patch set 21:41:22 <seongsoocho> clayg: just have a normal day. everythings are good 21:42:07 <timburke> something i've learned hte last few months: every day your cluster isn't on fire is a good day :D 21:42:45 <alecuyer> definitely :) 21:43:21 <zaitcev> timburke: thank yuo 21:44:04 <timburke> zaitcev, how is the watcher going? should i find time to look at it again soon, or wait a bit? 21:44:46 <zaitcev> timburke: wait a bit please. We're putting together a switch for selected action. 21:44:56 <timburke> 👍 21:45:00 <zaitcev> basically 21:45:19 <zaitcev> Sam's design didn't allow for configuration options specific for watchers. 21:45:41 <zaitcev> So, there's no way to express "The DD watcher should do X" 21:46:05 <zaitcev> David wants something crammed into paste line 21:46:58 <zaitcev> so it can turn a little unweildy like watchers=watcher_a,watcher_b#do_this=yes,watcher_c 21:47:06 <zaitcev> I'll let you know. 21:47:06 <clayg> 🤯 21:47:26 <timburke> makes me think of the config changes sam was thinking about in p 504472 ... 21:47:26 <patchbot> https://review.opendev.org/#/c/504472/ - swift - Shorten typical proxy pipeline. - 4 patch sets 21:48:31 <timburke> i feel like it should be fair for the DD watcher to claim the dark_data_* config namespace within the object-auditor 21:48:33 <zaitcev> I'd prefer something lie [object-auditor:watcher_b] \n do_this=yes 21:48:37 <zaitcev> But dunno 21:48:43 <zaitcev> Seems like overkill. 21:48:56 <timburke> or that! also seems good :-) 21:49:08 <zaitcev> ok 21:50:33 <clayg> noice 21:50:33 <timburke> all right, lets let kota_, mattoliverau, and seongsoocho get on with their day ;-) 21:50:46 <timburke> thank you all for coming, and thank you for working on swift! 21:50:50 <timburke> #endmeeting