#openstack-meeting log

21:00:02 <timburke> #startmeeting swift
21:00:03 <openstack> Meeting started Wed Jun 24 21:00:02 2020 UTC and is due to finish in 60 minutes.  The chair is timburke. Information about MeetBot at http://wiki.debian.org/MeetBot.
21:00:04 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
21:00:06 <openstack> The meeting name has been set to 'swift'
21:00:10 <timburke> who's here for the swift meeting?
21:00:19 <seongsoocho> o/
21:00:32 <alecuyer> o/
21:01:03 <clayg> o/
21:01:52 <zaitcev> o7
21:02:10 <timburke> agenda's at https://wiki.openstack.org/wiki/Meetings/Swift -- i don't have much to bring up, so it may be a fairly short meeting ;-)
21:02:37 <timburke> in fact, it's just updates
21:02:40 <timburke> #topic replication network and background daemons
21:03:03 <timburke> after the feedback last week, i updated https://review.opendev.org/#/c/735751/ to touch more daemons
21:03:03 <patchbot> patch 735751 - swift - Allow direct and internal clients to use the repli... - 2 patch sets
21:04:22 <timburke> there were a couple (reconciler and expirer) where i still provided an option of using the client-data-path network
21:04:25 <clayg> is that a filter xprofile?  what is even going on with this change set?! 🤯
21:04:50 <clayg> oh, that's unrelated - use_legacy_network 👍
21:05:07 <clayg> timburke: so is that one ready to go then!?  🤗
21:05:13 <timburke> wait, what about xprofile?
21:05:20 <timburke> yeah, pretty sure it's good to go
21:05:31 <clayg> i just clicked on the file with the biggest diff
21:06:35 <timburke> my logic for whether to provide an option was basically: if it walks disks (which presumably would be exposed on the replication network), don't bother to provide an option. if it just gets work by querying swift, include it
21:07:23 <timburke> oh, yeah -- i was just moving the filter to what seemed like a more natural location. i could back that out; it's unrelated
21:07:39 <clayg> dude, this change looks good
21:08:32 <clayg> i hate the option, i'm sure all of these places would have used the replication network from the get go if these interfaces already existed
21:09:29 <timburke> does anybody know of deployments where a node that might be running reconciler or expirer *wouldn't* have access to the replication network? maybe i don't even need to include the option
21:09:40 <clayg> but the way you did it it's super clear; not a blocker for me
21:09:44 <alecuyer> at least we don't have that
21:10:04 <alecuyer> (i mean they do have access to the replication network)
21:10:24 <clayg> well the reconciler is interesting because it uses internal client and direct client - it's not clear to me that the direct client requests honor the config option?
21:11:22 <clayg> i haven't looked at this since you pushed it up last thursday - i thought it was still wip 🤷‍♂️
21:11:28 <clayg> I'm sure i'll be great
21:11:31 <timburke> heh -- whoops. yeah, they don't atm -- feel free to leave a -1
21:11:54 <timburke> anyway, i just wanted to raise attention on it
21:11:57 <clayg> -1 let's just drop the option!?  😍
21:12:17 <clayg> or do you mean -1 if we have the option is has to work and best tested and 🤮
21:12:36 <timburke> either one, up to you ;-)
21:12:44 <timburke> #topic waterfall EC
21:13:03 <clayg> i totally get the argument "if ONE deployment wants the option we'll wish we had that code ready to go" 🤷‍♂️
21:13:06 <mattoliverau> Sorry I'm late, slept in a bit o/
21:13:11 <timburke> clayg, i saw you get some more patchsets up; how's it going?
21:13:14 <clayg> i just don't know if such a deployment exists and tend to YAGNI
21:13:24 <clayg> timburke: it's so great, i'm really happy
21:13:29 <timburke> mattoliverau, no worries -- we'll probably let you get breakfast before long ;-)
21:14:07 <alecuyer> thanks for the ping about the concurrent frag fetcher clay. Have started reading but I think I need to read more before I start asking questions :)
21:14:23 <clayg> ok, great, np!
21:14:56 <timburke> you also wrote up an etherpad about non-durables
21:14:58 <timburke> #link https://etherpad.opendev.org/p/the-non-durable-problem
21:15:32 <alecuyer> didn't catch that , thanks
21:16:41 <clayg> the implication was some sort of dump overly complicated looking code to define/override what's the "shortfall" for a non-durable bucket
21:17:15 <clayg> but it makes a clean way to put non-durable responses into a gradient of "probably fine" until either we get a durable response or get down to ~parity outstanding requests
21:18:12 <timburke> are there any open questions you'd like us to discuss, clayg? or should we just get on with reading the patches?
21:18:13 <clayg> i could be debated/researched/proved where on the slider is "most correct" - but I think it's a good sliding scale, so the code is "correct" in some sense
21:18:24 <clayg> uh, I'd be happy to answer questions
21:18:51 <clayg> but I would like some "early" feedback on gross over expressiveness of https://review.opendev.org/#/c/737096/
21:18:52 <patchbot> patch 737096 - swift - Make concurrency timeout per policy and replica - 3 patch sets
21:19:09 <clayg> having concurrent_gets/timeout be *per policy* is obviously a move a positive direction
21:19:58 <clayg> but crazy implementation of alecuyer 's idea that waterfall-ec should be able to express start ndata+N concurrent requests before feeding in the remaining primaries
21:20:09 <clayg> ... could probably be expressed different ways than how I wrote it up
21:21:07 <clayg> the underlying structure (the per primary count timeout) might be a good implementation; but a smart group of folks like we have here may have some better ways to configure it
21:21:25 <clayg> i.e. is there a "do_what_i_want = true" option that would be better
21:21:55 <clayg> ... than the `concurrency_timeout<N>` that I plumbed through
21:22:38 <clayg> I mean it's great!  I'm happy with it - it's *completely* sufficient for anything I might want to test in a lab
21:22:55 <clayg> ... and I'm sure different cluster's would hvae good reasons to want to do different things; so it doesn't bother me to have it exposed
21:23:14 <kota_> morning
21:23:16 <alecuyer> We'll have to test it - still swamped with romain about unrelated things but I hope we can test it soon
21:23:33 <clayg> oh wow, that'd be cool!
21:23:43 <timburke> sounds like i need to find time to review p 711342 and p 737096 this week :-)
21:23:44 <patchbot> https://review.opendev.org/#/c/711342/ - swift - Add concurrent_gets to EC GET requests - 12 patch sets
21:23:46 <patchbot> https://review.opendev.org/#/c/737096/ - swift - Make concurrency timeout per policy and replica - 3 patch sets
21:23:49 <clayg> I'll probably be testing it in our lab like... in another week or two?
21:23:56 <alecuyer> I think we want to use EC more and we have things like 12+3 so im quite sure it would help :)
21:24:20 <kota_> alarm didn't ringr
21:25:11 <timburke> alecuyer, oh for sure -- we're already grumpy with 8+4; needing to get another 4 live connections would surely make our problems worse ;-)
21:25:23 <timburke> that's it for the agenda
21:25:30 <timburke> #topic open discussion
21:25:41 <timburke> anything else we should talk about this week?
21:25:53 <alecuyer> I want to mention quickly an issue we found with romain during some tests,
21:26:15 <clayg> 😱
21:26:17 <alecuyer> the test setup had a ubuntu 20.04 proxy, and 18.04 object servers
21:26:42 <alecuyer> we used an EC policy, and were unable to reread data ,
21:26:47 <timburke> py2 still, or trying out py3?
21:26:49 <timburke> eep!
21:26:54 <alecuyer> still py2 sorry :/
21:27:11 <timburke> cool, just making sure 👍
21:27:16 <alecuyer> it _seems_ liberasure code has a problem in ubuntu between these two version
21:27:17 <alecuyer> s
21:27:22 <clayg> i blame 20.04 - that focal 😠
21:27:38 <timburke> which versions of liberasurecode?
21:27:39 <alecuyer> the object server will do get_metadata (i think sorry i don't recall the function name)
21:27:42 <clayg> wasn't there a thing with CRC libs that was $%^&ing INSANE (cc zaitcev )
21:27:56 <timburke> were they coming from ubuntu's repos?
21:28:00 <alecuyer> 1.6.1 vs 1.5.0 i think
21:28:01 <alecuyer> yes
21:28:06 <alecuyer> recompiling appeared to fix it,
21:28:15 <zaitcev> timburke fixed most of it, I only looked at that. I think you're thinking about zlib.
21:28:19 <alecuyer> and it seems to be related to linker flags (??) if i used ubuntu flags it broke again
21:28:22 <timburke> https://github.com/openstack/liberasurecode/commit/a9b20ae6a
21:28:23 <alecuyer> but well
21:28:36 <clayg> LINKER FLAGS
21:28:51 <alecuyer> just wanted to say we saw that, and maybe i can say more outside the meeting, or once i have tested this properly
21:28:57 <alecuyer> (this was not the goal of the test erm..)
21:29:26 <zaitcev> Yes. It was related to linking order. If system zlib got routines ahead of liberasurecode, then they're used instead of ours.
21:29:26 <clayg> to find bugs is always the goal testing - you are a winner
21:29:39 <alecuyer> zaitcev: thanks, i didn't figure it out
21:30:45 <zaitcev> alecuyer: so, does this mean that the fix we committed (linked above) was incorrect?
21:31:03 <clayg> so i guess probably someone would be glad if we could produce an actionable bug for ubuntu 20.04's liberasurecode package
21:31:04 <zaitcev> I reviewed it and it seemed watertight to me, buuuuut
21:31:17 <alecuyer> Sorry to say im not sure, haven't had time to dig further, but I'll check it and post detailed version and a test script we have to check outside of swift
21:31:32 <clayg> ... but "always compile/distribute your own liberasurecode" also seems like reasonable advice 🤔
21:31:32 <timburke> i think the issue must be new frags getting read by old libec :-(
21:31:46 <zaitcev> ah, okay
21:31:47 <clayg> trolorlololo
21:32:23 <zaitcev> No, wait. What if you have a cluster that's halfway updated and rsync moves fragment archives from new to old
21:32:31 <timburke> i think if it was the other way around, with old proxy talking to new object, it'd probably be fine? until the reconstructor actually had work to do :-(
21:32:32 <clayg> timburke: but it's only: new ec compiled with "wrong" flags breaks
21:34:02 <timburke> maybe we could offer an env var or something to have the new lib write old crcs (that could then still be read by old code)?
21:34:19 <timburke> and once everything's upgraded, take out the env flag
21:34:23 <clayg> alecuyer: can you share the werx and borked flags?
21:34:43 <alecuyer> yes I will send it on #openstack-swift and etherpad
21:34:52 <clayg> FABULOUS!!!
21:34:55 <clayg> alecuyer: well done
21:35:08 <timburke> so last night i ran down an s3api issue ormandj saw upgrading to py3
21:35:13 <timburke> #link https://bugs.launchpad.net/swift/+bug/1884991
21:35:13 <openstack> Launchpad bug 1884991 in OpenStack Object Storage (swift) "s3api on py3 doesn't use the bytes-on-the-wire when calculating string to sign" [Undecided,In progress]
21:35:41 <timburke> just a heads-up -- the long tail of py3 bugs continues
21:35:57 <clayg> glad someone is testing py3 😬
21:36:22 <zaitcev> They have no choice. We, for example, aren't shipping py26 anymore, at all.
21:36:43 <clayg> zaitcev: you're a hero!!! 🤗
21:36:44 <timburke> have a fix at https://review.opendev.org/#/c/737856/, but it needs tests
21:36:44 <patchbot> patch 737856 - swift - py3: Stop munging RAW_PATH_INFO - 1 patch set
21:37:12 <clayg> i'm very anti munging - +1 on principle alone
21:38:06 <timburke> *especially* when it's for something with "raw" in the name :P
21:38:32 <clayg> timburke: ❤️
21:39:00 <clayg> i'm not having a great time with s3 tests in https://review.opendev.org/#/c/735738/ right now
21:39:00 <patchbot> patch 735738 - swift - s3api: Don't do naive HEAD request for auth - 1 patch set
21:39:42 <zaitcev> BTW, I have 2 same things on my plate: Dark Data with dsariel and account server crashing. Not much change on either... I know Tim looked at the account server thing, but I'm going to write a probe test that solidly reproduces it.
21:40:14 <clayg> seongsoocho: how are things going for you?
21:40:50 <timburke> zaitcev, oh, yeah yeah -- p 704435 -- i've got a head-start on a probe test for you at p 737117
21:40:50 <patchbot> https://review.opendev.org/#/c/704435/ - swift - Mark a container reported if account was reclaimed - 2 patch sets
21:40:52 <patchbot> https://review.opendev.org/#/c/737117/ - swift - probe: Explore reaping with async pendings - 1 patch set
21:41:22 <seongsoocho> clayg:  just have a normal day. everythings are good
21:42:07 <timburke> something i've learned hte last few months: every day your cluster isn't on fire is a good day :D
21:42:45 <alecuyer> definitely :)
21:43:21 <zaitcev> timburke: thank yuo
21:44:04 <timburke> zaitcev, how is the watcher going? should i find time to look at it again soon, or wait a bit?
21:44:46 <zaitcev> timburke: wait a bit please. We're putting together a switch for selected action.
21:44:56 <timburke> 👍
21:45:00 <zaitcev> basically
21:45:19 <zaitcev> Sam's design didn't allow for configuration options specific for watchers.
21:45:41 <zaitcev> So, there's no way to express "The DD watcher should do X"
21:46:05 <zaitcev> David wants something crammed into paste line
21:46:58 <zaitcev> so it can turn a little unweildy like  watchers=watcher_a,watcher_b#do_this=yes,watcher_c
21:47:06 <zaitcev> I'll let you know.
21:47:06 <clayg> 🤯
21:47:26 <timburke> makes me think of the config changes sam was thinking about in p 504472 ...
21:47:26 <patchbot> https://review.opendev.org/#/c/504472/ - swift - Shorten typical proxy pipeline. - 4 patch sets
21:48:31 <timburke> i feel like it should be fair for the DD watcher to claim the dark_data_* config namespace within the object-auditor
21:48:33 <zaitcev> I'd prefer something lie  [object-auditor:watcher_b] \n do_this=yes
21:48:37 <zaitcev> But dunno
21:48:43 <zaitcev> Seems like overkill.
21:48:56 <timburke> or that! also seems good :-)
21:49:08 <zaitcev> ok
21:50:33 <clayg> noice
21:50:33 <timburke> all right, lets let kota_, mattoliverau, and seongsoocho get on with their day ;-)
21:50:46 <timburke> thank you all for coming, and thank you for working on swift!
21:50:50 <timburke> #endmeeting