#openstack-meeting log

21:00:47 <timburke> #startmeeting swift
21:00:48 <openstack> Meeting started Wed Apr 28 21:00:47 2021 UTC and is due to finish in 60 minutes.  The chair is timburke. Information about MeetBot at http://wiki.debian.org/MeetBot.
21:00:49 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
21:00:51 <openstack> The meeting name has been set to 'swift'
21:00:55 <timburke> who's here for the swift meeting?
21:01:19 <acoles> o/
21:01:59 <mattoliverau> o/
21:02:08 <clayg> o/
21:04:02 <timburke> as usual, the agenda's at https://wiki.openstack.org/wiki/Meetings/Swift
21:04:06 <timburke> first up
21:04:10 <timburke> #topic PTG
21:04:51 <timburke> i wanted to thank everyone who came out to the PTG last week -- i feel like we had some good, productive discussions
21:05:33 <timburke> and that needing to explain some of our ideas to devs we don't necessarily get to talk to super-regularly helped firm up a lot of them
21:06:02 <mattoliverau> +1
21:06:13 <timburke> i don't know that i've got a lot more to say, other than thanks again!
21:06:35 <timburke> #topic rolling upgrade job failures
21:06:44 <acoles> I'm grateful to those who were up in their night time - thank you!
21:07:39 <timburke> i don't know how much other people have noticed, but i've seen a fair few failures lately
21:08:11 <clayg> yes, I HAVE seen some rolling upgrade job failures - "grenade" too!
21:08:19 <acoles> I noticed a few in last 24 hours
21:08:40 <acoles> but func-cors is rock solid :D
21:08:46 <timburke> i suspect they've been flaky for a while (seem to be related to listing consistency issues), but we've had them disabled/non-voting a decent bit lately and hadn't noticed
21:09:43 <timburke> i also think (but haven't yet verified) there's a chance they'll improve the next time we cut a tag, since i added the ability to retry failed func tests
21:10:18 <timburke> just wanted to keep people updated; nothing really you all need to do
21:10:35 <acoles> at least one rolling upgrade fail was a timeout https://zuul.opendev.org/t/openstack/build/79a7ae5a3cc649d0a556a29e76dc0800
21:11:31 <timburke> on to updates!
21:11:52 <timburke> we've got a lot of things in-flight these days; i think that was another nice benefit of the PTG :-)
21:11:59 <timburke> #topic sharding
21:12:15 <timburke> so, current patches:
21:12:18 <timburke> https://review.opendev.org/c/openstack/swift/+/784617 - Add sharding to swift-recon (already approved)
21:12:28 <timburke> https://review.opendev.org/c/openstack/swift/+/785628 - swift-manage-shard-ranges: fix exit codes
21:12:32 <timburke> https://review.opendev.org/c/openstack/swift/+/774002 - Fix shrinking making acceptors prematurely active
21:12:43 <timburke> https://review.opendev.org/c/openstack/swift/+/777585 - stall cleaving at shard range gaps (already approved, but waiting on pre-req ^^^)
21:12:49 <timburke> https://review.opendev.org/c/openstack/swift/+/782832 - Consider tombstone count before shrinking a shard
21:12:56 <timburke> https://review.opendev.org/c/openstack/swift/+/787637 - Don't consider responses generated from cache as "already visited"
21:13:29 <timburke> do we have any upgrade concerns about the exit code changes? 'cause if not, i'm happy to +A :-)
21:15:31 <acoles> re the exit codes (patch 785628) IIRC back 3 years, there was maybe a thought to differentiate warnings from errors using codes 1 and 2 (or vice-versa) but its slipped since then
21:15:42 <clayg> exit code changes on which patch?
21:15:54 <timburke> second one
21:15:59 <acoles> and  I discovered recently that argparse exists with 2 on invalid args
21:17:28 <acoles> so my thinking with the patch is to try to line up all invalid cli to return 2 and any other non-success to be 1.
21:18:20 <timburke> seems reasonable, approving
21:18:21 <timburke> i feel like we ought to prioritize the "prematurely active" patch since it's blocking the "stall cleaving" patch which is otherwise good to go
21:18:37 <acoles> thanks
21:19:06 <timburke> how are we feeling about the tombstone counting? just waiting on review?
21:19:34 <acoles> I think I attracted some interest in tombstones from clayg
21:19:37 <mattoliverau> yup looks good (exit code). I'll look at prematurely active todat to unstick ut.
21:19:41 <mattoliverau> *it
21:19:47 <timburke> thanks!
21:19:53 <acoles> thank mattoliverau
21:20:07 <acoles> thanks*
21:21:46 <timburke> i wouldn't mind talking through the "already visited" patch a bit, but maybe that'd be better next week
21:22:23 <timburke> any other sharding topics i'm forgetting?
21:22:49 <mattoliverau> Maybe put the rest on priority review (if they aren't already) so I dont forget about them.. it's early here and my brain isn't working yet.
21:22:58 <acoles> I need to be convinced on not including cached responses in the loop-detection
21:23:39 <mattoliverau> I haven't really looped back around to actuve_age post PTG, so not much to say there yet. But want to get back too it soon.
21:23:52 <timburke> my main thought is that *we haven't gone to disk yet*
21:23:56 <mattoliverau> *active_age /me can't type this morn
21:24:20 <timburke> ooh -- yeah -- it'll be interesting to see if my idea pans out :-)
21:24:32 <acoles> timburke: it may be that we need a way to provoke a backend request without mandating that is for objects only
21:25:32 <acoles> but also retain the break if that just results in the same loop, somehow
21:26:23 <timburke> #topic relinker
21:26:48 <timburke> another wall of patches:
21:26:55 <timburke> https://review.opendev.org/c/openstack/swift/+/783731 - Rehash the parts actually touched when relinking
21:27:01 <timburke> https://review.opendev.org/c/openstack/swift/+/788089 - Only mark partitions "done" if there were no (new) errors
21:27:05 <timburke> https://review.opendev.org/c/openstack/swift/+/779655 - Add /recon/relinker endpoint and drop progress stats
21:27:09 <timburke> https://review.opendev.org/c/openstack/swift/+/788413 - Log and recon on SIGTERM signal
21:27:14 <timburke> https://review.opendev.org/c/openstack/swift/+/788177 - add aggregate data to recon drop
21:28:12 <acoles> re 788089 - when did we ignore errors?
21:28:17 <timburke> so the first two seem pretty useful for correctness and clear ops-signalling
21:28:51 <timburke> they weren't *ignored* exactly... i mean, we logged them and everything, and we'll exit non-zero
21:29:14 <timburke> it's just that we mark the partition as having been relinked
21:29:23 <acoles> but we set state to True?
21:29:30 <timburke> yup
21:29:33 <acoles> eek
21:30:01 <timburke> so a subsequent relink either skips the partition that had errors, or ops need to manually go clear the state file
21:30:20 <acoles> yeah, we should fix that
21:30:34 <mattoliverau> yup
21:30:43 <mattoliverau> the last 3 are based around the new recon patch, split and one a worte up to something to trap signals and dump the error_code as appropriate to recon, cli return and log it.
21:31:59 <timburke> the "rehash parts touched" strikes my interest since we've had instances where we had hashes in partitions from part-power 17 when relinking into 19 (for instance)
21:32:00 <mattoliverau> 788414 took longer then expected when trying to write tests because of diferences in os.exit and os._exit.. which made the signals kill my test suit run.. fun times :P
21:32:07 <acoles> https://review.opendev.org/c/openstack/swift/+/779655 (first in relinker recon chain) has been coming along - should we focus on getting that merged? I think there's still a few things to resolve like the option name, but hopefully it is close
21:33:20 <clayg> @mattoliverau fun time!!! 🤣
21:33:35 <acoles> i.e. should recon_interval actually be stats_interval like the replicator has?
21:33:37 <timburke> yeah, i definitely like the recon idea -- haven't had a chance to look at it since it was split up, unfortunately
21:33:58 <timburke> i'll try to take another pass at it this afternoon
21:35:07 <mattoliverau> yeah, if there is already a stats_interval elsewhere I'm all for using it. keeps things consistent.
21:36:42 <timburke> so mattoliverau -- do you have a preference on which fork people look at next after the first recon change?
21:36:50 <clayg> acoles: did you have other concerns that got dropped?  the name change is a good idea, and easy enough to fix 💪
21:37:12 <acoles> I like that you've broken things out into some follow on patches
21:38:25 <mattoliverau> timburke: no really. the trap just makes sure we write done and let people know things are done if the process is killed by something like (ahem ansible timeouts).
21:38:38 <mattoliverau> the aggregator we might need to discuss some more.
21:39:15 <timburke> sounds like maybe i should look at signals next, then ;-)
21:39:39 <timburke> i know acoles had some comments on the base patch -- did those ever get addressed?
21:40:04 <acoles> clayg: mattoliverau I think the two non-nits were recon_interval to stats_interval and duplicated start_time, although the latter isn't a blocker. But we must straighten out the option name.
21:40:30 <mattoliverau> I might go poke an op to take a look at the existing recon and the aggregator followup to see what they'd like to see, or rather if /what they can use.
21:40:56 <mattoliverau> acoles: ahh yeah the start time, somehow I missed that again in yesterdays rework.
21:41:27 <timburke> 👍
21:42:08 <timburke> #topic stale EC frags
21:42:29 <mattoliverau> I haven't looked at the patch this morning so don't know what's there. but will push a new patchset today. maybe I'll wait until timburke has a look (if he get's to it this arvo his time). no pressure tho
21:42:29 <timburke> we've got a couple patches currently working their way through the gate (thanks clayg and acoles!)
21:42:41 <timburke> https://review.opendev.org/c/openstack/swift/+/787279 - reconstructor: log more details when rebuild fails (already approved)
21:42:47 <timburke> https://review.opendev.org/c/openstack/swift/+/788540 - reconstructor: extract closure for handle_response (already approved)
21:42:54 <timburke> so how are we feeling on
21:42:58 <timburke> https://review.opendev.org/c/openstack/swift/+/786084 - Quarantine stale EC fragments
21:43:56 <mattoliverau> I started a review on it last night.. but ran out of time. Planning on continueing it today. So don't have too much to say atm myself.
21:45:28 <timburke> acoles, any known rough edges to watch out for?
21:45:38 <timburke> or just waiting on review?
21:46:47 <timburke> probably will need a rebase once the other two land...
21:47:07 <timburke> we'll see what it looks like next week
21:47:18 <timburke> #topic dark data watcher
21:47:44 <timburke> so we've got a couple patches for some known deficiencies
21:48:00 <timburke> https://review.opendev.org/c/openstack/swift/+/788398 - Make dark data watcher ignore the newly updated objects
21:48:11 <timburke> https://review.opendev.org/c/openstack/swift/+/787656 - Work with sharded containers
21:49:04 <timburke> i don't think either is quite ready yet (zaitcev's patch has a WIP in the commit message, and mine probably should, too)
21:49:16 <timburke> but i wanted to keep them on people's radars
21:49:44 <zaitcev> Mine needs tests.
21:50:07 <timburke> i think those are the main major efforts in-flight right now
21:50:10 <acoles> timburke: zaitcev thanks for those patches
21:50:12 <timburke> #topic open discussion
21:50:23 <timburke> anyhting else we ought to bring up this week?
21:52:40 <mattoliverau> nothing comes to mind
21:53:02 <timburke> so i had a thought that feels like a good idea, but idk if it presents some backwards compat issues
21:53:04 <timburke> https://review.opendev.org/c/openstack/swift/+/787905 - proxy: Downgrade some client problems to info
21:54:04 <timburke> basically, stop logging client disconnects and timeouts at warning -- they're client behaviors, so it is (or can be) way too noisy at that level
21:56:28 <timburke> well, something to think about, anyway
21:56:50 <timburke> that's all i've got
21:57:05 <timburke> thank you all for coming, and thank you for working on swift!
21:57:15 <timburke> and thanks for coming to the PTG :-)
21:57:20 <timburke> #endmeeting