21:00:47 <timburke> #startmeeting swift 21:00:48 <openstack> Meeting started Wed Apr 28 21:00:47 2021 UTC and is due to finish in 60 minutes. The chair is timburke. Information about MeetBot at http://wiki.debian.org/MeetBot. 21:00:49 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 21:00:51 <openstack> The meeting name has been set to 'swift' 21:00:55 <timburke> who's here for the swift meeting? 21:01:19 <acoles> o/ 21:01:59 <mattoliverau> o/ 21:02:08 <clayg> o/ 21:04:02 <timburke> as usual, the agenda's at https://wiki.openstack.org/wiki/Meetings/Swift 21:04:06 <timburke> first up 21:04:10 <timburke> #topic PTG 21:04:51 <timburke> i wanted to thank everyone who came out to the PTG last week -- i feel like we had some good, productive discussions 21:05:33 <timburke> and that needing to explain some of our ideas to devs we don't necessarily get to talk to super-regularly helped firm up a lot of them 21:06:02 <mattoliverau> +1 21:06:13 <timburke> i don't know that i've got a lot more to say, other than thanks again! 21:06:35 <timburke> #topic rolling upgrade job failures 21:06:44 <acoles> I'm grateful to those who were up in their night time - thank you! 21:07:39 <timburke> i don't know how much other people have noticed, but i've seen a fair few failures lately 21:08:11 <clayg> yes, I HAVE seen some rolling upgrade job failures - "grenade" too! 21:08:19 <acoles> I noticed a few in last 24 hours 21:08:40 <acoles> but func-cors is rock solid :D 21:08:46 <timburke> i suspect they've been flaky for a while (seem to be related to listing consistency issues), but we've had them disabled/non-voting a decent bit lately and hadn't noticed 21:09:43 <timburke> i also think (but haven't yet verified) there's a chance they'll improve the next time we cut a tag, since i added the ability to retry failed func tests 21:10:18 <timburke> just wanted to keep people updated; nothing really you all need to do 21:10:35 <acoles> at least one rolling upgrade fail was a timeout https://zuul.opendev.org/t/openstack/build/79a7ae5a3cc649d0a556a29e76dc0800 21:11:31 <timburke> on to updates! 21:11:52 <timburke> we've got a lot of things in-flight these days; i think that was another nice benefit of the PTG :-) 21:11:59 <timburke> #topic sharding 21:12:15 <timburke> so, current patches: 21:12:18 <timburke> https://review.opendev.org/c/openstack/swift/+/784617 - Add sharding to swift-recon (already approved) 21:12:28 <timburke> https://review.opendev.org/c/openstack/swift/+/785628 - swift-manage-shard-ranges: fix exit codes 21:12:32 <timburke> https://review.opendev.org/c/openstack/swift/+/774002 - Fix shrinking making acceptors prematurely active 21:12:43 <timburke> https://review.opendev.org/c/openstack/swift/+/777585 - stall cleaving at shard range gaps (already approved, but waiting on pre-req ^^^) 21:12:49 <timburke> https://review.opendev.org/c/openstack/swift/+/782832 - Consider tombstone count before shrinking a shard 21:12:56 <timburke> https://review.opendev.org/c/openstack/swift/+/787637 - Don't consider responses generated from cache as "already visited" 21:13:29 <timburke> do we have any upgrade concerns about the exit code changes? 'cause if not, i'm happy to +A :-) 21:15:31 <acoles> re the exit codes (patch 785628) IIRC back 3 years, there was maybe a thought to differentiate warnings from errors using codes 1 and 2 (or vice-versa) but its slipped since then 21:15:42 <clayg> exit code changes on which patch? 21:15:54 <timburke> second one 21:15:59 <acoles> and I discovered recently that argparse exists with 2 on invalid args 21:17:28 <acoles> so my thinking with the patch is to try to line up all invalid cli to return 2 and any other non-success to be 1. 21:18:20 <timburke> seems reasonable, approving 21:18:21 <timburke> i feel like we ought to prioritize the "prematurely active" patch since it's blocking the "stall cleaving" patch which is otherwise good to go 21:18:37 <acoles> thanks 21:19:06 <timburke> how are we feeling about the tombstone counting? just waiting on review? 21:19:34 <acoles> I think I attracted some interest in tombstones from clayg 21:19:37 <mattoliverau> yup looks good (exit code). I'll look at prematurely active todat to unstick ut. 21:19:41 <mattoliverau> *it 21:19:47 <timburke> thanks! 21:19:53 <acoles> thank mattoliverau 21:20:07 <acoles> thanks* 21:21:46 <timburke> i wouldn't mind talking through the "already visited" patch a bit, but maybe that'd be better next week 21:22:23 <timburke> any other sharding topics i'm forgetting? 21:22:49 <mattoliverau> Maybe put the rest on priority review (if they aren't already) so I dont forget about them.. it's early here and my brain isn't working yet. 21:22:58 <acoles> I need to be convinced on not including cached responses in the loop-detection 21:23:39 <mattoliverau> I haven't really looped back around to actuve_age post PTG, so not much to say there yet. But want to get back too it soon. 21:23:52 <timburke> my main thought is that *we haven't gone to disk yet* 21:23:56 <mattoliverau> *active_age /me can't type this morn 21:24:20 <timburke> ooh -- yeah -- it'll be interesting to see if my idea pans out :-) 21:24:32 <acoles> timburke: it may be that we need a way to provoke a backend request without mandating that is for objects only 21:25:32 <acoles> but also retain the break if that just results in the same loop, somehow 21:26:23 <timburke> #topic relinker 21:26:48 <timburke> another wall of patches: 21:26:55 <timburke> https://review.opendev.org/c/openstack/swift/+/783731 - Rehash the parts actually touched when relinking 21:27:01 <timburke> https://review.opendev.org/c/openstack/swift/+/788089 - Only mark partitions "done" if there were no (new) errors 21:27:05 <timburke> https://review.opendev.org/c/openstack/swift/+/779655 - Add /recon/relinker endpoint and drop progress stats 21:27:09 <timburke> https://review.opendev.org/c/openstack/swift/+/788413 - Log and recon on SIGTERM signal 21:27:14 <timburke> https://review.opendev.org/c/openstack/swift/+/788177 - add aggregate data to recon drop 21:28:12 <acoles> re 788089 - when did we ignore errors? 21:28:17 <timburke> so the first two seem pretty useful for correctness and clear ops-signalling 21:28:51 <timburke> they weren't *ignored* exactly... i mean, we logged them and everything, and we'll exit non-zero 21:29:14 <timburke> it's just that we mark the partition as having been relinked 21:29:23 <acoles> but we set state to True? 21:29:30 <timburke> yup 21:29:33 <acoles> eek 21:30:01 <timburke> so a subsequent relink either skips the partition that had errors, or ops need to manually go clear the state file 21:30:20 <acoles> yeah, we should fix that 21:30:34 <mattoliverau> yup 21:30:43 <mattoliverau> the last 3 are based around the new recon patch, split and one a worte up to something to trap signals and dump the error_code as appropriate to recon, cli return and log it. 21:31:59 <timburke> the "rehash parts touched" strikes my interest since we've had instances where we had hashes in partitions from part-power 17 when relinking into 19 (for instance) 21:32:00 <mattoliverau> 788414 took longer then expected when trying to write tests because of diferences in os.exit and os._exit.. which made the signals kill my test suit run.. fun times :P 21:32:07 <acoles> https://review.opendev.org/c/openstack/swift/+/779655 (first in relinker recon chain) has been coming along - should we focus on getting that merged? I think there's still a few things to resolve like the option name, but hopefully it is close 21:33:20 <clayg> @mattoliverau fun time!!! 🤣 21:33:35 <acoles> i.e. should recon_interval actually be stats_interval like the replicator has? 21:33:37 <timburke> yeah, i definitely like the recon idea -- haven't had a chance to look at it since it was split up, unfortunately 21:33:58 <timburke> i'll try to take another pass at it this afternoon 21:35:07 <mattoliverau> yeah, if there is already a stats_interval elsewhere I'm all for using it. keeps things consistent. 21:36:42 <timburke> so mattoliverau -- do you have a preference on which fork people look at next after the first recon change? 21:36:50 <clayg> acoles: did you have other concerns that got dropped? the name change is a good idea, and easy enough to fix 💪 21:37:12 <acoles> I like that you've broken things out into some follow on patches 21:38:25 <mattoliverau> timburke: no really. the trap just makes sure we write done and let people know things are done if the process is killed by something like (ahem ansible timeouts). 21:38:38 <mattoliverau> the aggregator we might need to discuss some more. 21:39:15 <timburke> sounds like maybe i should look at signals next, then ;-) 21:39:39 <timburke> i know acoles had some comments on the base patch -- did those ever get addressed? 21:40:04 <acoles> clayg: mattoliverau I think the two non-nits were recon_interval to stats_interval and duplicated start_time, although the latter isn't a blocker. But we must straighten out the option name. 21:40:30 <mattoliverau> I might go poke an op to take a look at the existing recon and the aggregator followup to see what they'd like to see, or rather if /what they can use. 21:40:56 <mattoliverau> acoles: ahh yeah the start time, somehow I missed that again in yesterdays rework. 21:41:27 <timburke> 👍 21:42:08 <timburke> #topic stale EC frags 21:42:29 <mattoliverau> I haven't looked at the patch this morning so don't know what's there. but will push a new patchset today. maybe I'll wait until timburke has a look (if he get's to it this arvo his time). no pressure tho 21:42:29 <timburke> we've got a couple patches currently working their way through the gate (thanks clayg and acoles!) 21:42:41 <timburke> https://review.opendev.org/c/openstack/swift/+/787279 - reconstructor: log more details when rebuild fails (already approved) 21:42:47 <timburke> https://review.opendev.org/c/openstack/swift/+/788540 - reconstructor: extract closure for handle_response (already approved) 21:42:54 <timburke> so how are we feeling on 21:42:58 <timburke> https://review.opendev.org/c/openstack/swift/+/786084 - Quarantine stale EC fragments 21:43:56 <mattoliverau> I started a review on it last night.. but ran out of time. Planning on continueing it today. So don't have too much to say atm myself. 21:45:28 <timburke> acoles, any known rough edges to watch out for? 21:45:38 <timburke> or just waiting on review? 21:46:47 <timburke> probably will need a rebase once the other two land... 21:47:07 <timburke> we'll see what it looks like next week 21:47:18 <timburke> #topic dark data watcher 21:47:44 <timburke> so we've got a couple patches for some known deficiencies 21:48:00 <timburke> https://review.opendev.org/c/openstack/swift/+/788398 - Make dark data watcher ignore the newly updated objects 21:48:11 <timburke> https://review.opendev.org/c/openstack/swift/+/787656 - Work with sharded containers 21:49:04 <timburke> i don't think either is quite ready yet (zaitcev's patch has a WIP in the commit message, and mine probably should, too) 21:49:16 <timburke> but i wanted to keep them on people's radars 21:49:44 <zaitcev> Mine needs tests. 21:50:07 <timburke> i think those are the main major efforts in-flight right now 21:50:10 <acoles> timburke: zaitcev thanks for those patches 21:50:12 <timburke> #topic open discussion 21:50:23 <timburke> anyhting else we ought to bring up this week? 21:52:40 <mattoliverau> nothing comes to mind 21:53:02 <timburke> so i had a thought that feels like a good idea, but idk if it presents some backwards compat issues 21:53:04 <timburke> https://review.opendev.org/c/openstack/swift/+/787905 - proxy: Downgrade some client problems to info 21:54:04 <timburke> basically, stop logging client disconnects and timeouts at warning -- they're client behaviors, so it is (or can be) way too noisy at that level 21:56:28 <timburke> well, something to think about, anyway 21:56:50 <timburke> that's all i've got 21:57:05 <timburke> thank you all for coming, and thank you for working on swift! 21:57:15 <timburke> and thanks for coming to the PTG :-) 21:57:20 <timburke> #endmeeting