21:00:30 <timburke> #startmeeting swift
21:00:30 <opendevmeet> Meeting started Wed Feb 23 21:00:30 2022 UTC and is due to finish in 60 minutes.  The chair is timburke. Information about MeetBot at http://wiki.debian.org/MeetBot.
21:00:30 <opendevmeet> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
21:00:30 <opendevmeet> The meeting name has been set to 'swift'
21:00:36 <timburke> who's here for the swift meeting?
21:00:58 <mattoliver> o/
21:01:13 <kota> o/
21:02:19 <acoles> o/
21:02:34 <timburke> as usual, the agenda's at https://wiki.openstack.org/wiki/Meetings/Swift
21:02:54 <timburke> (though i've forgotten to update it :P)
21:03:01 <timburke> #topic PTG
21:03:17 <timburke> quick reminder to fill out the doodle poll to pick meeting times
21:03:24 <timburke> #link https://doodle.com/poll/qs2pysgyb8nb36c2
21:03:34 <kota> oh ok. will do soon
21:04:02 <timburke> i'll get an etherpad up to collect development topics, too
21:04:53 <timburke> #topic priority reviews
21:05:00 <timburke> i updated the page at https://wiki.openstack.org/wiki/Swift/PriorityReviews
21:05:32 <timburke> mostly to call out some patches i know we're running in prod
21:07:17 <timburke> some seem about ready to go -- expirer: Only try to delete empty containers (https://review.opendev.org/c/openstack/swift/+/825883) did just what we hoped it would, and we saw a precipitous drop in container deletes and listing shard range cache misses
21:08:09 <acoles> yes that was a great improvement
21:08:28 <timburke> others had somewhat more mixed results -- container-server: plumb includes down into _get_shard_range_rows (https://review.opendev.org/c/openstack/swift/+/569847) *maybe* had some impact on updater timings, but it was hard to say decidedly
21:08:53 <timburke> there was one that i wanted to check in on in particular
21:08:55 <timburke> #link https://review.opendev.org/c/openstack/swift/+/809969
21:09:01 <timburke> Sharding: a remote SR without an epoch can't replicate over one with an epoch
21:09:58 <timburke> mattoliver, am i remembering right that the idea was to get the no-epoch SR to stick around so we could hunt down how it happened?
21:10:03 <mattoliver> That stops the reset, but I think currently locks the problem to the problem node.
21:10:24 <mattoliver> But if that problem node is a handoff then it might be fine.
21:10:42 <mattoliver> Interedtly we haven't seen to problem again since we started running it.
21:11:21 <timburke> what do we think about merging it sooner rather than later, and calling the problem fixed until we get new information?
21:11:58 <mattoliver> Yeah, kk, it does log when there is an issue, so it'll let people know.
21:13:43 <acoles> might be worth adding broker.db_path to the warning?
21:14:56 <mattoliver> oh yeah, good idea.
21:15:03 <timburke> all right, that's about all i've got then
21:15:11 <timburke> #topic open discussion
21:15:15 <mattoliver> I haven't looked at the patch so will look today
21:15:22 <timburke> what else should we bring up this week?
21:16:27 <mattoliver> I added handoff_delete to the db replicators https://review.opendev.org/c/openstack/swift/+/828637
21:16:53 <mattoliver> which helps when needing to drain and gets them closer to on par with the obj replicator
21:18:23 <mattoliver> Also been playing with concurrent container object puts to the same container and trying to understand the problems involved and attempting to improve things some more.
21:19:02 <timburke> nice! along the same lines, i wrote up https://review.opendev.org/c/openstack/swift/+/830535 to clean up part dirs more quickly when you're rebalancing DBs
21:19:25 <mattoliver> cool
21:19:49 <mattoliver> In initial testing, moving the container directory lock and sharding out the pending file and locking the pending file your updating seems really promising. Getting much less directory lock timeouts
21:20:58 <mattoliver> Just improves concurrent access to the server. So helps when running multiple workers
21:21:51 <mattoliver> current POC WIP is https://review.opendev.org/c/openstack/swift/+/830551
21:22:10 <timburke> yeah, that looked promising -- anything to get a few more reqs/s out of the container-server
21:22:19 <mattoliver> That still has debugging and q statements in it. Just wanted to get it backed up off my laptop.
21:22:24 <mattoliver> +1
21:24:17 <timburke> one thing i'm still curious about is what the curve looks like for number of container-server workers vs. max concurrent requests before clients start hitting timeouts
21:25:05 <mattoliver> yeah, on my VSAIO wont be as high as a real server :P
21:26:21 <timburke> still, hopefully the curve would still look somewhat similar -- start off at some level, and as you add a *ton* of workers it drops pretty low because of all the contention -- but what happens in the middle?
21:26:33 <timburke> i feel like that may push us toward something like a servers-per-port strategy
21:26:43 <mattoliver> yup, can have a play.
21:27:28 <mattoliver> currently I'm randomly choosing a pending file shard when a put comes in. I wonder if I could just have a shard per worker, or maybe its shards per worker.
21:27:42 <mattoliver> some of the timeouts could also be due to the randomness of choosing a shard.
21:28:25 <acoles> mattoliver: are you no longer locking the parent directory when appending to the pending file?
21:29:01 <mattoliver> nope, not unless it's a _commit_puts and we actually update the DB
21:29:15 <mattoliver> but not sure the effect that is on other things like replication yet
21:29:49 <mattoliver> but I do lock the pending file being updated so we don't loose pending data.
21:30:10 <acoles> but not locking the pending file when flushing it?
21:30:28 <acoles> does the parent dir lock also take lock on all the pending files?
21:30:31 <mattoliver> I do lock then too, because we use a truncate on it
21:31:18 <timburke> yeah, i'd imagine you'd want to lock all the pending files (and the parent dir) when flushing
21:31:31 <acoles> OIC down in commit_puts
21:31:52 <mattoliver> but I take a lock on a pending file while flushing it, and only while dealing with that one so a concurrent put could go use it again.
21:32:04 <mattoliver> timburke: yup
21:32:11 <timburke> nice
21:32:13 <timburke> if anyone has some spare time to think about a client-facing api change, i've got some users that'd appreciate something like https://review.opendev.org/c/openstack/swift/+/829605 - container: Add delimiter-depth query param
21:33:36 <acoles> I was wondering if it would be possible to direct updates to a pending file that isn't being flushed?
21:34:07 <mattoliver> oh interesting!
21:34:13 <timburke> that'd be fancy! do it as a ring ;-)
21:34:19 <acoles> e.g. if the pending files could be pinned to workers
21:34:46 <acoles> or some kind of rotation
21:35:12 <mattoliver> I like it!
21:35:44 <acoles> maybe just try 'em all til you get a lock, a bit like how we do multiple lock files
21:36:15 <mattoliver> yeah can borrow that code as a start at least :)
21:36:53 <mattoliver> also like the ring like approach.
21:37:08 <mattoliver> Will have a play. thanks for the awesome ideas
21:38:18 <timburke> all right, i think i'll call it
21:38:30 <timburke> thank you all for coming, and thank you for working on swift!
21:38:34 <timburke> #endmeeting