21:00:08 <timburke> #startmeeting swift
21:00:09 <opendevmeet> Meeting started Wed Jun  2 21:00:08 2021 UTC and is due to finish in 60 minutes.  The chair is timburke. Information about MeetBot at http://wiki.debian.org/MeetBot.
21:00:10 <opendevmeet> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
21:00:12 <opendevmeet> The meeting name has been set to 'swift'
21:00:17 <timburke> who's here for the swift meeting?
21:00:44 <kota_> hi
21:01:07 <acoles> o/
21:02:33 <mattoliver> o/
21:03:00 <timburke> i'm glad to see most everybody's migrated over to OFTC, and that the meeting bot's working well for us here :-)
21:03:16 <timburke> as usual, the agenda's at https://wiki.openstack.org/wiki/Swift
21:03:38 <timburke> er, not that. https://wiki.openstack.org/wiki/Meetings/Swift
21:03:41 <timburke> that's the one
21:03:58 <timburke> first up
21:04:04 <timburke> #topic testing on ARM
21:04:58 <timburke> i wanted to see what opinions we might have about ARM jobs now that we've (1) got more jobs proposed (thanks mattoliver!) and (2) we've had a bit more time to think about it
21:05:34 <timburke> the good news, by the way, is that everything seems to Just Work -- libec, pyeclib, swift all have passing ARM jobs proposed
21:06:20 <timburke> they're taking a bit longer than the other jobs (~2x or so?) but at least for swift, they aren't the limiting factor
21:06:57 <mattoliver> yeah, and I added func, func encrytion and a probe. So pretty good coverage I think
21:07:14 <timburke> i've got two main questions, and i'm not sure whether they're connected or not
21:07:36 <mattoliver> #link https://review.opendev.org/c/openstack/swift/+/793280
21:07:53 <mattoliver> #link https://review.opendev.org/c/openstack/pyeclib/+/793281
21:08:31 <timburke> #link https://review.opendev.org/c/openstack/swift/+/792867
21:08:45 <timburke> #link https://review.opendev.org/c/openstack/liberasurecode/+/793511
21:09:22 <timburke> first, should we have them in the main check queue or a separate check-arm64 queue? ricolin proposed it as a separate queue, but trying it out on the libec patch, a single queue seems to work fine
21:10:34 <timburke> second, should they be voting or not? they all seem to pass and if i saw one fail (*especially* if it was on a patch touching ctypes or something) i'd be inclined to figure out the failure before approving, personally
21:11:26 <mattoliver> well now that we know they seem to pass, I'm happy to have them voting, we can always turn them off again.
21:11:52 <acoles> +1
21:12:53 <mattoliver> the extra check pipeline, I'm not sure.. I thought I'd read somewhere it might have something to do with managing the arm64 resources.. but cant seem to figure out where I read it.. so might have been dreaming :P
21:14:59 <timburke> i seem to remember seeing something about that, too -- an ML thread, maybe?
21:16:31 <timburke> that also brings me to why i'm not sure whether the questions are connected or not: with two queues, we get two zuul responses -- if the arm jobs are voting, can the second response change the vote from the first? i can ask in -infra, i suppose...
21:17:18 <zaitcev> If the CI machine set for ARM is reliable enough, then I think we want them voting. We don't want to get stuck just because something keeps crashing. That balances against the upside of guarding against a breakage that is specific to ARM.
21:18:34 <timburke> i'm inclined to merge them non-voting to start, then revisit later (maybe a topic for the next PTG?)
21:19:42 <mattoliver> sure, sounds reasonable. point is we get to test on arm, which is pretty cool.
21:20:08 <timburke> for sure!
21:20:33 <timburke> #topic train-em
21:21:34 <timburke> so at the end of this week, openstack as a whole is moving train to extended maintenance. i'm going to work on getting a release tagged before then. just a heads-up
21:21:35 <zaitcev> So... What is here to discuss?
21:21:42 <timburke> that was all :-)
21:21:51 <timburke> on to updates!
21:22:00 <timburke> #topic sharding and shrinking
21:22:15 <timburke> how's it going?
21:23:15 <timburke> we merged https://review.opendev.org/c/openstack/swift/+/792182 - Add absolute values for shard shrinking config options
21:23:53 <acoles> we noticed some intermittent gappy listings from sharded containers last week, turned out we had some shard range data stuck in memcache
21:24:37 <acoles> the root problem is memcache  related, but it caused us to realise that perhaps we should not be so tolerant of bad listing responses from shard containers
21:25:26 <timburke> leading to https://review.opendev.org/c/openstack/swift/+/793492 - Return 503 for container listings when shards are deleted
21:25:27 <acoles> so  https://review.opendev.org/c/openstack/swift/+/793492 proposed to 503 is a shard listing does not succeed
21:26:55 <acoles> IIRC we originally thought a gappy listing was equivalent to eventual consistency, but with hindsight they are more like 'something isn't working'
21:27:01 <mattoliver> And acoles has a patch for invalidating the shard listing cache which will hopefully make things much better
21:27:22 <acoles> mattoliver: actually I abandoned that :)
21:27:40 <mattoliver> oh, then I take that back.. he hasn't got one :P
21:28:02 <acoles> I decided that if the cause of the bad response was backend server workload then flushing the cache could just escalate the problem
21:28:14 <acoles> so not worth the risk
21:28:38 <mattoliver> oh fair enough, it was hard enough to find as it was
21:28:46 <acoles> given that memcache should expire entries, we just had an anomaly
21:29:32 <timburke> i think we need some more investigation into why the entry didn't expire properly, anyway
21:29:35 <acoles> I prefer the idea of including expiry time with the cached data, but I expect that's a bigger piece of work
21:30:19 <acoles> anyway, that was the background to https://review.opendev.org/c/openstack/swift/+/793492
21:30:59 <timburke> once landed, do we think it's the sort of thing we ought to backport?
21:31:06 <mattoliver> I've learnt alot about memcache (and mcrounter what we use at NVIDIA). memcache should be able to supply the TTL with a 'me <key>' or something like that. I'll investigate that
21:31:24 <timburke> (in light of all the sharding backports zaitcev has already done)
21:32:15 <acoles> timburke: maybe. shall we see how it goes in production first (just in case we uncover a can of worms)
21:32:48 <acoles> although, we've no reason to expect a an of worms :)
21:32:56 <mattoliver> We're slowly making progress on small shard tails. Have a new simpler approach where we actually track the scan progress of the scanner to make it more reliable, and from that can make smarter decisions. And not delve into effiecent db queries or adding rows_per_shard = auto
21:33:00 <acoles> s/an/can/
21:33:37 <mattoliver> #link https://review.opendev.org/c/openstack/swift/+/793543
21:33:37 <timburke> nice
21:34:03 <acoles> mattoliver: I like the context idea in https://review.opendev.org/c/openstack/swift/+/793543
21:34:47 <mattoliver> I see acoles reviewed it, thanks! Will look at that again today.. yeah storing the upper might actually simplify the method.. that and/or the index.
21:34:54 <acoles> I was just a bit unsure about where we do the 'tiny-shard-squashing'
21:36:01 <timburke> anything else we ought to bring up for sharding? i'll be sure to add those two patches to the priority reviews page
21:36:05 <acoles> I also wondered if having per-db-replica context for scanning might help avoid split brain scanning??? but that's *another topic*
21:36:26 <mattoliver> it's the progress + shard_size + minimum > object_count line. Because that returns the end upper. but maybe I miss understand.
21:36:48 <mattoliver> yeah! interesting, maybe it could.. but yeah, need to think about it more before we discuss that :P
21:38:25 <timburke> all right, i'll assume those are the two main track right now :-)
21:38:33 <timburke> #topic dark data watcher
21:38:56 <timburke> zaitcev, i saw some more updates on https://review.opendev.org/c/openstack/swift/+/788398 -- how's it going?
21:39:30 <zaitcev> timburke: I'm addressing comments by acoles
21:39:59 <zaitcev> Give me a day or two
21:40:11 <timburke> 👍
21:40:29 <zaitcev> Could we get this landed instead? https://review.opendev.org/c/openstack/swift/+/792713
21:40:33 <zaitcev> I mean in the meanwhile
21:40:36 <zaitcev> Not instead.
21:41:21 <timburke> i'll take a look, see about writing a test for it to demonstrate the difference
21:41:23 <zaitcev> Although ironically enough I was going to slip it through with no change in testing coverage.
21:41:39 <timburke> :P
21:41:48 <timburke> #topic open discussion
21:41:59 <timburke> anything else we ought to bring up this week?
21:42:27 <zaitcev> I was just about to type that the other change has better tests. However, it only emulates listings that miss the objects, but errors.
21:44:12 <acoles> we successfully quarantined a large number of isolated durable EC fragments in the last week using https://review.opendev.org/c/openstack/swift/+/788833
21:44:47 <timburke> our log-ingest pipeline seems much happier for it :-)
21:44:51 <acoles> and as a consequence eliminated a large number of error log messages :)
21:44:56 <zaitcev> Note that it's not the dark data plugin but the built-in replicator code that does that.
21:45:28 <timburke> oh -- i noticed that unlike with the object-updater and container-updater (which can use request path), the container-sharder doesn't give any indication what shard an update came from in container server logs -- so i proposed https://review.opendev.org/c/openstack/swift/+/793485 to stick the chard account/container in Referer
21:45:48 <zaitcev> Why are you guys quarantine them instead of deleting?
21:45:57 <zaitcev> s/ are / do /
21:46:32 <mattoliver> timburke: nice
21:46:38 <zaitcev> Is there any doubt about the decision-making in that code? Looked pretty watertight to me. Just a general caution?
21:46:53 <acoles> timburke: I'll review that again
21:47:01 <timburke> thanks
21:47:08 <acoles> zaitcev: yes, caution
21:47:09 <mattoliver> just seemed better to quarantine then to just delete
21:47:20 <acoles> I'm averse to deleting things
21:47:39 <zaitcev> This contrasts with Alistair wanting to run object watcher with action=delete, which clearly has more avenues to fail and start deleting everything.
21:47:44 <timburke> zaitcev, yeah, general caution. our ops team will still need some tooling to wade through quarantines, though :-(
21:48:29 <acoles> zaitcev: I don't want to run dark data watcher ! I'm worried for anyone that does (before these fixes get merged)
21:49:44 <mattoliver> I have been playing with some potential reconstructor improvements, more interesting chain ends: https://review.opendev.org/c/openstack/swift/+/793888 which if it finds it on the last known primary will leave it for the handoff to push.. kinda a built in handoffs_first if we're talking post rebalance.
21:51:18 <mattoliver> the last patch (that;s linked) in the chain is skipping a partition if there have been a bunch already found on said partition. In some basic testing in my SAIO it sped post rebalance reconstructor cycle quite a bit.
21:51:35 <mattoliver> but just playing around, scratching an itch.
21:52:41 <timburke> very cool -- it'd be interesting to play with that in a lab environment (and for that matter, to have some notion of "rebalance scenarios" for labs...)
21:53:54 <timburke> all right, i think we're about done then
21:54:03 <timburke> thank you all for coming, and thank you for working on swift!
21:54:07 <timburke> #endmeeting