21:00:08 <timburke> #startmeeting swift 21:00:09 <opendevmeet> Meeting started Wed Jun 2 21:00:08 2021 UTC and is due to finish in 60 minutes. The chair is timburke. Information about MeetBot at http://wiki.debian.org/MeetBot. 21:00:10 <opendevmeet> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 21:00:12 <opendevmeet> The meeting name has been set to 'swift' 21:00:17 <timburke> who's here for the swift meeting? 21:00:44 <kota_> hi 21:01:07 <acoles> o/ 21:02:33 <mattoliver> o/ 21:03:00 <timburke> i'm glad to see most everybody's migrated over to OFTC, and that the meeting bot's working well for us here :-) 21:03:16 <timburke> as usual, the agenda's at https://wiki.openstack.org/wiki/Swift 21:03:38 <timburke> er, not that. https://wiki.openstack.org/wiki/Meetings/Swift 21:03:41 <timburke> that's the one 21:03:58 <timburke> first up 21:04:04 <timburke> #topic testing on ARM 21:04:58 <timburke> i wanted to see what opinions we might have about ARM jobs now that we've (1) got more jobs proposed (thanks mattoliver!) and (2) we've had a bit more time to think about it 21:05:34 <timburke> the good news, by the way, is that everything seems to Just Work -- libec, pyeclib, swift all have passing ARM jobs proposed 21:06:20 <timburke> they're taking a bit longer than the other jobs (~2x or so?) but at least for swift, they aren't the limiting factor 21:06:57 <mattoliver> yeah, and I added func, func encrytion and a probe. So pretty good coverage I think 21:07:14 <timburke> i've got two main questions, and i'm not sure whether they're connected or not 21:07:36 <mattoliver> #link https://review.opendev.org/c/openstack/swift/+/793280 21:07:53 <mattoliver> #link https://review.opendev.org/c/openstack/pyeclib/+/793281 21:08:31 <timburke> #link https://review.opendev.org/c/openstack/swift/+/792867 21:08:45 <timburke> #link https://review.opendev.org/c/openstack/liberasurecode/+/793511 21:09:22 <timburke> first, should we have them in the main check queue or a separate check-arm64 queue? ricolin proposed it as a separate queue, but trying it out on the libec patch, a single queue seems to work fine 21:10:34 <timburke> second, should they be voting or not? they all seem to pass and if i saw one fail (*especially* if it was on a patch touching ctypes or something) i'd be inclined to figure out the failure before approving, personally 21:11:26 <mattoliver> well now that we know they seem to pass, I'm happy to have them voting, we can always turn them off again. 21:11:52 <acoles> +1 21:12:53 <mattoliver> the extra check pipeline, I'm not sure.. I thought I'd read somewhere it might have something to do with managing the arm64 resources.. but cant seem to figure out where I read it.. so might have been dreaming :P 21:14:59 <timburke> i seem to remember seeing something about that, too -- an ML thread, maybe? 21:16:31 <timburke> that also brings me to why i'm not sure whether the questions are connected or not: with two queues, we get two zuul responses -- if the arm jobs are voting, can the second response change the vote from the first? i can ask in -infra, i suppose... 21:17:18 <zaitcev> If the CI machine set for ARM is reliable enough, then I think we want them voting. We don't want to get stuck just because something keeps crashing. That balances against the upside of guarding against a breakage that is specific to ARM. 21:18:34 <timburke> i'm inclined to merge them non-voting to start, then revisit later (maybe a topic for the next PTG?) 21:19:42 <mattoliver> sure, sounds reasonable. point is we get to test on arm, which is pretty cool. 21:20:08 <timburke> for sure! 21:20:33 <timburke> #topic train-em 21:21:34 <timburke> so at the end of this week, openstack as a whole is moving train to extended maintenance. i'm going to work on getting a release tagged before then. just a heads-up 21:21:35 <zaitcev> So... What is here to discuss? 21:21:42 <timburke> that was all :-) 21:21:51 <timburke> on to updates! 21:22:00 <timburke> #topic sharding and shrinking 21:22:15 <timburke> how's it going? 21:23:15 <timburke> we merged https://review.opendev.org/c/openstack/swift/+/792182 - Add absolute values for shard shrinking config options 21:23:53 <acoles> we noticed some intermittent gappy listings from sharded containers last week, turned out we had some shard range data stuck in memcache 21:24:37 <acoles> the root problem is memcache related, but it caused us to realise that perhaps we should not be so tolerant of bad listing responses from shard containers 21:25:26 <timburke> leading to https://review.opendev.org/c/openstack/swift/+/793492 - Return 503 for container listings when shards are deleted 21:25:27 <acoles> so https://review.opendev.org/c/openstack/swift/+/793492 proposed to 503 is a shard listing does not succeed 21:26:55 <acoles> IIRC we originally thought a gappy listing was equivalent to eventual consistency, but with hindsight they are more like 'something isn't working' 21:27:01 <mattoliver> And acoles has a patch for invalidating the shard listing cache which will hopefully make things much better 21:27:22 <acoles> mattoliver: actually I abandoned that :) 21:27:40 <mattoliver> oh, then I take that back.. he hasn't got one :P 21:28:02 <acoles> I decided that if the cause of the bad response was backend server workload then flushing the cache could just escalate the problem 21:28:14 <acoles> so not worth the risk 21:28:38 <mattoliver> oh fair enough, it was hard enough to find as it was 21:28:46 <acoles> given that memcache should expire entries, we just had an anomaly 21:29:32 <timburke> i think we need some more investigation into why the entry didn't expire properly, anyway 21:29:35 <acoles> I prefer the idea of including expiry time with the cached data, but I expect that's a bigger piece of work 21:30:19 <acoles> anyway, that was the background to https://review.opendev.org/c/openstack/swift/+/793492 21:30:59 <timburke> once landed, do we think it's the sort of thing we ought to backport? 21:31:06 <mattoliver> I've learnt alot about memcache (and mcrounter what we use at NVIDIA). memcache should be able to supply the TTL with a 'me <key>' or something like that. I'll investigate that 21:31:24 <timburke> (in light of all the sharding backports zaitcev has already done) 21:32:15 <acoles> timburke: maybe. shall we see how it goes in production first (just in case we uncover a can of worms) 21:32:48 <acoles> although, we've no reason to expect a an of worms :) 21:32:56 <mattoliver> We're slowly making progress on small shard tails. Have a new simpler approach where we actually track the scan progress of the scanner to make it more reliable, and from that can make smarter decisions. And not delve into effiecent db queries or adding rows_per_shard = auto 21:33:00 <acoles> s/an/can/ 21:33:37 <mattoliver> #link https://review.opendev.org/c/openstack/swift/+/793543 21:33:37 <timburke> nice 21:34:03 <acoles> mattoliver: I like the context idea in https://review.opendev.org/c/openstack/swift/+/793543 21:34:47 <mattoliver> I see acoles reviewed it, thanks! Will look at that again today.. yeah storing the upper might actually simplify the method.. that and/or the index. 21:34:54 <acoles> I was just a bit unsure about where we do the 'tiny-shard-squashing' 21:36:01 <timburke> anything else we ought to bring up for sharding? i'll be sure to add those two patches to the priority reviews page 21:36:05 <acoles> I also wondered if having per-db-replica context for scanning might help avoid split brain scanning??? but that's *another topic* 21:36:26 <mattoliver> it's the progress + shard_size + minimum > object_count line. Because that returns the end upper. but maybe I miss understand. 21:36:48 <mattoliver> yeah! interesting, maybe it could.. but yeah, need to think about it more before we discuss that :P 21:38:25 <timburke> all right, i'll assume those are the two main track right now :-) 21:38:33 <timburke> #topic dark data watcher 21:38:56 <timburke> zaitcev, i saw some more updates on https://review.opendev.org/c/openstack/swift/+/788398 -- how's it going? 21:39:30 <zaitcev> timburke: I'm addressing comments by acoles 21:39:59 <zaitcev> Give me a day or two 21:40:11 <timburke> 👍 21:40:29 <zaitcev> Could we get this landed instead? https://review.opendev.org/c/openstack/swift/+/792713 21:40:33 <zaitcev> I mean in the meanwhile 21:40:36 <zaitcev> Not instead. 21:41:21 <timburke> i'll take a look, see about writing a test for it to demonstrate the difference 21:41:23 <zaitcev> Although ironically enough I was going to slip it through with no change in testing coverage. 21:41:39 <timburke> :P 21:41:48 <timburke> #topic open discussion 21:41:59 <timburke> anything else we ought to bring up this week? 21:42:27 <zaitcev> I was just about to type that the other change has better tests. However, it only emulates listings that miss the objects, but errors. 21:44:12 <acoles> we successfully quarantined a large number of isolated durable EC fragments in the last week using https://review.opendev.org/c/openstack/swift/+/788833 21:44:47 <timburke> our log-ingest pipeline seems much happier for it :-) 21:44:51 <acoles> and as a consequence eliminated a large number of error log messages :) 21:44:56 <zaitcev> Note that it's not the dark data plugin but the built-in replicator code that does that. 21:45:28 <timburke> oh -- i noticed that unlike with the object-updater and container-updater (which can use request path), the container-sharder doesn't give any indication what shard an update came from in container server logs -- so i proposed https://review.opendev.org/c/openstack/swift/+/793485 to stick the chard account/container in Referer 21:45:48 <zaitcev> Why are you guys quarantine them instead of deleting? 21:45:57 <zaitcev> s/ are / do / 21:46:32 <mattoliver> timburke: nice 21:46:38 <zaitcev> Is there any doubt about the decision-making in that code? Looked pretty watertight to me. Just a general caution? 21:46:53 <acoles> timburke: I'll review that again 21:47:01 <timburke> thanks 21:47:08 <acoles> zaitcev: yes, caution 21:47:09 <mattoliver> just seemed better to quarantine then to just delete 21:47:20 <acoles> I'm averse to deleting things 21:47:39 <zaitcev> This contrasts with Alistair wanting to run object watcher with action=delete, which clearly has more avenues to fail and start deleting everything. 21:47:44 <timburke> zaitcev, yeah, general caution. our ops team will still need some tooling to wade through quarantines, though :-( 21:48:29 <acoles> zaitcev: I don't want to run dark data watcher ! I'm worried for anyone that does (before these fixes get merged) 21:49:44 <mattoliver> I have been playing with some potential reconstructor improvements, more interesting chain ends: https://review.opendev.org/c/openstack/swift/+/793888 which if it finds it on the last known primary will leave it for the handoff to push.. kinda a built in handoffs_first if we're talking post rebalance. 21:51:18 <mattoliver> the last patch (that;s linked) in the chain is skipping a partition if there have been a bunch already found on said partition. In some basic testing in my SAIO it sped post rebalance reconstructor cycle quite a bit. 21:51:35 <mattoliver> but just playing around, scratching an itch. 21:52:41 <timburke> very cool -- it'd be interesting to play with that in a lab environment (and for that matter, to have some notion of "rebalance scenarios" for labs...) 21:53:54 <timburke> all right, i think we're about done then 21:54:03 <timburke> thank you all for coming, and thank you for working on swift! 21:54:07 <timburke> #endmeeting