Wednesday, 2022-05-11

opendevreview	Merged openstack/swift master: container-server: plumb includes down into _get_shard_range_rows https://review.opendev.org/c/openstack/swift/+/569847	01:28
opendevreview	Matthew Oliver proposed openstack/swift master: Repair Gaps: Add a probe test https://review.opendev.org/c/openstack/swift/+/841372	07:37
opendevreview	Matthew Oliver proposed openstack/swift master: Repair Gaps: Add a probe test https://review.opendev.org/c/openstack/swift/+/841372	07:40
opendevreview	Alistair Coles proposed openstack/swift master: manage-shard-ranges: add gap repair option https://review.opendev.org/c/openstack/swift/+/841143	14:31
opendevreview	Alistair Coles proposed openstack/swift master: manage-shard-ranges: add gap repair option https://review.opendev.org/c/openstack/swift/+/841143	15:51
opendevreview	Tim Burke proposed openstack/swift master: manage-shard-ranges: add gap repair option https://review.opendev.org/c/openstack/swift/+/841143	19:49
kota	morning	21:00
timburke__	#startmeeting swift	21:00
opendevmeet	Meeting started Wed May 11 21:00:28 2022 UTC and is due to finish in 60 minutes. The chair is timburke__. Information about MeetBot at http://wiki.debian.org/MeetBot.	21:00
opendevmeet	Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.	21:00
opendevmeet	The meeting name has been set to 'swift'	21:00
timburke__	who's here for the swift meeting?	21:00
*** timburke__ is now known as timburke		21:00
kota	hi	21:00
mattoliver	o/	21:00
timburke	as usual, the agenda's at	21:01
timburke	#link https://wiki.openstack.org/wiki/Meetings/Swift	21:01
acoles	o/	21:02
timburke	but i mostly just want to follow up on the various threads of work i mentioned last week	21:02
timburke	#topic ring v2	21:02
timburke	i don't think we've done further work on this yet -- can anyone spare some review cycles for the head of the chain?	21:02
mattoliver	I can go back and look again, but probably need fresh eyes too.	21:03
timburke	thanks 👍	21:04
timburke	#topic memcache errors	21:04
timburke	clayg took a look and seemed to like it -- iirc we're going to try it out in prod this week and see how it goes	21:05
timburke	i think it'll be of tremendous use to callers to be able to tell the different between a cache miss and an error talking to memcached	21:05
mattoliver	+1	21:06
timburke	#topic eventlet and logging handler locks	21:06
mattoliver	I took a look at this in a vsaio. Threw a bunch of load at it, upped the workers. And seemed to all work fine.	21:08
timburke	we're also going to try this out in prod. it may take a little longer to get data on it, though, since it seems riskier (and so requires ops to opt-in to the new behavior)	21:08
timburke	good proof point! thanks mattoliver	21:08
timburke	it'll be real interesting if we can notice a performance difference when we try it in prod	21:09
mattoliver	I then wanted to goto an elk environment to better tell me if there was a logging issues as file beat tries to break things into json	21:09
mattoliver	But stalled getting the env set up	21:10
mattoliver	Might have a play with it now that we'll have it in our staging cluster.	21:10
timburke	👍	21:11
timburke	#topic backend rate limiting	21:11
timburke	sorry, acoles, i'm pretty sure i promised reviews and haven't delivered	21:11
acoles	hehe, no worries, I've not circled back to this yet (in terms of deploying at least)	21:12
acoles	I think since last week we've had some thoughts about how the proxy should deal with 529s...	21:12
acoles	originally I felt the proxy should not error limit. I still think that is the case given the difference in time scales (per second rate limiting vs 60second error limit), BUT it could be mitigated for by increasing the backend ratelimit buffer to match	21:14
acoles	i.e. average backend ratelimiting over ~60secs	21:14
acoles	also, we need to consider there are N proxies requesting to each backend ratelimiter	21:15
acoles	in the meantime, since last week I think, I added a patch to the chain to load the ratelimit config from a separate file, and to periodically reload	21:17
timburke	do we already have plumbing to make the buffer time configurable?	21:17
mattoliver	If only we had global error limiting 😜	21:17
mattoliver	Oh nice	21:17
acoles	timburke: yes, rate_buffer can be configured in the upstream patch	21:17
timburke	and do we want to make it an option in the proxy for whether to error-limit 529s or not? then we've put a whole bunch of knobs that ops can try out, and hopefully they can try some experiments and tell us how they like it	21:19
acoles	that's an idea I have considered, but not actioned yet	21:20
timburke	all right, that's it for old business	21:20
timburke	#topic shard range gaps	21:20
timburke	part of why i didn't get to reviewing the ratelimiting stuff -- i slipped up last week and broke some shard ranges in prod	21:22
timburke	we were having trouble keeping shard ranges in cache -- memcache would often complain about being out of memory	21:23
timburke	i wanted to relieve some of the memory pressure and knew there were shards getting cached that didn't have any objects in them, so i wanted to shrink them away	21:23
* zaitcev is getting excited at a possibility of dark data in production.		21:23
timburke	unfortunately, there was also a transient shard-range overlap which blocked me from trying to compact	21:24
timburke	i thought, "overlaps seem bad," and ran repair	21:24
zaitcev	Not to wish evil on Tim and crew but, I kinda wish my work on watchers were helpful to someone.	21:25
zaitcev	So what did repair do?	21:25
timburke	only, the trouble was that (1) a shard had sharded, (2) at least one replica fully split up, and (3) at least one other replica reported back to the root. everybody was currently marked active, but it wasn't going to stay that way	21:26
acoles	@zaitcev IIRC we try very hard to send object updates somewhere ultimately falling back to the root container 🤞	21:27
timburke	iirc, the child shards all got marked shrinking. so now the parent was trying to kick rows out to the children and delete itself, while the children were trying to send rows back to the parent and delete themselves	21:27
timburke	eventually some combination succeeded well enough to where the children got marked shrunk and the parent got marked sharded. and now nobody's covering the range	21:28
mattoliver	So the "overlap" was a normal shard event, just everything hadn't replicated. And after repair the shrink acceptor disappeared (deleted itself).	21:29
timburke	acoles was quick to get some new code into swift-manage-shard-ranges to help us fix it	21:29
timburke	#link https://review.opendev.org/c/openstack/swift/+/841143	21:29
timburke	and immediately following this meeting, i'm going to try it out :-)	21:29
timburke	any questions or comments on any of that?	21:31
mattoliver	Nope, but I've been living it 😀 interested to see how it goes	21:32
zaitcev	I haven't even enabled shrinking, so none.	21:32
mattoliver	Also seems fixing the early ACTIVE/CLEAVED would've helped here.	21:33
timburke	i was just going to mention that :-)	21:34
mattoliver	So maybe seeing the early ACTIVE edge case in real life, rather then in theory.	21:34
mattoliver	So we weren't crazy to work on it Al, that's a silver lining.	21:35
timburke	the chain ending at https://review.opendev.org/c/openstack/swift/+/789651 likely would have prevented me getting into trouble, since i would've seen some mix of active and cleaved/created and no overlap would've been detected	21:35
timburke	all right, that's all i've got	21:36
timburke	#topic open discussion	21:36
timburke	what else should we bring up this week?	21:36
mattoliver	As mentioned at ptg we think we have a better way of solving it without new states, but yeah, it would've helped definitely	21:36
timburke	mattoliver, do we have a patch for that yet, or is it still mostly hypothetical?	21:37
mattoliver	I've done the prereq patch, and playing with a timing out algorithm. But no still need to find time to write the rest of the code.. maybe once the gaps are filled ;)	21:39
acoles	there's a few other improvements we have identified, like making the repair overlaps tool check for any obvious parent-child relationship between the overlapping ranges	21:39
acoles	and also being wary of fixing recently created overlaps (that may be transient)	21:40
acoles	but yeah, ideally we would eliminate the transient overlaps that are a feature of normal sharding	21:40
timburke	all right, i think i'll get on with that repair then :-)	21:43
timburke	thank you all for coming, and thank you for working on swift	21:43
timburke	#endmeeting	21:43

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!