21:00:28 #startmeeting swift 21:00:28 Meeting started Wed May 11 21:00:28 2022 UTC and is due to finish in 60 minutes. The chair is timburke__. Information about MeetBot at http://wiki.debian.org/MeetBot. 21:00:28 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 21:00:28 The meeting name has been set to 'swift' 21:00:35 who's here for the swift meeting? 21:00:46 hi 21:00:52 o/ 21:01:53 as usual, the agenda's at 21:01:59 #link https://wiki.openstack.org/wiki/Meetings/Swift 21:02:08 o/ 21:02:19 but i mostly just want to follow up on the various threads of work i mentioned last week 21:02:28 #topic ring v2 21:02:55 i don't think we've done further work on this yet -- can anyone spare some review cycles for the head of the chain? 21:03:42 I can go back and look again, but probably need fresh eyes too. 21:04:03 thanks 👍 21:04:24 #topic memcache errors 21:05:04 clayg took a look and seemed to like it -- iirc we're going to try it out in prod this week and see how it goes 21:05:42 i think it'll be of tremendous use to callers to be able to tell the different between a cache miss and an error talking to memcached 21:06:10 +1 21:06:51 #topic eventlet and logging handler locks 21:08:24 I took a look at this in a vsaio. Threw a bunch of load at it, upped the workers. And seemed to all work fine. 21:08:29 we're also going to try this out in prod. it may take a little longer to get data on it, though, since it seems riskier (and so requires ops to opt-in to the new behavior) 21:08:46 good proof point! thanks mattoliver 21:09:34 it'll be real interesting if we can notice a performance difference when we try it in prod 21:09:51 I then wanted to goto an elk environment to better tell me if there was a logging issues as file beat tries to break things into json 21:10:02 But stalled getting the env set up 21:10:32 Might have a play with it now that we'll have it in our staging cluster. 21:11:04 👍 21:11:20 #topic backend rate limiting 21:11:38 sorry, acoles, i'm pretty sure i promised reviews and haven't delivered 21:12:21 hehe, no worries, I've not circled back to this yet (in terms of deploying at least) 21:12:50 I think since last week we've had some thoughts about how the proxy should deal with 529s... 21:14:11 originally I felt the proxy should not error limit. I still think that is the case given the difference in time scales (per second rate limiting vs 60second error limit), BUT it could be mitigated for by increasing the backend ratelimit buffer to match 21:14:37 i.e. average backend ratelimiting over ~60secs 21:15:29 also, we need to consider there are N proxies requesting to each backend ratelimiter 21:17:05 in the meantime, since last week I think, I added a patch to the chain to load the ratelimit config from a separate file, and to periodically reload 21:17:06 do we already have plumbing to make the buffer time configurable? 21:17:16 If only we had global error limiting 😜 21:17:31 Oh nice 21:17:58 timburke: yes, rate_buffer can be configured in the upstream patch 21:19:01 and do we want to make it an option in the proxy for whether to error-limit 529s or not? then we've put a whole bunch of knobs that ops can try out, and hopefully they can try some experiments and tell us how they like it 21:20:09 that's an idea I have considered, but not actioned yet 21:20:47 all right, that's it for old business 21:20:54 #topic shard range gaps 21:22:01 part of why i didn't get to reviewing the ratelimiting stuff -- i slipped up last week and broke some shard ranges in prod 21:23:24 we were having trouble keeping shard ranges in cache -- memcache would often complain about being out of memory 21:23:44 i wanted to relieve some of the memory pressure and knew there were shards getting cached that didn't have any objects in them, so i wanted to shrink them away 21:23:57 * zaitcev is getting excited at a possibility of dark data in production. 21:24:26 unfortunately, there was also a transient shard-range overlap which blocked me from trying to compact 21:24:46 i thought, "overlaps seem bad," and ran repair 21:25:33 Not to wish evil on Tim and crew but, I kinda wish my work on watchers were helpful to someone. 21:25:58 So what did repair do? 21:26:30 only, the trouble was that (1) a shard had sharded, (2) at least one replica fully split up, and (3) at least one other replica reported back to the root. everybody was currently marked active, but it wasn't going to stay that way 21:27:21 @zaitcev IIRC we try very hard to send object updates *somewhere* ultimately falling back to the root container 🤞 21:27:51 iirc, the child shards all got marked shrinking. so now the parent was trying to kick rows out to the children and delete itself, while the children were trying to send rows back to the parent and delete *themselves* 21:28:45 eventually some combination succeeded well enough to where the children got marked shrunk and the parent got marked sharded. and now nobody's covering the range 21:29:22 So the "overlap" was a normal shard event, just everything hadn't replicated. And after repair the shrink acceptor disappeared (deleted itself). 21:29:33 acoles was quick to get some new code into swift-manage-shard-ranges to help us fix it 21:29:35 #link https://review.opendev.org/c/openstack/swift/+/841143 21:29:52 and immediately following this meeting, i'm going to try it out :-) 21:31:46 any questions or comments on any of that? 21:32:28 Nope, but I've been living it 😀 interested to see how it goes 21:32:39 I haven't even enabled shrinking, so none. 21:33:59 Also seems fixing the early ACTIVE/CLEAVED would've helped here. 21:34:18 i was just going to mention that :-) 21:34:45 So maybe seeing the early ACTIVE edge case in real life, rather then in theory. 21:35:14 So we weren't crazy to work on it Al, that's a silver lining. 21:35:37 the chain ending at https://review.opendev.org/c/openstack/swift/+/789651 likely would have prevented me getting into trouble, since i would've seen some mix of active and cleaved/created and no overlap would've been detected 21:36:28 all right, that's all i've got 21:36:32 #topic open discussion 21:36:40 what else should we bring up this week? 21:36:57 As mentioned at ptg we think we have a better way of solving it without new states, but yeah, it would've helped definitely 21:37:58 mattoliver, do we have a patch for that yet, or is it still mostly hypothetical? 21:39:11 I've done the prereq patch, and playing with a timing out algorithm. But no still need to find time to write the rest of the code.. maybe once the gaps are filled ;) 21:39:39 there's a few other improvements we have identified, like making the repair overlaps tool check for any obvious parent-child relationship between the overlapping ranges 21:40:27 and also being wary of fixing recently created overlaps (that may be transient) 21:40:58 but yeah, ideally we would eliminate the transient overlaps that are a feature of normal sharding 21:43:37 all right, i think i'll get on with that repair then :-) 21:43:54 thank you all for coming, and thank you for working on swift 21:43:58 #endmeeting 06:59:27 Hi, I am working on enabling Keystone audit middleware in Swift. Since swift_proxy_server supports middleware, I am trying to add audit filter in the pipeline and enable audit for Swift service. But audit events are not getting generated. As per the analysis, events are not getting notified. Is this a known issue or Keystone audit middleware is not supported for Swift ? 13:38:26 Alistair Coles proposed openstack/swift master: trivial: add comment re sharder misplaced found stat https://review.opendev.org/c/openstack/swift/+/841592 13:38:49 ^^ easy review :) 13:39:31 trivial change but I lost time confused by the code 15:31:01 Alistair Coles proposed openstack/swift master: sharder: ensure that misplaced tombstone rows are moved https://review.opendev.org/c/openstack/swift/+/841612 18:13:43 Merged openstack/swift master: trivial: add comment re sharder misplaced found stat https://review.opendev.org/c/openstack/swift/+/841592 02:48:45 Takashi Kajinami proposed openstack/swift master: Add missing services to sample rsyslog.conf https://review.opendev.org/c/openstack/swift/+/841673 03:57:49 Tim Burke proposed openstack/swift master: Add --test-config option to WSGI servers https://review.opendev.org/c/openstack/swift/+/833124 03:57:50 Tim Burke proposed openstack/swift master: Add a swift-reload command https://review.opendev.org/c/openstack/swift/+/833174 03:57:50 Tim Burke proposed openstack/swift master: systemd: Send STOPPING/RELOADING notifications https://review.opendev.org/c/openstack/swift/+/837633 03:57:51 Tim Burke proposed openstack/swift master: Add abstract sockets for process notifications https://review.opendev.org/c/openstack/swift/+/837641 07:38:28 Alistair Coles proposed openstack/swift master: sharder: ensure that misplaced tombstone rows are moved https://review.opendev.org/c/openstack/swift/+/841612 07:42:50 mattoliver: ^^^ do you have any recollection if there was a reason to process misplaced object rows in order undeleted followed by deleted? 08:19:41 I was going to say, in case a delete was issued after a put, but the delete hits first. But that's only an issue if a delete isn't logged if the object isn't already there? Surely it is. Maybe it was to make sure objects that were there got to its destination first, to minimise missing objects from listing. 08:20:39 Maybe we should treat it more like a journal, and move objects in row order (include deleted) from the start. 08:21:57 If there were alot of deletes, like in an expired objects container then maybe so it looks like it makes progress if you move the deleted=0 first 🤷‍♂️ 08:51:52 mattoliver, hey, I see that the tempurl patches are in a good shape, thanks for that https://review.opendev.org/c/openstack/swift/+/525771 are we currently just waiting for reviews? 08:51:54 do you need any help with the patches? 08:57:00 Yeah, I'll double back to them and check, but yeah it's waiting on reviews I believe. I'll poke people about it in the next meeting if it isn't reviewed by then. 08:57:38 If you want to double check and review them too (if you haven't already) that'll be great! (On my phone so don't have them handy to check). 08:58:10 thanks, I'll do that, and ask my colleague 16:44:58 Merged openstack/swift master: memcached: Give callers the option to accept errors https://review.opendev.org/c/openstack/swift/+/839448 23:54:57 Tim Burke proposed openstack/swift master: container: Add delimiter-depth query param https://review.opendev.org/c/openstack/swift/+/829605 05:28:39 Tim Burke proposed openstack/swift master: container: Add delimiter-depth query param https://review.opendev.org/c/openstack/swift/+/829605 04:30:16 Hi, I am working on enabling Keystone audit middleware in Swift. Since swift_proxy_server supports middleware, I am trying to add audit filter in the pipeline and enable audit for Swift service. But audit events are not getting generated. As per the analysis, events are not getting notified. Is this a known issue or Keystone audit middleware is not supported for Swift ? 18:41:27 Tim Burke proposed openstack/swift master: Distinguish workers by their args https://review.opendev.org/c/openstack/swift/+/841989 20:51:28 Merged openstack/swift master: Refactor rate-limiting helper into a class https://review.opendev.org/c/openstack/swift/+/834960 06:24:29 Matthew Oliver proposed openstack/swift master: ring v2 serialization: more test coverage follow up https://review.opendev.org/c/openstack/swift/+/842040 08:15:19 Merged openstack/swift master: AbstractRateLimiter: add option to burst on start-up https://review.opendev.org/c/openstack/swift/+/835122 19:30:37 timburke__: apologies, I won't be able to make today's meeting 20:54:58 Merged openstack/swift master: Add missing services to sample rsyslog.conf https://review.opendev.org/c/openstack/swift/+/841673 21:03:35 meeting? 21:04:30 I'll poke tim 21:05:05 thx mattoliver 21:05:47 I didn't forget this time, meaning it's probably going to be jinxed with Tim's sick child or something. 21:06:14 :/ 21:06:20 lol 21:08:32 Still no response from him 21:11:08 sorry, got distracted by an issue at home 21:11:14 timburke__: Error: Can't start another meeting, one is in progress. Use #endmeeting first. 21:11:41 that's a weird error 21:11:41 wha... 21:11:45 #endmeeting