21:00:51 #startmeeting swift 21:00:58 who's here for the swift meeting? 21:01:10 o/ 21:01:14 hello 21:01:54 huh. the meetbot message didn't trigger... 21:02:10 :( 21:03:24 o/ sorry im late 21:03:29 i'll save a transcript myself, then find someone to talk to about uploading it to eavesdrop, i guess 21:04:31 sounds nice 21:04:42 as usual, the agenda's at https://wiki.openstack.org/wiki/Meetings/Swift 21:05:01 first up, i wanted to bring up https://bugs.launchpad.net/swift/+bug/1928494 21:05:11 `Deleted locks can still be acquired` 21:06:28 i noticed while doing some partition cleanup for the relinker that if you delete a lock file, anybody that was waiting on it becoming available can later acquire the deleted lock 21:06:35 ...which seems not-great 21:07:36 i've got a fix up at https://review.opendev.org/c/openstack/swift/+/791022, but acoles did some benchmarking and noticed that it came with a 20-25% performance hit in acquiring a lock 21:07:38 yeah, seems like a race condition that is hard to solve. 21:09:05 so i recently pushed up a new patchset that provides an escape hatch by means of a new config option 21:10:51 nice how you managed to plumb it into constraints 21:11:11 it's a bit of a kludge; i don't really expect we'll want it to merge as is, but it should give us (nvidia) a way to test it out in prod and see whther that performance hit is the sort of thing that noticeably impacts clients 21:12:30 acoles, yeah -- i started out plumbing it through as a new arg to lock_path, and trying to trace all of the callers and plumb it through *there*, too... it got pretty ugly 21:12:45 this seemed more expedient 21:13:02 so a couple questions i've got: 21:13:24 1) does it seem reasonable that we'd only really need the escape hatch for wsgi servers? 21:13:54 expedient seems ok given that we may never merge the conf option 21:15:31 2) if our testing at nvidia makes it seem like there's minimal client-observable impact, are we comfortable rolling back to the previous patchset (which didn't have the config option and just always made locks more robust/expensive), or do we worry that other deployments may still want an escape hatch? 21:16:23 kota_, zaitcev that second question might mainly be directed at you guys ;-) 21:16:46 timburke_: I joined late so I don't know what review we're discussing. 21:17:10 https://review.opendev.org/c/openstack/swift/+/791022 21:17:20 looking 21:17:21 `lock_path: Prevent aquiring deleted/moved locks` 21:18:19 at the moment, it's got a config option to fall back to the old behavior without the extra stat calls, but i don't actually *like* adding the new option 21:20:41 it's also the sort of thing that you guys can think about a bit and voice an opinion later. we're probably going to package that for next week, which means it'll be something like two weeks before we've had it running in prod long enough to know whether we needed the escape hatch ;-) 21:21:22 re Q 1, the stats will make it slower for non-wsgi services to release the lock, which could impact the wsgi servers waiting on the lock. Which others are there? relinker I know. 21:22:34 but relinker can be ratelimited with files-per-second 21:22:37 replicator (A, C, & O), reconstructor, sharder at least 21:24:29 how would the stats in lock_path() make it any slower to release the lock? 21:25:17 it does them after acquiring it but before doing the Real Work for which it grabbed the lock 21:25:32 acoles, ok, i'll plan on respinning with a change in daemon.py similar to what i did in wsgi.py, and moving the config option up into [DEFAULT] 21:26:15 mattoliverau: slower to yield from lock_path so therefore slower til it is released 21:26:32 timburke_: 👍 21:27:37 ok, i suggest we bring this up again next week (or even in two weeks, when we hope to have more results) 21:27:39 on to updates 21:27:51 #topic sharding 21:28:06 acoles, i saw that there's a new patch chain going about sharder options 21:28:52 yes! 21:29:59 I was addressing all the very useful review comments on my original attempt to have a single function to load sharder and s-m-s-r options, and realised it could all be simpler 21:30:14 So I now have an alternative patchset https://review.opendev.org/c/openstack/swift/+/792177 21:30:31 (just the sharder.py changes differ from the original) 21:31:29 I made a separate patchset in case it turned out bad so the original is still there, but I think https://review.opendev.org/c/openstack/swift/+/792177 is better, and removes some of the repetition 21:31:49 IDK why I didn't see it the first time I wrote it 21:32:35 i'll try to take a look this afternoon, now that i've mostly got the original take loaded into my head ;-) 21:33:39 me too 21:33:44 any other sharding developments we ought to be looking at? 21:33:49 BTW, I addressed comments on the old patch https://review.opendev.org/c/openstack/swift/+/778989, and then revised it for the new, and as I said it is just sharder.py that changed 21:34:00 timburke_: mattoliverau sorry for the churn :( 21:34:23 hey if it's better then nothing to apologise for :) 21:34:57 an FYI that a bug fix landed for overlaps that were not being reported https://review.opendev.org/c/openstack/swift/+/791485 21:36:51 oh yeah, and some swift-manage-shard-ranges patches have landed since last week 21:37:22 I still don't understand a thing about any of it, but I'm trying to keep an eye on what you guys are doing. 21:37:55 all right, moving on 21:38:01 #topic relinker 21:38:54 main thing to point out is that after inserting the locking fix ahead of https://review.opendev.org/c/openstack/swift/+/790305, i rebased that patch off of master 21:39:08 i think we can address them as orthogonal issues 21:40:06 kk 21:40:43 aside from that, i still want to do another review on the SIGTERM handling that mattoliverau proposed 21:41:33 yeah, I've got it using the process group like in Daemon but really need to test to see what happens with multiple workers.. I should get around to doing that. thanks for the reminder :) 21:43:04 #topic dark data watcher 21:43:45 zaitcev, sorry, i still haven't gotten around to reviewing your fix :-( 21:44:18 https://review.opendev.org/c/openstack/swift/+/788398 21:44:42 It's okay, all in a good time 21:45:05 I'm more concerned about the https://review.opendev.org/c/openstack/swift/+/787656 21:45:16 zaitcev: I'll try to look too 21:45:28 I need Alistair to explain some of his objections. 21:46:10 Because without understanding that I cannot fix it up. I think the basic approach should work, maybe with adding some options to GET of containers. 21:47:21 Also, there's a fundamental question there. Basically, what do we do if any failure occurs when asking a container server. 21:47:55 IIRC my concern was that data would be removed if primary container server(s) failed to respond 21:47:57 Originally, I thought it was okay, as long as any one server replies. We just take that reply as authoritative. 21:48:44 Right. So, if the scenario is that one of the failed nodes has the record for the object, but all of the replied nodes do not. 21:49:02 with the recent patch to quarantine stale EC frags (a kind of dark data) we require all servers to repond 404 https://review.opendev.org/c/openstack/swift/+/788833 21:49:17 zaitcev: right 21:50:03 I think it helps if the object is known to be 'older' (your patch) because we'd hope that all replicas would have been moved to primaries 21:50:15 Allright. I can change it to the absolute criteria. The worst that can happen is that the dark data watcher is effectively disabled if any one of container servers is offline. 21:51:16 The only thing that bothers me is that an administrator may not know it, and sometimes they keep a server down for years -- right until the time another one fails and there's no quorum anymore. 21:51:43 Well in that case they probably have worse problems 21:53:02 yeah, we've seen that sort of problem too :-( 21:53:15 ok. I'm going to update Tim's patch 787656 and ask you to review it. 21:53:27 thanks! 21:53:29 ok 21:53:35 all right, last few minutes 21:53:43 #topic open discussion 21:53:51 anything else we ought to bring up this week? 21:53:54 oh, it's nasty. The rest of the cluster already deleted tombstones many times over. 21:54:31 I don't think I have anything. 21:54:44 nor me 21:55:29 oh, hey -- looks like there may be some question about whether we'll continue meeting on freenode: http://lists.openstack.org/pipermail/openstack-discuss/2021-May/022539.html 21:56:11 I wonder if we should periodically write down a time stat, and then fail to come up and start replicating or answering if it's outside of reclaim_age or something. Not sure how it's done.. just a thought. 21:56:51 oh yeah, freenode.. there might be alot of churn for openstack soon. 21:57:01 The split is the talk of the town, no doubt. Now the leaders of the two organizations are entering a mortal combat and a pissing contest over who's more diverse and inclusive and thus gets that sweet, sweet sponsorship money. 21:57:26 lol 21:58:42 all right, i'm calling it 21:58:54 thank you all for coming, and thank you for working on swift! 21:59:00 #endmeeting