#openstack-meeting log

21:01:01 <timburke> #startmeeting swift
21:01:02 <openstack> Meeting started Wed Oct  7 21:01:01 2020 UTC and is due to finish in 60 minutes.  The chair is timburke. Information about MeetBot at http://wiki.debian.org/MeetBot.
21:01:03 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
21:01:05 <openstack> The meeting name has been set to 'swift'
21:01:15 <timburke> who's here for the swift meeting?
21:01:20 <mattoliverau> o/ (mostly)
21:01:22 <rledisez> o/
21:01:23 <kota_> o/
21:01:29 <alecuyer> o/
21:01:57 <timburke> agenda's at https://wiki.openstack.org/wiki/Meetings/Swift
21:02:05 <timburke> #topic TC election
21:02:34 <timburke> just a reminder that there's an election currently being held! vote!
21:03:04 <timburke> there are 7 candidates for (i believe) 4 seats
21:03:07 <timburke> #link https://governance.openstack.org/election/
21:03:44 <timburke> #topic ptg
21:04:20 <timburke> we're also just a couple weeks out from the virtual PTG!
21:04:36 <timburke> clayg's done a good job of seeding the etherpad
21:04:39 <timburke> #link https://etherpad.opendev.org/p/swift-ptg-wallaby
21:05:03 <timburke> i know *i* need to add some words about ALOs
21:05:30 <timburke> and probably some other topics
21:06:11 <timburke> #topic stable releases
21:07:09 <timburke> there are a couple patches currently working their way through the gate to get changelogs for 2.25.1 and 2.23.2
21:07:22 <timburke> p 756166 and p 756167
21:07:23 <patchbot> https://review.opendev.org/#/c/756166/ - swift (stable/ussuri) - Authors/ChangeLog for 2.25.1 - 1 patch set
21:07:25 <patchbot> https://review.opendev.org/#/c/756167/ - swift (stable/train) - ChangeLog for 2.23.2 - 1 patch set
21:08:07 <timburke> one of the big motivations for them (i feel like) is the backported fix for the py3 crypto bug
21:09:18 <timburke> just making sure people are aware of them and some of the great fixes they include :-)
21:09:36 <timburke> that's about it for announcements; any questions or comments?
21:10:10 <zaitcev> none on my side.
21:11:25 <timburke> all right, let's talk about some patches!
21:11:36 <clayg> o/
21:11:58 <timburke> #topic replication and ring version skew
21:12:01 <clayg> yeah, ALOs like s3 MPU!!! 👍
21:12:06 <timburke> #link https://review.opendev.org/#/c/754242/
21:12:07 <patchbot> patch 754242 - swift - Fix a race condition in case of cross-replication - 5 patch sets
21:12:37 <timburke> rledisez, i said i'd work on getting a dev env where i could actually repro the problem -- sorry, i haven't done that yet :-(
21:12:49 <rledisez> so, as a reminder, this patch only fix the issue for ssync. i'm working on a patch for rsync but nothing to propose yet
21:13:12 <clayg> tests look good - I don't think I had any specific concerns left over from last week - it might be ready to go?
21:13:16 <rledisez> i've testing the patch on prod where we we're seing the issue on a very regular basis. so far, so good, but i'll still monitore it closely
21:13:31 <rledisez> i'd like to share this
21:13:39 <rledisez> #link https://dl.plik.ovh/file/OIouMcSnLK2W2kwX/v8teIwe3Gapjjl2p/handoff-lock.png
21:13:56 <rledisez> it's a rebalance of an EC policy. we can see the patch avoided the issue many time
21:14:21 <rledisez> the ring distribution started at 18:00 and took 30 minutes
21:14:55 <rledisez> what bothers me is that even after the ring is distributed everywhere, we still get occurences of the reconstructor failing to lock the partition befor reverting it. I don't understand it
21:15:26 <timburke> cool! i'm loving that visual
21:16:36 <rledisez> what i want to try is to use a short timeout value for the reconstructor to see if it can improve perf (the rebalance without patch was taking 1h, with the patch it's 3h)
21:16:53 <rledisez> except than that, i'm done with it i think
21:17:11 <rledisez> waiting for reviews :)
21:17:15 <timburke> maybe previous-and-still primaries are doing actual rebuilds to the new primary, so it needs to grab the lock, too?
21:17:51 <timburke> that reminds me -- clayg, we need to make sure tsync is on the etherpad, too ;-)
21:17:53 <rledisez> it should be a perfect explanation, except that in our clusters we don't rebuild partition recently moved (to avoid to reconstruct something that should be moved)
21:18:10 <clayg> timburke: oh... yeah...
21:19:30 <rledisez> timburke: i think you're right, it's the rebuild, but not of moving partitions, of all the other partitions. so yeah, it should be fine. i'll check that but I think you got it
21:20:10 <timburke> alternatively, maybe there's new data landing in the partition, and the new primary's reconstructor locks it while checking in with neighbors? though you'd think that should be pretty fast
21:20:17 <clayg> tim doesn't even see the code anymore it's just 'primary partition, revert hand-off, orphaned slo segment...'
21:21:39 <timburke> all right, sounds like i need to do some reviews and rledisez is awesome
21:21:54 <timburke> #topic async cleanup of slo segments
21:22:34 <timburke> i think that's about ready to go -- thanks for the review, mattoliverau, i'll check to make sure i cover the more-than-one-container case
21:23:14 <timburke> in large part i left it on here as a segue to talking about ALOs ;-)
21:24:25 <zaitcev> Another Large Objects?
21:24:45 <timburke> i've been thinking "atomic", but yeah, that could work, too :P
21:25:35 <timburke> the gist of it is, i want to have something like SLOs, but where: the client API mirrors (basically exactly) S3's, the segments are all in the reserved namespace that we introduced for object versioning, and the segments get cleaned up asynchronously after delete/overwrite
21:25:55 <clayg> YES!!  zaitcev ❤️ 🤣
21:27:06 <clayg> timburke: can we even make it so it's not racy and because we control the manifest names and segments even if an overwrite thinks it's just a create we still destroy the inaccessable segments at some point?
21:27:09 <timburke> that last part may get interesting -- i'm assuming it'll need to get plumbed clear down to diskfile's cleanup_ondisk_files -- basically, before unlinking the old version, scatter a whole bunch of async_pendings to schedule the deletes
21:27:37 <zaitcev> hopefully not an auditor plugin to clean segments hehe
21:27:38 <clayg> interesting!  I was thinking it'd mostly happen the container layer!  Can't wait to discuss!
21:27:48 <clayg> auditor plugins 😢
21:28:24 <timburke> hey, those are making progress! i keep seeing new patchests from david
21:29:13 <timburke> all right, that's all i've got on the agenda
21:29:18 <timburke> #topic open discussion
21:29:29 <timburke> anything else we ought to bring up today?
21:30:23 <zaitcev> I already engaged your attention about the dark data, so I have nothing. I'm going to look at Romain's patch seriously.
21:30:31 <clayg> timburke: i was late, so i couldn't say it when you mentioned it - but KUDOS on taking care of all those backports and stable branches man - that's really great
21:31:22 <timburke> i may have found a new lead on https://bugs.launchpad.net/swift/+bug/1710328 -- i suspect some of the iterable/iterator cleanup in https://review.opendev.org/#/c/755639/ may address it
21:31:24 <openstack> Launchpad bug 1710328 in OpenStack Object Storage (swift) "object server deadlocks when a worker thread logs something" [High,Fix released] - Assigned to Samuel Merritt (torgomatic)
21:31:24 <patchbot> patch 755639 - swift - New proxy logging field for wire status - 5 patch sets
21:31:50 <timburke> er, not *that* deadlock patch... https://bugs.launchpad.net/swift/+bug/1895739
21:31:50 <openstack> Launchpad bug 1895739 in OpenStack Object Storage (swift) "Proxy server sometimes deadlocks while logging client disconnect" [Undecided,In progress]
21:33:34 <zaitcev> wow
21:35:05 <zaitcev> BTW... I don't know if it's going to be helpful to you, but I generally found re-entrant locks to be rather harmful in kernel arena.
21:35:25 <zaitcev> They aren't saving from deadlocks if 2 entities are present.
21:35:46 <zaitcev> And, some people clearly have trouble considering how locks work.
21:36:32 <zaitcev> Obviously it was the case 10 years ago and I'm sure you can think about them. I'm just saying that re-entrant locks tend to make the code harder to comprehend for mere mortals.
21:37:21 <zaitcev> I saw it happen when someone tried to port AFS to Linux
21:37:43 <timburke> yeah -- my hope is that if i can clean up the close() calls, we won't be punting to gc to handle the generators and i won't need to touch the _active_limbo_lock thing *at all*
21:38:03 <zaitcev> And in drivers if an rwlock is used for re-entrancy property, it's a signal that the code is out of control and someone is band-aiding around the baggage.
21:38:23 <zaitcev> Okay.
21:39:24 <timburke> even when i *did* try swapping out the lock, it didn't actually fix the issue -- it'd still come up now and then, i think because of some craziness in eventlet
21:39:29 <timburke> it was gross.
21:40:34 <timburke> but now i've got that patch applied in the cluster where i first characterized the problem, and i'll try running the same sorts of workloads -- i guess we'll see in a month or whatever if it actually fixed it 🤮
21:40:54 <clayg> timburke: can you post the link to the jira again (i don't see it from the bug)
21:41:06 <clayg> lp bug #1895739
21:41:07 <openstack> Launchpad bug 1895739 in OpenStack Object Storage (swift) "Proxy server sometimes deadlocks while logging client disconnect" [Undecided,In progress] https://launchpad.net/bugs/1895739
21:41:07 <timburke> i really wish i had a reliable repro
21:41:14 <zaitcev> You know, I cannot see the emoji all that well. Is that face puking?
21:42:49 <timburke> yup -- i think that was one of the first emojis where i thought, "well that *is* difficult to express without emojis (at least, while remaining as succinct)"
21:43:49 <timburke> kota_, clayg, rledisez, mattoliverau how are we feeling about https://review.opendev.org/#/c/739164/ ?
21:43:49 <patchbot> patch 739164 - swift - ec: Add an option to write fragments with legacy crc - 3 patch sets
21:44:09 <timburke> https://review.opendev.org/#/c/738959/ landed -- i should cut a libec release
21:44:09 <patchbot> patch 738959 - liberasurecode - Be willing to write fragments with legacy crc (MERGED) - 4 patch sets
21:44:40 <timburke> (step 1: remember how we do that)
21:45:07 <clayg> hahah
21:45:29 <zaitcev> Yeah. I only know that it ends in tarballs.opendev.org eventually.
21:45:49 <rledisez> timburke: it seems right
21:45:51 <kota_> dependency is always hard problem i'm feeling
21:45:52 <timburke> i think we can just push a (signed) tag, but i bet tdasilva remembers
21:47:03 <zaitcev> If we're done, I need to go.
21:47:17 <timburke> kota_, the good news is that despite the Depends-On, the new swift code is perfectly happy working with old libec -- it'll set the env var, and nothing will actually look at it
21:48:08 <kota_> :P
21:50:36 <timburke> all right, seems like we're winding down
21:50:48 <timburke> thank you all for coming, and thank you for working on swift!
21:50:51 <timburke> #endmeeting