21:01:01 <timburke> #startmeeting swift 21:01:02 <openstack> Meeting started Wed Oct 7 21:01:01 2020 UTC and is due to finish in 60 minutes. The chair is timburke. Information about MeetBot at http://wiki.debian.org/MeetBot. 21:01:03 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 21:01:05 <openstack> The meeting name has been set to 'swift' 21:01:15 <timburke> who's here for the swift meeting? 21:01:20 <mattoliverau> o/ (mostly) 21:01:22 <rledisez> o/ 21:01:23 <kota_> o/ 21:01:29 <alecuyer> o/ 21:01:57 <timburke> agenda's at https://wiki.openstack.org/wiki/Meetings/Swift 21:02:05 <timburke> #topic TC election 21:02:34 <timburke> just a reminder that there's an election currently being held! vote! 21:03:04 <timburke> there are 7 candidates for (i believe) 4 seats 21:03:07 <timburke> #link https://governance.openstack.org/election/ 21:03:44 <timburke> #topic ptg 21:04:20 <timburke> we're also just a couple weeks out from the virtual PTG! 21:04:36 <timburke> clayg's done a good job of seeding the etherpad 21:04:39 <timburke> #link https://etherpad.opendev.org/p/swift-ptg-wallaby 21:05:03 <timburke> i know *i* need to add some words about ALOs 21:05:30 <timburke> and probably some other topics 21:06:11 <timburke> #topic stable releases 21:07:09 <timburke> there are a couple patches currently working their way through the gate to get changelogs for 2.25.1 and 2.23.2 21:07:22 <timburke> p 756166 and p 756167 21:07:23 <patchbot> https://review.opendev.org/#/c/756166/ - swift (stable/ussuri) - Authors/ChangeLog for 2.25.1 - 1 patch set 21:07:25 <patchbot> https://review.opendev.org/#/c/756167/ - swift (stable/train) - ChangeLog for 2.23.2 - 1 patch set 21:08:07 <timburke> one of the big motivations for them (i feel like) is the backported fix for the py3 crypto bug 21:09:18 <timburke> just making sure people are aware of them and some of the great fixes they include :-) 21:09:36 <timburke> that's about it for announcements; any questions or comments? 21:10:10 <zaitcev> none on my side. 21:11:25 <timburke> all right, let's talk about some patches! 21:11:36 <clayg> o/ 21:11:58 <timburke> #topic replication and ring version skew 21:12:01 <clayg> yeah, ALOs like s3 MPU!!! 👍 21:12:06 <timburke> #link https://review.opendev.org/#/c/754242/ 21:12:07 <patchbot> patch 754242 - swift - Fix a race condition in case of cross-replication - 5 patch sets 21:12:37 <timburke> rledisez, i said i'd work on getting a dev env where i could actually repro the problem -- sorry, i haven't done that yet :-( 21:12:49 <rledisez> so, as a reminder, this patch only fix the issue for ssync. i'm working on a patch for rsync but nothing to propose yet 21:13:12 <clayg> tests look good - I don't think I had any specific concerns left over from last week - it might be ready to go? 21:13:16 <rledisez> i've testing the patch on prod where we we're seing the issue on a very regular basis. so far, so good, but i'll still monitore it closely 21:13:31 <rledisez> i'd like to share this 21:13:39 <rledisez> #link https://dl.plik.ovh/file/OIouMcSnLK2W2kwX/v8teIwe3Gapjjl2p/handoff-lock.png 21:13:56 <rledisez> it's a rebalance of an EC policy. we can see the patch avoided the issue many time 21:14:21 <rledisez> the ring distribution started at 18:00 and took 30 minutes 21:14:55 <rledisez> what bothers me is that even after the ring is distributed everywhere, we still get occurences of the reconstructor failing to lock the partition befor reverting it. I don't understand it 21:15:26 <timburke> cool! i'm loving that visual 21:16:36 <rledisez> what i want to try is to use a short timeout value for the reconstructor to see if it can improve perf (the rebalance without patch was taking 1h, with the patch it's 3h) 21:16:53 <rledisez> except than that, i'm done with it i think 21:17:11 <rledisez> waiting for reviews :) 21:17:15 <timburke> maybe previous-and-still primaries are doing actual rebuilds to the new primary, so it needs to grab the lock, too? 21:17:51 <timburke> that reminds me -- clayg, we need to make sure tsync is on the etherpad, too ;-) 21:17:53 <rledisez> it should be a perfect explanation, except that in our clusters we don't rebuild partition recently moved (to avoid to reconstruct something that should be moved) 21:18:10 <clayg> timburke: oh... yeah... 21:19:30 <rledisez> timburke: i think you're right, it's the rebuild, but not of moving partitions, of all the other partitions. so yeah, it should be fine. i'll check that but I think you got it 21:20:10 <timburke> alternatively, maybe there's new data landing in the partition, and the new primary's reconstructor locks it while checking in with neighbors? though you'd think that should be pretty fast 21:20:17 <clayg> tim doesn't even see the code anymore it's just 'primary partition, revert hand-off, orphaned slo segment...' 21:21:39 <timburke> all right, sounds like i need to do some reviews and rledisez is awesome 21:21:54 <timburke> #topic async cleanup of slo segments 21:22:34 <timburke> i think that's about ready to go -- thanks for the review, mattoliverau, i'll check to make sure i cover the more-than-one-container case 21:23:14 <timburke> in large part i left it on here as a segue to talking about ALOs ;-) 21:24:25 <zaitcev> Another Large Objects? 21:24:45 <timburke> i've been thinking "atomic", but yeah, that could work, too :P 21:25:35 <timburke> the gist of it is, i want to have something like SLOs, but where: the client API mirrors (basically exactly) S3's, the segments are all in the reserved namespace that we introduced for object versioning, and the segments get cleaned up asynchronously after delete/overwrite 21:25:55 <clayg> YES!! zaitcev ❤️ 🤣 21:27:06 <clayg> timburke: can we even make it so it's not racy and because we control the manifest names and segments even if an overwrite thinks it's just a create we still destroy the inaccessable segments at some point? 21:27:09 <timburke> that last part may get interesting -- i'm assuming it'll need to get plumbed clear down to diskfile's cleanup_ondisk_files -- basically, before unlinking the old version, scatter a whole bunch of async_pendings to schedule the deletes 21:27:37 <zaitcev> hopefully not an auditor plugin to clean segments hehe 21:27:38 <clayg> interesting! I was thinking it'd mostly happen the container layer! Can't wait to discuss! 21:27:48 <clayg> auditor plugins 😢 21:28:24 <timburke> hey, those are making progress! i keep seeing new patchests from david 21:29:13 <timburke> all right, that's all i've got on the agenda 21:29:18 <timburke> #topic open discussion 21:29:29 <timburke> anything else we ought to bring up today? 21:30:23 <zaitcev> I already engaged your attention about the dark data, so I have nothing. I'm going to look at Romain's patch seriously. 21:30:31 <clayg> timburke: i was late, so i couldn't say it when you mentioned it - but KUDOS on taking care of all those backports and stable branches man - that's really great 21:31:22 <timburke> i may have found a new lead on https://bugs.launchpad.net/swift/+bug/1710328 -- i suspect some of the iterable/iterator cleanup in https://review.opendev.org/#/c/755639/ may address it 21:31:24 <openstack> Launchpad bug 1710328 in OpenStack Object Storage (swift) "object server deadlocks when a worker thread logs something" [High,Fix released] - Assigned to Samuel Merritt (torgomatic) 21:31:24 <patchbot> patch 755639 - swift - New proxy logging field for wire status - 5 patch sets 21:31:50 <timburke> er, not *that* deadlock patch... https://bugs.launchpad.net/swift/+bug/1895739 21:31:50 <openstack> Launchpad bug 1895739 in OpenStack Object Storage (swift) "Proxy server sometimes deadlocks while logging client disconnect" [Undecided,In progress] 21:33:34 <zaitcev> wow 21:35:05 <zaitcev> BTW... I don't know if it's going to be helpful to you, but I generally found re-entrant locks to be rather harmful in kernel arena. 21:35:25 <zaitcev> They aren't saving from deadlocks if 2 entities are present. 21:35:46 <zaitcev> And, some people clearly have trouble considering how locks work. 21:36:32 <zaitcev> Obviously it was the case 10 years ago and I'm sure you can think about them. I'm just saying that re-entrant locks tend to make the code harder to comprehend for mere mortals. 21:37:21 <zaitcev> I saw it happen when someone tried to port AFS to Linux 21:37:43 <timburke> yeah -- my hope is that if i can clean up the close() calls, we won't be punting to gc to handle the generators and i won't need to touch the _active_limbo_lock thing *at all* 21:38:03 <zaitcev> And in drivers if an rwlock is used for re-entrancy property, it's a signal that the code is out of control and someone is band-aiding around the baggage. 21:38:23 <zaitcev> Okay. 21:39:24 <timburke> even when i *did* try swapping out the lock, it didn't actually fix the issue -- it'd still come up now and then, i think because of some craziness in eventlet 21:39:29 <timburke> it was gross. 21:40:34 <timburke> but now i've got that patch applied in the cluster where i first characterized the problem, and i'll try running the same sorts of workloads -- i guess we'll see in a month or whatever if it actually fixed it 🤮 21:40:54 <clayg> timburke: can you post the link to the jira again (i don't see it from the bug) 21:41:06 <clayg> lp bug #1895739 21:41:07 <openstack> Launchpad bug 1895739 in OpenStack Object Storage (swift) "Proxy server sometimes deadlocks while logging client disconnect" [Undecided,In progress] https://launchpad.net/bugs/1895739 21:41:07 <timburke> i really wish i had a reliable repro 21:41:14 <zaitcev> You know, I cannot see the emoji all that well. Is that face puking? 21:42:49 <timburke> yup -- i think that was one of the first emojis where i thought, "well that *is* difficult to express without emojis (at least, while remaining as succinct)" 21:43:49 <timburke> kota_, clayg, rledisez, mattoliverau how are we feeling about https://review.opendev.org/#/c/739164/ ? 21:43:49 <patchbot> patch 739164 - swift - ec: Add an option to write fragments with legacy crc - 3 patch sets 21:44:09 <timburke> https://review.opendev.org/#/c/738959/ landed -- i should cut a libec release 21:44:09 <patchbot> patch 738959 - liberasurecode - Be willing to write fragments with legacy crc (MERGED) - 4 patch sets 21:44:40 <timburke> (step 1: remember how we do that) 21:45:07 <clayg> hahah 21:45:29 <zaitcev> Yeah. I only know that it ends in tarballs.opendev.org eventually. 21:45:49 <rledisez> timburke: it seems right 21:45:51 <kota_> dependency is always hard problem i'm feeling 21:45:52 <timburke> i think we can just push a (signed) tag, but i bet tdasilva remembers 21:47:03 <zaitcev> If we're done, I need to go. 21:47:17 <timburke> kota_, the good news is that despite the Depends-On, the new swift code is perfectly happy working with old libec -- it'll set the env var, and nothing will actually look at it 21:48:08 <kota_> :P 21:50:36 <timburke> all right, seems like we're winding down 21:50:48 <timburke> thank you all for coming, and thank you for working on swift! 21:50:51 <timburke> #endmeeting