21:00:05 <timburke> #startmeeting swift 21:00:06 <openstack> Meeting started Wed May 6 21:00:05 2020 UTC and is due to finish in 60 minutes. The chair is timburke. Information about MeetBot at http://wiki.debian.org/MeetBot. 21:00:07 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 21:00:09 <openstack> The meeting name has been set to 'swift' 21:00:12 <timburke> who's here for the swift meeting? 21:00:18 <seongsoocho> o/ 21:00:36 <rledisez> hi o/ 21:00:54 <kota_> hi 21:00:57 <mattoliverau> o/ 21:01:13 <alecuyer> o/ 21:01:50 <clayg> o/ 21:02:04 <timburke> agenda's at https://wiki.openstack.org/wiki/Meetings/Swift 21:02:16 <timburke> #topic PTG 21:02:53 <timburke> mattoliverau very graciously offered to help with PTG planning/organization 21:03:04 <clayg> ❤️ mattoliverau 21:03:21 <mattoliverau> hey cool. 21:03:46 <timburke> mattoliverau, what have you learned? 21:03:51 <mattoliverau> So we need to get the registraion in and start putting in sessions to times 21:04:01 <mattoliverau> sorry, irc is lagging 21:04:33 <mattoliverau> So step 1, to get the registration in, I need to know if there are any other projects your interested in so we can try and avoid overlap 21:04:47 <mattoliverau> I already have storelets and first contact sig 21:05:30 <rledisez> I would be interested in Keystone (things like operator feedbacks etc…) 21:05:47 <mattoliverau> Next we need to know what time suits people because of timezones. For this I'll come up with a doodle poll and post the link in our channel. 21:06:08 <mattoliverau> rledisez: great! 21:07:20 <mattoliverau> I'll hold the doodle pool, once I've created it, open until the end of Friday (my time, though can wait a bit longer) just so people have a change to pick times. 21:07:54 <mattoliverau> I'm guessing meeting time isn't bad, but I'm happy to get up early or stay up late if need be. 21:08:31 <timburke> i'm feeling the same way -- my current plan is to just show up in irc as much as possible that week :-D 21:08:43 <kota_> yup. in the day time, it might be hard because kids interrupt me always. 21:09:11 <mattoliverau> But step 3, is to make sure all the topics you want to talk about is in the etherpad! So when we have times we can decide the number of blocks and get rooms booked. 21:09:30 <kota_> oic 21:09:56 <mattoliverau> most of this I think happens by the 10th, though this could just be the registration side. 21:09:57 <timburke> #link https://etherpad.opendev.org/p/swift-ptg-victoria 21:10:21 <clayg> so there IS an in-person presence? 21:10:23 <mattoliverau> but I'd like to get some virtual rooms booked before what ever our best times get taken by other projects 21:10:32 <clayg> oh, virtual rooms 21:10:41 <mattoliverau> *needs to happen by 21:11:14 <mattoliverau> room booking spreadsheet 21:11:20 <mattoliverau> #link https://ethercalc.openstack.org/126u8ek25noy 21:11:30 <mattoliverau> if you want to see what it currently looks like. 21:11:36 <timburke> kota_, otoh, we'll get to know each other's families so much better than we typically get just from pictures :-) 21:12:05 <kota_> timburke: :) 21:12:18 <mattoliverau> lol 21:12:19 <kota_> good idea 21:12:32 <mattoliverau> Anyway, sorry for the brain dump 21:13:05 <timburke> don't apologize! that was just the sort of overview i was hoping for and never got around to putting together myself 21:13:21 <timburke> again, thank you so much for taking that on, and sorry i didn't ask for help earlier 21:13:59 <mattoliverau> basically, 1. If ther eis any project your interested in, let me know; 2. fill out doodle poll once I get a link up later today; 3. update the etherpad; 21:14:05 <mattoliverau> timburke: nps 21:14:29 <timburke> #topic object updater 21:14:46 <clayg> updater 😡 21:14:59 <timburke> rledisez, thanks for putting this on the agenda! i think you may have noticed that we've been interested in this lately, too ;-) 21:15:22 <rledisez> Yeah, I though I bring the point here cause some of us are having issue with it (at least we do, at ovh :)) 21:15:37 <rledisez> there is mostly 2 issues in my mind: 21:16:17 <rledisez> the first one is that async-pendings can quickly piles up on disks for many reasons (unsharded bug container, network issue, process hanged, …) 21:16:56 <rledisez> and while some of them will never be able to be handled by the updater (at least without an operator intervention), some of them can be handled because it was a really transiant situation (like a switch reboot…) 21:17:14 <rledisez> the first review is about that: p 571917 21:17:15 <patchbot> https://review.opendev.org/#/c/571917/ - swift - Manage async_pendings priority per containers - 5 patch sets 21:17:29 <rledisez> the blocking point seems to be that it changes the way async pendings are stored 21:18:02 <rledisez> the second issue is that the way we communicate with container-server is by sending one request per async-pendings instead of batching them 21:18:10 <rledisez> that what p 724943 is about 21:18:11 <patchbot> https://review.opendev.org/#/c/724943/ - swift - WIP: Batched updates for object-updater - 2 patch sets 21:18:27 <rledisez> I though I would bring this here to have some other oint of view 21:19:22 <rledisez> (I'm done with the summary. any questions, remarks?) 21:19:51 <timburke> so a bit of perspective from clayg tdasilva and i: we've got a cluster that's filling up, leading to quotas being implemented, leading to users wanting to delete a good bit of data, often in fairly large containers 21:20:46 <clayg> it sounds like the issue we're having might be slightly different then - we accepted ~350M deletes into some sharded containers with billions of objects and the updaters keep dos-ing the container dbs 🤷♂️ 21:21:04 <timburke> we're currently sitting at like 450+M async pendings across ~250 nodes, and that's still going up by ~2.5M/hr 21:21:05 <alecuyer> ouch 21:21:36 <clayg> there's just no flow control across the nodes to try and put db updates in at "the correct rate" 21:21:41 <rledisez> clayg: yeah, I'm more looking into treating quickly what can be treated while still trying for the problematic containers 21:22:02 <clayg> we're also learning there's still lots of OTHER updates going into these same containers so we're trying to break up the work and prioritize stuff 21:22:57 <clayg> rledisez: yeah on a "per node" basis we need some way to have "bad containers" somehow get... I guess "error limited" or something like what you've done in the top-of-stack patch where it just "moves on" 21:23:47 <clayg> I'd really like it AT LEAST per-node we could have a per-container rate limit 21:23:48 <timburke> fwiw, "treating quickly what can be treated" is actually *exactly* what clayg did earlier this week -- run a filtered updater that ignores certain containers, then try running foreground updaters for the remaining ones 21:25:16 <clayg> yeah I don't guess I have that gist up just now, one sec 21:25:58 <rledisez> timburke: I saw that tool, there is a link in the review. my issue with it is that it has to open all async-pendings to filter them. I really want to avoid wasting I/O on that (are you running on SSD guys?) 21:26:02 <clayg> oh, no ... i did, just lost it -> https://gist.github.com/clayg/c3d31a62eba590eebd5f5d257c24a297 21:26:18 <clayg> anyways - this is *useful* but not very scalable in terms of operator friendly 21:27:51 <timburke> no, we just took the io hit -- worst case, it slows down object servers which might apply some backpressure to the clients :P 21:28:10 <timburke> fwiw, another thought i'd had recently was to make the number of successes required to unlink configurable -- if we can get the update to 2/3 replicas, that's *probably* good enough, right? 21:28:14 <clayg> we're not on SSDs but we do have SOME head room on iops 21:28:18 <timburke> let container replication square it 21:28:52 <clayg> I hadn't really considered the overhead of opening an async to parse it only to find out that container is ratelimited 🤔 21:28:57 <rledisez> how does replication work on big-unsharded container? I'm not sure it would do a better job 21:29:25 <clayg> @timburke wants to just put all the async updates in a database - I don't want to have to deal with another .pending file lock timeouts 21:29:48 <timburke> i mean, they're big... but not *that* big. ~20M rows or so, i think? 21:30:02 <clayg> timburke: I think rledisez was asking about *un* sharded 21:30:23 <clayg> yeah, sharding was primarily in my mind to fix container replication 21:30:35 <timburke> right, but i mean, we could shard that big shard -- we just haven't 21:30:39 <clayg> replication works great on the shards! 21:30:52 <clayg> timburke: yeah, we have more sharding to do 21:31:16 <timburke> we've been pretty good about sharding the *biggest* guys, we've only got like 2 containers over 50M 21:31:39 <timburke> one of them is actually itself a shard 🤔 21:31:54 <mattoliverau> then shard the shard :) 21:32:16 <alecuyer> I'm wondering if this is "fixable" without throttling DELETEs? Unless you have excess IO capacity in your container servers, something is always going to be lagging in your situation, no? (it would be nice of course to prioritize some things but still) 21:32:47 <clayg> we DO have IO headroom in the container dbs tho 21:33:10 <alecuyer> ok so it's sqlite contention? 21:33:31 <clayg> yeah it's some kind of locking - either we're doing or sqlite 21:34:11 <clayg> yesterday it looked like the replication UPDATE request locked up the db for 25s - then FAILED 21:34:37 <clayg> so we're leaving throughput on the floor and it's unclear we're making progress - we have more investigation to 21:34:58 <clayg> I'll think more on the ordering asyncs by containers - I'm generally pretty happy with the filesystem layout 21:36:05 <rledisez> with a new layout, it would be pretty easy to batch updates to send to container as they are grouped all together 21:36:41 <clayg> it's hadn't been obviously terrible to me that a cycle of the updater would read and open the 125K asyncs per-disk 21:36:51 <rledisez> it was actually a followup I wanted on the first patch (and also to move "legacy" async-pendings) 21:36:59 <mattoliverau> re container replication, if the 2 container are generally close in side, we use usync (ie sync a bunch of rows) so maybe only needing quruom successes and letting the container possibly usync (batch with other updates) is better, rather then waiting for all replicas. Of course large unsharded will always be an issue.. we just need to shard the buggers. 21:37:10 <timburke> and i need to work on getting a good process going for manual shrinking -- part of why i'm hesitant to shard is that i think a lot of those shards are going to end up mostly-empty once everything settles, and we haven't really invested in shrinking yet :-/ 21:37:33 <clayg> yeah, the batch updates may indeed be useful ... rledisez you're winning me over on the layout change 21:38:20 <rledisez> yay! can I offer you a virtual-beer during the virtual-ptg? 21:38:31 <mattoliverau> I like the idea of splitting the asyncs by container in the sense from a glance you can see how if containers are stuggling, but it that too much directory walking i/o, ie listdir? Maybe one per partition? which will map to container replicas 21:39:01 <clayg> at this point my biggest complaint to change it probably just reservation about changing on-disk layouts and legacy migrations etc 21:39:18 <mattoliverau> yeah 21:39:19 <clayg> it's a bunch of work - but maybe it's worth it - thanks for bringing this up 21:39:40 <clayg> I don't think I had a good picture of where your thinking was coming from - it's clearer now 21:39:51 <rledisez> so, the current patch is compatible with current layout, so no break during an update. in case of downgrade, some move of files would be required 21:40:06 <clayg> mattoliverau: yeah some workloads I'd seen had a BUNCH of containers in their cluster 21:40:25 <clayg> I think of all the dirs we create if a node is offline for awhile 21:40:57 <clayg> instead of "a handful" of problematic containers we get one TLD for each container on a node - which... might still be less 1M - but in the 100Ks 21:41:42 <rledisez> the cost of listing a directory of 100K entries is not big, but the cost of inserting a new one is not negligeable 21:41:51 <rledisez> (i made some measure during my tests) 21:42:04 <mattoliverau> timburke: yeah, mark as SHRINKING And maybe get to tool to search for the donor etc. 21:42:17 <timburke> hmm... i wonder if a db could still be a good idea -- have the object-server continue dropping files all over the fs, then have the updater walk that tree and load into db before fanning out workers to read from the db... 21:42:56 <alecuyer> sounds good, I'm afraid of having too many files on disk, wonder why :) 21:43:01 <rledisez> why dropping a file then? with the WAL, insertion should be quite fast, no? 21:43:26 <mattoliverau> use the general task queue.. and just hope that container gets updates so we don't have to deal with async.. damn :P 21:43:38 <clayg> mattoliverau: 😆 21:43:40 <rledisez> I would then even move to appending to a file, just to avoid too many fsync 21:45:09 <timburke> ok, this has been a good discussion. are there any decisions or action items we can take away from it? 21:46:25 <rledisez> should we investigate the DB idea? 21:46:44 <rledisez> I'm pretty sure it would be better, but it's also more work so won't be ready soon 21:47:07 <alecuyer> (I have to say I have no updates on LOSF having had no time to work on it this week. rledisez left it on the agenda as I should get some time next week. So, we can have more time for object updater or other topics) 21:47:26 <timburke> alecuyer, thanks for hte heads-up 21:47:34 <clayg> i need to investigate the problem in our cluster - that's ahead of me making a decision on the suitability of rledisez 's purposed layout change - but I'd like to review that more seriously given new perspective and anything I learn trying to fix our mess 21:48:57 <timburke> ok, one more crazy idea, that might be a somewhat cheaper way to investigate the db idea: what about a db per container? we could put it in the disk's containers/ driectory... 21:50:17 <clayg> I was seriously attacted to this idea because of leverage 21:50:31 <rledisez> yeah, we can just run the container-replicator then 21:50:33 <clayg> "can't stuff it in the primary container - just put it in a local handoff!" 21:50:35 <timburke> i mean, we've already got this db schema for tracking exactly the info that's in these updates... 21:50:48 <clayg> hehhe 21:51:40 <clayg> i was concerned that for AC/O clusters there might be some distaste to adding a container-replicator to your object layer 21:51:48 <timburke> *especially* for the shards -- then you run no risk of proxies getting getting bad acls from the handoff that got popped into existence when the primaries are overloaded 21:51:56 <clayg> we could just import the container-replicator into the updater and ... well do something 21:52:39 <timburke> true enough! we've already got the sharder doing something not so dissimilar 21:52:41 <clayg> timburke: yeah vivifying containers in the read path is probably not ideal - i was thinking out of band 21:53:22 <mattoliverau> as handoff conatiners? if so just make sure you get the rowids close to their parent. otherwise it might cuase a rsync_then_merge and that wouldn't go well on really large containers. 21:54:25 <clayg> mattoliverau: good point 21:55:37 <rledisez> we can avoid that by tuning the limits on the async updater/replicator I guess. but yes, something to take care of for sure 21:56:18 <timburke> i should read the rsync-then-merge code again... 21:56:40 <timburke> all right 21:56:44 <timburke> #topic open discussion 21:56:56 <timburke> anything else we should talk about in these last few minutes of meeting time? 21:58:21 <mattoliverau> nope, /me wants breakfst ;) 21:58:31 <clayg> nom nom 21:58:41 <clayg> i might cook eggs and bacon for dinner 🤔 21:59:02 <mattoliverau> you totally should! :) 21:59:14 <kota_> it seems kids woke up. 21:59:33 <timburke> :-) then this is as good a time as any to 21:59:41 <timburke> #endmeeting