*** thelounge94 is now known as redrobot | 13:02 | |
*** redrobot is now known as thelounge94 | 13:04 | |
*** thelounge94 is now known as redrobot | 13:04 | |
opendevreview | Merged openstack/swift-bench master: Migrate from testr to stestr https://review.opendev.org/c/openstack/swift-bench/+/798941 | 16:54 |
---|---|---|
reid_g | Hello again! It seems that the handoffs_only seems a bit risky and mgt doesn't want to use it for adding capacity because of the risk of the loss to durability. We saw we were able to speed up the rebalances when we increased the reconstructor workers... but this seems to lead to increases in various timeouts errors being logged. Do you have recommendations for tuning these? or is it just whack-a-mole? | 18:57 |
clayg | reid_g: correct, do NOT leave handoffs_only turned on after the EC rebalance finishes | 20:28 |
clayg | reid_g: process workers are great! You can definitely make a rebalance go quite fast... maybe TOO fast! | 20:29 |
reid_g | Right. I was asked not to use it at all | 20:29 |
reid_g | So maybe my answer is to scale down the workers? Went from 1 --> 12 | 20:31 |
clayg | reid_g: unfortunately there's a lot of unhelpful i/o contention during a rebalance if you allow primaries to attempt rebuilds during a rebalance - if the capacity increase is of sufficient size I would say it's "not possible" to do an EC rebalance w/o handoffs_only mode. This level of operational complexity/hand holding is considered a bug and an aera of ongoing investigation at Nvidia. | 20:31 |
clayg | reid_g: there's a number of other knobs to tune besides workers - depending on what kind of timeouts you're seeing you may need to increase the concurrency settings for the object replication server ssync receivers so there's enough capacity to eat all the parts that are trying to get pushed off | 20:34 |
opendevreview | Tim Burke proposed openstack/swift master: ring: Introduce a v2 ring format https://review.opendev.org/c/openstack/swift/+/808530 | 20:42 |
opendevreview | Tim Burke proposed openstack/swift master: ring: Allow RingData to vary dev_id_bytes https://review.opendev.org/c/openstack/swift/+/808531 | 20:42 |
opendevreview | Tim Burke proposed openstack/swift master: Allow ring-builder CLI users to specify device ID https://review.opendev.org/c/openstack/swift/+/808532 | 20:42 |
opendevreview | Tim Burke proposed openstack/swift master: ring: Allow builder to vary dev_id_bytes https://review.opendev.org/c/openstack/swift/+/808533 | 20:42 |
opendevreview | Tim Burke proposed openstack/swift master: ring: Keep track of last primary nodes from last rebalance https://review.opendev.org/c/openstack/swift/+/790550 | 20:42 |
timburke_ | reid_g, fwiw, the hope is that getting https://review.opendev.org/c/openstack/swift/+/792075 stacked on top of all that ^^^ will make it so we can avoid needing to switch on handoffs_only at all during a rebalance. i still need to take a closer look at it, though; i had an idea about doing a HEAD *first* (before fanning out to get frags) that i wanted to try out | 20:46 |
reid_g | That sounds like an interesting change | 20:50 |
reid_g | We started down the path of trying to tackle the timeouts that occurred with the increased workers. Each time we changed 1 timeout, another timeout error would appear. | 20:51 |
timburke_ | oh! i'm glad i looked -- mattoliver already added the HEAD-first idea! | 20:53 |
reid_g | Ultimately I don't think it is affecting the application too much since it can detect the failed uploads but it would be nice to add the capacity without the danger of reducing durability while not taking a long time between iterations. | 20:53 |
opendevreview | Tim Burke proposed openstack/swift master: WIP: Reconstructor: Use past node and abort to handoff https://review.opendev.org/c/openstack/swift/+/792075 | 20:54 |
timburke_ | (just a rebase) | 20:55 |
opendevreview | Timur Alperovich proposed openstack/swift master: Fix multipart upload listings https://review.opendev.org/c/openstack/swift/+/813715 | 23:37 |
Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!