| *** tkajinam_ is now known as tkajinam | 01:04 | |
| *** rcernin has quit IRC | 01:14 | |
| *** rcernin has joined #openstack-swift | 01:16 | |
| *** baojg has joined #openstack-swift | 01:36 | |
| *** gyee has quit IRC | 01:48 | |
| *** josephillips has quit IRC | 03:31 | |
| *** josephillips has joined #openstack-swift | 03:52 | |
| *** rcernin has quit IRC | 04:32 | |
| *** evrardjp has quit IRC | 04:33 | |
| *** evrardjp has joined #openstack-swift | 04:35 | |
| *** m75abrams has joined #openstack-swift | 04:51 | |
| *** dsariel has joined #openstack-swift | 04:58 | |
| *** rcernin has joined #openstack-swift | 05:03 | |
| *** rcernin has quit IRC | 05:41 | |
| openstackgerrit | Tim Burke proposed openstack/swift master: wsgi: Handle multiple USR1 signals in quick succession https://review.opendev.org/747496 | 06:05 |
|---|---|---|
| *** rcernin has joined #openstack-swift | 06:57 | |
| *** rcernin has quit IRC | 07:03 | |
| *** rcernin has joined #openstack-swift | 07:06 | |
| *** rcernin has quit IRC | 07:14 | |
| *** baojg has quit IRC | 07:58 | |
| *** baojg has joined #openstack-swift | 07:59 | |
| *** dosaboy has quit IRC | 08:59 | |
| *** dosaboy has joined #openstack-swift | 08:59 | |
| *** ianychoi__ has joined #openstack-swift | 09:21 | |
| *** ianychoi_ has quit IRC | 09:24 | |
| *** abelur has quit IRC | 10:49 | |
| *** lxkong has quit IRC | 10:51 | |
| *** rcernin has joined #openstack-swift | 10:57 | |
| *** abelur has joined #openstack-swift | 11:02 | |
| *** lxkong has joined #openstack-swift | 11:02 | |
| *** hoonetorg has quit IRC | 12:08 | |
| *** hoonetorg has joined #openstack-swift | 12:21 | |
| *** hoonetorg has quit IRC | 12:41 | |
| *** hoonetorg has joined #openstack-swift | 12:54 | |
| *** gyee has joined #openstack-swift | 14:03 | |
| *** rcernin has quit IRC | 14:45 | |
| *** baojg has quit IRC | 15:41 | |
| *** baojg has joined #openstack-swift | 15:42 | |
| *** m75abrams has quit IRC | 16:17 | |
| timburke | good morning | 16:42 |
| *** baojg has quit IRC | 17:59 | |
| *** abelur has quit IRC | 18:52 | |
| *** abelur has joined #openstack-swift | 18:53 | |
| timburke | so i noticed the proxy server in my home swift going a little squirrelly on occasion -- running https://github.com/swiftstack/python-stack-xray/blob/master/python-stack-xray against it, i found a particularly strange stack: http://paste.openstack.org/show/797139/ | 19:49 |
| timburke | what in the world is up with that _get_response_parts_iter frame?? i guess maybe there's a generator exit getting raised? | 19:49 |
| timburke | fwiw, i'm fairly certain my troubles are the two frames in enumerate() -- most/all of the other stacks are also waiting on _active_limbo_lock :-/ | 19:54 |
| timburke | the reference to threading._active reminds me of https://github.com/eventlet/eventlet/pull/611 ... i need to double check whether i applied that fix here or not... | 19:57 |
| openstackgerrit | Tim Burke proposed openstack/swift master: wsgi: Handle multiple USR1 signals in quick succession https://review.opendev.org/747496 | 19:58 |
| openstackgerrit | Tim Burke proposed openstack/swift master: ssync: Tolerate more hang-ups https://review.opendev.org/744270 | 20:27 |
| ormandj | something interesting we've noticed with swift - when we load a drive, single drive on a server (let's assume these servers have 56 drives for swift, which they do) - such as running a patrolread on them | 20:58 |
| ormandj | our throughput for the entire cluster goes waaaaaay down | 20:58 |
| ormandj | we have dedicated SSDs for container/account dbs, and those are not being touched | 20:58 |
| ormandj | we've also noticed when taking a server down (but not removing from ring) the same degradation happens | 21:00 |
| ormandj | is that a function of a three server cluster w/ triple replication? | 21:00 |
| timburke | "load a drive" as in, there's one drive that seems especially hot, or there's one drive that's particularly full? | 21:01 |
| timburke | could be part of it. i'd expect the down server to get error-limited fairly quickly, though | 21:01 |
| timburke | and then traffic should shed to the remaining servers | 21:02 |
| timburke | are you seeing performance tank for reads, writes, or both? | 21:03 |
| timburke | when you're taking a node down, how quickly can you get the proxy out of rotation for your load balancer? | 21:04 |
| ormandj | timburke: load as in cause to slow down | 21:09 |
| ormandj | timburke: no proxies go out of rotation, we're taking down storage nodes, not swift proxies. lb -> swift proxies -> storage nodes. swift proxies scales independently of storage nodes | 21:10 |
| ormandj | it looks like they continue to try and contact the 'down' node the whole time it's down | 21:11 |
| ormandj | timburke: so on the disk load thing, for example, we kick off a patrolread which 'invisibly' hits the disk with enough IOPs to increase await significantly for that one drive | 21:12 |
| timburke | might want to look at https://github.com/openstack/swift/blob/master/etc/proxy-server.conf-sample#L161-L166 values | 21:12 |
| ormandj | i think we have those at defaults (verifying) | 21:12 |
| ormandj | it'll keep trying the entire time they are down | 21:12 |
| timburke | :-/ | 21:13 |
| ormandj | yep, both commented out in the config | 21:13 |
| timburke | separately from proxy configs, how's the object server configured? servers per port, or all disks going over the one port? how many workers? | 21:13 |
| ormandj | workers is set default (auto according to config) and servers per port is default (0 according to config) | 21:15 |
| timburke | so auto should give you a worker per core -- how many cores do the nodes have? i'd worry a bit about all workers for that node getting hung up trying to service requests for that disk and getting stuck in some uninterruptible sleep | 21:27 |
| ormandj | 48 cores | 21:27 |
| ormandj | 56 data drives | 21:28 |
| ormandj | 256 gigs ram | 21:28 |
| ormandj | 4xssd for account/container db | 21:28 |
| ormandj | what you're describing would make sense based on what we see | 21:29 |
| timburke | i'd think about giving each disk its own port in the ring and setting servers_per_port to 2 or so | 21:29 |
| ormandj | safe to do without blowing up existing data? | 21:30 |
| timburke | yeah, it's a matter of updating the ring with swift-ring-builder's set_info command. the transition may still be a bit disruptive, though; you'd likely want to announce a maintenance window | 21:32 |
| timburke | let me see if i can find some docs on it... | 21:32 |
| ormandj | thanks tim. the docs have been a bit... not clear on some of the implications in the past | 21:36 |
| ormandj | so we try to be careful when it comes to ring operations heh | 21:36 |
| ormandj | manpage had info on that option | 21:37 |
| ormandj | set_info <search-value> <ip>:<port>/<device_name>_<meta> | 21:37 |
| timburke | https://docs.openstack.org/swift/latest/deployment_guide.html#running-object-servers-per-disk | 21:38 |
| timburke | (though it doesn't give an example of how to switch between modes :-/) | 21:38 |
| ormandj | awesome, we'll look into implementing that, we'll test in our dev cluster first in case we hose all the data, which is likely :p | 21:38 |
| ormandj | looks like just updating the port fields, which is straight forward enough | 21:39 |
| ormandj | "When migrating from normal to servers_per_port, perform these steps in order: | 21:39 |
| ormandj | " | 21:39 |
| ormandj | it has that section below the output | 21:39 |
| timburke | oh good -- i clearly didn't skim well enough! | 21:41 |
| ormandj | it doesn't give guidance on servers_per_port for the hypothetical, but looking at the default options, it seems to suggest '4' as giving complete i/o isolation | 21:42 |
| ormandj | so we'd end up at 56*4 processes, effectively if we did that | 21:43 |
| timburke | fwiw, the commit that introduced it had some nice benchmarks referenced: https://github.com/openstack/swift/commit/df134df901a13c2261a8205826ea1aa8d75dc283 | 21:43 |
| timburke | https://gist.github.com/dbishop/fd0ab067babdecfb07ca#file-results-md in particular seems relevant | 21:43 |
| ormandj | wonder how this got overlooked when people were setting up this cluster | 21:44 |
| ormandj | seems like it's a best practice kind of thing | 21:44 |
| ormandj | also, curious that it's not default | 21:44 |
| timburke | yeah, i was just about to say that we should probably look at updating docs/deployment guides to default to servers per port | 21:45 |
| ormandj | those benchmarks are pretty telling, you weren't kidding | 21:45 |
| ormandj | little struggle to understand the chart though, haha | 21:46 |
| *** rcernin has joined #openstack-swift | 22:01 | |
| *** rcernin has quit IRC | 22:01 | |
| *** rcernin has joined #openstack-swift | 22:02 | |
| openstackgerrit | Tim Burke proposed openstack/swift master: docs: Clean up some formatting around using servers_per_port https://review.opendev.org/748043 | 22:33 |
| clayg | there's a lot to digest in p 745603 - but I feel like I'm getting a hang of it! | 22:38 |
| patchbot | https://review.opendev.org/#/c/745603/ - swift - Bind a new socket per-worker - 4 patch sets | 22:38 |
| timburke | sorry; i maybe shouldn't have moved to get rid of PortPidState in the same patch | 22:39 |
| clayg | fwiw, i'll probably spend some time with the graceful worker shutdown patch before I loop back around to the per-worker-socket | 22:41 |
| clayg | i'll be nice to gather informal feedback in the meeting tomorrow as well | 22:42 |
| clayg | but it looks great timburke - incredible work | 22:42 |
| timburke | clayg, we may want to make it somewhat configurable -- someone (timur i think?) pointed out https://blog.cloudflare.com/the-sad-state-of-linux-socket-balancing/ that had some more thoughts on the matter | 22:44 |
| clayg | that might be a good reason to keep the sockets in the parent 🤔 | 22:45 |
| timburke | might be worth trying to do something like 4 listen sockets each with 6 workers or something like that | 22:45 |
Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!