Tuesday, 2020-08-25

*** tkajinam_ is now known as tkajinam		01:04
*** rcernin has quit IRC		01:14
*** rcernin has joined #openstack-swift		01:16
*** baojg has joined #openstack-swift		01:36
*** gyee has quit IRC		01:48
*** josephillips has quit IRC		03:31
*** josephillips has joined #openstack-swift		03:52
*** rcernin has quit IRC		04:32
*** evrardjp has quit IRC		04:33
*** evrardjp has joined #openstack-swift		04:35
*** m75abrams has joined #openstack-swift		04:51
*** dsariel has joined #openstack-swift		04:58
*** rcernin has joined #openstack-swift		05:03
*** rcernin has quit IRC		05:41
openstackgerrit	Tim Burke proposed openstack/swift master: wsgi: Handle multiple USR1 signals in quick succession https://review.opendev.org/747496	06:05
*** rcernin has joined #openstack-swift		06:57
*** rcernin has quit IRC		07:03
*** rcernin has joined #openstack-swift		07:06
*** rcernin has quit IRC		07:14
*** baojg has quit IRC		07:58
*** baojg has joined #openstack-swift		07:59
*** dosaboy has quit IRC		08:59
*** dosaboy has joined #openstack-swift		08:59
*** ianychoi__ has joined #openstack-swift		09:21
*** ianychoi_ has quit IRC		09:24
*** abelur has quit IRC		10:49
*** lxkong has quit IRC		10:51
*** rcernin has joined #openstack-swift		10:57
*** abelur has joined #openstack-swift		11:02
*** lxkong has joined #openstack-swift		11:02
*** hoonetorg has quit IRC		12:08
*** hoonetorg has joined #openstack-swift		12:21
*** hoonetorg has quit IRC		12:41
*** hoonetorg has joined #openstack-swift		12:54
*** gyee has joined #openstack-swift		14:03
*** rcernin has quit IRC		14:45
*** baojg has quit IRC		15:41
*** baojg has joined #openstack-swift		15:42
*** m75abrams has quit IRC		16:17
timburke	good morning	16:42
*** baojg has quit IRC		17:59
*** abelur has quit IRC		18:52
*** abelur has joined #openstack-swift		18:53
timburke	so i noticed the proxy server in my home swift going a little squirrelly on occasion -- running https://github.com/swiftstack/python-stack-xray/blob/master/python-stack-xray against it, i found a particularly strange stack: http://paste.openstack.org/show/797139/	19:49
timburke	what in the world is up with that _get_response_parts_iter frame?? i guess maybe there's a generator exit getting raised?	19:49
timburke	fwiw, i'm fairly certain my troubles are the two frames in enumerate() -- most/all of the other stacks are also waiting on _active_limbo_lock :-/	19:54
timburke	the reference to threading._active reminds me of https://github.com/eventlet/eventlet/pull/611 ... i need to double check whether i applied that fix here or not...	19:57
openstackgerrit	Tim Burke proposed openstack/swift master: wsgi: Handle multiple USR1 signals in quick succession https://review.opendev.org/747496	19:58
openstackgerrit	Tim Burke proposed openstack/swift master: ssync: Tolerate more hang-ups https://review.opendev.org/744270	20:27
ormandj	something interesting we've noticed with swift - when we load a drive, single drive on a server (let's assume these servers have 56 drives for swift, which they do) - such as running a patrolread on them	20:58
ormandj	our throughput for the entire cluster goes waaaaaay down	20:58
ormandj	we have dedicated SSDs for container/account dbs, and those are not being touched	20:58
ormandj	we've also noticed when taking a server down (but not removing from ring) the same degradation happens	21:00
ormandj	is that a function of a three server cluster w/ triple replication?	21:00
timburke	"load a drive" as in, there's one drive that seems especially hot, or there's one drive that's particularly full?	21:01
timburke	could be part of it. i'd expect the down server to get error-limited fairly quickly, though	21:01
timburke	and then traffic should shed to the remaining servers	21:02
timburke	are you seeing performance tank for reads, writes, or both?	21:03
timburke	when you're taking a node down, how quickly can you get the proxy out of rotation for your load balancer?	21:04
ormandj	timburke: load as in cause to slow down	21:09
ormandj	timburke: no proxies go out of rotation, we're taking down storage nodes, not swift proxies. lb -> swift proxies -> storage nodes. swift proxies scales independently of storage nodes	21:10
ormandj	it looks like they continue to try and contact the 'down' node the whole time it's down	21:11
ormandj	timburke: so on the disk load thing, for example, we kick off a patrolread which 'invisibly' hits the disk with enough IOPs to increase await significantly for that one drive	21:12
timburke	might want to look at https://github.com/openstack/swift/blob/master/etc/proxy-server.conf-sample#L161-L166 values	21:12
ormandj	i think we have those at defaults (verifying)	21:12
ormandj	it'll keep trying the entire time they are down	21:12
timburke	:-/	21:13
ormandj	yep, both commented out in the config	21:13
timburke	separately from proxy configs, how's the object server configured? servers per port, or all disks going over the one port? how many workers?	21:13
ormandj	workers is set default (auto according to config) and servers per port is default (0 according to config)	21:15
timburke	so auto should give you a worker per core -- how many cores do the nodes have? i'd worry a bit about all workers for that node getting hung up trying to service requests for that disk and getting stuck in some uninterruptible sleep	21:27
ormandj	48 cores	21:27
ormandj	56 data drives	21:28
ormandj	256 gigs ram	21:28
ormandj	4xssd for account/container db	21:28
ormandj	what you're describing would make sense based on what we see	21:29
timburke	i'd think about giving each disk its own port in the ring and setting servers_per_port to 2 or so	21:29
ormandj	safe to do without blowing up existing data?	21:30
timburke	yeah, it's a matter of updating the ring with swift-ring-builder's set_info command. the transition may still be a bit disruptive, though; you'd likely want to announce a maintenance window	21:32
timburke	let me see if i can find some docs on it...	21:32
ormandj	thanks tim. the docs have been a bit... not clear on some of the implications in the past	21:36
ormandj	so we try to be careful when it comes to ring operations heh	21:36
ormandj	manpage had info on that option	21:37
ormandj	set_info <search-value> <ip>:<port>/<device_name>_<meta>	21:37
timburke	https://docs.openstack.org/swift/latest/deployment_guide.html#running-object-servers-per-disk	21:38
timburke	(though it doesn't give an example of how to switch between modes :-/)	21:38
ormandj	awesome, we'll look into implementing that, we'll test in our dev cluster first in case we hose all the data, which is likely :p	21:38
ormandj	looks like just updating the port fields, which is straight forward enough	21:39
ormandj	"When migrating from normal to servers_per_port, perform these steps in order:	21:39
ormandj	"	21:39
ormandj	it has that section below the output	21:39
timburke	oh good -- i clearly didn't skim well enough!	21:41
ormandj	it doesn't give guidance on servers_per_port for the hypothetical, but looking at the default options, it seems to suggest '4' as giving complete i/o isolation	21:42
ormandj	so we'd end up at 56*4 processes, effectively if we did that	21:43
timburke	fwiw, the commit that introduced it had some nice benchmarks referenced: https://github.com/openstack/swift/commit/df134df901a13c2261a8205826ea1aa8d75dc283	21:43
timburke	https://gist.github.com/dbishop/fd0ab067babdecfb07ca#file-results-md in particular seems relevant	21:43
ormandj	wonder how this got overlooked when people were setting up this cluster	21:44
ormandj	seems like it's a best practice kind of thing	21:44
ormandj	also, curious that it's not default	21:44
timburke	yeah, i was just about to say that we should probably look at updating docs/deployment guides to default to servers per port	21:45
ormandj	those benchmarks are pretty telling, you weren't kidding	21:45
ormandj	little struggle to understand the chart though, haha	21:46
*** rcernin has joined #openstack-swift		22:01
*** rcernin has quit IRC		22:01
*** rcernin has joined #openstack-swift		22:02
openstackgerrit	Tim Burke proposed openstack/swift master: docs: Clean up some formatting around using servers_per_port https://review.opendev.org/748043	22:33
clayg	there's a lot to digest in p 745603 - but I feel like I'm getting a hang of it!	22:38
patchbot	https://review.opendev.org/#/c/745603/ - swift - Bind a new socket per-worker - 4 patch sets	22:38
timburke	sorry; i maybe shouldn't have moved to get rid of PortPidState in the same patch	22:39
clayg	fwiw, i'll probably spend some time with the graceful worker shutdown patch before I loop back around to the per-worker-socket	22:41
clayg	i'll be nice to gather informal feedback in the meeting tomorrow as well	22:42
clayg	but it looks great timburke - incredible work	22:42
timburke	clayg, we may want to make it somewhat configurable -- someone (timur i think?) pointed out https://blog.cloudflare.com/the-sad-state-of-linux-socket-balancing/ that had some more thoughts on the matter	22:44
clayg	that might be a good reason to keep the sockets in the parent 🤔	22:45
timburke	might be worth trying to do something like 4 listen sockets each with 6 workers or something like that	22:45

Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!