Thursday, 2016-08-25

*** mgagne_ is now known as mgagne		12:54
*** ChanServ changes topic to "gerrit tuning"		19:02
jeblair	ssh review gerrit show-caches --show-threads	19:03
jeblair	Threads: 16 CPUs available, 377 threads	19:03
jeblair	NEW RUNNABLE BLOCKED WAITING TIMED_WAITING TERMINATED	19:03
jeblair	SSH git-upload-pack 0 0 0 14 0 0	19:03
jeblair	SSH-Stream-Worker 0 0 0 17 0 0	19:03
jeblair	HTTP 0 5 0 0 20 0	19:03
jeblair	SSH-Interactive-Worker 0 0 0 182 0 0	19:03
jeblair	Other 0 26 0 66 29 0	19:03
fungi	aha	19:03
jeblair	ReceiveCommits 0 0 0 16 0 0	19:03
jeblair	SshCommandStart 0 0 0 2 0 0	19:03
jeblair	i get that	19:03
*** zaro has joined #openstack-infra-incident		19:04
fungi	ahh, yeah so it's gerrit show-caches --show-threads	19:04
jeblair	i'm not quite certain how to read that yet.	19:04
jeblair	and i need to get lunch.	19:04
fungi	if you add up all the numbers in the http row, they come out to 25 which is what the documentation says the max threads default for httpd is	19:05
fungi	i've polled it a few times and the numbers in RUNNABLE and TIMED_WAITING vary a bit, but always seem to add up to 25	19:06
fungi	i caught it dipping down to 24 once	19:07
fungi	so i take this as confirmation that the default max mentioned in the configuration docs is actually being enforced here	19:08
jeblair	http://help.collab.net/topic/teamforge80-git-gerrit210x/reference/Gerrit-Performance-Tuning-Cheat-Sheet.pdf	19:08
jeblair	that may also be helpful	19:08
jeblair	i think some of that information is not entirely correct, but it may help fill in some missing gaps.	19:09
jeblair	now lunch for real	19:09
fungi	that's an interesting document	19:09
fungi	zaro: you have a feel for any of this?	19:10
zaro	i can't tell from the documentation what the correct number should be. but probably higher than the default.	19:11
zaro	higher than default would be good.	19:11
fungi	yeah, that's where i am too at this point ;)	19:12
fungi	it's likely going to involve some trial and error, but performance is also at this point being impacted by the elevation in git gc activity again	19:12
zaro	i guess it depends on a lot of factors so maybe just pick one and try it?	19:13
fungi	so we are unlikely to be able to effectively iterate on it	19:13
fungi	or iterate on it quickly anyway	19:13
zaro	yeah, i'm guessing it's something that takes a few tries and may require time to know what the correct number is.	19:14
fungi	that cheatsheet is suggesting 100 is a reasonable "large site" value for httpd.maxThreads	19:14
clarkb	well tweaking those numbers will require a gerrit restart anyways which will avoid the GC issue temporarily. Then we should compare to see if GC happens quicker than normal (I think its like once every couple weeks now)	19:14
fungi	and that 25 (the default) is "small"	19:14
fungi	clarkb: yeah, that's basically what i wanted to try	19:15
clarkb	I think we should bum min httpd threads too just to avoid delays when things spike	19:15
clarkb	we could 4x the defaults and do 5-> 20 and 35 -> 100	19:15
zaro	well at least it's already setup in puppet	19:16
fungi	clarkb: okay, so you're in favor of upping teh base and max values then, not just the max?	19:16
fungi	i guess that may make ramp-up a little more snappy	19:16
zaro	++	19:16
clarkb	fungi: ya I think we should do both	19:16
fungi	wfm	19:16
clarkb	fungi: yup for when things spike	19:16
clarkb	we also need to incrase the db threads as described in review.opp	19:17
zaro	how about acceptorthreads?	19:17
clarkb	er review.pp. basically the sshd threads + httpd threads must be < than db threads	19:17
fungi	ahh, yep looks like we're at database.poolLimit=150 right now	19:17
clarkb	zaro: the docs say that 2 acceptor threads should be sufficient for most high traffic sites	19:18
fungi	so should probably bump it to 250 to give some breathing room? (that's 125% of sshd+httpd max)	19:18
clarkb	fungi: 225 would maintain the same headroom	19:18
clarkb	right now its 100 + 25 = 125	19:18
fungi	fair enough--i'm fine with 225	19:19
zaro	++	19:19
fungi	we've apparently already tuned httpd/maxqueued to 3x the default of 50	19:20
fungi	er, 4x	19:20
clarkb	apparently 200 is the new default for maxqueued	19:20
zaro	it's from this https://review.openstack.org/#/c/285588/	19:20
fungi	in 2.12+?	19:20
clarkb	so maybe we want to increase that a bit too? that one is the one I really don't have ideas for	19:20
zaro	200 is the new default	19:21
fungi	it's likely fine to leave as-is	19:21
clarkb	wfm to leave as is	19:21
fungi	i guess these are enough values i should propose the change first	19:22
* mordred joins the party ...		19:22
fungi	on the way	19:22
zaro	wonder what luca mean with this 'If you have over 200 incoming requests queued, possibly there is	19:22
zaro	something more serious to investigate..'	19:22
clarkb	zaro: probably that you are under attack of some sort	19:22
zaro	ahh yeah, that's completely possible	19:23
fungi	yeah, like you're not handling requests fast enough (either becaus eyou've tuned the other values poorly, your system is under-sized, or you're in the middle of a denial of service attack)	19:23
fungi	okay, as zaro pointed out (and i just confirmed), the parameters are already all plumbed through	19:28
fungi	https://review.openstack.org/360744	19:28
fungi	clarkb: zaro: mordred: jeblair: ^ does that makes sense then?	19:28
clarkb	looking	19:28
fungi	if you approve, i'll hand-patch the result into gerrit.config and restart the service	19:30
fungi	just making sure we're on the same page with the suggested values	19:30
zaro	didn't we agree on 100 for maxthreads?	19:32
clarkb	ya I think that should be 100 not 200	19:34
mordred	lgtm - other than the 100/200 from zaro clarkb	19:35
fungi	gah, yep	19:35
fungi	that was a typo	19:35
fungi	okay, correction is up as patchset 2	19:36
fungi	i got thrown off by copying and editing the httpd_maxqueued line and neglected to switch the 2 to a 1	19:37
fungi	clarkb: zaro: mordred: jeblair: ^	19:39
zaro	lgtm	19:40
mordred	fungi: +2	19:41
clarkb	trying to get it to load	19:41
fungi	oh the irony ;)	19:42
clarkb	I keep getting proxy errors	19:43
clarkb	I am just going to trust you replaced the 200 with 100 and everything else stayed the same	19:43
fungi	yep, i did	19:44
fungi	#status notice The Gerrit service on review.openstack.org is restarting to implement some performance tuning adjustments, and should return to working order momentarily.	19:44
openstackstatus	fungi: sending notice	19:44
fungi	cool, on its way back up with the new values applied	19:46
-openstackstatus- NOTICE: The Gerrit service on review.openstack.org is restarting to implement some performance tuning adjustments, and should return to working order momentarily.		19:46
fungi	i'm keeping an eye on javamelody	19:46
openstackstatus	fungi: finished sending notice	19:47
fungi	the threadcount graph dropped significantly, but not for long. it's already climbing back up almost to where it left off	19:52
clarkb	and will likely go past it	19:53
jeblair	o/	20:07
fungi	yeah, it's just now gotten back to the old level	20:11
fungi	unfortunately we're only around 20 httpd threads in use according to show-caches	20:11
fungi	i'm waiting to see that go over 25	20:12
fungi	now i'm worried that i mistyped max in there twice, but puppet has already reverted the config so i can't tell	20:13
fungi	so particularly eager to see it go over 20	20:13
fungi	though i guess unless demand increases past 20 it's just going to have 20 threads regardless	20:14
clarkb	and we probably have to wait for one of those spikes we were seeing to see it really push up	20:16
clarkb	since under the normal load it seemed happy with the old params	20:17
fungi	later on this evening after 360744 merges and is reflected in the config on disk i'll do another quick gerrit restart just to be doubly certain it's applied as written	20:19
clarkb	fungi: you can also see the threads in the java melody thread listing it expands in the page with a little + button	20:20
* jeblair helps by enqueing those changes from earlier		20:21
fungi	clarkb: yeah, though the ssh api is a little easier to get counts from	20:23
clarkb	https://review.openstack.org/monitoring?part=graph&graph=httpSystemErrors shows that the errors have dropped off. I think there is always sort of a baseline error count with gerrit since it throws exceptions for things that are relatively normal too	20:23
fungi	the threads count graph shows it's flatlined right about where it was before the restart	20:30
fungi	and still only totalling 20 httpd threads	20:30
clarkb	huh	20:32
jeblair	did puppet restart it?	20:32
jeblair	(does not look like it; current proc is from 19:45)	20:34
fungi	yay! fears abated... up to 23 httpd threads now	20:45
fungi	i'll check again after dinner	20:46
fungi	wrong time of day i guess. back down to 20 httpd threads	22:39
*** ChanServ changes topic to "situation normal"		22:39

Generated by irclog2html.py 2.14.0 by Marius Gedminas - find it at mg.pov.lt!