*** mgagne_ is now known as mgagne | 12:54 | |
*** ChanServ changes topic to "gerrit tuning" | 19:02 | |
jeblair | ssh review gerrit show-caches --show-threads | 19:03 |
---|---|---|
jeblair | Threads: 16 CPUs available, 377 threads | 19:03 |
jeblair | NEW RUNNABLE BLOCKED WAITING TIMED_WAITING TERMINATED | 19:03 |
jeblair | SSH git-upload-pack 0 0 0 14 0 0 | 19:03 |
jeblair | SSH-Stream-Worker 0 0 0 17 0 0 | 19:03 |
jeblair | HTTP 0 5 0 0 20 0 | 19:03 |
jeblair | SSH-Interactive-Worker 0 0 0 182 0 0 | 19:03 |
jeblair | Other 0 26 0 66 29 0 | 19:03 |
fungi | aha | 19:03 |
jeblair | ReceiveCommits 0 0 0 16 0 0 | 19:03 |
jeblair | SshCommandStart 0 0 0 2 0 0 | 19:03 |
jeblair | i get that | 19:03 |
*** zaro has joined #openstack-infra-incident | 19:04 | |
fungi | ahh, yeah so it's gerrit show-caches --show-threads | 19:04 |
jeblair | i'm not quite certain how to read that yet. | 19:04 |
jeblair | and i need to get lunch. | 19:04 |
fungi | if you add up all the numbers in the http row, they come out to 25 which is what the documentation says the max threads default for httpd is | 19:05 |
fungi | i've polled it a few times and the numbers in RUNNABLE and TIMED_WAITING vary a bit, but always seem to add up to 25 | 19:06 |
fungi | i caught it dipping down to 24 once | 19:07 |
fungi | so i take this as confirmation that the default max mentioned in the configuration docs is actually being enforced here | 19:08 |
jeblair | http://help.collab.net/topic/teamforge80-git-gerrit210x/reference/Gerrit-Performance-Tuning-Cheat-Sheet.pdf | 19:08 |
jeblair | that may also be helpful | 19:08 |
jeblair | i think some of that information is not entirely correct, but it may help fill in some missing gaps. | 19:09 |
jeblair | now lunch for real | 19:09 |
fungi | that's an interesting document | 19:09 |
fungi | zaro: you have a feel for any of this? | 19:10 |
zaro | i can't tell from the documentation what the correct number should be. but probably higher than the default. | 19:11 |
zaro | higher than default would be good. | 19:11 |
fungi | yeah, that's where i am too at this point ;) | 19:12 |
fungi | it's likely going to involve some trial and error, but performance is also at this point being impacted by the elevation in git gc activity again | 19:12 |
zaro | i guess it depends on a lot of factors so maybe just pick one and try it? | 19:13 |
fungi | so we are unlikely to be able to effectively iterate on it | 19:13 |
fungi | or iterate on it quickly anyway | 19:13 |
zaro | yeah, i'm guessing it's something that takes a few tries and may require time to know what the correct number is. | 19:14 |
fungi | that cheatsheet is suggesting 100 is a reasonable "large site" value for httpd.maxThreads | 19:14 |
clarkb | well tweaking those numbers will require a gerrit restart anyways which will avoid the GC issue temporarily. Then we should compare to see if GC happens quicker than normal (I think its like once every couple weeks now) | 19:14 |
fungi | and that 25 (the default) is "small" | 19:14 |
fungi | clarkb: yeah, that's basically what i wanted to try | 19:15 |
clarkb | I think we should bum min httpd threads too just to avoid delays when things spike | 19:15 |
clarkb | we could 4x the defaults and do 5-> 20 and 35 -> 100 | 19:15 |
zaro | well at least it's already setup in puppet | 19:16 |
fungi | clarkb: okay, so you're in favor of upping teh base and max values then, not just the max? | 19:16 |
fungi | i guess that may make ramp-up a little more snappy | 19:16 |
zaro | ++ | 19:16 |
clarkb | fungi: ya I think we should do both | 19:16 |
fungi | wfm | 19:16 |
clarkb | fungi: yup for when things spike | 19:16 |
clarkb | we also need to incrase the db threads as described in review.opp | 19:17 |
zaro | how about acceptorthreads? | 19:17 |
clarkb | er review.pp. basically the sshd threads + httpd threads must be < than db threads | 19:17 |
fungi | ahh, yep looks like we're at database.poolLimit=150 right now | 19:17 |
clarkb | zaro: the docs say that 2 acceptor threads should be sufficient for most high traffic sites | 19:18 |
fungi | so should probably bump it to 250 to give some breathing room? (that's 125% of sshd+httpd max) | 19:18 |
clarkb | fungi: 225 would maintain the same headroom | 19:18 |
clarkb | right now its 100 + 25 = 125 | 19:18 |
fungi | fair enough--i'm fine with 225 | 19:19 |
zaro | ++ | 19:19 |
fungi | we've apparently already tuned httpd/maxqueued to 3x the default of 50 | 19:20 |
fungi | er, 4x | 19:20 |
clarkb | apparently 200 is the new default for maxqueued | 19:20 |
zaro | it's from this https://review.openstack.org/#/c/285588/ | 19:20 |
fungi | in 2.12+? | 19:20 |
clarkb | so maybe we want to increase that a bit too? that one is the one I really don't have ideas for | 19:20 |
zaro | 200 is the new default | 19:21 |
fungi | it's likely fine to leave as-is | 19:21 |
clarkb | wfm to leave as is | 19:21 |
fungi | i guess these are enough values i should propose the change first | 19:22 |
* mordred joins the party ... | 19:22 | |
fungi | on the way | 19:22 |
zaro | wonder what luca mean with this 'If you have over 200 incoming requests queued, possibly there is | 19:22 |
zaro | something more serious to investigate..' | 19:22 |
clarkb | zaro: probably that you are under attack of some sort | 19:22 |
zaro | ahh yeah, that's completely possible | 19:23 |
fungi | yeah, like you're not handling requests fast enough (either becaus eyou've tuned the other values poorly, your system is under-sized, or you're in the middle of a denial of service attack) | 19:23 |
fungi | okay, as zaro pointed out (and i just confirmed), the parameters are already all plumbed through | 19:28 |
fungi | https://review.openstack.org/360744 | 19:28 |
fungi | clarkb: zaro: mordred: jeblair: ^ does that makes sense then? | 19:28 |
clarkb | looking | 19:28 |
fungi | if you approve, i'll hand-patch the result into gerrit.config and restart the service | 19:30 |
fungi | just making sure we're on the same page with the suggested values | 19:30 |
zaro | didn't we agree on 100 for maxthreads? | 19:32 |
clarkb | ya I think that should be 100 not 200 | 19:34 |
mordred | lgtm - other than the 100/200 from zaro clarkb | 19:35 |
fungi | gah, yep | 19:35 |
fungi | that was a typo | 19:35 |
fungi | okay, correction is up as patchset 2 | 19:36 |
fungi | i got thrown off by copying and editing the httpd_maxqueued line and neglected to switch the 2 to a 1 | 19:37 |
fungi | clarkb: zaro: mordred: jeblair: ^ | 19:39 |
zaro | lgtm | 19:40 |
mordred | fungi: +2 | 19:41 |
clarkb | trying to get it to load | 19:41 |
fungi | oh the irony ;) | 19:42 |
clarkb | I keep getting proxy errors | 19:43 |
clarkb | I am just going to trust you replaced the 200 with 100 and everything else stayed the same | 19:43 |
fungi | yep, i did | 19:44 |
fungi | #status notice The Gerrit service on review.openstack.org is restarting to implement some performance tuning adjustments, and should return to working order momentarily. | 19:44 |
openstackstatus | fungi: sending notice | 19:44 |
fungi | cool, on its way back up with the new values applied | 19:46 |
-openstackstatus- NOTICE: The Gerrit service on review.openstack.org is restarting to implement some performance tuning adjustments, and should return to working order momentarily. | 19:46 | |
fungi | i'm keeping an eye on javamelody | 19:46 |
openstackstatus | fungi: finished sending notice | 19:47 |
fungi | the threadcount graph dropped significantly, but not for long. it's already climbing back up almost to where it left off | 19:52 |
clarkb | and will likely go past it | 19:53 |
jeblair | o/ | 20:07 |
fungi | yeah, it's just now gotten back to the old level | 20:11 |
fungi | unfortunately we're only around 20 httpd threads in use according to show-caches | 20:11 |
fungi | i'm waiting to see that go over 25 | 20:12 |
fungi | now i'm worried that i mistyped max in there twice, but puppet has already reverted the config so i can't tell | 20:13 |
fungi | so particularly eager to see it go over 20 | 20:13 |
fungi | though i guess unless demand increases past 20 it's just going to have 20 threads regardless | 20:14 |
clarkb | and we probably have to wait for one of those spikes we were seeing to see it really push up | 20:16 |
clarkb | since under the normal load it seemed happy with the old params | 20:17 |
fungi | later on this evening after 360744 merges and is reflected in the config on disk i'll do another quick gerrit restart just to be doubly certain it's applied as written | 20:19 |
clarkb | fungi: you can also see the threads in the java melody thread listing it expands in the page with a little + button | 20:20 |
* jeblair helps by enqueing those changes from earlier | 20:21 | |
fungi | clarkb: yeah, though the ssh api is a little easier to get counts from | 20:23 |
clarkb | https://review.openstack.org/monitoring?part=graph&graph=httpSystemErrors shows that the errors have dropped off. I think there is always sort of a baseline error count with gerrit since it throws exceptions for things that are relatively normal too | 20:23 |
fungi | the threads count graph shows it's flatlined right about where it was before the restart | 20:30 |
fungi | and still only totalling 20 httpd threads | 20:30 |
clarkb | huh | 20:32 |
jeblair | did puppet restart it? | 20:32 |
jeblair | (does not look like it; current proc is from 19:45) | 20:34 |
fungi | yay! fears abated... up to 23 httpd threads now | 20:45 |
fungi | i'll check again after dinner | 20:46 |
fungi | wrong time of day i guess. back down to 20 httpd threads | 22:39 |
*** ChanServ changes topic to "situation normal" | 22:39 |
Generated by irclog2html.py 2.14.0 by Marius Gedminas - find it at mg.pov.lt!