Wednesday, 2020-12-09

*** tosky has quit IRC00:00
openstackgerritMerged opendev/system-config master: Put jgit pack settings in jgit.config  https://review.opendev.org/c/opendev/system-config/+/76586700:01
clarkbfungi: I think before restarting with ^ and the heap change we check that jgit.config, gerrit.config and docker-compose.yaml all look good?00:03
fungiyup00:04
fungideploy hasn't finished yet though00:05
clarkboh right need to wait for that too00:05
*** slaweq has quit IRC00:05
fungiyeesh the heap change runs 28 jobs in deploy, all serialized00:06
clarkboh because we changed host var stuff?00:07
clarkb:(00:07
fungithat's (at least part of) why the second change hasn't even started its deploy jobs yet00:07
*** slaweq has joined #opendev00:07
clarkbya I think it is because of the path of the host vars00:08
clarkbwe don't distinguish on a per service basis for those we just glob on the higher paths?00:08
fungianyway, it's fine... not like i have anywhere else to be00:08
*** DSpider has quit IRC00:11
*** ysandeep|away is now known as ysandeep00:13
*** tkajinam has quit IRC00:16
*** tkajinam has joined #opendev00:16
clarkbfungi: I guess the docker compose update should be in place now? but we still have to wait on the jgit.config changes?00:34
fungiyup00:35
fungiconfirmed, JAVA_OPTIONS: "-Xmx44g"00:37
*** guillaumec has quit IRC00:52
*** guillaumec has joined #opendev00:52
*** zigo has quit IRC00:54
clarkbfungi: I think the jgit change is deploying now01:01
clarkband done? (I don't have ssh keys loaded anymore so can't easily check direclty myself)01:03
fungiyeah looking01:03
fungipackedGitOpenFiles = 409601:04
fungiet cetera01:04
fungiit's in there now01:04
clarkband gerrit.config still looks happy?01:04
clarkb(I mean it should just a good sanity check)01:04
*** mlavalle has quit IRC01:04
fungiyep, those settings are also still in gerrit.config01:05
fungii'll get ready to restart01:05
fungi#status notice The Gerrit service on review.opendev.org is being restarted quickly to make heap memory and jgit config adjustments, downtime should be less than 5 minutes01:07
openstackstatusfungi: sending notice01:07
-openstackstatus- NOTICE: The Gerrit service on review.opendev.org is being restarted quickly to make heap memory and jgit config adjustments, downtime should be less than 5 minutes01:08
fungistopping01:08
fungiand starting again01:08
fungilooks like it's up again01:09
clarkbya I can get the web dashboard01:10
openstackstatusfungi: finished sending notice01:10
*** amotoki has quit IRC02:01
*** amotoki has joined #opendev02:02
*** hamalq has quit IRC02:12
*** ysandeep is now known as ysandeep|session02:36
*** cloudnull has quit IRC03:18
*** cloudnull has joined #opendev03:18
*** zbr has quit IRC06:06
ianwif gerrit's feeling a little slow, it's running a fairly inefficient backup process currently.  i'm not going to stop it but i'll look at cleaning it up tomorrow06:08
ianwi feel like this is highly likely this could have caused the issues we saw at a similar time yesterday06:08
*** ysandeep|session is now known as ysandeep|afk06:09
chkumar|ruckgerrit seems to be pretty slow06:25
*** marios has joined #opendev06:44
*** marios is now known as marios|rover06:45
*** ysandeep|afk is now known as ysandeep|session07:01
*** lpetrut has joined #opendev07:12
openstackgerritdaniel.pawlik proposed openstack/diskimage-builder master: Remove centos-repos package for Centos 8.3  https://review.opendev.org/c/openstack/diskimage-builder/+/76596307:16
*** ralonsoh has joined #opendev07:19
*** dmellado has quit IRC07:26
*** dmellado has joined #opendev07:26
*** eolivare has joined #opendev07:31
*** dmellado has quit IRC07:34
*** dmellado has joined #opendev07:36
*** sboyron has joined #opendev07:51
*** tosky has joined #opendev08:02
*** elod_pto is now known as elod08:05
*** hashar has joined #opendev08:06
*** andrewbonney has joined #opendev08:13
*** rpittau|afk is now known as rpittau08:14
openstackgerritStig Telfer proposed openstack/diskimage-builder master: Update handling of repo files for CentOS 8.3  https://review.opendev.org/c/openstack/diskimage-builder/+/76616408:33
*** ysandeep|session is now known as ysandeep|lunch09:03
*** eolivare_ has joined #opendev10:05
*** eolivare has quit IRC10:08
*** ssbarnea has joined #opendev10:18
*** zbr has joined #opendev10:23
*** DSpider has joined #opendev10:27
*** brinzhang_ has quit IRC10:34
*** ssbarnea has quit IRC10:36
*** ysandeep|lunch is now known as ysandeep10:38
*** brinzhang has joined #opendev10:42
*** dtantsur|afk is now known as dtantsur10:52
*** hashar is now known as hasharLunch10:56
*** zbr has quit IRC11:25
*** zbr has joined #opendev11:28
*** zbr has quit IRC11:46
*** zbr has joined #opendev11:48
openstackgerritTobias Henkel proposed zuul/zuul-jobs master: DNM: Test ensure-docker due to new docker release  https://review.opendev.org/c/zuul/zuul-jobs/+/76620711:57
*** fressi has joined #opendev12:28
*** hasharLunch has quit IRC12:48
*** d34dh0r53 has quit IRC12:56
*** zbr has quit IRC13:24
*** zbr has joined #opendev13:27
*** zbr has quit IRC13:45
*** zbr has joined #opendev13:47
*** fressi has quit IRC13:47
*** zbr has quit IRC14:03
*** zbr has joined #opendev14:04
*** bodgix has joined #opendev14:12
*** fdegir0 has joined #opendev14:24
*** fdegir has quit IRC14:24
*** noonedeadpunk has quit IRC14:30
openstackgerritchandan kumar proposed openstack/diskimage-builder master: [WIP] dracut list installed modules  https://review.opendev.org/c/openstack/diskimage-builder/+/76623214:35
sshnaidmhi, any reason that override-branch and override-checkout don't work here? https://opendev.org/openstack/ansible-collections-openstack/src/branch/master/.zuul.yaml#L153-L15914:41
sshnaidmI have master instead of rocky in https://storage.gra.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_1b5/766013/2/check/ansible-collections-openstack-functional-devstack-rocky-ansible-2.10/1b50b64/14:41
sshnaidmI think it was working fine before14:42
openstackgerritchandan kumar proposed openstack/diskimage-builder master: Enable dracut list installed modules  https://review.opendev.org/c/openstack/diskimage-builder/+/76623214:48
*** pabelanger has left #opendev14:48
*** diablo_rojo_phon has joined #opendev15:04
*** fdegir0 is now known as fdegir15:05
mnasersilly question: is there a way with gerrit to change the topic without having to refresh the change?15:17
mnaseri've noticed that to remove the topic, i have to hit the x but then the "add topic" is missing, so i have to refresh for it to appear15:17
mnaserits not a big deal, but im just curious if i'm missing someting15:18
avassmnaser: it takes as second to reappear for me15:26
*** ralonsoh has quit IRC15:27
*** ralonsoh has joined #opendev15:27
*** zbr has quit IRC15:29
*** zbr has joined #opendev15:30
*** ralonsoh has quit IRC15:34
*** ralonsoh has joined #opendev15:38
*** ralonsoh_ has joined #opendev15:45
*** ysandeep is now known as ysandeep|away15:46
*** ralonsoh has quit IRC15:46
fungimnaser: you should be able to edit the topic, you could in 2.13 anyway. and there's a rest api for it as well15:46
mnaseravass, fungi: maybe patience is what i needed.  usually in 2.13 before, i'd be able to click and edit, this time around, i cant seem to edit so i click on the 'x' to remove the current one but the 'add topic' button doesnt show up until i refresh (but maybe i should be a bit more patient)15:47
*** zbr has quit IRC15:48
*** zbr has joined #opendev15:51
sshnaidmfungi, do you know maybe answer to my question above, about not working override-branch/checkout in https://opendev.org/openstack/ansible-collections-openstack/src/branch/master/.zuul.yaml#L153-L15915:55
sshnaidmfungi, I have still master cloned:  https://storage.gra.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_1b5/766013/2/check/ansible-collections-openstack-functional-devstack-rocky-ansible-2.10/1b50b64/15:56
mnaserwould it be possible for the openstack/governance project to get a pipeline defined for it called 'governance-check' which listens to reviews/topic changes/comments and runs a 0 node job with pass/fail showing if the patch is ready to merge or not?16:10
mnaserwe have the essential tooling which does all the checks right now16:11
mnaserid like to automate more of this chair-y things16:11
*** mlavalle has joined #opendev16:13
fungisshnaidm: i don't know, i can try to work it out but i'll need some time to familiarize myself with your job, how it's running, and what those options are supposed to do16:13
fungimnaser: i think we already have a pipeline like that used for checking release requests, maybe we can reuse it?16:13
fungimnaser: yeah, the release-approval pipeline... maybe it could also be renamed if that helps16:14
mnaserfungi: yeah maybe it could be a better name16:16
fungimnaser: want to sync up with the release team and see if they'd be up for a more general name and more generalized success/failure labels? https://opendev.org/openstack/project-config/src/branch/master/zuul.d/pipelines.yaml#L256-L28116:17
fungiotherwise i suspect the pipeline you want will look identical to that one except for the name, description, and success/failure labels16:18
fungibut yes we could also add a separate pipeline, i'm just hesitant to continue to set a precedent for project-specific pipelines which are nearly identical16:19
*** brinzhang has quit IRC16:27
openstackgerritdaniel.pawlik proposed openstack/diskimage-builder master: Remove centos-repos package for Centos 8.3  https://review.opendev.org/c/openstack/diskimage-builder/+/76596316:40
fungisshnaidm: okay, i'm freed up kinda and seeing if i can make heads or tails of your branch overrides now16:42
*** marios|rover is now known as marios|out16:45
*** marios|out has quit IRC16:46
*** lpetrut has quit IRC16:46
*** hamalq has joined #opendev16:53
*** hamalq_ has joined #opendev16:55
*** hashar has joined #opendev16:56
*** hamalq has quit IRC16:59
openstackgerritdaniel.pawlik proposed openstack/diskimage-builder master: Remove centos-repos package for Centos 8.3  https://review.opendev.org/c/openstack/diskimage-builder/+/76596317:11
*** rpittau is now known as rpittau|afk17:12
*** zbr has quit IRC17:27
*** eolivare_ has quit IRC17:28
*** zbr has joined #opendev17:29
TheJuliaGerrit seems to be returning Proxy 502 errors17:38
clarkbthis looks similar to the thing that happened the other night (relative to me)17:39
clarkbjava is up and running but busy17:39
TheJuliaGC'ing?17:40
clarkbya that is what we seem to think is happening17:40
TheJuliais it doing parrallel gc'ing already?17:40
TheJuliathere used to be a background gc option, where it would be able to run to keep the whole process lock master gc from firing17:41
clarkbyes I think we haev 16 gc threads (we haev 16 cores and so that is automatically set up that way?)17:41
TheJuliasadlly the master one does exactly this when it occurs and can occur under heap pressure17:41
TheJuliaI don't remember, I think that changed at some point17:42
TheJuliabut don't remember. it has been a long time since I admined java webapps17:42
clarkbits "funny" because the gerrit maintainers say that java 11 largely fixed GC lock the world problems for gerrit. But we were running on 8 happily without that problem and now have it on java 11 :/17:43
clarkbya ok just confirmed lots of busy gc threads according to top -H17:44
* zbr wonders why we do not just give it more memory?17:45
TheJuliamaybe they meant java 11 defaults?17:45
TheJuliaand our current defaults might change that behavior?17:45
TheJuliazbr: more ran can make this even worse17:45
TheJuliaand elongate the GC run times17:45
TheJuliaat least, where the big lock is needed :(17:45
fungizbr: we're already using a vm flavor with 60gb ram17:45
fungiand we need to leave some headroom for processes external to the jvm, for example backups17:46
TheJuliaalso filesystem buffer io17:47
TheJuliawell, buffer and cache17:47
fungiyep17:47
*** zbr has quit IRC17:47
*** zbr has joined #opendev17:49
yoctozeptohere it is gerrit 50317:49
zbralso being a VM does not help it deliver best performance either, but using baremetal may be be harder.17:50
* clarkb is trying to figure out jmap on the test node to then run it on prod to figure out which gc algorithm is used17:53
clarkbfound a wikimedia bug that reports using the g1 gc helped them with similar problems17:53
fungisshnaidm: looking closely at https://opendev.org/openstack/ansible-collections-openstack/src/branch/master/.zuul.yaml#L153-L159 is the "rocky" there expected to be a branch name? did you mean "stable/rocky" instead?17:54
clarkbbut for whatever reason jmap doesn't want to talk to the test server17:55
*** zbr has quit IRC17:56
*** zbr has joined #opendev17:58
clarkbthe internet indicates this is a common problem with about a million reasons for why it may happen (the jmap thing) yay18:02
sshnaidmfungi, should it be stable/rocky?18:06
sshnaidmfungi, btw, should it be override-branch or override-checkout? I see both are used there18:07
fungisshnaidm: that was my question to you, did you mean stable/rocky (an actual branch in those projects)? rocky by itself doesn't exist as any sort of git ref in them that i can see18:08
sshnaidmfungi, yeah, of course I meant stable/rocky18:08
fungisshnaidm: override-branch and override-checkout mean different things18:08
sshnaidmI see18:08
fungialso i suspect override-branch is intentionally undocumented in zuul because it was only kept for backward compatibility, but i need to dig deeper to confirm18:10
corvusyeah, we need to remove it18:11
fungii saw a todo in the merger source about deprecating it18:11
mnaseri don't see a status message but yeah gerrit seems ded ;(18:12
clarkbthe current issue with jmap is we need capadd ptrace18:12
clarkbI'm going to modify the docker-compose on review-test to do that18:12
clarkb(and restart hte container)18:12
avasscan't you just download more ram when the old ram runs out? ;)18:12
sshnaidmfungi, so better to use override-checkout in required_projects18:13
fungistatus notice The Gerrit service on review.opendev.org is currently responding slowly or timing out due to resource starvation, investigation is underway18:13
fungiinfra-root: how about i send that for now?18:13
clarkbwfm18:14
fungisshnaidm: yes i expect so, but it depends on what you were trying to accomplish with it18:14
fungi#status notice The Gerrit service on review.opendev.org is currently responding slowly or timing out due to resource starvation, investigation is underway18:14
openstackstatusfungi: sending notice18:14
-openstackstatus- NOTICE: The Gerrit service on review.opendev.org is currently responding slowly or timing out due to resource starvation, investigation is underway18:14
clarkbok if I add SYS_PTRACE to the container capabilities, restart the containers, then exec into the container as root using exec -u root -it containerid bash I can run `jhsdb jmap --heap --pid 7` on review-test18:16
fungiclarkb: remember our second borg backup of the day starts at 17:12 utc, could that coincide with a downward spiral?18:16
fungijust over an hour ago18:16
clarkbthis shows we are running the G1 garbage collector with 13 threads on -test18:16
clarkb I expect we're in a similar setup on prod18:16
mnaserfungi: thank you for the status notice :)18:16
fungiyw18:16
openstackstatusfungi: finished sending notice18:17
clarkband confirmed that java 11 docs say the g1 collector is the default18:18
clarkb"If this assumption doesn’t hold, then G1 will eventually schedule a Full GC. This type of collection performs in-place compaction of the entire heap. This might be very slow."18:19
clarkbI expect this is what we are currently experiencing?18:19
clarkbjust thinking out loud here: We can try a different GC (though G1 is the one expected to work best in the scenario we are in aiui), we can try going back to java 8 (is this safe after install has been running on java 11? I expect so but dunno), We can try and tune G1 (I think we can also add better gc logging?)18:21
clarkbhttps://gerrit.wikimedia.org/r/c/operations/puppet/+/504073/4/modules/gerrit/manifests/jetty.pp18:23
fungiahh, okay, so basically java 11 gc is less resource-intensive because it's less thorough, but falls back to being more resource-intensive if it turns out to have been insufficient?18:23
clarkbinfra-root ^ thoughts on restarting the currently running container, then working to get something like ^ landed and restarted again?18:23
clarkbfungi: ya I'm suspecting that may be the case.18:23
clarkbI'm going to put that config in place on review-test if someone else wants to do the restarting18:24
fungido we want to add SYS_PTRACE in production so we can inspect it the next time this crops up?18:24
clarkb(on prod I mean)18:24
clarkbfungi: that may be another good addition. Also we could manually add these bits to prod and do one restart? but test on review-test first?18:25
clarkbI'll work on review-test to make that possible18:25
corvusclarkb: ++ restart now then add logging, then make decision on gc changes18:26
fungiand yeah, it does seem like this main thing which changed before we started hitting this is the openjdk 8->11 upgrade18:27
*** dtantsur is now known as dtantsur|afk18:28
*** fbo is now known as fbo|lunch18:28
*** fbo|lunch is now known as fbo|off18:28
fungiis freenode struggling too? i'm getting lots of messages in bursts18:28
fungiso this discussion is probably more async (or out of sync) than normal18:29
clarkbthe wikimedia logging config is for an older jvm so I'm reading docs now18:30
clarkbnot sure how long it will take to get a working config for java 1118:30
mnaserfungi: my timestamps don't seem very bursty *shrug*18:30
fungii had like 5 messages come in from clarkb all at the same moment18:31
fungii can prepare to restart the service, is the plan to restart asap to get things running again and then restart a second time to add the additional jetty logging, or try to work out the logging config now before we restart and avoid another restart later for that?18:31
corvusfungi: i read clarkb's comment as restart now to fix restart later to change18:33
corvus(which i agree with)18:34
clarkbya I think its gonna be a while to sort out how to get logging going on review-test (I am working on this now and if someone else wants to restart prod that would be good018:34
fungiokay, restarting it now18:35
fungiit should be on its way back up again18:35
*** hashar has quit IRC18:38
*** hashar has joined #opendev18:38
clarkbok I think I sorted out a working config on review-test. It will write to review_site/logs/jvm_gc.log and rotate 10 copies there at 20M each18:55
clarkbfungi: corvus maybe you can take a quick look at the log it is writing currently then I can push up a change to modify prod if we are happy with that?18:56
fungichecking18:57
clarkbhttps://phabricator.wikimedia.org/T221026 also has a bunch of interesting info (also interesting that we've arleady done some of the chagnes they did indepednently discovering similar problems. I wonder if that poinst to gerrit needing better defaults)18:58
fungiwow, that sure does create a ton of different log files18:59
fungils -l /home/gerrit2/review_site/logs/jvm_gc.*18:59
clarkbfungi: well tahts because i was fiddling with different rotation schemes18:59
fungiahh, okay18:59
clarkbwikimedia puts the pid in there, other people suggest the jvm startup time19:00
clarkbbut each of those may leak over time and then we need external rotation so I ended up with simple mode where the jvm does all the necessary rotation (we just have to look at timestemps for the files mroe carefully)19:00
clarkbthe current config should just do jvm_gc.log*19:00
fungitimes when my dyslexia amuses me: i read "metaspace" in the log as "meatspace"19:00
clarkbhttps://gerrit.wikimedia.org/r/c/operations/puppet/+/504448 also talks about tuning the packfile stuff more19:01
*** sboyron has quit IRC19:02
fungiand yeah, without knowing quite what we're looking for, that log seems reasonable... it's got some gc activity metrics at least19:02
*** andrewbonney has quit IRC19:03
fungiif we're going to put that in place in production as well, we might for a less disruptive time later today to restart with it19:03
clarkbya I'll work on a change for that now. One thing to note is that this config is java 11 specific which means if we roll back we'll want to use the config that looks like wikimedias that I linked earlier19:04
clarkbinfra-root do we want to add the ptrace cap to the docker compose file?19:06
clarkbI can do that in one change if so19:06
clarkbfungi: thinking about https://gerrit.wikimedia.org/r/c/operations/puppet/+/504448 I wonder if that helps gc'ing beacuse it packs the packfiles into more contiguous areas (assuming that gets preallocated)19:08
clarkbfungi: and maybe our smaller 400m fragments all over then we hit these problems?19:08
fungii wonder if we see nearly as much fetch activity in our deployment, but considering zuul and other ci systems fetch refs from gerrit i suppose that's where it could be hitting us19:10
fungiwhat sorts of additional inspection opportunities does allowing SYS_PTRACE cap give us?19:12
openstackgerritClark Boylan proposed opendev/system-config master: Add jvm gc logging to gerrit and traceability perms  https://review.opendev.org/c/opendev/system-config/+/76628319:14
clarkbfungi: ^ the commit message in there hints at it. But we can run the jhsdb jmap commands against the running pid19:14
clarkbI think it can do other debugging too19:14
clarkbthinking out loud some more: maybe instead of reverting to java 8 we can try the parallel collector on java 1119:31
clarkbsince thati s the java 8 default19:31
clarkbre the git packfile size thing nova has a packfile on review-test that is 810MB large. Which is too large to fit into our 400m limit (which may cause problems?)19:36
clarkbbut ya maybe we should consider bumping that value up like wikimedia did. They went with 1/6th of their heap (though this isn't how they calculated the value)19:37
clarkbbut maybe that points to it being a good idea to allocate a good portion of heap to the packfiles19:37
clarkbfungi: oh also we should cross check against melody I think19:38
clarkbI need to eat lunch now though19:38
clarkbhrm I check top before actualyl making a sandwich adn are we already doing it again?19:47
corvusclarkb: sorry i was deep in another terminal; looking19:47
clarkbwe == gerrit gc'ing in this case19:48
*** lbragstad has joined #opendev19:48
clarkbI think that maybe rules out backups somehow triggering it19:48
clarkbthinking out loud here maybe lowering heap size has made this worse? and we should consider going back to 48g then work with spreading out other system activities to reduce their overlapping memory needs19:50
clarkband on the crazy idea front we could try the Z GC19:50
corvusclarkb: why is ptrace cap needed in docker-compose file?19:50
clarkbcorvus: to run jhsdb jmap -heap --pid $pid against the jvm19:51
clarkbits another avenue for digging into what the jvm is doing19:51
* TheJulia guesses more unhappiness?19:51
corvusoh because you want to run that from inside a container19:51
clarkbcorvus: ya19:51
clarkbwe don't have the same jdk/jre outside of the container so using host tools is clunky19:51
corvus(vs on the outside where presumably no d-c change would be necessary, but we probably don't have the jhsdb tool outside)19:51
corvuscool, i'm caught up then :)19:52
clarkbTheJulia: yes top reports we're in a similar situation with garbage collecting19:52
corvusclarkb: if things are bad now, i think we should restart with 766283 in place19:52
clarkbcorvus: ok I'll manually make those edits now19:52
corvusthe logs on review-test look like if we had the same info in prod we might have more clues19:52
* TheJulia wonders if a beer and maybe a stream for SN8 might be good for the day19:53
corvusTheJulia: any update on a time for sn8?19:53
TheJulianone19:53
TheJuliaSpaceX says they won't go live to T-5 apparently... there is tank farm venting19:54
TheJuliaso... could be close19:54
clarkbcorvus: want to double check docker-compose.yaml on prod and if it looks good to you I can restart19:54
* corvus hopes that raptor engines don't have java gc issues19:54
TheJuliacorvus: that being said there are multiple T38s in the air19:54
corvusclarkb: lgtm19:55
clarkbok restarting now19:56
fungisorry, stepped away for a moment figuring we had more time before the problem resurfaced19:57
fungicaught up now19:57
clarkbjvm_gc.log is there and is logging now19:57
fungiinteresting that it cropped back up so quickly this time. have to wonder what's almost immediately driving it into that state19:58
TheJuliacorvus: looks like the rocket is starting to vent... so you may want to fire up any streaming19:58
corvusTheJulia: roger, wilco!  :)19:59
clarkbin the log file the format is roughly timestamps, jvm uptime, thread id, log level, tags, then the info20:01
*** hashar has quit IRC20:03
* corvus is watching the gc log and the starship for changes in either20:04
clarkbcorvus: If I'm reading it correctly it is frequently freeing about 8GB of memory20:06
clarkbmaybe even more now20:06
TheJuliaoff, does the browser immediately submit comments when I click save or is it async?20:07
clarkbtahts the 9876M->1234M(XYZAB) lines20:07
clarkbTheJulia: I think I've seen it register the posted message after I've naviated to a different page20:07
TheJuliaInteresting...20:08
* TheJulia re-types a comment20:08
corvusclarkb: 18... -> 5... is pretty common too20:08
clarkbcorvus: that does make me wonder if maybe the packfile limit and possibly jgit strong refs would help (strong refs scare me as we can't gc those). Assuming those values are too small we may be allocating a small buffer, using it, then allocating a new small buffer to read the next chunk which deallocates the previous small buffer?20:09
clarkband do that frequently enough and we may see this sort of behavior?20:09
clarkbalright I've been yelled at that my lunch is ready so I better go eat20:11
ianwo/20:24
corvusianw: the action is at: tail -f /home/gerrit2/review_site/logs/jvm_gc.log20:26
fungii'll be heading into dinner prep zone soon myself, but should be free later to discuss possible additional tunings20:26
ianwahh excellent, yes i was going to get into suggesting we fiddle at that level today after the issues yesterday20:26
ianwwas it the same thing with all the GC threads just spinning and nothing else happening?20:27
fungiseems so20:27
fungitwice20:27
*** ralonsoh_ has quit IRC20:28
fungiwe restarted and it came right back almost immediately20:28
fungisuggesting some external trigger20:28
ianwwe have definitely seen that behaviour with java 8, i remember debugging it and can try finding it in my notes too20:28
ianwthat is why i was thinking perhaps we have some magic trigger to a pathalogical case for that particular gc20:28
clarkbianw: note that I think different GC systems are in play between 8 and 1120:29
ianwclarkb: that's about where i got to :)  trying to figure out which one and the logging, so great minds i guess :)20:30
clarkbit just did a 35GB -> 18GB pass (and is now back up to ~24GB)20:30
corvusyeah, that after a bunch of 35->31 passes20:31
clarkband more recently it looks like we bumped up against our heap limit and then it did a 3s full pause to bring that back down again20:33
clarkbPause Full (GCLocker Initiated GC) <- grep for that to see the full pass20:34
fungithat gives some credence to the 48->44gb reduction in max heap being involved20:36
clarkbya or at least making it worse20:37
openstackgerritIan Wienand proposed opendev/system-config master: bup: Remove from hosts  https://review.opendev.org/c/opendev/system-config/+/76630020:41
clarkbit just did another full one20:41
clarkb6s this time20:41
clarkbalso note that the GC number is important because they aren't always in order as it can do things in parallel20:41
ianwnot to derail, but i'm happy to babysit that ^^ hopefully before 5pm my time, when it certainly doesn't help the gerrit situation :)20:42
ianwboth clarkb and myself have mounted the backups from both servers on review particularly, and can navigate and grab various bits20:43
clarkbwow after the last full run 2022 was the number for it. I think it went from 45g to 9g20:44
clarkbI wonder if this is the issue with the soft jgit cache refs that matthias has pointed out20:44
clarkbwhere you basically thrash the memory because you are constantly unloading and reloading into memory20:44
clarkbthe problem is a proper lru system would be nicer that way we don't need a ton of memory to have everything cached all the time :/20:44
clarkbbut I mean maybe if we see this persisting increasing packedGitLimit and setting the strong refs option are things to consider20:45
clarkbprobably packedGitLimit first?20:45
clarkbthere isn't achange for that yet, should I write one? or do others think that is a bad idea (mostly going off of https://phabricator.wikimedia.org/T221026 and repo discuss threads)20:46
fungiif i follow the argument, it sounds like it could help20:54
fungibasically nova (and probably other repos) have packfiles larger than the limit so they're not retained in memory?20:54
clarkbya that is what I'm thinking20:55
clarkband so we're just allocating and allocating and allocating memory to serve those things20:55
clarkbrather than allocating and hopefully reusing a buffer20:55
clarkbit is possible that we could just allocate more memory by bumping up that limit and things will get worse though :/20:55
corvusclarkb: why packedgitlimit and strong refs together?  as opposed to two separate things20:56
clarkbcorvus: matthias wrote an email that makes them sound like they go together, let me find a link20:56
clarkbcorvus https://groups.google.com/g/repo-discuss/c/35-9sXmdtEQ/m/Ok1cOpBmBQAJ20:57
clarkbcorvus: basically he is saying that the garbage collection has a habit of evicting all that jgit pack data only for gerrit to reload it again20:57
clarkbhowever, as I've noted before using strong refs for that memory will prevent it from being garbage collcted so we may end up with more problems20:57
corvusclarkb, TheJulia: https://www.youtube.com/watch?v=ap-BkkrRg-o is live20:58
clarkbyup t minus ~5 minutes now20:58
clarkbcorvus: maybe its better to do the packfile limit first20:58
clarkbobserve and contineu to refine our observations and thoughts on updates20:59
* fungi turns on, tunes in, and drops out20:59
* TheJulia opens numerous streams21:00
clarkband maybe before we change anything we want to catch the problem in the jvm_gc.log?21:00
corvusyeah21:01
fungithis might be the longest two minutes and six seconds i've experienced in quite a while21:10
corvusfungi: time is an illusion? :)21:10
* fungi checks to see if it's lunchtime again21:10
*** slaweq has quit IRC21:12
*** slaweq has joined #opendev21:13
ianwi guess the people at spacex are probably furiously discussing the possibility of a whole other type of garbage collection21:14
clarkbfungi: corvus ianw did you want to approve https://review.opendev.org/c/opendev/system-config/+/766283 since we are now running with that in prod?21:18
fungidone21:19
clarkbianw: https://review.opendev.org/c/opendev/system-config/+/766300 is failing CI jobs21:19
ianwyeah just realised i'll need to pull that out of testinfra too21:19
openstackgerritIan Wienand proposed opendev/system-config master: bup: Remove from hosts  https://review.opendev.org/c/opendev/system-config/+/76630021:20
openstackgerritIan Wienand proposed opendev/system-config master: bup: Remove from hosts  https://review.opendev.org/c/opendev/system-config/+/76630021:22
clarkbgerrit is working really hard right now to gc21:28
TheJulia:(21:29
clarkbits bouncing up against the heap limit looks like, it will get 1gb free then back up again21:29
fungimaybe it just needs encouragement21:29
clarkbdo we want to try the jmap thing21:33
fungiif we wait until it's completely thrashing we may not be able to get any response from the jvm?21:34
clarkbfungi: well I think it is already there21:34
TheJuliaare there tons of sockets?21:35
fungiahh, i didn't realize, still finishing dinner21:35
TheJuliamentally thinking maybe cut traffic to it21:35
clarkbya there are a few hundred but a number are in close wait21:39
TheJuliaare connections being kept alive or being forced closed?21:39
clarkbwell close wait is a force close that the server is wiating to make sure the client acks isn't it?21:40
clarkbthe sockets are still there though21:41
johnsom502 proxy error here.21:41
johnsomclose wait for a socket is the TCP reuse timer21:41
TheJuliaclarkb: I'm thinking http socket closure21:42
TheJuliabrowsers like to hang connections open for a while in keep-alive mode and socket buffers can sit in the middle21:42
clarkbI ran the jmap it mostly just confirms what gc logs show. That we're full up21:42
corvusthe mark/sweep is taking a long time now21:43
corvus5 seconds to mark21:43
clarkbTheJulia: those should be chaep though right? I mean a few hundred sockets is maybe a few hundred mb ?21:44
TheJuliaclarkb: and threads they are living on... which consumes more of the heap21:44
TheJuliaclarkb: in a past life, where we had similar issues, we had to let the frontend proxy handle http connection keep-alive and had the back-end force connection closure21:45
clarkbiirc java threads don't go on the heap (this was a major problem with jenkins becuse we would oom with plenty of heap remaning intead there was no more stack space for threads)21:46
corvusit looks like we're just maxed out now; it's not actually releasing any memory, which makes me think we're not looking at a situation where we keep clearing and reloading the cache21:46
clarkbcorvus: ya21:46
TheJuliadynamic caching inside the jvm?21:46
corvusno difference in outcome between the different gc types21:47
TheJuliaerr, well, for gerrit21:47
clarkbTheJulia: the caching of jgit data21:47
clarkbTheJulia: one of the things a gerrit maintainer has described that we thought may be similar is thrashing of the jgit caches21:47
TheJuliaugh21:48
clarkbTheJulia: gc will unload the jgit cache then jgit will reload the data immediately after in some situations according to them, however we'd expect that the gc'ign would actually unload it again if that was the case here21:48
clarkbI don't have any great ideas. Lots of less good ideas. java 8 seemed happier and would haev been using parllel gc whcih we can try. we also reduced heap space by 4gb to give the system more room, we could give gerrit that 4gb back21:49
clarkbwe can try enabling git protocol v2 which is supposed to be more efficient21:50
clarkbwe can revert to java 821:50
clarkbwe can make apache close conections more aggressively as TheJulia suggests21:50
TheJuliaI'd only change one thing at a time, fwiw21:51
clarkbyup thats what we've been doing since the upgrade21:51
TheJuliaawesome21:51
clarkbone other thing I hadn't considered is that the java 11 update also pulled in a newer gerrit I think21:51
clarkbpossible there is a regression between the gerrit we had on java 8 and the one we're running on java 1121:52
corvuswhen was the 8->11 upgrade?21:52
clarkbcorvus: I want to say friday /me checks status updates21:53
clarkb2020-12-03 03:46:29 UTC restarted the gerrit service on review.o.o for the openjdk 11 upgrade from https://review.opendev.org/76365621:53
clarkband ianw first observed this problem early monday morning (relative to my local time)21:54
TheJuliaclarkb: fwiw, I don't remember how we were doing it anymore. we had 2 proxies, one was technically an authentication gateway for requests21:54
TheJuliaI just remember, it became a factor where it was fine with the authentication gateway, but not the actual high memory usage webapp21:55
clarkbcorvus: https://bugs.eclipse.org/bugs/show_bug.cgi?id=569349 could we be hitting that21:56
openstackbugs.eclipse.org bug 569349 in JGit "PackInvalidException when fetch and repack run concurrently" [Normal,Resolved: fixed] - Assigned to jgit.core-inbox21:56
corvusclarkb: http://cacti.openstack.org/cacti/graph_image.php?action=view&local_graph_id=25&rra_id=321:56
corvusdefinite behavior difference this week vs last 221:57
*** otherwiseguy has joined #opendev21:58
clarkbthe way gerrit does merges makes history really hard to read21:58
clarkbI think the gc in the context of that jgit bug is git gc not jvm gc21:59
corvusagree21:59
fungido we want to push the heap limit back up to 48gb and restart now, or also try with the packed ref limit increased to something large enough to include the present nova packfile?22:02
clarkbI think setting it up to 48g is likely a good idea. due to what corvus said I'm not sure the packedfilelimit thing will help. My other thought for immediate ideas is use parallel gc since java 8 would have done that22:03
clarkband java 8 had different behavior22:03
corvusthose sound like 2 good next changes22:03
fungiboth at once, or one we prefer doing first?22:04
clarkbI think both as thats most like the java 8 setup before without reverting to java 822:04
clarkbI'll edit docker-compose.yaml now and ya'll can check it when done22:05
clarkbfungi: corvus ianw ok docker-compose.yaml has been edited22:06
clarkbif that looks good let me know and I can do a down up -d22:06
fungi-XX:+UseParallelGC and -Xmx48g22:08
fungilgtm22:08
clarkbI'll give corvus another minute or two but then I'll proceed22:09
ianw++22:10
corvuslooking22:11
corvuslgtm22:12
clarkbalright proceeding22:12
fungialso the logging changes still seem to be in place22:12
corvus(sorry, was weighing in on https://bugs.chromium.org/p/gerrit/issues/detail?id=13800 -- the stream-events recheck thing for 3.3.0)22:12
clarkbfungi: yup and it shows the GC implementation switched22:14
clarkb766283 is in the gate and so may end up merging and then undoing the paralellgc + 48GB bump22:15
clarkbI'll write a chaneg for ^22:16
openstackgerritClark Boylan proposed opendev/system-config master: Bump gerrit heap back to 48g and use parallel gc  https://review.opendev.org/c/opendev/system-config/+/76631722:20
fungithe 48gb "bump" could just be a revert of us lowering it to 44gb? i think that was stand-alone22:20
fungioh, or that, sure22:20
fungiwhatever's easiest22:21
clarkbmostly I don't want to lose track and since we did them together pairing them is easy22:21
fungiyup22:21
openstackgerritIan Wienand proposed opendev/system-config master: graphite: also deny account page  https://review.opendev.org/c/opendev/system-config/+/76631822:23
openstackgerritStig Telfer proposed openstack/diskimage-builder master: Update handling of repo files for CentOS 8.3  https://review.opendev.org/c/openstack/diskimage-builder/+/76616422:23
clarkbI'm not sure this is any happier :(22:26
clarkbwhich makes me think again to maybe changes in gerrit itself22:26
clarkbor java 1122:27
clarkbunfortunately I have basically zero good ideas22:29
fungiwe could try rolling back to the patched 3.2 image we started with during the upgrade?22:29
clarkbor even the one before java 1122:30
fungibut my suspicion is the timing has more to do with activity actually ramping back up after the major holiday in the usa22:30
fungiand lots of extra patches/rechecks due to pip and virtualenv updates22:31
clarkbshould we restart without parallel gc but keep 48g? I think parallelgc is just spinning its wheels22:32
fungihmm, yeah maybe. i just tried the webui and it's not loading for me22:33
clarkbya if you watch the jvm_gc.log you can see when it devolves22:33
clarkband when I posted taht it wasn't happier earlier it had basically already reached that point22:33
corvusyes; meanwhile, what's the deal with rolling back the gerrit+jvm upgrade?22:34
fungistill have to wonder what it was about today that things have gotten so much worse, just the the past few hours22:34
clarkbwe can try the ZGC which is experimental on java 1122:34
clarkbcorvus: we build gerrit against the stable-3.2 branch. If we can manage to land a change to revert java 11 we could rebuild on java 822:34
clarkbcorvus: separaetly we may need to decide if we want to checkout a specific stable-3.2 ref or tag?22:34
clarkbpossibly we've got the image cached somewhere but every time I try to look at docker image ids I get immediately lost bceause what the docker hub ui shows you is not what docker image list shows you and so on22:35
clarkbI'm undoing parallel gc now22:35
corvusi mean, i'm pretty much at 'if we can get back to friday do it'22:35
clarkbcorvus: ya22:36
fungiinsecure-ci-registry.opendev.org:5000/opendevorg/gerrit   f76ab6a8900f40718c6cd8a57596e3fc_3.2   fad1ccad836a        2 weeks ago         681MB22:36
clarkbcorvus: maybe b12bec20e800 or 712c96672bbb22:37
fungii think that's the one we used at the end of the upgrade?22:37
ianwfor reference, this is the same sort of gc death spiral from the last time i remember debugging http://eavesdrop.openstack.org/irclogs/%23opendev/%23opendev.2020-10-16.log.html#t2020-10-16T00:11:3422:37
clarkbI've updated docker-compose and will restart again?22:37
fungii confirm the parallel gc option is now out of the compose file22:38
corvusianw: you suspect this is not gerrit/jvm related?22:38
clarkbI've not yet restarted and will wait for at least some other acks that that is a reasonable thing to do22:41
corvuswfm22:41
ianwcorvus: don't think so, just going back over my notes.  not sure what it means, other than we saw similar in the before times as well, but with less frequency22:42
clarkbok restart is done22:42
clarkbwhere we are now is logging + g1 gc + 48g22:43
clarkbliftoff22:45
openstackgerritMerged opendev/system-config master: Add jvm gc logging to gerrit and traceability perms  https://review.opendev.org/c/opendev/system-config/+/76628322:45
clarkb^ will undo the 48g change22:47
ianwone engine stopped?  is that supposed to happen?22:47
clarkbno idea22:48
clarkbto summarize everything that happened today so far: We noticed that GC'ing had gone crazy again. In response to that we added gc logging to the service and the ability to run jhsdb jmap commands against the running process in the container. When this happened again we were able to confirm it appeared we were at the limit of memory22:51
clarkbfrom thee we compared to what things looked like last week and tried to mimic java 8 settings on java 11. But that went poorly as well (we enabled parallel gc instead of g1 gc) and bumped heap back up to 48g22:53
clarkbwe are now back to the default java 11 gc which is g1 gc and 48g heap22:53
clarkbthinking about TheJulia's connection limits idea maybe we can tun that back in gerrit itself. sshd.threads is the limit for git fetches over http and ssh. We could dial that back a bit maybe?22:54
ianwwould that just make things slightly slower, or reject connections leading to ci issues?22:59
clarkbianw: it is supposed to queue things, but ya if things aren't services quickly enough maybe it would start failing ci systems and the like23:00
clarkbdocker image id 3391de1cd0b2 is tagged 3.2 on review-test and is on review.o.o from 2 weeks ago23:16
clarkbI think it may be the promoted version of fad1ccad836a23:17
clarkbwhich is what we did the upgrade with23:17
openstackgerritStig Telfer proposed openstack/diskimage-builder master: Update handling of repo files for CentOS 8.3  https://review.opendev.org/c/openstack/diskimage-builder/+/76616423:18
ianwthere seems to be a lot of jobs maybe not getting nodes?23:18
ianwi guess we're just busy23:21
*** bodgix has quit IRC23:22
*** slaweq has quit IRC23:23
*** bodgix has joined #opendev23:23
openstackgerritClark Boylan proposed opendev/system-config master: Revert "Reduce gerrit heap limit to 44g"  https://review.opendev.org/c/opendev/system-config/+/76631723:27
clarkbfungi: ^ did your revert idea23:27
clarkbbut kep the change number23:27
fungiwfm23:27
ianwi don't know what's up here but https://grafana.opendev.org/d/9XCNuphGk/zuul-status?orgId=1 shows ze12 going a bit nuts23:45
clarkbianw: disk looks good on it at least (one thing we've seen cause problems with executors)23:46
*** zbr has quit IRC23:46
clarkbseems like there was antoher case I'm not remembering maybe something to do with gearman?23:46
clarkboh I'm thinking of the finger port/process dying23:47
clarkbin that case it continues to run jobs happily but you don't get the console logs from them on the dashboard23:47
ianwit seems to want to do all the work; perhaps the others are just busy23:47
*** zbr has joined #opendev23:48
clarkbya they are supposed to self throttle23:48
ianwMem:           7.8G        7.1G        149M         34M        589M        200M23:48
ianwthat's a lot of used memory and not much in caches23:48
openstackgerritPoornima Y N proposed openstack/project-config master: Integration of Secure Device onboard Rendezvous service on STX  https://review.opendev.org/c/openstack/project-config/+/76549423:51
openstackgerritClark Boylan proposed opendev/system-config master: Enable git protocol v2 on gerrit  https://review.opendev.org/c/opendev/system-config/+/76636523:51
clarkbinfra-root ^ fyi, that is enabled on review-test and I've tested it. Compared against prod currently and modified -test config and it does enable it as far as I can tell. Harder to determine if it will make big impacts without seeing clients try it I guess? somethign to think about23:52

Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!