Friday, 2025-02-14

Clark[m]That was fast. I'm out for snow v200:04
fungienjoy!00:04
fungi/opt/backups-202010 on backup02.ca-ymq-1.vexxhost has reached 90%, i'll get a prune going00:06
clarkbmy keys have aged out for the day. codesearch02 says hound is not ready. I expect this first initial startup is going to be slow cloning rather than updating everything00:39
clarkbcan cehck it in the morning and debug from there if it doesn't reach a happy state00:39
clarkbthanks again!00:39
ianw2025/02/14 00:48:17 All indexes built!00:49
ianw2025/02/14 00:48:17 running server at http://localhost:608000:49
ianw(i just logged in to check just in case the "not ready" was something to do with the scripts on a totally new host :)00:49
fungi#status log Pruned backups on backup02.ca-ymq-1.vexxhost reducing volume usage from 90% to 61%01:13
opendevstatusfungi: finished logging01:13
cloudnullevenings all :D02:13
*** cloudnull9 is now known as cloudnull04:16
opendevreviewMatthieu Huin proposed zuul/zuul-jobs master: Update the set-zuul-log-path-fact scheme to prevent huge url  https://review.opendev.org/c/zuul/zuul-jobs/+/92758213:28
*** tkajinam is now known as Guest912913:39
opendevreviewMatthieu Huin proposed zuul/zuul-jobs master: Fix the upload-logs-s3 test playbook  https://review.opendev.org/c/zuul/zuul-jobs/+/92760013:54
opendevreviewKarolina Kula proposed opendev/glean master: WIP: Add support for CentOS 10 keyfiles  https://review.opendev.org/c/opendev/glean/+/94167215:06
opendevreviewClark Boylan proposed opendev/system-config master: Remove zuul-lb01 from inventory  https://review.opendev.org/c/opendev/system-config/+/94167715:43
opendevreviewClark Boylan proposed opendev/zone-opendev.org master: Cleanup zuul-lb01 and reset zuul ttl  https://review.opendev.org/c/opendev/zone-opendev.org/+/94116815:43
opendevreviewClark Boylan proposed opendev/zone-opendev.org master: Switch codesearch over to codesearch02  https://review.opendev.org/c/opendev/zone-opendev.org/+/94167815:43
opendevreviewClark Boylan proposed opendev/system-config master: Remove codesearch01  https://review.opendev.org/c/opendev/system-config/+/94167915:54
opendevreviewClark Boylan proposed opendev/zone-opendev.org master: Remove codesearch01 from DNS  https://review.opendev.org/c/opendev/zone-opendev.org/+/94168115:55
clarkbinfra-root ^ I'm fairly confident the zuul-lb changes are all ready to go now. For codesearch02 I haven't seen aything amiss on the server after a quick check and the service seems to be working for me at https://codesearch02.opendev.org so I think we can proceed there too15:56
fungilgtm, i've approved the codesearch dns switch and should be around to see it take effect before i need to disappear for lunch in a few minutes16:06
opendevreviewMerged opendev/zone-opendev.org master: Switch codesearch over to codesearch02  https://review.opendev.org/c/opendev/zone-opendev.org/+/94167816:08
clarkbthanks!16:09
fungideploy succeeded16:22
clarkbthe new name resolves for me now16:22
fungisame16:23
clarkbinspecting the cert in firefox confirms I'm takling to the new server and it continues to work16:23
fungiworking for me16:23
fungisearches return expected content16:23
fungiheaded to lunch, bbiab16:30
fungiokay, back17:29
mnaserusing opendev has been beyond frustrating lately :(17:51
mnaserspecifically gitea17:51
mnaserPowered by Gitea Version: v1.23.3 Page: 21401ms Template: 84ms17:51
mnaserI cant be the only one constantly hitting these17:51
fungiwe're probably overrun with ai scraping crawlers again, i'll take another look at the apache logs17:52
mnaserfwiw i am seeing this while browisng zuul/zuul-jobs17:52
fungiload average on all the gitea backends is pretty nominal at the moment, looking for other possible causes17:54
fungifwiw, browsing https://opendev.org/zuul/zuul-jobs from here is really snappy. clicking around different files i'm getting like "Page: 58ms Template: 4ms"17:55
fungilooks like i'm going through zayo to reach our gitea servers in vexxhost17:57
fungimnaser: also a heads up, suddenly openstackclient is complaining and failing when i try to communicate with vexxhost: "https://vexxhost.com is a remote profile that could not be fetched: 502 Bad Gateway"17:58
fungihappening from multiple locations (a vm in rackspace as well as my home workstation)17:59
fungitesting all the gitea backends for a single file from zuul-jobs like https://gitea09.opendev.org:3000/zuul/zuul-jobs/src/branch/master/tests/ansible.cfg the slowest render i got was "Page: 154ms" from gitea1318:02
fungiokay, pulling https://gitea13.opendev.org:3000/zuul/zuul-jobs/ just now i got "Page: 16820ms"18:03
fungithe rest of them were in the 50-80ms range, but also so was gitea13 when i refreshed which suggests it's something to do with cold cache hits, maybe on that specific backend18:04
Clark[m]Yes gitea is very cache sensitive because git isnt all that fast. I suspect the markdown/rst renders may not be quick either depending on the content 18:08
fungii picked a random old zuul-jobs commit from 7 years ago and loaded it directly from every backend, all of them came back quite quickly18:08
Clark[m]If system load is fine (so we just not generally slow) then cold caches could explain it. I think new commits invalidate the cache but afterwards things should generally stick around18:08
fungie.g. https://gitea13.opendev.org:3000/zuul/zuul-jobs/commit/af0d8eb25b63f320ed63500a51b287ed45bbbb1b was "Page: 247ms"18:09
Clark[m]The top level page renders readme but random commits won't. Do other rendered pages (for eg docs) load slowly?18:12
Clark[m]The top level page also loads repo stats that could be slow18:12
fungimnaser: we're using the local rootfs on the gitea backends, so storage i/o is going to be whatever the hypervisor hosts put the default rootfs on. i'd get you the server instance uuid for gitea13, but as noted vexxhost isn't working with openstackclient at the moment18:12
Clark[m]Oh ya disk io could explain it especially if we have cold cache then hit an io slowdown18:13
mnaseroh, well TIL about the remote profile issue18:13
mnaseri need to remember to figure out what the well-known url is18:13
fungiClark[m]: i'm getting quick response on random rst files from zuul-jobs, e.g. https://gitea09.opendev.org:3000/zuul/zuul-jobs/src/branch/master/doc/source/roles.rst18:14
mnaserusually the first load is slow, then its fine after18:14
mnaseralso i dont think its network because i always get really high page render time in the footer18:15
fungiyeah, subsequent loads for the same url are going to come from a cache rather than from shelling out to git18:15
fungiand i agree it's probably not network-related, unless it's to do with an internal storage network18:15
mnaserare those boot from volume vms?  also i feel like its quite unlikely we're taking 21s for io requests18:16
mnaserwhat are the load averages for iowait on those systems?18:17
fungipretty sure they're bfv, but hard to double-check at the moment. and no, load average is low but also top doesn't show a bunch of iowait either18:18
fungimajority of cpu is fallung under user18:18
fungiload average on gitea13 has crept up to around 3, so it could maybe be crawlers, but i'd expect worse if so18:19
fungii'll do some user agent analysis on gitea13 just to see18:20
fungitons of hits from git/2.39.218:21
fungimaybe this is a runaway openstack deployment cloning from source on a bazillion hosts behind a nat18:21
fungi7539 requests from "git/2.39.2" persisted to gitea13 between 18:10-18:20, so ~12.5/sec18:22
fungii'll try to do some client port matching to work out the source addresses hitting haproxy for those and see if it gives me anything useful18:25
Clark[m]Yes I think they are bfv but I can't say for certain 18:27
fungilooking at the counts, i have a feeling these git/2.39.2 requests may be from a ci system. there were ~600 requests in 600 seconds for each of 1323 different repositories18:32
fungiso something hitting every one of our git repositories once a second persisted to gitea13 over that 10-minute slice of logs18:33
Clark[m]I think code search does that18:34
Clark[m]But it just checks for updates and shouldn't be a huge impact (it hasn't been in the past)18:34
Clark[m]We did deploy a new server but the container image didn't change so the version of git and hound should be the same with th same behaviors 18:35
fungii'll take a look there first18:36
fungianecdotally, that new codesearch02 server has git 2.43.0 installed, so seems unlikely18:37
Clark[m]Oh the old server is still up so it could be both of them hitting the same server at once? That seems unlikely ~1/6 chance but maybe we stop the containers on codesearch0118:37
Clark[m]fungi: the version of git used is the one in the hound container18:38
fungigood point! git 2.39.2 in the container on codesearch0218:39
fungidocker-compose exec is taking a really long time for me on codesearch01i've done a docker-compose down on codesearch01 just to be sure18:41
Clark[m]This seems unlikely to be the cause but if it is then it's a quick solution 18:42
fungidoing source port comparisons to map into the haproxy log, those git/2.39.2 requests do in fact seem to be originating from the ipv6 address of codesearch0218:49
Clark[m]There are many issues on the Internet related to gitea slowness. Some pages (like the org dashboard) are db bound and slowness there could be db related. Seems unlikely to be the cause here on repo pages.18:52
Clark[m]May need to look at logs for the slow requests and see if there is any indication of where the time was spent 18:52
fungithough my math was off, that was 7539 (~12.5/sec total) requests over 10 minutes, but ~600 for each of 1323 repositories since the log was rotated at midnight utc18:52
clarkblooking at /var/log/containers/docker-gitea.log and /var/log/apache2/gitea-ssl-access.log there does appear to be some interesting crawling going on19:00
clarkbI'm not sure that is the problem yet though19:00
clarkbrequests against /api/v1/repositories subpaths that are unauthorized and lots of fetches for commit specific paths that you wouldn't expect normal activity to generate19:01
clarkbtheory time: all this crawling is overwhelming the cache19:02
clarkbso while it isn't directly overloading things the cache may not be large enough to cache all of the requested data and sacrifices are made (I believe this is done in an LRU manner)19:03
clarkbso we notice because the cache isn't effective19:03
clarkbgrabbing samples from /var/containers/docker-gitea.log that match grep 'router: slow' and then looking in apache logs for the shas in the request so far 100% of these that I have looked up have been from amazonbot19:05
clarkbthats my best working theory right now. The gitea cache is simply not large enough and things are being evicted but I'm not sure that is the case19:05
fungigot it, so cache proves useless in the face of constant crawling hitting random content19:11
fungiand cold cache performance in gitea is often poor19:11
fungior cache miss performance i guess19:12
clarkbya cache miss performance has laways been pretty bad whcih is why they added teh cache. I'm still trying to track down how the memory adatper works (pretty sure this is the one we are using)19:14
clarkbhttps://gitea.com/go-chi/cache/src/branch/main/memory.go this appears to be the implementation and there doesn't seem to be any maximum record count so maybe this theory is bad? Its just going to add as many things to memory as it can?19:18
clarkblooks like the facebook meta-externalagent is also generating a lot of requests that get logged as slow19:20
clarkbI'm half tempted to temporarily block these crawlers and see if that improves things then work backward from there. Bit of a 20lb sledhammer though19:21
clarkbthere is mailing list info on golang maps indicating that large maps with many keys can negatively impact the garbage collector in go causing pauses19:24
clarkbso I'm adjusting my theory now to say maybe it is golang gc running against a very large in memory cache map made worse by crawlers19:24
clarkbto test ^ I think what we can do is manually stop then restart gitea on gitea13 and see if things improve in the immediate term19:25
clarkbeventually they should degrade but in the short term if they improve that would lead to some evidence my theory is on the right track19:25
clarkbI need to take a shower and catch up on some other morning thigns that haven't happend yet but then I can do that19:25
clarkbif anyone else beats me to it remember the startup process for gitea is a bit odd to ensure gerrit replication doesn't fail (basically do a down, then only up gitea web and mariadb, wait for giteaweb to respond then up gitea ssh)19:26
fungithe problem with that plan is i still don't have a consistent enough reproducer to be able to measure whether the restart improves anything19:36
fungibut maybe you have some idea what we should be measuring/counting19:37
clarkbfungi: the router: slow messages in the gitea log are the canary I think19:45
clarkbfungi: post restart we should probably expect those to become more common beacuse nothing is cached but then theyshould fall away then over time come back again (assuming my theory is correct)19:45
clarkbI'll hold off on restarting things since there is concern that may be a bit too cowboy19:46
fungimeh, "concern" is a bit strong to describe my feelings on it. besides, friday is for cowboys19:49
fungii don't object to the proposed experiment19:49
clarkbok proceeding with the experiment19:51
clarkbrestart is done and I'm running `sudo tail -f /var/log/contaniers/docker-gitea.log | grep 'router: slow'` to get a sense for whether or not this has helped. So far only two slow requests both for nova commits19:54
clarkbnova is our largest repo so likely whee we'd see slowness?19:54
fungii'd expect so, yes19:54
clarkbrough anecdata is the incidence rate has slowed19:55
clarkbof course I don't know for sure that 'router: slow' is a good canary but it is related to long requests19:56
clarkbmnaser: ^ fyi we did a thing that I've hypothesized will help. Would be curious if you notice the problem occuring persistently over the next little bit19:56
clarkband now lunch19:56
clarkbloading neutron commit pages does seem to consistently take a coupel seconds but that is much quicker than the ~16 people were having with zuul-jobs previously19:57
fungialso, it would be good to know if the https cert details in your browser indicate you're being persisted to gitea13 or one of the other backends, and if so, which one19:57
clarkb++19:58
clarkbbut really now food19:58
fungiyou should totally food19:58
clarkbfungi: any objection to https://review.opendev.org/c/opendev/system-config/+/941677 to remove zuul-lb01 from inventory at this point?21:18
clarkblooks like the codesearch01 removal change failed on docker rate limits. I'll recheck that one in a bit21:18
clarkbit occurred to me that all these people writing crawlers for llms grabbing git repo contents would be better off teaching their tools to git clone21:26
clarkbeven for nova they could grab the entire commit history in just a few minute and then crawl it from the comfort of their own datacenter21:27
clarkbanyone know anyone at openai, anthropic, google, meta, or amazon? My rates are reasonable21:28
clarkbI pulled up the same tail command on gitea09 and the rate there definitely seems lower than on 1321:29
clarkbbut that could be luck of the draw based on load balancing. Hard to make any conclusions at this poitn21:38
fungiapproved the zuul-lb01 inventory cleanup21:55
opendevreviewMerged opendev/system-config master: Remove zuul-lb01 from inventory  https://review.opendev.org/c/opendev/system-config/+/94167722:15
clarkboh right thats going to run all the jobs becuse it edits the hosts file22:16
fungiyup22:28
fungithe whole shebang22:28
fungii suppose we could create a modular inventory in order to avoid that22:28
clarkbit would be neat if we could detect only deletions and then short circuit but then I remembered we manage firewall rule membership via inventory and groups so that won't work22:33
clarkbthe codesearch01 removal change should pass in the next few minutes22:44
clarkbhttps://review.opendev.org/c/opendev/system-config/+/941679 yup this passes now22:57
ianychoiHi, Zuul job (openstack-tox-pep8) on openstack/i18n repo are falling - would you help how I can move forward? Links are: https://zuul.opendev.org/t/openstack/build/009def09355743528e8211913d25cd5e and https://zuul.opendev.org/t/openstack/build/f08739d7fc9442b3a769f073f95fa92123:01
Clark[m]ianychoi the problem is you are running the linter tools (pylint/hacking/flake8) with a pinned version that is too old for python3.12. You need to update the version of those tools23:03
ianychoiClark: thank you! Then, fixing on *requirements.txt on i18n repo would solve the issues - will do it!23:05

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!