Friday, 2025-02-14

Clark[m]	That was fast. I'm out for snow v2	00:04
fungi	enjoy!	00:04
fungi	/opt/backups-202010 on backup02.ca-ymq-1.vexxhost has reached 90%, i'll get a prune going	00:06
clarkb	my keys have aged out for the day. codesearch02 says hound is not ready. I expect this first initial startup is going to be slow cloning rather than updating everything	00:39
clarkb	can cehck it in the morning and debug from there if it doesn't reach a happy state	00:39
clarkb	thanks again!	00:39
ianw	2025/02/14 00:48:17 All indexes built!	00:49
ianw	2025/02/14 00:48:17 running server at http://localhost:6080	00:49
ianw	(i just logged in to check just in case the "not ready" was something to do with the scripts on a totally new host :)	00:49
fungi	#status log Pruned backups on backup02.ca-ymq-1.vexxhost reducing volume usage from 90% to 61%	01:13
opendevstatus	fungi: finished logging	01:13
cloudnull	evenings all :D	02:13
*** cloudnull9 is now known as cloudnull		04:16
opendevreview	Matthieu Huin proposed zuul/zuul-jobs master: Update the set-zuul-log-path-fact scheme to prevent huge url https://review.opendev.org/c/zuul/zuul-jobs/+/927582	13:28
*** tkajinam is now known as Guest9129		13:39
opendevreview	Matthieu Huin proposed zuul/zuul-jobs master: Fix the upload-logs-s3 test playbook https://review.opendev.org/c/zuul/zuul-jobs/+/927600	13:54
opendevreview	Karolina Kula proposed opendev/glean master: WIP: Add support for CentOS 10 keyfiles https://review.opendev.org/c/opendev/glean/+/941672	15:06
opendevreview	Clark Boylan proposed opendev/system-config master: Remove zuul-lb01 from inventory https://review.opendev.org/c/opendev/system-config/+/941677	15:43
opendevreview	Clark Boylan proposed opendev/zone-opendev.org master: Cleanup zuul-lb01 and reset zuul ttl https://review.opendev.org/c/opendev/zone-opendev.org/+/941168	15:43
opendevreview	Clark Boylan proposed opendev/zone-opendev.org master: Switch codesearch over to codesearch02 https://review.opendev.org/c/opendev/zone-opendev.org/+/941678	15:43
opendevreview	Clark Boylan proposed opendev/system-config master: Remove codesearch01 https://review.opendev.org/c/opendev/system-config/+/941679	15:54
opendevreview	Clark Boylan proposed opendev/zone-opendev.org master: Remove codesearch01 from DNS https://review.opendev.org/c/opendev/zone-opendev.org/+/941681	15:55
clarkb	infra-root ^ I'm fairly confident the zuul-lb changes are all ready to go now. For codesearch02 I haven't seen aything amiss on the server after a quick check and the service seems to be working for me at https://codesearch02.opendev.org so I think we can proceed there too	15:56
fungi	lgtm, i've approved the codesearch dns switch and should be around to see it take effect before i need to disappear for lunch in a few minutes	16:06
opendevreview	Merged opendev/zone-opendev.org master: Switch codesearch over to codesearch02 https://review.opendev.org/c/opendev/zone-opendev.org/+/941678	16:08
clarkb	thanks!	16:09
fungi	deploy succeeded	16:22
clarkb	the new name resolves for me now	16:22
fungi	same	16:23
clarkb	inspecting the cert in firefox confirms I'm takling to the new server and it continues to work	16:23
fungi	working for me	16:23
fungi	searches return expected content	16:23
fungi	headed to lunch, bbiab	16:30
fungi	okay, back	17:29
mnaser	using opendev has been beyond frustrating lately :(	17:51
mnaser	specifically gitea	17:51
mnaser	Powered by Gitea Version: v1.23.3 Page: 21401ms Template: 84ms	17:51
mnaser	I cant be the only one constantly hitting these	17:51
fungi	we're probably overrun with ai scraping crawlers again, i'll take another look at the apache logs	17:52
mnaser	fwiw i am seeing this while browisng zuul/zuul-jobs	17:52
fungi	load average on all the gitea backends is pretty nominal at the moment, looking for other possible causes	17:54
fungi	fwiw, browsing https://opendev.org/zuul/zuul-jobs from here is really snappy. clicking around different files i'm getting like "Page: 58ms Template: 4ms"	17:55
fungi	looks like i'm going through zayo to reach our gitea servers in vexxhost	17:57
fungi	mnaser: also a heads up, suddenly openstackclient is complaining and failing when i try to communicate with vexxhost: "https://vexxhost.com is a remote profile that could not be fetched: 502 Bad Gateway"	17:58
fungi	happening from multiple locations (a vm in rackspace as well as my home workstation)	17:59
fungi	testing all the gitea backends for a single file from zuul-jobs like https://gitea09.opendev.org:3000/zuul/zuul-jobs/src/branch/master/tests/ansible.cfg the slowest render i got was "Page: 154ms" from gitea13	18:02
fungi	okay, pulling https://gitea13.opendev.org:3000/zuul/zuul-jobs/ just now i got "Page: 16820ms"	18:03
fungi	the rest of them were in the 50-80ms range, but also so was gitea13 when i refreshed which suggests it's something to do with cold cache hits, maybe on that specific backend	18:04
Clark[m]	Yes gitea is very cache sensitive because git isnt all that fast. I suspect the markdown/rst renders may not be quick either depending on the content	18:08
fungi	i picked a random old zuul-jobs commit from 7 years ago and loaded it directly from every backend, all of them came back quite quickly	18:08
Clark[m]	If system load is fine (so we just not generally slow) then cold caches could explain it. I think new commits invalidate the cache but afterwards things should generally stick around	18:08
fungi	e.g. https://gitea13.opendev.org:3000/zuul/zuul-jobs/commit/af0d8eb25b63f320ed63500a51b287ed45bbbb1b was "Page: 247ms"	18:09
Clark[m]	The top level page renders readme but random commits won't. Do other rendered pages (for eg docs) load slowly?	18:12
Clark[m]	The top level page also loads repo stats that could be slow	18:12
fungi	mnaser: we're using the local rootfs on the gitea backends, so storage i/o is going to be whatever the hypervisor hosts put the default rootfs on. i'd get you the server instance uuid for gitea13, but as noted vexxhost isn't working with openstackclient at the moment	18:12
Clark[m]	Oh ya disk io could explain it especially if we have cold cache then hit an io slowdown	18:13
mnaser	oh, well TIL about the remote profile issue	18:13
mnaser	i need to remember to figure out what the well-known url is	18:13
fungi	Clark[m]: i'm getting quick response on random rst files from zuul-jobs, e.g. https://gitea09.opendev.org:3000/zuul/zuul-jobs/src/branch/master/doc/source/roles.rst	18:14
mnaser	usually the first load is slow, then its fine after	18:14
mnaser	also i dont think its network because i always get really high page render time in the footer	18:15
fungi	yeah, subsequent loads for the same url are going to come from a cache rather than from shelling out to git	18:15
fungi	and i agree it's probably not network-related, unless it's to do with an internal storage network	18:15
mnaser	are those boot from volume vms? also i feel like its quite unlikely we're taking 21s for io requests	18:16
mnaser	what are the load averages for iowait on those systems?	18:17
fungi	pretty sure they're bfv, but hard to double-check at the moment. and no, load average is low but also top doesn't show a bunch of iowait either	18:18
fungi	majority of cpu is fallung under user	18:18
fungi	load average on gitea13 has crept up to around 3, so it could maybe be crawlers, but i'd expect worse if so	18:19
fungi	i'll do some user agent analysis on gitea13 just to see	18:20
fungi	tons of hits from git/2.39.2	18:21
fungi	maybe this is a runaway openstack deployment cloning from source on a bazillion hosts behind a nat	18:21
fungi	7539 requests from "git/2.39.2" persisted to gitea13 between 18:10-18:20, so ~12.5/sec	18:22
fungi	i'll try to do some client port matching to work out the source addresses hitting haproxy for those and see if it gives me anything useful	18:25
Clark[m]	Yes I think they are bfv but I can't say for certain	18:27
fungi	looking at the counts, i have a feeling these git/2.39.2 requests may be from a ci system. there were ~600 requests in 600 seconds for each of 1323 different repositories	18:32
fungi	so something hitting every one of our git repositories once a second persisted to gitea13 over that 10-minute slice of logs	18:33
Clark[m]	I think code search does that	18:34
Clark[m]	But it just checks for updates and shouldn't be a huge impact (it hasn't been in the past)	18:34
Clark[m]	We did deploy a new server but the container image didn't change so the version of git and hound should be the same with th same behaviors	18:35
fungi	i'll take a look there first	18:36
fungi	anecdotally, that new codesearch02 server has git 2.43.0 installed, so seems unlikely	18:37
Clark[m]	Oh the old server is still up so it could be both of them hitting the same server at once? That seems unlikely ~1/6 chance but maybe we stop the containers on codesearch01	18:37
Clark[m]	fungi: the version of git used is the one in the hound container	18:38
fungi	good point! git 2.39.2 in the container on codesearch02	18:39
fungi	docker-compose exec is taking a really long time for me on codesearch01i've done a docker-compose down on codesearch01 just to be sure	18:41
Clark[m]	This seems unlikely to be the cause but if it is then it's a quick solution	18:42
fungi	doing source port comparisons to map into the haproxy log, those git/2.39.2 requests do in fact seem to be originating from the ipv6 address of codesearch02	18:49
Clark[m]	There are many issues on the Internet related to gitea slowness. Some pages (like the org dashboard) are db bound and slowness there could be db related. Seems unlikely to be the cause here on repo pages.	18:52
Clark[m]	May need to look at logs for the slow requests and see if there is any indication of where the time was spent	18:52
fungi	though my math was off, that was 7539 (~12.5/sec total) requests over 10 minutes, but ~600 for each of 1323 repositories since the log was rotated at midnight utc	18:52
clarkb	looking at /var/log/containers/docker-gitea.log and /var/log/apache2/gitea-ssl-access.log there does appear to be some interesting crawling going on	19:00
clarkb	I'm not sure that is the problem yet though	19:00
clarkb	requests against /api/v1/repositories subpaths that are unauthorized and lots of fetches for commit specific paths that you wouldn't expect normal activity to generate	19:01
clarkb	theory time: all this crawling is overwhelming the cache	19:02
clarkb	so while it isn't directly overloading things the cache may not be large enough to cache all of the requested data and sacrifices are made (I believe this is done in an LRU manner)	19:03
clarkb	so we notice because the cache isn't effective	19:03
clarkb	grabbing samples from /var/containers/docker-gitea.log that match grep 'router: slow' and then looking in apache logs for the shas in the request so far 100% of these that I have looked up have been from amazonbot	19:05
clarkb	thats my best working theory right now. The gitea cache is simply not large enough and things are being evicted but I'm not sure that is the case	19:05
fungi	got it, so cache proves useless in the face of constant crawling hitting random content	19:11
fungi	and cold cache performance in gitea is often poor	19:11
fungi	or cache miss performance i guess	19:12
clarkb	ya cache miss performance has laways been pretty bad whcih is why they added teh cache. I'm still trying to track down how the memory adatper works (pretty sure this is the one we are using)	19:14
clarkb	https://gitea.com/go-chi/cache/src/branch/main/memory.go this appears to be the implementation and there doesn't seem to be any maximum record count so maybe this theory is bad? Its just going to add as many things to memory as it can?	19:18
clarkb	looks like the facebook meta-externalagent is also generating a lot of requests that get logged as slow	19:20
clarkb	I'm half tempted to temporarily block these crawlers and see if that improves things then work backward from there. Bit of a 20lb sledhammer though	19:21
clarkb	there is mailing list info on golang maps indicating that large maps with many keys can negatively impact the garbage collector in go causing pauses	19:24
clarkb	so I'm adjusting my theory now to say maybe it is golang gc running against a very large in memory cache map made worse by crawlers	19:24
clarkb	to test ^ I think what we can do is manually stop then restart gitea on gitea13 and see if things improve in the immediate term	19:25
clarkb	eventually they should degrade but in the short term if they improve that would lead to some evidence my theory is on the right track	19:25
clarkb	I need to take a shower and catch up on some other morning thigns that haven't happend yet but then I can do that	19:25
clarkb	if anyone else beats me to it remember the startup process for gitea is a bit odd to ensure gerrit replication doesn't fail (basically do a down, then only up gitea web and mariadb, wait for giteaweb to respond then up gitea ssh)	19:26
fungi	the problem with that plan is i still don't have a consistent enough reproducer to be able to measure whether the restart improves anything	19:36
fungi	but maybe you have some idea what we should be measuring/counting	19:37
clarkb	fungi: the router: slow messages in the gitea log are the canary I think	19:45
clarkb	fungi: post restart we should probably expect those to become more common beacuse nothing is cached but then theyshould fall away then over time come back again (assuming my theory is correct)	19:45
clarkb	I'll hold off on restarting things since there is concern that may be a bit too cowboy	19:46
fungi	meh, "concern" is a bit strong to describe my feelings on it. besides, friday is for cowboys	19:49
fungi	i don't object to the proposed experiment	19:49
clarkb	ok proceeding with the experiment	19:51
clarkb	restart is done and I'm running `sudo tail -f /var/log/contaniers/docker-gitea.log \| grep 'router: slow'` to get a sense for whether or not this has helped. So far only two slow requests both for nova commits	19:54
clarkb	nova is our largest repo so likely whee we'd see slowness?	19:54
fungi	i'd expect so, yes	19:54
clarkb	rough anecdata is the incidence rate has slowed	19:55
clarkb	of course I don't know for sure that 'router: slow' is a good canary but it is related to long requests	19:56
clarkb	mnaser: ^ fyi we did a thing that I've hypothesized will help. Would be curious if you notice the problem occuring persistently over the next little bit	19:56
clarkb	and now lunch	19:56
clarkb	loading neutron commit pages does seem to consistently take a coupel seconds but that is much quicker than the ~16 people were having with zuul-jobs previously	19:57
fungi	also, it would be good to know if the https cert details in your browser indicate you're being persisted to gitea13 or one of the other backends, and if so, which one	19:57
clarkb	++	19:58
clarkb	but really now food	19:58
fungi	you should totally food	19:58
clarkb	fungi: any objection to https://review.opendev.org/c/opendev/system-config/+/941677 to remove zuul-lb01 from inventory at this point?	21:18
clarkb	looks like the codesearch01 removal change failed on docker rate limits. I'll recheck that one in a bit	21:18
clarkb	it occurred to me that all these people writing crawlers for llms grabbing git repo contents would be better off teaching their tools to git clone	21:26
clarkb	even for nova they could grab the entire commit history in just a few minute and then crawl it from the comfort of their own datacenter	21:27
clarkb	anyone know anyone at openai, anthropic, google, meta, or amazon? My rates are reasonable	21:28
clarkb	I pulled up the same tail command on gitea09 and the rate there definitely seems lower than on 13	21:29
clarkb	but that could be luck of the draw based on load balancing. Hard to make any conclusions at this poitn	21:38
fungi	approved the zuul-lb01 inventory cleanup	21:55
opendevreview	Merged opendev/system-config master: Remove zuul-lb01 from inventory https://review.opendev.org/c/opendev/system-config/+/941677	22:15
clarkb	oh right thats going to run all the jobs becuse it edits the hosts file	22:16
fungi	yup	22:28
fungi	the whole shebang	22:28
fungi	i suppose we could create a modular inventory in order to avoid that	22:28
clarkb	it would be neat if we could detect only deletions and then short circuit but then I remembered we manage firewall rule membership via inventory and groups so that won't work	22:33
clarkb	the codesearch01 removal change should pass in the next few minutes	22:44
clarkb	https://review.opendev.org/c/opendev/system-config/+/941679 yup this passes now	22:57
ianychoi	Hi, Zuul job (openstack-tox-pep8) on openstack/i18n repo are falling - would you help how I can move forward? Links are: https://zuul.opendev.org/t/openstack/build/009def09355743528e8211913d25cd5e and https://zuul.opendev.org/t/openstack/build/f08739d7fc9442b3a769f073f95fa921	23:01
Clark[m]	ianychoi the problem is you are running the linter tools (pylint/hacking/flake8) with a pinned version that is too old for python3.12. You need to update the version of those tools	23:03
ianychoi	Clark: thank you! Then, fixing on *requirements.txt on i18n repo would solve the issues - will do it!	23:05

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!