Clark[m] | That was fast. I'm out for snow v2 | 00:04 |
---|---|---|
fungi | enjoy! | 00:04 |
fungi | /opt/backups-202010 on backup02.ca-ymq-1.vexxhost has reached 90%, i'll get a prune going | 00:06 |
clarkb | my keys have aged out for the day. codesearch02 says hound is not ready. I expect this first initial startup is going to be slow cloning rather than updating everything | 00:39 |
clarkb | can cehck it in the morning and debug from there if it doesn't reach a happy state | 00:39 |
clarkb | thanks again! | 00:39 |
ianw | 2025/02/14 00:48:17 All indexes built! | 00:49 |
ianw | 2025/02/14 00:48:17 running server at http://localhost:6080 | 00:49 |
ianw | (i just logged in to check just in case the "not ready" was something to do with the scripts on a totally new host :) | 00:49 |
fungi | #status log Pruned backups on backup02.ca-ymq-1.vexxhost reducing volume usage from 90% to 61% | 01:13 |
opendevstatus | fungi: finished logging | 01:13 |
cloudnull | evenings all :D | 02:13 |
*** cloudnull9 is now known as cloudnull | 04:16 | |
opendevreview | Matthieu Huin proposed zuul/zuul-jobs master: Update the set-zuul-log-path-fact scheme to prevent huge url https://review.opendev.org/c/zuul/zuul-jobs/+/927582 | 13:28 |
*** tkajinam is now known as Guest9129 | 13:39 | |
opendevreview | Matthieu Huin proposed zuul/zuul-jobs master: Fix the upload-logs-s3 test playbook https://review.opendev.org/c/zuul/zuul-jobs/+/927600 | 13:54 |
opendevreview | Karolina Kula proposed opendev/glean master: WIP: Add support for CentOS 10 keyfiles https://review.opendev.org/c/opendev/glean/+/941672 | 15:06 |
opendevreview | Clark Boylan proposed opendev/system-config master: Remove zuul-lb01 from inventory https://review.opendev.org/c/opendev/system-config/+/941677 | 15:43 |
opendevreview | Clark Boylan proposed opendev/zone-opendev.org master: Cleanup zuul-lb01 and reset zuul ttl https://review.opendev.org/c/opendev/zone-opendev.org/+/941168 | 15:43 |
opendevreview | Clark Boylan proposed opendev/zone-opendev.org master: Switch codesearch over to codesearch02 https://review.opendev.org/c/opendev/zone-opendev.org/+/941678 | 15:43 |
opendevreview | Clark Boylan proposed opendev/system-config master: Remove codesearch01 https://review.opendev.org/c/opendev/system-config/+/941679 | 15:54 |
opendevreview | Clark Boylan proposed opendev/zone-opendev.org master: Remove codesearch01 from DNS https://review.opendev.org/c/opendev/zone-opendev.org/+/941681 | 15:55 |
clarkb | infra-root ^ I'm fairly confident the zuul-lb changes are all ready to go now. For codesearch02 I haven't seen aything amiss on the server after a quick check and the service seems to be working for me at https://codesearch02.opendev.org so I think we can proceed there too | 15:56 |
fungi | lgtm, i've approved the codesearch dns switch and should be around to see it take effect before i need to disappear for lunch in a few minutes | 16:06 |
opendevreview | Merged opendev/zone-opendev.org master: Switch codesearch over to codesearch02 https://review.opendev.org/c/opendev/zone-opendev.org/+/941678 | 16:08 |
clarkb | thanks! | 16:09 |
fungi | deploy succeeded | 16:22 |
clarkb | the new name resolves for me now | 16:22 |
fungi | same | 16:23 |
clarkb | inspecting the cert in firefox confirms I'm takling to the new server and it continues to work | 16:23 |
fungi | working for me | 16:23 |
fungi | searches return expected content | 16:23 |
fungi | headed to lunch, bbiab | 16:30 |
fungi | okay, back | 17:29 |
mnaser | using opendev has been beyond frustrating lately :( | 17:51 |
mnaser | specifically gitea | 17:51 |
mnaser | Powered by Gitea Version: v1.23.3 Page: 21401ms Template: 84ms | 17:51 |
mnaser | I cant be the only one constantly hitting these | 17:51 |
fungi | we're probably overrun with ai scraping crawlers again, i'll take another look at the apache logs | 17:52 |
mnaser | fwiw i am seeing this while browisng zuul/zuul-jobs | 17:52 |
fungi | load average on all the gitea backends is pretty nominal at the moment, looking for other possible causes | 17:54 |
fungi | fwiw, browsing https://opendev.org/zuul/zuul-jobs from here is really snappy. clicking around different files i'm getting like "Page: 58ms Template: 4ms" | 17:55 |
fungi | looks like i'm going through zayo to reach our gitea servers in vexxhost | 17:57 |
fungi | mnaser: also a heads up, suddenly openstackclient is complaining and failing when i try to communicate with vexxhost: "https://vexxhost.com is a remote profile that could not be fetched: 502 Bad Gateway" | 17:58 |
fungi | happening from multiple locations (a vm in rackspace as well as my home workstation) | 17:59 |
fungi | testing all the gitea backends for a single file from zuul-jobs like https://gitea09.opendev.org:3000/zuul/zuul-jobs/src/branch/master/tests/ansible.cfg the slowest render i got was "Page: 154ms" from gitea13 | 18:02 |
fungi | okay, pulling https://gitea13.opendev.org:3000/zuul/zuul-jobs/ just now i got "Page: 16820ms" | 18:03 |
fungi | the rest of them were in the 50-80ms range, but also so was gitea13 when i refreshed which suggests it's something to do with cold cache hits, maybe on that specific backend | 18:04 |
Clark[m] | Yes gitea is very cache sensitive because git isnt all that fast. I suspect the markdown/rst renders may not be quick either depending on the content | 18:08 |
fungi | i picked a random old zuul-jobs commit from 7 years ago and loaded it directly from every backend, all of them came back quite quickly | 18:08 |
Clark[m] | If system load is fine (so we just not generally slow) then cold caches could explain it. I think new commits invalidate the cache but afterwards things should generally stick around | 18:08 |
fungi | e.g. https://gitea13.opendev.org:3000/zuul/zuul-jobs/commit/af0d8eb25b63f320ed63500a51b287ed45bbbb1b was "Page: 247ms" | 18:09 |
Clark[m] | The top level page renders readme but random commits won't. Do other rendered pages (for eg docs) load slowly? | 18:12 |
Clark[m] | The top level page also loads repo stats that could be slow | 18:12 |
fungi | mnaser: we're using the local rootfs on the gitea backends, so storage i/o is going to be whatever the hypervisor hosts put the default rootfs on. i'd get you the server instance uuid for gitea13, but as noted vexxhost isn't working with openstackclient at the moment | 18:12 |
Clark[m] | Oh ya disk io could explain it especially if we have cold cache then hit an io slowdown | 18:13 |
mnaser | oh, well TIL about the remote profile issue | 18:13 |
mnaser | i need to remember to figure out what the well-known url is | 18:13 |
fungi | Clark[m]: i'm getting quick response on random rst files from zuul-jobs, e.g. https://gitea09.opendev.org:3000/zuul/zuul-jobs/src/branch/master/doc/source/roles.rst | 18:14 |
mnaser | usually the first load is slow, then its fine after | 18:14 |
mnaser | also i dont think its network because i always get really high page render time in the footer | 18:15 |
fungi | yeah, subsequent loads for the same url are going to come from a cache rather than from shelling out to git | 18:15 |
fungi | and i agree it's probably not network-related, unless it's to do with an internal storage network | 18:15 |
mnaser | are those boot from volume vms? also i feel like its quite unlikely we're taking 21s for io requests | 18:16 |
mnaser | what are the load averages for iowait on those systems? | 18:17 |
fungi | pretty sure they're bfv, but hard to double-check at the moment. and no, load average is low but also top doesn't show a bunch of iowait either | 18:18 |
fungi | majority of cpu is fallung under user | 18:18 |
fungi | load average on gitea13 has crept up to around 3, so it could maybe be crawlers, but i'd expect worse if so | 18:19 |
fungi | i'll do some user agent analysis on gitea13 just to see | 18:20 |
fungi | tons of hits from git/2.39.2 | 18:21 |
fungi | maybe this is a runaway openstack deployment cloning from source on a bazillion hosts behind a nat | 18:21 |
fungi | 7539 requests from "git/2.39.2" persisted to gitea13 between 18:10-18:20, so ~12.5/sec | 18:22 |
fungi | i'll try to do some client port matching to work out the source addresses hitting haproxy for those and see if it gives me anything useful | 18:25 |
Clark[m] | Yes I think they are bfv but I can't say for certain | 18:27 |
fungi | looking at the counts, i have a feeling these git/2.39.2 requests may be from a ci system. there were ~600 requests in 600 seconds for each of 1323 different repositories | 18:32 |
fungi | so something hitting every one of our git repositories once a second persisted to gitea13 over that 10-minute slice of logs | 18:33 |
Clark[m] | I think code search does that | 18:34 |
Clark[m] | But it just checks for updates and shouldn't be a huge impact (it hasn't been in the past) | 18:34 |
Clark[m] | We did deploy a new server but the container image didn't change so the version of git and hound should be the same with th same behaviors | 18:35 |
fungi | i'll take a look there first | 18:36 |
fungi | anecdotally, that new codesearch02 server has git 2.43.0 installed, so seems unlikely | 18:37 |
Clark[m] | Oh the old server is still up so it could be both of them hitting the same server at once? That seems unlikely ~1/6 chance but maybe we stop the containers on codesearch01 | 18:37 |
Clark[m] | fungi: the version of git used is the one in the hound container | 18:38 |
fungi | good point! git 2.39.2 in the container on codesearch02 | 18:39 |
fungi | docker-compose exec is taking a really long time for me on codesearch01i've done a docker-compose down on codesearch01 just to be sure | 18:41 |
Clark[m] | This seems unlikely to be the cause but if it is then it's a quick solution | 18:42 |
fungi | doing source port comparisons to map into the haproxy log, those git/2.39.2 requests do in fact seem to be originating from the ipv6 address of codesearch02 | 18:49 |
Clark[m] | There are many issues on the Internet related to gitea slowness. Some pages (like the org dashboard) are db bound and slowness there could be db related. Seems unlikely to be the cause here on repo pages. | 18:52 |
Clark[m] | May need to look at logs for the slow requests and see if there is any indication of where the time was spent | 18:52 |
fungi | though my math was off, that was 7539 (~12.5/sec total) requests over 10 minutes, but ~600 for each of 1323 repositories since the log was rotated at midnight utc | 18:52 |
clarkb | looking at /var/log/containers/docker-gitea.log and /var/log/apache2/gitea-ssl-access.log there does appear to be some interesting crawling going on | 19:00 |
clarkb | I'm not sure that is the problem yet though | 19:00 |
clarkb | requests against /api/v1/repositories subpaths that are unauthorized and lots of fetches for commit specific paths that you wouldn't expect normal activity to generate | 19:01 |
clarkb | theory time: all this crawling is overwhelming the cache | 19:02 |
clarkb | so while it isn't directly overloading things the cache may not be large enough to cache all of the requested data and sacrifices are made (I believe this is done in an LRU manner) | 19:03 |
clarkb | so we notice because the cache isn't effective | 19:03 |
clarkb | grabbing samples from /var/containers/docker-gitea.log that match grep 'router: slow' and then looking in apache logs for the shas in the request so far 100% of these that I have looked up have been from amazonbot | 19:05 |
clarkb | thats my best working theory right now. The gitea cache is simply not large enough and things are being evicted but I'm not sure that is the case | 19:05 |
fungi | got it, so cache proves useless in the face of constant crawling hitting random content | 19:11 |
fungi | and cold cache performance in gitea is often poor | 19:11 |
fungi | or cache miss performance i guess | 19:12 |
clarkb | ya cache miss performance has laways been pretty bad whcih is why they added teh cache. I'm still trying to track down how the memory adatper works (pretty sure this is the one we are using) | 19:14 |
clarkb | https://gitea.com/go-chi/cache/src/branch/main/memory.go this appears to be the implementation and there doesn't seem to be any maximum record count so maybe this theory is bad? Its just going to add as many things to memory as it can? | 19:18 |
clarkb | looks like the facebook meta-externalagent is also generating a lot of requests that get logged as slow | 19:20 |
clarkb | I'm half tempted to temporarily block these crawlers and see if that improves things then work backward from there. Bit of a 20lb sledhammer though | 19:21 |
clarkb | there is mailing list info on golang maps indicating that large maps with many keys can negatively impact the garbage collector in go causing pauses | 19:24 |
clarkb | so I'm adjusting my theory now to say maybe it is golang gc running against a very large in memory cache map made worse by crawlers | 19:24 |
clarkb | to test ^ I think what we can do is manually stop then restart gitea on gitea13 and see if things improve in the immediate term | 19:25 |
clarkb | eventually they should degrade but in the short term if they improve that would lead to some evidence my theory is on the right track | 19:25 |
clarkb | I need to take a shower and catch up on some other morning thigns that haven't happend yet but then I can do that | 19:25 |
clarkb | if anyone else beats me to it remember the startup process for gitea is a bit odd to ensure gerrit replication doesn't fail (basically do a down, then only up gitea web and mariadb, wait for giteaweb to respond then up gitea ssh) | 19:26 |
fungi | the problem with that plan is i still don't have a consistent enough reproducer to be able to measure whether the restart improves anything | 19:36 |
fungi | but maybe you have some idea what we should be measuring/counting | 19:37 |
clarkb | fungi: the router: slow messages in the gitea log are the canary I think | 19:45 |
clarkb | fungi: post restart we should probably expect those to become more common beacuse nothing is cached but then theyshould fall away then over time come back again (assuming my theory is correct) | 19:45 |
clarkb | I'll hold off on restarting things since there is concern that may be a bit too cowboy | 19:46 |
fungi | meh, "concern" is a bit strong to describe my feelings on it. besides, friday is for cowboys | 19:49 |
fungi | i don't object to the proposed experiment | 19:49 |
clarkb | ok proceeding with the experiment | 19:51 |
clarkb | restart is done and I'm running `sudo tail -f /var/log/contaniers/docker-gitea.log | grep 'router: slow'` to get a sense for whether or not this has helped. So far only two slow requests both for nova commits | 19:54 |
clarkb | nova is our largest repo so likely whee we'd see slowness? | 19:54 |
fungi | i'd expect so, yes | 19:54 |
clarkb | rough anecdata is the incidence rate has slowed | 19:55 |
clarkb | of course I don't know for sure that 'router: slow' is a good canary but it is related to long requests | 19:56 |
clarkb | mnaser: ^ fyi we did a thing that I've hypothesized will help. Would be curious if you notice the problem occuring persistently over the next little bit | 19:56 |
clarkb | and now lunch | 19:56 |
clarkb | loading neutron commit pages does seem to consistently take a coupel seconds but that is much quicker than the ~16 people were having with zuul-jobs previously | 19:57 |
fungi | also, it would be good to know if the https cert details in your browser indicate you're being persisted to gitea13 or one of the other backends, and if so, which one | 19:57 |
clarkb | ++ | 19:58 |
clarkb | but really now food | 19:58 |
fungi | you should totally food | 19:58 |
clarkb | fungi: any objection to https://review.opendev.org/c/opendev/system-config/+/941677 to remove zuul-lb01 from inventory at this point? | 21:18 |
clarkb | looks like the codesearch01 removal change failed on docker rate limits. I'll recheck that one in a bit | 21:18 |
clarkb | it occurred to me that all these people writing crawlers for llms grabbing git repo contents would be better off teaching their tools to git clone | 21:26 |
clarkb | even for nova they could grab the entire commit history in just a few minute and then crawl it from the comfort of their own datacenter | 21:27 |
clarkb | anyone know anyone at openai, anthropic, google, meta, or amazon? My rates are reasonable | 21:28 |
clarkb | I pulled up the same tail command on gitea09 and the rate there definitely seems lower than on 13 | 21:29 |
clarkb | but that could be luck of the draw based on load balancing. Hard to make any conclusions at this poitn | 21:38 |
fungi | approved the zuul-lb01 inventory cleanup | 21:55 |
opendevreview | Merged opendev/system-config master: Remove zuul-lb01 from inventory https://review.opendev.org/c/opendev/system-config/+/941677 | 22:15 |
clarkb | oh right thats going to run all the jobs becuse it edits the hosts file | 22:16 |
fungi | yup | 22:28 |
fungi | the whole shebang | 22:28 |
fungi | i suppose we could create a modular inventory in order to avoid that | 22:28 |
clarkb | it would be neat if we could detect only deletions and then short circuit but then I remembered we manage firewall rule membership via inventory and groups so that won't work | 22:33 |
clarkb | the codesearch01 removal change should pass in the next few minutes | 22:44 |
clarkb | https://review.opendev.org/c/opendev/system-config/+/941679 yup this passes now | 22:57 |
ianychoi | Hi, Zuul job (openstack-tox-pep8) on openstack/i18n repo are falling - would you help how I can move forward? Links are: https://zuul.opendev.org/t/openstack/build/009def09355743528e8211913d25cd5e and https://zuul.opendev.org/t/openstack/build/f08739d7fc9442b3a769f073f95fa921 | 23:01 |
Clark[m] | ianychoi the problem is you are running the linter tools (pylint/hacking/flake8) with a pinned version that is too old for python3.12. You need to update the version of those tools | 23:03 |
ianychoi | Clark: thank you! Then, fixing on *requirements.txt on i18n repo would solve the issues - will do it! | 23:05 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!