| opendevreview | Michal Nasiadka proposed zuul/zuul-jobs master: Use mirror_info in configure-mirrors role https://review.opendev.org/c/zuul/zuul-jobs/+/966187 | 07:11 |
|---|---|---|
| opendevreview | Michal Nasiadka proposed zuul/zuul-jobs master: Use mirror_info in configure-mirrors role https://review.opendev.org/c/zuul/zuul-jobs/+/966187 | 07:12 |
| *** mrunge_ is now known as mrunge | 07:14 | |
| *** ralonsoh_ is now known as ralonsoh | 07:58 | |
| opendevreview | Jakob Unterwurzacher proposed opendev/lodgeit master: Add docker-compose.yml https://review.opendev.org/c/opendev/lodgeit/+/971311 | 10:18 |
| opendevreview | Jakob Unterwurzacher proposed opendev/lodgeit master: Add docker-compose.yml https://review.opendev.org/c/opendev/lodgeit/+/971311 | 10:23 |
| noonedeadpunk | gitea feels really bad today :( | 13:14 |
| sercan | hi what happen to website is it down | 13:18 |
| noonedeadpunk | and it looks like most if not all workers are affected | 13:18 |
| *** ykarel__ is now known as ykarel | 13:24 | |
| fungi | which website? | 14:35 |
| fungi | sercan: noonedeadpunk: ^ we have a lot of web sites, a url would help so i can confirm whether it's still a problem | 14:36 |
| noonedeadpunk | fungi: opendev.org was quite unresponsive today | 14:48 |
| noonedeadpunk | some backends were just timeouting, some taking really long to load | 14:48 |
| noonedeadpunk | I was thrown between backends by the balancer | 14:49 |
| fungi | noted, must have been overrun by something. i'll see if i can spot any obvious pattern in the logs | 14:50 |
| noonedeadpunk | was throuwn betwween gitea11,12,13 with no to little performance benefits. | 14:52 |
| noonedeadpunk | ones that were timeouting could not spot, as they were failing on establishing tls | 14:53 |
| noonedeadpunk | (at least for me) | 14:53 |
| noonedeadpunk | but yeah, looks better now | 14:53 |
| fungi | well, if we get overrun by high-cost requests then when they cause a backend to stop responding it's taken out of the pool by the load balancer and those same problem requests just get distributed to another backend which gets similarly knocked offline | 14:54 |
| mhu | hey there, we have some CI that depends on repos hosted on opendev, I've started to see a few SSL errors when attempting to fetch some repos with git, here are some examples:... (full message at <https://matrix.org/oftc/media/v1/media/download/AXZdw1sWVDeWjMskZsPJ3Kl_gAA8JNu1HiyZyFl6dNlEtGFVE92eNhLT8GXAFTnCvK-ZnydI_hJd4H2NEohoWzVCebdPZVwgAG1hdHJpeC5vcmcveXpqaVRUbmNsVUp6dXZhVWZtaFF1QW1B>) | 15:10 |
| fungi | mhu: looks like your message was too long for irc and got truncated, but we did get reports around 13:15 utc that our git servers were being knocked offline for a little while | 15:17 |
| fungi | i expect we're going to need to do another pass over the current filters we use to try to block llm training crawlers | 15:17 |
| mhu | let me slice it up | 15:18 |
| mhu | hey there, we have some CI that depends on repos hosted on opendev, I've started to see a few SSL errors when attempting to fetch some repos with git, here are some examples: | 15:20 |
| mhu | Cmd('git') failed due to: exit code(128)\n cmdline: git fetch -f --tags --prune --prune-tags origin\n stderr: 'fatal: unable to access 'https://opendev.org/zuul/zuul-jobs/': OpenSSL SSL_connect: SSL_ERROR_ZERO_RETURN in connection to opendev.org:443 | 15:20 |
| mhu | git fetch -f --tags --prune --prune-tags origin\n stderr: 'fatal: unable to access 'https://opendev.org/x/browbeat/': OpenSSL SSL_connect: SSL_ERROR_ZERO_RETURN in connection to opendev.org:443 | 15:20 |
| mhu | It's infrequent (I can provide timestamps) but we've started to see this occur more often in the last 24 hours | 15:20 |
| fungi | mhu: were you seeing it specifically around 13:15 utc? if so that could be the same problem others were reporting in here earlier | 15:21 |
| mhu | @fungi: what did you set up to protect against LLMs? we had to do it too for rdoproject.org and softwarefactory-project.io, and we went with Anubis. | 15:21 |
| mhu | It started about a day ago, but it really depends on when our CI jobs were running. I mean it's mostly a check pipeline, not periodic, so it's kind of random | 15:22 |
| fungi | at the moment it's based on the fact that most of the misbehaving crawlers use semi-nonsense user agent strings (old browser versions, mobile browser versions, combos that don't make sense, obvious typos...) | 15:22 |
| mhu | (i just joined the chan so unfortunately I don't have the history) | 15:22 |
| fungi | but we're playing around with a honeypot to try auto-blocking clients that hit unadvertised urls and don't respect robots.txt next | 15:23 |
| mhu | I think git sets its own user agent, so it shouldn't hit your rules | 15:24 |
| fungi | yeah | 15:24 |
| mhu | and I can't really tell what percentage of git calls actually fail on the SSL issue compared to how many go through | 15:25 |
| mhu | anyhow I suggest having a look at Anubis, it doesn't have a honeypot feature AFAICT but it comes with predefined rules and it's fairly easy to custom | 15:26 |
| fungi | right, i think what you're seeing is that some client(s) overload one of our backends, the load balancer takes it offline because it's stopped responding, and then distributes those same problem requests to another backend that gets similarly knocked offline, and so on | 15:27 |
| mhu | Sounds sensible, and sadly familiar, yeah | 15:27 |
| fungi | our concerns with abnubis are that it could disrupt legitimate requests from automater systems that lack an internal javascript processor, and that it shows a possibly cringe-worthy cartoon character that may not be appropriate in some work environments | 15:28 |
| mhu | eh, we jokingly call it "the waifu" | 15:29 |
| mhu | (the challenge message is customizable, FWIW) | 15:30 |
| mhu | the legitimate requests come from automation that presents a browser's user agent? | 15:31 |
| fungi | no, but also illegitimate requests can come from crawlers that lie about what user agent they are | 15:32 |
| fungi | anyway, https://opendev.org/opendev/system-config/src/branch/master/playbooks/roles/apache-ua-filter/files/ua-filter.conf is what we're relying on at the moment | 15:32 |
| fungi | https://review.opendev.org/970674 is the upcoming work in progress | 15:33 |
| TheJulia | Any chance I can get some one on opendev who might have an opinion regarding glean to glance at https://storyboard.openstack.org/#!/story/2011631 ? | 15:33 |
| fungi | TheJulia: that looks related to https://review.opendev.org/963010 | 15:34 |
| TheJulia | yup | 15:35 |
| fungi | i've already +2'd it but would like another core reviewer's input just to be sure i'm not overlooking something | 15:35 |
| TheJulia | fair, I'll take a look at that patch in a minute myself since I'm definitely amongst the few who have dug through the glean code | 15:36 |
| mhu | thanks, I'll have a look and see if I can help. Just as a last note (I swear I don't own shares of Anubis) basically triggering a JS challenge whenever a User-Agent contains "Mozilla" was enough to reduce our gateway's load average by a factor of a thousand, without significant impact on legit traffic | 15:36 |
| fungi | mhu: yes, i expect we'll end up with a multi-pronged approach if the ai crawler incentive doesn't collapse under its own weight soon, and anubis or something like it will probably end up incorporated | 15:38 |
| opendevreview | Merged openstack/project-config master: [neutron] update dashboard with new jobs https://review.opendev.org/c/openstack/project-config/+/971155 | 15:38 |
| mhu | anyway good luck with that, I've been through the pain of fighting the Robot Wars of 2025 | 15:39 |
| fungi | same. i can't wait for the ai hype to be over | 15:39 |
| fungi | (not just the crawler madness, but also for people to finally stop telling me how i should be using ai for everything) | 15:40 |
| fungi | at some point the people running these crawlers are going to realize that ai-generated content has already poisoned so much of the web that training an llm on it only makes their models worse | 15:41 |
| TheJulia | fungi: oh, heh, I already did look at it | 15:42 |
| mhu | dead internet theory on steroids | 15:42 |
| fungi | TheJulia: oh, yes you +1'd it two days ago | 15:42 |
| TheJulia | mhu: the idea of AI crawlers being the new robot wars might explain the lacking continuation of battlebots. | 15:46 |
| clarkb | fwiw I'm actually really skeptical anubis i the right choice | 15:46 |
| fungi | mhu: oh, https://zadzmo.org/code/nepenthes/ was the other one we talked about, though some people have raised concerns that it might be illegal to cause "harm" to the systems running the crawlers by chewing up their resources. granted it's not too dissimilar from the labyrinth technique cloudflare ended up implementing | 15:46 |
| clarkb | specifically because it is trivially defeated by either using a git user agent (or something that isn't a mozilla compatible browser) or by running a js engine on the bot side | 15:47 |
| clarkb | also it creates a pretty terribel user experience | 15:47 |
| clarkb | (but terrible user experience is better than down user experience so maybe that is a compromise we have to make, I'ev just been trying to do what we can to avoid that compromise) | 15:47 |
| TheJulia | fungi: so, regarding that bug, it looks like it points to a different issue than the code changes, fwiw. | 15:48 |
| fungi | but if glean stops installing that udev rule triggered for glean@.service when systemd-networkd is present, that race condition won't have a chance to occur right? or am i missing some nuance in the bug? | 15:49 |
| clarkb | fungi: I dont' think systemd-networkd is the default for centos, network manager is | 15:50 |
| TheJulia | I think that changes on 10... I think | 15:51 |
| clarkb | no 10 made network manager native config required | 15:51 |
| TheJulia | ahh | 15:51 |
| clarkb | 9 and older supported the old /etc/system/config or whatever path they were files via a network manager plugin | 15:51 |
| fungi | clarkb: you're right, the bug report mentions nm | 15:51 |
| mhu | clarkb: I believe the user agent of these bots is always set to "Mozilla something" because the point is to slurp delicious content that is aimed at humans. But I agree Anubis and consorts only work as long as bots don't switch to no user-agent, and then we'll just be in for a lousy Red Queen Race | 15:52 |
| clarkb | fungi: yes, the change you linked is just a way of forcing systemd-networkd on non default systemd-networkd systems. So you could potentially force centos to use it instead of network manager. But I don't think that change would fix things for users who use centos and netwrok manager | 15:52 |
| clarkb | mhu: or they can run a headless js engine and cache some calculated results and share them amongst the botnet | 15:52 |
| clarkb | mhu: anubis works because the crawlers are currently happy with the situation and not because it really solves anything | 15:53 |
| clarkb | and as a user I close almost any tabs that flash the anubis logo about 90% of them | 15:53 |
| clarkb | *about 90% of the time | 15:53 |
| mhu | clarkb: for a human it's a one time cost though, well provided you accept the cookie storing your proof of work | 15:54 |
| mhu | but it's a bleak future indeed, I expect too that Anubis and consorts will only work until the bots get better | 15:55 |
| clarkb | mhu: right, but for many of us users that means every time I visit your site I'm needing to recalculate | 15:55 |
| clarkb | (because cookies don't persist forever) | 15:55 |
| clarkb | and again it would be trivial to defeat by the crawlers if they tried I think | 15:56 |
| mhu | well they'd still have to do the calculation at least once per site which would slow down their slurping, and cost more to run as bot farms | 15:57 |
| clarkb | mhu: they are literally crawling billions (trillions?) of pages. Once per site isn't a big deal | 15:57 |
| TheJulia | fungi: yeah, looking at it, I think the right course is to actually move the install over for the file outside of the individual invocations which is what the bug raises the question of, or quite literally conditional the file addition check with os.path.exists | 15:58 |
| clarkb | looking at requests on gitea13 there are definitely a bunch of bogus UAs we could filter | 15:59 |
| clarkb | chrome 5? thats ancient | 15:59 |
| mhu | well, some points-of-sale are still running WinXP ... | 16:00 |
| TheJulia | .... they shouldn't be on the internet then. | 16:00 |
| mhu | I hope they aren't, but my point was more that you'd be surprised by old stuff still being run | 16:01 |
| clarkb | found some chrome 4 | 16:01 |
| clarkb | mhu: yes but they are not our user base | 16:01 |
| TheJulia | My wife comments "Better than OS/2 Warp!" | 16:01 |
| fungi | if a point of sale system tries to browse our git repositories and gets blocked for running a super old chrome full of known exploitable security vulnerabilities, i won't lose sleep over it | 16:01 |
| clarkb | mhu: and when users have complaine that our rules have blocked their old browsers I have encourged them to upgrade (quickly) | 16:01 |
| clarkb | I'm going to eat some breakfast, but I can look into updating the UA list unless someone else wants to do it | 16:02 |
| mhu | that lone dude at costco trying to contribute to opendev during downtime on his PoS: ;_; | 16:02 |
| mhu | anyway I don't think the problem I initially reported was due to a problematic UA on our side, but more about hitting your servers as they were under heavy load | 16:03 |
| clarkb | correct | 16:04 |
| clarkb | and our solution us to block requests with bogus agents so that valid requests can be processed | 16:04 |
| clarkb | also not perfect but much simpler than anubis and transparent for the vast majority of valid users | 16:05 |
| mhu | yeah from my experience that should be sufficient to mitigate | 16:06 |
| mhu | I'll monitor the SSL issues some more and will keep you updated if I notice a frequency increase. Thanks! | 16:08 |
| fungi | right, it just means us playing whack-a-mole with new user agents that crop up in the logs which can be safely blocked | 16:08 |
| mhu | hence the honeypot idea | 16:09 |
| opendevreview | Julia Kreger proposed opendev/glean master: Move networkmanager dns installation into install https://review.opendev.org/c/opendev/glean/+/971350 | 16:11 |
| clarkb | early math shows about 2/3 of a million bad requests from a consistent pattern (with 566 entries) of different bogus UAs | 16:14 |
| clarkb | now I just have to figure out how to represent these in a way that we don't all go crazy | 16:14 |
| clarkb | mhu: the really fun thing is we've found typos in these user agents | 16:17 |
| mhu | because to err is human! | 16:17 |
| clarkb | I actualyl see one in the current data set too (they forgot the Mozilla/5.0 prefix) | 16:18 |
| mhu | I have a feeling these crawlers have been cobbled up together hastily just to jump on the hype train, sell the data to whoever thinks they need it | 16:18 |
| fungi | right, in particular some of the big name llm operations have been exposed buying data from no-name third parties who are obtaining it in disreputable (or sometimes even illegal) ways, so that the buyers can feign ignorance and continue to claim they're doing nothing wrong | 16:23 |
| mhu | story as old as time | 16:25 |
| fungi | but i really do expect that market to dry up soon. the popular models have long since run out of novel human-created content to train on, and all attempts to train on synthetic content are leading to model collapse | 16:25 |
| mhu | it's either that or a lot of the internet shuts down because of the massive DoS on just about anything that serves text or code | 16:26 |
| clarkb | ok I'm through the windows UAs... that represents about half of them. Now to figure out OS X and Linux | 16:38 |
| mnasiadka | clarkb: I’ve updated the zuul-jobs patch, updated pip.conf template - I’d assume it’s fine now | 16:39 |
| clarkb | mnasiadka: ack I'll try to take an other look | 16:39 |
| mnasiadka | But I see you’re having crawler fun | 16:39 |
| fungi | the fun never ends | 16:40 |
| mnasiadka | Just enable captcha on all pages :-) | 16:40 |
| opendevreview | Clark Boylan proposed opendev/system-config master: Expand our UA filter set https://review.opendev.org/c/opendev/system-config/+/971357 | 16:59 |
| clarkb | infra-root ^ I think that covers about 90% of the current set of crawlers based on logs | 16:59 |
| clarkb | essentially I'm expanding the platform and webkit/safari matches | 17:00 |
| clarkb | as there is a lot more variance on those now | 17:00 |
| clarkb | oh wait I may be able to delete some of the old regexes that I haven't already cleaned up. Give me a minute to double check | 17:01 |
| clarkb | nope nevermind I think they are distinct and not work trying to make a more complicated regex for imo | 17:02 |
| clarkb | *not worth | 17:02 |
| opendevreview | Clark Boylan proposed opendev/system-config master: Expand our UA filter set https://review.opendev.org/c/opendev/system-config/+/971357 | 17:04 |
| clarkb | that does fix a comment but shouldn't affect the functionality of the chagne if you've already started reviewing the first patchset | 17:04 |
| clarkb | I think they must round robin the agents because there is a very consistent count across all of them. Made it easy to identify the batch after some sort and uniq -c filtering :) | 17:05 |
| fungi | very cool observation! | 17:05 |
| opendevreview | Jeremy Stanley proposed opendev/system-config master: Use raw string for regex in release-volumes.py https://review.opendev.org/c/opendev/system-config/+/971358 | 17:23 |
| clarkb | https://104.130.74.13/ is the load balancer for gitea on the UA change. It is currently being updated so not all projects are there but I can browse via firefox and chrome so I'm reasonably happy I didn't berak anything for up to date browsers with the new rules | 18:10 |
| clarkb | there is no node hold so that is a temporary check but good enough for me | 18:11 |
| clarkb | I think we can proceed with deploying this if the job passes and try to push back on the crawler flood | 18:11 |
| fungi | sgtm | 18:25 |
| fungi | though i'm on my way out to an eye exam so will probably be afk for roughly an hour | 18:25 |
| fungi | maybe longer if they're backed up | 18:26 |
| clarkb | ack. I can be around if anyone else wants to review it or I can self approve | 18:26 |
| clarkb | the disaster case would be mitigated by manually dropping the use of the UA ruleset from vhosts and reloading apache if it comes to that | 18:27 |
| clarkb | but as said testing looks good so I'm not too worried | 18:27 |
| fungi | cool with me, bbiaw | 18:27 |
| clarkb | I'll wait for the check results to post and if they are +1 and no one else has chimed in I'll self approve | 18:28 |
| clarkb | infra-root ^ fyi if you want to review the change | 18:28 |
| clarkb | ok it posted +1 so I have approved it. We have about an hour or so to unapprove it if we wish | 18:33 |
| clarkb | it does look like things are trending back towards the backgroudn normal though | 19:00 |
| mnasiadka | It’s interesting that crawlers are identifying as such old macos versions | 19:39 |
| clarkb | mnasiadka: yes they've been doing this for yaers too. I think the idea is to have many different agent strings to make it potentially harder to block them all? | 19:43 |
| clarkb | while still being realistic enough to generate data that is useful | 19:43 |
| fungi | okay, back, that was fairly quick | 19:45 |
| clarkb | and you still have both eyeballs? | 19:45 |
| fungi | hard to tell | 19:45 |
| clarkb | the filter list should land soon then I'm going to eat lunch and run some errands myself | 19:46 |
| fungi | my pupils are still dilated so the screen is a bit wonky but readable at least | 19:46 |
| clarkb | fwiw when I look at UAs to filter my process is to first grep for requests that indicate crawling activity (with gitea this is looking for requests against specific commits), then I remove all 403 responses as those we've already blocked, then grab all the user agents, sort and count then sort again. Then you get a pretty decent picture of what is going on | 19:47 |
| clarkb | the good bots are at the top of that list with total request counts on the order of a few thousand or few 10k requests. Past that you get all the clearly bogus stuff looking like they are used in a round robin so they all sort together by total count | 19:48 |
| fungi | yeah, that's basically how i do it, though i filter up front for 200 response | 19:48 |
| mnasiadka | clarkb: I guess so | 19:52 |
| clarkb | and then on the other end of the bad crawling traffic hyou have a smaller amount of good traffic. | 19:52 |
| opendevreview | Merged opendev/system-config master: Expand our UA filter set https://review.opendev.org/c/opendev/system-config/+/971357 | 19:55 |
| clarkb | the really sad thing with the gitea servers getting crawled is that a smarter crawler could git clone each of the repos and process them locally much more efficiently | 19:55 |
| clarkb | I suspect but have no actual data of this that groups like sourcegraph/amp do this (because I don't recall seeing them which means they aren't as active as those from google and openai and facebook and amazon) | 19:56 |
| clarkb | but if you do that then you can git clone all the repos stick them in some database that can do incremental updates and also output in a way that your training system understands | 19:59 |
| clarkb | gitea09's apache has been reloaded | 20:00 |
| clarkb | https://docs.opendev.org/opendev/system-config/latest/roles.html should be behind the new filter set now too | 20:00 |
| clarkb | and that url loads for me in all three browsers I run | 20:01 |
| clarkb | I'll check gitea when all backends are updated so I don't have to SOCKS proxy | 20:01 |
| JayF | clarkb: 100% professional git scrapers do it via clones for one very important reason: unredacted email addresses | 20:04 |
| JayF | clarkb: at least that's true for git scrapers used by recruiters in the tech field I've talked to | 20:05 |
| clarkb | JayF: I suspect llm training sets care less about that info (as lon as they can identify authors uniquely the actual email probably doesn't matter). But git contains a ton of useful info which I suspect being able to look up more dynamically is useful | 20:06 |
| clarkb | the tree structure itself is usually not well represented in the web page renderings | 20:06 |
| clarkb | but ya I think there is the clearly competent group over here, the competent but also generic group over there, and then everyone else | 20:07 |
| clarkb | and the problems are largely caused by the everyone else | 20:07 |
| clarkb | deploy succeeded. Let us know if you have problems accessing sites like opendev.org or zuul.opendev.org or docs.openstack.org due to 403 responses. Otherwise I think this should help push back against the flood decently well (for now) | 20:08 |
| fungi | i expect everyone trying to train source code models does git clone operations, scrapers looking for web content just happen to not be able to tell they're uselessly crawling a code hosting site | 20:09 |
| clarkb | fungi: openai and google both scrape gitea via web. It is possible they also git clone if they are interseted in the source code as source code | 20:11 |
| clarkb | they both have source focused llm product offerings aiui so its a bit ambiguous | 20:11 |
| fungi | deploy succeeded for the ua filter list update about 15 minutes ago | 20:24 |
| fungi | and yeah, everything seems to be working for me | 20:25 |
| clarkb | gitea load averages are much lower too. I haven't done log analysis to see if that is due to lack of requests or us respdonding 403 to an ongoing crawl | 23:15 |
Generated by irclog2html.py 4.0.0 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!