*** bauzas_ is now known as bauzas | 00:32 | |
*** bauzas_ is now known as bauzas | 01:42 | |
*** bauzas_ is now known as bauzas | 03:45 | |
*** bauzas_ is now known as bauzas | 04:01 | |
*** bauzas- is now known as bauzas | 04:34 | |
*** bauzas_ is now known as bauzas | 05:34 | |
*** bauzas_ is now known as bauzas | 05:59 | |
*** bauzas_ is now known as bauzas | 06:15 | |
*** bauzas_ is now known as bauzas | 06:31 | |
*** bauzas_ is now known as bauzas | 07:00 | |
*** bauzas_ is now known as bauzas | 07:41 | |
*** bauzas_ is now known as bauzas | 08:24 | |
*** bauzas_ is now known as bauzas | 08:55 | |
*** bauzas_ is now known as bauzas | 09:23 | |
opendevreview | Artem Goncharov proposed openstack/project-config master: Add publish openapi specs job https://review.opendev.org/c/openstack/project-config/+/921934 | 10:17 |
---|---|---|
*** tosky_ is now known as tosky | 12:28 | |
JayF | Wiki is behaving a little weird; saved an edit but it didn't ever redirect anywhere, just kinda hung. When I reloaded the original page, the edit was there | 16:42 |
tonyb | JayF: a one off or several times? | 16:43 |
JayF | three times in a row, all in like a 10 minute period | 16:43 |
tonyb | Okay I'll have a look at the server logs etc | 16:44 |
JayF | it almost acted like it was being redirected to some URL that didn't exist / timed out | 16:47 |
tonyb | Okay. The server is quite "busy" but not overloaded | 16:49 |
tonyb | Ahh it *looks* like someone is crawling the site | 16:55 |
clarkb | that is something we've periodically had to deal with elsewhere too | 17:04 |
clarkb | the well behaved bots respect robots.txt rules that tell them to slow down (and honestly have never really been a problem). its the bots that ignore robots.txt that we often have to block | 17:05 |
tonyb | Yeah, this one seems to be coming from many IPs :( | 17:05 |
*** bauzas_ is now known as bauzas | 17:07 | |
tonyb | all AWS EC2 :( | 17:07 |
tonyb | https://www.reddit.com/r/singularity/comments/1cdm97j/anthropics_claudebot_is_aggressively_scraping_the/ | 17:08 |
JayF | tonyb: I know someone who works at anthropic, if you can directly tie it to them I can let the person know the feedback. | 17:09 |
fungi | yeah, when we have the new wiki server in place we can more easily reuse the ua filter we have on some other services | 17:09 |
fungi | JayF: it happened to me earlier today too, and crops up from time to time | 17:09 |
JayF | are the UAs on that claudebot? | 17:09 |
tonyb | Yup | 17:09 |
JayF | and I assume we have Crawl-Delay set and it's not being honored? | 17:09 |
fungi | i don't assume anything with the old unmaintained wiki server | 17:10 |
tonyb | They've done ~500k requests in 24hours | 17:10 |
JayF | https://wiki.openstack.org/robots.txt does not appear to work as I'd expect | 17:10 |
fungi | i'm inclined to just hope for the best and focus our energies on the new wiki server we'll actually be able to manage | 17:10 |
tonyb | They explictly ignore robots.txt anyway | 17:10 |
JayF | tonyb: I was hoping to assemble a full complaint with "your bot caused service disruption for these reasons" | 17:11 |
fungi | yeah, robots.txt is mostly only useful these days for informing legitimate search engines that some content is not worth crawling. abusive crawlers can easily just not care you have one at all | 17:11 |
JayF | tonyb: but if we don't publish a robots.txt, it's a less strong argument -- especially since https://www.openstack.org/robots.txt doesn't set Crawl-Delay either | 17:11 |
JayF | fungi: The people I know who work over there at least *claim* to care about not being abusive in how they train. That's part of why I wanted to close the loop | 17:12 |
tonyb | Fair. | 17:13 |
tonyb | Any objections to me adding: | 17:18 |
tonyb | root@wiki-upgrade-test:/etc/apache2/sites-enabled# diff -U0 50-wiki.openstack.org.conf~tonyb_before_crawler_change 50-wiki.openstack.org.conf | 17:18 |
tonyb | --- 50-wiki.openstack.org.conf~tonyb_before_crawler_change 2024-06-13 17:14:55.218894454 +0000 | 17:18 |
tonyb | +++ 50-wiki.openstack.org.conf 2024-06-13 17:17:15.226120795 +0000 | 17:18 |
tonyb | @@ -45,0 +46,3 @@ | 17:18 |
tonyb | + RewriteEngine on | 17:18 |
tonyb | + RewriteCond %{HTTP_USER_AGENT} ^.*ClaudeBot.*$ | 17:18 |
tonyb | + RewriteRule . - [R=403,L] | 17:18 |
tonyb | to the appropriate place? | 17:18 |
tonyb | to see if we can slow it down at least | 17:18 |
fungi | tonyb: no objection if that's how you choose to spend your time ;) | 17:19 |
fungi | i wouldn't want to set a precedent that we're continuing to try to fix things on the old server, but seems like we can see the light at the end of that tunnel anyway thanks to your efforts | 17:20 |
clarkb | ya I don't object | 17:22 |
tonyb | Yeah I don't really want to set that precedent either but also I'd like the service to be usable | 17:23 |
clarkb | JayF: fwiw I reject the idea htat people hosting content have to explicit say "please don't dos us" otherwise its fair game | 17:23 |
clarkb | robots.txt should provide helpful hints, bots should still be well behaved | 17:23 |
fungi | i.e. not every site is hosted on github-pages or geocities | 17:24 |
JayF | I agree in spirit but generally have seen in practice that most scripts/users/bots/whatever will take up as much performance slack as they are given, so tend to go the route of actual-rate-limits if it's a dynamic site that I care a lot about. | 17:25 |
JayF | But that's basically saying "hey overworked people, do more stuff" in a roundabout way, which isn't really reasonable either | 17:25 |
fungi | JayF: in my opinion it's also fair game to just block them without notice | 17:26 |
JayF | Sorta like my feelings on if you take an MIT licensed software and make it a proprietary product with very minimal value add. Allowed? Yes. Against the spirit? Yes. | 17:26 |
JayF | fungi: yep :D | 17:26 |
tonyb | Okay the load has dropped from 3+ to <1 | 17:26 |
fungi | thanks tonyb! | 17:26 |
tonyb | I added a second UA to the block list | 17:26 |
JayF | fungi: and I'll make sure that gets communicated to the right place when I find out what that is (your badly behaved bot will no longer be able to crawl the *root source* for some data on an OSS project because it was aggro) | 17:26 |
tonyb | and with that I think I'll relocate to a brewery | 17:27 |
fungi | one way wide-reaching scrapers deal with that is to parallelize across lots of sites rather than having a worker narrowly scrape content from one site | 17:28 |
fungi | so they can still achieve suitable throughput in aggregate while not hammering any one site with tons of requests in a short span of time | 17:29 |
JayF | I was pointed at this: https://support.anthropic.com/en/articles/8896518-does-anthropic-crawl-data-from-the-web-and-how-can-site-owners-block-the-crawler which indicates it should support Crawl-Delay; if that's something you all wanted to implement across hosted sites | 18:10 |
tonyb | we can look at that, but probably not for the current wiki server | 18:14 |
clarkb | most of our robots.txt files do set a crawl delay iirc | 18:14 |
JayF | there is no robots.txt currently for wiki.openstack.org, and www.openstack.org/robots.txt does not contain a crawl-delay | 18:14 |
clarkb | its just taht the wiki is the wiki and not in proper config management so making any changes to it requires far more effort and care | 18:14 |
fungi | https://review.opendev.org/robots.txt does | 18:14 |
fungi | as does https://opendev.org/robots.txt | 18:15 |
JayF | only for MSNBot it appears? | 18:15 |
fungi | oh, good catch | 18:15 |
fungi | we did it proper like on gitea tho | 18:15 |
JayF | yeah, opendev.org/robots.txt looks OK to me even if review.opendev.org might not have the crawl-delay properly scoped | 18:16 |
*** bauzas_ is now known as bauzas | 18:29 | |
*** bauzas_ is now known as bauzas | 18:46 | |
*** bauzas_ is now known as bauzas | 19:51 | |
*** bauzas_ is now known as bauzas | 20:21 | |
opendevreview | Jeremy Stanley proposed openstack/project-config master: Add OnMetal to Nodepool and Grafana https://review.opendev.org/c/openstack/project-config/+/921987 | 20:29 |
JayF | fungi: Rackspace OnMetal is going to be hosting opendev?! | 20:30 |
JayF | talk about full circle! | 20:30 |
fungi | JayF: good catch, i've confused the name | 20:30 |
fungi | JayF: was supposed to be openmetal | 20:30 |
JayF | ah, still Ironic users :D | 20:31 |
opendevreview | Jeremy Stanley proposed openstack/project-config master: Add OpenMetal to Nodepool and Grafana https://review.opendev.org/c/openstack/project-config/+/921987 | 20:34 |
*** bauzas_ is now known as bauzas | 20:42 | |
*** haleyb is now known as haleyb|out | 20:56 | |
opendevreview | Merged openstack/project-config master: Add OpenMetal to Nodepool and Grafana https://review.opendev.org/c/openstack/project-config/+/921987 | 22:33 |
*** bauzas_ is now known as bauzas | 22:52 | |
opendevreview | Jeremy Stanley proposed openstack/project-config master: Correct OpenMetal region from iad3 to IAD3 https://review.opendev.org/c/openstack/project-config/+/921991 | 23:09 |
opendevreview | Merged openstack/project-config master: Correct OpenMetal region from iad3 to IAD3 https://review.opendev.org/c/openstack/project-config/+/921991 | 23:55 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!