Thursday, 2024-06-13

*** bauzas_ is now known as bauzas00:32
*** bauzas_ is now known as bauzas01:42
*** bauzas_ is now known as bauzas03:45
*** bauzas_ is now known as bauzas04:01
*** bauzas- is now known as bauzas04:34
*** bauzas_ is now known as bauzas05:34
*** bauzas_ is now known as bauzas05:59
*** bauzas_ is now known as bauzas06:15
*** bauzas_ is now known as bauzas06:31
*** bauzas_ is now known as bauzas07:00
*** bauzas_ is now known as bauzas07:41
*** bauzas_ is now known as bauzas08:24
*** bauzas_ is now known as bauzas08:55
*** bauzas_ is now known as bauzas09:23
opendevreviewArtem Goncharov proposed openstack/project-config master: Add publish openapi specs job  https://review.opendev.org/c/openstack/project-config/+/92193410:17
*** tosky_ is now known as tosky12:28
JayFWiki is behaving a little weird; saved an edit but it didn't ever redirect anywhere, just kinda hung. When I reloaded the original page, the edit was there16:42
tonybJayF: a one off or several times?16:43
JayFthree times in a row, all in like a 10 minute period16:43
tonybOkay I'll have a look at the server logs etc16:44
JayFit almost acted like it was being redirected to some URL that didn't exist / timed out16:47
tonybOkay.   The server is quite "busy"  but not overloaded16:49
tonybAhh it *looks* like someone is crawling the site16:55
clarkbthat is something we've periodically had to deal with elsewhere too17:04
clarkbthe well behaved bots respect robots.txt rules that tell them to slow down (and honestly have never really been a problem). its the bots that ignore robots.txt that we often have to block17:05
tonybYeah, this one seems to be coming from many IPs :(17:05
*** bauzas_ is now known as bauzas17:07
tonyball AWS EC2 :(17:07
tonybhttps://www.reddit.com/r/singularity/comments/1cdm97j/anthropics_claudebot_is_aggressively_scraping_the/17:08
JayFtonyb: I know someone who works at anthropic, if you can directly tie it to them I can let the person know the feedback.17:09
fungiyeah, when we have the new wiki server in place we can more easily reuse the ua filter we have on some other services17:09
fungiJayF: it happened to me earlier today too, and crops up from time to time17:09
JayFare the UAs on that claudebot?17:09
tonybYup17:09
JayFand I assume we have Crawl-Delay set and it's not being honored?17:09
fungii don't assume anything with the old unmaintained wiki server17:10
tonybThey've done ~500k requests in 24hours17:10
JayFhttps://wiki.openstack.org/robots.txt does not appear to work as I'd expect17:10
fungii'm inclined to just hope for the best and focus our energies on the new wiki server we'll actually be able to manage17:10
tonybThey explictly ignore robots.txt anyway17:10
JayFtonyb: I was hoping to assemble a full complaint with "your bot caused service disruption for these reasons" 17:11
fungiyeah, robots.txt is mostly only useful these days for informing legitimate search engines that some content is not worth crawling. abusive crawlers can easily just not care you have one at all17:11
JayFtonyb: but if we don't publish a robots.txt, it's a less strong argument -- especially since https://www.openstack.org/robots.txt doesn't set Crawl-Delay either17:11
JayFfungi: The people I know who work over there at least *claim* to care about not being abusive in how they train. That's part of why I wanted to close the loop17:12
tonybFair.17:13
tonybAny objections to me adding:17:18
tonybroot@wiki-upgrade-test:/etc/apache2/sites-enabled# diff -U0 50-wiki.openstack.org.conf~tonyb_before_crawler_change 50-wiki.openstack.org.conf17:18
tonyb--- 50-wiki.openstack.org.conf~tonyb_before_crawler_change      2024-06-13 17:14:55.218894454 +000017:18
tonyb+++ 50-wiki.openstack.org.conf  2024-06-13 17:17:15.226120795 +000017:18
tonyb@@ -45,0 +46,3 @@17:18
tonyb+        RewriteEngine on17:18
tonyb+       RewriteCond %{HTTP_USER_AGENT}  ^.*ClaudeBot.*$17:18
tonyb+       RewriteRule . - [R=403,L]17:18
tonybto the appropriate place?17:18
tonybto see if we can slow it down at least17:18
fungitonyb: no objection if that's how you choose to spend your time ;)17:19
fungii wouldn't want to set a precedent that we're continuing to try to fix things on the old server, but seems like we can see the light at the end of that tunnel anyway thanks to your efforts17:20
clarkbya I don't object17:22
tonybYeah I don't really want to set that precedent either but also I'd like the service to be usable17:23
clarkbJayF: fwiw I reject the idea htat people hosting content have to explicit say "please don't dos us" otherwise its fair game17:23
clarkbrobots.txt should provide helpful hints, bots should still be well behaved17:23
fungii.e. not every site is hosted on github-pages or geocities17:24
JayFI agree in spirit but generally have seen in practice that most scripts/users/bots/whatever will take up as much performance slack as they are given, so tend to go the route of actual-rate-limits if it's a dynamic site that I care a lot about.17:25
JayFBut that's basically saying "hey overworked people, do more stuff" in a roundabout way, which isn't really reasonable either17:25
fungiJayF: in my opinion it's also fair game to just block them without notice17:26
JayFSorta like my feelings on if you take an MIT licensed software and make it a proprietary product with very minimal value add. Allowed? Yes. Against the spirit? Yes.17:26
JayFfungi: yep :D 17:26
tonybOkay the load has dropped from 3+ to <117:26
fungithanks tonyb!17:26
tonybI added a second UA to the block list17:26
JayFfungi: and I'll make sure that gets communicated to the right place when I find out what that is (your badly behaved bot will no longer be able to crawl the *root source* for some data on an OSS project because it was aggro)17:26
tonyband with that I think I'll relocate to a brewery17:27
fungione way wide-reaching scrapers deal with that is to parallelize across lots of sites rather than having a worker narrowly scrape content from one site17:28
fungiso they can still achieve suitable throughput in aggregate while not hammering any one site with tons of requests in a short span of time17:29
JayFI was pointed at this: https://support.anthropic.com/en/articles/8896518-does-anthropic-crawl-data-from-the-web-and-how-can-site-owners-block-the-crawler which indicates it should support Crawl-Delay; if that's something you all wanted to implement across hosted sites18:10
tonybwe can look at that, but probably not for the current wiki server 18:14
clarkbmost of our robots.txt files do set a crawl delay iirc18:14
JayFthere is no robots.txt currently for wiki.openstack.org, and www.openstack.org/robots.txt does not contain a crawl-delay18:14
clarkbits just taht the wiki is the wiki and not in proper config management so making any changes to it requires far more effort and care18:14
fungihttps://review.opendev.org/robots.txt does18:14
fungias does https://opendev.org/robots.txt18:15
JayFonly for MSNBot it appears?18:15
fungioh, good catch18:15
fungiwe did it proper like on gitea tho18:15
JayFyeah, opendev.org/robots.txt looks OK to me even if review.opendev.org might not have the crawl-delay properly scoped18:16
*** bauzas_ is now known as bauzas18:29
*** bauzas_ is now known as bauzas18:46
*** bauzas_ is now known as bauzas19:51
*** bauzas_ is now known as bauzas20:21
opendevreviewJeremy Stanley proposed openstack/project-config master: Add OnMetal to Nodepool and Grafana  https://review.opendev.org/c/openstack/project-config/+/92198720:29
JayFfungi: Rackspace OnMetal is going to be hosting opendev?!20:30
JayFtalk about full circle!20:30
fungiJayF: good catch, i've confused the name20:30
fungiJayF: was supposed to be openmetal20:30
JayFah, still Ironic users :D 20:31
opendevreviewJeremy Stanley proposed openstack/project-config master: Add OpenMetal to Nodepool and Grafana  https://review.opendev.org/c/openstack/project-config/+/92198720:34
*** bauzas_ is now known as bauzas20:42
*** haleyb is now known as haleyb|out20:56
opendevreviewMerged openstack/project-config master: Add OpenMetal to Nodepool and Grafana  https://review.opendev.org/c/openstack/project-config/+/92198722:33
*** bauzas_ is now known as bauzas22:52
opendevreviewJeremy Stanley proposed openstack/project-config master: Correct OpenMetal region from iad3 to IAD3  https://review.opendev.org/c/openstack/project-config/+/92199123:09
opendevreviewMerged openstack/project-config master: Correct OpenMetal region from iad3 to IAD3  https://review.opendev.org/c/openstack/project-config/+/92199123:55

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!