Tuesday, 2026-04-07

opendevreview	Michal Nasiadka proposed openstack/project-config master: propose-updates: Add pcu target https://review.opendev.org/c/openstack/project-config/+/978566	07:58
croelandt	Did the SSL issues come back? I'm seeing the same issue as last week in the CI: https://zuul.opendev.org/t/openstack/build/3d41b53f81b5493c9d94a5b7270b2c81	14:59
croelandt	EOF occurred in violation of protocol (_ssl.c:1000)'))': /openstack/requirements/raw/branch/master/upper-constraints.txt	15:00
stephenfin	I was just about to ask the same thing	15:00
clarkb	yes its another round of crawler load overwhelming the cluster	15:00
clarkb	per usual with a bunch of spoofed user agents presumably from a large set of IPs but I haven't checked the IP side yet	15:00
stephenfin	oh, joy /o\	15:01
croelandt	do we know who is running a DDOS^W^W^Wcrawling our resources?	15:05
croelandt	AI companies that believe our requirements might change every second? :p	15:05
clarkb	croelandt: no, they spoof their User Agents and use botnet looking source addresses so each request is from a different IP from just about anywhere in the world	15:07
clarkb	but if you look at the requests themselves its clear that there are crawler patterns at play	15:08
croelandt	but what is the purpose? It this a deliberate attack on our resources?	15:09
fungi	it seems to be organized crime in control of vast botnets across the globe trying to scrape everything on the internet to resell to "legitimate" aiu companies	15:09
clarkb	the "big" shops all identify their crawlers properly, google, meta, anthropic, openai, etc. Whether or not they buy the datasets being generated here I'm not sure (but I strongly suspect they do if you look in the gemma release info they have to run their own filters over the data sets implying they aren't generating them themselves)	15:09
clarkb	croelandt: no I think the purpose is that groups like google will pay money for large sets of data to train their LLMs	15:10
clarkb	croelandt: google then has to filter the data for legally problematic data. Then they feed the llm	15:10
fungi	...and don't care where it comes from/how it was obtained (would probably prefer not to know)	15:10
clarkb	but I suspect there is enough money in this to make it a business for certain folks	15:10
croelandt	fascinating times	15:10
fungi	which has created a goldrush for organized crime	15:10
croelandt	and it's all related to the absurd prices of hardware, amazing	15:11
clarkb	fungi: yup exactly. Otherwise why does google need to run an entire legal filtration process	15:11
clarkb	https://ai.google.dev/gemma/docs/core/model_card_4#data_preprocessing this is what I refer to	15:15
clarkb	basically that to me strongly implies that they are not curating the data set themselves, or at least not doing so for the entire dataset	15:15
clarkb	and that requires them to apply preprocessing to avoid these problematic sources of data. It also implies that there is a market for collecting the data without regard to fallout beacuse google is going to pay for it anyway	15:16
clarkb	including the far more problematic content that they are attempting to filter out	15:16
-opendevstatus- NOTICE: Load on the opendev.org Gitea backends is under control again for now, if any Zuul jobs failed with SSL errors or disconnects reaching the service prior to 16:15 UTC they can be safely rechecked		17:03
*** skandix4263990 is now known as skandix426399		22:20

Generated by irclog2html.py 4.1.0 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!