| opendevreview | Michal Nasiadka proposed openstack/project-config master: propose-updates: Add pcu target https://review.opendev.org/c/openstack/project-config/+/978566 | 07:58 |
|---|---|---|
| croelandt | Did the SSL issues come back? I'm seeing the same issue as last week in the CI: https://zuul.opendev.org/t/openstack/build/3d41b53f81b5493c9d94a5b7270b2c81 | 14:59 |
| croelandt | EOF occurred in violation of protocol (_ssl.c:1000)'))': /openstack/requirements/raw/branch/master/upper-constraints.txt | 15:00 |
| stephenfin | I was just about to ask the same thing | 15:00 |
| clarkb | yes its another round of crawler load overwhelming the cluster | 15:00 |
| clarkb | per usual with a bunch of spoofed user agents presumably from a large set of IPs but I haven't checked the IP side yet | 15:00 |
| stephenfin | oh, joy /o\ | 15:01 |
| croelandt | do we know who is running a DDOS^W^W^Wcrawling our resources? | 15:05 |
| croelandt | AI companies that believe our requirements might change every second? :p | 15:05 |
| clarkb | croelandt: no, they spoof their User Agents and use botnet looking source addresses so each request is from a different IP from just about anywhere in the world | 15:07 |
| clarkb | but if you look at the requests themselves its clear that there are crawler patterns at play | 15:08 |
| croelandt | but what is the purpose? It this a deliberate attack on our resources? | 15:09 |
| fungi | it seems to be organized crime in control of vast botnets across the globe trying to scrape everything on the internet to resell to "legitimate" aiu companies | 15:09 |
| clarkb | the "big" shops all identify their crawlers properly, google, meta, anthropic, openai, etc. Whether or not they buy the datasets being generated here I'm not sure (but I strongly suspect they do if you look in the gemma release info they have to run their own filters over the data sets implying they aren't generating them themselves) | 15:09 |
| clarkb | croelandt: no I think the purpose is that groups like google will pay money for large sets of data to train their LLMs | 15:10 |
| clarkb | croelandt: google then has to filter the data for legally problematic data. Then they feed the llm | 15:10 |
| fungi | ...and don't care where it comes from/how it was obtained (would probably prefer not to know) | 15:10 |
| clarkb | but I suspect there is enough money in this to make it a business for certain folks | 15:10 |
| croelandt | fascinating times | 15:10 |
| fungi | which has created a goldrush for organized crime | 15:10 |
| croelandt | and it's all related to the absurd prices of hardware, amazing | 15:11 |
| clarkb | fungi: yup exactly. Otherwise why does google need to run an entire legal filtration process | 15:11 |
| clarkb | https://ai.google.dev/gemma/docs/core/model_card_4#data_preprocessing this is what I refer to | 15:15 |
| clarkb | basically that to me strongly implies that they are not curating the data set themselves, or at least not doing so for the entire dataset | 15:15 |
| clarkb | and that requires them to apply preprocessing to avoid these problematic sources of data. It also implies that there is a market for collecting the data without regard to fallout beacuse google is going to pay for it anyway | 15:16 |
| clarkb | including the far more problematic content that they are attempting to filter out | 15:16 |
| -opendevstatus- NOTICE: Load on the opendev.org Gitea backends is under control again for now, if any Zuul jobs failed with SSL errors or disconnects reaching the service prior to 16:15 UTC they can be safely rechecked | 17:03 | |
| *** skandix4263990 is now known as skandix426399 | 22:20 | |
Generated by irclog2html.py 4.1.0 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!