By now, it should be pretty clear that this is no coincidence. AI scrapers are getting more and more aggressive, and - since FOSS software relies on public collaboration, whereas private companies don’t have that requirement - this is putting some extra burden on Open Source communities.

So let’s try to get more details – going back to Drew’s blogpost. According to Drew, LLM crawlers don’t respect robots.txt requirements and include expensive endpoints like git blame, every page of every git log, and every commit in your repository. They do so using random User-Agents from tens of thousands of IP addresses, each one making no more than one HTTP request, trying to blend in with user traffic.

Due to this, it’s hard to come off with a good set of mitigations. Drew says that several high-priority tasks have been delayed for weeks or months due to these interruptions, users have been occasionally affected (because it’s hard to distinguish bots and humans), and - of course - this causes occasional outages of SourceHut.

Drew here does not distinguish between which AI companies are more or less respectful of robots.txt files, or more accurate in their user agent reporting; we’ll be able to look more into that later.

Finally, Drew points out that this is not some isolated issue. He says,

All of my sysadmin friends are dealing with the same problems, [and] every time I sit down for beers or dinner to socialize with sysadmin friends it’s not long before we’re complaining about the bots. […] The desperation in these conversations is palpable.