FOSS infrastructure is under attack by AI companies: LLM scrapers are taking down FOSS projects' infrastructure, and it's getting worse.

alyaza [they/she] ( @alyaza@beehaw.org ) MA

TechnologyEnglish • 3 days ago

87

FOSS infrastructure is under attack by AI companies: LLM scrapers are taking down FOSS projects' infrastructure, and it's getting worse.

alyaza [they/she] ( @alyaza@beehaw.org ) MA

TechnologyEnglish • 3 days ago

FOSS infrastructure is under attack by AI companies

LLM scrapers are taking down FOSS projects' infrastructure, and it's getting worse.

By now, it should be pretty clear that this is no coincidence. AI scrapers are getting more and more aggressive, and - since FOSS software relies on public collaboration, whereas private companies don’t have that requirement - this is putting some extra burden on Open Source communities.

So let’s try to get more details – going back to Drew’s blogpost. According to Drew, LLM crawlers don’t respect robots.txt requirements and include expensive endpoints like git blame, every page of every git log, and every commit in your repository. They do so using random User-Agents from tens of thousands of IP addresses, each one making no more than one HTTP request, trying to blend in with user traffic.

Due to this, it’s hard to come off with a good set of mitigations. Drew says that several high-priority tasks have been delayed for weeks or months due to these interruptions, users have been occasionally affected (because it’s hard to distinguish bots and humans), and - of course - this causes occasional outages of SourceHut.

Drew here does not distinguish between which AI companies are more or less respectful of robots.txt files, or more accurate in their user agent reporting; we’ll be able to look more into that later.

Finally, Drew points out that this is not some isolated issue. He says,

All of my sysadmin friends are dealing with the same problems, [and] every time I sit down for beers or dinner to socialize with sysadmin friends it’s not long before we’re complaining about the bots. […] The desperation in these conversations is palpable.

You must log in or register to comment.

HotTopNewOld

Chat

Scrath ( @Scrath@lemmy.dbzer0.com )
link
fedilink
5•2 days ago
I had no idea this was that much of a problem

Technology

You are not logged in. However you can subscribe from another Fediverse account, for example Lemmy or Mastodon. To do this, paste the following into the search field of your instance: !technology@beehaw.org

A nice place to discuss rumors, happenings, innovations, and challenges in the technology sphere. We also welcome discussions on the intersections of technology and society. If it’s technological news or discussion of technology, it probably belongs here.

Remember the overriding ethos on Beehaw: Be(e) Nice. Each user you encounter here is a person, and should be treated with kindness (even if they’re wrong, or use a Linux distro you don’t like). Personal attacks will not be tolerated.

Subcommunities on Beehaw:

This community’s icon was made by Aaron Schneider, under the CC-BY-NC-SA 4.0 license.

45 users / day
298 users / week
746 users / month
2.17K users / 6 months
38.4K subscribers
4.48K Posts
86.8K Comments
Modlog