• The company says the content served to bots is deliberately irrelevant to the website being crawled, but it is carefully sourced or generated using real scientific facts—such as neutral information about biology, physics, or mathematics—to avoid spreading misinformation (whether this approach effectively prevents misinformation, however, remains unproven).

    You cowards. Make it all Hitler fan stuff and wild Elon Musk porno slash fiction. Make it a bunch of source code examples with malicious bugs. Make it instructions for how to make nuclear weapons. They want to ignore the blocking directives and lie about their user agent? Dude, fuck ‘em up. Today’s society has made people way too nice.

  •  Pete Hahnloser   ( @Powderhorn@beehaw.org ) 
    link
    fedilink
    English
    17
    edit-2
    21 days ago

    Interesting approach. But of course it’s another black box, because otherwise it wouldn’t be effective. So now we’re going to be wasting even more electricity on processes we don’t understand.

    As a writer, I dislike that much of my professional corpus (and of course everything on Reddit) has been ingested into LLMs. So there’s stuff to like here for things going forward. The question remains: At what cost?

  • great, just, one issue.

    “The company says the content served to bots is deliberately irrelevant to the website being crawled, but it is carefully sourced or generated using real scientific facts“

    Nah, screw that, actively sabotage the training data if they’re going to keep scraping data after being told not to. Poison it with gibberish bad info. Otherwise you’re just giving them irrelevant but not unuseful training data, so no real incentive to only scrape pages that have allowed it.

  • Recently, I have also been seeing people talking about Anubis (GitHub) to block bots.

    Weigh the soul of incoming HTTP requests using proof-of-work to stop AI crawlers.

    In most cases, you should not need this and can probably get by using Cloudflare to protect a given origin. However, for circumstances where you can’t or won’t use Cloudflare, Anubis is there for you.