The beef between Microsoft and Reddit came to light after I published a story revealing that Reddit is currently blocking every crawler from every search engine except Google, which earlier this year agreed to pay Reddit $60 million a year to scrap the site for its generative AI products.
I know the author meant “scrape”, but sometimes it really does feel like AI is just scrapping the old internet for parts.
cybermass ( @cybermass@lemmy.ca ) 15•8 months agoYeah, aren’t like over half of reddit comments/posts by bots these days?
originalucifer ( @originalucifer@moist.catsweat.com ) 13•8 months agoyep, and the longer that happens the less value to the dataset. its becoming aged.
KeriKitty (They(/It)) ( @RiikkaTheIcePrincess@pawb.social ) English13•8 months ago[Joke] See, Reddit’s doing a nice thing here! They’re making sure nobody ends up toxifying their own dataset by using Reddit’s garbage heap of bot posts!
originalucifer ( @originalucifer@moist.catsweat.com ) 5•8 months agogoogle needs a checkbox of ‘ignore reddit’ im sick of having to manually add -reddit
The Cuuuuube ( @Cube6392@beehaw.org ) English13•8 months agoHey good news. Turns out you can use bing and not get back Reddit results
originalucifer ( @originalucifer@moist.catsweat.com ) 3•8 months agoyeah but then i get back bing results. no one needs that
i_am_not_a_robot ( @i_am_not_a_robot@discuss.tchncs.de ) English3•8 months agoThere’s a browser extension for that. It also works on Pintrest and other useless sites. https://iorate.github.io/ublacklist/docs
doctortofu ( @doctortofu@reddthat.com ) 44•8 months agoI can see why spez is upset about scrappers and search engines - image a company profiting from people creating lots of data, just hoarding it and using it for free, and not paying those people a cent, preposterous, right? :)
Moonrise2473 ( @Moonrise2473@feddit.it ) 28•8 months agoA search engine can’t pay a website for having the honor of bringing them visits and ad views.
Fuck reddit, get delisted, no problem.
Weird that google is ignoring their robots.txt though.
Even if they pay them for being able to say that glue is perfect on pizza, having
User-agent: * Disallow: /
should block googlebot too. That means google programmed an exception on googlebot to ignore robots.txt on that domain and that shouldn’t be done. What’s the purpose of that file then?
Because robots.txt is completely based on honor (there’s no need to pretend being another bot, could just ignore it), should be
User-agent: Googlebot Disallow: User-agent: * Disallow: /
MrSoup ( @MrSoup@lemmy.zip ) 28•8 months agoI doubt Google respects any robots.txt
DaGeek247 ( @DaGeek247@fedia.io ) 27•8 months agoMy robots.txt has been respected by every bot that visited it in the past three months. I know this because i wrote a page that IP bans anything that visits it, and l also put it as a not allowed spot in the robots.txt file.
I’ve only gotten like, 20 visits in the past three months though, so, very small sample size.
mozz ( @mozz@mbin.grits.dev ) 14•8 months agoI know this because i wrote a page that IP bans anything that visits it, and l also put it as a not allowed spot in the robots.txt file.
This is fuckin GENIUS
Moonrise2473 ( @Moonrise2473@feddit.it ) 8•8 months agoonly if you don’t want any visits except from yourself, because this removes your site from any search engine
should write a “disallow: /juicy-content” and then block anything that tries to access that page (only bad bots would follow that path)
Miaou ( @Miaou@jlai.lu ) 24•8 months agoThat’s exactly what was described…?
Moonrise2473 ( @Moonrise2473@feddit.it ) 3•8 months agoOops. As a non-native English speaker I misunderstood what he meant. I understood wrongly that he set the server to ban everything that asked for robots.txt
Zoop ( @Zoop@beehaw.org ) 2•8 months agoJust in case it makes you feel any better: I’m a native English speaker who always aced the reading comprehension tests back in school, and I read it the exact same way. Lol! I’m glad I wasn’t the only one. :)
mozz ( @mozz@mbin.grits.dev ) 5•8 months agoYou need to read again the thing that was described, more carefully. Imagine for example that by “a page,” the person means a page called /juicy-content or something.
thingsiplay ( @thingsiplay@beehaw.org ) 2•8 months agoInteresting way of testing this. Another would be to search the search machines with adding
site:your.domain
(Edit: Typo corrected. Off course without-
at-site:
, otherwise you will exclude it, not limit to.) to show results from your site only. Not an exhaustive check, but another tool to test this behavior. MrSoup ( @MrSoup@lemmy.zip ) 2•8 months agoThank you for sharing
Moonrise2473 ( @Moonrise2473@feddit.it ) 10•8 months agofor common people they respect and even warn a webmaster if they submit a sitemap that has paths included in robots.txt
tal ( @tal@lemmy.today ) English4•8 months agoI guessed in a previous comment that given their new partnership, Reddit is probably feeding their comment database to Google directly, which reduces load for both of them and permits Google to have real-time updates of the whole kit-and-kaboodle rather than polling individual pages. Both Google and Reddit are better-off doing that, and for Google it’d make sense for any site that’s large-enough and valuable enough to warrant putting forth any effort special-case to that site.
I know that Reddit built functionality for that before, used it for pushshift.io and I believe bots.
I doubt that Google is actually using Googlebot on Reddit at all today.
I would bet against either Google violating robots.txt or Reddit serving different robots.txt files to different clients (why? It’s just unnecessary complication).
jarfil ( @jarfil@beehaw.org ) 3•8 months agoGoogle is paying for the use of Reddit’s API, not for scraping the site.
That’s the new Reddit’s business model: want “their” (users’) content, then pay for API access.
TehPers ( @TehPers@beehaw.org ) English11•8 months agoJoke’s on Reddit. I’ve been blocking their results in the search engine I use for months!
I wonder if this will end up being pursued as an antitrust case. If anything, it’ll reduce traffic to Reddit from non-Google users, so hopefully that kills them off just a little faster.
AVincentInSpace ( @AVincentInSpace@pawb.social ) English10•8 months agoCome on. Be realistic. Chrome has 70% browser market share and people are already used to tacking “Reddit” onto the end of their search queries to find useful information. If anything this will have no effect besides steering people towards Google.
TehPers ( @TehPers@beehaw.org ) English5•8 months agoPeople on Chrome adding Reddit to their Google searches already use Google. People not using Google who don’t search “Reddit” are going to see fewer Reddit results.
No, this won’t kill Reddit, but it certainly isn’t helping them get more traffic.
The Cuuuuube ( @Cube6392@beehaw.org ) English2•8 months agoThey don’t care about traffic. They care about the existing barrel of data for the data models
lemmyvore ( @lemmyvore@feddit.nl ) English2•8 months ago…I thought that was the whole point of Spez blocking other spiders.