•  The Doctor   ( @drwho@beehaw.org ) 
    link
    fedilink
    English
    311 months ago

    Very early on, at least, their spiders respected robots.txt.

    I know there are folks that have all of the Big G in their robots.txt files on principle, might want to ask them if it works or not.

    • I do and I can confirm there are no requests (except for robots.txt and the odd /favicon.ico). Google sorta respects robots.txt. They do have a weird gotcha though: they still put the URLs in search, they just appear with an useless description. Their suggestion to avoid that can be summarized as: don’t block us, let us crawl and just tell us not to use the result, just trust us! when they could very easily change that behavior to make more sense. Not a single damn person with Google blocked in robots.txt wants to be indexed, and their logic on password protecting kind of makes sense but my concern isn’t security, it’s that I don’t like them (or Bing or Yandex).

      Another gotcha I’ve seen linked is that their ad targeting bot for Google AdSense (different crawler) doesn’t respect a * exclusion, but that kind of makes sense since it will only ever visit your site if you place AdSense ads on it.

      And I suppose they’ll train Bard on all data they scraped because of course. Probably no way to opt out of that without opting out of Google Search as well.