Google Says It'll Scrape Everything You Post Online for AI

misk ( @misk@lemm.ee ) · 2 years ago

Google Says It'll Scrape Everything You Post Online for AI

Pete Hahnloser ( @Powderhorn@beehaw.org ) · 2 years ago

People who are alive can have a company steal their entire corpus without recompense, while the descendants of people who died decades ago can get still get paid for content created by their ancestors.

Right.

Peanut ( @Peanutbjelly@sopuli.xyz ) · 2 years ago

But how else could Disney afford to own everyone else’s rights and properties? Why not think about the little guy! (Mickey mouse is little, right?)

That being said, I find it weird people are going after training data for llm’s after completely ignoring the models built specifically to compete with and take advantage of people’s unconscious habits and lifestyles.

AI in general will be very important to comfortably survive the near future as a species. Data is an important part of that.

we absolutely need to do something about the megacorps funneling every new gain as a society into increasing the already absurd wealth divide. The technology is good. The general web scraping isn’t bad if the tool is not specifically evil in function. We just need to as a global community demand that the technology be used to benefit everyone equally as it continues to be developed.

alcasa ( @alcasa@lemmy.sdf.org ) · 2 years ago

Glad that I can contribute to making the next Google Bard even dumber

Zapp ( @Zapp@beehaw.org ) · edit-2 2 years ago

Yeah. Now the stupidity I post online has a purpose.

Someday a T-800 will be closing in on a freedom fighter, but will have an intrusive thought interrupt it at a key vulnerable moment. And that intrusive thought will be some random pun we posted to DadJokes. You’re welcome, future freedom fighters.

SmallAlmond ( @SmallAlmond@lemmy.dbzer0.com ) · 2 years ago

They have probably been doing this for ages

mayooooo ( @MayonnaiseArch@beehaw.org ) · 2 years ago

Exactly, they don’t give a fuck. Counting on being too big for anyone to handle

Rentlar ( @Rentlar@beehaw.org ) · 2 years ago

I, as the proprietor of my comments, condone Google AI scraping my publicly shared content for their own use, on the condition that they condone scraping of their publicly accessible content including YouTube videos. :P

Thomas Gray ( @deCorp0@lemmy.dbzer0.com ) · 2 years ago

Google is going to continue boiling the frog until everyone using gmail, YT, drive, etc… is paying subscriptions for access to these services. It’s going to be interesting to see how much people are willing to pay to hold on to a gmail account they’ve been using for 20 years. I should buy Alphabet stock now.

CreativeTensors ( @CreativeTensors@beehaw.org ) · 2 years ago

I just kind of assumed that they, as well as anyone in the space was doing that already.

Whether that means that we all collectively have ownership over the outputs of these models if they’re trained on content that we produced over the years is another thing. As someone who uses AI tools a fair bit I would be totally fine with generated content being public domain unless a threshold for human intervention is met.

That threshold is where the messy legal work lies.

YuzuDrink ( @YuzuDrink@beehaw.org ) · 2 years ago

Would maybe be funny if a law were passed saying that you could only charge people for access to your AI content if you can prove that their own content wasn’t used to help train the AI…

MagicShel ( @MagicShel@programming.dev ) · 2 years ago

I agree with this. Human knowledge grows on the shoulders of others, and should collectively belong to all of us.

millie ( @millie@beehaw.org ) · 2 years ago

Crazy that Google feeds on all our data and has for years, but when OpenAI puts the benefit of that data back into the hands of users it catches flack.

Rentlar ( @Rentlar@beehaw.org ) · 2 years ago

Perhaps we lived in blissful ignorance all this time. Before AI Language Learning models they are today, Google Translate was most of what the data was going to and it was mainly about getting an adequate translation. Now it’s being used to answer questions on all different subjects using parts of real people’s answers, which could be more frightening to people.

shanghaibebop ( @shanghaibebop@beehaw.org ) · edit-2 2 years ago

I think it’s a problem of value capture.

People had no problem posting on reddit and wasting tons of hours helping strangers solve their problems. But now that reddit puts that information behind a paywall, people will have massive issues with that.

Similarly, google scrapped data, but didn’t APPEAR (and i can’t emphasize that enough) to use that data to deliver value that cannot be shared by the people who created that data. Most of the time your value is aligned so that you give up your “data” to google so that google can either provide you with better traffic through its search engine, or better ads to generate revenue for you.

OpenAI does not benefit the original publisher of that information what so ever.

millie ( @millie@beehaw.org ) · 2 years ago

I don’t know about that. When’s the last time you looked something up on Google and the first link was driving traffic to a website rather than scraping one and present it in-engine?

jcarax ( @jcarax@beehaw.org ) · 2 years ago

I choose to take this as an admission that they should be paying into a global UBI fund.

AndrewZabar ( @AndrewZabar@beehaw.org ) · 2 years ago

Google does what Google wants. Lawsuits are the only remedy to any of their indulgent transgressions. And not everyone can sue.

Years ago I had to have a lawyer file a motion in court in order to get Google to erase private medical documents they had inadvertently gotten access to and then they cached. It’s one thing to index everything and another even if they temporarily have access to restricted data because of a security lapse. But to COPY data as cache is something that should be absolutely illegal.

But as I said, Google does what Google wants.

trekz ( @trekz@beehaw.org ) · edit-2 2 years ago

Is this even new though? Google has always had a stronghold over any public data on the internet. It’s a search engine 😄. It’s sole purpose is to scrape and store everything it possibly can on the web.

that_one_guy ( @that_one_guy@beehaw.org ) · 2 years ago

It’s not

Previously, Google said the data would be used “for language models,” rather than “AI models,” and where the older policy just mentioned Google Translate, Bard and Cloud AI now make an appearance.

This is mainly just an update to more modern terms, it doesn’t really seem like they’re adding anything new to their policies.

Lost_Wanderer ( @Lost_Wanderer@beehaw.org ) · 2 years ago

Just feels icky knowing that.