Source: https://front-end.social/@fox/110846484782705013

Text in the screenshot from Grammarly says:

We develop data sets to train our algorithms so that we can improve the services we provide to customers like you. We have devoted significant time and resources to developing methods to ensure that these data sets are anonymized and de-identified.

To develop these data sets, we sample snippets of text at random, disassociate them from a user’s account, and then use a variety of different methods to strip the text of identifying information (such as identifiers, contact details, addresses, etc.). Only then do we use the snippets to train our algorithms-and the original text is deleted. In other words, we don’t store any text in a manner that can be associated with your account or used to identify you or anyone else.

We currently offer a feature that permits customers to opt out of this use for Grammarly Business teams of 500 users or more. Please let me know if you might be interested in a license of this size, and I’II forward your request to the corresponding team.

  • Models need vast amounts of data. Paying individual users isnt feasible, and like you said most of it can be scraped.

    The only way I see this working is if scraped content is a no go and then you pay the website, publishing house, record company, etc which kills any open source solution and doesn’t really help any of the users or creators that much. It also paves the way for certain companies owning a lot of our economy as we move towards an AI driven society.

    It’s definitely a hot mess but the way I see it, the more restrictive we are with it, the more gross monopolies we create for no real gains.

    • I mean they’re not even giving credit or asking permission, which both cost nothing. Make a site where people can volunteer their own work, program the ai to generate a list of citations of all the works it used data from when it provides output (I know that this might be lengthy, that’s fine), if you implement it into any sites or software make it so that people can opt out of having their data used, etc. It’s not that hard.

      • Most of the data is scraped, it’s not up to the website. You can’t give a list of citation since it isn’t a search engine, it doesn’t know where the information comes from and it’s highly transformative, it melds information from hundreds if not thousand of different sources.

        If it worked only with volunteer work, there would simply be not enough data.

        Any law restricting data use in AI is only going to benefit corporations, there isn’t a solution for individual content creators. You can’t pay them for the drop in the bucket they add, thee logistics are insane. You can let them opt out, but then you need to do the same for whole websites which leads to a corporate hellscape where three companies own our whole economy since they are the only ones who can train ais.

        • Most of the data is scraped, it’s not up to the website.

          It is up to whoever runs the ai, and those are the people I’m addressing for the most part, though plenty of websites do have control over what data is fed to the ai they’re using. In grammarly’s case it’s absolutely up to them what data is used and whether there’s an option provided to opt out of having your work used for training the ai, as shown by the fact that they offer it to the business license. They just choose not to offer that option to other users.

          You can’t give a list of citation since it isn’t a search engine, it doesn’t know where the information comes from and it’s highly transformative, it melds information from hundreds if not thousand of different sources.

          It’s all code, the people coding it are 100% capable of programming it to keep track of where the information comes from. Even if it’s transformative, that doesn’t prevent it from keeping track of what was transformed. I’m aware that the number of citations would be extensive, I’m fine with that.

          If it worked only with volunteer work, there would simply be not enough data.

          According to who? There are plenty of ways to get data from voluntary sources just like we get for any number of studies. It’s just up to the one who runs the ai to put in the legwork to get enough data that way, and there are lots of methods. You don’t have to just sit and wait for people to come to you and sign up, though based on the ai frenzy I bet they could have gotten plenty of data that way from people who are curious and want to contribute to ai training as a novel new concept. Making ai data gathering on websites something people can opt in or out on is just one way of making it more ethical than forcibly taking that data without permission.

          Any law restricting data use in AI is only going to benefit corporations,

          I fail to see how requiring permission and offering the option to opt out of having your data used would benefit corporations. That just sounds like an excuse to not even try to regulate them.

          You can let them opt out, but then you need to do the same for whole websites which leads to a corporate hellscape where three companies own our whole economy since they are the only ones who can train ais.

          I don’t understand how part A leads to part B here. Why would those corporations have an advantage just because everyone with ais, including them, have to offer the option to opt out? Also, it’s entirely possible to also restrict the scope of an ai or regulate ai monopolies alongside regulating stuff like basic consent. Historically a lack of regulation is what causes corporate hellscapes because without something keeping them in check the larger companies will take advantage of their reach to do whatever they want on a larger scale, pushing out or merging with competitors. It’s not like requiring permission and providing opt-out would give them more of an advantage than they already have.

          • It depends for what kind of AI and but no, giving sources and building with just volunteer data is just not possible at our current technological level. I’m mostly talking about large llms because that’s what’s really at stake and they train on huge amounts of data. Like ALL of stack, GitHub, Reddit, etc. Just fine tuning them on a consumer level takes more than 50 000 question and answer pairs, that’s just one tiny superficial layer that’s added on top.

            Grammerly should absolutely add an opt out option to gain consumers trust, but forcing the the whole industry to do so is a disaster.

            If individuals can opt out, so will websites to “protect their users”. Then we get data hoarding, where stack and GitHub opt out of all open source options but sell it to the only ones that can now afford to build ais, Microsoft and google. it won’t include data of certain individuals, the few that opt out, but I’m guessing eventually the opt in will be directly into the terms of service of websites, you opt in or you fuck off.

            How does anyone except corporations benefit from this kind of circus. In 10 years, AI will be doing most office work. Google isn’t dumb and wants that profit. They and openai have all the data, they can strong arm or buy what they are missing. Restricting and legislating only widens their moat.

    • I don’t see why those are the only two options.

      We could update GPL, CC, etc. licensing so that it specifies whether the author intends to allow their work to be used for LLM training. And you could still put a non-commercial or share-alike constraint on it.

      Hooray, open source is saved while greedy grubby hands are thwarted.

      • What happens when every corporation and website closes their doors to AI? There isn’t any open source if we can’t use scrapped information from stack overflow, GitHub, Reddit etc.

        Sure some users will opt out but most won’t. Every single website will restrict though and then they will sell it to google and Microsoft who will be the only companies able to build ais.

        • If I could predict what happens to the tech market when XYZ policy is enacted, I wouldn’t be posting on Lemmy during my tea breaks. Whatever policies end up sticking around, success is gonna require a lot of us having ideas, trying them out, and recombining them.

          But I’ll claim this about my personal metric of “success”: If the future of open source looks like copying the extractive data-mining model of big tech and hoping we can shove the entire history of human thought into a blender faster than them, I think we’ve failed.