There is plenty of free data available to train an English language model. Once you need a multilingual chatbot, however, the internet needs to be scraped
I’m honestly genuinely surprised that megacorps like Disney and the music labels haven’t pounced on OpenAI like they did with say Napster. Not saying they’d win but surprised they’ve not tried.
I don’t think OpenAI have broken any copyright laws since they don’t store any copyrighted material in the model, as such. Therefore, they don’t distribute any copyrighted material. The LLMs may come up with something very similar to copyrighted material, inspired by copyrighted material though. Google may have broken copyright law by distributing the dataset. How that affects OpenAIs ability to continue to use that data will be interesting to see, but I don’t think it’s going to break the current copyright model. If it’s illegal to learn from copyrighted material and make works inspired by them, everyone is guilty.
after all this piracy, isn’t it ironic that it’s the mega corps that that ultimately breaks our current copyright model.
I’m honestly genuinely surprised that megacorps like Disney and the music labels haven’t pounced on OpenAI like they did with say Napster. Not saying they’d win but surprised they’ve not tried.
I don’t think OpenAI have broken any copyright laws since they don’t store any copyrighted material in the model, as such. Therefore, they don’t distribute any copyrighted material. The LLMs may come up with something very similar to copyrighted material, inspired by copyrighted material though. Google may have broken copyright law by distributing the dataset. How that affects OpenAIs ability to continue to use that data will be interesting to see, but I don’t think it’s going to break the current copyright model. If it’s illegal to learn from copyrighted material and make works inspired by them, everyone is guilty.