very upsetting

Arthur Besse ( @cypherpunks@lemmy.ml ) · edit-2 9 months ago

very upsetting

Rivalarrival ( @Rivalarrival@lemmy.today ) · 9 months ago

Copyright protects against creating and distributing copies. Copyright does not protect against reading and understanding a work.

What LLMs and other models are doing is analogous to reading a book and writing a book report. They are not regurgitating a copy of the book to users. They are not creating or distributing a copy.

The purpose of copyright laws are to promote the progress of Science and the Useful Arts. The purpose is to expand the depth and breadth of human knowledge and technology. “Fair Use” is not an exception: “Fair Use” is purpose. “Copyright” is the exception.

If technology is fundamentally incompatible with copyright law, that technology has the right-of-way, and copyright must yield.

OmnipotentEntity ( @OmnipotentEntity@beehaw.org ) · 9 months ago

What LLMs and other models are doing is analogous to reading a book and writing a book report.

It is purported to be analogous to that. But given that in actuality it can also simply reproduce nearly entire articles word for word from a short prompt, it’s clear that the analogy that you are attempting to draw is flawed. Inside of the LLM, encoded in the weights and biases of the network, is that article and many others, it has been copied into the network, encoded, and can be referenced.

The Pile is 825GiB of text. ChatGPT-4 is about 400 billion parameters, and each of those parameters is 2 bytes, which is 800GiB of data. There’s certainly enough redundancy in whatever corpus they’re using to just memorize the entire thing and still have sufficient network space leftover to actually make some sense of it.