I do think that companies should clarify how they’re training their models and on what datasets. For one thing, this will allow outside researchers to gauge the risks of particular models. (For example, is this AI trained on “the whole Internet,” including unfiltered hate-group safe-havens? Does the training procedure adequately compensate for the bias that the model might learn from those sources?)
However, knowing that a model was trained on copyrighted sources will not enough to prevent the model from reproducing copyrighted material.
There’s no good way to sidestep the issue, either. We have a relatively small amount of data that is (verifiably) public-domain. It’s probably not enough to train a large language model on, and if it is, then it probably won’t be a very useful one in 2023.
I half-agree.
I do think that companies should clarify how they’re training their models and on what datasets. For one thing, this will allow outside researchers to gauge the risks of particular models. (For example, is this AI trained on “the whole Internet,” including unfiltered hate-group safe-havens? Does the training procedure adequately compensate for the bias that the model might learn from those sources?)
However, knowing that a model was trained on copyrighted sources will not enough to prevent the model from reproducing copyrighted material.
There’s no good way to sidestep the issue, either. We have a relatively small amount of data that is (verifiably) public-domain. It’s probably not enough to train a large language model on, and if it is, then it probably won’t be a very useful one in 2023.