Using model-generated content in training causes irreversible defects, a team of researchers says. “The tails of the original content distribution disappears,” writes co-author Ross Anderson from the University of Cambridge in a blog post. “Within a few generations, text becomes garbage, as Gaussian distributions converge and may even become delta functions.”

Here’s is the study: http://web.archive.org/web/20230614184632/https://arxiv.org/abs/2305.17493

  • So the only Reddit data that’s really valuable to AI companies is the stuff that’s already been archived, and the stuff that will be gated behind their paid API is less useful now that bots are becoming prevalent?

    Good move, Reddit.