• There’s been ongoing suspicions that pirated content was used to train popular LLMs simply because popular datasets used for training LLMs do include such content. The Washington Post did an article about it.

    Google’s C4 dataset used for research included illegal websites. What remains to be seen is if it was cleaned up before training Bard as we know it today. OpenAI as revealed nothing on its dataset.