Oxford University scholars have issued a warning about the potential pitfalls of using synthetic data to train generative artificial intelligence (gen AI) models. In a recent publication in the prestigious science journal Nature, lead author Ilia Shumailov and his team describe the phenomenon of “model collapse,” which can significantly degrade the accuracy of AI models, rendering them useless.
The researchers conducted an experiment using Meta’s flagship open-source model, Llama 3.1 405B, and found that feeding the output of large language models into the training regimen of successive models resulted in a degenerative process. This process, known as model collapse, occurs when the data generated by the models pollutes the training set of the next generation, leading to a misperception of reality.
The study highlights that as models are trained on polluted data, they gradually lose track of less-common facts over generations, becoming more generic. Consequently, the answers they produce become irrelevant to the questions asked, effectively turning into gibberish.
The authors emphasize the need to take these findings seriously, as the widespread use of large language models to publish content on the internet could further pollute the training data for future models. They caution that the internet may become increasingly flooded with the output of AI models, making it challenging to train newer versions without access to original human-created data or data generated by humans at scale.
The implications of model collapse extend beyond the degradation of AI model accuracy. The authors argue that the increasing reliance on synthetic data from gen AI could create a lost internet of the past, making it harder to preserve and access original human-created training data.
The warning from Oxford University scholars serves as a reminder that the quality of training data is crucial for the development of reliable and accurate AI models. As the field of gen AI continues to evolve, researchers and developers must address the challenges posed by synthetic data and ensure the preservation of high-quality human-generated training data.