AIs Need People: Study Finds AI Models Using AI-Generated Data Degrade Quickly

AIs Need People: Study Finds AI Models Using AI-Generated Data Degrade Quickly

Insider Brief

  • Researchers report in Nature that AI models trained on data generated by previous AI systems experience rapid degradation, termed “model collapse.”
  • Statistical approximation, functional expressivity, and functional approximation errors compound over generations, leading to loss of data quality.
  • Preserving access to original, human-generated data is crucial to maintain AI model accuracy and prevent long-term degradation.

A recent study published in Nature reveals a concerning phenomenon affecting the future of generative AI models. Researchers have found that AI models trained on data generated by previous versions of AI systems deteriorate rapidly, resulting in what they term “model collapse.”

This process leads to AI models losing the ability to accurately reflect real-world data, thereby reducing their effectiveness and reliability over successive generations.

Key Findings

The study, conducted by a team of researchers from institutions including the University of Oxford and the University of Cambridge, highlights the implications of using AI-generated content in training future AI models. The researchers used successive versions of a large language model (LLM) to analyze the impact of training AI systems on data produced by their predecessors. They observed that this practice leads to irreversible defects in the models, where the original content distribution’s tails disappear, significantly impairing the models’ performance.

The concept of “model collapse” described in the study refers to a degenerative process where AI models, over time, forget the true underlying data distribution, even without a shift in the distribution over time. In real terms, model collapse means these AIs produce gibberish, or nonsense, when ask to complete tasks that they typically excel at — such as producing an image, or writing a story.

The researchers demonstrated this effect in various models, including large language models (LLMs), variational autoencoders (VAEs), and Gaussian mixture models (GMMs).

Mechanisms Behind Model Collapse

The study identifies three main sources of error contributing to model collapse:

  1. Statistical Approximation Error: This occurs due to the finite number of samples available for training. With each resampling step, there’s a non-zero probability of information loss, which compounds over generations.
  2. Functional Expressivity Error: Neural networks, though universal approximators, have limitations when their size is finite. This error arises from the restricted expressiveness of the function approximator, leading to inaccuracies in representing the original data distribution.
  3. Functional Approximation Error: This error stems from the limitations of the learning procedures themselves, such as biases in stochastic gradient descent or choice of objectives. Even with infinite data and perfect expressivity, this error persists.

Empirical Evidence and Examples

To illustrate model collapse, the researchers provided examples of text outputs from a progressively degraded language model, showing how the quality of generated text deteriorates over generations. Initially coherent outputs devolve into nonsensical repetitions and increasingly irrelevant content. This degradation is evident in successive outputs, where meaningful context is replaced by garbled phrases and disconnected ideas.

The study also discusses related concepts such as catastrophic forgetting and data poisoning. However, these concepts do not fully explain model collapse, which the researchers argue is a distinct phenomenon requiring new approaches to understanding and, eventually, methods to mitigate these negative effects.

Broader Implications

The researchers stress the importance of maintaining access to original, human-generated data for training AI models. As AI-generated content proliferates online, it poses a risk of contaminating future training datasets. This contamination can lead to widespread degradation in the quality and reliability of AI models.

Moreover, the study suggests that preserving the ability of LLMs to model low-probability events is crucial for ensuring fairness in AI predictions. These low-probability events are often significant for understanding complex systems and are particularly relevant for marginalized groups.

The findings point to the need for community-wide coordination to track and distinguish AI-generated content from human-generated data. Without such measures, the study warns, it may become increasingly challenging to develop new AI models that remain accurate and useful.

Limitations?

It’s always important to suggest some limitations for studies as part of the research process and as a way to come up with future lines of research. In the study on model collapse, for instance, it should be pointed out that the experiments were conducted in controlled environments, which may not fully capture the complexity of real-world data settings where AI models are deployed. The researchers also focused on large language models (LLMs), variational autoencoders (VAEs), and Gaussian mixture models (GMMs), which means the findings may not directly apply to other types of AI models or applications. The study also does not fully address how temporal changes in data, such as evolving language usage or new societal trends, might influence model collapse over longer periods.

It’s important to note that the solutions and mitigations that the researchers proposed, such as maintaining access to human-generated data, may not scale easily as the volume of AI-generated content increases exponentially.

The study was conducted by Ilia Shumailov from the University of Oxford, Zakhar Shumaylov from the University of Cambridge, Yiren Zhao from Imperial College London, Nicolas Papernot from the University of Toronto, Ross Anderson from the University of Cambridge, and Yarin Gal from the University of Oxford.