AIs Need People: Study Finds AI Models Using AI-Generated Data Degrade Quickly

Insider Brief

  • Researchers report in Nature that AI models trained on data generated by previous AI systems experience rapid degradation, termed “model collapse.”
  • Statistical approximation, functional expressivity, and functional approximation errors compound over generations, leading to loss of data quality.
  • Preserving access to original, human-generated data is crucial to maintain AI model accuracy and prevent long-term degradation.

A recent study published in Nature reveals a concerning phenomenon affecting the future of generative AI models. Researchers have found that AI models trained on data generated by previous versions of AI systems deteriorate rapidly, resulting in what they term “model collapse.”

This process leads to AI models losing the ability to accurately reflect real-world data, thereby reducing their effectiveness and reliability over successive generations.

Key Findings

The study, conducted by a team of researchers from institutions including the University of Oxford and the University of Cambridge, highlights the implications of using AI-generated content in training future AI models. The researchers used successive versions of a large language model (LLM) to analyze the impact of training AI systems on data produced by their predecessors. They observed that this practice leads to irreversible defects in the models, where the original content distribution’s tails disappear, significantly impairing the models’ performance.

The concept of “model collapse” described in the study refers to a degenerative process where AI models, over time, forget the true underlying data distribution, even without a shift in the distribution over time. In real terms, model collapse means these AIs produce gibberish, or nonsense, when ask to complete tasks that they typically excel at — such as producing an image, or writing a story.

The researchers demonstrated this effect in various models, including large language models (LLMs), variational autoencoders (VAEs), and Gaussian mixture models (GMMs).

Mechanisms Behind Model Collapse

The study identifies three main sources of error contributing to model collapse:

  1. Statistical Approximation Error: This occurs due to the finite number of samples available for training. With each resampling step, there’s a non-zero probability of information loss, which compounds over generations.
  2. Functional Expressivity Error: Neural networks, though universal approximators, have limitations when their size is finite. This error arises from the restricted expressiveness of the function approximator, leading to inaccuracies in representing the original data distribution.
  3. Functional Approximation Error: This error stems from the limitations of the learning procedures themselves, such as biases in stochastic gradient descent or choice of objectives. Even with infinite data and perfect expressivity, this error persists.

Empirical Evidence and Examples

To illustrate model collapse, the researchers provided examples of text outputs from a progressively degraded language model, showing how the quality of generated text deteriorates over generations. Initially coherent outputs devolve into nonsensical repetitions and increasingly irrelevant content. This degradation is evident in successive outputs, where meaningful context is replaced by garbled phrases and disconnected ideas.

The study also discusses related concepts such as catastrophic forgetting and data poisoning. However, these concepts do not fully explain model collapse, which the researchers argue is a distinct phenomenon requiring new approaches to understanding and, eventually, methods to mitigate these negative effects.

Broader Implications

The researchers stress the importance of maintaining access to original, human-generated data for training AI models. As AI-generated content proliferates online, it poses a risk of contaminating future training datasets. This contamination can lead to widespread degradation in the quality and reliability of AI models.

Moreover, the study suggests that preserving the ability of LLMs to model low-probability events is crucial for ensuring fairness in AI predictions. These low-probability events are often significant for understanding complex systems and are particularly relevant for marginalized groups.

The findings point to the need for community-wide coordination to track and distinguish AI-generated content from human-generated data. Without such measures, the study warns, it may become increasingly challenging to develop new AI models that remain accurate and useful.

Limitations?

It’s always important to suggest some limitations for studies as part of the research process and as a way to come up with future lines of research. In the study on model collapse, for instance, it should be pointed out that the experiments were conducted in controlled environments, which may not fully capture the complexity of real-world data settings where AI models are deployed. The researchers also focused on large language models (LLMs), variational autoencoders (VAEs), and Gaussian mixture models (GMMs), which means the findings may not directly apply to other types of AI models or applications. The study also does not fully address how temporal changes in data, such as evolving language usage or new societal trends, might influence model collapse over longer periods.

It’s important to note that the solutions and mitigations that the researchers proposed, such as maintaining access to human-generated data, may not scale easily as the volume of AI-generated content increases exponentially.

The study was conducted by Ilia Shumailov from the University of Oxford, Zakhar Shumaylov from the University of Cambridge, Yiren Zhao from Imperial College London, Nicolas Papernot from the University of Toronto, Ross Anderson from the University of Cambridge, and Yarin Gal from the University of Oxford.

Need Deeper Intelligence on the AI Market?

AI Insider's Market Intelligence platform tracks funding rounds, competitive landscapes, and technology trends across the global AI ecosystem in real time. Get the data and insights your organization needs to make informed decisions.

Related Articles

Runway Launches $10M Fund to Expand AI Video and World Model Ecosystem

Runway has announced the launch of a $10 million venture fund to support early-stage startups building across AI, media, and simulation, as the company expands

Mantis Biotech Announces $7.4M in Funding to Advance AI-Driven Digital Twin Models for Biomedical Research

Mantis Biotech has secured $7.4 million in seed funding led by Decibel VC, with participation from Y Combinator, Liquid 2, and angel investors, as it

ScaleOps Closes $130M Series C to Advance Autonomous AI Infrastructure Management

ScaleOps has secured $130 million in Series C funding led by Insight Partners, with participation from Lightspeed Venture Partners, NFX, Glilot Capital Partners, and Picture

Stay Updated with AI Insider

Get the latest AI funding news, market intelligence, and industry insights delivered to your inbox weekly.

Subscribe today for the latest news about the AI landscape