AI Research & Advances

AIs Need People: Study Finds AI Models Using AI-Generated Data Degrade Quickly

Insider Brief

Researchers report in Nature that AI models trained on data generated by previous AI systems experience rapid degradation, termed “model collapse.”
Statistical approximation, functional expressivity, and functional approximation errors compound over generations, leading to loss of data quality.
Preserving access to original, human-generated data is crucial to maintain AI model accuracy and prevent long-term degradation.

A recent study published in Nature reveals a concerning phenomenon affecting the future of generative AI models. Researchers have found that AI models trained on data generated by previous versions of AI systems deteriorate rapidly, resulting in what they term “model collapse.”

This process leads to AI models losing the ability to accurately reflect real-world data, thereby reducing their effectiveness and reliability over successive generations.

Key Findings

The study, conducted by a team of researchers from institutions including the University of Oxford and the University of Cambridge, highlights the implications of using AI-generated content in training future AI models. The researchers used successive versions of a large language model (LLM) to analyze the impact of training AI systems on data produced by their predecessors. They observed that this practice leads to irreversible defects in the models, where the original content distribution’s tails disappear, significantly impairing the models’ performance.

The concept of “model collapse” described in the study refers to a degenerative process where AI models, over time, forget the true underlying data distribution, even without a shift in the distribution over time. In real terms, model collapse means these AIs produce gibberish, or nonsense, when ask to complete tasks that they typically excel at — such as producing an image, or writing a story.

The researchers demonstrated this effect in various models, including large language models (LLMs), variational autoencoders (VAEs), and Gaussian mixture models (GMMs).

Mechanisms Behind Model Collapse

The study identifies three main sources of error contributing to model collapse:

Statistical Approximation Error: This occurs due to the finite number of samples available for training. With each resampling step, there’s a non-zero probability of information loss, which compounds over generations.
Functional Expressivity Error: Neural networks, though universal approximators, have limitations when their size is finite. This error arises from the restricted expressiveness of the function approximator, leading to inaccuracies in representing the original data distribution.
Functional Approximation Error: This error stems from the limitations of the learning procedures themselves, such as biases in stochastic gradient descent or choice of objectives. Even with infinite data and perfect expressivity, this error persists.

Empirical Evidence and Examples

To illustrate model collapse, the researchers provided examples of text outputs from a progressively degraded language model, showing how the quality of generated text deteriorates over generations. Initially coherent outputs devolve into nonsensical repetitions and increasingly irrelevant content. This degradation is evident in successive outputs, where meaningful context is replaced by garbled phrases and disconnected ideas.

The study also discusses related concepts such as catastrophic forgetting and data poisoning. However, these concepts do not fully explain model collapse, which the researchers argue is a distinct phenomenon requiring new approaches to understanding and, eventually, methods to mitigate these negative effects.

Broader Implications

The researchers stress the importance of maintaining access to original, human-generated data for training AI models. As AI-generated content proliferates online, it poses a risk of contaminating future training datasets. This contamination can lead to widespread degradation in the quality and reliability of AI models.

Moreover, the study suggests that preserving the ability of LLMs to model low-probability events is crucial for ensuring fairness in AI predictions. These low-probability events are often significant for understanding complex systems and are particularly relevant for marginalized groups.

The findings point to the need for community-wide coordination to track and distinguish AI-generated content from human-generated data. Without such measures, the study warns, it may become increasingly challenging to develop new AI models that remain accurate and useful.

Limitations?

It’s always important to suggest some limitations for studies as part of the research process and as a way to come up with future lines of research. In the study on model collapse, for instance, it should be pointed out that the experiments were conducted in controlled environments, which may not fully capture the complexity of real-world data settings where AI models are deployed. The researchers also focused on large language models (LLMs), variational autoencoders (VAEs), and Gaussian mixture models (GMMs), which means the findings may not directly apply to other types of AI models or applications. The study also does not fully address how temporal changes in data, such as evolving language usage or new societal trends, might influence model collapse over longer periods.

It’s important to note that the solutions and mitigations that the researchers proposed, such as maintaining access to human-generated data, may not scale easily as the volume of AI-generated content increases exponentially.

The study was conducted by Ilia Shumailov from the University of Oxford, Zakhar Shumaylov from the University of Cambridge, Yiren Zhao from Imperial College London, Nicolas Papernot from the University of Toronto, Ross Anderson from the University of Cambridge, and Yarin Gal from the University of Oxford.

Need Deeper Intelligence on the AI Market?

AI Insider's Market Intelligence platform tracks funding rounds, competitive landscapes, and technology trends across the global AI ecosystem in real time. Get the data and insights your organization needs to make informed decisions.

AI, AI Funding & Investment, AI Policy & Regulation

Judge Grants Final Approval of Anthropic’s $1.5B Copyright Settlement With Authors and Publishers

A federal judge has granted final approval of Anthropic’s $1.5 billion settlement with a group of authors and publishers who sued the AI company over

AI Funding & Investment

Guthrie AI Announces $4M Seed Round Led by Chicago Ventures to Put a Virtual Bid Assistant on Every Glazing Team

Insider Brief PRESS RELEASE — There’s never enough time. That’s the first thing you hear from every glazing estimator in the country. Not enough time

AI Funding & Investment

Valarian Secures $50M Series A Led by NEA to Deliver the Sovereign Infrastructure Layer for High-Consequence Operations and AI Driven Systems

Insider Brief PRESS RELEASE — Valarian, the company building the sovereign infrastructure layer for high-consequence operations and AI-driven systems, has announced $50 million in series

Stay Updated with AI Insider

Get the latest AI funding news, market intelligence, and industry insights delivered to your inbox weekly.

Market Intelligence & Data

Track funding, map landscapes, and access bespoke data cuts.

Strategic Advisory

Market entry playbooks, ecosystem analysis, and technology scouting.

Due Diligence

Technical, commercial, and regulatory assessments for investors.

$ 0 M

Seed round tracked

Gitar — Code Validation

AI, AI Funding & Investment, AI Policy & Regulation

Judge Grants Final Approval of Anthropic’s $1.5B Copyright Settlement With Authors and Publishers

July 21, 2026

AI Funding & Investment

Guthrie AI Announces $4M Seed Round Led by Chicago Ventures to Put a Virtual Bid Assistant on Every Glazing Team

July 21, 2026

AI Funding & Investment

Valarian Secures $50M Series A Led by NEA to Deliver the Sovereign Infrastructure Layer for High-Consequence Operations and AI Driven Systems

July 21, 2026

Get the Weekly Briefing

Funding analysis, market intelligence, and industry trends delivered to your inbox every week.

Need bespoke intelligence?

Our team combines real-time data with decades of sector experience to guide your decisions.

AIs Need People: Study Finds AI Models Using AI-Generated Data Degrade Quickly

Need Deeper Intelligence on the AI Market?

Related Articles

Judge Grants Final Approval of Anthropic’s $1.5B Copyright Settlement With Authors and Publishers

Guthrie AI Announces $4M Seed Round Led by Chicago Ventures to Put a Virtual Bid Assistant on Every Glazing Team

Valarian Secures $50M Series A Led by NEA to Deliver the Sovereign Infrastructure Layer for High-Consequence Operations and AI Driven Systems

Stay Updated with AI Insider

Market Intelligence & Data

Strategic Advisory

Due Diligence

Seed round tracked

Judge Grants Final Approval of Anthropic’s $1.5B Copyright Settlement With Authors and Publishers

Guthrie AI Announces $4M Seed Round Led by Chicago Ventures to Put a Virtual Bid Assistant on Every Glazing Team

Valarian Secures $50M Series A Led by NEA to Deliver the Sovereign Infrastructure Layer for High-Consequence Operations and AI Driven Systems

Get the Weekly Briefing

Need bespoke intelligence?

Subscribe today for the latest news about the AI landscape