AI Model Collapse

12 views

Skip to first unread message

Stuart LaForge

unread,

May 30, 2025, 11:28:01 AM5/30/25

to ExI Chat, Extropolis

Since we have been talking about AIs recursively self-improving on the
lists, I thought this was a pertinent article that could affect the
timeline to AGI. There is a phenomenon called AI model collapse which
occurs when AIs are trained on their own output. This produces an echo
chamber effect which reinforces hallucinations, biases, and
misinformation resulting in a degradation of output quality. Since these
days, much of the output of AI gets put on the Internet and then new AI
models get trained on the Internet, their training data becomes
contaminated with AI output and can lead to AI model collapse. This is
the informatic equivalent of biological inbreeding where deleterious
mutations get amplified and reinforced in a genetic lineage resulting in
all sorts of pathologies.

https://www.nature.com/articles/s41586-024-07566-y

Abstract
Stable diffusion revolutionized image creation from descriptive text.
GPT-2 (ref. 1), GPT-3(.5) (ref. 2) and GPT-4 (ref. 3) demonstrated high
performance across a variety of language tasks. ChatGPT introduced such
language models to the public. It is now clear that generative
artificial intelligence (AI) such as large language models (LLMs) is
here to stay and will substantially change the ecosystem of online text
and images. Here we consider what may happen to GPT-{n} once LLMs
contribute much of the text found online. We find that indiscriminate
use of model-generated content in training causes irreversible defects
in the resulting models, in which tails of the original content
distribution disappear. We refer to this effect as ‘model collapse’ and
show that it can occur in LLMs as well as in variational autoencoders
(VAEs) and Gaussian mixture models (GMMs). We build theoretical
intuition behind the phenomenon and portray its ubiquity among all
learned generative models. We demonstrate that it must be taken
seriously if we are to sustain the benefits of training from large-scale
data scraped from the web. Indeed, the value of data collected about
genuine human interactions with systems will be increasingly valuable in
the presence of LLM-generated content in data crawled from the Internet.
---------------------------------

Stuart LaForge

John Clark

unread,

May 30, 2025, 1:57:12 PM5/30/25

to extro...@googlegroups.com, ExI Chat

On Fri, May 30, 2025 at 11:28 AM Stuart LaForge <av...@sollegro.com> wrote:

> Since we have been talking about AIs recursively self-improving on the
lists, I thought this was a pertinent article that could affect the
timeline to AGI. There is a phenomenon called AI model collapse which
occurs when AIs are trained on their own output. This produces an echo
chamber effect which reinforces hallucinations, biases, and
misinformation resulting in a degradation of output quality. Since these
days, much of the output of AI gets put on the Internet and then new AI
models get trained on the Internet, their training data becomes
contaminated with AI output and can lead to AI model collapse. This is
the informatic equivalent of biological inbreeding where deleterious
mutations get amplified and reinforced in a genetic lineage resulting in
all sorts of pathologies.

https://www.nature.com/articles/s41586-024-07566-y

That article is ancient, it was written way back to the olden days of 10 months ago. That's practically pre-Cambrian!. So I asked GPT this:

"The following article indicates that synthetically generated data can cause "irreversible defects in the resulting models"

https://www.nature.com/articles/s41586-024-07566-y.

Are there any more recent articles that contradict this claim and can you tell me something about them?"

I then pressed the "deep research" button and GPT told me this:

===

Recent Articles Contradicting Risks of Synthetic Data in AI Models

Contrarian Evidence on Synthetic Data “Model Collapse”

AI-Generated Text Data

arxiv.org Recent work shows synthetic text can help rather than irreparably harm models. Chen et al. (2024) introduce a diversity metric for synthetic corpora and find that richer, more varied synthetic text correlates with higher LLM performance in pretraining and fine-tuning arxiv.org. Gan and Liu (2024) develop a theoretical framework linking synthetic data to information gain: they prove that when the synthetic generator provides sufficient new information, model generalization improves arxiv.org. These analyses imply synthetic text – if diverse and informative – can augment training data positively, not guarantee collapse.

arxiv.org Empirical studies echo this optimism. Zhang & Pavlick (2025) fine-tuned a Llama-2 model on synthetic NLI examples and tested on a challenging “blindspot” set (HANS). They report that adding synthetic data did not worsen the model’s heuristic biases: the performance gap on HANS remained unchanged by synthetic augmentation arxiv.org. In another example, Zhezherau & Yanockin (2024) create a domain-specific dialog dataset by mixing real counseling transcripts with AI-generated synthetic sessions. Their “hybrid” LLM outperformed both the real-only and synthetic-only models across all metrics arxiv.org.

In short, these results show that targeted synthetic text (especially when blended with real data) can improve robustness and task accuracy, counter to an inevitable degradation claim.

arxiv.org Still, uncontrolled self-training on only AI-generated text can degrade models – but even here remedies exist. Zhu et al. (2024) confirm that naively pretraining on high proportions of synthetic text leads to performance drop arxiv.org, but they propose a token-editing method (ToEdit) that theoretically bounds the error and empirically prevents collapse arxiv.org. Crucially, Gerstgrasser et al. (2024) show that data strategy matters: if each new generation of synthetic text is added to (rather than replaces) the original real data, model collapse is avoided entirely arxiv.orgarxiv.org. In their experiments and proofs, accumulating synthetic with real data keeps the test error finite, whereas pure replacement causes unbounded error growth.

Taken together, these findings indicate that synthetic text need not cause “irreversible defects” if used under controlled schemes (e.g. mixing with real data, ensuring diversity, or editing strategies)arxiv.orgarxiv.org.

AI-Generated (Synthetic) Tabular Data

arxiv.org Surveys of synthetic tabular data note it as a practical remedy for data scarcity and privacy constraints. Liu et al. (2025) summarize that generative models (GANs, VAEs, diffusion, even LLMs) can learn real tabular distributions and output high-fidelity, privacy-preserving samples arxiv.org. Importantly, several new methods show improved model performance with synthetic tables. For example, Li et al. (2024) propose TAEGAN, a GAN with a masked autoencoder backbone tailored for tabular data. On benchmarks, TAEGAN outperforms nine other deep-synthesis methods on 9/10 datasets for predictive accuracy and boosts data augmentation on most small-data cases arxiv.org. Likewise, D’souza et al. (2025) address class imbalance by adding a synthetic “overlap” class; this yields much higher-quality minority samples and significantly increases the final classifier’s accuracyarxiv.org.

These results demonstrate that well-designed synthetic tables can enhance learning.

arxiv.orgarxiv.org In context, the tabular survey emphasizes that synthetic data enables sharing and analysis that would otherwise be restricted. It notes that state-of-the-art generative methods now learn tabular schemas so well that they can produce realistic datasets resembling the originals. The empirical studies above confirm that using such synthetic tables often improves outcomes (better balanced learning, higher accuracy) rather than introducing hidden defects. In short, synthetic tabular data is shown to be useful and benign under the right conditions arxiv.orgarxiv.org, contradicting any blanket claim of irreversible harm.

Synthetic Images and Vision-Language Data

arxiv.orgarxiv.org Recent work on vision models echoes the text/tabular conclusions: synthetic visual data can match or even surpass real data if crafted carefully. Chan et al. (2024) (“Synth2”) train a vision–language model using only synthetic image-caption pairs generated by a text-to-image pipeline. They find the resulting VLM performs as well as one trained on human-annotated images, despite using much less data arxiv.org. Similarly, Liu et al. (2024) (“SynthVLM”) create 100K AI-generated caption-image pairs via diffusion models and report achieving state-of-the-art results on vision-QA tasks with only this synthetic set (just 18% the size of a typical dataset) arxiv.org.

These successes show that high-quality synthetic images and labels can effectively substitute large real datasets in training.

arxiv.org That said, some analyses urge caution rather than doom. Geng et al. (2024) compare fine-tuning on task-relevant synthetic images versus simply retrieving similar real images from a large collection. They observe that retrieved real images consistently match or exceed the performance of the synthetic images arxiv.org. In other words, models using the real-image baseline never fell below those using generated images. The authors attribute this gap to artifacts and missing fine details in synthetic images, not to any fundamental model flaw. This suggests synthetic vision data must be high-fidelity – but it does not imply an irreversible defect. Crucially, the Synth2 and SynthVLM results above show that when synthetic images are high-quality and well-aligned, models achieve excellent outcomesarxiv.orgarxiv.org.

In summary, synthetic image data can be highly effective, but its utility should be measured against strong baselines.

Summary: Across text, tabular, and vision domains, recent studies find synthetic data to be useful or harmless when used with care. Empirical and theoretical analyses show that data diversity, real-data mixing, or task-specific design can prevent the runaway “collapse” seen in uncontrolled experiments arxiv.orgarxiv.org.

None of the cited works reports irreversible model damage from moderate synthetic augmentation; instead they highlight conditions (e.g. pure self-training, low diversity) under which problems emerge. When those conditions are avoided – for example by blending synthetic with real data, verifying diversity, or using advanced generation techniques – models remain robust or even improve. These counterexamples argue that the Nature article’s extreme outcome is not inevitable in practice; synthetic data can be a valuable, controllable resource rather than a doom-laden poison arxiv.orgarxiv.org.

Sources: Representative peer-reviewed and preprint studies (and industry reports) listed above provide empirical evidence and theory supporting the controlled use of synthetic data across modalitiesarxiv.orgarxiv.orgarxiv.orgarxiv.org. Each reference above contains methodology details and experiments illustrating the points summarized.

John K Clark See what's on my new list at Extropolis

8v&

Reply all

Reply to author

Forward

0 new messages