AI is almost depleted

2025-10-31 16:04:33

Abstract generation in progress

In the era of generative AI, the models of giants like OpenAI, Google, and Anthropic have nearly consumed all publicly available data on the web. However, research conducted by Oxford University and several institutions indicates that by 2026 to 2028, the high-quality publicly available data that humans can provide to AI will soon be depleted. When the internet is flooded with AI-generated content, new models will inevitably need to train themselves using data generated by AI. This self-referential process is akin to the inbreeding of AI's close relatives.

By 2026, the data generated by humans will be learned thoroughly by AI.

The paper titled “The Curse of Recursion: Training on Generated Data Makes Models Forget” published in April 2024 by the University of Oxford, University of Cambridge, and several research institutions reveals this phenomenon.

They found that when generative models repeatedly use the data they produce for training, even under ideal conditions, the models gradually forget reality and ultimately fall into degradation. The research team pointed out after experimenting with various architectures such as language models, Variational Autoencoders (VAE), and Gaussian Mixture Models (GMM), that each retraining is like a photocopier reprinting a copy: details gradually disappear, and rare events are the first to be forgotten. After several generations, the model is left with only the average and mainstream appearance, ultimately becoming mediocre, singular, and even incorrect.

This process is like a data poisoning initiated by the model itself (self-poisoning). The final result is that the model no longer understands language and reality, and the output becomes repetitive nonsense.

Stanford Paper: As long as real data continues to participate, AI will not collapse.

However, the paper published in April 2024 by Stanford University and the Constellation team, titled “Is Model Collapse Inevitable? Breaking the Curse of Recursion by Accumulating Real and Synthetic Data,” provides a more optimistic answer. They replicated the experiments of the Oxford team but proposed a new training strategy: to accumulate data rather than replace it. In other words, the new generation of AI does not discard old human data but continuously overlays and integrates human and AI-generated content.

The results show that if new synthetic data replaces old data in each training session, the model's performance will linearly degrade. However, if the original data is retained and continuously accumulated, the model's error will gradually stabilize, even stopping the degradation. They repeatedly verified this conclusion using language models (GPT-2, Llama 2), image generation (VAE), and molecular generation (Diffusion model), all arriving at the same conclusion: as long as real data continues to be involved, AI will not collapse.

Researchers have theoretically proven that as data accumulates, the upper limit of model error is finite and will not inflate indefinitely. This means that AI's “inbreeding” is not predetermined, as long as we do not sever the connection with real human data.

AI also has the Habituation phenomenon, and AI self-reference is like incest.

The founder of iKala, Cheng Shijia, who once served as a software engineer at Google, used the famous Habsburg family in human history to describe this phenomenon. The well-known Habsburg dynasty in European history, in order to maintain pure bloodlines, locked wealth and power within the family through consanguineous marriages. The result was the infamous “Habsburg jaw,” but this is just the tip of the iceberg of genetic issues. Various genetic diseases, epilepsy, intellectual disabilities, and even high mortality rates are all curses of the Habsburg family, leading to the last king, Carlos II, suffering from multiple diseases and dying without heirs.

Cheng Shijia uses a more specific case to explain that it was originally a landscape painting full of details, even with small flaws. The painter's style, details, brushstrokes, and flaws actually represent genetic diversity. During the first photocopy, the AI generated a copy ( synthesized data ). At this point, the copy is 99.9% close to the original. However, AI is a model that takes the average from this, smoothing out flaws ( representing rare knowledge ), and slightly enhancing the most common features ( mainstream views ). The next generation learns from this and takes the average, which is the self-referential loop.

This article AI is almost exhausted first appeared in Chain News ABMedia.

View Original

This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.