Kymata Labs
← Back to Research Hub
AI TRAINING DATAMODEL COLLAPSESYNTHETIC DATAAI CAPABILITY

The Internet Is Eating Itself

Model Collapse, Synthetic Sludge, and the Slow Poisoning of the Data That Made AI Smart

Kymata Labs Research·June 2, 2026·~12 min read

The best AI models were built on a one-time gift: a vast, messy, mostly human internet, written before anyone thought a machine would read it. That internet is gone. The new one is filling, by the day, with text and images that AI wrote, and when you feed a model its own output, it gets worse.

This is not a metaphor. It has a name in the literature: model collapse. The frightening part has nothing to do with the machines turning on us. It is quieter than that. The well that made them smart is filling with their own runoff, and they have to keep drinking from it.

The measured web

Newly published web articles, 2024 → 2026

Share of new articles that are human- vs AI-written

Human-written50%
AI-generated50%
The freshest layer of the web is now ~50% machine-made
Source: Graphite (2024–2026). AI- and human-written articles reached rough parity in late 2024, then plateaued near 50/50. A share of newly published articles, not the web overall.

Published by Kymata Labs · Independent Research Institution.

Does this affect you?

If you trust what a model tells you, this is your problem too.

You asked a chatbot a question this morning and took the answer at face value. You read an article, looked at a product image, skimmed a translated page, and you couldn’t say for certain whether a person or a machine made any of it. That uncertainty is the leading edge of the problem. The web you read is also the web the next model trains on. The two are the same pile.

So the question isn’t academic. What happens to the quality of the answers you depend on when the system has spent years learning, increasingly, from itself? The early data says the answer is not nothing, and it bends in only one direction.

A model trained on its own output is a photocopy of a photocopy. Run the machine long enough and the picture fades.

“We may be living through the smartest these machines will ever be.”

Kymata Labs
The evidence

This is not a worry. It’s a result, on the cover of Nature.

The case rests on five things that don’t depend on intuition: a controlled experiment, a measured web, a linguistic forensic audit, a price list, and a forecast. Each is doing different work. Together they describe a single system quietly poisoning its own supply.

  • The experiment: feed a model its own output and the tails fall off

    In July 2024, Natureran a study as its cover story. The finding, in the authors’ own words: “indiscriminate use of model-generated content in training causes irreversible defects in the resulting models, in which tails of the original content distribution disappear”, an effect they named “model collapse.”They demonstrated it across large language models, variational autoencoders, and Gaussian mixture models. Three very different architectures, the same failure. The canonical illustration is brutal in its simplicity. An OPT-125m model, retrained over and over on its own writing, began (Generation 0) with coherent prose about medieval architecture, and degraded by Generation 9 into repetitive nonsense about multicoloured “jackrabbits.” The rare, the specific, the long-tail, the parts of reality that make a model accurate rather than merely plausible, washed out first.

    Shumailov, Shumaylov, Zhao, Papernot, Anderson & Gal, Nature 631:755–759, July 2024. Co-author Ross Anderson, the late Cambridge security researcher, died in 2024.
  • The honest caveat: it’s replacement, not synthetic data as such

    Read carefully, the Nature result is about a specific, dangerous recipe: indiscriminate, replacement training, where each generation learns chiefly from the previous one’s output. A companion line of work sharpens the warning. Gerstgrasser and colleagues showed that when synthetic data accumulates alongside the original human data instead of replacing it, collapse is bounded, because the human floor holds. So synthetic data is not poison on its own. The poison is the trajectory: a web where verified human text thins out, model output thickens, and training proceeds indiscriminately because nobody can reliably tell the two apart. That is precisely the trajectory we are on.

    Gerstgrasser et al., arXiv:2404.01413 (2024).
  • The measured web: roughly half of new articles are now machine-made

    How close is that trajectory? Closer than comfortable. Graphite’s analysis found that AI-generated articles reached rough parity with human-written ones by late 2024, then plateaued, with their through-Q1-2026 work landing near a 50/50 split. Be precise about the claim: this is roughly half of newly published web articles, not “the majority of the web,” and not the whole internet. But it is the freshest stratum, the layer a crawler scrapes next. The clean human web is now a finite seam under a rising tide of synthetic sediment.

    Graphite, 2024–2026. Stated as a share of newly published articles, not the web overall.
  • The forensic audit: a shocking amount of the web is machine-translated

    Contamination isn’t only fresh AI text. AWS AI Labs researchers built a multi-way parallel corpus from the web and found that 57.1% of sentences sat in multi-way-parallel (mass machine-translated) clusters, the same low-quality content pumped through translation into many languages at once. Their conclusion is pointed: low-quality material is disproportionately machine-translated, skewing the multilingual web toward exactly the kind of garbled, hallucination-prone text you least want in a training set. The poison was already in the groundwater before the chatbots arrived; the chatbots are raising the level.

    Thompson et al., AWS AI Labs, Findings of ACL 2024.
  • The price list: clean human data is now a scarce, contested commodity

    Watch where the money goes and you learn what’s actually scarce. Verified human content is now priced and fought over. Reddit licensed its human conversation to Google for a reported ~$60 million a year (February 2024). OpenAI signed News Corp for a reported more than $250 million over five years (May 2024). And in the most telling episode of all, when Stack Overflow struck its OpenAI deal in May 2024, some users sabotaged or deleted their own answers in protest, a small and furious referendum on who owns the human knowledge the machines were trained on. You don’t pay nine figures for an abundant resource.

    Reuters (Feb 2024); WSJ (May 2024); reporting on the Stack Overflow–OpenAI deal (May 2024).
  • The forecast: “peak data” arrives around 2028

    Epoch AI put a date on the scarcity. Their updated 2024 projection estimates the stock of high-quality public human text will be fully utilized roughly between 2026 and 2032, central estimate around 2028. Their own hedge matters and we keep it: they place only about a 20% chance that scaling slows significantly by 2040, because synthetic data and efficiency gains may yet relax the limit. Read together with model collapse, the forecast is sobering: the human well runs low right as the synthetic runoff runs high.

    Epoch AI (Villalobos et al.), arXiv:2211.04325, updated June 2024.
  • It’s not just text: the images go MAD too

    Lest this seem a quirk of language, the same disease shows up in pictures. Researchers at Rice and Stanford coined MAD, for Model Autophagy Disorder: a self-consuming loop in which image generators trained on their own (and each other’s) outputs lose quality and diversity, generation after generation, unless a sufficient stream of fresh real images keeps flowing in. Autophagy: a system eating itself. The name is not subtle, and it isn’t meant to be.

    Alemohammad et al., Rice & Stanford, ICLR 2024.
How contaminated is the well?

The web, by the numbers

Measured synthetic and machine-made content in what a crawler scrapes next

AI-generatedshare of new web articles (Graphite)
~50%
Machine-translatedshare of web sentences in mass-MT clusters (AWS)
57.1%
Epoch AI still puts only ~20% odds on scaling slowing by 2040; the image equivalent of collapse has its own name, MAD. Sources: Graphite (2024–26); Thompson et al., AWS AI Labs, ACL 2024.
~50%Of newly published web articles are now AI-generatedAI- and human-written content reached rough parity in late 2024, then plateaued near a 50/50 split through Q1 2026 (Graphite). The freshest layer of the web is half machine-made.
~2028“Peak data,” when the human text stock runs lowEpoch AI projects high-quality public human text is fully utilized roughly between 2026 and 2032, central estimate ~2028. They still put only ~20% odds on scaling slowing by 2040.
How we got here

The gift was free. That was the whole problem.

Today’s best models were trained on a corpus no one will ever assemble again: decades of human writing produced before the writers knew a machine would harvest it. Forum arguments, Wikipedia edits, recipe blogs, code comments, half-finished novels: billions of people, writing for each other, with no thought of training data. That naïveté is what made the data clean. It was a snapshot of human language taken while no one was performing for the camera.

The economics

Clean human data now has a price

What labs pay for verified human text (annualized)

Reddit → Googlehuman conversation · Feb 2024
~$60M / yr
News Corp → OpenAInews archive · May 2024
$250M+ / 5yr
News Corp annualized from a reported $250M+ over five years. When Stack Overflow signed with OpenAI, some users deleted their own answers in protest. You don’t pay nine figures for an abundant resource. Sources: Reuters (Feb 2024); WSJ (May 2024).

Then the cameras turned on. The moment models could write fluently, the cheapest way to fill a web page became a prompt. And because that text is free, fast, and good enough, it floods in, onto the same open web the next model will scrape. The supply chain quietly closed into a loop: the model’s output becomes the model’s input. The internet stopped being a record of human thought and started becoming a mirror the machines hold up to themselves.

Nothing about this was a decision. No one chose to poison the well. It is an emergent property of cheap generation meeting an open commons, a tragedy of the commons where the grass is verified human data and everyone, including the machines, is grazing.

The web that once carried humans talking to each other now mostly carries the models talking to themselves.

No single synthetic article is the danger. The ratio is, drifting one upload at a time. Each generation of models trains on a web a little more written by the last generation of models, and the rare, true, long-tail facts that made the early models sharp are the first things to wash away.On the structure of model collapse
The divide

Two kinds of model-builders are forming. Only one is drinking clean.

Model collapse doesn’t fall evenly. It sorts the field into two camps. On one side, the labs that own or can buy verified human data, whether the Reddit license, the News Corp archive, or a proprietary stream of real human interaction, paired with the discipline to keep synthetic data accumulating alongside the real, never replacing it. They keep the human floor under their models, and the floor holds.

On the other side, everyone scraping the open web in good faith, training indiscriminatelyon a corpus they can no longer clean, because reliable AI-text detection at web scale doesn’t exist and gets harder as the models improve. The ambition is identical; the trajectory runs the opposite way. The first group’s models stay sharp. The second group’s models slowly forget the tails, and, fluent to the last, won’t show the damage until something specific and true is asked of them and the answer comes back confident and wrong. The scarce resource was never compute. It is clean water.

What it means

The same evidence, read by three different readers.

Collapse is not destiny. The research that diagnoses it also names the cure: keep real human data in the mix, and keep the corpus clean. What that demands depends on who you are.

For individuals

Treat fluent answers as claims, not facts.

  • The long-tail, meaning the rare, specific, verifiable detail, is exactly what collapse erodes first. On anything that matters, check the model against a human-made primary source.
  • Your original writing, photos, and answers are now a scarce resource. Be deliberate about what you pour into open scraping pipelines for free.
  • Prize provenance. “Who actually made this, and how would I know?” is becoming the most useful question on the internet.

For employers & builders

Clean data is now strategy, not plumbing.

  • Don’t train indiscriminately on a scraped web you can no longer vet. Where you use synthetic data, accumulate it alongside verified human data; never replace.
  • Treat a stream of real human interaction (support logs, expert review, licensed corpora) as a durable competitive moat, because it is one.
  • Track data provenance like you track a supply chain. The contamination you can’t see is the one that collapses your model.

For policymakers

The commons that trained AI is depleting. Govern it as such.

  • Provenance and disclosure standards (clear labelling of AI-generated and machine-translated content) make a cleanable web possible. Without them, contamination is invisible by default.
  • Verified human data is becoming a concentrated, priced asset. Watch who corners it; access to clean data is becoming access to capable AI.
  • Fund independent measurement of web contamination and model collapse. Today’s strongest signals come from a handful of papers. A risk this structural deserves rigour at national scale.
Questions worth asking

FAQ

Both things are true, and the distinction is the whole paper. The Nature result is about indiscriminate, replacement training: each generation learns mostly from the last one's output, and the model collapses. A separate study (Gerstgrasser et al.) showed that when synthetic data accumulates alongside the original human data rather than replacing it, collapse is bounded, because the human floor holds. So synthetic data isn't poison by itself. The danger is in the recipe the open web is drifting toward: less and less verified human data, more and more model output, trained on indiscriminately because nobody can tell which is which anymore.

No, and we're careful with that number. Graphite's analysis found that roughly half of newly published web articles are AI-generated, reaching rough parity with human-written content in late 2024 and then plateauing near 50/50. That's newly published articles, not the web as a whole, and not 'the majority.' The point isn't that humans have stopped writing. It's that the freshest layer of the web, the part a crawler scrapes tomorrow, is now about half machine-made, and rising AI content is harder to label and filter than it looks.

In theory. In practice, reliable detection of AI text at web scale is an unsolved problem, and it gets harder as models get better, since the whole point of a good model is that its output is indistinguishable from human writing. Worse, contamination hides in places filters don't look: machine-translated pages dressed up as native content, AI text edited lightly by a human, synthetic data laundered through a dozen reposts. You can't cleanly remove what you can't reliably detect.

Unsure, and we won't pretend otherwise. Epoch AI, which projects the high-quality human text stock is fully utilized somewhere between 2026 and 2032 (central estimate ~2028), still puts only about a 20% chance on scaling significantly slowing by 2040. Synthetic data done carefully, better data efficiency, and new modalities may relax the constraint. The honest claim is narrower and harder to dodge: the cheap, clean, human-made data that produced today's best models is finite and contaminating, and the strategies that work around that are more expensive and more fragile than the free buffet that got us here.

Clean human data is now the scarce input, and it's being diluted by the very systems that depend on it. The models didn't run out of compute or ideas. What they're running short of is uncontaminated water at the well. Whoever controls verified human data, and whoever can keep training corpora clean, holds the real leverage in the next phase of AI.

The machines didn’t run out of compute. They’re running out of us.

This is not a prophecy of doom. Collapse is avoidable: accumulate real data, keep the corpus clean, prize provenance. But the free, abundant, accidentally-perfect human internet that made today’s models smart was a one-time gift, and it is curdling into a mirror. The labs that understand the water is the scarce thing will keep building. The ones that mistake the runoff for the well will ship fluent models that quietly forget the truth.

Protect the human web. It’s the only well there is.

References

Sources

Every figure in this paper is drawn from the primary sources below. Where a result is specific to indiscriminate training, or a projection carries the authors’ own hedge, we have said so in the text.

The full paper

Read it as a PDF

The complete argument, the data, and all eight primary sources, typeset as a white paper. Free, no signup.

↓ Download the PDF

Published by Kymata Labs · Independent Research Institution · kymatalabs.com

This paper is provided for free. Cite the underlying primary sources directly.