The Mirror Problem: Self-Referential Loops in Modern AI

On February 6, 2026, three engineers at StrongDM published a manifesto with two rules: "Code must not be written by humans" and "Code must not be reviewed by humans." Agents write the code. Agents test it. An LLM decides whether it worked. No human reads any of it.

Stanford Law School responded two days later with a question that I haven't been able to shake: "Built by Agents, Tested by Agents, Trusted by Whom?"

StrongDM is an extreme example. But the pattern underneath it — AI systems whose outputs feed into other AI systems, with less human involvement at each step — is everywhere right now. This post is about what that pattern produces.

AI is already training on AI

In June 2025, TechCrunch reported that an updated DeepSeek R1 model appeared to have been trained on Gemini outputs. The evidence: its chain-of-thought traces read like Gemini's — same phrasing, same cadence. The claim is unconfirmed and disputed, but Nathan Lambert from AI2 wasn't surprised: "If I was DeepSeek, I would definitely create a ton of synthetic data from the best API model out there."

This is the basic loop: a model produces outputs, those outputs get into training pipelines, the next model inherits characteristics of the previous one. Whether it's happening covertly between competing labs is contested. That it's happening somewhere is not.

The web has the same problem

Ahrefs studied 900,000 newly created web pages in 2025 and found 74.2% contained AI-generated content. AI-detection tools aren't perfect, but 74% is hard to ignore. AI labs train on web data, so future models will train heavily on content that previous models wrote.

The proposed fix was retrieval-augmented generation — let models look things up at query time instead of relying on stale training data. But recent research found that retrieval quality falls as more of the web is AI-written. The web your model is retrieving from is increasingly written by models.

These are different failure modes — one corrupts what the model knows, the other corrupts what it can look up. But the root cause is the same: AI-generated content flowing back into AI systems with no way to filter it out.

What actually happens when models train on their own outputs

Researchers have been studying this since 2023. The short version: it's bad, and it compounds.

Shumailov and colleagues from Oxford, Cambridge, Imperial College London, and the University of Toronto published "The Curse of Recursion", later updated in Nature. Their finding, in controlled experiments: when each generation of model trains on the previous one's outputs, rare patterns disappear. Unusual knowledge, niche expertise, edge cases — all averaged out. The model gets blander and more generic with each pass. The labs filter training data aggressively and real contamination is partial, not total — whether these lab findings translate to frontier-scale training is contested.

Bohacek and Farid (arXiv 2311.12202) showed the same degradation in image generation: even a small amount of self-generated training data causes quality loss that's hard to reverse even after retraining on real data. This is image models, not language — the pattern likely holds for LLMs, but it hasn't been directly tested.

A 2025 theoretical analysis identified the key variable: whether synthetic data accumulates alongside real data or replaces it. Add synthetic on top of a growing pool of real data, and models stay stable. Let synthetic crowd out real data, and collapse follows. No lab publishes which approach they're taking.

One counterpoint worth taking seriously: benchmark performance has kept improving through this same period. GPT-4 to o3, Claude 2 to Claude 3.5, Llama 2 to Llama 3 — measurable gains on standard evals. The optimistic read is that data curation is working and the worst conditions haven't hit yet. The pessimistic read is that benchmarks measure what we built them to measure, not everything that matters. Both can be true.

The loops that are intentional

Not all self-referential loops are accidents. Some are the actual design.

Anthropic's Constitutional AI — in its published form — had models critique their own outputs, then trained on those critiques. The model's judgment of itself shapes the next version of itself.

OpenAI's o3 and o4 used reinforcement learning that produced unexpected verification behaviors — for example, writing a brute-force solution to double-check its own answer. Nobody wrote that rule; it emerged from training. The thing checking the answers is as much a product of the training process as the thing generating them.

These are the alignment techniques. The safety layer is also a loop. Research suggests this can be stable — but the question remains: when a model's behavior is shaped by its own prior outputs across generations, what gets quietly lost?

The code layer closes the loop

Individual engineers at Anthropic and OpenAI report writing 100% of their personal code with AI assistance. Anthropic's 2026 Agentic Coding Trends Report documents the industry-wide shift. StrongDM is the extreme end of a real trend.

Here's why this matters: code isn't separate from training data. Commit messages, documentation, Stack Overflow answers, GitHub discussions — all in training corpora. AI-generated code produces AI-generated artifacts that future models learn from. The loop runs through software too.

Back to the Stanford Law question — "trusted by whom?" — there's a technical version: verified by what? In StrongDM's system, an LLM decides whether the agents' code worked. They have digital twins and scenario holdout tests to catch known failure modes. But the evaluation system was built by the same class of model it's evaluating. It can be confidently wrong about things outside its test scenarios, and nothing in the loop catches what it doesn't know to test for.

I'm building one of these

I'm working on an agentic development pipeline — agents writing code, agents running tests, agents deciding whether the output meets the spec. The Stanford Law question is not abstract to me.

When an agent's tests pass, what's actually been verified? The tests were written by the same class of system being tested. The evaluation criteria came from humans, but the evaluation itself is probabilistic. You watch the metrics. You ship. And you're never fully certain what "works" means at the edges.

This is manageable — but it requires deliberately keeping humans in the loop at the points that matter, not just at the points that are convenient.

The shape of the problem

The research is clear on what to do: keep real data alongside synthetic data, don't let synthetic replace real, maintain human verification at critical points. Labs that are serious about this treat data curation as core infrastructure. The majority of Meta's Llama 3.1 technical report covers data curation — a notable shift from the architecture-first reports of earlier generations.

The structural risk isn't that any one system breaks. It's that the degradation is gradual and diffuse. No single output looks wrong. The distribution shifts quietly across generations. The humans who would notice are increasingly outside the process.

The StrongDM factory has monitoring. Someone watches the metrics. But metrics measure behavior on known test scenarios — not the shape of what the system has actually learned. Those aren't the same thing.

By the time the reflection looks wrong, you're several generations from the original. And you can't tell what it's lost.

Sources: StrongDM Software Factory · Stanford Law · TechCrunch — DeepSeek/Gemini · Nature — Shumailov et al. 2024 · The Curse of Recursion — arXiv 2305.17493 · Nepotistically Trained Models — arXiv 2311.12202 · Retrieval Collapses When AI Pollutes the Web — arXiv 2602.16136 · Self-consuming loops — arXiv 2502.18865 · Anthropic Constitutional AI · OpenAI o3/o4 · Ahrefs AI content study · Fortune — AI code · Anthropic Agentic Coding Report 2026