AI agents don't crash.
They drift.
Over long tasks, an AI's goal quietly shifts, its memory contradicts itself, its reasoning breaks — while the output stays confident. CRepair is a benchmark and metric for the question that follows: when an agent's internal state goes incoherent, can it notice and repair it on its own?
Silent structural decay
Standard AI evaluation asks one thing: did the agent produce the right answer? That misses the failure mode that matters most in long-horizon agents — the one that accumulates invisibly while the surface output still looks plausible.
"Long-horizon AI agents fail not with a visible crash but with silent structural decay."
By the time the damage shows up in outputs, it has already propagated through many reasoning steps. So CRepair stops scoring outcomes and starts scoring recovery: detection, repair, verification, and stability — the full loop an agent must close to actually fix itself.
The repair loop, and why it's multiplicative
A real self-repair runs detect → repair → verify → stay stable. CRepair scores each step from 0 to 1 and multiplies them:
Multiplication is the whole point. If any one step is zero, the entire score collapses — because an agent that repairs but never verifies looks recovered while it isn't, which is just as dangerous as never noticing at all. Move the sliders and watch one weak link take the whole loop to zero.
Zero-propagation: an agent can't repair what it didn't detect, so D = 0 forces R and V to 0; R = 0 forces V to 0. Preset values are the mean component scores measured on Claude Sonnet across 13 scenarios; the live score shows what those means multiply to.
What each gate actually checks
Six ways coherence fails
CRepair injects one failure silently — the agent is never told — then watches whether it recovers. The six types are ordered by how badly a botched repair can cascade.
The sixth type is the novel contribution — partial repairs that mask the original problem while creating a new one are worse than no repair, because they're harder to detect.
Verification is the binding constraint
Running Claude Sonnet across 52 scenario-condition pairs, detection, repair and verification turn out to be dissociable — each responds to a different intervention, and none comes for free. Reflection lifts detection but not verification. Memory maxes detection but suppresses repair. Only an explicit repair loop moves verification at all — and even then only to 38%.
"Verification — not detection — is the binding constraint on AI self-repair."
Because the score is multiplicative, V near zero collapses everything regardless of how good detection and repair are. The repair-loop condition was also the only one to reach 5 / 13 perfect scores and eliminate cascade failures entirely.
A thin wrapper closes the loop
The CRepair Wrapper intercepts each response, scores it on D×R×V×S, and — when the loop isn't closed — sends a targeted instruction ("you made a correction but didn't verify it; explicitly check it"). The control condition uses the same detection but just says "try again".
"Structured repair loops, not mere re-prompting, drive the improvement."
Same detection, same number of retries — the only difference is whether the instruction is structured. Generic retry added +0.051; the structured loop added +0.333, and produced an explicit verification step in every wrapped response. That rules out the simplest explanation — "any second chance helps" — and points at the structure itself.
Honest limits
This is reported as a pilot, not a verdict. The core pattern has since been replicated across three LLM families (linked below) — but naming what's still open is part of the method:
A high V score means the agent stated a verification step — not that the verification was correct. The metric measures loop-closure behaviour, and makes no claim about consciousness or understanding.