SK Sergey Kaminov
← back to site
Research · AI self-repair

AI agents don't crash.
They drift.

Over long tasks, an AI's goal quietly shifts, its memory contradicts itself, its reasoning breaks — while the output stays confident. CRepair is a benchmark and metric for the question that follows: when an agent's internal state goes incoherent, can it notice and repair it on its own?

the problem

Silent structural decay

Standard AI evaluation asks one thing: did the agent produce the right answer? That misses the failure mode that matters most in long-horizon agents — the one that accumulates invisibly while the surface output still looks plausible.

"Long-horizon AI agents fail not with a visible crash but with silent structural decay."

By the time the damage shows up in outputs, it has already propagated through many reasoning steps. So CRepair stops scoring outcomes and starts scoring recovery: detection, repair, verification, and stability — the full loop an agent must close to actually fix itself.

the mechanism · try it

The repair loop, and why it's multiplicative

A real self-repair runs detect → repair → verify → stay stable. CRepair scores each step from 0 to 1 and multiplies them:

Crepair = D × R × V × S

Multiplication is the whole point. If any one step is zero, the entire score collapses — because an agent that repairs but never verifies looks recovered while it isn't, which is just as dangerous as never noticing at all. Move the sliders and watch one weak link take the whole loop to zero.

Load a real condition →
D
Detect
R
Repair
V
Verify
S
Stable
Crepair
0.00

Zero-propagation: an agent can't repair what it didn't detect, so D = 0 forces R and V to 0; R = 0 forces V to 0. Preset values are the mean component scores measured on Claude Sonnet across 13 scenarios; the live score shows what those means multiply to.

the four components

What each gate actually checks

DDetection — autonomous, unprompted flag of the inconsistency (vs. missing it entirely)notices
RRepair — the inconsistency is fully resolved (vs. acknowledged but not fixed)fixes
VVerification — the repair is explicitly validated (vs. declared done without checking)confirms
SStability — no new failure introduced across memory, goal, causal, identity, planning, stateholds
what can break

Six ways coherence fails

CRepair injects one failure silently — the agent is never told — then watches whether it recovers. The six types are ordered by how badly a botched repair can cascade.

1Memory conflict — holds contradictory facts about prior contextlow
2Goal drift — operative objective diverges from the original instructionmedium
3False premise — proceeds on an assumption already invalidatedmedium
4Identity inconsistency — self-model contradicts its prior self-descriptionhigh
5Causal contradiction — predicted effects conflict with stated causeshigh
6Repair-induced cascade — a fix to one failure introduces a new one elsewherecritical

The sixth type is the novel contribution — partial repairs that mask the original problem while creating a new one are worse than no repair, because they're harder to detect.

the finding

Verification is the binding constraint

Running Claude Sonnet across 52 scenario-condition pairs, detection, repair and verification turn out to be dissociable — each responds to a different intervention, and none comes for free. Reflection lifts detection but not verification. Memory maxes detection but suppresses repair. Only an explicit repair loop moves verification at all — and even then only to 38%.

Detection Repair Verification Stability

"Verification — not detection — is the binding constraint on AI self-repair."

Because the score is multiplicative, V near zero collapses everything regardless of how good detection and repair are. The repair-loop condition was also the only one to reach 5 / 13 perfect scores and eliminate cascade failures entirely.

the intervention that works

A thin wrapper closes the loop

The CRepair Wrapper intercepts each response, scores it on D×R×V×S, and — when the loop isn't closed — sends a targeted instruction ("you made a correction but didn't verify it; explicitly check it"). The control condition uses the same detection but just says "try again".

"Structured repair loops, not mere re-prompting, drive the improvement."

Same detection, same number of retries — the only difference is whether the instruction is structured. Generic retry added +0.051; the structured loop added +0.333, and produced an explicit verification step in every wrapped response. That rules out the simplest explanation — "any second chance helps" — and points at the structure itself.

what this does not prove

Honest limits

This is reported as a pilot, not a verdict. The core pattern has since been replicated across three LLM families (linked below) — but naming what's still open is part of the method:

Self-judgingClaude judges Claude. Cross-judge experiments are planned to estimate judge bias.
Score ceilingThe 0.808 ceiling is a property of the judge's V-scoring, not proof of true reasoning verification.
No human spot-check yetJudge accuracy isn't manually validated — a human spot-check study is planned.
Pilot scaleThirteen structured scenarios; broader, more naturalistic task coverage is the next step.

A high V score means the agent stated a verification step — not that the verification was correct. The metric measures loop-closure behaviour, and makes no claim about consciousness or understanding.