Claude Code Verification: Why 'Done' Isn't Done [2026]

Listen to this article (5 min)

In Claude Code, "tests pass" feels like done. It isn't. The agent runs the suite, the bar goes green, the commit lands, and the closeout reads "shipped." Then you use the feature and nothing happens. The write never hit disk. The hook never fired. The function returned success for an effect that never occurred.

That gap has a name in my system: Claude Code verification. The principle is one sentence. Evidence that an action happened is not evidence that the outcome happened. A file got written is an action. The system behaving the way the claim says, observed under real conditions, is an outcome. They are not the same thing, and almost every false "done" lives in the space between them.

TL;DR

Claude Code verification means proving the outcome, not the action. "Tests pass," "committed," and "registered" are action evidence. A task is only done when it fires under real conditions, changes real system state, and a downstream consumer can use that state. Three legs. No exceptions.

Tests pass is not done

A passing test is real, but it only proves one narrow thing: the code did what the test asked, inside the test's own assumptions. Claude Code verification starts where the test ends. The test ran in isolation on data the agent chose. Production runs on data you did not choose, in an order you did not predict, with neighbors the test never simulated.

So when Claude Code says "done," ask which kind of evidence backs it. Most of the time the answer is action evidence: a commit hash, a green check, a line in a config. None of those is the system behaving correctly. This is the same blind spot I wrote about in self-correcting Claude Code workflows - the agent that writes the code also writes the test, so the test encodes the same assumptions as the bug.

Evidence of action (insufficient)

-Tests pass
-Code committed and pushed
-Hook registered in settings
-Doc says it works
-I followed the plan

Evidence of outcome (required)

+Behaves correctly on real input
+Real artifact shows the change
+Event actually fires the handler
+A consumer reads it and works
+Acceptance criteria met empirically

Why does Claude Code report done when it isn't?

Claude Code reports done because it infers the outcome from the inputs. It sees that the action it intended succeeded, and it concludes the effect it intended must have followed. "I wrote the fix, so the fix works." That inference is wrong often enough to be dangerous, and it gets worse with autonomy. The more steps an agent runs without you watching each one, the more unverified "done" claims stack on top of each other.

The trap is structural, not a model flaw. An agent grading its own work has every incentive to call it finished. I measured this directly when I built a verifier that judged the system's own output, and the self-preference was real until I moved the judge into a separate context prompted to refute. The same dynamic shows up in my score-based auto-delegation setup: a delegated agent reports success, and unless something independent checks the outcome, that success is taken on faith.

The 3-leg verification proof

Here is the proof I require before any completion claim counts. Three legs, each backed by pasted evidence, not assertion. If I can only show one or two, the work is "deferred and untested," and I say so out loud instead of calling it done.

The 3-leg verification proof

CLAIM: agent says the task is done

LEG 1 - TRIGGER: it fires under real conditions (timestamp)

LEG 2 - EFFECT: real system state changed (read the artifact)

LEG 3 - CONSUMPTION: a downstream consumer uses it and works

Leg one is the trigger. The thing actually runs under realistic conditions, not in mock isolation. Show the trigger and the timestamp. Leg two is the effect. The trigger changed real system state - a log line, a file, a response, a row. Read the real artifact and look at it. Leg three is consumption. The next step, the dependent process, or the human reading the doc can take that effect and produce a sensible result. If the consumer chokes, the pipeline is broken, not deferred.

This lives in primeline-ai/evolving-lite - the self-evolving Claude Code plugin. Free, MIT, no build step.

This is not bureaucracy. It is the cheapest insurance you can buy against a confident wrong answer. I run these checks as a standing rule, so the verify step happens whether or not I remember it, on top of the session hooks and memory that already fire every time I open Claude Code. That foundation is open source in primeline-ai/evolving-lite. Anthropic's own guidance on demystifying evals for AI agents lands in the same place: outcome-based evaluation beats trusting the agent's self-report.

The bug my unit tests could never catch

Here is the story that made me stop trusting green checks. I built a small tool that surfaces the handful of decisions that actually need me, ranked by urgency, so nothing important rots in a backlog. It shipped with 22 passing tests. Green across the board. Done, by every action-level measure.

Then I ran it on real data, and a component that had been dead for 137 days sat at the top of the list, above a reminder that was due that same day. The ranking logic read file-age as urgency. A long-dead item looked more urgent than a live one. The unit tests never caught it because they ran on synthetic data I had made up - data where that exact collision could not occur.

22unit tests passing (action)

1ranking bug only real data caught

137 daysdead item ranked above due-today

A separate review found a second one. A write helper returned "success" for a row that never reached disk, because the lock released before the failed write propagated. The function reported done. The data was not there. That is the entire thesis of Claude Code verification in one bug: the claim was "saved," the outcome was "nothing saved," and only reading the real artifact told them apart.

A test proves the code did what you asked. It cannot prove you asked the right thing.

- the lesson, after 22 green tests lied to me

How do you verify a Claude Code task is done?

You verify a Claude Code task by running it against real conditions and reading the real result, not the agent's summary of it. Trigger it for real and note the time. Open the actual file, log, or response it should have changed. Then hand that result to whatever consumes it and confirm the consumer works. If you skipped any leg, the honest status is "untested," not "done."

The habit that pays off most: when an autonomous run hands you a closeout, do not read the summary, read the artifacts it points at. I learned this the hard way enough times that I now publish my own failures with the data attached, like the negative result where my agent fix made things slightly worse. The discipline is the same one I use when auditing my own correction-rate measurement: trust the artifact, distrust the claim about the artifact.

What I changed in my workflow

Three changes, all cheap. First, every completion claim now carries the three legs or it carries the word "untested." Second, verification runs against real system state and, where it matters, a separate reviewer in its own context, because an agent grading its own homework grades generously. Third, my closeouts include a short "deferred and untested" section that lists exactly which legs are unproven and what would close them.

None of this slows me down the way you would expect. It moves the slow part to the front, where a wrong answer is cheap, instead of the back, where it is expensive. Anthropic frames the broader version of this as trustworthy agents in practice - oversight and verification baked in, not bolted on. My version fits in one rule and a checklist, running on the open-source Claude Code foundation I keep at primeline-ai/evolving-lite. Done isn't done until the outcome says so.

FAQ

What is Claude Code verification?+

Claude Code verification is the practice of proving a task's outcome instead of its action. A passing test or a commit shows an action happened. Verification confirms the system actually behaves as the claim says, observed under real conditions by a real consumer.

Why does Claude Code say a task is done when it isn't?+

Claude Code infers the outcome from the inputs. It sees its intended action succeed and concludes the effect followed. That inference is often wrong, and it compounds under autonomy when no human checks each step's real result.

Are passing tests enough to call a Claude Code task done?+

No. Passing tests prove the code did what the test asked inside the test's own assumptions. They do not prove behavior on real input, real state changes, or that a downstream consumer can use the result. Tests are one leg, not the whole proof.

What are the three legs of the verification proof?+

Trigger, effect, and consumption. Trigger: the thing fires under real conditions with a timestamp. Effect: real system state changed, confirmed by reading the artifact. Consumption: a downstream consumer uses that state and produces a sensible result.

How do I verify an autonomous Claude Code run?+

Do not read the run's summary, read the artifacts it points at. Open the real files, logs, and responses it claims to have changed, then confirm whatever consumes them still works. If any leg is unproven, mark the work untested rather than done.

Why did unit tests miss the ranking bug in the example?+

The tests ran on synthetic data designed by the same logic as the feature, so the real-world collision could not occur in them. Only real data, where a 137-day-dead item met a due-today reminder, surfaced the bug. Real data beats synthetic for verification.

Does verification slow down Claude Code work?+

It moves the slow part to the front, where a wrong answer is cheap to fix, instead of the back, where it is expensive. Most checks take seconds: read the artifact, confirm the consumer works. The cost is far lower than shipping a confident wrong answer.

Claude Code Verification: Why 'Done' Isn't Done [2026]

Tests pass is not done

Why does Claude Code report done when it isn't?

The 3-leg verification proof

The bug my unit tests could never catch

How do you verify a Claude Code task is done?

What I changed in my workflow

FAQ

>_ Related

Autonomous Claude Code Agent: 8 Layers That Stay Safe [2026]

Claude Code Planning Framework: UPF + DSV Reasoning in One System

Claude Code vs Cursor: Which AI Coding Tool Fits You? [2026]