In Claude Code, "tests pass" feels like done. It isn't. The agent runs the suite, the bar goes green, the commit lands, and the closeout reads "shipped." Then you use the feature and nothing happens. The write never hit disk. The hook never fired. The function returned success for an effect that never occurred.
That gap has a name in my system: Claude Code verification. The principle is one sentence. Evidence that an action happened is not evidence that the outcome happened. A file got written is an action. The system behaving the way the claim says, observed under real conditions, is an outcome. They are not the same thing, and almost every false "done" lives in the space between them.
Claude Code verification means proving the outcome, not the action. "Tests pass," "committed," and "registered" are action evidence. A task is only done when it fires under real conditions, changes real system state, and a downstream consumer can use that state. Three legs. No exceptions.
Tests pass is not done
A passing test is real, but it only proves one narrow thing: the code did what the test asked, inside the test's own assumptions. Claude Code verification starts where the test ends. The test ran in isolation on data the agent chose. Production runs on data you did not choose, in an order you did not predict, with neighbors the test never simulated.
So when Claude Code says "done," ask which kind of evidence backs it. Most of the time the answer is action evidence: a commit hash, a green check, a line in a config. None of those is the system behaving correctly. This is the same blind spot I wrote about in self-correcting Claude Code workflows - the agent that writes the code also writes the test, so the test encodes the same assumptions as the bug.
- -Tests pass
- -Code committed and pushed
- -Hook registered in settings
- -Doc says it works
- -I followed the plan
- +Behaves correctly on real input
- +Real artifact shows the change
- +Event actually fires the handler
- +A consumer reads it and works
- +Acceptance criteria met empirically
Why does Claude Code report done when it isn't?
Claude Code reports done because it infers the outcome from the inputs. It sees that the action it intended succeeded, and it concludes the effect it intended must have followed. "I wrote the fix, so the fix works." That inference is wrong often enough to be dangerous, and it gets worse with autonomy. The more steps an agent runs without you watching each one, the more unverified "done" claims stack on top of each other.
The trap is structural, not a model flaw. An agent grading its own work has every incentive to call it finished. I measured this directly when I built a verifier that judged the system's own output, and the self-preference was real until I moved the judge into a separate context prompted to refute. The same dynamic shows up in my score-based auto-delegation setup: a delegated agent reports success, and unless something independent checks the outcome, that success is taken on faith.
The 3-leg verification proof
Here is the proof I require before any completion claim counts. Three legs, each backed by pasted evidence, not assertion. If I can only show one or two, the work is "deferred and untested," and I say so out loud instead of calling it done.
Leg one is the trigger. The thing actually runs under realistic conditions, not in mock isolation. Show the trigger and the timestamp. Leg two is the effect. The trigger changed real system state - a log line, a file, a response, a row. Read the real artifact and look at it. Leg three is consumption. The next step, the dependent process, or the human reading the doc can take that effect and produce a sensible result. If the consumer chokes, the pipeline is broken, not deferred.
This lives in primeline-ai/evolving-lite - the self-evolving Claude Code plugin. Free, MIT, no build step.
This is not bureaucracy. It is the cheapest insurance you can buy against a confident wrong answer. I run these checks as a standing rule, so the verify step happens whether or not I remember it, on top of the session hooks and memory that already fire every time I open Claude Code. That foundation is open source in primeline-ai/evolving-lite. Anthropic's own guidance on demystifying evals for AI agents lands in the same place: outcome-based evaluation beats trusting the agent's self-report.
The bug my unit tests could never catch
Here is the story that made me stop trusting green checks. I built a small tool that surfaces the handful of decisions that actually need me, ranked by urgency, so nothing important rots in a backlog. It shipped with 22 passing tests. Green across the board. Done, by every action-level measure.
Then I ran it on real data, and a component that had been dead for 137 days sat at the top of the list, above a reminder that was due that same day. The ranking logic read file-age as urgency. A long-dead item looked more urgent than a live one. The unit tests never caught it because they ran on synthetic data I had made up - data where that exact collision could not occur.
A separate review found a second one. A write helper returned "success" for a row that never reached disk, because the lock released before the failed write propagated. The function reported done. The data was not there. That is the entire thesis of Claude Code verification in one bug: the claim was "saved," the outcome was "nothing saved," and only reading the real artifact told them apart.
A test proves the code did what you asked. It cannot prove you asked the right thing.
- the lesson, after 22 green tests lied to me
How do you verify a Claude Code task is done?
You verify a Claude Code task by running it against real conditions and reading the real result, not the agent's summary of it. Trigger it for real and note the time. Open the actual file, log, or response it should have changed. Then hand that result to whatever consumes it and confirm the consumer works. If you skipped any leg, the honest status is "untested," not "done."
The habit that pays off most: when an autonomous run hands you a closeout, do not read the summary, read the artifacts it points at. I learned this the hard way enough times that I now publish my own failures with the data attached, like the negative result where my agent fix made things slightly worse. The discipline is the same one I use when auditing my own correction-rate measurement: trust the artifact, distrust the claim about the artifact.
What I changed in my workflow
Three changes, all cheap. First, every completion claim now carries the three legs or it carries the word "untested." Second, verification runs against real system state and, where it matters, a separate reviewer in its own context, because an agent grading its own homework grades generously. Third, my closeouts include a short "deferred and untested" section that lists exactly which legs are unproven and what would close them.
None of this slows me down the way you would expect. It moves the slow part to the front, where a wrong answer is cheap, instead of the back, where it is expensive. Anthropic frames the broader version of this as trustworthy agents in practice - oversight and verification baked in, not bolted on. My version fits in one rule and a checklist, running on the open-source Claude Code foundation I keep at primeline-ai/evolving-lite. Done isn't done until the outcome says so.
![Claude Code Verification: Why 'Done' Isn't Done [2026]](/_next/image?url=%2Fblog%2Fclaude-code-verification-hero.webp&w=3840&q=75)
![Autonomous Claude Code Agent: 8 Layers That Stay Safe [2026]](/_next/image?url=%2Fblog%2Fautonomous-claude-code-agent-hero.webp&w=3840&q=75)

![Claude Code vs Cursor: Which AI Coding Tool Fits You? [2026]](/_next/image?url=%2Fblog%2Fclaude-code-vs-cursor-hero.webp&w=3840&q=75)