Auditing My Claude Code Correction Rate Measurement [2026]

A few weeks ago I finished migrating my Claude Code setup from Opus 4.6 to Opus 4.7. Ten phases, several weeks of work, one critical question at the end: did the model actually behave more in line with my instructions like Anthropic claimed it would?

I had a metric for that. The rate at which I correct Claude during a session. The hypothesis was simple. If 4.7 follows instructions more literally, the rate at which I have to push back should drop. I set a success criteria of "at least a 25% reduction" before declaring the migration validated.

Then I ran the measurement and the number came back at +60 percentage points. The correction rate had supposedly gone UP, from 17% pre-migration to 76.9% post-migration. By the rule I set myself, that's not just a miss, it's a regression so big the entire migration looked broken.

The migration plan had flagged this scenario in advance. Either the hypothesis was wrong, or the methodology was. So I sat down to figure out which one before drawing any conclusion. Two bugs surfaced. Both in the measurement, not the model.

Want the foundational patterns first? The free 3-pattern guide covers memory, delegation, and knowledge graphs at concept level.

94.5%pre-4.7 baseline (real)

49.7%post-4.7 rate

-47.4%relative reduction

2methodology bugs found

Bug 1: The Date Filter Never Filtered

The script that computes the rate, sc9-correction-rate.py, has a function compute_sc9(since_date) that takes a starting date and is supposed to count corrections only after that date.

It does not.

code

experiences = data.get("experiences", [])
total_count = len(experiences)        # counts ENTIRE index, no window filter

The since_date parameter only affects a different counter, sessions since that date. It never gets applied to the actual experience set. So the "since 2026-04-23" number I'd been reading for weeks was secretly "since the beginning of recorded time".

This is the kind of bug that looks completely fine in code review and runs cleanly for months. The function signature reads as if the date filter is doing its job. The output looks plausible. No error, no warning, no anomaly. Just a number that's quietly answering a question you didn't ask.

I noticed it because the absolute count of "corrections in the last 24 days" was suspiciously close to the absolute count of "corrections this year". One should be a fraction of the other. They were the same.

Bug 2: The Baseline Was From a Different Tape

The bigger problem was the baseline.

The plan said: "Baseline = 17% of corrections over March-April 2026, pre auto-monitoring". Sensible enough. But that baseline number came from before I'd shipped a script called correction-detector.py v2. The v2 release on March 2nd added two things. A new pattern called repeated_mistake, and automatic logging of any detected correction to my experience store.

So pre-v2 was hand-counted, manually entered. Post-v2 was automated, with broader pattern matching, and auto-amplified by the writer hook. The "17%" number was measured on the old tape. The "76.9%" was measured on the new tape. Same metric name, completely different ruler.

Comparing them was like comparing a stopwatch reading from one race to a different stopwatch from another race, where the second one started counting earlier and recorded more events. The difference between the two numbers had almost nothing to do with the model. It was an instrument upgrade hiding inside a measurement.

What the Apples-to-Apples Comparison Actually Shows

To do this honestly, I needed two windows measured by the same v2 instrument. I went into the experience store and pulled timestamps from each individual file, then applied identical filters across two non-overlapping date windows.

The filter: type==gotcha AND tag IN (misunderstanding, repeated-mistake) with self-monitoring auto-gotchas excluded on both sides. In the post-window that filter pulled out 50 self-monitoring entries (the v2 detector firing on its own delegation-gap audit log, not real user corrections), leaving 149 user-driven entries. The pre-window had zero such self-monitoring entries to exclude.

Window	Filter-eligible entries	User corrections	Rate
V2-era pre-4.7 (Mar 2 to Apr 22)	128	121	94.5%
4.7-migration (Apr 23 to May 17)	149	74	49.7%

Both windows use the v2 detector. Both apply the same filter. The pre-4.7 number is 94.5%, not 17%.

That makes the actual delta -44.8 percentage points absolute, or -47.4% relative. Almost twice the success threshold I'd set in the plan. The migration didn't fail. The metric was just measuring two different things and pretending they were one.

What the Broken Measurement Said

-Pre-4.7 baseline: 17%
-Post-4.7 rate: 76.9%
-Delta: +60 percentage points (regression)
-Verdict: FAIL_RATE_INCREASED, hypothesis invalidated

What Apples-to-Apples Actually Shows

+Pre-4.7 baseline: 94.5% (same v2 instrument)
+Post-4.7 rate: 49.7%
+Delta: -44.8 percentage points (-47.4% relative)
+Verdict: PASS, hypothesis empirically supported

What I'm Not Claiming

I'm a sample of one running a single project. So the result here is real for my own workflow, not a generalized benchmark of Opus 4.7. A few honest caveats.

The denominators are small. 128 entries pre, 149 post (after the self-monitoring filter). Big enough that the delta is unlikely to be noise, but I'm not running statistical tests on a single-user dataset and calling it science.

My project mix shifted during the window. Pre-April 23 I was mostly building infrastructure: hooks, the delegation system, internal tooling. Post-April 23 was dominated by writing, deck prep, and content work, which is more well-trodden territory. A lower correction rate there could partly be "later work is on more familiar ground", not pure model effect. My guess is the model effect accounts for the majority of the delta but a single-digit-percentage slice is probably project-mix confound.

Window lengths are asymmetric. 52 days pre, 24 days post. Both are large enough that the rate should stabilize over the window, but it's worth flagging.

This lives in primeline-ai/evolving-lite - the self-evolving Claude Code plugin. Free, MIT, no build step.

The Lesson That Surprised Me

I run a fairly disciplined measurement setup. Plan documents the metric. Hook auto-logs the data. Script computes the rate. Closeout doc records the verdict. The whole loop is supposed to make this kind of mistake hard.

It wasn't hard. Two bugs, one trivial and one subtle, slipped through every checkpoint for weeks. The trivial one was a function parameter that wasn't actually being used. The subtle one was an instrument upgrade I forgot to mark as a methodology change.

The fix in both cases wasn't reading the code more carefully. It was looking at the absolute numbers and asking whether they made physical sense. "How can 24 days of corrections equal the total of every correction in the system?" was the question that broke Bug 1. "How can my correction rate jump from 17% to 95% on the same model without any other change?" should have been the question that broke Bug 2, but I'd been treating the 17% as a fixed truth instead of a measurement that could be wrong.

Whenever a number changes shape dramatically, before drawing any conclusion, the first move should be: did I change anything about the measurement? Not the model, not the workflow, not the prompts. The measurement itself. Because the instrument is software, and software changes silently.

I validated the migration. My correction rate dropped 47.4%. And I had to audit my own audit to find that out.

FAQ

What is the correction rate metric for Claude Code?+

It is the rate at which I push back on Claude Code mid-session: telling it to stop, redo, or take a different approach. Counted as user-correction experiences over the eligible filter window. Used as a proxy for instruction-following quality on a personal workflow.

Did Opus 4.7 actually reduce corrections in Claude Code?+

Yes, by 47.4 percent relative on an apples-to-apples comparison. The pre-4.7 baseline was 94.5 percent (V2-instrumented window, March 2 to April 22). The post-4.7 rate was 49.7 percent (April 23 to May 17). Both windows used the same v2 correction detector and the same filter.

What were the two methodology bugs?+

Bug 1: the date-filter parameter was never applied to the experience set, so the rolling window silently became 'since beginning of recorded time'. Bug 2: the baseline was measured pre-V2 instrument upgrade (hand-counted), while the post-migration rate used V2 auto-detection. Same metric name, different ruler.

Why was the original number 60 pp regression wrong?+

The pre-baseline of 17 percent came from a different measurement instrument than the post-rate of 76.9 percent. Comparing them was comparing a stopwatch from one race to a different stopwatch from another. After re-measuring both windows with the same v2 detector and the same filter, the real delta was minus 47.4 percent relative.

How can I avoid this kind of bug in my own Claude Code measurements?+

When a number changes shape dramatically, first ask: did I change the measurement? Not the model or workflow, the instrument. Software changes silently. Also: check that absolute counts make physical sense (24 days of data should not equal lifetime total of the same metric).

Is this a generalised benchmark of Opus 4.7?+

No. It is a single-user, single-project workflow comparison. The 47.4 percent relative reduction is real for my Claude Code setup but is not a population-level benchmark. Denominators are small (128 pre, 149 post), project mix shifted across windows, and window lengths are asymmetric (52 days pre, 24 days post).

Why publish the audit instead of just fixing the number quietly?+

The migration plan recorded a fixed pre-baseline. When the post-measurement contradicted it dramatically, the choice was either to revise silently or to publish the audit. Publishing the audit keeps the record honest and the methodology reusable. It also lets the reader weight future correction-rate claims accordingly.

Auditing My Claude Code Correction Rate Measurement [2026]

Bug 1: The Date Filter Never Filtered

Bug 2: The Baseline Was From a Different Tape

What the Apples-to-Apples Comparison Actually Shows

What I'm Not Claiming

The Lesson That Surprised Me

FAQ

>_ Related

How to Build a Claude Code Plugin: Real Example [2026]

How I Mapped My Self-Improving Claude Code System [2026]

Autonomous Claude Code Agent: 8 Layers That Stay Safe [2026]