A few weeks ago I finished migrating my Claude Code setup from Opus 4.6 to Opus 4.7. Ten phases, several weeks of work, one critical question at the end: did the model actually behave more in line with my instructions like Anthropic claimed it would?
I had a metric for that. The rate at which I correct Claude during a session. The hypothesis was simple. If 4.7 follows instructions more literally, the rate at which I have to push back should drop. I set a success criteria of "at least a 25% reduction" before declaring the migration validated.
Then I ran the measurement and the number came back at +60 percentage points. The correction rate had supposedly gone UP, from 17% pre-migration to 76.9% post-migration. By the rule I set myself, that's not just a miss, it's a regression so big the entire migration looked broken.
The migration plan had flagged this scenario in advance. Either the hypothesis was wrong, or the methodology was. So I sat down to figure out which one before drawing any conclusion. Two bugs surfaced. Both in the measurement, not the model.
Want the foundational patterns first? The free 3-pattern guide covers memory, delegation, and knowledge graphs at concept level.
Bug 1: The Date Filter Never Filtered
The script that computes the rate, sc9-correction-rate.py, has a function compute_sc9(since_date) that takes a starting date and is supposed to count corrections only after that date.
It does not.
experiences = data.get("experiences", [])
total_count = len(experiences) # counts ENTIRE index, no window filter
The since_date parameter only affects a different counter, sessions since that date. It never gets applied to the actual experience set. So the "since 2026-04-23" number I'd been reading for weeks was secretly "since the beginning of recorded time".
This is the kind of bug that looks completely fine in code review and runs cleanly for months. The function signature reads as if the date filter is doing its job. The output looks plausible. No error, no warning, no anomaly. Just a number that's quietly answering a question you didn't ask.
I noticed it because the absolute count of "corrections in the last 24 days" was suspiciously close to the absolute count of "corrections this year". One should be a fraction of the other. They were the same.
Bug 2: The Baseline Was From a Different Tape
The bigger problem was the baseline.
The plan said: "Baseline = 17% of corrections over March-April 2026, pre auto-monitoring". Sensible enough. But that baseline number came from before I'd shipped a script called correction-detector.py v2. The v2 release on March 2nd added two things. A new pattern called repeated_mistake, and automatic logging of any detected correction to my experience store.
So pre-v2 was hand-counted, manually entered. Post-v2 was automated, with broader pattern matching, and auto-amplified by the writer hook. The "17%" number was measured on the old tape. The "76.9%" was measured on the new tape. Same metric name, completely different ruler.
Comparing them was like comparing a stopwatch reading from one race to a different stopwatch from another race, where the second one started counting earlier and recorded more events. The difference between the two numbers had almost nothing to do with the model. It was an instrument upgrade hiding inside a measurement.
What the Apples-to-Apples Comparison Actually Shows
To do this honestly, I needed two windows measured by the same v2 instrument. I went into the experience store and pulled timestamps from each individual file, then applied identical filters across two non-overlapping date windows.
The filter: type==gotcha AND tag IN (misunderstanding, repeated-mistake) with self-monitoring auto-gotchas excluded on both sides. In the post-window that filter pulled out 50 self-monitoring entries (the v2 detector firing on its own delegation-gap audit log, not real user corrections), leaving 149 user-driven entries. The pre-window had zero such self-monitoring entries to exclude.
| Window | Filter-eligible entries | User corrections | Rate |
|---|---|---|---|
| V2-era pre-4.7 (Mar 2 to Apr 22) | 128 | 121 | 94.5% |
| 4.7-migration (Apr 23 to May 17) | 149 | 74 | 49.7% |
Both windows use the v2 detector. Both apply the same filter. The pre-4.7 number is 94.5%, not 17%.
That makes the actual delta -44.8 percentage points absolute, or -47.4% relative. Almost twice the success threshold I'd set in the plan. The migration didn't fail. The metric was just measuring two different things and pretending they were one.
- -Pre-4.7 baseline: 17%
- -Post-4.7 rate: 76.9%
- -Delta: +60 percentage points (regression)
- -Verdict: FAIL_RATE_INCREASED, hypothesis invalidated
- +Pre-4.7 baseline: 94.5% (same v2 instrument)
- +Post-4.7 rate: 49.7%
- +Delta: -44.8 percentage points (-47.4% relative)
- +Verdict: PASS, hypothesis empirically supported
What I'm Not Claiming
I'm a sample of one running a single project. So the result here is real for my own workflow, not a generalized benchmark of Opus 4.7. A few honest caveats.
The denominators are small. 128 entries pre, 149 post (after the self-monitoring filter). Big enough that the delta is unlikely to be noise, but I'm not running statistical tests on a single-user dataset and calling it science.
My project mix shifted during the window. Pre-April 23 I was mostly building infrastructure: hooks, the delegation system, internal tooling. Post-April 23 was dominated by writing, deck prep, and content work, which is more well-trodden territory. A lower correction rate there could partly be "later work is on more familiar ground", not pure model effect. My guess is the model effect accounts for the majority of the delta but a single-digit-percentage slice is probably project-mix confound.
Window lengths are asymmetric. 52 days pre, 24 days post. Both are large enough that the rate should stabilize over the window, but it's worth flagging.
This lives in primeline-ai/evolving-lite - the self-evolving Claude Code plugin. Free, MIT, no build step.
The Lesson That Surprised Me
I run a fairly disciplined measurement setup. Plan documents the metric. Hook auto-logs the data. Script computes the rate. Closeout doc records the verdict. The whole loop is supposed to make this kind of mistake hard.
It wasn't hard. Two bugs, one trivial and one subtle, slipped through every checkpoint for weeks. The trivial one was a function parameter that wasn't actually being used. The subtle one was an instrument upgrade I forgot to mark as a methodology change.
The fix in both cases wasn't reading the code more carefully. It was looking at the absolute numbers and asking whether they made physical sense. "How can 24 days of corrections equal the total of every correction in the system?" was the question that broke Bug 1. "How can my correction rate jump from 17% to 95% on the same model without any other change?" should have been the question that broke Bug 2, but I'd been treating the 17% as a fixed truth instead of a measurement that could be wrong.
Whenever a number changes shape dramatically, before drawing any conclusion, the first move should be: did I change anything about the measurement? Not the model, not the workflow, not the prompts. The measurement itself. Because the instrument is software, and software changes silently.
I validated the migration. My correction rate dropped 47.4%. And I had to audit my own audit to find that out.
![Auditing My Claude Code Correction Rate Measurement [2026]](/_next/image?url=%2Fblog%2Fauditing-my-correction-rate-measurement-hero.webp&w=3840&q=75)
![A Negative Result on Claude Code Agent Self-Regulation [2026]](/_next/image?url=%2Fblog%2Fcsra-negative-result-hero.webp&w=3840&q=75)

