A Negative Result on Claude Code Agent Self-Regulation [2026]

In March 2026 I designed a Cognitive Self-Regulation Architecture (CSRA) for autonomous Claude Code agents. The idea was simple: instead of letting agents apply the same generic "try again with more reasoning" to every failure, diagnose the failure type first, then apply a matched intervention. Suspension State for premature commitment. Shadow Buffer for rejected alternatives. Einstellung Breaker for strategy lock-in. Plus a closed Learning Loop that updated the intervention policy over time. Grounded in Zimmerman's Self-Regulated Learning theory, motivated by recent Anthropic work on activation vectors driving reward-hacking behaviour.

I ran two head-to-head evaluations against an unmodified baseline. Same model, same harness, only the system prompt changed. The numbers said no.

Want the foundational patterns first? The free 3-pattern guide covers memory, delegation, and knowledge graphs at concept level.

-0.8 ppSWE-bench Pro delta (n approx 27)

-2.0 ppSWE-EVO delta (n = 48)

40%baseline failures from IMPORT_ERROR

~0%baseline failures from drift (what CSRA targets)

What CSRA was supposed to do

The motivating analogy was a physician who prescribes aspirin for every complaint instead of diagnosing first. Modern coding agents do roughly the same thing. They plan, they act, they reflect, but the reflection is undifferentiated. CSRA added four components on top of the standard loop:

A self-model with the agent's strengths, known weaknesses, and OCEAN-style traits.
A diagnostic monitor that maps observed behaviour onto eight typed failure categories.
A matched intervention repertoire, one intervention per failure type, drawn from cognitive psychology.
A learning loop that observed which interventions resolved which failure types and updated the selection policy over time.

The hypothesis (H1): augmenting the system prompt with this scaffold raises task resolve rate by at least 5 percentage points on multi-file long-horizon coding benchmarks. Pre-registered falsification threshold: if the delta is less than 5 pp or has the wrong sign, the project is closed without publication of a positive paper.

Methodology

SWE-bench Pro pilot (Apple Silicon, n approximately 27)

Dataset: ScaleAI/SWE-bench_Pro, single-file bug-fix subset, public split. Tasks restricted to instances whose Docker images pulled successfully on the available infrastructure. The agent was Claude Code CLI in headless mode against the user's authenticated Max subscription, Sonnet 4.x as of March 2026, exact build not pinned by the run JSON. Scoring used the standard SWE-bench Pro evaluator with fail-to-pass and pass-to-pass test execution against pre-built per-instance Docker images.

Two prompt conditions, identical otherwise:

Baseline: a generic software-engineering system prompt.
CSRA-Full: identical baseline plus the CSRA scaffold (Suspension State, Shadow Buffer, Einstellung Breaker, Learning Log).

The Apple Silicon part matters. SWE-bench Pro images are linux/amd64. Running them under Rosetta on the M-series Mac caused container crashes, 5x to 10x slower test execution, and silent Docker disk fill at any meaningful evaluation volume. About 40 to 80 percent of evaluation attempts failed for infrastructure reasons, not for code reasons. After this surfaced I migrated the main run to native x86. The pilot stayed on the Mac.

SWE-EVO main run (native x86 Hetzner, n = 48)

Dataset: Fsoft-AIC/SWE-EVO, paper arXiv:2512.18470, loaded via Dataset.from_file() from the Arrow split. Average 21 source files per task, 2390-word task specifications. GPT-5 scores around 21 percent on this benchmark versus 65 percent on SWE-bench Verified, which is what motivated me to use it as the more discriminating benchmark for self-regulation interventions.

Hardware: Hetzner CPX42, 8 vCPU AMD, 16 GB RAM, 320 GB SSD, Helsinki, Ubuntu 24.04, EUR 19.99 per month. Native x86 Docker, pre-built SWE-EVO images under xingyaoww/ on Docker Hub and ghcr.io/epoch-research/. Working directory inside containers is /testbed/.

Both conditions ran the full 48-task test split. No infrastructure attrition. Each condition got the same set of task instances. Both ran end-to-end before any analysis.

Isolation

To prevent contamination from my own Evolving system (about ten thousand tokens of CLAUDE.md, hooks, MCP servers, skills) every run executed in a clean working directory with no .claude/rules/, no active hooks, no MCP servers beyond what the harness required, no CLAUDE.md in the working directory or any ancestor. The only independent variable between conditions was the system-prompt content.

One isolation bug worth flagging: overriding the HOME env var when launching Claude Code as a subprocess breaks authentication, because the CLI looks for credentials in ~/.config and ~/.anthropic. Isolation had to be achieved via cwd plus empty .claude/settings.json instead of HOME-rewriting.

Results

SWE-bench Pro

Condition	Resolve rate	Delta vs baseline
Baseline (standard agent)	22.2 percent	reference
CSRA-Full	21.4 percent	minus 0.8 pp

22.2 percent of approximately 27 tasks corresponds to about 6 resolved. 21.4 percent of approximately 28 tasks corresponds to about 6 resolved. Both conditions resolved roughly the same raw count; the percentage difference reflects different total-evaluated counts caused by differing Docker pull failures across conditions. The delta is below the falsification threshold and has the wrong sign. The pilot was treated as suggestive, not conclusive, because of the Apple Silicon attrition.

SWE-EVO

Condition	Resolved	Total	Resolve rate	Delta vs baseline
Baseline (standard agent)	10	48	20.8 percent	reference
CSRA-Full	9	48	18.8 percent	minus 2.0 pp

10 divided by 48 is 0.20833, rounds to 20.8 percent. 9 divided by 48 is 0.1875, rounds to 18.8 percent under half-up. 20.83 minus 18.75 is 2.08, rounds to 2.0. The delta is below the falsification threshold and has the wrong sign. With the Mac pilot results corroborating, the project closed.

Raw per-task prediction files and full evaluation logs were preserved offline for replication and audit purposes. The public SWE-EVO test split plus the prompts in the appendix are sufficient to reproduce the run end-to-end on independent infrastructure.

The failure taxonomy is where the result becomes interesting

On the SWE-EVO baseline run (n = 48), I classified each failed task by mode:

Failure mode	Share of failures
IMPORT_ERROR (broke imports by editing wrong modules)	40 percent
NO_PATCH (capability limit, could not generate any code)	25 percent
TYPE_ERROR (wrong argument types)	10 percent
WRONG_API (used wrong attribute or method)	6 percent
WRONG_LOGIC (code runs but wrong result)	6 percent
RESOLVED	6 percent
Over-scoping or drift (the failure mode CSRA targets)	near 0 percent

The dominant failure modes (IMPORT_ERROR plus NO_PATCH plus TYPE_ERROR, together 75 percent) are all code-structure or capability limits. The failure mode CSRA was specifically designed to prevent (strategy lock-in and goal drift on long-horizon tasks) accounted for approximately zero baseline failures on this benchmark.

That is the load-bearing piece of secondary evidence. CSRA did not just fail to lift the resolve rate. It targeted a failure mode that was not happening.

Why the hypothesis failed

CSRA was designed for one specific failure type: long-horizon drift, where an agent loses track of the original goal across many steps, fixates on a partial solution, cannot recover from a wrong early commitment. SWE-bench Pro single-file bug fixes and SWE-EVO multi-file evolution tasks both turned out to be predominantly failure-mode-mismatched against this design.

A prompt scaffold telling the agent to "monitor for strategy lock-in every five actions" does nothing for an agent that does not understand which module to edit in the first place.

A blunt line from the internal notes captured the outcome:

Like testing a helmet while walking.

The helmet was designed for falls. The agent did not fall. It walked into the wrong room.

This lives in primeline-ai/evolving-lite - the self-evolving Claude Code plugin. Free, MIT, no build step.

Two earlier failures of the CSRA prompt that sharpen the result

Worth surfacing because they make the negative finding stronger, not weaker.

v1: the prompt collapsed the output channel

CSRA v1 was 3918 characters. On a 5-task pilot it produced 0 patches versus baseline's 3, because the agent spent its output-token budget on monitoring and reporting bookkeeping rather than on emitting diffs. The fix was a single explicit reframing:

CRITICAL: Your primary output MUST be a working code patch. Use the self-regulation system below INTERNALLY to guide your work do NOT spend output tokens on CSRA bookkeeping.

After this, v2 no longer collapsed the output channel. The final n = 48 main run used the corrected version.

Early 50-task run: the harness was wrong

An early 50-task dev-set run produced baseline 50 percent versus CSRA 20 percent. Net minus 15 tasks. The root cause was not that CSRA was worse, but that the harness ran claude -p in single-shot print mode, which prevents interactive file reading and tool use. SWE-bench tasks structurally require agentic tool use. After switching to an interactive harness with full tool access, both conditions converged to the low-twenty-percent range reported above.

The final negative result is with both the prompt bug and the harness bug fixed. CSRA still showed no measurable improvement.

Contrast: PsychAgent was a positive result on a different task class

Before running CSRA on coding benchmarks I had data showing one paragraph of personality text could measurably change agent behaviour on ambiguous tasks. The PsychAgent benchmark (59 runs, 6 personalities, 5 stress scenarios, Claude Opus 4.6, all raw outputs preserved) found:

On scenario S4, a Curious personality discovered approximately 6x more security issues than a Control with no personality.
A Perfectionist personality never engaged in reward hacking on impossible tasks. It redefined the success criterion instead of accepting it.
Composed scored 0.83 overall and was adopted as the default in subsequent production work.

The contrast is not "prompts work" versus "prompts do not work". The contrast is task-class shaped.

PsychAgent tasks have high latitude. Many plausible actions. No benchmark-scored ground truth from one specific patch. SWE-bench Pro and SWE-EVO tasks have low latitude. One correct patch. Scored by automated tests. The same prompt-level lever moves behaviour on the former and does not move resolve rate on the latter.

This is consistent with a simple reading: prompt-level instruction shapes how an agent fills available degrees of freedom. When the task offers little freedom (write the correct patch or fail the test), there is no degrees-of-freedom dial to turn. When the task offers high freedom (decide what to look at next in a security audit), the dial turns.

Implications for prompt-engineering-as-safety

Some of the contemporary literature on metacognitive safety implicitly treats prompt-level scaffolding as a candidate intervention for sycophancy, reward hacking, and goal drift. Liu and van der Schaar (ICML 2025, arXiv:2506.05109) argue for intrinsic metacognitive mechanisms at the learning level. The negative result here suggests two narrower implications:

Prompt-level scaffolding can shift behaviour on tasks where the agent has latitude (ambiguous, multi-step, judgment-laden). For these tasks, one-paragraph personality and metacognitive prompts may be a low-cost lever.
Prompt-level scaffolding does not move resolve rate on tasks where the agent has little latitude (single-correct-patch coding benchmarks). For these tasks, the lever is somewhere else (architecture, retrieval, tool-use scaffolding, or model capability), not in the system prompt.

This is a sharper version of the older "scaffolding versus model" debate, refocused on the task class.

Honest scope of the negative result

The negative result is specific:

It is a result about prompt-level metacognitive scaffolding for one model family running one harness on two specific coding benchmarks. It does not rule out architecture-level metacognition (hooks, external monitors, real-time tool-call inspection), only the prompt-level form.
It is a result about coding benchmarks dominated by single-correct-patch tasks. It is consistent with, and does not rule out, the positive result on PsychAgent for ambiguous tasks.
It is a single set of experiments, not a meta-study. A second independent attempt with different harnesses, a different benchmark suite (Hell or High Water, Recovery-bench, MiRA), might still find an effect on a task class better matched to the architecture's design intent.
It does not rule out CSRA's value as a design framework for the underlying psychology mappings (Shadow, Suspension, Einstellung). The mappings are useful even when the prompt-level packaging does not move the resolve dial.

Why I am publishing this

The research community implicitly defaults toward publishing positive results from prompt-level interventions. The selection bias has known consequences for the apparent strength of these methods. A clean negative result on a well-defined task class with verified numbers and a failure-mode taxonomy is a small counterweight.

The taxonomy in section "The failure taxonomy is where the result becomes interesting" is also reusable on its own. It shows where coding agents actually fail in March 2026 on a multi-file long-horizon benchmark, and the answer is "code-structure understanding", not "metacognition".

The contrast with PsychAgent is the most useful single take-away. It points researchers toward task-class-stratified evaluation when measuring prompt-level interventions. The right unit of analysis is not "does this prompt help" but "does this prompt help on this class of task".

This pairs with my earlier publication of a methodology audit on my own correction-rate measurement, which corrected a previous claim of mine when I found two bugs in the measurement. Same principle. Publishing the audit and the negative result is the same job as publishing the positive one.

What remains open

CSRA at the architecture level (hooks, real-time interventions, external monitors) on the same benchmarks. Untested.
CSRA scaffolding on benchmarks designed around recovery, stuck-state escape, or strategy switching (Hell or High Water, MiRA, AgentErrorTaxonomy). Untested.
The Learning Loop component (C4 in the architecture). It was never reached on these benchmarks because the per-task resolve rate stayed too low for the loop to have meaningful data to learn from. Read the result above as a falsification of the first three components, not of C4. C4 remains genuinely untested.
Whether a hybrid (chemotactic optimisation, validated as positive on game-of-24, plus CSRA components as failure-recovery extensions) outperforms either alone.

If you are working on agent self-regulation and want to skip the helmet-while-walking trap, the suggestion is: classify your benchmark's failure modes before designing the intervention. If drift and lock-in are not in the top three, the prompt scaffold is not what you need.

Sources and reproducibility notes

Every empirical claim in this post is traced to a source in the underlying paper-source document, with arithmetic verification. The CSRA-full prompt text and the runner script live in the private research repository and are available on request for independent replication. The public SWE-EVO test split plus the prompts are sufficient to rerun the harness on independent infrastructure.

No inferential statistic is computed (approximately 6 resolves per condition on the pilot and 9 to 10 per condition on the main run; no test of practical use is meaningful at this sample size). The pre-registered falsification threshold (plus 5 percentage points absolute delta) is the rule the result is evaluated against.

FAQ

What is CSRA?+

CSRA (Cognitive Self-Regulation Architecture) is a prompt-level scaffold for autonomous Claude Code agents. It adds Suspension State, Shadow Buffer, Einstellung Breaker, and a Learning Loop to the standard agent prompt. The hypothesis was that diagnosing the failure type and applying a matched intervention would beat undifferentiated reflection.

Did CSRA improve coding-agent performance on SWE-bench?+

No. Two head-to-head evaluations on Claude Code agents showed deltas of minus 0.8 percentage points on SWE-bench Pro (n approximately 27) and minus 2.0 percentage points on SWE-EVO (n equal 48). Both deltas fail the pre-registered falsification threshold of plus 5 percentage points.

Why did the hypothesis fail?+

The failure-mode taxonomy on the baseline showed 40 percent IMPORT_ERROR, 25 percent NO_PATCH, 10 percent TYPE_ERROR. These are code-structure and capability failures. The failure mode CSRA was designed to prevent (drift, strategy lock-in on long-horizon tasks) accounted for approximately zero baseline failures. A prompt scaffold for drift cannot help with import errors.

Does this rule out prompt-level scaffolding entirely?+

No. The same author has a positive result on the PsychAgent benchmark for ambiguous tasks (Curious personality found approximately 6x more security issues than Control). Prompt-level scaffolding moves behaviour on high-latitude tasks. It does not move resolve rate on low-latitude single-correct-patch tasks. The right unit of analysis is task class, not Claude Code prompts in general.

What does this mean for AI safety prompt engineering?+

Prompt-level safety interventions (anti-sycophancy, anti-reward-hacking, anti-drift) likely work where the task gives the agent latitude (judgment-laden, ambiguous, multi-step), and likely do not work where the task is one-correct-answer. Stratify your benchmark before designing the prompt intervention.

Was the CSRA Learning Loop tested?+

No. Read this paper as a falsification of the first three CSRA components (Suspension, Shadow, Einstellung at the prompt level). The Learning Loop (C4) was never reached because the per-task resolve rate stayed too low for the policy-update loop to have meaningful data. C4 remains genuinely untested.

Why publish a negative result?+

Selection bias toward positive results in prompt-engineering literature has known consequences. A clean negative result with verified numbers and a failure-mode taxonomy is a small counterweight. The contrast with PsychAgent points researchers toward task-class-stratified evaluation, which is the most reusable take-away from this work.