Auto-Delegation in Claude Code: Score-Based Task Routing

Q: What happens when the scoring gets it wrong?

The safety penalties are intentionally aggressive. A critical keyword like 'deploy' or 'password' carries a penalty large enough to prevent delegation even when every other signal says delegate. False negatives - not delegating when the system should - are cheap: the task just runs with Opus, which is slower and more expensive but correct. False positives on critical tasks are the dangerous failure mode, and the penalty system is calibrated to make those essentially impossible. Gap tracking also logs every undelegated task, so you can review the data and tune thresholds over time.

Q: Can I override the automatic delegation?

Yes. Phrasing a request as 'explain this to me' or 'show me how this works' triggers a penalty that keeps the task with the main agent - you're asking for reasoning and visibility, not just execution. You can also say 'do this yourself' explicitly. The hook is designed to respect intent signals in the message, not just keyword matching.

Q: How does this differ from CrewAI or LangGraph delegation?

CrewAI assigns roles to agents upfront and routes tasks to the matching role. LangGraph requires you to define explicit state machines with nodes and edges. Both require manual design before runtime. Score-based delegation is quantitative - the task itself generates a score at runtime, and that score determines routing. No roles to assign, no graph to design. The system adapts to what you actually send, not what you anticipated when you built it.

Q: Does model routing actually save money?

Yes, meaningfully. Haiku is significantly cheaper per token than Opus. When the system automatically routes simple searches, explorations, and research tasks to Haiku - which it does whenever those tasks score above threshold with low complexity - those tasks cost a fraction of what they would with Opus. The quality difference for search and exploration tasks is negligible. The savings become noticeable in sessions with frequent lookup and research tasks.

0manual decisions

~65%cost reduction

3point threshold

Every time Claude Code needed to search, review, or explore something, I had to make a call: handle it myself, or spin up a sub-agent? If sub-agent - which one? Which model? How complex is this task really?

That's five decisions before any actual work happens. Multiplied across a full session, the mental overhead adds up fast. And the decisions aren't free - getting them wrong either burns expensive Opus tokens on tasks Haiku could handle, or accidentally routes critical operations to an underpowered model.

I decided to make this a math problem instead of a judgment call.

Want the foundational patterns first? The free 3-pattern guide covers memory, delegation, and knowledge graphs at concept level.

The Problem: Manual Delegation Doesn't Scale

When I built out my first few agent types, manual routing was manageable. I knew the agents, I could roughly estimate complexity, the overhead was tolerable.

Then I had 50+ agent types.

At that scale, the decision tree becomes impossible to hold in working memory. Most developers hit one of two failure modes. The first: never delegating. Every task stays with the main agent, which means paying Opus rates for searches, explorations, and reviews that Haiku could handle at a fraction of the cost. The second: always delegating. Everything gets routed out, including the tasks where you actually need the main agent's full context and reasoning - critical deployments, sensitive configuration changes, complex architectural decisions that require careful judgment.

Neither extreme is right. What I needed was a system that could reliably distinguish between the two - without my involvement.

The Solution: A Scoring Formula for Claude Code Delegation

The core idea is straightforward: task characteristics map to points, and points determine whether to delegate and which model to use.

Each incoming message gets analyzed for characteristics. Some characteristics add points - the task involves searching across a codebase, or it's clearly independent from the current context, or it's a research question. Some characteristics subtract points - the message contains critical operation keywords, or the user is asking for an explanation rather than execution.

When the score crosses a threshold, delegation happens automatically. When it doesn't, the task stays with the main agent. No judgment call. No five decisions.

The threshold sits at three points. Below three: stay with Opus. At or above three: delegate. The number isn't arbitrary - it's calibrated to let a single strong delegation signal trigger automatically (exploration keywords score high enough alone), while requiring multiple weaker signals to combine before delegation fires.

Safety penalties are the critical design choice here. Certain keywords - deploy, production, payment, password - carry penalties large enough to override almost any combination of positive factors. A task can have every delegation signal firing, but if it touches critical operations, the penalty drives the score deep into negative territory. The formula can't route those tasks out accidentally.

Auto-Delegation Decision Flow

User sends a message to Claude

Hook analyzes task characteristics

Score calculated: factors (+) minus penalties (-)

Score >= 3? Delegate automatically

Complexity routes to model: Haiku / Sonnet / Opus

Sub-agent executes, result returned

A Concrete Example

User message: "search the codebase for all authentication patterns."

The hook reads this and scores it. Exploration and search are high-value delegation signals - they score high enough to cross the threshold on their own. The task is clearly independent from whatever else is in the session. The score lands well above three.

Model selection happens next. Complexity falls in the mid range - this is a structured search, not an architectural decision. That maps to Sonnet. Delegation fires, Sonnet handles the search, result comes back.

Now compare that to: "deploy the payment system."

The hook scores this too. There's an independent task signal, which adds points. But "deploy" and "payment" are both critical operation keywords. The penalties are aggressive by design. The total score goes sharply negative. The task stays with the main Opus agent, which has full session context and appropriate caution for irreversible operations.

Same formula, opposite outcomes. The difference isn't a rule I wrote about deployment - it's the penalty system making the math work correctly.

Delegation Scoring Architecture

Model RouterTask type routes to Haiku (exploration) or Sonnet (debugging, review, research, planning). Main agent keeps critical tasks.

Threshold GateScore >= 3 triggers delegation, < 3 stays with main agent

Score CalculationFactors (+2 to +3) minus penalties (-3 to -10)

Task AnalysisExtract scope, keywords, complexity from user prompt

The Result

The before state was five decisions per task. Operator overhead on top of every piece of actual work.

The after state is zero decisions. The formula runs on every message in the background. Tasks that should be delegated get delegated. Tasks that should stay with Opus stay with Opus. Model selection follows from complexity scoring. I never think about it.

The cost impact is real. Haiku costs a fraction of Opus per token. When the system automatically routes simple searches and exploration tasks to Haiku, those tasks are both faster and cheaper - without any quality tradeoff for that class of work. The savings compound across a full session.

There's also something no competing framework does here. CrewAI uses role-based routing - you assign roles to agents manually, and tasks go to whoever has the matching role. LangGraph uses explicit state machines - you build graphs of nodes and edges that define legal transitions. Both approaches require upfront design work and don't adapt based on what the task actually is.

This system uses quantitative scoring. The task itself determines where it goes. No predefined roles. No explicit graph. Just arithmetic on task characteristics.

Before - Manual Delegation

-5 decisions per task (delegate? which agent? which model?)
-Opus tokens burned on simple searches
-Critical tasks accidentally sent to Haiku
-No learning from delegation patterns

After - Score-Based Auto-Delegation

+Zero manual decisions - formula handles routing
+Haiku handles searches, Opus handles architecture
+Safety keywords block delegation of critical tasks
+Gap tracking reveals missed delegation opportunities

This is different from the broader multi-agent architecture post, which covers how to build and orchestrate a system of agents. That post is about structure. This post is about the decision mechanism - how the system knows, for any given message, whether to delegate at all and where to send it if so. The architecture and the router are complementary layers. It also connects to hook-based automation patterns - both use the same UserPromptSubmit event to intercept and process messages before Claude acts on them.

The full implementation - complete scoring tables, the Python hook, model routing configuration, and the gap tracking setup - is included in Evolving Lite. Free and open source.

This lives in primeline-ai/evolving-lite - the self-evolving Claude Code plugin. Free, MIT, no build step.

FAQ

What happens when the scoring gets it wrong?+

The safety penalties are intentionally aggressive. A critical keyword like 'deploy' or 'password' carries a penalty large enough to prevent delegation even when every other signal says delegate. False negatives - not delegating when the system should - are cheap: the task just runs with Opus, which is slower and more expensive but correct. False positives on critical tasks are the dangerous failure mode, and the penalty system is calibrated to make those essentially impossible. Gap tracking also logs every undelegated task, so you can review the data and tune thresholds over time.

Can I override the automatic delegation?+

Yes. Phrasing a request as 'explain this to me' or 'show me how this works' triggers a penalty that keeps the task with the main agent - you're asking for reasoning and visibility, not just execution. You can also say 'do this yourself' explicitly. The hook is designed to respect intent signals in the message, not just keyword matching.

How does this differ from CrewAI or LangGraph delegation?+

CrewAI assigns roles to agents upfront and routes tasks to the matching role. LangGraph requires you to define explicit state machines with nodes and edges. Both require manual design before runtime. Score-based delegation is quantitative - the task itself generates a score at runtime, and that score determines routing. No roles to assign, no graph to design. The system adapts to what you actually send, not what you anticipated when you built it.

Does model routing actually save money?+

Yes, meaningfully. Haiku is significantly cheaper per token than Opus. When the system automatically routes simple searches, explorations, and research tasks to Haiku - which it does whenever those tasks score above threshold with low complexity - those tasks cost a fraction of what they would with Opus. The quality difference for search and exploration tasks is negligible. The savings become noticeable in sessions with frequent lookup and research tasks.

Auto-Delegation in Claude Code: Score-Based Task Routing

The Problem: Manual Delegation Doesn't Scale

The Solution: A Scoring Formula for Claude Code Delegation

A Concrete Example

The Result

FAQ

>_ Related

59 Experiments on Claude Code Agent Behavior

Claude Code tmux Orchestration: Parallel AI Sessions

Trait-Based Agent Composition in Claude Code