From 1 to 81 Agents: Building a Sub-Agent System in Claude Code

Listen to this article (5 min)

The First Agent Was Terrible

My first Claude Code agent was terrible. It was supposed to analyze code structure, but it kept getting lost in irrelevant files, generated walls of text nobody asked for, and ignored the actual question half the time.

My 81st agent is great. It knows exactly what to do, when to stop, and how to hand off context cleanly. The difference was not luck or better prompts. The difference was a system.

This is the story of building a sub-agent delegation system in Claude Code - the architecture, the trade-offs, and the lessons learned at scale.

Want the blueprint? Get the free 3-pattern guide that covers memory, delegation, and knowledge graphs.

The Single-Agent Ceiling

Here is what happens when you scale Claude Code without delegation:

Every request goes to Opus, the most expensive model. Your context window fills up with conversation history, code snippets, and system instructions. By session 3, you are at 80% capacity. By session 5, you start forgetting things. You notice the AI suggesting the same fix it tried 20 minutes ago.

The problem is not the AI. The problem is treating Claude Code like a single-threaded system when it should be orchestrated.

The breakthrough was realizing that not every task needs the full context. Most tasks are atomic. Search for a file? That is a 2-minute haiku job. Review code quality? Sonnet can do it in fresh context, no history pollution. The main agent should be the conductor, not the performer.

Single Agent

-Every task runs in Opus
-Context fills up by session 3
-AI forgets and repeats itself
-Cost: $0.40+ per task

Multi-Agent System

+Tasks route to the right model
+Fresh context per delegation
+Clean results, no pollution
+Cost: $0.02 for simple tasks

Score-Based Delegation: The Core Idea

I built a score-based delegation system. Every incoming request gets scored automatically. Score above the threshold? Delegate. Below? Handle it yourself.

Here is the scoring table I use:

Factor	Points
Scope > 2 files	+2
Bulk operation	+2
Research/learning task	+2
Code review	+2
Exploration/search	+3
Independent task	+2

And the deductions that keep things safe:

Factor	Points
Critical keywords (production, deploy, password)	-10
User wants to observe ("show me", "explain")	-5
Complexity > 6	-3

CORE PRINCIPLE

The threshold sits at 3 points. Below 3: stay with Opus. At or above 3: delegate. Safety keywords carry a -10 penalty - large enough to override any combination of positive factors.

Some tasks always delegate regardless of score. File searches go to a fast Explore agent, debugging goes to a specialized debugger, planning goes to a dedicated planner. These are hard rules. When I type "find all API calls", the Explore agent spins up before I finish the sentence.

The production system adds more factors, model-specific cost matrices, and team-based delegation for parallel workloads. Evolving Lite includes the complete scoring config with all edge cases. Free and open source.

delegation-flow

User Request

Score Calculation

Agent Selection

Model Routing

Execute + Verify

The Trait System: Composable Agent Behavior

Here is where it gets interesting. I have 81 agents, but I do not write 81 unique agent files. That would be unmaintainable. Instead, I use a trait system that generates agents on demand.

The system is built on three dimensions: expertise (what do you know?), personality (how do you communicate?), and approach (how do you work?). Here is how traits map to task types:

Task Type	Expertise	Personality	Approach
Bug fix	engineer	precise	iterative
Security review	security	cautious	adversarial
Research	researcher	skeptical	systematic
Code review	engineer	thorough	systematic
Architecture	architect	cautious	consultative

When a task comes in, the system picks traits and compiles them into a dynamic agent profile. The agent does not just know what to do - it knows how to think. A security reviewer approaches code with adversarial thinking, actively trying to break it. An engineer doing a bug fix iterates methodically, testing each hypothesis before moving on.

The full trait taxonomy covers all combinations with production-ready profiles for every task type.

Model Selection: Route by Complexity

Not all tasks need Opus. Most do not. I route tasks to models based on complexity:

Simple tasks (file search, quick lookup) go to Haiku - fast and cheap
Medium tasks (code review, debugging, refactoring) go to Sonnet - balanced
Complex tasks (architecture, multi-system decisions) stay in Opus

When I delegate a file search to Haiku instead of running it in my main Opus session, I save 20x on cost and get a cleaner result because the agent starts with zero context pollution.

For the persistent memory setup, I use Haiku exclusively. For the context management architecture, I use Sonnet. For architectural decisions? I keep those in Opus.

The Structured Prompt: Why Most Delegations Fail

Every delegated task uses a structured prompt with 6 sections. This is non-negotiable. The structure forces clarity and prevents scope creep.

code

## 1. TASK
Atomic, specific goal (one sentence)

## 2. EXPECTED OUTCOME
Concrete deliverables the agent must produce

## 3. REQUIRED TOOLS
Explicit tool whitelist (Read, Edit, Grep, etc.)

## 4. MUST DO
Exhaustive requirements - everything the agent must accomplish

## 5. MUST NOT DO
Banned actions - explicit boundaries the agent cannot cross

## 6. CONTEXT
File paths, patterns, constraints, relevant background

If I cannot fill out every section, the task is not well-defined. That is a signal to clarify before delegating, not after. I tried looser structures. They all failed. Scope creep killed every delegation that did not have explicit boundaries.

The MUST NOT DO section is the secret weapon. Without it, agents optimize for completion and cut corners. With it, they respect constraints. "Do NOT modify files outside src/components" is more powerful than "focus on components."

Verification: The Step Everyone Skips

Here is the mistake I made early on: delegating and trusting the result.

Bad idea.

After every delegated task, I verify: Does it work? Did it meet requirements? Did it respect constraints? No verification means the task is not complete. With automated hooks, most of these checks happen without manual effort.

I have caught agents that fixed the bug but broke the tests, completed the refactor but ignored a dependency constraint, generated the report but used outdated data. Verification adds 30 seconds per delegation. Skipping it costs 30 minutes when you realize the agent silently violated a constraint.

The System Today

Current stats from my Evolving system (as of February 2026):

81 agents: Covering delegation, debugging, research, security, and content workflows
113 commands: Covering delegation, planning, debugging, and content workflows
Hundreds of trait combinations: Composable, not hardcoded
12 active hooks: Delegation enforcer, context monitor, security tier check
Cost savings: ~65% reduction vs single-agent baseline

The system handles everything from quick file searches (Haiku, 2 seconds) to comprehensive security audits (Sonnet team, 8 minutes). The key metric is not how many agents I have. It is how often I delegate without thinking about it.

That is the difference between session-based work and system-based work.

Getting Started

You do not need 81 agents to get value from delegation. Start small:

Start noticing which tasks could run independently
Build one agent - start with Explore for file searches
Add structure - define what the agent must do and must not do
Route by complexity - simple tasks to fast models, complex tasks stay local

The system compounds. Every delegated task is one less context pollution event. Every clean result is proof that the system works.

You now have the scoring table, the trait mapping, and the prompt structure. That is enough to build your first delegation system today. Start with the Explore agent for file searches, add the structured prompt, and work up from there.

The free 3-pattern guide covers memory, delegation, and knowledge graphs at concept level - a good next step if you want the bigger picture.

FAQ

How do you decide what tasks to delegate?+

The system scores every request automatically based on scope, task type, and safety factors. Above the threshold, it delegates without asking. Critical operations like production deploys never auto-delegate.

How do you prevent agents from going rogue?+

Every delegation uses a structured prompt with explicit boundaries - required actions and banned actions. Plus verification after every task. Agents cannot do what they are not equipped to do.

Can I use this with Claude Desktop instead of Claude Code?+

The architecture works with any Claude interface that supports custom instructions. Claude Code has built-in agent support which makes delegation cleaner, but the core patterns - scoring, traits, structured prompts - work anywhere.

Do I need to know Python or scripting to build this?+

No. The delegation system is mostly markdown files with structured prompts. The scoring logic is a small Python hook you can adapt or skip entirely if you prefer manual scoring initially.