Why Anthropic Treats Cache Misses Like Outages
The Claude Code team monitors prompt cache hit rate like infrastructure uptime. If it drops too low, they open an incident. Not a bug ticket - an incident.
That surprised me. But once I understood why, a dozen confusing Claude Code behaviors clicked into place. Why CLAUDE.md loads before your conversation. Why plan mode is a tool call. Why /clear feels expensive. Every single one traces back to one constraint: prompt caching.
Anthropic's prompt caching documentation explains the API-level mechanics. This post goes deeper - into how those mechanics shape the tool you use every day.
How Prompt Caching Actually Works
Prompt caching works by prefix matching. Claude can cache a prefix of the conversation and reuse it on the next request - but only if that prefix is byte-for-byte identical. Change one character early in the prompt and the entire cache after that point is invalidated.
That one rule drives every UX decision in Claude Code.
Prefix Matching in Practice
If static content sits at the front of the prompt and dynamic content sits at the back, the static part stays cached across requests. If you flip that order - dynamic first, static after - you rebuild the cache on every message.
Same tokens. Wildly different cost.
The cache has a 5-minute TTL (time to live). If you don't send another request within 5 minutes, the cached prefix expires and rebuilds from scratch on your next message. This is why long pauses between messages cost more than continuous work.
The Numbers That Matter
| Metric | Value | Source |
|---|---|---|
| Cache write cost | 25% more than base input tokens | Anthropic pricing |
| Cache read cost | 90% discount vs uncached input | Anthropic pricing |
| Cache TTL | 5 minutes | Anthropic docs |
| Minimum cacheable prefix | 1,024 tokens (Sonnet/Haiku), 2,048 (Opus) | Anthropic docs |
At a 90% discount on cached reads, the difference between a cached and uncached session is dramatic. A 100K-token context with 80% cache hit rate costs roughly 8x less per request than the same context with 0% hits.
The Prompt Assembly Order
Claude Code assembles your prompt in a very specific order. This is not arbitrary - it is deliberately structured so the most stable content sits earliest in the prefix.
Layer 1: System Prompt and Tools
The system prompt and all tool definitions sit at the very front. These are identical across requests within a session, so they cache on the first message and stay cached. This is why ToolSearch uses lightweight stubs instead of full schemas - the stubs are always present in the prefix. Full schemas load only when needed, and the prefix stays stable.
Layer 2: CLAUDE.md and Static Context
Your CLAUDE.md file, rules files, and any static project context load next. These are more stable than your conversation but less stable than the system prompt (they can vary across projects). This layer caches after the first turn and stays cached for the session.
This is why changes to CLAUDE.md do not take effect mid-session without a restart - the file loaded at session start is baked into the cached prefix.
Layer 3: Conversation Messages
Your actual messages sit at the back. They change every turn, so they are never cached as input (though they become part of the prefix for the next request). This is the dynamic layer where caching ends and full-price token processing begins.
Now watch what happens when something changes at layer two - a tool gets added, a model switches, or the system prompt gets edited. Everything below that point rebuilds from scratch.
That cascade is the cost nobody sees. One change at the wrong layer, and thousands of cached tokens become full-price tokens on every subsequent request.
What Breaks the Cache
Understanding what invalidates the cache is more useful than understanding what preserves it. Here are the four most common cache breakers:
1. Switching Models Mid-Session
Each model has its own cache. When you switch from Opus to Haiku mid-session, the Haiku model has no cached prefix - it builds from scratch at full token prices.
At high context volumes, this is counterintuitive: the full-price Haiku cost can exceed what Opus would charge at cached rates.
# What you think happens
Opus (expensive) → Haiku (cheap) = savings
# What actually happens at 100K cached context
Opus @ cached rate: ~$0.30/request
Haiku @ uncached rate: ~$0.25/request (first), then cached
# Plus you pay the cache write cost again
The Anthropic team confirmed this directly. Choose your model at session start and stay with it.
2. Adding or Removing Tools
Tools are part of the prefix. Adding a tool partway through invalidates everything cached after that point. This is why ToolSearch exists as an architecture pattern - deferred tool loading that keeps the prefix stable.
3. Using /clear Instead of Compaction
/clear destroys the session and forces a full cache rebuild. Compaction uses what the team calls "cache-safe forking" - same system prompt, same tools, same prefix, with a summary appended at the end. The cached prefix survives.
4. Long Pauses Between Messages
The 5-minute TTL means a coffee break costs a cache rebuild. If your workflow involves long thinking pauses between messages, consider using shorter, more frequent interactions or accepting the occasional rebuild as a cost of your work style.
Six Things This Changes About Your Workflow
The caching architecture has direct implications for how you should work:
CLAUDE.md loads before your conversation. Not because it is special - because it is more static than your messages. It gets cached after the first turn and stays cached for the session.
The <system-reminder> tags in your transcript are not noise. When Claude Code needs to inject updated instructions, it appends them to messages instead of modifying the system prompt. Modifying the system prompt would break the cached prefix. Those tags are saving cache hits on every subsequent request.
Plan mode is a tool call, not a mode swap. EnterPlanMode and ExitPlanMode appear as tool calls. If plan mode were a separate system prompt state, switching in and out would invalidate the cache. As a tool, it leaves the prefix intact.
Never add or remove tools mid-session. Tools are part of the prefix. Adding one partway through invalidates everything cached after that point. Plan your tool set upfront.
Model switching costs more than you think. At 100K tokens of cached context, switching from Opus to Haiku does not save money - it rebuilds the cache from zero. Choose at session start.
Long sessions benefit from cache stability. A 2-hour focused session with a warm cache is dramatically cheaper per-request than 12 short sessions that each rebuild from zero.
Cache-Safe Compaction vs /clear
This deserves its own section because the wrong choice here is the most expensive daily mistake.
- -Cache rebuilt from zero each time
- -Full token cost on every request
- -Session history gone
- -Faster perceived start, slower actual performance
- +Cached prefix preserved
- +Cache hit rates stay high
- +Summary bridges old and new context
- +Consistent cost across the session
I used to /clear after every task. Now I only use it when I genuinely need a fresh context, not just a clean conversation view.
When to Actually Use /clear
- Starting a completely unrelated task (different project, different domain)
- The session has accumulated too much irrelevant context that compaction can't shrink
- You suspect the model is stuck in a behavior loop and need a true reset
For everything else, let compaction handle it. The session management patterns I use are built around this principle.
How to Audit Your Own Cache Behavior
If your Claude Code sessions feel inconsistent - sharp early, degraded later - check whether you are accidentally breaking the cached prefix.
Symptoms of Cache Degradation
- Sessions that suddenly feel slower after a model switch or tool change
- Unexpectedly high token usage reported in the billing dashboard
- Claude Code taking longer to respond after you've been away for a few minutes (TTL expiry)
Quick Self-Check
- Are you switching models mid-session? Stop. Pick one at the start.
- Are you using /clear between tasks? Switch to compaction.
- Are you adding tools mid-session? Plan your tool set upfront in your CLAUDE.md configuration.
- Are you taking 5+ minute breaks between messages? Accept the rebuild cost or send a quick follow-up before the break.
If your Claude Code sessions feel inconsistent - sharp early, degraded later - check whether you are adding tools, switching models, or using /clear mid-session. Any of those breaks the cached prefix and forces a rebuild.
Why This Architecture Exists
Prompt caching is not a nice-to-have. At the token volumes Claude Code sessions generate - especially with long CLAUDE.md files, large tool sets, and extended conversations - without caching the cost would make the tool impractical for daily use.
The architecture is a product constraint turned into a design principle. And once you see it, the entire tool becomes more legible:
- The context management system I built is partly a cache optimization strategy in disguise - keeping stable content stable and dynamic content minimal.
- The delegation system benefits too - sub-agents start with a small, focused prefix instead of inheriting a bloated 100K-token conversation. Less prefix, less cost, regardless of cache state.
- Hooks preserve the prefix by design - they inject behavior through tool-use events, not system prompt modifications.
Every pattern I use connects back here.
This lives in primeline-ai/evolving-lite - the self-evolving Claude Code plugin. Free, MIT, no build step.
![Claude Code Prompt Caching: The One Rule Behind It [2026]](/_next/image?url=%2Fblog%2Fprompt-caching-hero.webp&w=3840&q=75)

![Claude Code Context Window: Stop Wasting Tokens [2026]](/_next/image?url=%2Fblog%2Fcontext-management-hero.webp&w=3840&q=75)
