>_

Claude Code Prompt Caching: The One Rule Behind It [2026]

Robin||9 min
Last updated: April 12, 2026
claude-codeprompt-cachingoptimizationcontextcost
Claude Code Prompt Caching: The One Rule Behind It [2026]
Listen to this article (9 min)

Why Anthropic Treats Cache Misses Like Outages

The Claude Code team monitors prompt cache hit rate like infrastructure uptime. If it drops too low, they open an incident. Not a bug ticket - an incident.

That surprised me. But once I understood why, a dozen confusing Claude Code behaviors clicked into place. Why CLAUDE.md loads before your conversation. Why plan mode is a tool call. Why /clear feels expensive. Every single one traces back to one constraint: prompt caching.

Anthropic's prompt caching documentation explains the API-level mechanics. This post goes deeper - into how those mechanics shape the tool you use every day.

How Prompt Caching Actually Works

Prompt caching works by prefix matching. Claude can cache a prefix of the conversation and reuse it on the next request - but only if that prefix is byte-for-byte identical. Change one character early in the prompt and the entire cache after that point is invalidated.

That one rule drives every UX decision in Claude Code.

Prefix Matching in Practice

If static content sits at the front of the prompt and dynamic content sits at the back, the static part stays cached across requests. If you flip that order - dynamic first, static after - you rebuild the cache on every message.

Same tokens. Wildly different cost.

The cache has a 5-minute TTL (time to live). If you don't send another request within 5 minutes, the cached prefix expires and rebuilds from scratch on your next message. This is why long pauses between messages cost more than continuous work.

The Numbers That Matter

MetricValueSource
Cache write cost25% more than base input tokensAnthropic pricing
Cache read cost90% discount vs uncached inputAnthropic pricing
Cache TTL5 minutesAnthropic docs
Minimum cacheable prefix1,024 tokens (Sonnet/Haiku), 2,048 (Opus)Anthropic docs

At a 90% discount on cached reads, the difference between a cached and uncached session is dramatic. A 100K-token context with 80% cache hit rate costs roughly 8x less per request than the same context with 0% hits.

The Prompt Assembly Order

Claude Code assembles your prompt in a very specific order. This is not arbitrary - it is deliberately structured so the most stable content sits earliest in the prefix.

Claude Code Prompt Order (Static Before Dynamic)
Conversation MessagesChanges every turn - always dynamic
CLAUDE.md + Session ContextCached after first load
Static System Prompt + ToolsCached across the entire session

Layer 1: System Prompt and Tools

The system prompt and all tool definitions sit at the very front. These are identical across requests within a session, so they cache on the first message and stay cached. This is why ToolSearch uses lightweight stubs instead of full schemas - the stubs are always present in the prefix. Full schemas load only when needed, and the prefix stays stable.

Layer 2: CLAUDE.md and Static Context

Your CLAUDE.md file, rules files, and any static project context load next. These are more stable than your conversation but less stable than the system prompt (they can vary across projects). This layer caches after the first turn and stays cached for the session.

This is why changes to CLAUDE.md do not take effect mid-session without a restart - the file loaded at session start is baked into the cached prefix.

Layer 3: Conversation Messages

Your actual messages sit at the back. They change every turn, so they are never cached as input (though they become part of the prefix for the next request). This is the dynamic layer where caching ends and full-price token processing begins.

Now watch what happens when something changes at layer two - a tool gets added, a model switches, or the system prompt gets edited. Everything below that point rebuilds from scratch.

prompt-cache-invalidation

That cascade is the cost nobody sees. One change at the wrong layer, and thousands of cached tokens become full-price tokens on every subsequent request.

What Breaks the Cache

Understanding what invalidates the cache is more useful than understanding what preserves it. Here are the four most common cache breakers:

1. Switching Models Mid-Session

Each model has its own cache. When you switch from Opus to Haiku mid-session, the Haiku model has no cached prefix - it builds from scratch at full token prices.

At high context volumes, this is counterintuitive: the full-price Haiku cost can exceed what Opus would charge at cached rates.

code
# What you think happens
Opus (expensive) → Haiku (cheap) = savings

# What actually happens at 100K cached context
Opus @ cached rate: ~$0.30/request
Haiku @ uncached rate: ~$0.25/request (first), then cached
# Plus you pay the cache write cost again

The Anthropic team confirmed this directly. Choose your model at session start and stay with it.

2. Adding or Removing Tools

Tools are part of the prefix. Adding a tool partway through invalidates everything cached after that point. This is why ToolSearch exists as an architecture pattern - deferred tool loading that keeps the prefix stable.

3. Using /clear Instead of Compaction

/clear destroys the session and forces a full cache rebuild. Compaction uses what the team calls "cache-safe forking" - same system prompt, same tools, same prefix, with a summary appended at the end. The cached prefix survives.

4. Long Pauses Between Messages

The 5-minute TTL means a coffee break costs a cache rebuild. If your workflow involves long thinking pauses between messages, consider using shorter, more frequent interactions or accepting the occasional rebuild as a cost of your work style.

Six Things This Changes About Your Workflow

The caching architecture has direct implications for how you should work:

CLAUDE.md loads before your conversation. Not because it is special - because it is more static than your messages. It gets cached after the first turn and stays cached for the session.

The <system-reminder> tags in your transcript are not noise. When Claude Code needs to inject updated instructions, it appends them to messages instead of modifying the system prompt. Modifying the system prompt would break the cached prefix. Those tags are saving cache hits on every subsequent request.

Plan mode is a tool call, not a mode swap. EnterPlanMode and ExitPlanMode appear as tool calls. If plan mode were a separate system prompt state, switching in and out would invalidate the cache. As a tool, it leaves the prefix intact.

Never add or remove tools mid-session. Tools are part of the prefix. Adding one partway through invalidates everything cached after that point. Plan your tool set upfront.

Model switching costs more than you think. At 100K tokens of cached context, switching from Opus to Haiku does not save money - it rebuilds the cache from zero. Choose at session start.

Long sessions benefit from cache stability. A 2-hour focused session with a warm cache is dramatically cheaper per-request than 12 short sessions that each rebuild from zero.

Cache-Safe Compaction vs /clear

This deserves its own section because the wrong choice here is the most expensive daily mistake.

/clear after every task
  • -Cache rebuilt from zero each time
  • -Full token cost on every request
  • -Session history gone
  • -Faster perceived start, slower actual performance
Compaction over /clear
  • +Cached prefix preserved
  • +Cache hit rates stay high
  • +Summary bridges old and new context
  • +Consistent cost across the session

I used to /clear after every task. Now I only use it when I genuinely need a fresh context, not just a clean conversation view.

When to Actually Use /clear

  • Starting a completely unrelated task (different project, different domain)
  • The session has accumulated too much irrelevant context that compaction can't shrink
  • You suspect the model is stuck in a behavior loop and need a true reset

For everything else, let compaction handle it. The session management patterns I use are built around this principle.

SEVAlert level when cache hit rate drops
90%Discount on cached vs uncached input tokens
0Cache remaining after /clear

How to Audit Your Own Cache Behavior

If your Claude Code sessions feel inconsistent - sharp early, degraded later - check whether you are accidentally breaking the cached prefix.

Symptoms of Cache Degradation

  • Sessions that suddenly feel slower after a model switch or tool change
  • Unexpectedly high token usage reported in the billing dashboard
  • Claude Code taking longer to respond after you've been away for a few minutes (TTL expiry)

Quick Self-Check

  1. Are you switching models mid-session? Stop. Pick one at the start.
  2. Are you using /clear between tasks? Switch to compaction.
  3. Are you adding tools mid-session? Plan your tool set upfront in your CLAUDE.md configuration.
  4. Are you taking 5+ minute breaks between messages? Accept the rebuild cost or send a quick follow-up before the break.
Quick audit for your sessions

If your Claude Code sessions feel inconsistent - sharp early, degraded later - check whether you are adding tools, switching models, or using /clear mid-session. Any of those breaks the cached prefix and forces a rebuild.

Why This Architecture Exists

Prompt caching is not a nice-to-have. At the token volumes Claude Code sessions generate - especially with long CLAUDE.md files, large tool sets, and extended conversations - without caching the cost would make the tool impractical for daily use.

The architecture is a product constraint turned into a design principle. And once you see it, the entire tool becomes more legible:

  • The context management system I built is partly a cache optimization strategy in disguise - keeping stable content stable and dynamic content minimal.
  • The delegation system benefits too - sub-agents start with a small, focused prefix instead of inheriting a bloated 100K-token conversation. Less prefix, less cost, regardless of cache state.
  • Hooks preserve the prefix by design - they inject behavior through tool-use events, not system prompt modifications.

Every pattern I use connects back here.

This lives in primeline-ai/evolving-lite - the self-evolving Claude Code plugin. Free, MIT, no build step.

FAQ

What is Claude Code prompt caching and why does it matter?+
Prompt caching lets Claude reuse a computed prefix of your conversation across multiple requests, paying a fraction of the normal token cost on the cached portion. In Claude Code, the system is structured so your static configuration (system prompt, tools, CLAUDE.md) sits at the front of the prefix and gets cached. The more stable your session setup, the higher your cache hit rate, and the lower your per-message cost.
Why does switching models mid-session cost more?+
Each model has its own cache. When you switch from Opus to Haiku mid-session, the Haiku model has no cached prefix - it builds from scratch at full token prices. At high context volumes, the full-price Haiku cost can exceed what Opus would charge at cached rates. The advice is to choose your model at session start and stay with it.
What is the difference between /clear and compaction in Claude Code?+
/clear destroys the session and forces a full cache rebuild on the next request. Compaction summarizes the conversation but keeps the same system prompt, tools, and prefix structure - so the cached prefix survives. For long working sessions, compaction is almost always the better choice.
Why does plan mode appear as a tool call instead of a separate mode?+
Tools are part of the cached prefix, so EnterPlanMode and ExitPlanMode as tool calls leave the prefix intact. If plan mode were a separate system prompt state, switching in and out would invalidate the cached prefix on every use. The tool design is a caching decision.
How do system-reminder tags relate to prompt caching?+
When Claude Code needs to inject updated context or instructions mid-session, it appends them to messages rather than modifying the system prompt. Modifying the system prompt would change the static prefix and invalidate the cache. System-reminder tags let Claude inject updates while keeping the prefix stable and the cache hit rate high.
How long does the Claude Code prompt cache last?+
The prompt cache has a 5-minute TTL (time to live). If you don't send another request within 5 minutes, the cached prefix expires and rebuilds from scratch on your next message. This means continuous work sessions are cheaper than sessions with long gaps between messages.
What is the cost difference between cached and uncached Claude Code requests?+
Cached input tokens cost 90% less than uncached input tokens. The initial cache write costs 25% more than base input price. After that first write, every subsequent request that hits the cache pays only 10% of the normal input cost for the cached portion. Over a long session, this adds up to roughly 8x cost reduction.
How do I know if my Claude Code cache is working properly?+
Watch for sessions that feel sharp early but degrade later, unexpectedly high token usage, or Claude Code taking longer to respond after brief pauses. These are symptoms of cache invalidation. Check if you are switching models, adding tools, using /clear, or taking 5+ minute breaks between messages.

>_ Get the free Claude Code guide

>_ No spam. Unsubscribe anytime.

>_ Related