>_

Claude Code Prompt Caching: One Rule, Every Feature

Robin||5 min
claude-codeprompt-cachingoptimizationcontext
Claude Code Prompt Caching: One Rule, Every Feature
Listen to this article (6 min)

Anthropic Declares Incidents When This Metric Drops

The Claude Code team monitors prompt cache hit rate like infrastructure uptime. If it drops too low, they open an incident. Not a bug ticket. An incident.

That surprised me. But once I understood why, a dozen confusing Claude Code behaviors clicked into place. Why CLAUDE.md loads before your conversation. Why plan mode is a tool call. Why /clear feels expensive. Every single one traces back to one constraint: prompt caching.

The free 3-pattern guide covers memory, delegation, and knowledge graphs. This post goes one layer deeper - the infrastructure those patterns run on.

The Problem Nobody Explains

Prompt caching works by prefix matching. Claude can cache a prefix of the conversation and reuse it on the next request - but only if that prefix is byte-for-byte identical. Change one character early in the prompt and the entire cache after that point is invalidated.

That one rule drives every UX decision in Claude Code.

If static content sits at the front of the prompt and dynamic content sits at the back, the static part stays cached across requests. If you flip that order - dynamic first, static after - you rebuild the cache on every message. Same tokens. Wildly different cost.

Claude Code Prompt Order (Static Before Dynamic)
Conversation MessagesChanges every turn - always dynamic
CLAUDE.md + Session ContextCached after first load
Static System Prompt + ToolsCached across the entire session

That is the actual order Claude Code assembles your prompt. Not arbitrary. Deliberately structured so the most stable content sits earliest in the prefix.

Now watch what happens when something changes at layer two - a tool gets added, a model switches, or the system prompt gets edited. Everything below that point rebuilds from scratch.

prompt-cache-invalidation

That cascade is the cost nobody sees. One change at the wrong layer, and thousands of cached tokens become full-price tokens on every subsequent request.

What This Means for Your Daily Workflow

The architecture has six direct implications for how I work - and probably how you work too.

CLAUDE.md loads before your conversation. Not because it is special. Because it is more static than your messages. It gets cached after the first turn and stays cached for the session. That is why changes to CLAUDE.md do not take effect mid-session without a restart.

The <system-reminder> tags in your transcript are not noise. When Claude Code needs to inject updated instructions, it appends them to messages instead of modifying the system prompt. Modifying the system prompt would break the cached prefix. System-reminder preserves it. Those tags are saving cache hits on every subsequent request.

Plan mode is a tool call, not a mode swap. EnterPlanMode and ExitPlanMode appear in the transcript as tool calls. I used to find this weird. Now it makes sense. Tools are part of the cached prefix. If plan mode were a separate system prompt state, switching in and out would invalidate the cache. As a tool, it leaves the prefix intact.

Switching models mid-session costs more than you think. At 100K tokens of cached context, switching from Opus to Haiku does not save money - it rebuilds the cache from zero. The cache rebuild on Opus at cached rates is cheaper than full-price Haiku. The Anthropic team confirmed this directly. My /haiku toggles were occasionally making sessions more expensive.

Never add or remove tools mid-session. Tools are part of the prefix. Adding one partway through invalidates everything cached after that point. This is why ToolSearch uses lightweight stubs instead of full schemas - the stubs are always present in the prefix, full schemas load only when needed, and the prefix stays stable.

/clear is more destructive than compaction. Compaction uses what the team calls "cache-safe forking" - same system prompt, same tools, same prefix, with a summary appended at the end. The cached prefix survives. /clear destroys it. I used to /clear after every task. Now I only use it when I genuinely need a fresh context, not just a clean conversation view.

/clear after every task
  • -Cache rebuilt from zero each time
  • -Full token cost on every request
  • -Session history gone
  • -Faster perceived start, slower actual performance
Compaction over /clear
  • +Cached prefix preserved
  • +Cache hit rates stay high
  • +Summary bridges old and new context
  • +Consistent cost across the session

The Metric They Monitor Like Uptime

The Claude Code team runs alerts on prompt cache hit rate. They declare an incident if it drops too low. That is not a footnote - that is a signal about how central caching is to the product's economics.

A high cache hit rate means users pay cached token prices on the bulk of each request. A low rate means they pay full price. The difference compounds fast across a long session.

SEVAlert level when cache hit rate drops
90%Discount on cached vs uncached input tokens
0Cache remaining after /clear

The implication: behaviors that feel like minor UX choices - message ordering, tool loading strategy, plan mode as a tool - are actually cache optimization decisions. They exist to protect that hit rate.

The Pattern: Work With the Prefix

Once I understood prefix matching, I changed how I structure long sessions.

I stopped rotating CLAUDE.md sections mid-session. I stopped adding new tools based on what I needed in the moment. I stopped treating /clear as a quick reset. And I stopped switching models when a task felt too small for Opus.

Instead, I plan my session's tool set upfront. I put stable instructions in CLAUDE.md and let dynamic updates ride in messages. I let compaction handle context compression. And I watch for the patterns that signal cache degradation - usually a session that suddenly feels slower and more expensive than normal.

Quick audit for your sessions

If your Claude Code sessions feel inconsistent - sharp early, degraded later - check whether you are adding tools, switching models, or using /clear mid-session. Any of those breaks the cached prefix and forces a rebuild.

The full session design patterns, including how to structure CLAUDE.md for maximum cache stability and how hooks automation interacts with the prefix, are part of the course implementation kit. But the concept is simple: static at the front, dynamic at the back, and protect the prefix.

Why This Architecture Exists

Prompt caching is not a nice-to-have. At the token volumes Claude Code sessions generate - especially with long CLAUDE.md files, large tool sets, and extended conversations - without caching the cost would make the tool impractical for daily use.

The architecture is a product constraint turned into a design principle. And once you see it, the entire tool becomes more legible. The context management system I built is partly a cache optimization strategy in disguise - keeping stable content stable and dynamic content minimal. The delegation system benefits too - sub-agents start with a small, focused prefix instead of inheriting a bloated 100K-token conversation. Less prefix, less cost, regardless of cache state.

Every pattern I use connects back here.

Want the full system blueprint? Get the free 3-pattern guide.

FAQ

What is Claude Code prompt caching and why does it matter?+
Prompt caching lets Claude reuse a computed prefix of your conversation across multiple requests, paying a fraction of the normal token cost on the cached portion. In Claude Code, the system is structured so your static configuration (system prompt, tools, CLAUDE.md) sits at the front of the prefix and gets cached. The more stable your session setup, the higher your cache hit rate, and the lower your per-message cost.
Why does switching models mid-session cost more?+
Each model has its own cache. When you switch from Opus to Haiku mid-session, the Haiku model has no cached prefix - it builds from scratch at full token prices. At high context volumes, the full-price Haiku cost can exceed what Opus would charge at cached rates. The advice is to choose your model at session start and stay with it.
What is the difference between /clear and compaction in Claude Code?+
/clear destroys the session and forces a full cache rebuild on the next request. Compaction summarizes the conversation but keeps the same system prompt, tools, and prefix structure - so the cached prefix survives. For long working sessions, compaction is almost always the better choice.
Why does plan mode appear as a tool call instead of a separate mode?+
Tools are part of the cached prefix, so EnterPlanMode and ExitPlanMode as tool calls leave the prefix intact. If plan mode were a separate system prompt state, switching in and out would invalidate the cached prefix on every use. The tool design is a caching decision.
How do system-reminder tags relate to prompt caching?+
When Claude Code needs to inject updated context or instructions mid-session, it appends them to messages rather than modifying the system prompt. Modifying the system prompt would change the static prefix and invalidate the cache. System-reminder tags let Claude inject updates while keeping the prefix stable and the cache hit rate high.

>_ Get the free Claude Code guide

>_ No spam. Unsubscribe anytime.

>_ Related