Anthropic Declares Incidents When This Metric Drops
The Claude Code team monitors prompt cache hit rate like infrastructure uptime. If it drops too low, they open an incident. Not a bug ticket. An incident.
That surprised me. But once I understood why, a dozen confusing Claude Code behaviors clicked into place. Why CLAUDE.md loads before your conversation. Why plan mode is a tool call. Why /clear feels expensive. Every single one traces back to one constraint: prompt caching.
The free 3-pattern guide covers memory, delegation, and knowledge graphs. This post goes one layer deeper - the infrastructure those patterns run on.
The Problem Nobody Explains
Prompt caching works by prefix matching. Claude can cache a prefix of the conversation and reuse it on the next request - but only if that prefix is byte-for-byte identical. Change one character early in the prompt and the entire cache after that point is invalidated.
That one rule drives every UX decision in Claude Code.
If static content sits at the front of the prompt and dynamic content sits at the back, the static part stays cached across requests. If you flip that order - dynamic first, static after - you rebuild the cache on every message. Same tokens. Wildly different cost.
That is the actual order Claude Code assembles your prompt. Not arbitrary. Deliberately structured so the most stable content sits earliest in the prefix.
Now watch what happens when something changes at layer two - a tool gets added, a model switches, or the system prompt gets edited. Everything below that point rebuilds from scratch.
That cascade is the cost nobody sees. One change at the wrong layer, and thousands of cached tokens become full-price tokens on every subsequent request.
What This Means for Your Daily Workflow
The architecture has six direct implications for how I work - and probably how you work too.
CLAUDE.md loads before your conversation. Not because it is special. Because it is more static than your messages. It gets cached after the first turn and stays cached for the session. That is why changes to CLAUDE.md do not take effect mid-session without a restart.
The <system-reminder> tags in your transcript are not noise. When Claude Code needs to inject updated instructions, it appends them to messages instead of modifying the system prompt. Modifying the system prompt would break the cached prefix. System-reminder preserves it. Those tags are saving cache hits on every subsequent request.
Plan mode is a tool call, not a mode swap. EnterPlanMode and ExitPlanMode appear in the transcript as tool calls. I used to find this weird. Now it makes sense. Tools are part of the cached prefix. If plan mode were a separate system prompt state, switching in and out would invalidate the cache. As a tool, it leaves the prefix intact.
Switching models mid-session costs more than you think. At 100K tokens of cached context, switching from Opus to Haiku does not save money - it rebuilds the cache from zero. The cache rebuild on Opus at cached rates is cheaper than full-price Haiku. The Anthropic team confirmed this directly. My /haiku toggles were occasionally making sessions more expensive.
Never add or remove tools mid-session. Tools are part of the prefix. Adding one partway through invalidates everything cached after that point. This is why ToolSearch uses lightweight stubs instead of full schemas - the stubs are always present in the prefix, full schemas load only when needed, and the prefix stays stable.
/clear is more destructive than compaction. Compaction uses what the team calls "cache-safe forking" - same system prompt, same tools, same prefix, with a summary appended at the end. The cached prefix survives. /clear destroys it. I used to /clear after every task. Now I only use it when I genuinely need a fresh context, not just a clean conversation view.
- -Cache rebuilt from zero each time
- -Full token cost on every request
- -Session history gone
- -Faster perceived start, slower actual performance
- +Cached prefix preserved
- +Cache hit rates stay high
- +Summary bridges old and new context
- +Consistent cost across the session
The Metric They Monitor Like Uptime
The Claude Code team runs alerts on prompt cache hit rate. They declare an incident if it drops too low. That is not a footnote - that is a signal about how central caching is to the product's economics.
A high cache hit rate means users pay cached token prices on the bulk of each request. A low rate means they pay full price. The difference compounds fast across a long session.
The implication: behaviors that feel like minor UX choices - message ordering, tool loading strategy, plan mode as a tool - are actually cache optimization decisions. They exist to protect that hit rate.
The Pattern: Work With the Prefix
Once I understood prefix matching, I changed how I structure long sessions.
I stopped rotating CLAUDE.md sections mid-session. I stopped adding new tools based on what I needed in the moment. I stopped treating /clear as a quick reset. And I stopped switching models when a task felt too small for Opus.
Instead, I plan my session's tool set upfront. I put stable instructions in CLAUDE.md and let dynamic updates ride in messages. I let compaction handle context compression. And I watch for the patterns that signal cache degradation - usually a session that suddenly feels slower and more expensive than normal.
If your Claude Code sessions feel inconsistent - sharp early, degraded later - check whether you are adding tools, switching models, or using /clear mid-session. Any of those breaks the cached prefix and forces a rebuild.
The full session design patterns, including how to structure CLAUDE.md for maximum cache stability and how hooks automation interacts with the prefix, are part of the course implementation kit. But the concept is simple: static at the front, dynamic at the back, and protect the prefix.
Why This Architecture Exists
Prompt caching is not a nice-to-have. At the token volumes Claude Code sessions generate - especially with long CLAUDE.md files, large tool sets, and extended conversations - without caching the cost would make the tool impractical for daily use.
The architecture is a product constraint turned into a design principle. And once you see it, the entire tool becomes more legible. The context management system I built is partly a cache optimization strategy in disguise - keeping stable content stable and dynamic content minimal. The delegation system benefits too - sub-agents start with a small, focused prefix instead of inheriting a bloated 100K-token conversation. Less prefix, less cost, regardless of cache state.
Every pattern I use connects back here.
Want the full system blueprint? Get the free 3-pattern guide.


