Claude Code Shares Prompt Caching Secrets for Faster, Cheaper AI

TL;DR

Prompt caching is a critical technology enabling long-running AI agentic products like Claude Code, significantly reducing latency and cost.
To maximize cache hits, structure prompts with static content first and dynamic content last, using a hierarchical approach for caching.
Avoid making changes mid-session, such as altering the system prompt, switching models, or adding/removing tools, as these actions invalidate the cache.
Strategies like using <system-reminder> tags for updates and deferring tool loading with a tool search tool help maintain cache integrity and performance.

Anthropic has shared key insights into the engineering behind Claude Code, emphasizing the paramount importance of prompt caching for building efficient and cost-effective AI applications. This technology is the bedrock that makes long-running agentic products, such as Claude Code itself, feasible by allowing the reuse of computation from previous interactions. By focusing on maximizing prompt cache hit rates, developers can achieve lower costs and enable more generous rate limits for their users. The lessons learned highlight how often counter-intuitive optimizations can lead to significant performance gains.

A core principle for effective prompt caching is the careful structure of prompts. The strategy involves placing static content, like system prompts and tools, at the beginning of the request. This is followed by project-specific configurations (CLAUDE.md), session context, and finally, the conversational messages. This layered approach ensures that a larger portion of the prompt remains consistent across different sessions, thereby increasing the likelihood of cache hits. However, this structure can be surprisingly fragile; even minor alterations, such as including timestamps in the system prompt or non-deterministic ordering of tools, can trigger costly cache misses.

Managing updates and dynamic information requires careful consideration to preserve cache integrity. Instead of directly modifying the prompt when data becomes outdated—a move that would inevitably lead to a cache miss—Anthropic recommends using messages. For instance, information can be passed via a <system-reminder> tag within the next user message or tool result. This approach allows the model to receive updated data without invalidating the cache, helping to maintain performance and reduce costs. This practice is detailed further in their Prompt Compaction Guide.

Furthermore, the article stresses the negative impact of model switching mid-session. Prompt caches are model-specific, meaning that switching from a powerful model like Opus to a lighter one like Haiku, even for a simple task, would necessitate rebuilding the entire prompt cache. This can incur higher latency and costs than simply having the original model handle the task. For tool management, rather than adding or removing tools during a conversation—which also breaks the cache—strategies like using a tool search tool with defer_loading can be employed. This involves sending lightweight stubs for tools, with the full schema only being loaded when the model explicitly discovers and requires them via tool search, as described in the documentation for Tool Use and Tool Search.

Summary

Prompt caching is fundamental for the performance and cost-efficiency of agentic AI applications like Claude Code.
Optimizing prompt structure with static elements preceding dynamic ones is crucial for maximizing cache hit rates.
Avoid disruptive mid-session changes, including model switching and altering tool sets, to prevent cache invalidation.
Employ techniques like message-based updates and deferred tool loading to maintain cache integrity and improve user experience.

Source: Lessons from building Claude Code: Prompt caching is everything | Claude

Claude Code Shares Prompt Caching Secrets for Faster, Cheaper AI

TL;DR

Summary

Read next

Cursor Launches Beta Security Review for PRs and Codebases

Get notified when our newsletter launches