Prompt Caching With Claude: What Actually Saves Money in Production

Prompt caching is one of the most underused features in the Anthropic API, and also one of the most misunderstood. The documentation makes it sound like a simple lever: mark a chunk of content as cacheable, pay less on subsequent calls. The reality in production is more interesting, because the gains only show up when the shape of your requests matches the shape of the cache. I have shipped it across a few workloads now, and the patterns that actually save money are more specific than the marketing suggests.

The Cache Model in One Paragraph

The cache operates on prefixes. You mark a position in your prompt with a cache_control block, and everything before that marker becomes eligible for reuse. When a later request sends the same prefix, the server returns a cache hit, charges a fraction of the normal input token price, and you pay full price only for the uncached suffix. The cache lives for a few minutes by default, enough for interactive sessions and short-lived batches, not enough for a low-volume background worker.

The Silent Constraint: Byte-for-Byte Matching

The prefix has to match exactly. Not "semantically". Not "approximately". If a single token of your system prompt changes, the cache evaporates. This sounds obvious, but almost every bug I have seen in production comes from here. A feature flag that quietly flips a line of text, a timestamp injected into the system prompt for "freshness", a tool definition that gets re-serialized in a slightly different order by a minor SDK bump. Each of those invalidates the entire downstream cache for every user.

I treat the cached prefix like a compiled artifact. It has a deterministic build process. It gets logged. When hit rates drop, that log is the first place I look.

Where the Money Actually Comes From

Three shapes pay off consistently.

The first is a long, stable system prompt. If you are running an agent with 8k or 16k tokens of instructions, tools, and examples, caching that block is almost free to implement and drops cost dramatically after the first call.

The second is RAG with a stable document context. When you retrieve the same set of chunks across a conversation turn, mark the retrieval block as cacheable and the follow-up turns become cheap. This only works if your retrieval is deterministic enough to produce the same bytes on replay.

The third is batched evaluation. When you run a benchmark suite against a fixed prompt with varying inputs, you can structure the prompt so the fixed part comes first, the variable input comes last, and every row after the first pays the cached rate.

Where It Does Not Help

If your workload is naturally cold — one request per user every few hours, with no repetition — the cache just adds complexity. The write is more expensive than a normal input token, so a cache that never hits costs you more than no cache at all.

If your prompt is small, the savings are real in percentage terms but trivial in absolute terms. Spending engineering time to cache a 500-token prompt that costs fractions of a cent per call rarely pays back.

Measurement Beats Intuition

I stopped trusting my intuition about which prompts would cache well. Every workload gets a dashboard showing cache reads versus cache writes versus uncached input tokens, broken down by call type. The first version of that dashboard surprised me in both directions: workloads I assumed were caching well were not, and workloads I thought were too dynamic were actually very cacheable after small changes to the prompt layout.

The rule I now follow: do not enable caching on a new call path without first landing the observability. The cache is invisible in your application logs. It is only visible in the response metadata, which you have to forward somewhere useful.

Small Patterns That Compound

One pattern that paid off in several places: separate the truly stable system prompt from the per-request policy overrides. Put the stable part first, cache it, and let the dynamic overrides be small and uncached. That alone can take a session-based agent from 70% uncached spend to 10%.

Another pattern: when you tune an agent, avoid rewriting the system prompt mid-conversation. Each rewrite burns the cache for every user on that pathway. If you need to experiment, shadow the new prompt on a separate cache line instead of editing the live one.

When to Reach for It

Prompt caching is not a general-purpose optimization. It is a workload-specific one. It pays off when you have large, stable prefixes and a request volume high enough to keep the cache warm. For those workloads, it is easily the biggest cost lever available. For everything else, skip it.

The tools to use caching well are not in the API. They are in the discipline around your prompt: keeping it stable, logging its shape, and measuring what the cache actually does. Once those are in place, the rest is arithmetic.