Anthropic's ephemeral cache is not new. It has been available since 2024. But most teams using Claude in production have not implemented it correctly — and are paying 60-90% more than necessary on calls that should be cached.
This is not because caching is difficult. It is because placement is counterintuitive, and because the documentation is technically correct without being practically useful.
This article explains the mechanics, the placement pattern, and what can and cannot actually be cached.
What is ephemeral cache
Anthropic's ephemeral cache is a server-side caching mechanism that stores parts of your prompt on Anthropic's infrastructure for up to five minutes. When you send the cached content again within those five minutes, you do not pay for input tokens — only for output tokens and a small cache-read fee.
Concrete discount: typically 90% on cached input tokens. If you have a 5,000-token system prompt sent on every call, and you cache it, you pay the cache-read price (typically 10% of normal input price) instead of the full input price.
For systems with long, static system prompts, this is transformative.
Caching is not an optimisation. It is a fundamental part of the cost model for AI systems in production. A system that sends 5,000-token instructions on every call without caching is not cost-optimised — it is not set up correctly.
The 1,024-token threshold
The most important technical detail is the minimum size: your prompt segment must be at least 1,024 tokens for caching to activate.
This means you cannot cache a short 200-token system prompt and expect a discount. You must either:
- Have sufficient content in your cached block (1,024+ tokens), or
- Structure your prompt so that the repeatable parts are consolidated into one block above the threshold.
In practice, this is rarely a problem for enterprise systems. System prompts with domain context, examples (few-shot), and instructions are typically well above 1,024 tokens.
For simpler systems with short instructions, this is a real limitation — and a signal that you should consider adding few-shot examples to your prompt anyway (they typically improve output consistency and give you caching as a side benefit).
The cacheControl placement pattern
This is the part that confuses most teams implementing caching for the first time: you do not cache "your system prompt as a whole." You mark specific parts of your messages array as cacheable.
const result = await generateText({
model: models.deep,
messages: [
{
role: "user",
content: [
{
type: "text",
text: STATIC_SYSTEM_INSTRUCTIONS, // 2000+ tokens of static text
providerOptions: {
anthropic: {
cacheControl: { type: "ephemeral" },
},
},
},
{
type: "text",
text: userInput, // dynamic — not cached
},
],
},
],
experimental_telemetry: {
isEnabled: true,
functionId: "my-feature",
},
})
The critical point: cacheControl is set on providerOptions on the specific TextPart — not on the entire messages array, not in the system field, and not in the text itself.
Setting it in the wrong location is a common mistake, and the result is that caching does not activate — you pay full price and believe you are caching.
What can and cannot be cached
Can be cached: Static content that is identical across calls. Instructions, frameworks, few-shot examples, domain context, role definitions. The more static, the better.
Cannot be cached effectively: User input, real-time data, timestamps, content with unique IDs. Caching only works when the cached content is bit-for-bit identical to what is in the cache.
Placement hierarchy: Place the cached segment as early as possible in the messages array. Caching works from the beginning to the point you have marked — it is not possible to cache a fragment in the middle of a long conversation array.
TTL: Cache lives for up to five minutes. For systems with frequent calls (chat, real-time analysis), this is rarely a problem. For batch jobs with long pauses between calls, the cache expires and you pay full price on the next call.
Multi-turn conversations: You can cache the initial system context and pay for it once per session window. User turns (which change) cannot be cached.
Verification: is it actually cached
The most critical mistake is implementing caching and assuming it works. Verification requires checking response metadata.
Via Langfuse or directly via the Anthropic SDK, you can see:
// From Langfuse trace or direct API response
const usage = result.usage
// cache_creation_input_tokens: tokens used to create the cache (first call)
// cache_read_input_tokens: tokens read from cache (subsequent calls)
// input_tokens: tokens not cached
If cache_read_input_tokens is 0 on calls that should hit the cache, caching is not activated correctly. This can be due to incorrect placement of cacheControl, tokens below the 1,024 threshold, or the TTL window having expired.
Add observability from day one. You cannot debug caching without seeing what is actually happening at the token level.
Strategy for system prompts with context
The most powerful use case is caching domain context that is shared across many calls — but specific enough to drive output quality.
Pattern 1 — Static + dynamic:
const DOMAIN_CONTEXT = `
You are an enterprise architecture assistant. You help analyse
IT portfolios based on the following principles:
${PRINCIPLES} // 500 tokens
${EXAMPLES} // 800 tokens
${METHODOLOGY} // 600 tokens
`
// Total: ~1900 tokens — above the 1024 threshold, can be cached
const result = await generateText({
messages: [
{
role: "user",
content: [
{ type: "text", text: DOMAIN_CONTEXT, providerOptions: { anthropic: { cacheControl: { type: "ephemeral" } } } },
{ type: "text", text: `Analyse this application: ${userApp}` }
]
}
]
})
Pattern 2 — Few-shot caching:
Few-shot examples are ideal for caching. They are static, relatively long (200-300 tokens per example), and substantially improve output consistency. Three examples typically produce 600-900 tokens — close enough to the 1,024 threshold to be supplemented with a shorter instruction.
What to do tomorrow
Caching has the lowest implementation cost of any AI cost optimisation — and one of the highest savings potentials. Three steps:
Week 1: Identify the calls in the product that send long, static system prompts. Measure the token volume on these calls. Calculate the potential savings.
Week 2: Implement cacheControl on the static parts of the identified calls. Verify via telemetry that cache_read_input_tokens increases.
Week 3: Measure the actual cost reduction in Langfuse or the Anthropic console. Establish caching as the default practice for all new features with static system context.
Prompt caching is not an advanced optimisation. It is a fundamental part of building AI systems that are cost-sustainable in production.
References
[1] Anthropic, "Prompt Caching", available at docs.anthropic.com/en/docs/build-with-claude/prompt-caching (accessed 2026-04-23).
[2] Anthropic, "Models Overview — Pricing", available at docs.anthropic.com/en/docs/models-overview (accessed 2026-04-23).
Spekir builds the layer that connects strategy to the IT portfolio. See Atlas →
Related articles
AI governance for midmarket: beyond the policy document
A policy PDF doesn't make you compliant. Here are the four deliverables that actually move the needle — a register, a risk classification, a decision matrix, and a one-pager.
8 min read →
Model routing: why choosing Haiku, Sonnet, or Opus matters more than your prompt
80% of AI cost reduction comes from sending the right request to the right model — not from prompt engineering. A practical guide to model routing in production.
7 min read →
Observability for AI features: from black box to audit trail
AI features without traces are not features, they are liabilities. The trace pattern, ADR-0001 metadata fields, and EU AI Act Article 15 in practice.
9 min read →