📚 Learning Hub
· 6 min read

How Prompt Caching Works — And Why It Saves You 90% on AI API Costs


Every time you send a prompt to an LLM API, the model processes every single token from scratch. Your 10,000-token system prompt, your 50 few-shot examples, that entire PDF you stuffed into context — all re-processed on every call. That’s expensive and slow.

Prompt caching fixes this. It lets the API reuse the computed internal state from previous calls, skipping redundant work. The result: up to 90% cost reduction on cached tokens and noticeably faster responses.

Here’s how it actually works.

What gets cached: the KV cache

To understand prompt caching, you need to understand what happens inside a transformer during inference. As the model processes each token, it computes key-value (KV) pairs in every attention layer. These KV pairs are what the model uses to “attend” to previous tokens when generating the next one. For a deeper dive, see our KV cache explainer.

Normally, this KV cache is computed fresh for every request and discarded afterward. Prompt caching changes that — the provider stores the KV cache from your prompt’s prefix and reuses it when you send a request with the same prefix again.

This means the model doesn’t need to recompute attention for the cached portion. It picks up right where the cache left off, only processing the new tokens you’ve added.

How prefix matching works

The key constraint is that caching works on prefixes. The provider compares the beginning of your new request against cached prompts. If the first N tokens match exactly, the cached KV state for those N tokens is reused.

The matching is strict:

  • The tokens must be identical, in the same order, from the very start
  • A single changed token at position 5 invalidates everything from position 5 onward
  • Tokens after the cached prefix are processed normally

This is why prompt structure matters. Put your stable content first (system prompt, instructions, reference documents) and variable content last (the user’s actual question). If you flip that order, you’ll never get cache hits.

For more on how this applies across different APIs, see prefix caching in LLM APIs.

The cost impact

The savings are significant. Here’s what the major providers offer on cache hits:

ProviderCache hit pricingCache write costTTL
Anthropic90% discount25% surcharge5 minutes
OpenAI50% discountNo surcharge5–10 minutes
Google (Gemini)75% discountNo surchargeConfigurable

On a typical workload where 80% of your tokens are in a stable prefix (system prompt + context), prompt caching can cut your effective cost by 60–70%. If you’re making rapid successive calls with the same context — like in an agentic loop — the savings approach 90%. We covered more strategies in how to reduce LLM API costs.

Response latency also drops. Skipping KV computation for thousands of cached tokens means time-to-first-token improves substantially, often by 2–3x for long prompts.

Provider implementations

Each major provider handles caching differently.

Anthropic: explicit cache control

Anthropic gives you direct control over what gets cached using cache_control breakpoints in the Messages API. You mark specific content blocks, and Anthropic caches everything up to that point.

import anthropic

client = anthropic.Anthropic()

response = client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": "You are a legal assistant. Reference the following contract...\n\n" + contract_text,
            "cache_control": {"type": "ephemeral"}
        }
    ],
    messages=[
        {"role": "user", "content": "Summarize the indemnification clause."}
    ]
)

# Check cache performance
print(response.usage.cache_creation_input_tokens)  # tokens written to cache
print(response.usage.cache_read_input_tokens)       # tokens read from cache

The first call writes to the cache (with a 25% surcharge on those tokens). Subsequent calls with the same prefix hit the cache at a 90% discount. The cache lives for 5 minutes and resets its TTL on each hit.

You can place multiple cache_control breakpoints to cache at different granularities. For a practical example using Claude’s latest models, see the Claude Opus 4 / Sonnet 4 guide.

OpenAI: automatic prefix caching

OpenAI’s approach requires zero code changes. Prefix caching is enabled by default on the API. If your request shares a prefix with a recent request, OpenAI automatically reuses the cached computation.

from openai import OpenAI

client = OpenAI()

# Both calls automatically benefit from prefix caching
# if the system message + early messages are identical
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {"role": "system", "content": long_system_prompt},
        {"role": "user", "content": "What are the key risks?"}
    ]
)

# Check cached tokens in the response
print(response.usage.prompt_tokens_details.cached_tokens)

The tradeoff: you get less control. You can’t force specific breakpoints, and the minimum cacheable prefix is 1,024 tokens. But it’s frictionless.

Google: context caching API

Google takes a different approach with an explicit caching API for Gemini models. You create a named cache object and reference it in subsequent requests.

from google import genai
from google.genai.types import Content, Part

client = genai.Client()

# Create a cache with your static content
cache = client.caches.create(
    model="gemini-2.0-flash",
    contents=[
        Content(role="user", parts=[Part(text=large_document)])
    ],
    config={"display_name": "contract-analysis", "ttl": "3600s"}
)

# Use the cache in requests
response = client.models.generate_content(
    model="gemini-2.0-flash",
    contents="Summarize the indemnification clause.",
    config={"cached_content": cache.name}
)

Google’s approach gives you explicit TTL control and the ability to manage caches as named resources. The minimum cacheable size is 32,768 tokens.

When prompt caching helps most

Prompt caching delivers the biggest wins in specific patterns:

  • Long system prompts — If your system prompt is 2,000+ tokens with detailed instructions, persona definitions, or output schemas, caching it across calls is an obvious win.
  • Few-shot examples — Stuffing 20–50 examples into your prompt for consistent formatting? That’s a perfect cache candidate. Put them in the system message or early in the conversation.
  • Document Q&A — When users ask multiple questions about the same document, the document tokens get cached after the first question. Every follow-up is dramatically cheaper.
  • Agentic loops — Agents that make dozens of sequential calls with growing context benefit enormously. The stable prefix (tools, instructions, earlier conversation) stays cached while only new messages get processed. See LLM inference explained for more on how this fits into the broader inference pipeline.
  • Multi-turn conversations — Each new turn in a conversation shares the entire previous conversation as a prefix. Caching makes long conversations much cheaper.

Limitations to know

Prompt caching isn’t magic. There are real constraints:

Exact prefix match required. Even a single token difference at the start of your prompt invalidates the cache. This means you need to be disciplined about prompt construction — don’t inject timestamps, random IDs, or variable content before your stable prefix.

TTL expiry. Caches don’t last forever. Anthropic’s 5-minute TTL means low-traffic endpoints won’t benefit much. If you’re making one call per hour, the cache will always be cold.

Cache misses cost the same (or more). With Anthropic, writing to the cache costs 25% more than a normal request. If your hit rate is low, you’re paying more, not less.

Minimum size thresholds. Providers require minimum token counts for caching to kick in. Short prompts under 1,024 tokens (OpenAI) or 2,048 tokens (Anthropic) won’t be cached.

No cross-model caching. A cache built with claude-sonnet-4-20250514 won’t work with claude-haiku-4-20250514. Each model has its own KV representation.

Structuring prompts for cache hits

The practical takeaway is simple: put stable content first, variable content last.

[system prompt]        ← cached (stable across all calls)
[few-shot examples]    ← cached (stable across all calls)
[reference document]   ← cached (stable within a session)
[conversation history] ← partially cached (grows each turn)
[new user message]     ← never cached (changes every call)

If you’re using Anthropic, place cache_control breakpoints after each stable section. If you’re using OpenAI, just make sure your prompt follows this structure and caching happens automatically.

The difference between a well-structured and poorly-structured prompt can be the difference between 90% cache hits and 0%. For most production applications making repeated calls with shared context, prompt caching is the single highest-impact optimization you can make.