πŸ€– AI Tools
Β· 3 min read
Last updated on

Prompt Caching Explained β€” Save Up to 90% on LLM API Costs


Prompt caching is the easiest way to cut LLM API costs. If the beginning of your prompt matches a recent request, the provider charges 75-90% less for those cached tokens. No code changes needed β€” just structure your prompts correctly.

How it works

When you send a request, the API provider checks if the first N tokens match a recently cached prompt prefix. If they do, those tokens are served from cache at a massive discount.

ProviderCache discountCache lifetimeMin cacheable
Anthropic90% off input5 minutes1,024 tokens
OpenAI50% off input5-10 minutes1,024 tokens
Google75% off inputConfigurable32,768 tokens

When it saves money

Prompt caching helps when you send the same prefix repeatedly:

System prompts β€” Your 2K-token system prompt is identical across all requests. With caching, you pay full price once, then 90% less for the next 5 minutes of requests.

Few-shot examples β€” If you include 10 examples in every prompt, those examples get cached.

Large context documents β€” Sending the same codebase or documentation to every request? Cache it.

Conversation history β€” Each new message in a conversation shares the entire previous history as a prefix.

When it doesn’t help

  • Unique prompts β€” If every request is completely different, nothing gets cached
  • Low volume β€” If you send <1 request per 5 minutes, the cache expires between requests
  • Short prompts β€” Under 1,024 tokens, there’s nothing to cache

Implementation

Anthropic (Claude)

Caching is automatic for the first 1,024+ tokens. Structure your prompt so the static part comes first:

response = client.messages.create(
    model="claude-sonnet-4.6",
    system="You are a senior developer...",  # This gets cached
    messages=[
        {"role": "user", "content": large_codebase},  # This gets cached too
        {"role": "user", "content": "Fix the bug in auth.ts"}  # Only this is new
    ]
)

OpenAI

Same principle β€” static content first, dynamic content last:

response = client.chat.completions.create(
    model="gpt-5.4",
    messages=[
        {"role": "system", "content": long_system_prompt},  # Cached (50% off)
        {"role": "user", "content": specific_question}       # Full price
    ]
)

Real savings example

An AI coding agent sending 100 requests/hour with a 5K-token system prompt:

Without cachingWith caching
System prompt cost5K Γ— 100 Γ— $15/1M = $7.50/hr5K Γ— 1 Γ— $15/1M + 5K Γ— 99 Γ— $1.50/1M = $0.82/hr
Monthly savingsβ€”~$4,800/month

That’s why our AI race agents use structured prompts with static identity files loaded first β€” the system prompt and IDENTITY.md get cached across runs.

Combine with model routing

The biggest savings come from combining caching with model routing:

  1. Route simple tasks to DeepSeek ($0.27/1M) β€” no caching needed, already cheap
  2. Route complex tasks to Claude with caching β€” 90% off the system prompt
  3. Use local models for autocomplete β€” free

This combination typically achieves 70-85% cost reduction vs naive Claude-for-everything.

FAQ

What is prompt caching?

Prompt caching is when an API provider stores the processed representation of the beginning of your prompt so it doesn’t need to be recomputed on subsequent requests. If your next request starts with the same tokens, those cached tokens are served at 50-90% off the normal input price. It’s automatic on most providers β€” you just need to structure prompts with static content first.

Does prompt caching affect quality?

No. Prompt caching only skips redundant computation β€” the model produces identical results whether tokens are cached or freshly processed. The output quality, reasoning, and behavior are exactly the same. Caching is purely a cost and speed optimization with no impact on model performance.

Which providers support it?

Anthropic (Claude) offers 90% off cached tokens with a 5-minute lifetime. OpenAI (GPT) offers 50% off with a 5-10 minute lifetime. Google (Gemini) offers 75% off with configurable lifetime. All three require a minimum of 1,024 tokens (32,768 for Google) for caching to activate. Most providers via OpenRouter also support pass-through caching.

Related: How to Reduce LLM API Costs by 70% Β· OpenRouter Complete Guide Β· AI Coding Tools Pricing 2026