Prompt Caching Explained β Save Up to 90% on LLM API Costs
Prompt caching is the easiest way to cut LLM API costs. If the beginning of your prompt matches a recent request, the provider charges 75-90% less for those cached tokens. No code changes needed β just structure your prompts correctly.
How it works
When you send a request, the API provider checks if the first N tokens match a recently cached prompt prefix. If they do, those tokens are served from cache at a massive discount.
| Provider | Cache discount | Cache lifetime | Min cacheable |
|---|---|---|---|
| Anthropic | 90% off input | 5 minutes | 1,024 tokens |
| OpenAI | 50% off input | 5-10 minutes | 1,024 tokens |
| 75% off input | Configurable | 32,768 tokens |
When it saves money
Prompt caching helps when you send the same prefix repeatedly:
System prompts β Your 2K-token system prompt is identical across all requests. With caching, you pay full price once, then 90% less for the next 5 minutes of requests.
Few-shot examples β If you include 10 examples in every prompt, those examples get cached.
Large context documents β Sending the same codebase or documentation to every request? Cache it.
Conversation history β Each new message in a conversation shares the entire previous history as a prefix.
When it doesnβt help
- Unique prompts β If every request is completely different, nothing gets cached
- Low volume β If you send <1 request per 5 minutes, the cache expires between requests
- Short prompts β Under 1,024 tokens, thereβs nothing to cache
Implementation
Anthropic (Claude)
Caching is automatic for the first 1,024+ tokens. Structure your prompt so the static part comes first:
response = client.messages.create(
model="claude-sonnet-4.6",
system="You are a senior developer...", # This gets cached
messages=[
{"role": "user", "content": large_codebase}, # This gets cached too
{"role": "user", "content": "Fix the bug in auth.ts"} # Only this is new
]
)
OpenAI
Same principle β static content first, dynamic content last:
response = client.chat.completions.create(
model="gpt-5.4",
messages=[
{"role": "system", "content": long_system_prompt}, # Cached (50% off)
{"role": "user", "content": specific_question} # Full price
]
)
Real savings example
An AI coding agent sending 100 requests/hour with a 5K-token system prompt:
| Without caching | With caching | |
|---|---|---|
| System prompt cost | 5K Γ 100 Γ $15/1M = $7.50/hr | 5K Γ 1 Γ $15/1M + 5K Γ 99 Γ $1.50/1M = $0.82/hr |
| Monthly savings | β | ~$4,800/month |
Thatβs why our AI race agents use structured prompts with static identity files loaded first β the system prompt and IDENTITY.md get cached across runs.
Combine with model routing
The biggest savings come from combining caching with model routing:
- Route simple tasks to DeepSeek ($0.27/1M) β no caching needed, already cheap
- Route complex tasks to Claude with caching β 90% off the system prompt
- Use local models for autocomplete β free
This combination typically achieves 70-85% cost reduction vs naive Claude-for-everything.
FAQ
What is prompt caching?
Prompt caching is when an API provider stores the processed representation of the beginning of your prompt so it doesnβt need to be recomputed on subsequent requests. If your next request starts with the same tokens, those cached tokens are served at 50-90% off the normal input price. Itβs automatic on most providers β you just need to structure prompts with static content first.
Does prompt caching affect quality?
No. Prompt caching only skips redundant computation β the model produces identical results whether tokens are cached or freshly processed. The output quality, reasoning, and behavior are exactly the same. Caching is purely a cost and speed optimization with no impact on model performance.
Which providers support it?
Anthropic (Claude) offers 90% off cached tokens with a 5-minute lifetime. OpenAI (GPT) offers 50% off with a 5-10 minute lifetime. Google (Gemini) offers 75% off with configurable lifetime. All three require a minimum of 1,024 tokens (32,768 for Google) for caching to activate. Most providers via OpenRouter also support pass-through caching.
Related: How to Reduce LLM API Costs by 70% Β· OpenRouter Complete Guide Β· AI Coding Tools Pricing 2026