Prompt caching is the easiest way to cut LLM API costs. If the beginning of your prompt matches a recent request, the provider charges 75-90% less for those cached tokens. No code changes needed — just structure your prompts correctly.
How it works
When you send a request, the API provider checks if the first N tokens match a recently cached prompt prefix. If they do, those tokens are served from cache at a massive discount.
| Provider | Cache discount | Cache lifetime | Min cacheable |
|---|---|---|---|
| Anthropic | 90% off input | 5 minutes | 1,024 tokens |
| OpenAI | 50% off input | 5-10 minutes | 1,024 tokens |
| 75% off input | Configurable | 32,768 tokens |
When it saves money
Prompt caching helps when you send the same prefix repeatedly:
System prompts — Your 2K-token system prompt is identical across all requests. With caching, you pay full price once, then 90% less for the next 5 minutes of requests.
Few-shot examples — If you include 10 examples in every prompt, those examples get cached.
Large context documents — Sending the same codebase or documentation to every request? Cache it.
Conversation history — Each new message in a conversation shares the entire previous history as a prefix.
When it doesn’t help
- Unique prompts — If every request is completely different, nothing gets cached
- Low volume — If you send <1 request per 5 minutes, the cache expires between requests
- Short prompts — Under 1,024 tokens, there’s nothing to cache
Implementation
Anthropic (Claude)
Caching is automatic for the first 1,024+ tokens. Structure your prompt so the static part comes first:
response = client.messages.create(
model="claude-sonnet-4.6",
system="You are a senior developer...", # This gets cached
messages=[
{"role": "user", "content": large_codebase}, # This gets cached too
{"role": "user", "content": "Fix the bug in auth.ts"} # Only this is new
]
)
OpenAI
Same principle — static content first, dynamic content last:
response = client.chat.completions.create(
model="gpt-5.4",
messages=[
{"role": "system", "content": long_system_prompt}, # Cached (50% off)
{"role": "user", "content": specific_question} # Full price
]
)
Real savings example
An AI coding agent sending 100 requests/hour with a 5K-token system prompt:
| Without caching | With caching | |
|---|---|---|
| System prompt cost | 5K × 100 × $15/1M = $7.50/hr | 5K × 1 × $15/1M + 5K × 99 × $1.50/1M = $0.82/hr |
| Monthly savings | — | ~$4,800/month |
That’s why our AI race agents use structured prompts with static identity files loaded first — the system prompt and IDENTITY.md get cached across runs.
Combine with model routing
The biggest savings come from combining caching with model routing:
- Route simple tasks to DeepSeek ($0.27/1M) — no caching needed, already cheap
- Route complex tasks to Claude with caching — 90% off the system prompt
- Use local models for autocomplete — free
This combination typically achieves 70-85% cost reduction vs naive Claude-for-everything.
Related: How to Reduce LLM API Costs by 70% · OpenRouter Complete Guide · AI Coding Tools Pricing 2026