Apr 15, 2026 · 3 min read

Last updated on Apr 20, 2026

Prompt Caching Explained — Save Up to 90% on LLM API Costs

Prompt caching is the easiest way to cut LLM API costs. If the beginning of your prompt matches a recent request, the provider charges 75-90% less for those cached tokens. No code changes needed — just structure your prompts correctly.

How it works

When you send a request, the API provider checks if the first N tokens match a recently cached prompt prefix. If they do, those tokens are served from cache at a massive discount.

Provider	Cache discount	Cache lifetime	Min cacheable
Anthropic	90% off input	5 minutes	1,024 tokens
OpenAI	50% off input	5-10 minutes	1,024 tokens
Google	75% off input	Configurable	32,768 tokens

When it saves money

Prompt caching helps when you send the same prefix repeatedly:

System prompts — Your 2K-token system prompt is identical across all requests. With caching, you pay full price once, then 90% less for the next 5 minutes of requests.

Few-shot examples — If you include 10 examples in every prompt, those examples get cached.

Large context documents — Sending the same codebase or documentation to every request? Cache it.

Conversation history — Each new message in a conversation shares the entire previous history as a prefix.

When it doesn’t help

Unique prompts — If every request is completely different, nothing gets cached
Low volume — If you send <1 request per 5 minutes, the cache expires between requests
Short prompts — Under 1,024 tokens, there’s nothing to cache

Implementation

Anthropic (Claude)

Caching is automatic for the first 1,024+ tokens. Structure your prompt so the static part comes first:

response = client.messages.create(
    model="claude-sonnet-4.6",
    system="You are a senior developer...",  # This gets cached
    messages=[
        {"role": "user", "content": large_codebase},  # This gets cached too
        {"role": "user", "content": "Fix the bug in auth.ts"}  # Only this is new
    ]
)

OpenAI

Same principle — static content first, dynamic content last:

response = client.chat.completions.create(
    model="gpt-5.4",
    messages=[
        {"role": "system", "content": long_system_prompt},  # Cached (50% off)
        {"role": "user", "content": specific_question}       # Full price
    ]
)

Real savings example

An AI coding agent sending 100 requests/hour with a 5K-token system prompt:

	Without caching	With caching
System prompt cost	5K × 100 × $15/1M = $7.50/hr	5K × 1 × $15/1M + 5K × 99 × $1.50/1M = $0.82/hr
Monthly savings	—	~$4,800/month

That’s why our AI race agents use structured prompts with static identity files loaded first — the system prompt and IDENTITY.md get cached across runs.

Combine with model routing

The biggest savings come from combining caching with model routing:

Route simple tasks to DeepSeek ($0.27/1M) — no caching needed, already cheap
Route complex tasks to Claude with caching — 90% off the system prompt
Use local models for autocomplete — free

This combination typically achieves 70-85% cost reduction vs naive Claude-for-everything.

FAQ

What is prompt caching?

Prompt caching is when an API provider stores the processed representation of the beginning of your prompt so it doesn’t need to be recomputed on subsequent requests. If your next request starts with the same tokens, those cached tokens are served at 50-90% off the normal input price. It’s automatic on most providers — you just need to structure prompts with static content first.

Does prompt caching affect quality?

No. Prompt caching only skips redundant computation — the model produces identical results whether tokens are cached or freshly processed. The output quality, reasoning, and behavior are exactly the same. Caching is purely a cost and speed optimization with no impact on model performance.

Which providers support it?

Anthropic (Claude) offers 90% off cached tokens with a 5-minute lifetime. OpenAI (GPT) offers 50% off with a 5-10 minute lifetime. Google (Gemini) offers 75% off with configurable lifetime. All three require a minimum of 1,024 tokens (32,768 for Google) for caching to activate. Most providers via OpenRouter also support pass-through caching.

Prompt Caching Explained — Save Up to 90% on LLM API Costs

How it works

When it saves money

When it doesn’t help

Implementation

Anthropic (Claude)

OpenAI

Real savings example

Combine with model routing

FAQ

What is prompt caching?

Does prompt caching affect quality?

Which providers support it?

📬 AI Dev Weekly

You might also like

How to Monitor and Control AI API Spending — Stop the Surprise Bills

LLM Cost Calculator — How to Estimate Your Monthly AI Spend

How to Reduce LLM API Costs by 70% — 5 Strategies That Actually Work

Best AI API Providers in 2026: Ranked by Models, Pricing, and Reliability