🤖 AI Tools
· 2 min read

Prompt Caching Explained — Save Up to 90% on LLM API Costs


Prompt caching is the easiest way to cut LLM API costs. If the beginning of your prompt matches a recent request, the provider charges 75-90% less for those cached tokens. No code changes needed — just structure your prompts correctly.

How it works

When you send a request, the API provider checks if the first N tokens match a recently cached prompt prefix. If they do, those tokens are served from cache at a massive discount.

ProviderCache discountCache lifetimeMin cacheable
Anthropic90% off input5 minutes1,024 tokens
OpenAI50% off input5-10 minutes1,024 tokens
Google75% off inputConfigurable32,768 tokens

When it saves money

Prompt caching helps when you send the same prefix repeatedly:

System prompts — Your 2K-token system prompt is identical across all requests. With caching, you pay full price once, then 90% less for the next 5 minutes of requests.

Few-shot examples — If you include 10 examples in every prompt, those examples get cached.

Large context documents — Sending the same codebase or documentation to every request? Cache it.

Conversation history — Each new message in a conversation shares the entire previous history as a prefix.

When it doesn’t help

  • Unique prompts — If every request is completely different, nothing gets cached
  • Low volume — If you send <1 request per 5 minutes, the cache expires between requests
  • Short prompts — Under 1,024 tokens, there’s nothing to cache

Implementation

Anthropic (Claude)

Caching is automatic for the first 1,024+ tokens. Structure your prompt so the static part comes first:

response = client.messages.create(
    model="claude-sonnet-4.6",
    system="You are a senior developer...",  # This gets cached
    messages=[
        {"role": "user", "content": large_codebase},  # This gets cached too
        {"role": "user", "content": "Fix the bug in auth.ts"}  # Only this is new
    ]
)

OpenAI

Same principle — static content first, dynamic content last:

response = client.chat.completions.create(
    model="gpt-5.4",
    messages=[
        {"role": "system", "content": long_system_prompt},  # Cached (50% off)
        {"role": "user", "content": specific_question}       # Full price
    ]
)

Real savings example

An AI coding agent sending 100 requests/hour with a 5K-token system prompt:

Without cachingWith caching
System prompt cost5K × 100 × $15/1M = $7.50/hr5K × 1 × $15/1M + 5K × 99 × $1.50/1M = $0.82/hr
Monthly savings~$4,800/month

That’s why our AI race agents use structured prompts with static identity files loaded first — the system prompt and IDENTITY.md get cached across runs.

Combine with model routing

The biggest savings come from combining caching with model routing:

  1. Route simple tasks to DeepSeek ($0.27/1M) — no caching needed, already cheap
  2. Route complex tasks to Claude with caching — 90% off the system prompt
  3. Use local models for autocomplete — free

This combination typically achieves 70-85% cost reduction vs naive Claude-for-everything.

Related: How to Reduce LLM API Costs by 70% · OpenRouter Complete Guide · AI Coding Tools Pricing 2026