πŸ€– AI Tools
Β· 1 min read

Prefix Caching for LLM APIs β€” How It Works and Why It Saves Money


Prefix caching is the inference-level optimization behind prompt caching in APIs. When multiple requests share the same beginning (system prompt, few-shot examples, shared documents), the KV cache for that prefix is computed once and reused.

How it works

Request 1: [System prompt] + [User question A]
           ↓ Compute KV cache for system prompt (expensive)
           ↓ Compute KV cache for question A (cheap)

Request 2: [System prompt] + [User question B]
           ↓ REUSE KV cache for system prompt (free!)
           ↓ Compute KV cache for question B (cheap)

The system prompt’s KV cache is computed once and shared across all requests that start with it.

Savings

ScenarioWithout prefix cachingWith prefix caching
5K system prompt, 100 req/hr500K tokens/hr prefill5K + 99Γ—0 = 5K tokens/hr
RAG with shared docsFull recompute per queryDocs cached, only query computed
Coding agent with project contextReload codebase every callCodebase cached across calls

Implementation

In APIs (prompt caching)

Anthropic, OpenAI, and Google handle this automatically. Structure your prompts so the shared part comes first:

messages = [
    {"role": "system", "content": long_system_prompt},  # Cached
    {"role": "user", "content": shared_context},          # Cached
    {"role": "user", "content": unique_question}           # Not cached
]

In self-hosted inference

vLLM: --enable-prefix-caching

SGLang: Automatic via RadixAttention (always on)

Ollama: Not supported (single-user, no sharing)

Connection to context engineering

Prefix caching rewards good context engineering. If you structure your context so the static parts come first (system prompt β†’ shared docs β†’ user-specific query), caching is maximally effective.

Bad context ordering (user query first, then docs) defeats caching because the prefix changes every request.

Related: Prompt Caching Explained Β· KV Cache Explained Β· SGLang vs vLLM Β· How to Reduce LLM API Costs