Prefix caching is the inference-level optimization behind prompt caching in APIs. When multiple requests share the same beginning (system prompt, few-shot examples, shared documents), the KV cache for that prefix is computed once and reused.
How it works
Request 1: [System prompt] + [User question A]
β Compute KV cache for system prompt (expensive)
β Compute KV cache for question A (cheap)
Request 2: [System prompt] + [User question B]
β REUSE KV cache for system prompt (free!)
β Compute KV cache for question B (cheap)
The system promptβs KV cache is computed once and shared across all requests that start with it.
Savings
| Scenario | Without prefix caching | With prefix caching |
|---|---|---|
| 5K system prompt, 100 req/hr | 500K tokens/hr prefill | 5K + 99Γ0 = 5K tokens/hr |
| RAG with shared docs | Full recompute per query | Docs cached, only query computed |
| Coding agent with project context | Reload codebase every call | Codebase cached across calls |
Implementation
In APIs (prompt caching)
Anthropic, OpenAI, and Google handle this automatically. Structure your prompts so the shared part comes first:
messages = [
{"role": "system", "content": long_system_prompt}, # Cached
{"role": "user", "content": shared_context}, # Cached
{"role": "user", "content": unique_question} # Not cached
]
In self-hosted inference
vLLM: --enable-prefix-caching
SGLang: Automatic via RadixAttention (always on)
Ollama: Not supported (single-user, no sharing)
Connection to context engineering
Prefix caching rewards good context engineering. If you structure your context so the static parts come first (system prompt β shared docs β user-specific query), caching is maximally effective.
Bad context ordering (user query first, then docs) defeats caching because the prefix changes every request.
Related: Prompt Caching Explained Β· KV Cache Explained Β· SGLang vs vLLM Β· How to Reduce LLM API Costs