Apr 24, 2026 · 1 min read

Prefix Caching for LLM APIs — How It Works and Why It Saves Money

Prefix caching is the inference-level optimization behind prompt caching in APIs. When multiple requests share the same beginning (system prompt, few-shot examples, shared documents), the KV cache for that prefix is computed once and reused.

How it works

Request 1: [System prompt] + [User question A]
           ↓ Compute KV cache for system prompt (expensive)
           ↓ Compute KV cache for question A (cheap)

Request 2: [System prompt] + [User question B]
           ↓ REUSE KV cache for system prompt (free!)
           ↓ Compute KV cache for question B (cheap)

The system prompt’s KV cache is computed once and shared across all requests that start with it.

Savings

Scenario	Without prefix caching	With prefix caching
5K system prompt, 100 req/hr	500K tokens/hr prefill	5K + 99×0 = 5K tokens/hr
RAG with shared docs	Full recompute per query	Docs cached, only query computed
Coding agent with project context	Reload codebase every call	Codebase cached across calls

Implementation

In APIs (prompt caching)

Anthropic, OpenAI, and Google handle this automatically. Structure your prompts so the shared part comes first:

messages = [
    {"role": "system", "content": long_system_prompt},  # Cached
    {"role": "user", "content": shared_context},          # Cached
    {"role": "user", "content": unique_question}           # Not cached
]

In self-hosted inference

vLLM: --enable-prefix-caching

SGLang: Automatic via RadixAttention (always on)

Ollama: Not supported (single-user, no sharing)

Connection to context engineering

Prefix caching rewards good context engineering. If you structure your context so the static parts come first (system prompt → shared docs → user-specific query), caching is maximally effective.

Bad context ordering (user query first, then docs) defeats caching because the prefix changes every request.

Prefix Caching for LLM APIs — How It Works and Why It Saves Money

How it works

Savings

Implementation

In APIs (prompt caching)

In self-hosted inference

Connection to context engineering

📬 AI Dev Weekly

You might also like

SGLang vs vLLM — The New Inference Engine Challenger (2026)

GPU Memory Planning for LLM Serving — How Much VRAM You Actually Need

Quantization Trade-offs in Production — 4-bit vs 8-bit vs Full Precision

How to Serve LLMs with vLLM — Production Deployment Guide