🤖 AI Tools
· 6 min read
Last updated on

KV Cache Explained — Why LLM Inference Is Fast


KV cache is the optimization that makes LLM inference fast. Without it, generating 100 tokens would require reprocessing the entire prompt 100 times. Understanding how it works is essential if you want to plan GPU memory, serve multiple users, or understand why LLM inference behaves the way it does.

What KV cache stores

Every transformer layer has an attention mechanism. During attention, the model computes three matrices from the input: Query (Q), Key (K), and Value (V). These matrices allow the model to determine which tokens are relevant to each other — this is the core of how transformers actually work.

During the prefill phase (processing your prompt), the model computes K and V matrices for every token at every layer. The KV cache stores these computed Key and Value tensors so they don’t need to be recomputed.

During the decode phase (generating output tokens one at a time), each new token only needs to:

  1. Compute its own Q, K, V vectors
  2. Attend to all previously cached K and V vectors
  3. Append its own K and V to the cache

Without KV cache, generating token N would require recomputing attention across all N-1 previous tokens from scratch — an O(n²) operation for the full sequence. With KV cache, each new token generation is O(n) because it only computes attention against the cached states.

This is the difference between a model that generates 50 tokens per second and one that generates 2 tokens per second. The cache trades memory for compute.

The math behind KV cache size

KV cache size per token per layer is:

2 × num_kv_heads × head_dim × precision_bytes

For a full model, multiply by number of layers. For a full sequence, multiply by sequence length. For a batch, multiply by batch size.

Example for Llama 3 70B (80 layers, 8 KV heads, 128 head dim, FP16):

Per token: 2 × 8 × 128 × 2 bytes × 80 layers = 327,680 bytes ≈ 0.3 MB
Per 4K context: 0.3 MB × 4096 = 1.3 GB
Per 32K context: 0.3 MB × 32768 = 10.2 GB
Per 128K context: 0.3 MB × 131072 = 40.9 GB

This grows linearly with context length and is often the VRAM bottleneck — not the model weights themselves.

Why it matters for VRAM

Here’s the practical impact across common models:

ModelContextKV cache per requestModel weightsTotal per user
7B model4K~0.5 GB~14 GB (FP16)~14.5 GB
7B model32K~4 GB~14 GB~18 GB
27B model32K~10 GB~54 GB~64 GB
70B model32K~10 GB~140 GB~150 GB
70B model128K~40 GB~140 GB~180 GB

When you’re serving multiple concurrent users, each user needs their own KV cache. Five users with 32K context on a 70B model means 50 GB just for KV cache. This is why VRAM planning is critical for production deployments.

PagedAttention — vLLM’s innovation

Traditional KV cache allocation is wasteful. When a request arrives, the system must pre-allocate memory for the maximum possible sequence length — even if the actual generation is much shorter. This leads to massive memory fragmentation.

PagedAttention, introduced by vLLM, solves this by managing KV cache like an operating system manages virtual memory:

  • KV cache is divided into fixed-size pages (blocks)
  • Pages are allocated on demand as the sequence grows
  • Pages can be non-contiguous in physical memory
  • When a request finishes, its pages are immediately freed

The benefits are substantial:

  1. Near-zero waste — no pre-allocation of maximum sequence length
  2. Higher throughput — more concurrent requests fit in the same VRAM
  3. Copy-on-write sharing — requests with shared prefixes (like system prompts) can share KV cache pages
  4. Dynamic memory — memory usage tracks actual sequence length, not maximum

In practice, PagedAttention allows vLLM to serve 2-4x more concurrent users than naive KV cache management on the same hardware. This is the primary reason vLLM dominates production LLM serving.

Other KV cache optimizations

Grouped-Query Attention (GQA) — Instead of each query head having its own KV head, multiple query heads share KV heads. Llama 3 uses 8 KV heads shared across 64 query heads — an 8x reduction in KV cache size. This is why modern models can handle long contexts without exploding memory.

Multi-Query Attention (MQA) — The extreme version where all query heads share a single KV head. Maximum memory savings but slightly lower quality. Used in some smaller models optimized for inference speed.

KV cache quantization — Quantizing the cached K and V tensors from FP16 to INT8 halves KV cache memory with minimal quality loss (typically <0.5% on benchmarks). vLLM supports this natively with --kv-cache-dtype fp8.

Sliding window attention — Models like Mistral use a fixed attention window (e.g., 4096 tokens) where only the most recent tokens are cached. Older tokens are evicted. This caps KV cache size regardless of context length, at the cost of losing access to distant context.

Prompt caching — Reuses KV cache across requests that share the same prefix (like a system prompt). Anthropic offers 90% discount on cached input tokens. For serving, this means the system prompt’s KV cache is computed once and shared across all users.

Practical impact on serving

If you’re running a local AI coding server for your team, KV cache determines how many concurrent users you can serve:

Available KV memory = Total VRAM - Model weights - Overhead (~1GB)
Max concurrent users = Available KV memory / KV cache per user

Example: RTX 4090 (24 GB) serving a 27B model (Q4, ~16 GB weights):

Available: 24 - 16 - 1 = 7 GB for KV cache
At 8K context: ~1.2 GB per user → 5 concurrent users
At 32K context: ~5 GB per user → 1 concurrent user

This is why context length is a trade-off. Longer context means fewer concurrent users on the same hardware. For multi-user serving, consider limiting max context or using continuous batching to manage resources efficiently.

FAQ

How much memory does KV cache use?

It depends on the model architecture and context length. A rough formula: for a 7B model at FP16, expect ~0.5 GB per 4K tokens of context. For a 70B model, expect ~10 GB per 32K tokens. The key variables are number of KV heads, head dimension, number of layers, and precision. GQA models (most modern ones) use significantly less than older MHA models. Use --kv-cache-dtype fp8 in vLLM to halve usage.

Does KV cache affect output quality?

No — KV cache is mathematically identical to recomputing attention from scratch. It’s a pure speed optimization with zero quality impact. The model produces exactly the same outputs whether KV cache is used or not. However, KV cache quantization (FP16 → FP8) introduces a tiny approximation that can affect quality by <0.5% on benchmarks.

What is PagedAttention?

PagedAttention is vLLM’s memory management system for KV cache, inspired by virtual memory in operating systems. Instead of pre-allocating a contiguous block for each request’s maximum possible length, it allocates small fixed-size pages on demand. This eliminates memory fragmentation, enables copy-on-write sharing for common prefixes, and allows 2-4x more concurrent requests on the same GPU. It’s the core innovation that makes vLLM the fastest open-source LLM serving engine.

Related: LLM Inference Explained · Continuous Batching Explained · How Much VRAM for AI? · How Transformers Actually Work