LLM Inference Explained for Developers — How AI Models Generate Text
Every time you call an AI API or run a local model, inference happens. Understanding how it works helps you optimize for speed, cost, and quality.
The two phases
LLM inference has two distinct phases:
1. Prefill (prompt processing)
The model reads your entire prompt at once, processing all input tokens in parallel. This is fast because it’s parallelizable across GPU cores.
Time: Proportional to prompt length. A 1,000-token prompt takes ~10x longer than a 100-token prompt.
2. Decode (token generation)
The model generates output tokens one at a time. Each new token depends on all previous tokens. This is sequential and slower.
Time: Proportional to output length. Each token takes roughly the same time regardless of prompt length.
This is why prompt caching saves money — the prefill phase is skipped for cached tokens.
KV Cache — why inference is fast
During prefill, the model computes attention states (Key and Value matrices) for every token. The KV cache stores these so they don’t need to be recomputed during decode.
Without KV cache: generating 100 tokens requires reprocessing the entire prompt 100 times. With KV cache: the prompt is processed once, and each new token only computes attention against the cached states.
The tradeoff: KV cache uses VRAM. A 200K context window on a large model can consume 10-20GB of KV cache alone. This is why VRAM planning matters.
Tokens per second
The key performance metric. Typical speeds:
| Setup | Tokens/sec | Experience |
|---|---|---|
| Cloud API (Claude, GPT) | 50-100 | Instant |
| vLLM on A100 | 30-80 | Fast |
| Ollama on RTX 4090 | 20-40 | Good |
| Ollama on Mac M4 32GB | 15-30 | Usable |
| CPU inference | 1-5 | Slow |
For interactive coding, you need >15 tok/s. For batch processing, speed matters less than throughput.
Batching — serving multiple users
When serving multiple users, the inference engine batches requests together:
Static batching: Wait for N requests, process together. Simple but adds latency.
Continuous batching: Start processing new requests as soon as GPU has capacity. No waiting. This is what vLLM uses and why it’s 3-24x faster than naive serving.
Cost structure
| Component | What drives cost |
|---|---|
| Input tokens | Prefill compute (GPU time) |
| Output tokens | Decode compute (sequential, slower) |
| KV cache | VRAM usage (limits concurrent users) |
| Idle GPU | Fixed cost whether serving or not |
This is why output tokens cost 2-5x more than input tokens in API pricing. See our LLM cost calculator for estimates.
Choosing an inference engine
| Engine | Best for | Guide |
|---|---|---|
| Ollama | Local dev, prototyping | Setup guide |
| vLLM | Production serving | Comparison |
| llama.cpp | CPU inference, edge | Comparison |
| Cloud API | Zero ops | OpenRouter guide |
See our complete inference engine comparison for benchmarks.
FAQ
How do LLMs generate text?
LLMs generate text one token at a time in two phases. First, the prefill phase processes your entire prompt in parallel to build an internal representation. Then, the decode phase generates output tokens sequentially — each new token depends on all previous tokens. This is why output generation is the bottleneck, not prompt processing.
Why are LLMs slow?
The decode phase is inherently sequential — each token must be generated one at a time because it depends on all previous tokens. This can’t be fully parallelized. Cloud APIs add network latency on top. Local models are limited by available compute power and memory bandwidth. Larger models with more parameters take longer per token.
What is time to first token?
Time to first token (TTFT) is the delay between sending a request and receiving the first output token. It’s dominated by the prefill phase (processing your prompt) plus network latency for cloud APIs. Local models typically achieve 0.1-0.3s TTFT, while cloud APIs take 0.8-1.2s. TTFT matters most for interactive applications where perceived responsiveness is important.
Related: How to Reduce LLM API Costs · Best GPU for AI Locally · Quantization Trade-offs · How Much VRAM for AI? · Llm Inference Apple Silicon