Apr 17, 2026 · 3 min read

Last updated on Apr 20, 2026

LLM Inference Explained for Developers — How AI Models Generate Text

Every time you call an AI API or run a local model, inference happens. Understanding how it works helps you optimize for speed, cost, and quality.

The two phases

LLM inference has two distinct phases:

1. Prefill (prompt processing)

The model reads your entire prompt at once, processing all input tokens in parallel. This is fast because it’s parallelizable across GPU cores.

Time: Proportional to prompt length. A 1,000-token prompt takes ~10x longer than a 100-token prompt.

2. Decode (token generation)

The model generates output tokens one at a time. Each new token depends on all previous tokens. This is sequential and slower.

Time: Proportional to output length. Each token takes roughly the same time regardless of prompt length.

This is why prompt caching saves money — the prefill phase is skipped for cached tokens.

KV Cache — why inference is fast

During prefill, the model computes attention states (Key and Value matrices) for every token. The KV cache stores these so they don’t need to be recomputed during decode.

Without KV cache: generating 100 tokens requires reprocessing the entire prompt 100 times. With KV cache: the prompt is processed once, and each new token only computes attention against the cached states.

The tradeoff: KV cache uses VRAM. A 200K context window on a large model can consume 10-20GB of KV cache alone. This is why VRAM planning matters.

Tokens per second

The key performance metric. Typical speeds:

Setup	Tokens/sec	Experience
Cloud API (Claude, GPT)	50-100	Instant
vLLM on A100	30-80	Fast
Ollama on RTX 4090	20-40	Good
Ollama on Mac M4 32GB	15-30	Usable
CPU inference	1-5	Slow

For interactive coding, you need >15 tok/s. For batch processing, speed matters less than throughput.

Batching — serving multiple users

When serving multiple users, the inference engine batches requests together:

Static batching: Wait for N requests, process together. Simple but adds latency.

Continuous batching: Start processing new requests as soon as GPU has capacity. No waiting. This is what vLLM uses and why it’s 3-24x faster than naive serving.

Cost structure

Component	What drives cost
Input tokens	Prefill compute (GPU time)
Output tokens	Decode compute (sequential, slower)
KV cache	VRAM usage (limits concurrent users)
Idle GPU	Fixed cost whether serving or not

This is why output tokens cost 2-5x more than input tokens in API pricing. See our LLM cost calculator for estimates.

Choosing an inference engine

Engine	Best for	Guide
Ollama	Local dev, prototyping	Setup guide
vLLM	Production serving	Comparison
llama.cpp	CPU inference, edge	Comparison
Cloud API	Zero ops	OpenRouter guide

See our complete inference engine comparison for benchmarks.

FAQ

How do LLMs generate text?

LLMs generate text one token at a time in two phases. First, the prefill phase processes your entire prompt in parallel to build an internal representation. Then, the decode phase generates output tokens sequentially — each new token depends on all previous tokens. This is why output generation is the bottleneck, not prompt processing.

Why are LLMs slow?

The decode phase is inherently sequential — each token must be generated one at a time because it depends on all previous tokens. This can’t be fully parallelized. Cloud APIs add network latency on top. Local models are limited by available compute power and memory bandwidth. Larger models with more parameters take longer per token.

What is time to first token?

Time to first token (TTFT) is the delay between sending a request and receiving the first output token. It’s dominated by the prefill phase (processing your prompt) plus network latency for cloud APIs. Local models typically achieve 0.1-0.3s TTFT, while cloud APIs take 0.8-1.2s. TTFT matters most for interactive applications where perceived responsiveness is important.

LLM Inference Explained for Developers — How AI Models Generate Text

The two phases

1. Prefill (prompt processing)

2. Decode (token generation)

KV Cache — why inference is fast

Tokens per second

Batching — serving multiple users

Cost structure

Choosing an inference engine

FAQ

How do LLMs generate text?

Why are LLMs slow?

What is time to first token?

📬 AI Dev Weekly

You might also like

Context Window Management — How to Fit More Into Your LLM's Memory

How to Test AI Applications — A Developer's Guide to LLM Evaluation

LLM Observability for Developers — How to Monitor AI Apps in Production

When to Use CPU vs GPU for LLM Inference