How to Benchmark LLM Inference Correctly — Metrics That Matter
Most LLM benchmarks are misleading. Here’s how to measure what actually matters.
Key metrics explained
Tokens per second (tok/s)
How fast the model generates output. This measures decode speed — the rate at which new tokens are produced after the first one. For interactive coding, you need >15 tok/s to feel responsive. Below 10 tok/s feels sluggish.
Important: tok/s varies with sequence length. Early tokens are faster; speed drops as the KV cache grows. Always report average tok/s over the full generation.
Time to first token (TTFT)
How long until the first token appears. This includes:
- Network latency (if remote)
- Prompt processing (prefill phase)
- KV cache computation for the input
Critical for interactive use — users notice anything >500ms. Long prompts (10K+ tokens) can push TTFT to several seconds even on fast hardware.
Throughput (requests/sec)
How many concurrent requests the server handles. This is where continuous batching shines — frameworks like vLLM can serve 10-50x more requests than naive sequential processing.
Throughput matters for production APIs and team servers.
Latency p50/p95/p99
Median and tail latencies under load. p99 matters more than average — one slow request ruins the experience. A system with 100ms average but 5s p99 is worse than one with 200ms average and 500ms p99.
Inter-token latency (ITL)
Time between consecutive tokens during generation. Should be consistent — spikes indicate scheduling issues or memory pressure. Measure the standard deviation, not just the mean.
Common mistakes
-
Benchmarking with short prompts — Real workloads have 1-10K token prompts. Short prompts hide prefill costs and KV cache overhead.
-
Single-user benchmarks — Production serves many users. Test with concurrent requests to see how continuous batching performs.
-
Ignoring warmup — First requests are slower (model loading, JIT compilation). Discard first 10 requests.
-
Comparing different quantizations — Q4 is always faster than FP16. Compare apples to apples.
-
Not testing your actual workload — Coding tasks have different token patterns than chat. Benchmark with representative prompts.
-
Measuring only generation, not end-to-end — Include network overhead, tokenization, and detokenization in your measurements.
-
Not reporting hardware specs — A benchmark without GPU model, VRAM, driver version, and framework version is useless for comparison.
Benchmarking tools
llm-benchmark (quick single-user)
pip install llm-benchmark
llm-benchmark --url http://localhost:8000/v1 --model qwen3.5:27b \
--prompt "Write a fibonacci function in Python" \
--num-requests 50
genai-perf (NVIDIA’s official tool)
pip install genai-perf
genai-perf profile \
-m qwen3.5:27b \
--endpoint-type chat \
--url localhost:8000 \
--concurrency 1 4 8 16 32 \
--input-tokens-mean 512 \
--output-tokens-mean 256
This gives you TTFT, ITL, throughput, and latency percentiles across different concurrency levels.
locust (load testing with concurrent users)
from locust import HttpUser, task, between
class LLMUser(HttpUser):
wait_time = between(1, 3)
@task
def generate(self):
self.client.post("/v1/chat/completions", json={
"model": "qwen3.5:27b",
"messages": [{"role": "user", "content": "Explain quicksort in detail"}],
"max_tokens": 200
})
Run with: locust -f bench.py --host http://localhost:8000 --users 32 --spawn-rate 4
vegeta (raw HTTP load testing)
echo 'POST http://localhost:8000/v1/chat/completions
Content-Type: application/json
{"model":"qwen3.5:27b","messages":[{"role":"user","content":"Hello"}],"max_tokens":50}' | \
vegeta attack -rate=10/s -duration=60s | vegeta report
How to run a proper benchmark
Step 1: Define your workload
Create a representative prompt set:
prompts = [
# Short prompts (chat-like)
"What is Python?",
# Medium prompts (coding tasks)
"Write a REST API in FastAPI with authentication, rate limiting, and database connection pooling.",
# Long prompts (document analysis)
open("long_document.txt").read() + "\nSummarize this document.",
]
Step 2: Warmup
# Send 10 throwaway requests
for i in $(seq 1 10); do
curl -s http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model":"qwen3.5:27b","messages":[{"role":"user","content":"Hi"}],"max_tokens":10}' > /dev/null
done
Step 3: Measure single-user performance
# Measure TTFT + total generation time
for i in $(seq 1 50); do
curl -s -w "TTFT: %{time_starttransfer}s Total: %{time_total}s\n" \
-o /dev/null \
http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model":"qwen3.5:27b","messages":[{"role":"user","content":"Write a fibonacci function in Python"}],"max_tokens":200}'
done
Step 4: Measure under load
Increase concurrency gradually (1, 4, 8, 16, 32) and record how metrics change. Good systems maintain low TTFT even at high concurrency thanks to continuous batching.
Step 5: Report results properly
Always include:
- Hardware (GPU model, VRAM, CPU, RAM)
- Software (framework version, CUDA version, quantization)
- Workload (prompt length distribution, max_tokens, temperature)
- Metrics (TTFT p50/p95/p99, tok/s, throughput at each concurrency level)
Interpreting results
What “good” looks like
| Metric | Interactive use | Batch processing |
|---|---|---|
| TTFT | <500ms | Doesn’t matter |
| Tok/s | >15 | >5 |
| p99 latency | <3s | <30s |
| Throughput | >10 req/s | >50 req/s |
| ITL std dev | <20ms | <50ms |
Red flags
- TTFT increases linearly with concurrency — batching isn’t working properly
- Tok/s drops >50% under load — memory pressure or scheduling issues
- p99 is 10x+ higher than p50 — indicates queuing or OOM-related stalls
- ITL has large spikes — possible memory swapping or garbage collection
Comparing inference frameworks
When comparing vLLM vs Ollama vs llama.cpp vs TGI, use identical:
- Model and quantization
- Hardware
- Prompt set
- Concurrency levels
- Max output tokens
See our AI model leaderboards guide for understanding public benchmark results.
Related: Serve LLMs with vLLM · vLLM vs Ollama vs llama.cpp · AI Model Leaderboards Explained · Continuous Batching · LLM Inference Explained · GPU Memory Planning