How to Benchmark LLM Inference Correctly β Metrics That Matter
Most LLM benchmarks are misleading. Hereβs how to measure what actually matters.
Key metrics explained
Tokens per second (tok/s)
How fast the model generates output. This measures decode speed β the rate at which new tokens are produced after the first one. For interactive coding, you need >15 tok/s to feel responsive. Below 10 tok/s feels sluggish.
Important: tok/s varies with sequence length. Early tokens are faster; speed drops as the KV cache grows. Always report average tok/s over the full generation.
Time to first token (TTFT)
How long until the first token appears. This includes:
- Network latency (if remote)
- Prompt processing (prefill phase)
- KV cache computation for the input
Critical for interactive use β users notice anything >500ms. Long prompts (10K+ tokens) can push TTFT to several seconds even on fast hardware.
Throughput (requests/sec)
How many concurrent requests the server handles. This is where continuous batching shines β frameworks like vLLM can serve 10-50x more requests than naive sequential processing.
Throughput matters for production APIs and team servers.
Latency p50/p95/p99
Median and tail latencies under load. p99 matters more than average β one slow request ruins the experience. A system with 100ms average but 5s p99 is worse than one with 200ms average and 500ms p99.
Inter-token latency (ITL)
Time between consecutive tokens during generation. Should be consistent β spikes indicate scheduling issues or memory pressure. Measure the standard deviation, not just the mean.
Common mistakes
-
Benchmarking with short prompts β Real workloads have 1-10K token prompts. Short prompts hide prefill costs and KV cache overhead.
-
Single-user benchmarks β Production serves many users. Test with concurrent requests to see how continuous batching performs.
-
Ignoring warmup β First requests are slower (model loading, JIT compilation). Discard first 10 requests.
-
Comparing different quantizations β Q4 is always faster than FP16. Compare apples to apples.
-
Not testing your actual workload β Coding tasks have different token patterns than chat. Benchmark with representative prompts.
-
Measuring only generation, not end-to-end β Include network overhead, tokenization, and detokenization in your measurements.
-
Not reporting hardware specs β A benchmark without GPU model, VRAM, driver version, and framework version is useless for comparison.
If you need specific GPU hardware for benchmarking that you donβt own, cloud GPU providers let you spin up exactly the configuration you need for a few hours.
Benchmarking tools
llm-benchmark (quick single-user)
pip install llm-benchmark
llm-benchmark --url http://localhost:8000/v1 --model qwen3.5:27b \
--prompt "Write a fibonacci function in Python" \
--num-requests 50
genai-perf (NVIDIAβs official tool)
pip install genai-perf
genai-perf profile \
-m qwen3.5:27b \
--endpoint-type chat \
--url localhost:8000 \
--concurrency 1 4 8 16 32 \
--input-tokens-mean 512 \
--output-tokens-mean 256
This gives you TTFT, ITL, throughput, and latency percentiles across different concurrency levels.
locust (load testing with concurrent users)
from locust import HttpUser, task, between
class LLMUser(HttpUser):
wait_time = between(1, 3)
@task
def generate(self):
self.client.post("/v1/chat/completions", json={
"model": "qwen3.5:27b",
"messages": [{"role": "user", "content": "Explain quicksort in detail"}],
"max_tokens": 200
})
Run with: locust -f bench.py --host http://localhost:8000 --users 32 --spawn-rate 4
vegeta (raw HTTP load testing)
echo 'POST http://localhost:8000/v1/chat/completions
Content-Type: application/json
{"model":"qwen3.5:27b","messages":[{"role":"user","content":"Hello"}],"max_tokens":50}' | \
vegeta attack -rate=10/s -duration=60s | vegeta report
How to run a proper benchmark
Step 1: Define your workload
Create a representative prompt set:
prompts = [
# Short prompts (chat-like)
"What is Python?",
# Medium prompts (coding tasks)
"Write a REST API in FastAPI with authentication, rate limiting, and database connection pooling.",
# Long prompts (document analysis)
open("long_document.txt").read() + "\nSummarize this document.",
]
Step 2: Warmup
# Send 10 throwaway requests
for i in $(seq 1 10); do
curl -s http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model":"qwen3.5:27b","messages":[{"role":"user","content":"Hi"}],"max_tokens":10}' > /dev/null
done
Step 3: Measure single-user performance
# Measure TTFT + total generation time
for i in $(seq 1 50); do
curl -s -w "TTFT: %{time_starttransfer}s Total: %{time_total}s\n" \
-o /dev/null \
http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model":"qwen3.5:27b","messages":[{"role":"user","content":"Write a fibonacci function in Python"}],"max_tokens":200}'
done
Step 4: Measure under load
Increase concurrency gradually (1, 4, 8, 16, 32) and record how metrics change. Good systems maintain low TTFT even at high concurrency thanks to continuous batching.
Step 5: Report results properly
Always include:
- Hardware (GPU model, VRAM, CPU, RAM)
- Software (framework version, CUDA version, quantization)
- Workload (prompt length distribution, max_tokens, temperature)
- Metrics (TTFT p50/p95/p99, tok/s, throughput at each concurrency level)
Interpreting results
What βgoodβ looks like
| Metric | Interactive use | Batch processing |
|---|---|---|
| TTFT | <500ms | Doesnβt matter |
| Tok/s | >15 | >5 |
| p99 latency | <3s | <30s |
| Throughput | >10 req/s | >50 req/s |
| ITL std dev | <20ms | <50ms |
Red flags
- TTFT increases linearly with concurrency β batching isnβt working properly
- Tok/s drops >50% under load β memory pressure or scheduling issues
- p99 is 10x+ higher than p50 β indicates queuing or OOM-related stalls
- ITL has large spikes β possible memory swapping or garbage collection
Comparing inference frameworks
When comparing vLLM vs Ollama vs llama.cpp vs TGI, use identical:
- Model and quantization
- Hardware
- Prompt set
- Concurrency levels
- Max output tokens
See our AI model leaderboards guide for understanding public benchmark results.
Related: Serve LLMs with vLLM Β· vLLM vs Ollama vs llama.cpp Β· AI Model Leaderboards Explained Β· Continuous Batching Β· LLM Inference Explained Β· GPU Memory Planning