Apr 18, 2026 · 4 min read

Last updated on Apr 20, 2026

How to Benchmark LLM Inference Correctly — Metrics That Matter

Most LLM benchmarks are misleading. Here’s how to measure what actually matters.

Key metrics explained

Tokens per second (tok/s)

How fast the model generates output. This measures decode speed — the rate at which new tokens are produced after the first one. For interactive coding, you need >15 tok/s to feel responsive. Below 10 tok/s feels sluggish.

Important: tok/s varies with sequence length. Early tokens are faster; speed drops as the KV cache grows. Always report average tok/s over the full generation.

Time to first token (TTFT)

How long until the first token appears. This includes:

Network latency (if remote)
Prompt processing (prefill phase)
KV cache computation for the input

Critical for interactive use — users notice anything >500ms. Long prompts (10K+ tokens) can push TTFT to several seconds even on fast hardware.

Throughput (requests/sec)

How many concurrent requests the server handles. This is where continuous batching shines — frameworks like vLLM can serve 10-50x more requests than naive sequential processing.

Throughput matters for production APIs and team servers.

Latency p50/p95/p99

Median and tail latencies under load. p99 matters more than average — one slow request ruins the experience. A system with 100ms average but 5s p99 is worse than one with 200ms average and 500ms p99.

Inter-token latency (ITL)

Time between consecutive tokens during generation. Should be consistent — spikes indicate scheduling issues or memory pressure. Measure the standard deviation, not just the mean.

Common mistakes

Benchmarking with short prompts — Real workloads have 1-10K token prompts. Short prompts hide prefill costs and KV cache overhead.
Single-user benchmarks — Production serves many users. Test with concurrent requests to see how continuous batching performs.
Ignoring warmup — First requests are slower (model loading, JIT compilation). Discard first 10 requests.
Comparing different quantizations — Q4 is always faster than FP16. Compare apples to apples.
Not testing your actual workload — Coding tasks have different token patterns than chat. Benchmark with representative prompts.
Measuring only generation, not end-to-end — Include network overhead, tokenization, and detokenization in your measurements.
Not reporting hardware specs — A benchmark without GPU model, VRAM, driver version, and framework version is useless for comparison.

If you need specific GPU hardware for benchmarking that you don’t own, cloud GPU providers let you spin up exactly the configuration you need for a few hours.

Benchmarking tools

llm-benchmark (quick single-user)

pip install llm-benchmark
llm-benchmark --url http://localhost:8000/v1 --model qwen3.5:27b \
  --prompt "Write a fibonacci function in Python" \
  --num-requests 50

genai-perf (NVIDIA’s official tool)

pip install genai-perf
genai-perf profile \
  -m qwen3.5:27b \
  --endpoint-type chat \
  --url localhost:8000 \
  --concurrency 1 4 8 16 32 \
  --input-tokens-mean 512 \
  --output-tokens-mean 256

This gives you TTFT, ITL, throughput, and latency percentiles across different concurrency levels.

locust (load testing with concurrent users)

from locust import HttpUser, task, between

class LLMUser(HttpUser):
    wait_time = between(1, 3)

    @task
    def generate(self):
        self.client.post("/v1/chat/completions", json={
            "model": "qwen3.5:27b",
            "messages": [{"role": "user", "content": "Explain quicksort in detail"}],
            "max_tokens": 200
        })

Run with: locust -f bench.py --host http://localhost:8000 --users 32 --spawn-rate 4

vegeta (raw HTTP load testing)

echo 'POST http://localhost:8000/v1/chat/completions
Content-Type: application/json
{"model":"qwen3.5:27b","messages":[{"role":"user","content":"Hello"}],"max_tokens":50}' | \
vegeta attack -rate=10/s -duration=60s | vegeta report

How to run a proper benchmark

Step 1: Define your workload

Create a representative prompt set:

prompts = [
    # Short prompts (chat-like)
    "What is Python?",
    # Medium prompts (coding tasks)
    "Write a REST API in FastAPI with authentication, rate limiting, and database connection pooling.",
    # Long prompts (document analysis)
    open("long_document.txt").read() + "\nSummarize this document.",
]

Step 2: Warmup

# Send 10 throwaway requests
for i in $(seq 1 10); do
  curl -s http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{"model":"qwen3.5:27b","messages":[{"role":"user","content":"Hi"}],"max_tokens":10}' > /dev/null
done

Step 3: Measure single-user performance

# Measure TTFT + total generation time
for i in $(seq 1 50); do
  curl -s -w "TTFT: %{time_starttransfer}s Total: %{time_total}s\n" \
    -o /dev/null \
    http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{"model":"qwen3.5:27b","messages":[{"role":"user","content":"Write a fibonacci function in Python"}],"max_tokens":200}'
done

Step 4: Measure under load

Increase concurrency gradually (1, 4, 8, 16, 32) and record how metrics change. Good systems maintain low TTFT even at high concurrency thanks to continuous batching.

Step 5: Report results properly

Always include:

Hardware (GPU model, VRAM, CPU, RAM)
Software (framework version, CUDA version, quantization)
Workload (prompt length distribution, max_tokens, temperature)
Metrics (TTFT p50/p95/p99, tok/s, throughput at each concurrency level)

Interpreting results

What “good” looks like

Metric	Interactive use	Batch processing
TTFT	<500ms	Doesn’t matter
Tok/s	>15	>5
p99 latency	<3s	<30s
Throughput	>10 req/s	>50 req/s
ITL std dev	<20ms	<50ms

Red flags

TTFT increases linearly with concurrency — batching isn’t working properly
Tok/s drops >50% under load — memory pressure or scheduling issues
p99 is 10x+ higher than p50 — indicates queuing or OOM-related stalls
ITL has large spikes — possible memory swapping or garbage collection

Comparing inference frameworks

When comparing vLLM vs Ollama vs llama.cpp vs TGI, use identical:

Model and quantization
Hardware
Prompt set
Concurrency levels
Max output tokens

See our AI model leaderboards guide for understanding public benchmark results.

Related: Serve LLMs with vLLM · vLLM vs Ollama vs llama.cpp · AI Model Leaderboards Explained · Continuous Batching · LLM Inference Explained · GPU Memory Planning