🤖 AI Tools
· 5 min read
Last updated on

Continuous Batching Explained — How LLM Servers Handle Multiple Users


Continuous batching is why vLLM serves 3-24x more users than naive inference. If you’re serving an LLM to multiple users — whether a coding team or a production API — understanding batching is the difference between a responsive system and one that wastes 80% of your GPU.

The problem with static batching

Traditional (static) batching collects N requests, processes them as a group, and returns all results together. This approach has two fatal flaws:

Flaw 1: Waiting for the batch to fill. If your batch size is 8 and only 3 requests have arrived, you either wait (adding latency) or process a partial batch (wasting GPU capacity).

Flaw 2: Padding to the longest sequence. If one request in the batch generates 10 tokens and another generates 500, the short request must wait for the long one to finish. The GPU slot for the completed request sits idle, doing nothing.

In practice, static batching wastes 60-80% of available GPU compute. The GPU is either waiting for batches to fill or waiting for the longest request to finish.

How continuous batching works

Continuous batching (also called iteration-level scheduling or in-flight batching) operates at the granularity of individual decode steps rather than complete requests:

  1. Requests enter immediately — no waiting for a batch to fill
  2. Each iteration processes all active requests — one decode step per request
  3. When a request finishes (hits EOS or max tokens), its slot is freed instantly
  4. A waiting request fills the freed slot on the very next iteration

The key insight: scheduling happens at every single token generation step, not at the request level. This means the GPU is always processing the maximum number of active requests it can handle.

Iteration-level scheduling in detail

Here’s what happens during a single iteration with continuous batching:

Iteration 1: [Req A (token 5), Req B (token 12), Req C (token 1)]
Iteration 2: [Req A (token 6), Req B (token 13), Req C (token 2)]
Iteration 3: [Req A (DONE),    Req B (token 14), Req C (token 3)]
Iteration 4: [Req D (token 1), Req B (token 15), Req C (token 4)]  ← D fills A's slot

Request A finishes at iteration 3. By iteration 4, Request D is already being processed. Zero GPU cycles wasted.

Compare with static batching where Request D would wait until both B and C finish before the next batch starts.

Prefill vs decode phases

LLM inference has two distinct phases that continuous batching must handle:

Prefill phase — Processing the input prompt. This is compute-bound (matrix multiplications over all input tokens at once). It’s fast per-token but uses a burst of compute.

Decode phase — Generating output tokens one at a time. This is memory-bandwidth-bound (reading model weights and KV cache for each token). It’s slower per-token but uses less compute per step.

The challenge: prefill and decode have different resource profiles. Advanced engines handle this with chunked prefill — breaking long prompts into chunks and interleaving them with decode steps from other requests. This prevents a single long prompt from stalling all active generations.

vLLM and SGLang both implement chunked prefill. TGI uses a simpler approach where prefill takes priority but is bounded in size.

Real throughput impact

The improvement from continuous batching is dramatic:

Batching strategyThroughput (tok/s)Avg latencyGPU utilization
No batching30Low~10%
Static (batch=8)120High (waiting)~40%
Continuous200-400Low~85%
Continuous + PagedAttention300-600Low~90%

The 2-10x throughput improvement comes from eliminating idle time. With continuous batching, the GPU processes tokens every single cycle. Combined with PagedAttention for efficient memory management, modern engines achieve near-theoretical-maximum GPU utilization.

How vLLM implements it

vLLM’s scheduler runs before every decode iteration:

  1. Check if any active requests have finished → free their KV cache pages
  2. Check the waiting queue for new requests
  3. If memory is available, admit new requests (run their prefill)
  4. Execute one decode step for all active requests in parallel
  5. Return completed outputs to clients immediately

The scheduler also handles preemption — if memory runs low, it can swap lower-priority requests’ KV cache to CPU RAM and resume them later. This prevents out-of-memory crashes under load.

How TGI implements it

HuggingFace’s Text Generation Inference uses a similar approach but with some differences:

  • Token streaming is built-in (each token is sent to the client as it’s generated)
  • Flash Attention 2 is used instead of PagedAttention for some models
  • Watermark-based scheduling controls admission
  • Supports speculative decoding for additional speedup

Both engines achieve similar throughput for standard workloads. vLLM tends to win on high-concurrency scenarios; TGI has better HuggingFace ecosystem integration.

Which engines support continuous batching

EngineContinuous batchingNotes
vLLM✅ Best implementationPagedAttention + chunked prefill
TGI✅ SupportedGood HuggingFace integration
SGLang✅ Advanced29% faster on shared-prefix workloads
Ollama❌ SequentialFine for single user
llama.cpp server⚠️ BasicSlot-based, limited scheduling
TensorRT-LLM✅ SupportedNVIDIA-optimized, complex setup

For multi-user production serving, continuous batching is non-negotiable. For local single-user development, Ollama processes requests sequentially and that’s perfectly fine — there’s no batch to optimize when you’re the only user.

When continuous batching matters most

The benefit scales with concurrency. At 1 user, there’s nothing to batch. At 10+ concurrent users, continuous batching is the difference between a usable system and one that queues requests for seconds.

Use cases where it’s essential:

  • Team coding servers (5+ developers sharing a GPU)
  • API endpoints with variable-length responses
  • Chat applications with many concurrent sessions
  • Any scenario where requests arrive continuously

Use cases where it doesn’t matter:

  • Single-user local development (use Ollama)
  • Batch processing where all inputs are similar length
  • One-shot inference scripts

See our vLLM deployment guide for production setup instructions.

FAQ

Does continuous batching affect latency?

For individual requests, continuous batching slightly improves latency compared to static batching because requests don’t wait for a batch to fill. However, under heavy load, prefill of new requests can briefly delay decode steps for active requests. Advanced engines mitigate this with chunked prefill. Overall, continuous batching gives you both better throughput AND better average latency.

Which inference engines support continuous batching?

vLLM, TGI, SGLang, and TensorRT-LLM all support continuous batching. vLLM has the most mature implementation with PagedAttention. Ollama and basic llama.cpp do not support it — they process requests sequentially. If you need multi-user serving, use vLLM or TGI. See our comparison guide for detailed benchmarks.

Do I need continuous batching for single-user use?

No. If you’re the only user, there’s nothing to batch — requests are processed one at a time regardless. Continuous batching only helps when multiple requests need to be served concurrently. For local development, Ollama is the right choice. Save continuous batching for team servers and production APIs.

Related: KV Cache Explained · Serve LLMs with vLLM · LLM Inference Explained · vLLM vs Ollama vs llama.cpp vs TGI