πŸ”§ Error Fixes
Β· 3 min read

vLLM CUDA Out of Memory Fix: GPU Optimization for LLM Serving (2026)


You started a vLLM server and got:

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.00 GiB.
GPU 0 has a total capacity of 23.65 GiB of which 1.24 GiB is free.

vLLM needs GPU memory for three things: model weights, KV cache (conversation context), and activation memory. When any of these exceed your VRAM, you get OOM errors. Here’s how to fix it.

Fix 1: Reduce GPU memory utilization

vLLM tries to use 90% of GPU memory by default. Lower it:

# Default: 0.9 (90% of VRAM)
# Reduce to give headroom
vllm serve meta-llama/Llama-4-Scout --gpu-memory-utilization 0.8

# For tight VRAM situations
vllm serve meta-llama/Llama-4-Scout --gpu-memory-utilization 0.7

This reduces the KV cache size, which means fewer concurrent requests but no OOM errors.

Fix 2: Use quantization

Load the model in lower precision:

# AWQ quantization (4-bit, good quality)
vllm serve TheBloke/model-AWQ --quantization awq

# GPTQ quantization (4-bit)
vllm serve TheBloke/model-GPTQ --quantization gptq

# BitsAndBytes (8-bit or 4-bit)
vllm serve model-name --quantization bitsandbytes --load-format bitsandbytes

Memory savings:

PrecisionVRAM for 7B modelVRAM for 70B model
FP16~14 GB~140 GB
INT8~7 GB~70 GB
INT4 (AWQ/GPTQ)~4 GB~40 GB

For most serving use cases, AWQ quantization gives the best quality-per-VRAM ratio.

Fix 3: Tensor parallelism (multi-GPU)

Split the model across multiple GPUs:

# 2 GPUs
vllm serve meta-llama/Llama-4-Maverick --tensor-parallel-size 2

# 4 GPUs
vllm serve meta-llama/Llama-4-Maverick --tensor-parallel-size 4

Each GPU holds a portion of the model. VRAM requirement per GPU = total model size / number of GPUs.

Fix 4: Reduce max model length

The KV cache scales with context length. Reduce it if you don’t need the full context:

# Default might be 32768 or higher
vllm serve model-name --max-model-len 4096

# For chat applications where context is short
vllm serve model-name --max-model-len 2048

This is the same principle as reducing Ollama’s context window β€” less context = less memory.

Fix 5: Limit concurrent requests

More concurrent requests = more KV cache entries = more VRAM:

# Limit max concurrent sequences
vllm serve model-name --max-num-seqs 8  # Default is often 256

# Limit max batched tokens
vllm serve model-name --max-num-batched-tokens 4096

This reduces throughput but prevents OOM under load.

Fix 6: Use swap space (last resort)

If you’re slightly over VRAM, enable KV cache offloading to CPU:

vllm serve model-name --swap-space 4  # 4 GB of CPU RAM for overflow

This is significantly slower for offloaded requests but prevents crashes.

Production configuration example

A balanced config for serving a 70B model on 2x A100 80GB:

vllm serve meta-llama/Llama-4-Maverick \
  --tensor-parallel-size 2 \
  --gpu-memory-utilization 0.85 \
  --max-model-len 8192 \
  --max-num-seqs 32 \
  --quantization awq

When vLLM isn’t the right tool

If you’re running into constant OOM issues, consider:

  • Ollama β€” simpler, handles quantization automatically, better for single-user setups
  • llama.cpp β€” more memory-efficient for CPU inference, supports GGUF quantization
  • Cloud APIs β€” OpenRouter or direct provider APIs if local serving isn’t worth the hardware cost

See our Ollama vs llama.cpp vs vLLM comparison for when to use each.

Related: Ollama vs llama.cpp vs vLLM Β· Ollama Out of Memory Fix Β· How Much VRAM for AI Models Β· Best GPU for AI Locally Β· Best Cloud GPU Providers Β· Serve LLMs with vLLM