🔧 Error Fixes
· 3 min read

vLLM CUDA Out of Memory Fix: GPU Optimization for LLM Serving (2026)


You started a vLLM server and got:

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.00 GiB.
GPU 0 has a total capacity of 23.65 GiB of which 1.24 GiB is free.

vLLM needs GPU memory for three things: model weights, KV cache (conversation context), and activation memory. When any of these exceed your VRAM, you get OOM errors. Here’s how to fix it.

Fix 1: Reduce GPU memory utilization

vLLM tries to use 90% of GPU memory by default. Lower it:

# Default: 0.9 (90% of VRAM)
# Reduce to give headroom
vllm serve meta-llama/Llama-4-Scout --gpu-memory-utilization 0.8

# For tight VRAM situations
vllm serve meta-llama/Llama-4-Scout --gpu-memory-utilization 0.7

This reduces the KV cache size, which means fewer concurrent requests but no OOM errors.

Fix 2: Use quantization

Load the model in lower precision:

# AWQ quantization (4-bit, good quality)
vllm serve TheBloke/model-AWQ --quantization awq

# GPTQ quantization (4-bit)
vllm serve TheBloke/model-GPTQ --quantization gptq

# BitsAndBytes (8-bit or 4-bit)
vllm serve model-name --quantization bitsandbytes --load-format bitsandbytes

Memory savings:

PrecisionVRAM for 7B modelVRAM for 70B model
FP16~14 GB~140 GB
INT8~7 GB~70 GB
INT4 (AWQ/GPTQ)~4 GB~40 GB

For most serving use cases, AWQ quantization gives the best quality-per-VRAM ratio.

Fix 3: Tensor parallelism (multi-GPU)

Split the model across multiple GPUs:

# 2 GPUs
vllm serve meta-llama/Llama-4-Maverick --tensor-parallel-size 2

# 4 GPUs
vllm serve meta-llama/Llama-4-Maverick --tensor-parallel-size 4

Each GPU holds a portion of the model. VRAM requirement per GPU = total model size / number of GPUs.

Fix 4: Reduce max model length

The KV cache scales with context length. Reduce it if you don’t need the full context:

# Default might be 32768 or higher
vllm serve model-name --max-model-len 4096

# For chat applications where context is short
vllm serve model-name --max-model-len 2048

This is the same principle as reducing Ollama’s context window — less context = less memory.

Fix 5: Limit concurrent requests

More concurrent requests = more KV cache entries = more VRAM:

# Limit max concurrent sequences
vllm serve model-name --max-num-seqs 8  # Default is often 256

# Limit max batched tokens
vllm serve model-name --max-num-batched-tokens 4096

This reduces throughput but prevents OOM under load.

Fix 6: Use swap space (last resort)

If you’re slightly over VRAM, enable KV cache offloading to CPU:

vllm serve model-name --swap-space 4  # 4 GB of CPU RAM for overflow

This is significantly slower for offloaded requests but prevents crashes.

Production configuration example

A balanced config for serving a 70B model on 2x A100 80GB:

vllm serve meta-llama/Llama-4-Maverick \
  --tensor-parallel-size 2 \
  --gpu-memory-utilization 0.85 \
  --max-model-len 8192 \
  --max-num-seqs 32 \
  --quantization awq

When vLLM isn’t the right tool

If you’re running into constant OOM issues, consider:

  • Ollama — simpler, handles quantization automatically, better for single-user setups
  • llama.cpp — more memory-efficient for CPU inference, supports GGUF quantization
  • Cloud APIsOpenRouter or direct provider APIs if local serving isn’t worth the hardware cost

See our Ollama vs llama.cpp vs vLLM comparison for when to use each.

Related: Ollama vs llama.cpp vs vLLM · Ollama Out of Memory Fix · How Much VRAM for AI Models · Best GPU for AI Locally · Best Cloud GPU Providers · Serve LLMs with vLLM