You started a vLLM server and got:
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.00 GiB.
GPU 0 has a total capacity of 23.65 GiB of which 1.24 GiB is free.
vLLM needs GPU memory for three things: model weights, KV cache (conversation context), and activation memory. When any of these exceed your VRAM, you get OOM errors. Hereβs how to fix it.
Fix 1: Reduce GPU memory utilization
vLLM tries to use 90% of GPU memory by default. Lower it:
# Default: 0.9 (90% of VRAM)
# Reduce to give headroom
vllm serve meta-llama/Llama-4-Scout --gpu-memory-utilization 0.8
# For tight VRAM situations
vllm serve meta-llama/Llama-4-Scout --gpu-memory-utilization 0.7
This reduces the KV cache size, which means fewer concurrent requests but no OOM errors.
Fix 2: Use quantization
Load the model in lower precision:
# AWQ quantization (4-bit, good quality)
vllm serve TheBloke/model-AWQ --quantization awq
# GPTQ quantization (4-bit)
vllm serve TheBloke/model-GPTQ --quantization gptq
# BitsAndBytes (8-bit or 4-bit)
vllm serve model-name --quantization bitsandbytes --load-format bitsandbytes
Memory savings:
| Precision | VRAM for 7B model | VRAM for 70B model |
|---|---|---|
| FP16 | ~14 GB | ~140 GB |
| INT8 | ~7 GB | ~70 GB |
| INT4 (AWQ/GPTQ) | ~4 GB | ~40 GB |
For most serving use cases, AWQ quantization gives the best quality-per-VRAM ratio.
Fix 3: Tensor parallelism (multi-GPU)
Split the model across multiple GPUs:
# 2 GPUs
vllm serve meta-llama/Llama-4-Maverick --tensor-parallel-size 2
# 4 GPUs
vllm serve meta-llama/Llama-4-Maverick --tensor-parallel-size 4
Each GPU holds a portion of the model. VRAM requirement per GPU = total model size / number of GPUs.
Fix 4: Reduce max model length
The KV cache scales with context length. Reduce it if you donβt need the full context:
# Default might be 32768 or higher
vllm serve model-name --max-model-len 4096
# For chat applications where context is short
vllm serve model-name --max-model-len 2048
This is the same principle as reducing Ollamaβs context window β less context = less memory.
Fix 5: Limit concurrent requests
More concurrent requests = more KV cache entries = more VRAM:
# Limit max concurrent sequences
vllm serve model-name --max-num-seqs 8 # Default is often 256
# Limit max batched tokens
vllm serve model-name --max-num-batched-tokens 4096
This reduces throughput but prevents OOM under load.
Fix 6: Use swap space (last resort)
If youβre slightly over VRAM, enable KV cache offloading to CPU:
vllm serve model-name --swap-space 4 # 4 GB of CPU RAM for overflow
This is significantly slower for offloaded requests but prevents crashes.
Production configuration example
A balanced config for serving a 70B model on 2x A100 80GB:
vllm serve meta-llama/Llama-4-Maverick \
--tensor-parallel-size 2 \
--gpu-memory-utilization 0.85 \
--max-model-len 8192 \
--max-num-seqs 32 \
--quantization awq
When vLLM isnβt the right tool
If youβre running into constant OOM issues, consider:
- Ollama β simpler, handles quantization automatically, better for single-user setups
- llama.cpp β more memory-efficient for CPU inference, supports GGUF quantization
- Cloud APIs β OpenRouter or direct provider APIs if local serving isnβt worth the hardware cost
See our Ollama vs llama.cpp vs vLLM comparison for when to use each.
Related: Ollama vs llama.cpp vs vLLM Β· Ollama Out of Memory Fix Β· How Much VRAM for AI Models Β· Best GPU for AI Locally Β· Best Cloud GPU Providers Β· Serve LLMs with vLLM