You started a vLLM server and got:
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.00 GiB.
GPU 0 has a total capacity of 23.65 GiB of which 1.24 GiB is free.
vLLM needs GPU memory for three things: model weights, KV cache (conversation context), and activation memory. When any of these exceed your VRAM, you get OOM errors. Here’s how to fix it.
Fix 1: Reduce GPU memory utilization
vLLM tries to use 90% of GPU memory by default. Lower it:
# Default: 0.9 (90% of VRAM)
# Reduce to give headroom
vllm serve meta-llama/Llama-4-Scout --gpu-memory-utilization 0.8
# For tight VRAM situations
vllm serve meta-llama/Llama-4-Scout --gpu-memory-utilization 0.7
This reduces the KV cache size, which means fewer concurrent requests but no OOM errors.
Fix 2: Use quantization
Load the model in lower precision:
# AWQ quantization (4-bit, good quality)
vllm serve TheBloke/model-AWQ --quantization awq
# GPTQ quantization (4-bit)
vllm serve TheBloke/model-GPTQ --quantization gptq
# BitsAndBytes (8-bit or 4-bit)
vllm serve model-name --quantization bitsandbytes --load-format bitsandbytes
Memory savings:
| Precision | VRAM for 7B model | VRAM for 70B model |
|---|---|---|
| FP16 | ~14 GB | ~140 GB |
| INT8 | ~7 GB | ~70 GB |
| INT4 (AWQ/GPTQ) | ~4 GB | ~40 GB |
For most serving use cases, AWQ quantization gives the best quality-per-VRAM ratio.
Fix 3: Tensor parallelism (multi-GPU)
Split the model across multiple GPUs:
# 2 GPUs
vllm serve meta-llama/Llama-4-Maverick --tensor-parallel-size 2
# 4 GPUs
vllm serve meta-llama/Llama-4-Maverick --tensor-parallel-size 4
Each GPU holds a portion of the model. VRAM requirement per GPU = total model size / number of GPUs.
Fix 4: Reduce max model length
The KV cache scales with context length. Reduce it if you don’t need the full context:
# Default might be 32768 or higher
vllm serve model-name --max-model-len 4096
# For chat applications where context is short
vllm serve model-name --max-model-len 2048
This is the same principle as reducing Ollama’s context window — less context = less memory.
Fix 5: Limit concurrent requests
More concurrent requests = more KV cache entries = more VRAM:
# Limit max concurrent sequences
vllm serve model-name --max-num-seqs 8 # Default is often 256
# Limit max batched tokens
vllm serve model-name --max-num-batched-tokens 4096
This reduces throughput but prevents OOM under load.
Fix 6: Use swap space (last resort)
If you’re slightly over VRAM, enable KV cache offloading to CPU:
vllm serve model-name --swap-space 4 # 4 GB of CPU RAM for overflow
This is significantly slower for offloaded requests but prevents crashes.
Production configuration example
A balanced config for serving a 70B model on 2x A100 80GB:
vllm serve meta-llama/Llama-4-Maverick \
--tensor-parallel-size 2 \
--gpu-memory-utilization 0.85 \
--max-model-len 8192 \
--max-num-seqs 32 \
--quantization awq
When vLLM isn’t the right tool
If you’re running into constant OOM issues, consider:
- Ollama — simpler, handles quantization automatically, better for single-user setups
- llama.cpp — more memory-efficient for CPU inference, supports GGUF quantization
- Cloud APIs — OpenRouter or direct provider APIs if local serving isn’t worth the hardware cost
See our Ollama vs llama.cpp vs vLLM comparison for when to use each.
Related: Ollama vs llama.cpp vs vLLM · Ollama Out of Memory Fix · How Much VRAM for AI Models · Best GPU for AI Locally · Best Cloud GPU Providers · Serve LLMs with vLLM