Some links in this article are affiliate links. We earn a commission at no extra cost to you when you purchase through them. Full disclosure.
You pulled a model, ran ollama run, and got:
Error: model requires more system memory (14.2 GiB) than is available (7.8 GiB)
Or the GPU variant:
CUDA out of memory. Tried to allocate 2.00 GiB
This means the model is too large for your available RAM or VRAM. Here are 5 fixes, from quickest to most involved.
Fix 1: Use a smaller quantization
The fastest fix. Quantized models use less memory with minimal quality loss:
# Instead of the full model
ollama pull llama4-scout # ~60 GB
# Use a quantized version
ollama pull llama4-scout:q4_k_m # ~25 GB
ollama pull llama4-scout:q3_k_m # ~20 GB
Quantization guide:
| Quantization | Size reduction | Quality loss | When to use |
|---|---|---|---|
| Q8_0 | ~50% | Negligible | You have enough RAM |
| Q5_K_M | ~65% | Very small | Recommended default |
| Q4_K_M | ~75% | Small | Tight on RAM |
| Q3_K_M | ~80% | Noticeable | Last resort |
| Q2_K | ~85% | Significant | Don’t use for coding |
For coding tasks, don’t go below Q4_K_M — the quality drop affects code generation accuracy. See our VRAM guide for exact requirements per model.
Fix 2: Reduce context window
Ollama allocates memory for the full context window upfront. Reducing it frees significant RAM:
# Default context is often 4096 or 8192
# Reduce to 2048 for simple tasks
ollama run llama3.2 --num-ctx 2048
# Or set in a Modelfile
cat > Modelfile << 'EOF'
FROM qwen3.5:27b-q4_k_m
PARAMETER num_ctx 2048
EOF
ollama create qwen-small-ctx -f Modelfile
ollama run qwen-small-ctx
Context window memory usage scales linearly. Halving the context roughly halves the memory overhead beyond the model weights.
Fix 3: Use a smaller model
If quantization isn’t enough, switch to a smaller model:
| Your RAM | Best coding model | Command |
|---|---|---|
| 4 GB | Qwen3 1.7B | ollama pull qwen3:1.7b |
| 8 GB | Qwen3 8B or Phi-4 | ollama pull qwen3:8b |
| 16 GB | DeepSeek R1 14B | ollama pull deepseek-r1:14b |
| 32 GB | Qwen 3.5 27B | ollama pull qwen3.5:27b |
| 64 GB | Llama 4 Scout | ollama pull llama4-scout |
See our best models by RAM and best models under 16GB VRAM guides for detailed recommendations.
Fix 4: Free up memory
Other processes might be using your RAM/VRAM:
# Check what's using GPU memory
nvidia-smi
# Check system RAM usage
free -h # Linux
vm_stat # macOS
# Kill other Ollama models (only one loads at a time by default)
ollama stop llama3.2
# On macOS, close memory-hungry apps (Chrome, Docker, VS Code)
Ollama keeps the last-used model in memory. If you ran a large model earlier, it might still be loaded:
# List running models
ollama ps
# Stop all running models
ollama stop $(ollama ps -q)
Fix 5: Enable GPU/CPU split (partial offloading)
If your GPU doesn’t have enough VRAM for the full model, offload some layers to CPU RAM:
# Set number of GPU layers (lower = more on CPU)
OLLAMA_NUM_GPU=20 ollama run qwen3.5:27b
# Or in environment
export OLLAMA_NUM_GPU=20
ollama run qwen3.5:27b
This is slower than full GPU but faster than full CPU. Experiment with the number — start at half the model’s layers and adjust.
Docker-specific fix
If running Ollama in Docker, the container might have a memory limit:
# docker-compose.yml
services:
ollama:
image: ollama/ollama
deploy:
resources:
limits:
memory: 32G # Increase this
# For GPU access
runtime: nvidia
environment:
- NVIDIA_VISIBLE_DEVICES=all
Still not working?
If none of these fixes help, the model genuinely doesn’t fit on your hardware. Options:
- Use a cloud GPU — RunPod or Vultr for on-demand GPU access
- Use an API instead — OpenRouter gives you access to any model without local hardware
- Upgrade RAM — if you’re on a Mac, the M-series unified memory is the most cost-effective way to run large models locally
Related: Ollama Complete Guide · Ollama Troubleshooting Guide · How Much VRAM for AI Models · Best AI Models Under 4GB RAM · Best AI Models Under 16GB VRAM · Best GPU for AI Locally · How to Run AI Without GPU