Some links in this article are affiliate links. We earn a commission at no extra cost to you when you purchase through them. Full disclosure.
You pulled a model, ran ollama run, and got:
Error: model requires more system memory (14.2 GiB) than is available (7.8 GiB)
Or the GPU variant:
CUDA out of memory. Tried to allocate 2.00 GiB
This means the model is too large for your available RAM or VRAM. Here are 5 fixes, from quickest to most involved.
Fix 1: Use a smaller quantization
The fastest fix. Quantized models use less memory with minimal quality loss:
# Instead of the full model
ollama pull llama4-scout # ~60 GB
# Use a quantized version
ollama pull llama4-scout:q4_k_m # ~25 GB
ollama pull llama4-scout:q3_k_m # ~20 GB
Quantization guide:
| Quantization | Size reduction | Quality loss | When to use |
|---|---|---|---|
| Q8_0 | ~50% | Negligible | You have enough RAM |
| Q5_K_M | ~65% | Very small | Recommended default |
| Q4_K_M | ~75% | Small | Tight on RAM |
| Q3_K_M | ~80% | Noticeable | Last resort |
| Q2_K | ~85% | Significant | Donโt use for coding |
For coding tasks, donโt go below Q4_K_M โ the quality drop affects code generation accuracy. See our VRAM guide for exact requirements per model.
Fix 2: Reduce context window
Ollama allocates memory for the full context window upfront. Reducing it frees significant RAM:
# Default context is often 4096 or 8192
# Reduce to 2048 for simple tasks
ollama run llama3.2 --num-ctx 2048
# Or set in a Modelfile
cat > Modelfile << 'EOF'
FROM qwen3.5:27b-q4_k_m
PARAMETER num_ctx 2048
EOF
ollama create qwen-small-ctx -f Modelfile
ollama run qwen-small-ctx
Context window memory usage scales linearly. Halving the context roughly halves the memory overhead beyond the model weights.
Fix 3: Use a smaller model
If quantization isnโt enough, switch to a smaller model:
| Your RAM | Best coding model | Command |
|---|---|---|
| 4 GB | Qwen3 1.7B | ollama pull qwen3:1.7b |
| 8 GB | Qwen3 8B or Phi-4 | ollama pull qwen3:8b |
| 16 GB | DeepSeek R1 14B | ollama pull deepseek-r1:14b |
| 32 GB | Qwen 3.5 27B | ollama pull qwen3.5:27b |
| 64 GB | Llama 4 Scout | ollama pull llama4-scout |
See our best models by RAM and best models under 16GB VRAM guides for detailed recommendations.
Fix 4: Free up memory
Other processes might be using your RAM/VRAM:
# Check what's using GPU memory
nvidia-smi
# Check system RAM usage
free -h # Linux
vm_stat # macOS
# Kill other Ollama models (only one loads at a time by default)
ollama stop llama3.2
# On macOS, close memory-hungry apps (Chrome, Docker, VS Code)
Ollama keeps the last-used model in memory. If you ran a large model earlier, it might still be loaded:
# List running models
ollama ps
# Stop all running models
ollama stop $(ollama ps -q)
Fix 5: Enable GPU/CPU split (partial offloading)
If your GPU doesnโt have enough VRAM for the full model, offload some layers to CPU RAM:
# Set number of GPU layers (lower = more on CPU)
OLLAMA_NUM_GPU=20 ollama run qwen3.5:27b
# Or in environment
export OLLAMA_NUM_GPU=20
ollama run qwen3.5:27b
This is slower than full GPU but faster than full CPU. Experiment with the number โ start at half the modelโs layers and adjust.
Docker-specific fix
If running Ollama in Docker, the container might have a memory limit:
# docker-compose.yml
services:
ollama:
image: ollama/ollama
deploy:
resources:
limits:
memory: 32G # Increase this
# For GPU access
runtime: nvidia
environment:
- NVIDIA_VISIBLE_DEVICES=all
Still not working?
If none of these fixes help, the model genuinely doesnโt fit on your hardware. Options:
- Use a cloud GPU โ RunPod or Vultr for on-demand GPU access
- Use an API instead โ OpenRouter gives you access to any model without local hardware
- Upgrade RAM โ if youโre on a Mac, the M-series unified memory is the most cost-effective way to run large models locally
Related: Ollama Complete Guide ยท Ollama Troubleshooting Guide ยท How Much VRAM for AI Models ยท Best AI Models Under 4GB RAM ยท Best AI Models Under 16GB VRAM ยท Best GPU for AI Locally ยท How to Run AI Without GPU