Apr 23, 2026 · 3 min read

Ollama Out of Memory Fix: 5 Solutions That Actually Work (2026)

Q: Fix 3: Use a smaller model

If quantization isn't enough, switch to a smaller model: | Your RAM | Best coding model | Command | |----------|------------------|---------|

Q: Fix 4: Free up memory

Other processes might be using your RAM/VRAM: # Check what's using GPU memory nvidia-smi

Some links in this article are affiliate links. We earn a commission at no extra cost to you when you purchase through them. Full disclosure.

You pulled a model, ran ollama run, and got:

Error: model requires more system memory (14.2 GiB) than is available (7.8 GiB)

Or the GPU variant:

CUDA out of memory. Tried to allocate 2.00 GiB

This means the model is too large for your available RAM or VRAM. Here are 5 fixes, from quickest to most involved.

Fix 1: Use a smaller quantization

The fastest fix. Quantized models use less memory with minimal quality loss:

# Instead of the full model
ollama pull llama4-scout        # ~60 GB

# Use a quantized version
ollama pull llama4-scout:q4_k_m  # ~25 GB
ollama pull llama4-scout:q3_k_m  # ~20 GB

Quantization guide:

Quantization	Size reduction	Quality loss	When to use
Q8_0	~50%	Negligible	You have enough RAM
Q5_K_M	~65%	Very small	Recommended default
Q4_K_M	~75%	Small	Tight on RAM
Q3_K_M	~80%	Noticeable	Last resort
Q2_K	~85%	Significant	Don’t use for coding

For coding tasks, don’t go below Q4_K_M — the quality drop affects code generation accuracy. See our VRAM guide for exact requirements per model.

Fix 2: Reduce context window

Ollama allocates memory for the full context window upfront. Reducing it frees significant RAM:

# Default context is often 4096 or 8192
# Reduce to 2048 for simple tasks
ollama run llama3.2 --num-ctx 2048

# Or set in a Modelfile
cat > Modelfile << 'EOF'
FROM qwen3.5:27b-q4_k_m
PARAMETER num_ctx 2048
EOF
ollama create qwen-small-ctx -f Modelfile
ollama run qwen-small-ctx

Context window memory usage scales linearly. Halving the context roughly halves the memory overhead beyond the model weights.

Fix 3: Use a smaller model

If quantization isn’t enough, switch to a smaller model:

Your RAM	Best coding model	Command
4 GB	Qwen3 1.7B	`ollama pull qwen3:1.7b`
8 GB	Qwen3 8B or Phi-4	`ollama pull qwen3:8b`
16 GB	DeepSeek R1 14B	`ollama pull deepseek-r1:14b`
32 GB	Qwen 3.5 27B	`ollama pull qwen3.5:27b`
64 GB	Llama 4 Scout	`ollama pull llama4-scout`

See our best models by RAM and best models under 16GB VRAM guides for detailed recommendations.

Fix 4: Free up memory

Other processes might be using your RAM/VRAM:

# Check what's using GPU memory
nvidia-smi

# Check system RAM usage
free -h  # Linux
vm_stat  # macOS

# Kill other Ollama models (only one loads at a time by default)
ollama stop llama3.2

# On macOS, close memory-hungry apps (Chrome, Docker, VS Code)

Ollama keeps the last-used model in memory. If you ran a large model earlier, it might still be loaded:

# List running models
ollama ps

# Stop all running models
ollama stop $(ollama ps -q)

Fix 5: Enable GPU/CPU split (partial offloading)

If your GPU doesn’t have enough VRAM for the full model, offload some layers to CPU RAM:

# Set number of GPU layers (lower = more on CPU)
OLLAMA_NUM_GPU=20 ollama run qwen3.5:27b

# Or in environment
export OLLAMA_NUM_GPU=20
ollama run qwen3.5:27b

This is slower than full GPU but faster than full CPU. Experiment with the number — start at half the model’s layers and adjust.

Docker-specific fix

If running Ollama in Docker, the container might have a memory limit:

# docker-compose.yml
services:
  ollama:
    image: ollama/ollama
    deploy:
      resources:
        limits:
          memory: 32G  # Increase this
    # For GPU access
    runtime: nvidia
    environment:
      - NVIDIA_VISIBLE_DEVICES=all

Still not working?

If none of these fixes help, the model genuinely doesn’t fit on your hardware. Options:

Use a cloud GPU — RunPod or Vultr for on-demand GPU access
Use an API instead — OpenRouter gives you access to any model without local hardware
Upgrade RAM — if you’re on a Mac, the M-series unified memory is the most cost-effective way to run large models locally

Ollama Out of Memory Fix: 5 Solutions That Actually Work (2026)

Fix 1: Use a smaller quantization

Fix 2: Reduce context window

Fix 3: Use a smaller model

Fix 4: Free up memory

Fix 5: Enable GPU/CPU split (partial offloading)

Docker-specific fix

Still not working?

📬 AI Dev Weekly

You might also like

Ollama Connection Refused Fix: Server Not Starting or Not Responding (2026)

Ollama Slow Inference Fix: Speed Up Local AI Model Response Times (2026)

Ollama Model Not Found Fix: Why Your Model Won't Pull or Run (2026)

vLLM CUDA Out of Memory Fix: GPU Optimization for LLM Serving (2026)