๐Ÿ”ง Error Fixes
ยท 3 min read

Ollama Out of Memory Fix: 5 Solutions That Actually Work (2026)


Some links in this article are affiliate links. We earn a commission at no extra cost to you when you purchase through them. Full disclosure.

You pulled a model, ran ollama run, and got:

Error: model requires more system memory (14.2 GiB) than is available (7.8 GiB)

Or the GPU variant:

CUDA out of memory. Tried to allocate 2.00 GiB

This means the model is too large for your available RAM or VRAM. Here are 5 fixes, from quickest to most involved.

Fix 1: Use a smaller quantization

The fastest fix. Quantized models use less memory with minimal quality loss:

# Instead of the full model
ollama pull llama4-scout        # ~60 GB

# Use a quantized version
ollama pull llama4-scout:q4_k_m  # ~25 GB
ollama pull llama4-scout:q3_k_m  # ~20 GB

Quantization guide:

QuantizationSize reductionQuality lossWhen to use
Q8_0~50%NegligibleYou have enough RAM
Q5_K_M~65%Very smallRecommended default
Q4_K_M~75%SmallTight on RAM
Q3_K_M~80%NoticeableLast resort
Q2_K~85%SignificantDonโ€™t use for coding

For coding tasks, donโ€™t go below Q4_K_M โ€” the quality drop affects code generation accuracy. See our VRAM guide for exact requirements per model.

Fix 2: Reduce context window

Ollama allocates memory for the full context window upfront. Reducing it frees significant RAM:

# Default context is often 4096 or 8192
# Reduce to 2048 for simple tasks
ollama run llama3.2 --num-ctx 2048

# Or set in a Modelfile
cat > Modelfile << 'EOF'
FROM qwen3.5:27b-q4_k_m
PARAMETER num_ctx 2048
EOF
ollama create qwen-small-ctx -f Modelfile
ollama run qwen-small-ctx

Context window memory usage scales linearly. Halving the context roughly halves the memory overhead beyond the model weights.

Fix 3: Use a smaller model

If quantization isnโ€™t enough, switch to a smaller model:

Your RAMBest coding modelCommand
4 GBQwen3 1.7Bollama pull qwen3:1.7b
8 GBQwen3 8B or Phi-4ollama pull qwen3:8b
16 GBDeepSeek R1 14Bollama pull deepseek-r1:14b
32 GBQwen 3.5 27Bollama pull qwen3.5:27b
64 GBLlama 4 Scoutollama pull llama4-scout

See our best models by RAM and best models under 16GB VRAM guides for detailed recommendations.

Fix 4: Free up memory

Other processes might be using your RAM/VRAM:

# Check what's using GPU memory
nvidia-smi

# Check system RAM usage
free -h  # Linux
vm_stat  # macOS

# Kill other Ollama models (only one loads at a time by default)
ollama stop llama3.2

# On macOS, close memory-hungry apps (Chrome, Docker, VS Code)

Ollama keeps the last-used model in memory. If you ran a large model earlier, it might still be loaded:

# List running models
ollama ps

# Stop all running models
ollama stop $(ollama ps -q)

Fix 5: Enable GPU/CPU split (partial offloading)

If your GPU doesnโ€™t have enough VRAM for the full model, offload some layers to CPU RAM:

# Set number of GPU layers (lower = more on CPU)
OLLAMA_NUM_GPU=20 ollama run qwen3.5:27b

# Or in environment
export OLLAMA_NUM_GPU=20
ollama run qwen3.5:27b

This is slower than full GPU but faster than full CPU. Experiment with the number โ€” start at half the modelโ€™s layers and adjust.

Docker-specific fix

If running Ollama in Docker, the container might have a memory limit:

# docker-compose.yml
services:
  ollama:
    image: ollama/ollama
    deploy:
      resources:
        limits:
          memory: 32G  # Increase this
    # For GPU access
    runtime: nvidia
    environment:
      - NVIDIA_VISIBLE_DEVICES=all

Still not working?

If none of these fixes help, the model genuinely doesnโ€™t fit on your hardware. Options:

  1. Use a cloud GPU โ€” RunPod or Vultr for on-demand GPU access
  2. Use an API instead โ€” OpenRouter gives you access to any model without local hardware
  3. Upgrade RAM โ€” if youโ€™re on a Mac, the M-series unified memory is the most cost-effective way to run large models locally

Related: Ollama Complete Guide ยท Ollama Troubleshooting Guide ยท How Much VRAM for AI Models ยท Best AI Models Under 4GB RAM ยท Best AI Models Under 16GB VRAM ยท Best GPU for AI Locally ยท How to Run AI Without GPU