🔧 Error Fixes
· 3 min read

Ollama Out of Memory Fix: 5 Solutions That Actually Work (2026)


Some links in this article are affiliate links. We earn a commission at no extra cost to you when you purchase through them. Full disclosure.

You pulled a model, ran ollama run, and got:

Error: model requires more system memory (14.2 GiB) than is available (7.8 GiB)

Or the GPU variant:

CUDA out of memory. Tried to allocate 2.00 GiB

This means the model is too large for your available RAM or VRAM. Here are 5 fixes, from quickest to most involved.

Fix 1: Use a smaller quantization

The fastest fix. Quantized models use less memory with minimal quality loss:

# Instead of the full model
ollama pull llama4-scout        # ~60 GB

# Use a quantized version
ollama pull llama4-scout:q4_k_m  # ~25 GB
ollama pull llama4-scout:q3_k_m  # ~20 GB

Quantization guide:

QuantizationSize reductionQuality lossWhen to use
Q8_0~50%NegligibleYou have enough RAM
Q5_K_M~65%Very smallRecommended default
Q4_K_M~75%SmallTight on RAM
Q3_K_M~80%NoticeableLast resort
Q2_K~85%SignificantDon’t use for coding

For coding tasks, don’t go below Q4_K_M — the quality drop affects code generation accuracy. See our VRAM guide for exact requirements per model.

Fix 2: Reduce context window

Ollama allocates memory for the full context window upfront. Reducing it frees significant RAM:

# Default context is often 4096 or 8192
# Reduce to 2048 for simple tasks
ollama run llama3.2 --num-ctx 2048

# Or set in a Modelfile
cat > Modelfile << 'EOF'
FROM qwen3.5:27b-q4_k_m
PARAMETER num_ctx 2048
EOF
ollama create qwen-small-ctx -f Modelfile
ollama run qwen-small-ctx

Context window memory usage scales linearly. Halving the context roughly halves the memory overhead beyond the model weights.

Fix 3: Use a smaller model

If quantization isn’t enough, switch to a smaller model:

Your RAMBest coding modelCommand
4 GBQwen3 1.7Bollama pull qwen3:1.7b
8 GBQwen3 8B or Phi-4ollama pull qwen3:8b
16 GBDeepSeek R1 14Bollama pull deepseek-r1:14b
32 GBQwen 3.5 27Bollama pull qwen3.5:27b
64 GBLlama 4 Scoutollama pull llama4-scout

See our best models by RAM and best models under 16GB VRAM guides for detailed recommendations.

Fix 4: Free up memory

Other processes might be using your RAM/VRAM:

# Check what's using GPU memory
nvidia-smi

# Check system RAM usage
free -h  # Linux
vm_stat  # macOS

# Kill other Ollama models (only one loads at a time by default)
ollama stop llama3.2

# On macOS, close memory-hungry apps (Chrome, Docker, VS Code)

Ollama keeps the last-used model in memory. If you ran a large model earlier, it might still be loaded:

# List running models
ollama ps

# Stop all running models
ollama stop $(ollama ps -q)

Fix 5: Enable GPU/CPU split (partial offloading)

If your GPU doesn’t have enough VRAM for the full model, offload some layers to CPU RAM:

# Set number of GPU layers (lower = more on CPU)
OLLAMA_NUM_GPU=20 ollama run qwen3.5:27b

# Or in environment
export OLLAMA_NUM_GPU=20
ollama run qwen3.5:27b

This is slower than full GPU but faster than full CPU. Experiment with the number — start at half the model’s layers and adjust.

Docker-specific fix

If running Ollama in Docker, the container might have a memory limit:

# docker-compose.yml
services:
  ollama:
    image: ollama/ollama
    deploy:
      resources:
        limits:
          memory: 32G  # Increase this
    # For GPU access
    runtime: nvidia
    environment:
      - NVIDIA_VISIBLE_DEVICES=all

Still not working?

If none of these fixes help, the model genuinely doesn’t fit on your hardware. Options:

  1. Use a cloud GPURunPod or Vultr for on-demand GPU access
  2. Use an API insteadOpenRouter gives you access to any model without local hardware
  3. Upgrade RAM — if you’re on a Mac, the M-series unified memory is the most cost-effective way to run large models locally

Related: Ollama Complete Guide · Ollama Troubleshooting Guide · How Much VRAM for AI Models · Best AI Models Under 4GB RAM · Best AI Models Under 16GB VRAM · Best GPU for AI Locally · How to Run AI Without GPU