πŸ”§ Error Fixes
Β· 3 min read

Ollama Slow Inference Fix: Speed Up Local AI Model Response Times (2026)


Your Ollama model runs but takes 30+ seconds to respond, or generates tokens painfully slowly. Local AI should feel snappy β€” here’s how to fix slow inference.

Check your current speed

# Run a quick benchmark
time ollama run qwen3:8b "Write a hello world in Python" --verbose

Look at the eval rate in verbose output. Acceptable speeds:

SpeedExperienceVerdict
30+ tok/sFeels instantβœ… Great
15-30 tok/sComfortableβœ… Good
5-15 tok/sNoticeable delay⚠️ Needs optimization
<5 tok/sPainfully slow❌ Fix needed

Fix 1: Enable GPU acceleration

The #1 cause of slow inference β€” the model is running on CPU instead of GPU:

# Check if GPU is being used
ollama ps
# Look for "GPU" in the processor column

# NVIDIA: install CUDA drivers
nvidia-smi  # Should show your GPU

# If Ollama isn't using GPU, check CUDA
# Install/update NVIDIA drivers from nvidia.com

Apple Silicon (M1/M2/M3/M4): Ollama uses Metal GPU acceleration automatically. If it’s slow, check that you’re not running out of unified memory (see OOM fix).

No GPU? See our guide on running AI without a GPU. CPU inference will always be slower, but there are optimizations.

Fix 2: Use a smaller model

Bigger models = slower inference. If speed matters more than quality:

Current modelFaster alternativeSpeed improvement
Qwen 3.5 27BQwen3 8B~3x faster
DeepSeek R1 14BPhi-4 3.8B~4x faster
Llama 4 ScoutGemma 4 9B~2x faster
Any 70B modelAny 8B model~8x faster
# Switch to a faster model
ollama run qwen3:8b  # Instead of qwen3.5:27b

For coding specifically, see our best 8B models guide β€” modern 8B models are surprisingly capable.

Fix 3: Reduce context window

Larger context = slower generation (especially the first token):

# Reduce context for faster responses
ollama run qwen3:8b --num-ctx 2048  # Instead of default 4096-8192

If you’re not feeding long documents into the model, a 2048 context window is plenty for most coding tasks.

Fix 4: Use Flash Attention

If your GPU supports it, Flash Attention significantly speeds up inference:

# Enable flash attention (if supported)
OLLAMA_FLASH_ATTENTION=1 ollama serve

Supported on: NVIDIA GPUs with compute capability 8.0+ (A100, RTX 3090, RTX 4090, etc.).

Fix 5: Close competing processes

Other GPU-hungry applications steal VRAM and compute:

# Check GPU usage
nvidia-smi

# Common culprits:
# - Browser with hardware acceleration (Chrome, Firefox)
# - VS Code with GPU rendering
# - Docker containers using GPU
# - Other AI tools (LM Studio, vLLM)

On macOS, Activity Monitor β†’ Memory tab shows what’s consuming unified memory.

Fix 6: Use quantized models

Lower quantization = faster inference (less data to process per token):

# Q4 is faster than Q8
ollama pull qwen3.5:27b-q4_k_m  # Faster
ollama pull qwen3.5:27b-q8_0    # Slower but higher quality

The speed difference between Q4 and Q8 is typically 20-40%.

Fix 7: Batch your requests

If you’re making multiple API calls, batch them:

# Slow: sequential requests
for question in questions:
    response = ollama.chat(model="qwen3:8b", messages=[{"role": "user", "content": question}])

# Faster: use keep_alive to keep model loaded
# First request loads the model, subsequent requests are instant
ollama.chat(model="qwen3:8b", messages=[...], keep_alive="10m")

The first request is always slower because Ollama loads the model into memory. Subsequent requests reuse the loaded model.

Expected speeds by hardware

Hardware8B model14B model27B model
M1 MacBook Air (8GB)25 tok/s❌ Too large❌
M2 Pro (16GB)40 tok/s25 tok/s15 tok/s
M3 Max (64GB)50 tok/s35 tok/s25 tok/s
RTX 3090 (24GB)60 tok/s40 tok/s❌ (needs quantization)
RTX 4090 (24GB)80 tok/s55 tok/s30 tok/s (Q4)
A100 (80GB)100+ tok/s70 tok/s50 tok/s

If your speeds are significantly below these numbers, one of the fixes above should help.

Related: Ollama Complete Guide Β· Ollama Out of Memory Fix Β· Best GPU for AI Locally Β· How to Run AI Without GPU Β· CPU vs GPU for LLM Inference Β· Best 8B Parameter Models Β· How Much VRAM for AI Models