Ollama Slow Inference Fix: Speed Up Local AI Model Response Times (2026)
Your Ollama model runs but takes 30+ seconds to respond, or generates tokens painfully slowly. Local AI should feel snappy β hereβs how to fix slow inference.
Check your current speed
# Run a quick benchmark
time ollama run qwen3:8b "Write a hello world in Python" --verbose
Look at the eval rate in verbose output. Acceptable speeds:
| Speed | Experience | Verdict |
|---|---|---|
| 30+ tok/s | Feels instant | β Great |
| 15-30 tok/s | Comfortable | β Good |
| 5-15 tok/s | Noticeable delay | β οΈ Needs optimization |
| <5 tok/s | Painfully slow | β Fix needed |
Fix 1: Enable GPU acceleration
The #1 cause of slow inference β the model is running on CPU instead of GPU:
# Check if GPU is being used
ollama ps
# Look for "GPU" in the processor column
# NVIDIA: install CUDA drivers
nvidia-smi # Should show your GPU
# If Ollama isn't using GPU, check CUDA
# Install/update NVIDIA drivers from nvidia.com
Apple Silicon (M1/M2/M3/M4): Ollama uses Metal GPU acceleration automatically. If itβs slow, check that youβre not running out of unified memory (see OOM fix).
No GPU? See our guide on running AI without a GPU. CPU inference will always be slower, but there are optimizations.
Fix 2: Use a smaller model
Bigger models = slower inference. If speed matters more than quality:
| Current model | Faster alternative | Speed improvement |
|---|---|---|
| Qwen 3.5 27B | Qwen3 8B | ~3x faster |
| DeepSeek R1 14B | Phi-4 3.8B | ~4x faster |
| Llama 4 Scout | Gemma 4 9B | ~2x faster |
| Any 70B model | Any 8B model | ~8x faster |
# Switch to a faster model
ollama run qwen3:8b # Instead of qwen3.5:27b
For coding specifically, see our best 8B models guide β modern 8B models are surprisingly capable.
Fix 3: Reduce context window
Larger context = slower generation (especially the first token):
# Reduce context for faster responses
ollama run qwen3:8b --num-ctx 2048 # Instead of default 4096-8192
If youβre not feeding long documents into the model, a 2048 context window is plenty for most coding tasks.
Fix 4: Use Flash Attention
If your GPU supports it, Flash Attention significantly speeds up inference:
# Enable flash attention (if supported)
OLLAMA_FLASH_ATTENTION=1 ollama serve
Supported on: NVIDIA GPUs with compute capability 8.0+ (A100, RTX 3090, RTX 4090, etc.).
Fix 5: Close competing processes
Other GPU-hungry applications steal VRAM and compute:
# Check GPU usage
nvidia-smi
# Common culprits:
# - Browser with hardware acceleration (Chrome, Firefox)
# - VS Code with GPU rendering
# - Docker containers using GPU
# - Other AI tools (LM Studio, vLLM)
On macOS, Activity Monitor β Memory tab shows whatβs consuming unified memory.
Fix 6: Use quantized models
Lower quantization = faster inference (less data to process per token):
# Q4 is faster than Q8
ollama pull qwen3.5:27b-q4_k_m # Faster
ollama pull qwen3.5:27b-q8_0 # Slower but higher quality
The speed difference between Q4 and Q8 is typically 20-40%.
Fix 7: Batch your requests
If youβre making multiple API calls, batch them:
# Slow: sequential requests
for question in questions:
response = ollama.chat(model="qwen3:8b", messages=[{"role": "user", "content": question}])
# Faster: use keep_alive to keep model loaded
# First request loads the model, subsequent requests are instant
ollama.chat(model="qwen3:8b", messages=[...], keep_alive="10m")
The first request is always slower because Ollama loads the model into memory. Subsequent requests reuse the loaded model.
Expected speeds by hardware
| Hardware | 8B model | 14B model | 27B model |
|---|---|---|---|
| M1 MacBook Air (8GB) | 25 tok/s | β Too large | β |
| M2 Pro (16GB) | 40 tok/s | 25 tok/s | 15 tok/s |
| M3 Max (64GB) | 50 tok/s | 35 tok/s | 25 tok/s |
| RTX 3090 (24GB) | 60 tok/s | 40 tok/s | β (needs quantization) |
| RTX 4090 (24GB) | 80 tok/s | 55 tok/s | 30 tok/s (Q4) |
| A100 (80GB) | 100+ tok/s | 70 tok/s | 50 tok/s |
If your speeds are significantly below these numbers, one of the fixes above should help.
Related: Ollama Complete Guide Β· Ollama Out of Memory Fix Β· Best GPU for AI Locally Β· How to Run AI Without GPU Β· CPU vs GPU for LLM Inference Β· Best 8B Parameter Models Β· How Much VRAM for AI Models