Your application calls the Ollama API and either times out or hangs for 30+ seconds before responding. The model works fine in ollama run but the API is slow. Hereβs why and how to fix it.
Why the first request is slow
Ollama loads models into memory on first use. This βcold startβ takes 5-30 seconds depending on model size and hardware. Subsequent requests are fast because the model stays loaded.
# First request: slow (loading model)
time curl http://localhost:11434/api/generate -d '{"model":"qwen3:8b","prompt":"hi"}'
# 15-25 seconds
# Second request: fast (model already loaded)
time curl http://localhost:11434/api/generate -d '{"model":"qwen3:8b","prompt":"hi"}'
# 1-3 seconds
Fix 1: Pre-load the model
Load the model before your app starts:
# Pre-load on startup
curl http://localhost:11434/api/generate -d '{"model":"qwen3:8b","prompt":"","keep_alive":"24h"}'
Or in your application:
import httpx
# On app startup
async def preload_model():
async with httpx.AsyncClient() as client:
await client.post("http://localhost:11434/api/generate", json={
"model": "qwen3:8b",
"prompt": "",
"keep_alive": "24h",
})
Fix 2: Increase keep_alive
By default, Ollama unloads models after 5 minutes of inactivity. Increase it:
# Keep model loaded for 24 hours
OLLAMA_KEEP_ALIVE=24h ollama serve
# Or per-request
curl http://localhost:11434/api/generate -d '{
"model": "qwen3:8b",
"prompt": "hello",
"keep_alive": "24h"
}'
# In Python
response = ollama.chat(
model="qwen3:8b",
messages=[{"role": "user", "content": "hello"}],
keep_alive="24h",
)
Fix 3: Increase client timeout
Your HTTP client might timeout before Ollama finishes loading:
# Python (httpx)
client = httpx.AsyncClient(timeout=120.0) # 2 minutes
# Python (requests)
response = requests.post(url, json=data, timeout=120)
# Node.js (fetch)
const response = await fetch(url, {
method: "POST",
body: JSON.stringify(data),
signal: AbortSignal.timeout(120000), // 2 minutes
});
Set timeouts to at least 60 seconds for the first request (model loading) and 30 seconds for subsequent requests.
Fix 4: Use streaming
Non-streaming requests wait for the full response. Streaming returns tokens as theyβre generated:
# Non-streaming: waits for complete response (slow perceived latency)
response = ollama.chat(model="qwen3:8b", messages=messages)
# Streaming: first token arrives quickly
for chunk in ollama.chat(model="qwen3:8b", messages=messages, stream=True):
print(chunk["message"]["content"], end="", flush=True)
Streaming doesnβt make generation faster, but the user sees output immediately instead of waiting for the full response.
Fix 5: Reduce model size
Larger models take longer to load and generate slower:
# Slow to load (16 GB)
ollama run qwen3.5:27b
# Fast to load (5 GB)
ollama run qwen3:8b
See our Ollama slow inference fix for more speed optimizations.
Fix 6: Connection pooling
If making many API calls, reuse connections:
# Bad: new connection per request
for msg in messages:
response = requests.post("http://localhost:11434/api/chat", json=data)
# Good: reuse connection
session = requests.Session()
for msg in messages:
response = session.post("http://localhost:11434/api/chat", json=data)
Production configuration
For applications serving multiple users:
# Allow parallel requests
OLLAMA_NUM_PARALLEL=4 \
OLLAMA_KEEP_ALIVE=24h \
OLLAMA_MAX_LOADED_MODELS=2 \
ollama serve
This keeps models loaded, handles 4 concurrent requests, and allows 2 models in memory simultaneously.
Related: Ollama Complete Guide Β· Ollama Cheat Sheet Β· Ollama Slow Inference Fix Β· Ollama Connection Refused Fix Β· Ollama Out of Memory Fix