Apr 27, 2026 · 2 min read

Ollama API Timeout Fix: Slow or Hanging API Requests (2026)

Q: Fix 1: Pre-load the model

Load the model before your app starts: # Pre-load on startup curl http://localhost:11434/api/generate -d '{"model":"qwen3:8b","prompt":"","keep_alive":"24h"}'

Q: Fix 5: Reduce model size

Larger models take longer to load and generate slower: # Slow to load (16 GB) ollama run qwen3.5:27b

Q: Fix 6: Connection pooling

If making many API calls, reuse connections: # Bad: new connection per request for msg in messages:

Your application calls the Ollama API and either times out or hangs for 30+ seconds before responding. The model works fine in ollama run but the API is slow. Here’s why and how to fix it.

Why the first request is slow

Ollama loads models into memory on first use. This “cold start” takes 5-30 seconds depending on model size and hardware. Subsequent requests are fast because the model stays loaded.

# First request: slow (loading model)
time curl http://localhost:11434/api/generate -d '{"model":"qwen3:8b","prompt":"hi"}'
# 15-25 seconds

# Second request: fast (model already loaded)
time curl http://localhost:11434/api/generate -d '{"model":"qwen3:8b","prompt":"hi"}'
# 1-3 seconds

Fix 1: Pre-load the model

Load the model before your app starts:

# Pre-load on startup
curl http://localhost:11434/api/generate -d '{"model":"qwen3:8b","prompt":"","keep_alive":"24h"}'

Or in your application:

import httpx

# On app startup
async def preload_model():
    async with httpx.AsyncClient() as client:
        await client.post("http://localhost:11434/api/generate", json={
            "model": "qwen3:8b",
            "prompt": "",
            "keep_alive": "24h",
        })

Fix 2: Increase keep_alive

By default, Ollama unloads models after 5 minutes of inactivity. Increase it:

# Keep model loaded for 24 hours
OLLAMA_KEEP_ALIVE=24h ollama serve

# Or per-request
curl http://localhost:11434/api/generate -d '{
  "model": "qwen3:8b",
  "prompt": "hello",
  "keep_alive": "24h"
}'

# In Python
response = ollama.chat(
    model="qwen3:8b",
    messages=[{"role": "user", "content": "hello"}],
    keep_alive="24h",
)

Fix 3: Increase client timeout

Your HTTP client might timeout before Ollama finishes loading:

# Python (httpx)
client = httpx.AsyncClient(timeout=120.0)  # 2 minutes

# Python (requests)
response = requests.post(url, json=data, timeout=120)

# Node.js (fetch)
const response = await fetch(url, {
    method: "POST",
    body: JSON.stringify(data),
    signal: AbortSignal.timeout(120000),  // 2 minutes
});

Set timeouts to at least 60 seconds for the first request (model loading) and 30 seconds for subsequent requests.

Fix 4: Use streaming

Non-streaming requests wait for the full response. Streaming returns tokens as they’re generated:

# Non-streaming: waits for complete response (slow perceived latency)
response = ollama.chat(model="qwen3:8b", messages=messages)

# Streaming: first token arrives quickly
for chunk in ollama.chat(model="qwen3:8b", messages=messages, stream=True):
    print(chunk["message"]["content"], end="", flush=True)

Streaming doesn’t make generation faster, but the user sees output immediately instead of waiting for the full response.

Fix 5: Reduce model size

Larger models take longer to load and generate slower:

# Slow to load (16 GB)
ollama run qwen3.5:27b

# Fast to load (5 GB)
ollama run qwen3:8b

See our Ollama slow inference fix for more speed optimizations.

Fix 6: Connection pooling

If making many API calls, reuse connections:

# Bad: new connection per request
for msg in messages:
    response = requests.post("http://localhost:11434/api/chat", json=data)

# Good: reuse connection
session = requests.Session()
for msg in messages:
    response = session.post("http://localhost:11434/api/chat", json=data)

Production configuration

For applications serving multiple users:

# Allow parallel requests
OLLAMA_NUM_PARALLEL=4 \
OLLAMA_KEEP_ALIVE=24h \
OLLAMA_MAX_LOADED_MODELS=2 \
ollama serve

This keeps models loaded, handles 4 concurrent requests, and allows 2 models in memory simultaneously.

Ollama API Timeout Fix: Slow or Hanging API Requests (2026)

Why the first request is slow

Fix 1: Pre-load the model

Fix 2: Increase keep_alive

Fix 3: Increase client timeout

Fix 4: Use streaming

Fix 5: Reduce model size

Fix 6: Connection pooling

Production configuration

📬 AI Dev Weekly

You might also like

Ollama GPU Not Detected Fix: CUDA and Metal Acceleration Issues (2026)

Ollama Connection Refused Fix: Server Not Starting or Not Responding (2026)

Ollama Slow Inference Fix: Speed Up Local AI Model Response Times (2026)

Ollama Model Not Found Fix: Why Your Model Won't Pull or Run (2026)