Ollama Complete Guide: Install, Pull Models, and Run AI Locally in 5 Minutes (2026)
Ollama is the easiest way to run AI models locally. One command to install, one command to run any model. No Python environments, no dependency hell, no configuration files. Hereβs everything you need to know.
What is Ollama?
Ollama is a local AI runtime that downloads, manages, and serves open-source language models. Think of it as Docker for AI models β you pull a model, run it, and interact via chat or API.
It supports hundreds of models including Gemma 4, Llama 4, Qwen 3.6, MiMo V2.5, DeepSeek, Mistral, and more.
Installation
macOS
curl -fsSL https://ollama.com/install.sh | sh
Or download from ollama.com β the macOS app runs in the menu bar.
Linux
curl -fsSL https://ollama.com/install.sh | sh
Windows
Download the installer from ollama.com. Ollama runs as a background service.
Docker
docker run -d --gpus all -v ollama:/root/.ollama -p 11434:11434 ollama/ollama
Verify installation
ollama --version
ollama list # Shows downloaded models
Running your first model
# Start chatting with Gemma 4 26B
ollama run gemma4:26b
# Or try other models
ollama run llama4:scout
ollama run qwen3.5:plus
ollama run deepseek-v3
ollama run codestral
The first run downloads the model (can take a few minutes depending on size and connection). Subsequent runs start instantly.
Model management
Browse available models
Visit ollama.com/library or search from the terminal:
ollama list # Downloaded models
ollama show gemma4 # Model details
ollama ps # Currently running models
Download without running
ollama pull gemma4:26b
ollama pull qwen2.5-coder:32b
Remove models
ollama rm gemma4:26b
Model tags and quantization
Most models have multiple tags for different sizes and quantizations:
ollama run gemma4:26b # Default quantization
ollama run gemma4:26b-q4_K_M # Specific quantization
ollama run gemma4:e2b # Smaller variant
ollama run gemma4:31b # Larger variant
Lower quantization (Q2, Q3) = smaller + faster but lower quality. Higher (Q8, FP16) = larger + slower but better quality. Q4_K_M is the sweet spot for most use cases.
The API
Ollama exposes an OpenAI-compatible API on port 11434. This works with any tool that supports the OpenAI format.
Chat completion
curl http://localhost:11434/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "gemma4:26b",
"messages": [
{"role": "system", "content": "You are a helpful coding assistant."},
{"role": "user", "content": "Write a Python function to validate email addresses"}
]
}'
Streaming
curl http://localhost:11434/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "gemma4:26b",
"messages": [{"role": "user", "content": "Explain MoE architecture"}],
"stream": true
}'
Python
import openai
client = openai.OpenAI(base_url="http://localhost:11434/v1", api_key="unused")
response = client.chat.completions.create(
model="gemma4:26b",
messages=[{"role": "user", "content": "Hello!"}]
)
print(response.choices[0].message.content)
JavaScript/TypeScript
const response = await fetch('http://localhost:11434/v1/chat/completions', {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({
model: 'gemma4:26b',
messages: [{ role: 'user', content: 'Hello!' }]
})
});
const data = await response.json();
console.log(data.choices[0].message.content);
GPU configuration
NVIDIA GPUs
Ollama automatically detects and uses NVIDIA GPUs with CUDA. No configuration needed.
# Check GPU usage
nvidia-smi
# Force CPU-only mode
CUDA_VISIBLE_DEVICES="" ollama run gemma4:26b
Apple Silicon (M1/M2/M3/M4)
Ollama uses Metal acceleration automatically on Apple Silicon. The unified memory architecture means your full RAM is available as βVRAM.β
A MacBook with 16 GB unified memory can run models that would need a 16 GB GPU on other platforms.
AMD GPUs
Ollama supports AMD GPUs via ROCm on Linux. Install ROCm first, then Ollama detects it automatically.
No GPU
Ollama runs on CPU too β just slower. For CPU-only use, stick to smaller models:
- Gemma 4 E2B (2 GB) β 10-15 tok/s on CPU
- Phi-3.5 Mini (3 GB) β 12-18 tok/s on CPU
- Qwen 3.6 (5 GB) β 8-12 tok/s on CPU
See how to run AI without a GPU for more tips.
Advanced configuration
Custom Modelfiles
Create custom model configurations:
# Modelfile
FROM gemma4:26b
PARAMETER temperature 0.7
PARAMETER num_ctx 8192
PARAMETER num_gpu 999
SYSTEM "You are a senior software engineer. Write clean, well-documented code."
ollama create my-coding-assistant -f Modelfile
ollama run my-coding-assistant
Environment variables
# Change default port
OLLAMA_HOST=0.0.0.0:8080 ollama serve
# Set model storage location
OLLAMA_MODELS=/path/to/models ollama serve
# Limit GPU memory usage
OLLAMA_GPU_MEMORY=6g ollama serve
Running as a service
# Linux (systemd)
sudo systemctl enable ollama
sudo systemctl start ollama
# Check status
sudo systemctl status ollama
IDE integration
VS Code + Continue.dev
The most popular setup for local AI coding:
- Install Ollama and pull a coding model:
ollama pull qwen2.5-coder:32b - Install the Continue extension in VS Code
- Configure Continue to use Ollama at
http://localhost:11434
You get tab completion, inline chat, and code actions β like GitHub Copilot but fully local and free.
Other integrations
Ollamaβs OpenAI-compatible API works with:
- Open WebUI β ChatGPT-like web interface for Ollama
- Cody β Sourcegraphβs AI coding assistant
- LangChain / LlamaIndex β for building RAG applications
- Any OpenAI SDK β just change the base URL
Ollama vs alternatives
| Ollama | llama.cpp | vLLM | |
|---|---|---|---|
| Setup | 1 command | Build from source | pip install |
| GPU required | No | No | Yes |
| Model management | Built-in | Manual | Manual |
| API | OpenAI-compat | OpenAI-compat | OpenAI-compat |
| Quantization control | Via tags | Full control | Limited |
| Best for | Getting started | Power users | Production |
For a detailed comparison, see Ollama vs llama.cpp vs vLLM.
Troubleshooting
Model too slow
- Use a smaller model or lower quantization
- Check GPU is being used:
ollama psshows GPU layers - Reduce context: add
PARAMETER num_ctx 4096to Modelfile
Out of memory
- Use a smaller quantization:
ollama run gemma4:26b-q3_K_M - Reduce context window
- Close other applications
- Check our best AI models under 4GB RAM
Model not found
ollama pull model-name # Download first
ollama list # Check what's available
What to run first
New to local AI? Start with ollama run gemma4:26b β best quality-per-hardware ratio.
Need coding help? Try ollama run qwen2.5-coder:32b β see best AI models for coding locally.
Limited hardware? Try ollama run gemma4:e2b β runs in 2 GB RAM. For more power, cloud GPU providers offer dedicated instances with A100s and H100s by the hour.
Want to compare? See our best local AI models by task and best free AI models rankings.
Ollama makes local AI accessible to everyone. Install it, pick a model, and start building. No cloud account, no API key, no monthly bill.
FAQ
Is Ollama free?
Yes, Ollama is completely free and open source. There are no subscription fees, API charges, or usage limits β you download it, run models on your own hardware, and pay nothing.
Does Ollama need a GPU?
No, Ollama runs on CPU as well, though it will be slower. For acceptable performance without a GPU, stick to smaller models like Gemma 4 E2B or Phi-3.5 Mini that can manage 10-18 tokens per second on CPU alone.
Which Ollama model is best for coding?
Qwen 2.5 Coder 32B is widely considered the best coding model you can run through Ollama. If your hardware canβt handle 32B parameters, Qwen 2.5 Coder 7B or CodeStral are strong alternatives at smaller sizes.
Can Ollama run on Windows?
Yes, Ollama has a native Windows installer available from ollama.com. It runs as a background service and works the same as on macOS and Linux, with full GPU acceleration support for NVIDIA cards.
Related: Best AI Engineering Courses