πŸ€– AI Tools
Β· 6 min read
Last updated on

Ollama Complete Guide: Install, Pull Models, and Run AI Locally in 5 Minutes (2026)


Ollama is the easiest way to run AI models locally. One command to install, one command to run any model. No Python environments, no dependency hell, no configuration files. Here’s everything you need to know.

What is Ollama?

Ollama is a local AI runtime that downloads, manages, and serves open-source language models. Think of it as Docker for AI models β€” you pull a model, run it, and interact via chat or API.

It supports hundreds of models including Gemma 4, Llama 4, Qwen 3.6, MiMo V2.5, DeepSeek, Mistral, and more.

Installation

macOS

curl -fsSL https://ollama.com/install.sh | sh

Or download from ollama.com β€” the macOS app runs in the menu bar.

Linux

curl -fsSL https://ollama.com/install.sh | sh

Windows

Download the installer from ollama.com. Ollama runs as a background service.

Docker

docker run -d --gpus all -v ollama:/root/.ollama -p 11434:11434 ollama/ollama

Verify installation

ollama --version
ollama list  # Shows downloaded models

Running your first model

# Start chatting with Gemma 4 26B
ollama run gemma4:26b

# Or try other models
ollama run llama4:scout
ollama run qwen3.5:plus
ollama run deepseek-v3
ollama run codestral

The first run downloads the model (can take a few minutes depending on size and connection). Subsequent runs start instantly.

Model management

Browse available models

Visit ollama.com/library or search from the terminal:

ollama list          # Downloaded models
ollama show gemma4   # Model details
ollama ps            # Currently running models

Download without running

ollama pull gemma4:26b
ollama pull qwen2.5-coder:32b

Remove models

ollama rm gemma4:26b

Model tags and quantization

Most models have multiple tags for different sizes and quantizations:

ollama run gemma4:26b          # Default quantization
ollama run gemma4:26b-q4_K_M   # Specific quantization
ollama run gemma4:e2b           # Smaller variant
ollama run gemma4:31b           # Larger variant

Lower quantization (Q2, Q3) = smaller + faster but lower quality. Higher (Q8, FP16) = larger + slower but better quality. Q4_K_M is the sweet spot for most use cases.

The API

Ollama exposes an OpenAI-compatible API on port 11434. This works with any tool that supports the OpenAI format.

Chat completion

curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gemma4:26b",
    "messages": [
      {"role": "system", "content": "You are a helpful coding assistant."},
      {"role": "user", "content": "Write a Python function to validate email addresses"}
    ]
  }'

Streaming

curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gemma4:26b",
    "messages": [{"role": "user", "content": "Explain MoE architecture"}],
    "stream": true
  }'

Python

import openai

client = openai.OpenAI(base_url="http://localhost:11434/v1", api_key="unused")

response = client.chat.completions.create(
    model="gemma4:26b",
    messages=[{"role": "user", "content": "Hello!"}]
)
print(response.choices[0].message.content)

JavaScript/TypeScript

const response = await fetch('http://localhost:11434/v1/chat/completions', {
  method: 'POST',
  headers: { 'Content-Type': 'application/json' },
  body: JSON.stringify({
    model: 'gemma4:26b',
    messages: [{ role: 'user', content: 'Hello!' }]
  })
});
const data = await response.json();
console.log(data.choices[0].message.content);

GPU configuration

NVIDIA GPUs

Ollama automatically detects and uses NVIDIA GPUs with CUDA. No configuration needed.

# Check GPU usage
nvidia-smi

# Force CPU-only mode
CUDA_VISIBLE_DEVICES="" ollama run gemma4:26b

Apple Silicon (M1/M2/M3/M4)

Ollama uses Metal acceleration automatically on Apple Silicon. The unified memory architecture means your full RAM is available as β€œVRAM.”

A MacBook with 16 GB unified memory can run models that would need a 16 GB GPU on other platforms.

AMD GPUs

Ollama supports AMD GPUs via ROCm on Linux. Install ROCm first, then Ollama detects it automatically.

No GPU

Ollama runs on CPU too β€” just slower. For CPU-only use, stick to smaller models:

  • Gemma 4 E2B (2 GB) β€” 10-15 tok/s on CPU
  • Phi-3.5 Mini (3 GB) β€” 12-18 tok/s on CPU
  • Qwen 3.6 (5 GB) β€” 8-12 tok/s on CPU

See how to run AI without a GPU for more tips.

Advanced configuration

Custom Modelfiles

Create custom model configurations:

# Modelfile
FROM gemma4:26b

PARAMETER temperature 0.7
PARAMETER num_ctx 8192
PARAMETER num_gpu 999

SYSTEM "You are a senior software engineer. Write clean, well-documented code."
ollama create my-coding-assistant -f Modelfile
ollama run my-coding-assistant

Environment variables

# Change default port
OLLAMA_HOST=0.0.0.0:8080 ollama serve

# Set model storage location
OLLAMA_MODELS=/path/to/models ollama serve

# Limit GPU memory usage
OLLAMA_GPU_MEMORY=6g ollama serve

Running as a service

# Linux (systemd)
sudo systemctl enable ollama
sudo systemctl start ollama

# Check status
sudo systemctl status ollama

IDE integration

VS Code + Continue.dev

The most popular setup for local AI coding:

  1. Install Ollama and pull a coding model: ollama pull qwen2.5-coder:32b
  2. Install the Continue extension in VS Code
  3. Configure Continue to use Ollama at http://localhost:11434

You get tab completion, inline chat, and code actions β€” like GitHub Copilot but fully local and free.

Other integrations

Ollama’s OpenAI-compatible API works with:

  • Open WebUI β€” ChatGPT-like web interface for Ollama
  • Cody β€” Sourcegraph’s AI coding assistant
  • LangChain / LlamaIndex β€” for building RAG applications
  • Any OpenAI SDK β€” just change the base URL

Ollama vs alternatives

Ollamallama.cppvLLM
Setup1 commandBuild from sourcepip install
GPU requiredNoNoYes
Model managementBuilt-inManualManual
APIOpenAI-compatOpenAI-compatOpenAI-compat
Quantization controlVia tagsFull controlLimited
Best forGetting startedPower usersProduction

For a detailed comparison, see Ollama vs llama.cpp vs vLLM.

Troubleshooting

Model too slow

  • Use a smaller model or lower quantization
  • Check GPU is being used: ollama ps shows GPU layers
  • Reduce context: add PARAMETER num_ctx 4096 to Modelfile

Out of memory

  • Use a smaller quantization: ollama run gemma4:26b-q3_K_M
  • Reduce context window
  • Close other applications
  • Check our best AI models under 4GB RAM

Model not found

ollama pull model-name  # Download first
ollama list             # Check what's available

What to run first

New to local AI? Start with ollama run gemma4:26b β€” best quality-per-hardware ratio.

Need coding help? Try ollama run qwen2.5-coder:32b β€” see best AI models for coding locally.

Limited hardware? Try ollama run gemma4:e2b β€” runs in 2 GB RAM. For more power, cloud GPU providers offer dedicated instances with A100s and H100s by the hour.

Want to compare? See our best local AI models by task and best free AI models rankings.

Ollama makes local AI accessible to everyone. Install it, pick a model, and start building. No cloud account, no API key, no monthly bill.

FAQ

Is Ollama free?

Yes, Ollama is completely free and open source. There are no subscription fees, API charges, or usage limits β€” you download it, run models on your own hardware, and pay nothing.

Does Ollama need a GPU?

No, Ollama runs on CPU as well, though it will be slower. For acceptable performance without a GPU, stick to smaller models like Gemma 4 E2B or Phi-3.5 Mini that can manage 10-18 tokens per second on CPU alone.

Which Ollama model is best for coding?

Qwen 2.5 Coder 32B is widely considered the best coding model you can run through Ollama. If your hardware can’t handle 32B parameters, Qwen 2.5 Coder 7B or CodeStral are strong alternatives at smaller sizes.

Can Ollama run on Windows?

Yes, Ollama has a native Windows installer available from ollama.com. It runs as a background service and works the same as on macOS and Linux, with full GPU acceleration support for NVIDIA cards.

Related: Best AI Engineering Courses