Some links in this article are affiliate links. We earn a commission at no extra cost to you when you purchase through them. Full disclosure.
Quick reference for Ollama β the easiest way to run AI models locally.
Install
# macOS
brew install ollama
# Linux
curl -fsSL https://ollama.com/install.sh | sh
# Docker
docker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama
Model management
ollama pull qwen3:8b # Download a model
ollama run qwen3:8b # Run interactively (auto-pulls if needed)
ollama list # List downloaded models
ollama show qwen3:8b # Show model details
ollama rm qwen3:8b # Delete a model
ollama cp qwen3:8b my-model # Copy/rename a model
ollama ps # Show running models
ollama stop qwen3:8b # Stop a running model
Run options
ollama run qwen3:8b # Interactive chat
ollama run qwen3:8b "one-shot prompt" # Single response, then exit
ollama run qwen3:8b --verbose # Show token stats
ollama run qwen3:8b --num-ctx 2048 # Set context window
ollama run qwen3:8b --num-gpu 999 # Use all GPU layers
ollama run qwen3:8b --format json # Force JSON output
Create custom models
# Create a Modelfile
cat > Modelfile << 'EOF'
FROM qwen3:8b
PARAMETER temperature 0.2
PARAMETER num_ctx 4096
SYSTEM You are a senior Python developer. Write clean, typed, tested code.
EOF
# Build the model
ollama create python-coder -f Modelfile
# Run it
ollama run python-coder
Import GGUF files
cat > Modelfile << 'EOF'
FROM ./model-file.gguf
EOF
ollama create my-model -f Modelfile
API endpoints
# Generate (streaming)
curl http://localhost:11434/api/generate -d '{
"model": "qwen3:8b",
"prompt": "Write hello world in Python"
}'
# Chat (multi-turn)
curl http://localhost:11434/api/chat -d '{
"model": "qwen3:8b",
"messages": [{"role": "user", "content": "Explain REST APIs"}]
}'
# Embeddings
curl http://localhost:11434/api/embed -d '{
"model": "nomic-embed-text",
"input": "Your text here"
}'
# List models
curl http://localhost:11434/api/tags
# Model info
curl http://localhost:11434/api/show -d '{"name": "qwen3:8b"}'
Environment variables
OLLAMA_HOST=0.0.0.0:11434 # Listen address (default: 127.0.0.1:11434)
OLLAMA_MODELS=/path/to/models # Model storage location
OLLAMA_NUM_GPU=999 # GPU layers (999 = all)
OLLAMA_NUM_PARALLEL=4 # Concurrent requests
OLLAMA_MAX_LOADED_MODELS=2 # Models in memory simultaneously
OLLAMA_FLASH_ATTENTION=1 # Enable flash attention
OLLAMA_KEEP_ALIVE=5m # Keep model loaded after last request
OLLAMA_DEBUG=1 # Debug logging
Best models for coding
| Model | Size | Command | Best for |
|---|---|---|---|
| Qwen3 8B | 5 GB | ollama run qwen3:8b | General coding |
| DeepSeek R1 14B | 9 GB | ollama run deepseek-r1:14b | Reasoning |
| Qwen 3.5 27B | 16 GB | ollama run qwen3.5:27b | Best quality |
| CodeStral | 13 GB | ollama run codestral | Code-specific |
| Phi-4 3.8B | 2.5 GB | ollama run phi4 | Low RAM |
See our best Ollama models for coding for the full list.
Use with AI coding tools
# With Aider
aider --model ollama/qwen3:8b
# With Continue.dev (VS Code)
# Add to ~/.continue/config.json:
# {"models": [{"provider": "ollama", "model": "qwen3:8b"}]}
# With OpenCode
opencode --model ollama/qwen3:8b
Troubleshooting
| Error | Fix |
|---|---|
| Out of memory | Use smaller model or quantization |
| Model not found | Check name format |
| Connection refused | Start service |
| Slow responses | Enable GPU, reduce context |
Full troubleshooting: Ollama Troubleshooting Guide
Speed up your workflow β Raycast lets you trigger Ollama commands from a keyboard shortcut.
Related: Ollama Complete Guide Β· Best Ollama Models for Coding Β· Ollama vs LM Studio vs vLLM Β· How Much VRAM for AI Models Β· Aider with Ollama Setup