Ollama vs llama.cpp vs vLLM โ Which Should You Use? (2026)
There are three main ways to run AI models on your own hardware: Ollama, llama.cpp, and vLLM. Each one is built for a different use case. Hereโs which one to use.
Quick comparison
| Ollama | llama.cpp | vLLM | |
|---|---|---|---|
| Best for | Personal use, prototyping | Maximum control, any hardware | Production serving |
| Setup time | 2 minutes | 15-30 minutes | 30-60 minutes |
| API | OpenAI-compatible | REST (manual) | OpenAI-compatible |
| GPU support | CUDA, Metal, ROCm | CUDA, Metal, ROCm, Vulkan, CPU | CUDA (primarily) |
| Concurrent users | 1-2 | 1-2 | Hundreds |
| Throughput | Good | Good | 3-5x better than Ollama |
| Model format | GGUF (auto-download) | GGUF (manual download) | HuggingFace (native) |
| Quantization | Automatic | Full control | Limited |
| Learning curve | Minimal | Moderate | Steep |
Ollama: โit just worksโ
Ollama is Docker for LLMs. One command to install, one command to run a model. It handles downloading, quantization, and serving automatically.
curl -fsSL https://ollama.com/install.sh | sh
ollama run qwen3.5:9b
Thatโs it. You now have a local AI running with an OpenAI-compatible API at localhost:11434.
Use Ollama when:
- You want to try a model quickly
- Youโre building a personal coding assistant
- You need a local API for development
- You donโt want to manage model files manually
Skip Ollama when:
- You need to serve multiple concurrent users
- You need fine-grained control over quantization
- Youโre deploying to production
llama.cpp: maximum control
llama.cpp is the C++ inference engine that Ollama is built on. It runs on everything โ NVIDIA GPUs, AMD GPUs, Apple Silicon, Intel GPUs, and even CPUs. It gives you full control over quantization, context size, batch size, and threading.
# Build from source
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp && make LLAMA_CUDA=1
# Download a model manually
huggingface-cli download Qwen/Qwen3.5-9B-GGUF \
--include "*Q4_K_M*" --local-dir ./models
# Run
./llama-server -m ./models/Qwen3.5-9B-Q4_K_M.gguf \
--ctx-size 8192 --threads 8 --port 8080
Use llama.cpp when:
- You need specific quantization formats (Q2, Q3, Q4, Q5, Q6, Q8)
- Youโre running on unusual hardware (AMD, Intel Arc, CPU-only)
- You want to squeeze maximum performance from limited hardware
- Youโre building a custom inference pipeline
Skip llama.cpp when:
- You just want to chat with a model
- You donโt want to compile anything
- You need production-grade serving
vLLM: production throughput
vLLM uses PagedAttention and continuous batching to serve 3-5x more concurrent users on the same hardware compared to Ollama. Itโs designed for production deployments where multiple users hit the same model simultaneously.
pip install vllm
vllm serve Qwen/Qwen3.5-9B \
--port 8000 \
--max-model-len 8192
Use vLLM when:
- Youโre serving a model to multiple users
- You need maximum throughput on GPU hardware
- Youโre deploying to Kubernetes or a production cluster
- Latency under concurrent load matters
Skip vLLM when:
- Youโre the only user
- Youโre on Apple Silicon (limited support)
- You donโt have NVIDIA GPUs
- You want a simple setup
Decision flowchart
- Just want to try a model? โ Ollama
- Building a personal tool? โ Ollama
- Need specific quantization or unusual hardware? โ llama.cpp
- Serving to a team or production users? โ vLLM
- On Apple Silicon? โ Ollama or llama.cpp (vLLM has limited Mac support)
- On NVIDIA GPU for production? โ vLLM
Performance comparison
On the same hardware (RTX 4090, Qwen3.5-9B Q4):
| Ollama | llama.cpp | vLLM | |
|---|---|---|---|
| Single user tok/s | ~45 | ~50 | ~40 |
| 10 concurrent users | ~5 tok/s each | ~5 tok/s each | ~15 tok/s each |
| 50 concurrent users | Crashes | Crashes | ~8 tok/s each |
For a single user, all three perform similarly. The difference shows up under concurrent load, where vLLMโs batching gives it a massive advantage.
Can you combine them?
Yes. A common pattern:
- Development: Ollama on your laptop for quick testing
- Staging: llama.cpp on a shared server for team access
- Production: vLLM on GPU instances for user-facing features
The models are the same GGUF/HuggingFace files โ youโre just changing the serving layer.
Related
- Best Self-Hosted AI Models in 2026
- How to Run Qwen 3.5 Locally
- Self-Hosted AI vs API โ When to Pay and When to Run Locally
- How Much VRAM Do You Need for AI?
Related: AI Coding Tools Pricing