๐Ÿค– AI Tools
ยท 3 min read
Last updated on

Ollama vs llama.cpp vs vLLM โ€” Which Should You Use? (2026)


There are three main ways to run AI models on your own hardware: Ollama, llama.cpp, and vLLM. Each one is built for a different use case. Hereโ€™s which one to use.

Quick comparison

Ollamallama.cppvLLM
Best forPersonal use, prototypingMaximum control, any hardwareProduction serving
Setup time2 minutes15-30 minutes30-60 minutes
APIOpenAI-compatibleREST (manual)OpenAI-compatible
GPU supportCUDA, Metal, ROCmCUDA, Metal, ROCm, Vulkan, CPUCUDA (primarily)
Concurrent users1-21-2Hundreds
ThroughputGoodGood3-5x better than Ollama
Model formatGGUF (auto-download)GGUF (manual download)HuggingFace (native)
QuantizationAutomaticFull controlLimited
Learning curveMinimalModerateSteep

Ollama: โ€œit just worksโ€

Ollama is Docker for LLMs. One command to install, one command to run a model. It handles downloading, quantization, and serving automatically.

curl -fsSL https://ollama.com/install.sh | sh
ollama run qwen3.5:9b

Thatโ€™s it. You now have a local AI running with an OpenAI-compatible API at localhost:11434.

Use Ollama when:

  • You want to try a model quickly
  • Youโ€™re building a personal coding assistant
  • You need a local API for development
  • You donโ€™t want to manage model files manually

Skip Ollama when:

  • You need to serve multiple concurrent users
  • You need fine-grained control over quantization
  • Youโ€™re deploying to production

llama.cpp: maximum control

llama.cpp is the C++ inference engine that Ollama is built on. It runs on everything โ€” NVIDIA GPUs, AMD GPUs, Apple Silicon, Intel GPUs, and even CPUs. It gives you full control over quantization, context size, batch size, and threading.

# Build from source
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp && make LLAMA_CUDA=1

# Download a model manually
huggingface-cli download Qwen/Qwen3.5-9B-GGUF \
  --include "*Q4_K_M*" --local-dir ./models

# Run
./llama-server -m ./models/Qwen3.5-9B-Q4_K_M.gguf \
  --ctx-size 8192 --threads 8 --port 8080

Use llama.cpp when:

  • You need specific quantization formats (Q2, Q3, Q4, Q5, Q6, Q8)
  • Youโ€™re running on unusual hardware (AMD, Intel Arc, CPU-only)
  • You want to squeeze maximum performance from limited hardware
  • Youโ€™re building a custom inference pipeline

Skip llama.cpp when:

  • You just want to chat with a model
  • You donโ€™t want to compile anything
  • You need production-grade serving

vLLM: production throughput

vLLM uses PagedAttention and continuous batching to serve 3-5x more concurrent users on the same hardware compared to Ollama. Itโ€™s designed for production deployments where multiple users hit the same model simultaneously.

pip install vllm

vllm serve Qwen/Qwen3.5-9B \
  --port 8000 \
  --max-model-len 8192

Use vLLM when:

  • Youโ€™re serving a model to multiple users
  • You need maximum throughput on GPU hardware
  • Youโ€™re deploying to Kubernetes or a production cluster
  • Latency under concurrent load matters

Skip vLLM when:

  • Youโ€™re the only user
  • Youโ€™re on Apple Silicon (limited support)
  • You donโ€™t have NVIDIA GPUs
  • You want a simple setup

Decision flowchart

  1. Just want to try a model? โ†’ Ollama
  2. Building a personal tool? โ†’ Ollama
  3. Need specific quantization or unusual hardware? โ†’ llama.cpp
  4. Serving to a team or production users? โ†’ vLLM
  5. On Apple Silicon? โ†’ Ollama or llama.cpp (vLLM has limited Mac support)
  6. On NVIDIA GPU for production? โ†’ vLLM

Performance comparison

On the same hardware (RTX 4090, Qwen3.5-9B Q4):

Ollamallama.cppvLLM
Single user tok/s~45~50~40
10 concurrent users~5 tok/s each~5 tok/s each~15 tok/s each
50 concurrent usersCrashesCrashes~8 tok/s each

For a single user, all three perform similarly. The difference shows up under concurrent load, where vLLMโ€™s batching gives it a massive advantage.

Can you combine them?

Yes. A common pattern:

  • Development: Ollama on your laptop for quick testing
  • Staging: llama.cpp on a shared server for team access
  • Production: vLLM on GPU instances for user-facing features

The models are the same GGUF/HuggingFace files โ€” youโ€™re just changing the serving layer.

Related: AI Coding Tools Pricing