🤖 AI Tools
· 3 min read
Last updated on

Ollama vs llama.cpp vs vLLM — Which Should You Use? (2026)


There are three main ways to run AI models on your own hardware: Ollama, llama.cpp, and vLLM. Each one is built for a different use case. Here’s which one to use.

Quick comparison

Ollamallama.cppvLLM
Best forPersonal use, prototypingMaximum control, any hardwareProduction serving
Setup time2 minutes15-30 minutes30-60 minutes
APIOpenAI-compatibleREST (manual)OpenAI-compatible
GPU supportCUDA, Metal, ROCmCUDA, Metal, ROCm, Vulkan, CPUCUDA (primarily)
Concurrent users1-21-2Hundreds
ThroughputGoodGood3-5x better than Ollama
Model formatGGUF (auto-download)GGUF (manual download)HuggingFace (native)
QuantizationAutomaticFull controlLimited
Learning curveMinimalModerateSteep

Ollama: “it just works”

Ollama is Docker for LLMs. One command to install, one command to run a model. It handles downloading, quantization, and serving automatically.

curl -fsSL https://ollama.com/install.sh | sh
ollama run qwen3.5:9b

That’s it. You now have a local AI running with an OpenAI-compatible API at localhost:11434.

Use Ollama when:

  • You want to try a model quickly
  • You’re building a personal coding assistant
  • You need a local API for development
  • You don’t want to manage model files manually

Skip Ollama when:

  • You need to serve multiple concurrent users
  • You need fine-grained control over quantization
  • You’re deploying to production

llama.cpp: maximum control

llama.cpp is the C++ inference engine that Ollama is built on. It runs on everything — NVIDIA GPUs, AMD GPUs, Apple Silicon, Intel GPUs, and even CPUs. It gives you full control over quantization, context size, batch size, and threading.

# Build from source
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp && make LLAMA_CUDA=1

# Download a model manually
huggingface-cli download Qwen/Qwen3.5-9B-GGUF \
  --include "*Q4_K_M*" --local-dir ./models

# Run
./llama-server -m ./models/Qwen3.5-9B-Q4_K_M.gguf \
  --ctx-size 8192 --threads 8 --port 8080

Use llama.cpp when:

  • You need specific quantization formats (Q2, Q3, Q4, Q5, Q6, Q8)
  • You’re running on unusual hardware (AMD, Intel Arc, CPU-only)
  • You want to squeeze maximum performance from limited hardware
  • You’re building a custom inference pipeline

Skip llama.cpp when:

  • You just want to chat with a model
  • You don’t want to compile anything
  • You need production-grade serving

vLLM: production throughput

vLLM uses PagedAttention and continuous batching to serve 3-5x more concurrent users on the same hardware compared to Ollama. It’s designed for production deployments where multiple users hit the same model simultaneously.

pip install vllm

vllm serve Qwen/Qwen3.5-9B \
  --port 8000 \
  --max-model-len 8192

Use vLLM when:

  • You’re serving a model to multiple users
  • You need maximum throughput on GPU hardware
  • You’re deploying to Kubernetes or a production cluster
  • Latency under concurrent load matters

Skip vLLM when:

  • You’re the only user
  • You’re on Apple Silicon (limited support)
  • You don’t have NVIDIA GPUs
  • You want a simple setup

Decision flowchart

  1. Just want to try a model? → Ollama
  2. Building a personal tool? → Ollama
  3. Need specific quantization or unusual hardware? → llama.cpp
  4. Serving to a team or production users? → vLLM
  5. On Apple Silicon? → Ollama or llama.cpp (vLLM has limited Mac support)
  6. On NVIDIA GPU for production? → vLLM

Performance comparison

On the same hardware (RTX 4090, Qwen3.5-9B Q4):

Ollamallama.cppvLLM
Single user tok/s~45~50~40
10 concurrent users~5 tok/s each~5 tok/s each~15 tok/s each
50 concurrent usersCrashesCrashes~8 tok/s each

For a single user, all three perform similarly. The difference shows up under concurrent load, where vLLM’s batching gives it a massive advantage.

Can you combine them?

Yes. A common pattern:

  • Development: Ollama on your laptop for quick testing
  • Staging: llama.cpp on a shared server for team access
  • Production: vLLM on GPU instances for user-facing features

The models are the same GGUF/HuggingFace files — you’re just changing the serving layer.

Related: AI Coding Tools Pricing