Apr 18, 2026 · 3 min read

Last updated on Apr 20, 2026

How to Serve LLMs with vLLM — Production Deployment Guide

vLLM is the standard for production LLM serving. It uses continuous batching and PagedAttention to maximize throughput while keeping latency low. Here’s how to deploy it from scratch.

Prerequisites

Before installing vLLM, make sure you have:

Python 3.9+ installed
NVIDIA GPU with CUDA 12.1+ (or AMD ROCm 6.0+)
At least 16 GB of VRAM for meaningful models
pip or conda for package management

Don’t have 16 GB of VRAM? Cloud GPU providers offer A100s by the hour, so you can run vLLM without buying hardware.

Install

The simplest installation uses pip:

pip install vllm

For a specific CUDA version or if you need to build from source:

# Install with specific CUDA version
pip install vllm --extra-index-url https://download.pytorch.org/whl/cu121

# Or build from source for latest features
git clone https://github.com/vllm-project/vllm.git
cd vllm
pip install -e .

Verify the installation:

python -c "import vllm; print(vllm.__version__)"

Loading models

vLLM downloads models from HuggingFace automatically. The first run downloads the model weights, which can take a while depending on model size:

# Load a model in Python
from vllm import LLM

llm = LLM(model="Qwen/Qwen3.5-27B-Instruct")
output = llm.generate("Hello, how are you?")
print(output[0].outputs[0].text)

For gated models, set your HuggingFace token:

export HF_TOKEN="hf_your_token_here"

Basic server

python -m vllm.entrypoints.openai.api_server \
  --model Qwen/Qwen3.5-27B-Instruct \
  --port 8000

This starts an OpenAI-compatible API. Any tool that works with OpenAI works with vLLM — Aider, OpenCode, Continue.dev.

Test the server:

curl http://localhost:8000/v1/models

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen3.5-27B-Instruct",
    "messages": [{"role": "user", "content": "Hello!"}],
    "max_tokens": 100
  }'

With quantization

python -m vllm.entrypoints.openai.api_server \
  --model Qwen/Qwen3.5-27B-Instruct-AWQ \
  --quantization awq \
  --max-model-len 32768 \
  --port 8000

See our quantization guide for choosing between GPTQ, AWQ, and GGUF.

Multi-GPU

python -m vllm.entrypoints.openai.api_server \
  --model mistralai/Mistral-Large-Instruct-2411 \
  --tensor-parallel-size 2 \
  --port 8000

See our GPU memory planning guide for sizing.

GPU memory management

vLLM pre-allocates GPU memory for the KV cache at startup. You can control this behavior:

# Limit GPU memory utilization (default is 0.9 = 90%)
python -m vllm.entrypoints.openai.api_server \
  --model Qwen/Qwen3.5-27B-Instruct \
  --gpu-memory-utilization 0.85 \
  --port 8000

# Limit max context length to reduce KV cache memory
python -m vllm.entrypoints.openai.api_server \
  --model Qwen/Qwen3.5-27B-Instruct \
  --max-model-len 16384 \
  --port 8000

Key memory flags:

Flag	Effect
`--gpu-memory-utilization 0.85`	Reserve 15% VRAM for OS/other processes
`--max-model-len 16384`	Cap context length, reducing KV cache size
`--enforce-eager`	Disable CUDA graphs (saves ~1-2 GB, slightly slower)
`--swap-space 16`	Use CPU swap space for overflow (GB)

If you’re running into OOM errors, reduce --max-model-len first — it has the biggest impact on memory usage.

Benchmarks

Typical throughput numbers on an A100 80GB with continuous batching:

Model	Quantization	Tok/s (single user)	Throughput (32 concurrent)
Qwen3.5 27B	FP16	~45 tok/s	~800 tok/s total
Qwen3.5 27B	AWQ	~65 tok/s	~1200 tok/s total
Mistral Large 123B	FP16 (2×A100)	~25 tok/s	~400 tok/s total
Llama 4 Scout 70B	AWQ	~35 tok/s	~600 tok/s total

For detailed comparisons, see vLLM vs Ollama vs llama.cpp vs TGI and SGLang vs vLLM.

Connect your tools

# Aider
aider --model openai/qwen3.5-27b --openai-api-base http://localhost:8000/v1

# Claude Code (with GLM-5.1 or other models)
export ANTHROPIC_BASE_URL="http://localhost:8000/v1"
claude

# Continue.dev
# Set provider to "openai" with baseURL "http://localhost:8000/v1"

Production tips

Set max model length — --max-model-len 32768 to limit KV cache memory
Enable prefix caching — --enable-prefix-caching for prompt caching
Monitor GPU — nvidia-smi for VRAM usage
Add Nginx — reverse proxy for load balancing and auth
Health check — curl http://localhost:8000/health
Use systemd — Run vLLM as a service for automatic restarts
Set --disable-log-requests — Reduce I/O overhead in production
API key auth — Use --api-key your-secret-key to protect the endpoint

Systemd service example

[Unit]
Description=vLLM Inference Server
After=network.target

[Service]
Type=simple
User=vllm
ExecStart=/usr/bin/python -m vllm.entrypoints.openai.api_server \
  --model Qwen/Qwen3.5-27B-Instruct-AWQ \
  --quantization awq \
  --max-model-len 32768 \
  --enable-prefix-caching \
  --gpu-memory-utilization 0.9 \
  --api-key your-secret-key \
  --port 8000
Restart=always
RestartSec=10

[Install]
WantedBy=multi-user.target

Docker deployment

docker run --gpus all \
  -p 8000:8000 \
  vllm/vllm-openai:latest \
  --model Qwen/Qwen3.5-27B-Instruct-AWQ \
  --quantization awq \
  --max-model-len 32768

For team setups, see our free AI coding server guide.

How to Serve LLMs with vLLM — Production Deployment Guide

Prerequisites

Install

Loading models

Basic server

With quantization

Multi-GPU

GPU memory management

Benchmarks

Connect your tools

Production tips

Systemd service example

Docker deployment

📬 AI Dev Weekly

You might also like

Prefix Caching for LLM APIs — How It Works and Why It Saves Money

SGLang vs vLLM — The New Inference Engine Challenger (2026)

GPU Memory Planning for LLM Serving — How Much VRAM You Actually Need

Quantization Trade-offs in Production — 4-bit vs 8-bit vs Full Precision