How to Serve LLMs with vLLM β Production Deployment Guide
vLLM is the standard for production LLM serving. It uses continuous batching and PagedAttention to maximize throughput while keeping latency low. Hereβs how to deploy it from scratch.
Prerequisites
Before installing vLLM, make sure you have:
- Python 3.9+ installed
- NVIDIA GPU with CUDA 12.1+ (or AMD ROCm 6.0+)
- At least 16 GB of VRAM for meaningful models
piporcondafor package management
Donβt have 16 GB of VRAM? Cloud GPU providers offer A100s by the hour, so you can run vLLM without buying hardware.
Install
The simplest installation uses pip:
pip install vllm
For a specific CUDA version or if you need to build from source:
# Install with specific CUDA version
pip install vllm --extra-index-url https://download.pytorch.org/whl/cu121
# Or build from source for latest features
git clone https://github.com/vllm-project/vllm.git
cd vllm
pip install -e .
Verify the installation:
python -c "import vllm; print(vllm.__version__)"
Loading models
vLLM downloads models from HuggingFace automatically. The first run downloads the model weights, which can take a while depending on model size:
# Load a model in Python
from vllm import LLM
llm = LLM(model="Qwen/Qwen3.5-27B-Instruct")
output = llm.generate("Hello, how are you?")
print(output[0].outputs[0].text)
For gated models, set your HuggingFace token:
export HF_TOKEN="hf_your_token_here"
Basic server
python -m vllm.entrypoints.openai.api_server \
--model Qwen/Qwen3.5-27B-Instruct \
--port 8000
This starts an OpenAI-compatible API. Any tool that works with OpenAI works with vLLM β Aider, OpenCode, Continue.dev.
Test the server:
curl http://localhost:8000/v1/models
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen3.5-27B-Instruct",
"messages": [{"role": "user", "content": "Hello!"}],
"max_tokens": 100
}'
With quantization
python -m vllm.entrypoints.openai.api_server \
--model Qwen/Qwen3.5-27B-Instruct-AWQ \
--quantization awq \
--max-model-len 32768 \
--port 8000
See our quantization guide for choosing between GPTQ, AWQ, and GGUF.
Multi-GPU
python -m vllm.entrypoints.openai.api_server \
--model mistralai/Mistral-Large-Instruct-2411 \
--tensor-parallel-size 2 \
--port 8000
See our GPU memory planning guide for sizing.
GPU memory management
vLLM pre-allocates GPU memory for the KV cache at startup. You can control this behavior:
# Limit GPU memory utilization (default is 0.9 = 90%)
python -m vllm.entrypoints.openai.api_server \
--model Qwen/Qwen3.5-27B-Instruct \
--gpu-memory-utilization 0.85 \
--port 8000
# Limit max context length to reduce KV cache memory
python -m vllm.entrypoints.openai.api_server \
--model Qwen/Qwen3.5-27B-Instruct \
--max-model-len 16384 \
--port 8000
Key memory flags:
| Flag | Effect |
|---|---|
--gpu-memory-utilization 0.85 | Reserve 15% VRAM for OS/other processes |
--max-model-len 16384 | Cap context length, reducing KV cache size |
--enforce-eager | Disable CUDA graphs (saves ~1-2 GB, slightly slower) |
--swap-space 16 | Use CPU swap space for overflow (GB) |
If youβre running into OOM errors, reduce --max-model-len first β it has the biggest impact on memory usage.
Benchmarks
Typical throughput numbers on an A100 80GB with continuous batching:
| Model | Quantization | Tok/s (single user) | Throughput (32 concurrent) |
|---|---|---|---|
| Qwen3.5 27B | FP16 | ~45 tok/s | ~800 tok/s total |
| Qwen3.5 27B | AWQ | ~65 tok/s | ~1200 tok/s total |
| Mistral Large 123B | FP16 (2ΓA100) | ~25 tok/s | ~400 tok/s total |
| Llama 4 Scout 70B | AWQ | ~35 tok/s | ~600 tok/s total |
For detailed comparisons, see vLLM vs Ollama vs llama.cpp vs TGI and SGLang vs vLLM.
Connect your tools
# Aider
aider --model openai/qwen3.5-27b --openai-api-base http://localhost:8000/v1
# Claude Code (with GLM-5.1 or other models)
export ANTHROPIC_BASE_URL="http://localhost:8000/v1"
claude
# Continue.dev
# Set provider to "openai" with baseURL "http://localhost:8000/v1"
Production tips
- Set max model length β
--max-model-len 32768to limit KV cache memory - Enable prefix caching β
--enable-prefix-cachingfor prompt caching - Monitor GPU β
nvidia-smifor VRAM usage - Add Nginx β reverse proxy for load balancing and auth
- Health check β
curl http://localhost:8000/health - Use systemd β Run vLLM as a service for automatic restarts
- Set
--disable-log-requestsβ Reduce I/O overhead in production - API key auth β Use
--api-key your-secret-keyto protect the endpoint
Systemd service example
[Unit]
Description=vLLM Inference Server
After=network.target
[Service]
Type=simple
User=vllm
ExecStart=/usr/bin/python -m vllm.entrypoints.openai.api_server \
--model Qwen/Qwen3.5-27B-Instruct-AWQ \
--quantization awq \
--max-model-len 32768 \
--enable-prefix-caching \
--gpu-memory-utilization 0.9 \
--api-key your-secret-key \
--port 8000
Restart=always
RestartSec=10
[Install]
WantedBy=multi-user.target
Docker deployment
docker run --gpus all \
-p 8000:8000 \
vllm/vllm-openai:latest \
--model Qwen/Qwen3.5-27B-Instruct-AWQ \
--quantization awq \
--max-model-len 32768
For team setups, see our free AI coding server guide.
Related: vLLM vs Ollama vs llama.cpp Β· SGLang vs vLLM Β· LLM Inference Explained Β· GPU Memory Planning Β· Continuous Batching Β· KV Cache Explained Β· Best Hosting for AI Projects Β· Reliable Data Extraction Llms