πŸ€– AI Tools
Β· 3 min read
Last updated on

How to Serve LLMs with vLLM β€” Production Deployment Guide


vLLM is the standard for production LLM serving. It uses continuous batching and PagedAttention to maximize throughput while keeping latency low. Here’s how to deploy it from scratch.

Prerequisites

Before installing vLLM, make sure you have:

  • Python 3.9+ installed
  • NVIDIA GPU with CUDA 12.1+ (or AMD ROCm 6.0+)
  • At least 16 GB of VRAM for meaningful models
  • pip or conda for package management

Don’t have 16 GB of VRAM? Cloud GPU providers offer A100s by the hour, so you can run vLLM without buying hardware.

Install

The simplest installation uses pip:

pip install vllm

For a specific CUDA version or if you need to build from source:

# Install with specific CUDA version
pip install vllm --extra-index-url https://download.pytorch.org/whl/cu121

# Or build from source for latest features
git clone https://github.com/vllm-project/vllm.git
cd vllm
pip install -e .

Verify the installation:

python -c "import vllm; print(vllm.__version__)"

Loading models

vLLM downloads models from HuggingFace automatically. The first run downloads the model weights, which can take a while depending on model size:

# Load a model in Python
from vllm import LLM

llm = LLM(model="Qwen/Qwen3.5-27B-Instruct")
output = llm.generate("Hello, how are you?")
print(output[0].outputs[0].text)

For gated models, set your HuggingFace token:

export HF_TOKEN="hf_your_token_here"

Basic server

python -m vllm.entrypoints.openai.api_server \
  --model Qwen/Qwen3.5-27B-Instruct \
  --port 8000

This starts an OpenAI-compatible API. Any tool that works with OpenAI works with vLLM β€” Aider, OpenCode, Continue.dev.

Test the server:

curl http://localhost:8000/v1/models

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen3.5-27B-Instruct",
    "messages": [{"role": "user", "content": "Hello!"}],
    "max_tokens": 100
  }'

With quantization

python -m vllm.entrypoints.openai.api_server \
  --model Qwen/Qwen3.5-27B-Instruct-AWQ \
  --quantization awq \
  --max-model-len 32768 \
  --port 8000

See our quantization guide for choosing between GPTQ, AWQ, and GGUF.

Multi-GPU

python -m vllm.entrypoints.openai.api_server \
  --model mistralai/Mistral-Large-Instruct-2411 \
  --tensor-parallel-size 2 \
  --port 8000

See our GPU memory planning guide for sizing.

GPU memory management

vLLM pre-allocates GPU memory for the KV cache at startup. You can control this behavior:

# Limit GPU memory utilization (default is 0.9 = 90%)
python -m vllm.entrypoints.openai.api_server \
  --model Qwen/Qwen3.5-27B-Instruct \
  --gpu-memory-utilization 0.85 \
  --port 8000

# Limit max context length to reduce KV cache memory
python -m vllm.entrypoints.openai.api_server \
  --model Qwen/Qwen3.5-27B-Instruct \
  --max-model-len 16384 \
  --port 8000

Key memory flags:

FlagEffect
--gpu-memory-utilization 0.85Reserve 15% VRAM for OS/other processes
--max-model-len 16384Cap context length, reducing KV cache size
--enforce-eagerDisable CUDA graphs (saves ~1-2 GB, slightly slower)
--swap-space 16Use CPU swap space for overflow (GB)

If you’re running into OOM errors, reduce --max-model-len first β€” it has the biggest impact on memory usage.

Benchmarks

Typical throughput numbers on an A100 80GB with continuous batching:

ModelQuantizationTok/s (single user)Throughput (32 concurrent)
Qwen3.5 27BFP16~45 tok/s~800 tok/s total
Qwen3.5 27BAWQ~65 tok/s~1200 tok/s total
Mistral Large 123BFP16 (2Γ—A100)~25 tok/s~400 tok/s total
Llama 4 Scout 70BAWQ~35 tok/s~600 tok/s total

For detailed comparisons, see vLLM vs Ollama vs llama.cpp vs TGI and SGLang vs vLLM.

Connect your tools

# Aider
aider --model openai/qwen3.5-27b --openai-api-base http://localhost:8000/v1

# Claude Code (with GLM-5.1 or other models)
export ANTHROPIC_BASE_URL="http://localhost:8000/v1"
claude

# Continue.dev
# Set provider to "openai" with baseURL "http://localhost:8000/v1"

Production tips

  1. Set max model length β€” --max-model-len 32768 to limit KV cache memory
  2. Enable prefix caching β€” --enable-prefix-caching for prompt caching
  3. Monitor GPU β€” nvidia-smi for VRAM usage
  4. Add Nginx β€” reverse proxy for load balancing and auth
  5. Health check β€” curl http://localhost:8000/health
  6. Use systemd β€” Run vLLM as a service for automatic restarts
  7. Set --disable-log-requests β€” Reduce I/O overhead in production
  8. API key auth β€” Use --api-key your-secret-key to protect the endpoint

Systemd service example

[Unit]
Description=vLLM Inference Server
After=network.target

[Service]
Type=simple
User=vllm
ExecStart=/usr/bin/python -m vllm.entrypoints.openai.api_server \
  --model Qwen/Qwen3.5-27B-Instruct-AWQ \
  --quantization awq \
  --max-model-len 32768 \
  --enable-prefix-caching \
  --gpu-memory-utilization 0.9 \
  --api-key your-secret-key \
  --port 8000
Restart=always
RestartSec=10

[Install]
WantedBy=multi-user.target

Docker deployment

docker run --gpus all \
  -p 8000:8000 \
  vllm/vllm-openai:latest \
  --model Qwen/Qwen3.5-27B-Instruct-AWQ \
  --quantization awq \
  --max-model-len 32768

For team setups, see our free AI coding server guide.

Related: vLLM vs Ollama vs llama.cpp Β· SGLang vs vLLM Β· LLM Inference Explained Β· GPU Memory Planning Β· Continuous Batching Β· KV Cache Explained Β· Best Hosting for AI Projects Β· Reliable Data Extraction Llms