πŸ€– AI Tools
Β· 2 min read

GPU Memory Planning for LLM Serving β€” How Much VRAM You Actually Need


Before deploying an LLM, calculate exactly how much VRAM you need. Getting this wrong means OOM errors or wasted money.

The formula

Total VRAM = Model weights + KV cache + Overhead (~15%)

Model weights

PrecisionBytes/param7B27B70B123B
FP16214GB54GB140GB246GB
INT817GB27GB70GB123GB
Q40.53.5GB13.5GB35GB62GB

KV cache per concurrent request

Depends on context length and model architecture. Rough estimates:

Model size4K context32K context128K context
7B0.1GB0.5GB2GB
27B0.3GB2GB8GB
70B0.5GB5GB20GB

Practical examples

ModelPrecision1 user4 usersGPU needed
Qwen 3.5 9BQ46GB8GBRTX 4070 (12GB)
Codestral 22BQ413GB16GBRTX 4090 (24GB)
Qwen 3.5 27BQ416GB22GBRTX 4090 (24GB)
Mistral Large 123BQ468GB80GB1x H100 (80GB)
GLM-5.1 754BQ4~200GB~240GB4x A100 (320GB)

Multi-GPU

For models that don’t fit on one GPU, use tensor parallelism:

python -m vllm.entrypoints.openai.api_server --model my-model --tensor-parallel-size 2

VRAM is split across GPUs. A 70B Q4 model (35GB) fits on 2x RTX 4090 (48GB total).

Multi-GPU setups are expensive to build at home. Cloud GPU providers offer multi-A100 and H100 instances on demand if you need high-VRAM serving without the capital expense.

Apple Silicon

Mac unified memory counts as VRAM. A 32GB Mac = 32GB effective VRAM. A 192GB Mac Studio Ultra can run Mistral Large 2 in Q4.

Related: How Much VRAM for AI? Β· Best GPU for AI Β· KV Cache Explained Β· Best AI Models for Mac