Apr 18, 2026 · 2 min read

GPU Memory Planning for LLM Serving — How Much VRAM You Actually Need

Before deploying an LLM, calculate exactly how much VRAM you need. Getting this wrong means OOM errors or wasted money.

The formula

Total VRAM = Model weights + KV cache + Overhead (~15%)

Model weights

Precision	Bytes/param	7B	27B	70B	123B
FP16	2	14GB	54GB	140GB	246GB
INT8	1	7GB	27GB	70GB	123GB
Q4	0.5	3.5GB	13.5GB	35GB	62GB

KV cache per concurrent request

Depends on context length and model architecture. Rough estimates:

Model size	4K context	32K context	128K context
7B	0.1GB	0.5GB	2GB
27B	0.3GB	2GB	8GB
70B	0.5GB	5GB	20GB

Practical examples

Model	Precision	1 user	4 users	GPU needed
Qwen 3.5 9B	Q4	6GB	8GB	RTX 4070 (12GB)
Codestral 22B	Q4	13GB	16GB	RTX 4090 (24GB)
Qwen 3.5 27B	Q4	16GB	22GB	RTX 4090 (24GB)
Mistral Large 123B	Q4	68GB	80GB	1x H100 (80GB)
GLM-5.1 754B	Q4	~200GB	~240GB	4x A100 (320GB)

Multi-GPU

For models that don’t fit on one GPU, use tensor parallelism:

python -m vllm.entrypoints.openai.api_server --model my-model --tensor-parallel-size 2

VRAM is split across GPUs. A 70B Q4 model (35GB) fits on 2x RTX 4090 (48GB total).

Multi-GPU setups are expensive to build at home. Cloud GPU providers offer multi-A100 and H100 instances on demand if you need high-VRAM serving without the capital expense.

Apple Silicon

Mac unified memory counts as VRAM. A 32GB Mac = 32GB effective VRAM. A 192GB Mac Studio Ultra can run Mistral Large 2 in Q4.

GPU Memory Planning for LLM Serving — How Much VRAM You Actually Need

The formula

Model weights

KV cache per concurrent request

Practical examples

Multi-GPU

Apple Silicon

📬 AI Dev Weekly

You might also like

When to Use CPU vs GPU for LLM Inference

Prefix Caching for LLM APIs — How It Works and Why It Saves Money

How to Run Multiple Models on One GPU

SGLang vs vLLM — The New Inference Engine Challenger (2026)