Before deploying an LLM, calculate exactly how much VRAM you need. Getting this wrong means OOM errors or wasted money.
The formula
Total VRAM = Model weights + KV cache + Overhead (~15%)
Model weights
| Precision | Bytes/param | 7B | 27B | 70B | 123B |
|---|---|---|---|---|---|
| FP16 | 2 | 14GB | 54GB | 140GB | 246GB |
| INT8 | 1 | 7GB | 27GB | 70GB | 123GB |
| Q4 | 0.5 | 3.5GB | 13.5GB | 35GB | 62GB |
KV cache per concurrent request
Depends on context length and model architecture. Rough estimates:
| Model size | 4K context | 32K context | 128K context |
|---|---|---|---|
| 7B | 0.1GB | 0.5GB | 2GB |
| 27B | 0.3GB | 2GB | 8GB |
| 70B | 0.5GB | 5GB | 20GB |
Practical examples
| Model | Precision | 1 user | 4 users | GPU needed |
|---|---|---|---|---|
| Qwen 3.5 9B | Q4 | 6GB | 8GB | RTX 4070 (12GB) |
| Codestral 22B | Q4 | 13GB | 16GB | RTX 4090 (24GB) |
| Qwen 3.5 27B | Q4 | 16GB | 22GB | RTX 4090 (24GB) |
| Mistral Large 123B | Q4 | 68GB | 80GB | 1x H100 (80GB) |
| GLM-5.1 754B | Q4 | ~200GB | ~240GB | 4x A100 (320GB) |
Multi-GPU
For models that donβt fit on one GPU, use tensor parallelism:
python -m vllm.entrypoints.openai.api_server --model my-model --tensor-parallel-size 2
VRAM is split across GPUs. A 70B Q4 model (35GB) fits on 2x RTX 4090 (48GB total).
Multi-GPU setups are expensive to build at home. Cloud GPU providers offer multi-A100 and H100 instances on demand if you need high-VRAM serving without the capital expense.
Apple Silicon
Mac unified memory counts as VRAM. A 32GB Mac = 32GB effective VRAM. A 192GB Mac Studio Ultra can run Mistral Large 2 in Q4.
Related: How Much VRAM for AI? Β· Best GPU for AI Β· KV Cache Explained Β· Best AI Models for Mac