The #1 question people ask about running AI locally: how much VRAM do I need? Here’s the simple answer.
The formula
At Q4 quantization (what most people use):
VRAM needed ≈ model parameters × 0.5-0.7 GB per billion
Plus ~1-2GB overhead for the inference engine and context.
VRAM chart
| Model size | VRAM needed (Q4) | Example models | Example GPUs |
|---|---|---|---|
| 0.5-1B | 2GB | Qwen3.5-0.8B | Any device |
| 3-4B | 4-5GB | Qwen3.5-4B, Phi-3 Mini | 8GB laptop |
| 7-9B | 6-8GB | Qwen3.5-9B, Llama 3.2 7B, DeepSeek R1 7B | 8GB GPU, 16GB Mac |
| 14-15B | 10-12GB | DeepSeek Coder V2 Lite, MiMo-V2-Flash | RTX 3060 12GB |
| 22-24B | 14-16GB | Codestral, Mistral Small | RTX 4070 Ti 16GB |
| 27-32B | 18-24GB | Qwen3.5-27B, Qwen 2.5 Coder 32B | RTX 4090 24GB |
| 70B | 35-45GB | Llama 3.3 70B | 48GB Mac Pro, A6000 |
| 120-130B | 60-80GB | Qwen3.5-122B-A10B | 64GB+ Mac, multi-GPU |
| 400B+ | 150-214GB | Qwen3.5-397B, DeepSeek V3 | 192GB Mac Ultra, multi-A100 |
MoE models use less VRAM than you’d think
Mixture-of-Experts models have a large total parameter count but only activate a fraction per token. However, you still need to load the full model into memory.
| Model | Total params | Active params | VRAM needed (Q4) |
|---|---|---|---|
| MiMo-V2-Flash | 309B | 15B | ~12-16GB |
| Qwen3.5-35B-A3B | 35B | 3B | ~8GB |
| Qwen3.5-397B | 397B | 17B | ~214GB |
| DeepSeek V3 | 671B | 37B | ~80-100GB |
| Llama 4 Maverick | 400B | 17B | ~60-80GB |
The VRAM requirement is based on total parameters, not active parameters. You need to fit the whole model in memory even though only a fraction runs per token.
Context window affects VRAM too
Longer context = more VRAM. The KV cache grows with context length.
Approximate additional VRAM for context (on a 9B model):
- 4K context: +0.5GB
- 8K context: +1GB
- 32K context: +4GB
- 128K context: +16GB
This is why a model that “fits” in 8GB VRAM at 4K context might not fit at 32K context. Start with smaller context and increase until you hit your VRAM limit.
Quantization levels explained
| Level | Bits per param | Quality | VRAM savings |
|---|---|---|---|
| FP16 | 16 | Original | Baseline |
| Q8_0 | 8 | ~99% of original | 50% less |
| Q6_K | 6 | ~98% of original | 62% less |
| Q4_K_M | 4 | ~95% of original | 75% less |
| Q3_K_M | 3 | ~90% of original | 81% less |
| Q2_K | 2 | Noticeable degradation | 87% less |
Q4_K_M is the sweet spot. It preserves ~95% of model quality while using 75% less VRAM than full precision. Most people can’t tell the difference between Q4 and full precision in practice.
Go to Q3 or Q2 only if your model barely doesn’t fit at Q4. The quality drop becomes noticeable.
Quick recommendations
| ”I have…" | "Run this…“ |
|---|---|
| 8GB (laptop/Mac) | Qwen3.5-9B Q4 — best quality-per-GB |
| 12GB (RTX 3060) | DeepSeek Coder V2 Lite or Qwen3.5-9B with larger context |
| 16GB (RTX 4070 Ti) | Codestral or MiMo-V2-Flash |
| 24GB (RTX 4090) | Qwen 2.5 Coder 32B — best open-source coding model |
| 32GB (RTX 5090/Mac) | Qwen3.5-27B with generous context |
| 48GB (Mac Pro) | Llama 4 Maverick or Qwen3.5-122B-A10B |
| 192GB (Mac Ultra) | Anything. DeepSeek V3, Qwen 397B. |