The #1 question people ask about running AI locally: how much VRAM do I need? Hereβs the simple answer.
The formula
At Q4 quantization (what most people use):
VRAM needed β model parameters Γ 0.5-0.7 GB per billion
Plus ~1-2GB overhead for the inference engine and context.
VRAM chart
| Model size | VRAM needed (Q4) | Example models | Example GPUs |
|---|---|---|---|
| 0.5-1B | 2GB | Qwen3.5-0.8B | Any device |
| 3-4B | 4-5GB | Qwen3.5-4B, Phi-3 Mini | 8GB laptop |
| 7-9B | 6-8GB | Qwen3.5-9B, Llama 3.2 7B, DeepSeek R1 7B | 8GB GPU, 16GB Mac |
| 14-15B | 10-12GB | DeepSeek Coder V2 Lite, MiMo-V2-Flash | RTX 3060 12GB |
| 22-24B | 14-16GB | Codestral, Mistral Small | RTX 4070 Ti 16GB |
| 27-32B | 18-24GB | Qwen3.5-27B, Qwen 2.5 Coder 32B | RTX 4090 24GB |
| 70B | 35-45GB | Llama 3.3 70B | 48GB Mac Pro, A6000 |
| 120-130B | 60-80GB | Qwen3.5-122B-A10B | 64GB+ Mac, multi-GPU |
| 400B+ | 150-214GB | Qwen3.5-397B, DeepSeek V3 | 192GB Mac Ultra, multi-A100 |
MoE models use less VRAM than youβd think
Mixture-of-Experts models have a large total parameter count but only activate a fraction per token. However, you still need to load the full model into memory.
| Model | Total params | Active params | VRAM needed (Q4) |
|---|---|---|---|
| MiMo-V2-Flash | 309B | 15B | ~12-16GB |
| Qwen3.5-35B-A3B | 35B | 3B | ~8GB |
| Qwen3.5-397B | 397B | 17B | ~214GB |
| DeepSeek V3 | 671B | 37B | ~80-100GB |
| Llama 4 Maverick | 400B | 17B | ~60-80GB |
The VRAM requirement is based on total parameters, not active parameters. You need to fit the whole model in memory even though only a fraction runs per token.
Context window affects VRAM too
Longer context = more VRAM. The KV cache grows with context length.
Approximate additional VRAM for context (on a 9B model):
- 4K context: +0.5GB
- 8K context: +1GB
- 32K context: +4GB
- 128K context: +16GB
This is why a model that βfitsβ in 8GB VRAM at 4K context might not fit at 32K context. Start with smaller context and increase until you hit your VRAM limit.
Quantization levels explained
| Level | Bits per param | Quality | VRAM savings |
|---|---|---|---|
| FP16 | 16 | Original | Baseline |
| Q8_0 | 8 | ~99% of original | 50% less |
| Q6_K | 6 | ~98% of original | 62% less |
| Q4_K_M | 4 | ~95% of original | 75% less |
| Q3_K_M | 3 | ~90% of original | 81% less |
| Q2_K | 2 | Noticeable degradation | 87% less |
Q4_K_M is the sweet spot. It preserves ~95% of model quality while using 75% less VRAM than full precision. Most people canβt tell the difference between Q4 and full precision in practice.
Go to Q3 or Q2 only if your model barely doesnβt fit at Q4. The quality drop becomes noticeable.
Quick recommendations
| βI haveβ¦" | "Run thisβ¦β |
|---|---|
| 8GB (laptop/Mac) | Qwen3.5-9B Q4 β best quality-per-GB |
| 12GB (RTX 3060) | DeepSeek Coder V2 Lite or Qwen3.5-9B with larger context |
| 16GB (RTX 4070 Ti) | Codestral or MiMo-V2-Flash |
| 24GB (RTX 4090) | Qwen 2.5 Coder 32B β best open-source coding model |
| 32GB (RTX 5090/Mac) | Qwen3.5-27B with generous context |
| 48GB (Mac Pro) | Llama 4 Maverick or Qwen3.5-122B-A10B |
| 192GB (Mac Ultra) | Anything. DeepSeek V3, Qwen 397B. |
For models that need more VRAM than your local hardware provides, cloud GPU providers offer A100 and H100 instances on demand β often cheaper than upgrading your own setup.
Related
- Best GPU for Running AI Models Locally in 2026
- Best Self-Hosted AI Models in 2026
- Best AI Models for Mac in 2026
- Ollama vs llama.cpp vs vLLM β Which Should You Use?
Related: Best AI Engineering Courses