Apr 9, 2026 · 3 min read

How Much VRAM Do You Need for AI? A Simple Guide (2026)

The #1 question people ask about running AI locally: how much VRAM do I need? Here’s the simple answer.

The formula

At Q4 quantization (what most people use):

VRAM needed ≈ model parameters × 0.5-0.7 GB per billion

Plus ~1-2GB overhead for the inference engine and context.

VRAM chart

Model size	VRAM needed (Q4)	Example models	Example GPUs
0.5-1B	2GB	Qwen3.5-0.8B	Any device
3-4B	4-5GB	Qwen3.5-4B, Phi-3 Mini	8GB laptop
7-9B	6-8GB	Qwen3.5-9B, Llama 3.2 7B, DeepSeek R1 7B	8GB GPU, 16GB Mac
14-15B	10-12GB	DeepSeek Coder V2 Lite, MiMo-V2-Flash	RTX 3060 12GB
22-24B	14-16GB	Codestral, Mistral Small	RTX 4070 Ti 16GB
27-32B	18-24GB	Qwen3.5-27B, Qwen 2.5 Coder 32B	RTX 4090 24GB
70B	35-45GB	Llama 3.3 70B	48GB Mac Pro, A6000
120-130B	60-80GB	Qwen3.5-122B-A10B	64GB+ Mac, multi-GPU
400B+	150-214GB	Qwen3.5-397B, DeepSeek V3	192GB Mac Ultra, multi-A100

MoE models use less VRAM than you’d think

Mixture-of-Experts models have a large total parameter count but only activate a fraction per token. However, you still need to load the full model into memory.

Model	Total params	Active params	VRAM needed (Q4)
MiMo-V2-Flash	309B	15B	~12-16GB
Qwen3.5-35B-A3B	35B	3B	~8GB
Qwen3.5-397B	397B	17B	~214GB
DeepSeek V3	671B	37B	~80-100GB
Llama 4 Maverick	400B	17B	~60-80GB

The VRAM requirement is based on total parameters, not active parameters. You need to fit the whole model in memory even though only a fraction runs per token.

Context window affects VRAM too

Longer context = more VRAM. The KV cache grows with context length.

Approximate additional VRAM for context (on a 9B model):

4K context: +0.5GB
8K context: +1GB
32K context: +4GB
128K context: +16GB

This is why a model that “fits” in 8GB VRAM at 4K context might not fit at 32K context. Start with smaller context and increase until you hit your VRAM limit.

Quantization levels explained

Level	Bits per param	Quality	VRAM savings
FP16	16	Original	Baseline
Q8_0	8	~99% of original	50% less
Q6_K	6	~98% of original	62% less
Q4_K_M	4	~95% of original	75% less
Q3_K_M	3	~90% of original	81% less
Q2_K	2	Noticeable degradation	87% less

Q4_K_M is the sweet spot. It preserves ~95% of model quality while using 75% less VRAM than full precision. Most people can’t tell the difference between Q4 and full precision in practice.

Go to Q3 or Q2 only if your model barely doesn’t fit at Q4. The quality drop becomes noticeable.

Quick recommendations

”I have…"	"Run this…“
8GB (laptop/Mac)	Qwen3.5-9B Q4 — best quality-per-GB
12GB (RTX 3060)	DeepSeek Coder V2 Lite or Qwen3.5-9B with larger context
16GB (RTX 4070 Ti)	Codestral or MiMo-V2-Flash
24GB (RTX 4090)	Qwen 2.5 Coder 32B — best open-source coding model
32GB (RTX 5090/Mac)	Qwen3.5-27B with generous context
48GB (Mac Pro)	Llama 4 Maverick or Qwen3.5-122B-A10B
192GB (Mac Ultra)	Anything. DeepSeek V3, Qwen 397B.

For models that need more VRAM than your local hardware provides, cloud GPU providers offer A100 and H100 instances on demand — often cheaper than upgrading your own setup.

Best GPU for Running AI Models Locally in 2026
Best Self-Hosted AI Models in 2026
Best AI Models for Mac in 2026
Ollama vs llama.cpp vs vLLM — Which Should You Use?
NVIDIA RTX Spark — 128GB unified memory eliminates VRAM constraints for models up to 120B

Related: Best AI Engineering Courses

How Much VRAM Do You Need for AI? A Simple Guide (2026)

The formula

VRAM chart

MoE models use less VRAM than you’d think

Context window affects VRAM too

Quantization levels explained

Quick recommendations

Related

📬 AI Dev Weekly

You might also like

Best LLMs to Run on NVIDIA RTX Spark: What Fits in 128GB (2026)

Used GPU for AI — Buying Guide (2026)

Best GPU for Running AI Models Locally in 2026

Cheapest Way to Run AI Locally in 2026 — Budget Builds From $0 to $300