🤖 AI Tools
· 3 min read

How Much VRAM Do You Need for AI? A Simple Guide (2026)


The #1 question people ask about running AI locally: how much VRAM do I need? Here’s the simple answer.

The formula

At Q4 quantization (what most people use):

VRAM needed ≈ model parameters × 0.5-0.7 GB per billion

Plus ~1-2GB overhead for the inference engine and context.

VRAM chart

Model sizeVRAM needed (Q4)Example modelsExample GPUs
0.5-1B2GBQwen3.5-0.8BAny device
3-4B4-5GBQwen3.5-4B, Phi-3 Mini8GB laptop
7-9B6-8GBQwen3.5-9B, Llama 3.2 7B, DeepSeek R1 7B8GB GPU, 16GB Mac
14-15B10-12GBDeepSeek Coder V2 Lite, MiMo-V2-FlashRTX 3060 12GB
22-24B14-16GBCodestral, Mistral SmallRTX 4070 Ti 16GB
27-32B18-24GBQwen3.5-27B, Qwen 2.5 Coder 32BRTX 4090 24GB
70B35-45GBLlama 3.3 70B48GB Mac Pro, A6000
120-130B60-80GBQwen3.5-122B-A10B64GB+ Mac, multi-GPU
400B+150-214GBQwen3.5-397B, DeepSeek V3192GB Mac Ultra, multi-A100

MoE models use less VRAM than you’d think

Mixture-of-Experts models have a large total parameter count but only activate a fraction per token. However, you still need to load the full model into memory.

ModelTotal paramsActive paramsVRAM needed (Q4)
MiMo-V2-Flash309B15B~12-16GB
Qwen3.5-35B-A3B35B3B~8GB
Qwen3.5-397B397B17B~214GB
DeepSeek V3671B37B~80-100GB
Llama 4 Maverick400B17B~60-80GB

The VRAM requirement is based on total parameters, not active parameters. You need to fit the whole model in memory even though only a fraction runs per token.

Context window affects VRAM too

Longer context = more VRAM. The KV cache grows with context length.

Approximate additional VRAM for context (on a 9B model):

  • 4K context: +0.5GB
  • 8K context: +1GB
  • 32K context: +4GB
  • 128K context: +16GB

This is why a model that “fits” in 8GB VRAM at 4K context might not fit at 32K context. Start with smaller context and increase until you hit your VRAM limit.

Quantization levels explained

LevelBits per paramQualityVRAM savings
FP1616OriginalBaseline
Q8_08~99% of original50% less
Q6_K6~98% of original62% less
Q4_K_M4~95% of original75% less
Q3_K_M3~90% of original81% less
Q2_K2Noticeable degradation87% less

Q4_K_M is the sweet spot. It preserves ~95% of model quality while using 75% less VRAM than full precision. Most people can’t tell the difference between Q4 and full precision in practice.

Go to Q3 or Q2 only if your model barely doesn’t fit at Q4. The quality drop becomes noticeable.

Quick recommendations

”I have…""Run this…“
8GB (laptop/Mac)Qwen3.5-9B Q4 — best quality-per-GB
12GB (RTX 3060)DeepSeek Coder V2 Lite or Qwen3.5-9B with larger context
16GB (RTX 4070 Ti)Codestral or MiMo-V2-Flash
24GB (RTX 4090)Qwen 2.5 Coder 32B — best open-source coding model
32GB (RTX 5090/Mac)Qwen3.5-27B with generous context
48GB (Mac Pro)Llama 4 Maverick or Qwen3.5-122B-A10B
192GB (Mac Ultra)Anything. DeepSeek V3, Qwen 397B.