How Much VRAM Do You Need for AI Models? (2026 Calculator)
Some links in this article are affiliate links. We earn a commission at no extra cost to you when you purchase through them. Full disclosure.
The #1 question before running a local model: will it fit? Hereβs the formula and a quick reference for every popular model.
The formula
VRAM needed β (Parameters Γ Bytes per parameter) + Context overhead
| Quantization | Bytes per parameter | Example: 24B model |
|---|---|---|
| FP16 (full) | 2 bytes | 48 GB |
| Q8_0 | 1 byte | 24 GB |
| Q5_K_M | 0.65 bytes | 15.6 GB |
| Q4_K_M | 0.5 bytes | 12 GB |
| Q3_K_M | 0.4 bytes | 9.6 GB |
Context overhead: Add ~1-4 GB depending on context length. A 64K context window adds ~2 GB. A 128K context adds ~4 GB.
Detailed VRAM calculation
For a more precise estimate, use this expanded formula:
Total VRAM = Model weights + KV cache + Activation memory + Overhead
Model weights = Parameters Γ Bytes per parameter
KV cache = 2 Γ num_layers Γ hidden_size Γ context_length Γ batch_size Γ bytes_per_param
Activation memory β 5-10% of model weights
Overhead β 500 MB - 1 GB (CUDA context, framework buffers)
For example, Qwen3.5 27B at Q4_K_M with 32K context:
- Model weights: 27B Γ 0.5 = 13.5 GB
- KV cache (32K context): ~2.5 GB
- Activation + overhead: ~1.5 GB
- Total: ~17.5 GB
This is why a 24 GB GPU can run it comfortably but a 16 GB GPU struggles.
Quick reference
| Model | FP16 | Q5_K_M | Q4_K_M | Min hardware |
|---|---|---|---|---|
| Qwen3 4B | 8 GB | 3 GB | 2.5 GB | Any 8GB machine |
| Qwen3 8B | 16 GB | 6 GB | 5 GB | 8GB Mac/GPU |
| DeepSeek R1 14B | 28 GB | 10 GB | 8 GB | 16GB Mac, RTX 3080 |
| Devstral Small 24B | 48 GB | 16 GB | 12 GB | 16GB Mac, RTX 4090 |
| Qwen 3.5 27B | 54 GB | 18 GB | 14 GB | 24GB Mac, RTX 4090 |
| Qwen3-Coder 32B | 64 GB | 21 GB | 16 GB | 24GB+ Mac |
| Llama 4 Scout 70B | 140 GB | 46 GB | 35 GB | 48GB+ Mac, A100 |
For a broader hardware buying guide, see How Much VRAM for AI and Best GPU for AI Locally in 2026.
Quantization impact on quality
Not all quantizations are equal. Hereβs how they affect output quality for coding tasks:
| Quantization | Size reduction | Quality impact | Best for |
|---|---|---|---|
| FP16 | None (baseline) | None | Research, max quality |
| Q8_0 | 50% | Negligible | When you have the VRAM |
| Q5_K_M | 67% | Minimal | Sweet spot for most users |
| Q4_K_M | 75% | Slight degradation | Fitting larger models |
| Q3_K_M | 80% | Noticeable on complex tasks | Last resort |
| Q2_K | 87% | Significant | Not recommended |
The difference between Q5_K_M and Q4_K_M is usually imperceptible in conversation. For coding, Q5_K_M preserves more accuracy in syntax and logic. Going below Q4 causes measurable drops in code correctness.
For a deep comparison of quantization formats, see our GGUF vs GPTQ vs AWQ guide.
What happens when you donβt have enough VRAM
| Situation | What happens | Performance |
|---|---|---|
| Model fits in VRAM | Runs entirely on GPU | β Full speed |
| Model slightly exceeds VRAM | Offloads layers to CPU (Apple Silicon: unified memory) | β οΈ 30-50% slower |
| Model far exceeds VRAM | Heavy CPU offloading or crashes | β 5-10x slower or fails |
On Apple Silicon Macs, βVRAMβ is unified memory shared with the system. A 32GB Mac can run a 24GB model because the OS uses ~4-6GB, leaving ~26GB for the model.
On NVIDIA GPUs, VRAM is separate. A 24GB RTX 4090 can only fit models up to ~22GB (some VRAM used by the OS/driver).
Apple Silicon unified memory
On Macs with Apple Silicon, GPU and CPU share the same memory pool. This changes the math:
- Total available for models = Total RAM - OS usage (~4-6 GB)
- No separate VRAM β the model uses whatever memory is free
- Slower than discrete GPU β memory bandwidth is lower than NVIDIA HBM
- But no offloading penalty β unlike NVIDIA where CPU offload crosses PCIe
Practical limits on Apple Silicon:
| Mac | Total RAM | Usable for models | Largest model (Q4_K_M) |
|---|---|---|---|
| M4 16GB | 16 GB | ~11 GB | 14B |
| M4 24GB | 24 GB | ~18 GB | 27B |
| M4 Pro 48GB | 48 GB | ~42 GB | 70B |
| M4 Max 96GB | 96 GB | ~88 GB | 120B+ |
| M4 Ultra 192GB | 192 GB | ~180 GB | 200B+ |
Use Ollama or llama.cpp with Metal acceleration for best performance on Mac.
Tips for fitting larger models
- Use quantization β Drop from FP16 to Q4_K_M to cut VRAM by 75%
- Reduce context length β Use
--ctx-size 4096instead of 32K if you donβt need long context - Close other GPU apps β Browsers, video players, and displays use VRAM
- Use Flash Attention β Reduces memory usage for long contexts
- Try partial offloading β Put most layers on GPU, a few on CPU (
--n-gpu-layers 28in llama.cpp) - Consider a smaller model β A 14B at Q5 often beats a 27B at Q3 in quality
Choosing the right quantization
Have plenty of VRAM? β Q8_0 (best quality)
Model barely fits? β Q5_K_M (sweet spot)
Model doesn't fit? β Q4_K_M (acceptable quality loss)
Still doesn't fit? β Smaller model (don't go below Q4)
Going below Q4 quantization causes noticeable quality degradation, especially for coding tasks. Itβs better to use a smaller model at Q5 than a larger model at Q3.
Hardware recommendations by budget
| Budget | Hardware | Models you can run |
|---|---|---|
| $0 | Existing laptop (8GB+) | Qwen3 4B-8B |
| $800 | Mac Mini M4 24GB | Up to 14B models |
| $2,000 | Mac Mini M4 Pro 48GB | Up to 32B models |
| $3,500 | Mac Studio M4 Max 96GB | Up to 70B models |
| $2,500 | RTX 4090 workstation | Up to 22B (VRAM limited) |
| $1,200/mo | RunPod A100 80GB | Anything up to 70B |
See our GPU vs CPU guide for when you need a GPU and best AI models for Mac for Apple Silicon recommendations.
Related: How Much VRAM for AI Β· Best GPU for AI Locally 2026 Β· GGUF vs GPTQ vs AWQ Β· Ollama Complete Guide Β· GPU Memory Planning Β· GPU vs CPU for AI Inference Β· Best AI Models for Mac Β· Best Ollama Models for Coding Β· Best Cloud GPU Providers