VRAM is the bottleneck for running AI models locally. The model has to fit in your GPUβs memory, and if it doesnβt, performance drops from usable to unusable. Hereβs which GPU to buy based on your budget and what models you want to run.
The rule of thumb
Roughly 2GB of VRAM per billion parameters at FP16 precision. With Q4 quantization (which most people use), that drops to about 0.5-0.7GB per billion parameters.
In practice:
- 8GB VRAM β models up to ~9B parameters
- 12GB VRAM β models up to ~14B parameters
- 16GB VRAM β models up to ~22B parameters
- 24GB VRAM β models up to ~32B parameters
- 48GB VRAM β models up to ~70B parameters
Best GPUs by budget
Under $400: RTX 3060 12GB
The budget king. 12GB VRAM runs DeepSeek Coder V2 Lite (14B), Qwen3.5-9B, and most 7B models comfortably. Available used for $200-300.
- VRAM: 12GB GDDR6
- Models: Up to ~14B (Q4)
- Speed: ~15-20 tok/s on 9B models
- Best for: Getting started, coding assistants
$500-800: RTX 4070 Ti Super 16GB
The sweet spot for most developers. 16GB runs Codestral (22B), MiMo-V2-Flash (15B active), and medium-sized models.
- VRAM: 16GB GDDR6X
- Models: Up to ~22B (Q4)
- Speed: ~25-35 tok/s on 14B models
- Best for: Daily coding assistant, IDE autocomplete
$1,000-1,600: RTX 4090 24GB
The best consumer GPU for AI. 24GB runs Qwen 2.5 Coder 32B, Qwen3.5-27B, and any model up to ~32B at full speed.
- VRAM: 24GB GDDR6X
- Models: Up to ~32B (Q4)
- Speed: ~45 tok/s on 32B models
- Best for: Serious local AI, best open-source coding models
$2,500-3,000: RTX 5090 32GB
The new flagship. 32GB GDDR7 with significantly faster memory bandwidth. Runs larger models and faster inference than the 4090.
- VRAM: 32GB GDDR7
- Models: Up to ~45B (Q4)
- Speed: ~60+ tok/s on 32B models, 185 tok/s on 8B
- Best for: Future-proofing, larger models
$1,149-6,000: Apple Silicon Mac
Appleβs unified memory architecture is uniquely suited for AI. The GPU can use all system RAM, so a 192GB Mac Studio can load models that would need multiple discrete GPUs.
| Mac | Memory | Models it runs | Price |
|---|---|---|---|
| Mac Mini M4 32GB | 32GB | Up to ~27B | $1,149 |
| Mac Mini M4 Pro 48GB | 48GB | Up to ~45B | $1,799 |
| Mac Studio M4 Ultra 192GB | 192GB | Up to ~130B (full quality) | ~$6,000 |
The Mac Mini M4 32GB is the best value for local AI. Silent, efficient, runs 7-14B models at 28-35 tokens per second.
The Mac Studio M4 Ultra 192GB is the only consumer device that can run full DeepSeek V3 (671B, 37B active) at usable speeds.
Used enterprise: A100 40GB/80GB
If you can find used A100s ($2,000-4,000), theyβre excellent for AI. 80GB of HBM2e memory with massive bandwidth. Two A100 80GBs can run almost any model at full quality.
What NOT to buy
- AMD GPUs: CUDA support is still better for AI. ROCm works but has more compatibility issues.
- Intel Arc: Improving but not ready for serious AI workloads.
- GPUs with less than 8GB VRAM: Too small for useful models.
- Multiple cheap GPUs: Model splitting across GPUs adds latency. One big GPU beats two small ones.
Which models run on which GPUs
| GPU (VRAM) | Best models |
|---|---|
| 8GB | Qwen3.5-4B, Qwen3.5-0.8B, DeepSeek R1 7B |
| 12GB | Qwen3.5-9B, DeepSeek Coder V2 Lite, MiMo-V2-Flash (tight) |
| 16GB | Codestral, MiMo-V2-Flash, Qwen3.5-35B-A3B |
| 24GB | Qwen 2.5 Coder 32B, Qwen3.5-27B, Llama 4 Scout |
| 32GB | All of the above + larger quantizations |
| 48GB | Qwen3.5-122B-A10B, Llama 4 Maverick |
| 192GB (Mac Ultra) | DeepSeek V3, Qwen3.5-397B (Q4) |
The recommendation
- Tight budget: Used RTX 3060 12GB ($200-300). Runs Qwen3.5-9B which beats models 13x its size.
- Most developers: RTX 4090 24GB ($1,000-1,600). Runs the best open-source coding models at full speed.
- Mac users: Mac Mini M4 32GB ($1,149). Silent, efficient, great for daily use.
- Go big: Mac Studio M4 Ultra 192GB (~$6,000). Runs nearly anything.