VRAM is the bottleneck for running AI models locally. The model has to fit in your GPUβs memory, and if it doesnβt, performance drops from usable to unusable. Hereβs which GPU to buy based on your budget and what models you want to run.
The rule of thumb
Roughly 2GB of VRAM per billion parameters at FP16 precision. With Q4 quantization (which most people use), that drops to about 0.5-0.7GB per billion parameters.
In practice:
- 8GB VRAM β models up to ~9B parameters
- 12GB VRAM β models up to ~14B parameters
- 16GB VRAM β models up to ~22B parameters
- 24GB VRAM β models up to ~32B parameters
- 48GB VRAM β models up to ~70B parameters
Best GPUs by budget
Under $400: RTX 3060 12GB
The budget king. 12GB VRAM runs DeepSeek Coder V2 Lite (14B), Qwen3.5-9B, and most 7B models comfortably. Available used for $200-300.
- VRAM: 12GB GDDR6
- Models: Up to ~14B (Q4)
- Speed: ~15-20 tok/s on 9B models
- Best for: Getting started, coding assistants
$500-800: RTX 4070 Ti Super 16GB
The sweet spot for most developers. 16GB runs Codestral (22B), MiMo-V2-Flash (15B active), and medium-sized models.
- VRAM: 16GB GDDR6X
- Models: Up to ~22B (Q4)
- Speed: ~25-35 tok/s on 14B models
- Best for: Daily coding assistant, IDE autocomplete
$1,000-1,600: RTX 4090 24GB
The best consumer GPU for AI. 24GB runs Qwen 2.5 Coder 32B, Qwen3.5-27B, and any model up to ~32B at full speed.
- VRAM: 24GB GDDR6X
- Models: Up to ~32B (Q4)
- Speed: ~45 tok/s on 32B models
- Best for: Serious local AI, best open-source coding models
$2,500-3,000: RTX 5090 32GB
The new flagship. 32GB GDDR7 with significantly faster memory bandwidth. Runs larger models and faster inference than the 4090.
- VRAM: 32GB GDDR7
- Models: Up to ~45B (Q4)
- Speed: ~60+ tok/s on 32B models, 185 tok/s on 8B
- Best for: Future-proofing, larger models
$1,149-6,000: Apple Silicon Mac
Appleβs unified memory architecture is uniquely suited for AI. The GPU can use all system RAM, so a 192GB Mac Studio can load models that would need multiple discrete GPUs.
| Mac | Memory | Models it runs | Price |
|---|---|---|---|
| Mac Mini M4 32GB | 32GB | Up to ~27B | $1,149 |
| Mac Mini M4 Pro 48GB | 48GB | Up to ~45B | $1,799 |
| Mac Studio M4 Ultra 192GB | 192GB | Up to ~130B (full quality) | ~$6,000 |
The Mac Mini M4 32GB is the best value for local AI. Silent, efficient, runs 7-14B models at 28-35 tokens per second.
The Mac Studio M4 Ultra 192GB is the only consumer device that can run full DeepSeek V3 (671B, 37B active) at usable speeds.
Used enterprise: A100 40GB/80GB
If you can find used A100s ($2,000-4,000), theyβre excellent for AI. 80GB of HBM2e memory with massive bandwidth. Two A100 80GBs can run almost any model at full quality.
What NOT to buy
- AMD GPUs: CUDA support is still better for AI. ROCm works but has more compatibility issues.
- Intel Arc: Improving but not ready for serious AI workloads.
- GPUs with less than 8GB VRAM: Too small for useful models.
- Multiple cheap GPUs: Model splitting across GPUs adds latency. One big GPU beats two small ones.
Which models run on which GPUs
| GPU (VRAM) | Best models |
|---|---|
| 8GB | Qwen3.5-4B, Qwen3.5-0.8B, DeepSeek R1 7B |
| 12GB | Qwen3.5-9B, DeepSeek Coder V2 Lite, MiMo-V2-Flash (tight) |
| 16GB | Codestral, MiMo-V2-Flash, Qwen3.5-35B-A3B |
| 24GB | Qwen 2.5 Coder 32B, Qwen3.5-27B, Llama 4 Scout |
| 32GB | All of the above + larger quantizations |
| 48GB | Qwen3.5-122B-A10B, Llama 4 Maverick |
| 192GB (Mac Ultra) | DeepSeek V3, Qwen3.5-397B (Q4) |
The recommendation
- Tight budget: Used RTX 3060 12GB ($200-300). Runs Qwen3.5-9B which beats models 13x its size.
- Most developers: RTX 4090 24GB ($1,000-1,600). Runs the best open-source coding models at full speed.
- Mac users: Mac Mini M4 32GB ($1,149). Silent, efficient, great for daily use.
- Go big: Mac Studio M4 Ultra 192GB (~$6,000). Runs nearly anything.
Related
- Best Self-Hosted AI Models in 2026
- How Much VRAM Do You Need for AI?
- Best AI Models for Mac in 2026
- Ollama vs llama.cpp vs vLLM β Which Should You Use?
FAQ
Whatβs the best GPU for running AI locally in 2026?
The RTX 4090 (24GB) is the best consumer GPU for local AI β it runs 27B models at full speed and even handles some 70B models at lower quantization. For budget buyers, the RTX 4080 (16GB) offers excellent value and runs most practical coding models.
How much VRAM do I need for local AI?
12GB is the minimum for useful local AI (runs 7-14B models). 16GB is the sweet spot (runs 22-27B models). 24GB lets you run the best open-source models comfortably. For frontier-class models (70B+), you need 48GB+ or multiple GPUs.
Is NVIDIA or AMD better for local AI?
NVIDIA is significantly better due to CUDA ecosystem support. Ollama, vLLM, and most AI frameworks are optimized for NVIDIA first. AMDβs ROCm has improved but still has compatibility issues with many models and tools.
Should I buy a GPU or use cloud GPUs?
Buy if youβll use it daily for more than 6 months β the RTX 4090 pays for itself vs cloud GPU rental in about 4-6 months of heavy use. Use cloud GPUs for occasional large jobs or if you need hardware you canβt afford to buy (A100, H100).
Related: Best AI Engineering Courses