GPU vs CPU for AI Inference — When Do You Actually Need a GPU?
Some links in this article are affiliate links. We earn a commission at no extra cost to you when you purchase through them. Full disclosure.
GPUs are expensive. Before you spend $1,000+ on hardware or $1,000+/month on cloud GPUs, ask: do you actually need one?
When CPU is good enough
CPU inference works for:
- Small models (< 8B parameters) — Qwen3 4B runs at 10-15 tok/s on a modern CPU
- Low-frequency usage — a few requests per hour, not per second
- Batch processing — latency doesn’t matter, you just need it done
- Embeddings — embedding models are small and CPU-efficient
- Development/testing — you’re iterating on prompts, not serving users
Typical CPU performance (Ollama on modern CPU):
| Model | Tokens/second | Usable? |
|---|---|---|
| Qwen3 4B | 10-15 | ✅ Good for chat |
| Qwen3 8B | 5-8 | ⚠️ Slow but works |
| Devstral 24B | 1-3 | ❌ Too slow |
When you need a GPU
GPU inference is necessary for:
- Large models (> 14B parameters) — need GPU memory and parallel compute
- Real-time serving — users expect responses in 2-5 seconds
- Concurrent users — multiple requests simultaneously
- Long context — 32K+ token contexts need GPU memory bandwidth
- Production APIs — consistent, fast responses
GPU performance comparison:
| Hardware | 8B model | 24B model | 70B model |
|---|---|---|---|
| CPU (modern) | 10 tok/s | 2 tok/s | ❌ |
| Apple M2 16GB | 25 tok/s | 15 tok/s | ❌ |
| Apple M4 Pro 48GB | 40 tok/s | 30 tok/s | 15 tok/s |
| RTX 4090 24GB | 80 tok/s | 40 tok/s | ❌ (not enough VRAM) |
| A100 80GB | 120 tok/s | 80 tok/s | 40 tok/s |
Apple Silicon: the middle ground
Apple Silicon Macs use unified memory — the CPU and GPU share the same RAM. This means you can run larger models than a discrete GPU with limited VRAM:
| Mac | Unified memory | Max model | Performance |
|---|---|---|---|
| MacBook Air M2 | 16 GB | ~12B | Good for development |
| MacBook Pro M3 | 36 GB | ~27B | Good for daily coding |
| Mac Mini M4 Pro | 48 GB | ~32B | Best value for local AI |
| Mac Studio M4 Ultra | 192 GB | ~120B | Run almost anything |
See our best AI models for Mac guide for specific recommendations.
The decision framework
Model < 8B parameters?
→ CPU is fine for development
→ GPU for production serving
Model 8-24B parameters?
→ Apple Silicon Mac for development
→ GPU (RTX 4090 or cloud) for production
Model > 24B parameters?
→ GPU required (A100 or Apple Silicon 48GB+)
Serving multiple users?
→ GPU required regardless of model size
Just generating embeddings?
→ CPU is fine
Cost comparison
| Setup | Monthly cost | Best for |
|---|---|---|
| CPU only (existing hardware) | $0 | Development, small models |
| Mac Mini M4 Pro 48GB | ~$85 (amortized) | Solo dev, daily coding |
| RTX 4090 workstation | ~$105 (amortized) | Single-user production |
| RunPod A100 | $1,180/mo | Multi-user production |
| Vultr GPU | $1,480/mo | Production with SLA |
See our GPU providers comparison and VRAM guide for detailed hardware planning.
The practical recommendation
- Start on CPU with Ollama and a small model (Qwen3 8B)
- If too slow, get a Mac with Apple Silicon (best value for local AI)
- If serving users, use cloud GPU (RunPod serverless for variable load)
- If high volume, dedicated GPU server (Vultr or Hetzner)
Don’t buy GPU hardware until you’ve validated your use case with API calls first. See our when to switch from API to self-hosted guide.
Related: Best Cloud GPU Providers · GPU Memory Planning · Best AI Models for Mac · How to Run AI Without GPU · Ollama Complete Guide