The RTX 4090 (24GB) and RTX 5090 (32GB) are the most common high-end consumer GPUs for local AI. This guide ranks the best models that fit within 32GB VRAM β maximizing quality while staying within hardware limits.
For larger models (70-120B), you need 128GB+ unified memory (RTX Spark, Mac Studio). For smaller budgets (16GB or 4GB), see our under 16GB VRAM and under 4GB RAM guides.
What fits in 24-32GB VRAM?
| VRAM | Max model (Q4_K_M) | Max model (FP16) |
|---|---|---|
| 24GB (RTX 4090) | ~35B parameters | ~14B parameters |
| 32GB (RTX 5090) | ~50B parameters | ~17B parameters |
The rankings
#1: Qwen 3.6 27B β Best all-around (16GB at Q4)
ollama pull qwen3.6:27b
| Memory (Q4) | Speed | Coding quality | Context |
|---|---|---|---|
| ~16GB | 40-60 t/s | Excellent | 128K |
Qwen 3.6 27B is the default recommendation for 24-32GB GPUs. NVIDIA specifically optimized llama.cpp for this model (2Γ throughput with multi-token prediction). Fits comfortably on an RTX 4090 with room for context.
Best for: Coding, general tasks, daily driver. The model RTX Spark is optimized for.
#2: Qwen 3.7 27B β Latest Qwen (same footprint)
ollama pull qwen3.7:27b
| Memory (Q4) | Speed | Coding quality | Context |
|---|---|---|---|
| ~16GB | 40-60 t/s | Excellent (latest) | 128K |
Qwen 3.7 is the newest iteration with improved coding and reasoning over 3.6. Same memory footprint, same speed, slightly better output. If your hardware runs 3.6, it runs 3.7.
#3: Qwen 3.6 35B-A3B β MoE speed demon (20GB)
ollama pull qwen3.6:35b-a3b
| Memory (Q4) | Speed | Active params | Context |
|---|---|---|---|
| ~20GB | 80+ t/s | 3B per token | 128K |
Qwen 3.6 35B-A3B is a MoE model β 35B total but only 3B activate per token. This makes it absurdly fast (80+ t/s) while having the knowledge of a 35B model. Fits on any 24GB GPU.
Best for: Speed-critical tasks, autocomplete, when you need instant responses.
#4: Devstral 2 β Coding specialist (~28GB at Q4)
ollama pull devstral2
| Memory (Q4) | Speed | Specialty | Context |
|---|---|---|---|
| ~28GB | 30-50 t/s | Code-only | 128K |
Devstral 2 is Mistralβs purpose-built coding model. Larger than Qwen 27B but trained exclusively on code β better at code completion, refactoring, and explaining code patterns. Fits on RTX 5090 (32GB) at Q4.
Best for: Pure coding tasks. If you only use AI for code, this is optimized for it.
#5: Mistral Medium 3.5 (~24GB at Q4)
| Memory (Q4) | Speed | Quality | Context |
|---|---|---|---|
| ~24GB | 30-50 t/s | Strong (coding + general) | 128K |
Mistral Medium 3.5 is Mistralβs balanced model β good at both coding and general tasks. Fits on 24GB (tight) or 32GB (comfortable).
#6: Granite 4.1 34B β Enterprise tool calling (~20GB at Q4)
ollama pull granite4.1:34b
| Memory (Q4) | Speed | Specialty | Context |
|---|---|---|---|
| ~20GB | 30-45 t/s | Tool calling, enterprise | 128K |
Granite 4.1 from IBM excels at structured outputs and function calling. If youβre building agents that need reliable tool calling, Graniteβs training on enterprise tasks gives it an edge.
#7: Gemma 4 27B β Googleβs open model (~16GB at Q4)
ollama pull gemma4:27b
| Memory (Q4) | Speed | Quality | Multimodal |
|---|---|---|---|
| ~16GB | 40-60 t/s | Good | β (vision) |
Gemma 4 from Google has native vision support β parse images locally without any API. Good all-around model with the advantage of multimodal on consumer hardware.
Best for: Local multimodal (image understanding) without needing massive hardware.
Quick comparison
| Model | Memory (Q4) | Speed | Coding | Multimodal | GPU requirement |
|---|---|---|---|---|---|
| Qwen 3.6 27B | 16GB | 40-60 t/s | β β β | β | RTX 4090 β |
| Qwen 3.7 27B | 16GB | 40-60 t/s | β β β β | β | RTX 4090 β |
| Qwen 3.6 35B-A3B | 20GB | 80+ t/s | β β | β | RTX 4090 β |
| Devstral 2 | 28GB | 30-50 t/s | β β β β | β | RTX 5090 β |
| Mistral Medium 3.5 | 24GB | 30-50 t/s | β β β | β | RTX 4090 β οΈ tight |
| Granite 4.1 34B | 20GB | 30-45 t/s | β β β | β | RTX 4090 β |
| Gemma 4 27B | 16GB | 40-60 t/s | β β | β | RTX 4090 β |
Setup (2 minutes)
# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh
# Pull and run (picks best quantization for your hardware)
ollama run qwen3.7:27b
For coding tool integration: Aider setup Β· Ollama guide Β· Continue setup
When 32GB isnβt enough
If you need larger models (70B+), your options are:
- NVIDIA RTX Spark (128GB, fall 2026) β Runs 120B models
- Mac Studio (128-192GB) β Available now
- API models β DeepSeek V4-Pro at $0.435/M is cheaper than buying hardware for occasional use
FAQ
RTX 4090 (24GB) or RTX 5090 (32GB)?
The extra 8GB on the 5090 lets you run Devstral 2 and Mistral Medium 3.5 comfortably. If you only plan to run 27B models, the 4090 is fine. If you want the largest models that fit on a single consumer GPU, get the 5090.
Can I run two models simultaneously?
On 32GB: two small models (7B + 14B = 12GB) yes. Two 27B models: no. Ollama handles model loading/unloading automatically.
Q4 vs Q6 vs FP16?
Q4_K_M: best balance of quality and memory. Q6_K: slightly better quality, 50% more memory. FP16: full quality but doubles memory. For 24-32GB GPUs, Q4_K_M is almost always the right choice. Quality difference is minimal. See quantization guide.
How does local compare to API quality?
Qwen 3.6 27B locally produces ~85% of API model quality (DeepSeek V4-Pro). For most coding tasks, the difference is minor. Where API models clearly win: complex multi-file refactoring and novel algorithms.
Should I wait for RTX Spark instead?
If you need models larger than 50B: yes, RTX Spark (128GB, fall 2026) opens up 120B models locally. If 27-35B models serve your needs: buy an RTX 4090/5090 now. See Best LLMs for RTX Spark.