Jun 10, 2026 · 5 min read

Best AI Models Under 32GB VRAM in 2026: What Fits on an RTX 4090/5090

The RTX 4090 (24GB) and RTX 5090 (32GB) are the most common high-end consumer GPUs for local AI. This guide ranks the best models that fit within 32GB VRAM — maximizing quality while staying within hardware limits.

For larger models (70-120B), you need 128GB+ unified memory (RTX Spark, Mac Studio). For smaller budgets (16GB or 4GB), see our under 16GB VRAM and under 4GB RAM guides.

What fits in 24-32GB VRAM?

VRAM	Max model (Q4_K_M)	Max model (FP16)
24GB (RTX 4090)	~35B parameters	~14B parameters
32GB (RTX 5090)	~50B parameters	~17B parameters

The rankings

#1: Qwen 3.6 27B — Best all-around (16GB at Q4)

ollama pull qwen3.6:27b

Memory (Q4)	Speed	Coding quality	Context
~16GB	40-60 t/s	Excellent	128K

Qwen 3.6 27B is the default recommendation for 24-32GB GPUs. NVIDIA specifically optimized llama.cpp for this model (2× throughput with multi-token prediction). Fits comfortably on an RTX 4090 with room for context.

Best for: Coding, general tasks, daily driver. The model RTX Spark is optimized for.

#2: Qwen 3.7 27B — Latest Qwen (same footprint)

ollama pull qwen3.7:27b

Memory (Q4)	Speed	Coding quality	Context
~16GB	40-60 t/s	Excellent (latest)	128K

Qwen 3.7 is the newest iteration with improved coding and reasoning over 3.6. Same memory footprint, same speed, slightly better output. If your hardware runs 3.6, it runs 3.7.

#3: Qwen 3.6 35B-A3B — MoE speed demon (20GB)

ollama pull qwen3.6:35b-a3b

Memory (Q4)	Speed	Active params	Context
~20GB	80+ t/s	3B per token	128K

Qwen 3.6 35B-A3B is a MoE model — 35B total but only 3B activate per token. This makes it absurdly fast (80+ t/s) while having the knowledge of a 35B model. Fits on any 24GB GPU.

Best for: Speed-critical tasks, autocomplete, when you need instant responses.

#4: Devstral 2 — Coding specialist (~28GB at Q4)

ollama pull devstral2

Memory (Q4)	Speed	Specialty	Context
~28GB	30-50 t/s	Code-only	128K

Devstral 2 is Mistral’s purpose-built coding model. Larger than Qwen 27B but trained exclusively on code — better at code completion, refactoring, and explaining code patterns. Fits on RTX 5090 (32GB) at Q4.

Best for: Pure coding tasks. If you only use AI for code, this is optimized for it.

#5: Mistral Medium 3.5 (~24GB at Q4)

Memory (Q4)	Speed	Quality	Context
~24GB	30-50 t/s	Strong (coding + general)	128K

Mistral Medium 3.5 is Mistral’s balanced model — good at both coding and general tasks. Fits on 24GB (tight) or 32GB (comfortable).

#6: Granite 4.1 34B — Enterprise tool calling (~20GB at Q4)

ollama pull granite4.1:34b

Memory (Q4)	Speed	Specialty	Context
~20GB	30-45 t/s	Tool calling, enterprise	128K

Granite 4.1 from IBM excels at structured outputs and function calling. If you’re building agents that need reliable tool calling, Granite’s training on enterprise tasks gives it an edge.

#7: Gemma 4 27B — Google’s open model (~16GB at Q4)

ollama pull gemma4:27b

Memory (Q4)	Speed	Quality	Multimodal
~16GB	40-60 t/s	Good	✅ (vision)

Gemma 4 from Google has native vision support — parse images locally without any API. Good all-around model with the advantage of multimodal on consumer hardware.

Best for: Local multimodal (image understanding) without needing massive hardware.

Quick comparison

Model	Memory (Q4)	Speed	Coding	Multimodal	GPU requirement
Qwen 3.6 27B	16GB	40-60 t/s	✅✅✅	❌	RTX 4090 ✅
Qwen 3.7 27B	16GB	40-60 t/s	✅✅✅✅	❌	RTX 4090 ✅
Qwen 3.6 35B-A3B	20GB	80+ t/s	✅✅	❌	RTX 4090 ✅
Devstral 2	28GB	30-50 t/s	✅✅✅✅	❌	RTX 5090 ✅
Mistral Medium 3.5	24GB	30-50 t/s	✅✅✅	❌	RTX 4090 ⚠️ tight
Granite 4.1 34B	20GB	30-45 t/s	✅✅✅	❌	RTX 4090 ✅
Gemma 4 27B	16GB	40-60 t/s	✅✅	✅	RTX 4090 ✅

Setup (2 minutes)

# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh

# Pull and run (picks best quantization for your hardware)
ollama run qwen3.7:27b

For coding tool integration: Aider setup · Ollama guide · Continue setup

When 32GB isn’t enough

If you need larger models (70B+), your options are:

NVIDIA RTX Spark (128GB, fall 2026) — Runs 120B models
Mac Studio (128-192GB) — Available now
API models — DeepSeek V4-Pro at $0.435/M is cheaper than buying hardware for occasional use

FAQ

RTX 4090 (24GB) or RTX 5090 (32GB)?

The extra 8GB on the 5090 lets you run Devstral 2 and Mistral Medium 3.5 comfortably. If you only plan to run 27B models, the 4090 is fine. If you want the largest models that fit on a single consumer GPU, get the 5090.

Can I run two models simultaneously?

On 32GB: two small models (7B + 14B = 12GB) yes. Two 27B models: no. Ollama handles model loading/unloading automatically.

Q4 vs Q6 vs FP16?

Q4_K_M: best balance of quality and memory. Q6_K: slightly better quality, 50% more memory. FP16: full quality but doubles memory. For 24-32GB GPUs, Q4_K_M is almost always the right choice. Quality difference is minimal. See quantization guide.

How does local compare to API quality?

Qwen 3.6 27B locally produces ~85% of API model quality (DeepSeek V4-Pro). For most coding tasks, the difference is minor. Where API models clearly win: complex multi-file refactoring and novel algorithms.

Should I wait for RTX Spark instead?

If you need models larger than 50B: yes, RTX Spark (128GB, fall 2026) opens up 120B models locally. If 27-35B models serve your needs: buy an RTX 4090/5090 now. See Best LLMs for RTX Spark.