πŸ€– AI Tools
Β· 5 min read

Best AI Models Under 32GB VRAM in 2026: What Fits on an RTX 4090/5090


The RTX 4090 (24GB) and RTX 5090 (32GB) are the most common high-end consumer GPUs for local AI. This guide ranks the best models that fit within 32GB VRAM β€” maximizing quality while staying within hardware limits.

For larger models (70-120B), you need 128GB+ unified memory (RTX Spark, Mac Studio). For smaller budgets (16GB or 4GB), see our under 16GB VRAM and under 4GB RAM guides.

What fits in 24-32GB VRAM?

VRAMMax model (Q4_K_M)Max model (FP16)
24GB (RTX 4090)~35B parameters~14B parameters
32GB (RTX 5090)~50B parameters~17B parameters

The rankings

#1: Qwen 3.6 27B β€” Best all-around (16GB at Q4)

ollama pull qwen3.6:27b
Memory (Q4)SpeedCoding qualityContext
~16GB40-60 t/sExcellent128K

Qwen 3.6 27B is the default recommendation for 24-32GB GPUs. NVIDIA specifically optimized llama.cpp for this model (2Γ— throughput with multi-token prediction). Fits comfortably on an RTX 4090 with room for context.

Best for: Coding, general tasks, daily driver. The model RTX Spark is optimized for.

#2: Qwen 3.7 27B β€” Latest Qwen (same footprint)

ollama pull qwen3.7:27b
Memory (Q4)SpeedCoding qualityContext
~16GB40-60 t/sExcellent (latest)128K

Qwen 3.7 is the newest iteration with improved coding and reasoning over 3.6. Same memory footprint, same speed, slightly better output. If your hardware runs 3.6, it runs 3.7.

#3: Qwen 3.6 35B-A3B β€” MoE speed demon (20GB)

ollama pull qwen3.6:35b-a3b
Memory (Q4)SpeedActive paramsContext
~20GB80+ t/s3B per token128K

Qwen 3.6 35B-A3B is a MoE model β€” 35B total but only 3B activate per token. This makes it absurdly fast (80+ t/s) while having the knowledge of a 35B model. Fits on any 24GB GPU.

Best for: Speed-critical tasks, autocomplete, when you need instant responses.

#4: Devstral 2 β€” Coding specialist (~28GB at Q4)

ollama pull devstral2
Memory (Q4)SpeedSpecialtyContext
~28GB30-50 t/sCode-only128K

Devstral 2 is Mistral’s purpose-built coding model. Larger than Qwen 27B but trained exclusively on code β€” better at code completion, refactoring, and explaining code patterns. Fits on RTX 5090 (32GB) at Q4.

Best for: Pure coding tasks. If you only use AI for code, this is optimized for it.

#5: Mistral Medium 3.5 (~24GB at Q4)

Memory (Q4)SpeedQualityContext
~24GB30-50 t/sStrong (coding + general)128K

Mistral Medium 3.5 is Mistral’s balanced model β€” good at both coding and general tasks. Fits on 24GB (tight) or 32GB (comfortable).

#6: Granite 4.1 34B β€” Enterprise tool calling (~20GB at Q4)

ollama pull granite4.1:34b
Memory (Q4)SpeedSpecialtyContext
~20GB30-45 t/sTool calling, enterprise128K

Granite 4.1 from IBM excels at structured outputs and function calling. If you’re building agents that need reliable tool calling, Granite’s training on enterprise tasks gives it an edge.

#7: Gemma 4 27B β€” Google’s open model (~16GB at Q4)

ollama pull gemma4:27b
Memory (Q4)SpeedQualityMultimodal
~16GB40-60 t/sGoodβœ… (vision)

Gemma 4 from Google has native vision support β€” parse images locally without any API. Good all-around model with the advantage of multimodal on consumer hardware.

Best for: Local multimodal (image understanding) without needing massive hardware.

Quick comparison

ModelMemory (Q4)SpeedCodingMultimodalGPU requirement
Qwen 3.6 27B16GB40-60 t/sβœ…βœ…βœ…βŒRTX 4090 βœ…
Qwen 3.7 27B16GB40-60 t/sβœ…βœ…βœ…βœ…βŒRTX 4090 βœ…
Qwen 3.6 35B-A3B20GB80+ t/sβœ…βœ…βŒRTX 4090 βœ…
Devstral 228GB30-50 t/sβœ…βœ…βœ…βœ…βŒRTX 5090 βœ…
Mistral Medium 3.524GB30-50 t/sβœ…βœ…βœ…βŒRTX 4090 ⚠️ tight
Granite 4.1 34B20GB30-45 t/sβœ…βœ…βœ…βŒRTX 4090 βœ…
Gemma 4 27B16GB40-60 t/sβœ…βœ…βœ…RTX 4090 βœ…

Setup (2 minutes)

# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh

# Pull and run (picks best quantization for your hardware)
ollama run qwen3.7:27b

For coding tool integration: Aider setup Β· Ollama guide Β· Continue setup

When 32GB isn’t enough

If you need larger models (70B+), your options are:

FAQ

RTX 4090 (24GB) or RTX 5090 (32GB)?

The extra 8GB on the 5090 lets you run Devstral 2 and Mistral Medium 3.5 comfortably. If you only plan to run 27B models, the 4090 is fine. If you want the largest models that fit on a single consumer GPU, get the 5090.

Can I run two models simultaneously?

On 32GB: two small models (7B + 14B = 12GB) yes. Two 27B models: no. Ollama handles model loading/unloading automatically.

Q4 vs Q6 vs FP16?

Q4_K_M: best balance of quality and memory. Q6_K: slightly better quality, 50% more memory. FP16: full quality but doubles memory. For 24-32GB GPUs, Q4_K_M is almost always the right choice. Quality difference is minimal. See quantization guide.

How does local compare to API quality?

Qwen 3.6 27B locally produces ~85% of API model quality (DeepSeek V4-Pro). For most coding tasks, the difference is minor. Where API models clearly win: complex multi-file refactoring and novel algorithms.

Should I wait for RTX Spark instead?

If you need models larger than 50B: yes, RTX Spark (128GB, fall 2026) opens up 120B models locally. If 27-35B models serve your needs: buy an RTX 4090/5090 now. See Best LLMs for RTX Spark.