πŸ€– AI Tools
Β· 6 min read

Best LLMs to Run on NVIDIA RTX Spark: What Fits in 128GB (2026)


NVIDIA RTX Spark ships this fall with 128GB of unified memory and a Blackwell GPU delivering 1 petaflop of AI compute. The question every developer is asking: which models can I actually run on it?

The answer depends on model architecture (dense vs MoE), quantization level, and how much context you need. This guide ranks the best models for RTX Spark by use case, with realistic memory estimates and expected performance.

The memory budget

With 128GB unified memory, you need to account for:

  • Model weights β€” The main memory consumer
  • KV cache β€” Grows with context length (typically 2-8GB for 32K context)
  • OS + applications β€” Windows needs ~8-12GB
  • Overhead β€” CUDA runtime, buffers (~4-8GB)

Practical budget for model weights: ~100-110GB

This means:

  • Models up to ~120B parameters at Q4_K_M quantization
  • Models up to ~60-70B at FP16 (full precision)
  • MoE models where total size fits, regardless of active parameters

Tier 1: Models that run great (under 50GB)

These models load instantly, leave room for large context windows, and run at 30-60+ tokens/second on RTX Spark.

ModelParamsMemory (Q4)Memory (FP16)Best forSpeed (est.)
Qwen 3.6 27B27B~16GB~54GBCoding, general40-60 t/s
Qwen 3.6 35B-A3B35B (3B active)~20GB~7GB*Lightweight tasks80+ t/s
Gemma 4 27B27B~16GB~54GBGeneral, multimodal40-60 t/s
Mistral Medium 3.5~40B~24GB~80GBCoding, reasoning30-50 t/s
Qwen 3.7 27B27B~16GB~54GBLatest Qwen coding40-60 t/s

*Qwen 3.6 35B-A3B is an MoE model β€” only 3B parameters are active per token, so it runs extremely fast despite 35B total parameters.

Recommendation for this tier: Qwen 3.6 27B is the NVIDIA-optimized choice (2Γ— throughput with multi-token prediction). It is the best all-around model for RTX Spark.

Tier 2: Models that fit well (50-90GB)

These models use most of the available memory but run at good speeds. Context window may be limited to 32-64K tokens.

ModelParamsMemory (Q4)Best forSpeed (est.)
Llama 4 Scout109B (17B active)~60GBGeneral, multilingual20-35 t/s
DeepSeek V4 Flash~70B active~40GBFast coding25-40 t/s
Qwen 3.6 27B (FP16)27B~54GBMaximum quality25-40 t/s
Granite 4.1 34B34B~20GBEnterprise, tool use35-50 t/s
Devstral 2~50B (est.)~30GBCoding (Mistral)30-45 t/s

Recommendation for this tier: Llama 4 Scout is remarkable on RTX Spark β€” 109B total parameters but only 17B active, meaning it has the knowledge of a 109B model with the speed of a 17B model. The MoE architecture is perfectly suited to 128GB unified memory.

Tier 3: Models that fit tight (90-110GB)

These models work but leave minimal room for context or other processes. Best for batch processing or dedicated inference.

ModelParamsMemory (Q4)Best forSpeed (est.)
120B dense model120B~70GBMaximum local quality10-18 t/s
Nemotron 3 Super97B~56GBNVIDIA-optimized12-20 t/s
Falcon H1R 7B + larger variantsVariousVariousArabic + EnglishVarious

Recommendation for this tier: Unless you specifically need the largest possible model, the Tier 1-2 models offer much better speed and context length. A 120B model at 10 t/s is painfully slow for interactive use.

What does NOT fit

These popular models cannot run on RTX Spark’s 128GB:

ModelWhy it doesn’t fitAlternative
DeepSeek V4-Pro1.6T total, needs 200GB+ even quantizedUse API at $0.435/M
MiMo V2.5 ProDense, exceeds 128GBUse API at $0.435/M
MiniMax M3Estimated 200-400B, too largeUse API at $0.60/M
Claude Opus 4.8Closed source, API onlyUse API at $5/M
GPT-5.5Closed source, API onlyUse API

For these models, stick with APIs. At Chinese model prices ($0.435-0.87/M tokens), the API cost is often lower than the electricity cost of running a local machine 24/7.

Best model for each use case

Use caseBest model on RTX SparkWhy
General codingQwen 3.6 27B (Q4)NVIDIA-optimized, 2Γ— throughput, excellent coding
Maximum coding qualityQwen 3.7 27B or Mistral Medium 3.5Latest models, strong SWE-bench scores
Fastest responsesQwen 3.6 35B-A3B3B active params = 80+ t/s
Largest knowledge baseLlama 4 Scout (109B MoE)109B params worth of knowledge, 17B speed
Multimodal (images)Gemma 4 27BNative vision support
Privacy-sensitiveAny model aboveAll run 100% locally, no data leaves device
Long context (128K+)Qwen 3.6 27B (Q4)Small enough to leave room for KV cache
Enterprise/tool useGranite 4.1 34BIBM’s tool-calling-optimized model

Quantization guide for RTX Spark

QuantizationQuality lossMemory savingsRecommended?
FP16NoneBaselineFor models ≀60B
Q8Minimal50%Good balance for 60-80B
Q6_KVery low60%Sweet spot for most users
Q4_K_MLow75%Best for 70-120B models
Q3_KModerate80%Only if model doesn’t fit otherwise
Q2_KHigh87%Not recommended (noticeable quality loss)

For RTX Spark with 128GB, Q4_K_M is the sweet spot for most models. It offers 75% memory reduction with minimal quality loss β€” enough to fit models up to 120B parameters while leaving room for context.

Setup when RTX Spark launches

When RTX Spark ships this fall, setup will be straightforward:

# Option 1: Ollama (easiest)
winget install ollama
ollama pull qwen3.6:27b-q4_K_M
ollama run qwen3.6:27b-q4_K_M

# Option 2: LM Studio (GUI)
# Download from lmstudio.ai, search for Qwen 3.6 27B

# Option 3: llama.cpp (most control, NVIDIA-optimized)
# Will include multi-token prediction for 2x throughput

For detailed setup guides, see Ollama complete guide, LM Studio guide, and how to run models locally.

Should you wait for RTX Spark or use APIs now?

Your situationRecommendation
Spend >$200/month on AI APIsWait for RTX Spark β€” breaks even within a year
Spend <$50/monthStick with APIs β€” hardware ROI too slow
Need models >120BStick with APIs β€” DeepSeek/MiMo at $0.435/M
Privacy requirementsWait for RTX Spark β€” all local, no data leaves device
Need it todayBuy Mac Studio or use APIs

FAQ

What’s the single best model to run on RTX Spark?

Qwen 3.6 27B at Q4_K_M quantization. It is NVIDIA’s optimization target (2Γ— throughput demonstrated), fits easily in 128GB with room for large contexts, runs at 40-60 t/s, and scores competitively on coding benchmarks.

Can I run multiple models simultaneously?

Yes, if combined memory stays under ~110GB. Example: Qwen 3.6 27B (16GB) + a small 7B model (4GB) = 20GB, leaving plenty of room. Running two 70B models simultaneously is not feasible.

Will GGUF quantizations work on day one?

Yes. llama.cpp (which uses GGUF) will have NVIDIA-optimized builds ready at RTX Spark launch, with multi-token prediction and other Blackwell-specific optimizations.

How does the context window work with limited memory?

KV cache for context grows with sequence length. At Q4 quantization, a 27B model with 128K context uses ~16GB (model) + ~4GB (KV cache) = ~20GB. You have 90GB+ of headroom. For 1M token context, the KV cache grows significantly β€” NVIDIA claims 1M token support but practical limits depend on the model.

Is RTX Spark better than a GeForce RTX 5090 for local AI?

Yes. The RTX 5090 has 32GB VRAM β€” only enough for 14-27B models. RTX Spark’s 128GB unified memory handles models 4-5Γ— larger. They are in completely different categories.

What about fine-tuning?

Fine-tuning requires 2-4Γ— the memory of inference. On RTX Spark’s 128GB, you can realistically fine-tune models up to ~14-27B parameters using QLoRA. For larger model fine-tuning, cloud GPUs remain necessary.