Jun 3, 2026 · 6 min read

Best LLMs to Run on NVIDIA RTX Spark: What Fits in 128GB (2026)

NVIDIA RTX Spark ships this fall with 128GB of unified memory and a Blackwell GPU delivering 1 petaflop of AI compute. The question every developer is asking: which models can I actually run on it?

The answer depends on model architecture (dense vs MoE), quantization level, and how much context you need. This guide ranks the best models for RTX Spark by use case, with realistic memory estimates and expected performance.

The memory budget

With 128GB unified memory, you need to account for:

Model weights — The main memory consumer
KV cache — Grows with context length (typically 2-8GB for 32K context)
OS + applications — Windows needs ~8-12GB
Overhead — CUDA runtime, buffers (~4-8GB)

Practical budget for model weights: ~100-110GB

This means:

Models up to ~120B parameters at Q4_K_M quantization
Models up to ~60-70B at FP16 (full precision)
MoE models where total size fits, regardless of active parameters

Tier 1: Models that run great (under 50GB)

These models load instantly, leave room for large context windows, and run at 30-60+ tokens/second on RTX Spark.

Model	Params	Memory (Q4)	Memory (FP16)	Best for	Speed (est.)
Qwen 3.6 27B	27B	~16GB	~54GB	Coding, general	40-60 t/s
Qwen 3.6 35B-A3B	35B (3B active)	~20GB	~7GB*	Lightweight tasks	80+ t/s
Gemma 4 27B	27B	~16GB	~54GB	General, multimodal	40-60 t/s
Mistral Medium 3.5	~40B	~24GB	~80GB	Coding, reasoning	30-50 t/s
Qwen 3.7 27B	27B	~16GB	~54GB	Latest Qwen coding	40-60 t/s

*Qwen 3.6 35B-A3B is an MoE model — only 3B parameters are active per token, so it runs extremely fast despite 35B total parameters.

Recommendation for this tier: Qwen 3.6 27B is the NVIDIA-optimized choice (2× throughput with multi-token prediction). It is the best all-around model for RTX Spark.

Tier 2: Models that fit well (50-90GB)

These models use most of the available memory but run at good speeds. Context window may be limited to 32-64K tokens.

Model	Params	Memory (Q4)	Best for	Speed (est.)
Llama 4 Scout	109B (17B active)	~60GB	General, multilingual	20-35 t/s
DeepSeek V4 Flash	~70B active	~40GB	Fast coding	25-40 t/s
Qwen 3.6 27B (FP16)	27B	~54GB	Maximum quality	25-40 t/s
Granite 4.1 34B	34B	~20GB	Enterprise, tool use	35-50 t/s
Devstral 2	~50B (est.)	~30GB	Coding (Mistral)	30-45 t/s

Recommendation for this tier: Llama 4 Scout is remarkable on RTX Spark — 109B total parameters but only 17B active, meaning it has the knowledge of a 109B model with the speed of a 17B model. The MoE architecture is perfectly suited to 128GB unified memory.

Tier 3: Models that fit tight (90-110GB)

These models work but leave minimal room for context or other processes. Best for batch processing or dedicated inference.

Model	Params	Memory (Q4)	Best for	Speed (est.)
120B dense model	120B	~70GB	Maximum local quality	10-18 t/s
Nemotron 3 Super	97B	~56GB	NVIDIA-optimized	12-20 t/s
Falcon H1R 7B + larger variants	Various	Various	Arabic + English	Various

Recommendation for this tier: Unless you specifically need the largest possible model, the Tier 1-2 models offer much better speed and context length. A 120B model at 10 t/s is painfully slow for interactive use.

What does NOT fit

These popular models cannot run on RTX Spark’s 128GB:

Model	Why it doesn’t fit	Alternative
DeepSeek V4-Pro	1.6T total, needs 200GB+ even quantized	Use API at $0.435/M
MiMo V2.5 Pro	Dense, exceeds 128GB	Use API at $0.435/M
MiniMax M3	Estimated 200-400B, too large	Use API at $0.60/M
Claude Opus 4.8	Closed source, API only	Use API at $5/M
GPT-5.5	Closed source, API only	Use API

For these models, stick with APIs. At Chinese model prices ($0.435-0.87/M tokens), the API cost is often lower than the electricity cost of running a local machine 24/7.

Best model for each use case

Use case	Best model on RTX Spark	Why
General coding	Qwen 3.6 27B (Q4)	NVIDIA-optimized, 2× throughput, excellent coding
Maximum coding quality	Qwen 3.7 27B or Mistral Medium 3.5	Latest models, strong SWE-bench scores
Fastest responses	Qwen 3.6 35B-A3B	3B active params = 80+ t/s
Largest knowledge base	Llama 4 Scout (109B MoE)	109B params worth of knowledge, 17B speed
Multimodal (images)	Gemma 4 27B	Native vision support
Privacy-sensitive	Any model above	All run 100% locally, no data leaves device
Long context (128K+)	Qwen 3.6 27B (Q4)	Small enough to leave room for KV cache
Enterprise/tool use	Granite 4.1 34B	IBM’s tool-calling-optimized model

Quantization guide for RTX Spark

Quantization	Quality loss	Memory savings	Recommended?
FP16	None	Baseline	For models ≤60B
Q8	Minimal	50%	Good balance for 60-80B
Q6_K	Very low	60%	Sweet spot for most users
Q4_K_M	Low	75%	Best for 70-120B models
Q3_K	Moderate	80%	Only if model doesn’t fit otherwise
Q2_K	High	87%	Not recommended (noticeable quality loss)

For RTX Spark with 128GB, Q4_K_M is the sweet spot for most models. It offers 75% memory reduction with minimal quality loss — enough to fit models up to 120B parameters while leaving room for context.

Setup when RTX Spark launches

When RTX Spark ships this fall, setup will be straightforward:

# Option 1: Ollama (easiest)
winget install ollama
ollama pull qwen3.6:27b-q4_K_M
ollama run qwen3.6:27b-q4_K_M

# Option 2: LM Studio (GUI)
# Download from lmstudio.ai, search for Qwen 3.6 27B

# Option 3: llama.cpp (most control, NVIDIA-optimized)
# Will include multi-token prediction for 2x throughput

For detailed setup guides, see Ollama complete guide, LM Studio guide, and how to run models locally.

Should you wait for RTX Spark or use APIs now?

Your situation	Recommendation
Spend >$200/month on AI APIs	Wait for RTX Spark — breaks even within a year
Spend <$50/month	Stick with APIs — hardware ROI too slow
Need models >120B	Stick with APIs — DeepSeek/MiMo at $0.435/M
Privacy requirements	Wait for RTX Spark — all local, no data leaves device
Need it today	Buy Mac Studio or use APIs

FAQ

What’s the single best model to run on RTX Spark?

Qwen 3.6 27B at Q4_K_M quantization. It is NVIDIA’s optimization target (2× throughput demonstrated), fits easily in 128GB with room for large contexts, runs at 40-60 t/s, and scores competitively on coding benchmarks.

Can I run multiple models simultaneously?

Yes, if combined memory stays under ~110GB. Example: Qwen 3.6 27B (16GB) + a small 7B model (4GB) = 20GB, leaving plenty of room. Running two 70B models simultaneously is not feasible.

Will GGUF quantizations work on day one?

Yes. llama.cpp (which uses GGUF) will have NVIDIA-optimized builds ready at RTX Spark launch, with multi-token prediction and other Blackwell-specific optimizations.

How does the context window work with limited memory?

KV cache for context grows with sequence length. At Q4 quantization, a 27B model with 128K context uses ~16GB (model) + ~4GB (KV cache) = ~20GB. You have 90GB+ of headroom. For 1M token context, the KV cache grows significantly — NVIDIA claims 1M token support but practical limits depend on the model.

Is RTX Spark better than a GeForce RTX 5090 for local AI?

Yes. The RTX 5090 has 32GB VRAM — only enough for 14-27B models. RTX Spark’s 128GB unified memory handles models 4-5× larger. They are in completely different categories.

What about fine-tuning?

Fine-tuning requires 2-4× the memory of inference. On RTX Spark’s 128GB, you can realistically fine-tune models up to ~14-27B parameters using QLoRA. For larger model fine-tuning, cloud GPUs remain necessary.