NVIDIA RTX Spark ships this fall with 128GB of unified memory and a Blackwell GPU delivering 1 petaflop of AI compute. The question every developer is asking: which models can I actually run on it?
The answer depends on model architecture (dense vs MoE), quantization level, and how much context you need. This guide ranks the best models for RTX Spark by use case, with realistic memory estimates and expected performance.
The memory budget
With 128GB unified memory, you need to account for:
- Model weights β The main memory consumer
- KV cache β Grows with context length (typically 2-8GB for 32K context)
- OS + applications β Windows needs ~8-12GB
- Overhead β CUDA runtime, buffers (~4-8GB)
Practical budget for model weights: ~100-110GB
This means:
- Models up to ~120B parameters at Q4_K_M quantization
- Models up to ~60-70B at FP16 (full precision)
- MoE models where total size fits, regardless of active parameters
Tier 1: Models that run great (under 50GB)
These models load instantly, leave room for large context windows, and run at 30-60+ tokens/second on RTX Spark.
| Model | Params | Memory (Q4) | Memory (FP16) | Best for | Speed (est.) |
|---|---|---|---|---|---|
| Qwen 3.6 27B | 27B | ~16GB | ~54GB | Coding, general | 40-60 t/s |
| Qwen 3.6 35B-A3B | 35B (3B active) | ~20GB | ~7GB* | Lightweight tasks | 80+ t/s |
| Gemma 4 27B | 27B | ~16GB | ~54GB | General, multimodal | 40-60 t/s |
| Mistral Medium 3.5 | ~40B | ~24GB | ~80GB | Coding, reasoning | 30-50 t/s |
| Qwen 3.7 27B | 27B | ~16GB | ~54GB | Latest Qwen coding | 40-60 t/s |
*Qwen 3.6 35B-A3B is an MoE model β only 3B parameters are active per token, so it runs extremely fast despite 35B total parameters.
Recommendation for this tier: Qwen 3.6 27B is the NVIDIA-optimized choice (2Γ throughput with multi-token prediction). It is the best all-around model for RTX Spark.
Tier 2: Models that fit well (50-90GB)
These models use most of the available memory but run at good speeds. Context window may be limited to 32-64K tokens.
| Model | Params | Memory (Q4) | Best for | Speed (est.) |
|---|---|---|---|---|
| Llama 4 Scout | 109B (17B active) | ~60GB | General, multilingual | 20-35 t/s |
| DeepSeek V4 Flash | ~70B active | ~40GB | Fast coding | 25-40 t/s |
| Qwen 3.6 27B (FP16) | 27B | ~54GB | Maximum quality | 25-40 t/s |
| Granite 4.1 34B | 34B | ~20GB | Enterprise, tool use | 35-50 t/s |
| Devstral 2 | ~50B (est.) | ~30GB | Coding (Mistral) | 30-45 t/s |
Recommendation for this tier: Llama 4 Scout is remarkable on RTX Spark β 109B total parameters but only 17B active, meaning it has the knowledge of a 109B model with the speed of a 17B model. The MoE architecture is perfectly suited to 128GB unified memory.
Tier 3: Models that fit tight (90-110GB)
These models work but leave minimal room for context or other processes. Best for batch processing or dedicated inference.
| Model | Params | Memory (Q4) | Best for | Speed (est.) |
|---|---|---|---|---|
| 120B dense model | 120B | ~70GB | Maximum local quality | 10-18 t/s |
| Nemotron 3 Super | 97B | ~56GB | NVIDIA-optimized | 12-20 t/s |
| Falcon H1R 7B + larger variants | Various | Various | Arabic + English | Various |
Recommendation for this tier: Unless you specifically need the largest possible model, the Tier 1-2 models offer much better speed and context length. A 120B model at 10 t/s is painfully slow for interactive use.
What does NOT fit
These popular models cannot run on RTX Sparkβs 128GB:
| Model | Why it doesnβt fit | Alternative |
|---|---|---|
| DeepSeek V4-Pro | 1.6T total, needs 200GB+ even quantized | Use API at $0.435/M |
| MiMo V2.5 Pro | Dense, exceeds 128GB | Use API at $0.435/M |
| MiniMax M3 | Estimated 200-400B, too large | Use API at $0.60/M |
| Claude Opus 4.8 | Closed source, API only | Use API at $5/M |
| GPT-5.5 | Closed source, API only | Use API |
For these models, stick with APIs. At Chinese model prices ($0.435-0.87/M tokens), the API cost is often lower than the electricity cost of running a local machine 24/7.
Best model for each use case
| Use case | Best model on RTX Spark | Why |
|---|---|---|
| General coding | Qwen 3.6 27B (Q4) | NVIDIA-optimized, 2Γ throughput, excellent coding |
| Maximum coding quality | Qwen 3.7 27B or Mistral Medium 3.5 | Latest models, strong SWE-bench scores |
| Fastest responses | Qwen 3.6 35B-A3B | 3B active params = 80+ t/s |
| Largest knowledge base | Llama 4 Scout (109B MoE) | 109B params worth of knowledge, 17B speed |
| Multimodal (images) | Gemma 4 27B | Native vision support |
| Privacy-sensitive | Any model above | All run 100% locally, no data leaves device |
| Long context (128K+) | Qwen 3.6 27B (Q4) | Small enough to leave room for KV cache |
| Enterprise/tool use | Granite 4.1 34B | IBMβs tool-calling-optimized model |
Quantization guide for RTX Spark
| Quantization | Quality loss | Memory savings | Recommended? |
|---|---|---|---|
| FP16 | None | Baseline | For models β€60B |
| Q8 | Minimal | 50% | Good balance for 60-80B |
| Q6_K | Very low | 60% | Sweet spot for most users |
| Q4_K_M | Low | 75% | Best for 70-120B models |
| Q3_K | Moderate | 80% | Only if model doesnβt fit otherwise |
| Q2_K | High | 87% | Not recommended (noticeable quality loss) |
For RTX Spark with 128GB, Q4_K_M is the sweet spot for most models. It offers 75% memory reduction with minimal quality loss β enough to fit models up to 120B parameters while leaving room for context.
Setup when RTX Spark launches
When RTX Spark ships this fall, setup will be straightforward:
# Option 1: Ollama (easiest)
winget install ollama
ollama pull qwen3.6:27b-q4_K_M
ollama run qwen3.6:27b-q4_K_M
# Option 2: LM Studio (GUI)
# Download from lmstudio.ai, search for Qwen 3.6 27B
# Option 3: llama.cpp (most control, NVIDIA-optimized)
# Will include multi-token prediction for 2x throughput
For detailed setup guides, see Ollama complete guide, LM Studio guide, and how to run models locally.
Should you wait for RTX Spark or use APIs now?
| Your situation | Recommendation |
|---|---|
| Spend >$200/month on AI APIs | Wait for RTX Spark β breaks even within a year |
| Spend <$50/month | Stick with APIs β hardware ROI too slow |
| Need models >120B | Stick with APIs β DeepSeek/MiMo at $0.435/M |
| Privacy requirements | Wait for RTX Spark β all local, no data leaves device |
| Need it today | Buy Mac Studio or use APIs |
FAQ
Whatβs the single best model to run on RTX Spark?
Qwen 3.6 27B at Q4_K_M quantization. It is NVIDIAβs optimization target (2Γ throughput demonstrated), fits easily in 128GB with room for large contexts, runs at 40-60 t/s, and scores competitively on coding benchmarks.
Can I run multiple models simultaneously?
Yes, if combined memory stays under ~110GB. Example: Qwen 3.6 27B (16GB) + a small 7B model (4GB) = 20GB, leaving plenty of room. Running two 70B models simultaneously is not feasible.
Will GGUF quantizations work on day one?
Yes. llama.cpp (which uses GGUF) will have NVIDIA-optimized builds ready at RTX Spark launch, with multi-token prediction and other Blackwell-specific optimizations.
How does the context window work with limited memory?
KV cache for context grows with sequence length. At Q4 quantization, a 27B model with 128K context uses ~16GB (model) + ~4GB (KV cache) = ~20GB. You have 90GB+ of headroom. For 1M token context, the KV cache grows significantly β NVIDIA claims 1M token support but practical limits depend on the model.
Is RTX Spark better than a GeForce RTX 5090 for local AI?
Yes. The RTX 5090 has 32GB VRAM β only enough for 14-27B models. RTX Sparkβs 128GB unified memory handles models 4-5Γ larger. They are in completely different categories.
What about fine-tuning?
Fine-tuning requires 2-4Γ the memory of inference. On RTX Sparkβs 128GB, you can realistically fine-tune models up to ~14-27B parameters using QLoRA. For larger model fine-tuning, cloud GPUs remain necessary.