DiffusionGemma is here, it’s open-source under Apache 2.0, and it promises 1,000+ tokens per second on consumer hardware. But “consumer hardware” comes with some asterisks. Let me walk you through exactly what you need, how to set it up, and what performance to realistically expect on different GPUs.
The short version: you need an NVIDIA GPU with 18GB+ VRAM. This model is heavily optimized for NVIDIA’s ecosystem — RTX PRO, DGX Spark, and GeForce RTX cards. If you’re on AMD or Apple Silicon, the story is more complicated. Let’s dig in.
Hardware Requirements
DiffusionGemma uses the NVFP4 data format to compress its 26B parameters (3.8B active per token) into 18GB of VRAM. Here’s the minimum and recommended hardware:
Minimum Requirements
- GPU: NVIDIA GPU with 18GB+ VRAM
- System RAM: 32GB recommended (16GB minimum with swap)
- Storage: ~20GB for model weights
- CUDA: 12.4+
- Driver: 560.0+ (latest Game Ready or Studio driver)
Recommended Setups
| GPU | VRAM | Expected Speed | Notes |
|---|---|---|---|
| RTX 5090 | 32GB | 1200+ tok/s | Best consumer option |
| RTX 4090 | 24GB | 1000+ tok/s | Sweet spot price/performance |
| RTX PRO 6000 | 48GB | 1400+ tok/s | Professional workstation |
| RTX 4080 | 16GB | Not supported | Insufficient VRAM |
| RTX A6000 | 48GB | 1100+ tok/s | Data center option |
| DGX Spark | 128GB | 1500+ tok/s | NVIDIA’s local AI workstation |
The 18GB floor is firm because NVFP4 is already an aggressive quantization format — you can’t compress further without significant quality loss. If you’re planning a GPU purchase for local AI, our how much VRAM AI models need guide covers the broader landscape.
Installation on NVIDIA RTX (Linux/Windows)
Here’s the setup process for NVIDIA GPUs. This is the primary supported path.
Step 1: Verify Your Environment
# Check NVIDIA driver version (need 560.0+)
nvidia-smi
# Check CUDA version (need 12.4+)
nvcc --version
# Verify available VRAM
nvidia-smi --query-gpu=memory.total --format=csv
Step 2: Install Dependencies
# Create a fresh Python environment
python -m venv diffusiongemma-env
source diffusiongemma-env/bin/activate # Linux/Mac
# diffusiongemma-env\Scripts\activate # Windows
# Install PyTorch with CUDA 12.4 support
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu124
# Install DiffusionGemma
pip install diffusiongemma
Step 3: Download the Model
# Using HuggingFace CLI
huggingface-cli download google/diffusiongemma-26b-nvfp4
# Or within Python
from diffusiongemma import DiffusionGemmaModel
model = DiffusionGemmaModel.from_pretrained("google/diffusiongemma-26b-nvfp4")
Step 4: Run Inference
from diffusiongemma import DiffusionGemmaModel, DiffusionConfig
# Load model with optimized settings
config = DiffusionConfig(
num_diffusion_steps=16,
device="cuda",
dtype="nvfp4"
)
model = DiffusionGemmaModel.from_pretrained(
"google/diffusiongemma-26b-nvfp4",
config=config
)
# Generate text
output = model.generate(
prompt="Write a Python function to calculate fibonacci numbers",
max_tokens=256
)
print(output)
For Windows-specific setup details, see our run AI locally on Windows guide.
RTX AI Garage Setup
NVIDIA’s RTX AI Garage provides an optimized runtime for DiffusionGemma. This is the path with the best performance and least friction:
# Install RTX AI Garage (requires NVIDIA account)
# Download from nvidia.com/rtx-ai-garage
# Launch DiffusionGemma through the AI Garage interface
rtx-ai-garage launch diffusiongemma
# Or use the API endpoint it exposes
curl http://localhost:8080/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "diffusiongemma-26b",
"prompt": "Explain recursion simply",
"max_tokens": 200,
"diffusion_steps": 16
}'
The RTX AI Garage handles driver compatibility, CUDA optimization, and memory management automatically. It’s the recommended path for users who don’t want to manage Python environments manually.
For more on NVIDIA’s dedicated local AI hardware, check the NVIDIA RTX Spark complete guide.
Performance Tuning
DiffusionGemma’s speed depends heavily on configuration. Here are the key parameters:
Diffusion Steps
The num_diffusion_steps parameter is your primary speed/quality dial:
# Fast mode (8 steps) — ~1500 tok/s, lower quality
fast_output = model.generate(prompt=prompt, max_tokens=256, num_diffusion_steps=8)
# Balanced mode (16 steps) — ~1000 tok/s, good quality
balanced_output = model.generate(prompt=prompt, max_tokens=256, num_diffusion_steps=16)
# Quality mode (24 steps) — ~700 tok/s, near-autoregressive quality
quality_output = model.generate(prompt=prompt, max_tokens=256, num_diffusion_steps=24)
Batch Size
Because diffusion generates all tokens in parallel, generating longer sequences doesn’t proportionally slow things down:
# 100 tokens: ~1000 tok/s total throughput
# 500 tokens: ~1000 tok/s total throughput
# 1000 tokens: ~900 tok/s total throughput (slight degradation)
This is a key advantage over autoregressive models where speed scales linearly with output length.
Memory Optimization
If you’re right at the 18GB limit:
config = DiffusionConfig(
num_diffusion_steps=16,
device="cuda",
dtype="nvfp4",
attention_slicing=True, # Reduces peak VRAM at slight speed cost
cpu_offload_unused_experts=True # Offload inactive MoE experts
)
Expected Speeds by GPU
Based on early benchmarks with 16 diffusion steps and 256-token outputs:
| GPU | Tokens/Second | Notes |
|---|---|---|
| RTX 5090 (32GB) | 1,200-1,400 | Best consumer performance |
| RTX 4090 (24GB) | 1,000-1,200 | Meets Google’s headline claim |
| RTX PRO 6000 (48GB) | 1,300-1,500 | Professional card, extra headroom |
| RTX A6000 (48GB) | 1,000-1,200 | Ampere architecture, still fast |
| DGX Spark | 1,400-1,600 | Purpose-built for local AI |
| RTX 3090 (24GB) | 600-800 | Older architecture, still viable |
For context, autoregressive models at similar quality typically achieve 30-80 tok/s on these same GPUs. The 4x claim holds across the board, with newer architectures seeing even larger gaps.
Understanding the difference between GPU architectures for AI workloads is covered in our GPU vs CPU AI inference article.
Mac and Apple Silicon: Current Status
Let me be upfront: DiffusionGemma is not optimized for Apple Silicon at launch. The NVFP4 format is NVIDIA-specific, and the inference optimizations target CUDA.
That said, here’s the situation:
- Native Metal/MPS support: Not available at launch
- Community ports: Expected within weeks via llama.cpp or MLX adaptations
- Performance expectation: Likely significantly slower than NVIDIA (2-4x) due to lack of hardware-specific optimizations
- VRAM (unified memory): M4 Pro (24GB) and M4 Max (48GB+) have sufficient memory
If you’re on Apple Silicon and need fast local inference today, autoregressive models remain your better bet. Check our LLM inference on Apple Silicon and best AI models for Mac M4 guides for current top picks.
AMD GPUs: Also Limited
AMD ROCm support is not available at launch. The NVFP4 format and CUDA-specific kernels don’t have AMD equivalents yet. Community efforts may bring ROCm support eventually, but don’t count on it short-term.
Troubleshooting Common Issues
Out of Memory Errors
RuntimeError: CUDA out of memory
Solutions:
- Close other GPU-using applications
- Enable
attention_slicing=True - Enable
cpu_offload_unused_experts=True - Reduce
max_tokensfor the generation - Verify no other process is consuming VRAM with
nvidia-smi
Slow First Generation
The first generation after loading is always slower due to CUDA kernel compilation and warmup. Subsequent generations will be at full speed. This is normal for optimized inference libraries.
Quality Issues
If outputs seem incoherent:
- Increase
num_diffusion_steps(try 20-24) - Check that you’re using the correct NVFP4 weights (not incorrectly quantized variants)
- Ensure your prompt is well-formatted
Driver Compatibility
# If you get CUDA errors, update your driver
# Download latest from nvidia.com/drivers
# Minimum: 560.0 for NVFP4 support
Comparing Inference Options
DiffusionGemma currently has a more limited inference ecosystem than mature autoregressive models. Here’s the landscape:
| Method | Status | Notes |
|---|---|---|
| RTX AI Garage | ✅ Supported | Best performance, easiest setup |
| Python SDK | ✅ Supported | Direct API, most flexible |
| vLLM | 🔄 Coming soon | Batch inference optimization |
| Ollama | ❌ Not yet | Requires diffusion support |
| llama.cpp | ❌ Not yet | Autoregressive only currently |
| LM Studio | ❌ Not yet | Autoregressive only currently |
For the autoregressive model ecosystem comparison, see our vLLM vs Ollama vs llama.cpp vs TGI breakdown.
Practical Use Cases for Local Deployment
Once you have DiffusionGemma running, here are high-value local use cases:
- Real-time coding assistant: Sub-second response times for code completion
- Local chatbot: Instant responses without cloud API latency
- Content pipeline: Generate hundreds of summaries/descriptions per minute
- Document processing: Rapid extraction and reformatting at scale
- Interactive applications: Real-time translation, live summarization
Frequently Asked Questions
Can I run DiffusionGemma on an RTX 4080 (16GB)?
No. The NVFP4 model requires 18GB VRAM minimum. The RTX 4080’s 16GB is insufficient. Your options are RTX 4090 (24GB), RTX 5090 (32GB), or professional cards like A6000/RTX PRO. There’s no smaller quantized variant available that maintains the diffusion speed advantage.
Is DiffusionGemma available on Ollama or LM Studio?
Not at launch. These tools are built for autoregressive models and don’t support the diffusion generation paradigm yet. The primary interfaces are NVIDIA’s RTX AI Garage and the Python SDK. Community integration with popular tools will depend on those tools adding diffusion support.
How does DiffusionGemma compare to running Gemma 4 27B locally?
On the same hardware (RTX 4090), Gemma 4 27B generates ~40 tokens/second autoregressively. DiffusionGemma generates 1000+ tokens/second. That’s 25x faster raw throughput. However, Gemma 4 27B produces higher quality output on complex reasoning tasks. Choose based on whether speed or quality matters more for your use case.
Can I fine-tune DiffusionGemma locally?
The Apache 2.0 license allows it, but fine-tuning diffusion models requires more VRAM than inference (typically 2-3x). You’d need 48GB+ VRAM for fine-tuning. The techniques also differ from standard autoregressive fine-tuning — traditional LoRA may not directly apply. Watch for community-developed fine-tuning guides.
What’s the maximum output length?
DiffusionGemma can generate sequences up to 2048 tokens in a single diffusion pass. For longer outputs, multiple passes can be chained, though this adds complexity. Most use cases (chat responses, summaries, code snippets) fit well within this limit.
Does it support system prompts and chat templates?
Yes, DiffusionGemma supports standard prompt formatting including system prompts. The denoising process is conditioned on the full prompt context. Use the same prompt templates you’d use with other Gemma family models.
What’s Next
DiffusionGemma is day one of a new paradigm for local AI inference. Expect rapid ecosystem development: framework integrations, community quantization experiments, fine-tuned variants, and performance optimizations. The 18GB VRAM floor may drop as better compression techniques emerge.
If you have an RTX 4090 or better, there’s no reason not to try it today. The setup is straightforward, the speed is genuinely impressive, and the Apache 2.0 license means you can use it for anything. Just remember it’s experimental — validate outputs for critical use cases.