Jun 11, 2026 · 8 min read

How to Run DiffusionGemma Locally: RTX, Mac, and Hardware Guide (2026)

DiffusionGemma is here, it’s open-source under Apache 2.0, and it promises 1,000+ tokens per second on consumer hardware. But “consumer hardware” comes with some asterisks. Let me walk you through exactly what you need, how to set it up, and what performance to realistically expect on different GPUs.

The short version: you need an NVIDIA GPU with 18GB+ VRAM. This model is heavily optimized for NVIDIA’s ecosystem — RTX PRO, DGX Spark, and GeForce RTX cards. If you’re on AMD or Apple Silicon, the story is more complicated. Let’s dig in.

Hardware Requirements

DiffusionGemma uses the NVFP4 data format to compress its 26B parameters (3.8B active per token) into 18GB of VRAM. Here’s the minimum and recommended hardware:

Minimum Requirements

GPU: NVIDIA GPU with 18GB+ VRAM
System RAM: 32GB recommended (16GB minimum with swap)
Storage: ~20GB for model weights
CUDA: 12.4+
Driver: 560.0+ (latest Game Ready or Studio driver)

Recommended Setups

GPU	VRAM	Expected Speed	Notes
RTX 5090	32GB	1200+ tok/s	Best consumer option
RTX 4090	24GB	1000+ tok/s	Sweet spot price/performance
RTX PRO 6000	48GB	1400+ tok/s	Professional workstation
RTX 4080	16GB	Not supported	Insufficient VRAM
RTX A6000	48GB	1100+ tok/s	Data center option
DGX Spark	128GB	1500+ tok/s	NVIDIA’s local AI workstation

The 18GB floor is firm because NVFP4 is already an aggressive quantization format — you can’t compress further without significant quality loss. If you’re planning a GPU purchase for local AI, our how much VRAM AI models need guide covers the broader landscape.

Installation on NVIDIA RTX (Linux/Windows)

Here’s the setup process for NVIDIA GPUs. This is the primary supported path.

Step 1: Verify Your Environment

# Check NVIDIA driver version (need 560.0+)
nvidia-smi

# Check CUDA version (need 12.4+)
nvcc --version

# Verify available VRAM
nvidia-smi --query-gpu=memory.total --format=csv

Step 2: Install Dependencies

# Create a fresh Python environment
python -m venv diffusiongemma-env
source diffusiongemma-env/bin/activate  # Linux/Mac
# diffusiongemma-env\Scripts\activate   # Windows

# Install PyTorch with CUDA 12.4 support
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu124

# Install DiffusionGemma
pip install diffusiongemma

Step 3: Download the Model

# Using HuggingFace CLI
huggingface-cli download google/diffusiongemma-26b-nvfp4

# Or within Python
from diffusiongemma import DiffusionGemmaModel
model = DiffusionGemmaModel.from_pretrained("google/diffusiongemma-26b-nvfp4")

Step 4: Run Inference

from diffusiongemma import DiffusionGemmaModel, DiffusionConfig

# Load model with optimized settings
config = DiffusionConfig(
    num_diffusion_steps=16,
    device="cuda",
    dtype="nvfp4"
)

model = DiffusionGemmaModel.from_pretrained(
    "google/diffusiongemma-26b-nvfp4",
    config=config
)

# Generate text
output = model.generate(
    prompt="Write a Python function to calculate fibonacci numbers",
    max_tokens=256
)
print(output)

For Windows-specific setup details, see our run AI locally on Windows guide.

RTX AI Garage Setup

NVIDIA’s RTX AI Garage provides an optimized runtime for DiffusionGemma. This is the path with the best performance and least friction:

# Install RTX AI Garage (requires NVIDIA account)
# Download from nvidia.com/rtx-ai-garage

# Launch DiffusionGemma through the AI Garage interface
rtx-ai-garage launch diffusiongemma

# Or use the API endpoint it exposes
curl http://localhost:8080/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "diffusiongemma-26b",
    "prompt": "Explain recursion simply",
    "max_tokens": 200,
    "diffusion_steps": 16
  }'

The RTX AI Garage handles driver compatibility, CUDA optimization, and memory management automatically. It’s the recommended path for users who don’t want to manage Python environments manually.

For more on NVIDIA’s dedicated local AI hardware, check the NVIDIA RTX Spark complete guide.

Performance Tuning

DiffusionGemma’s speed depends heavily on configuration. Here are the key parameters:

Diffusion Steps

The num_diffusion_steps parameter is your primary speed/quality dial:

# Fast mode (8 steps) — ~1500 tok/s, lower quality
fast_output = model.generate(prompt=prompt, max_tokens=256, num_diffusion_steps=8)

# Balanced mode (16 steps) — ~1000 tok/s, good quality
balanced_output = model.generate(prompt=prompt, max_tokens=256, num_diffusion_steps=16)

# Quality mode (24 steps) — ~700 tok/s, near-autoregressive quality
quality_output = model.generate(prompt=prompt, max_tokens=256, num_diffusion_steps=24)

Batch Size

Because diffusion generates all tokens in parallel, generating longer sequences doesn’t proportionally slow things down:

# 100 tokens: ~1000 tok/s total throughput
# 500 tokens: ~1000 tok/s total throughput  
# 1000 tokens: ~900 tok/s total throughput (slight degradation)

This is a key advantage over autoregressive models where speed scales linearly with output length.

Memory Optimization

If you’re right at the 18GB limit:

config = DiffusionConfig(
    num_diffusion_steps=16,
    device="cuda",
    dtype="nvfp4",
    attention_slicing=True,  # Reduces peak VRAM at slight speed cost
    cpu_offload_unused_experts=True  # Offload inactive MoE experts
)

Expected Speeds by GPU

Based on early benchmarks with 16 diffusion steps and 256-token outputs:

GPU	Tokens/Second	Notes
RTX 5090 (32GB)	1,200-1,400	Best consumer performance
RTX 4090 (24GB)	1,000-1,200	Meets Google’s headline claim
RTX PRO 6000 (48GB)	1,300-1,500	Professional card, extra headroom
RTX A6000 (48GB)	1,000-1,200	Ampere architecture, still fast
DGX Spark	1,400-1,600	Purpose-built for local AI
RTX 3090 (24GB)	600-800	Older architecture, still viable

For context, autoregressive models at similar quality typically achieve 30-80 tok/s on these same GPUs. The 4x claim holds across the board, with newer architectures seeing even larger gaps.

Understanding the difference between GPU architectures for AI workloads is covered in our GPU vs CPU AI inference article.

Mac and Apple Silicon: Current Status

Let me be upfront: DiffusionGemma is not optimized for Apple Silicon at launch. The NVFP4 format is NVIDIA-specific, and the inference optimizations target CUDA.

That said, here’s the situation:

Native Metal/MPS support: Not available at launch
Community ports: Expected within weeks via llama.cpp or MLX adaptations
Performance expectation: Likely significantly slower than NVIDIA (2-4x) due to lack of hardware-specific optimizations
VRAM (unified memory): M4 Pro (24GB) and M4 Max (48GB+) have sufficient memory

If you’re on Apple Silicon and need fast local inference today, autoregressive models remain your better bet. Check our LLM inference on Apple Silicon and best AI models for Mac M4 guides for current top picks.

AMD GPUs: Also Limited

AMD ROCm support is not available at launch. The NVFP4 format and CUDA-specific kernels don’t have AMD equivalents yet. Community efforts may bring ROCm support eventually, but don’t count on it short-term.

Troubleshooting Common Issues

Out of Memory Errors

RuntimeError: CUDA out of memory

Solutions:

Close other GPU-using applications
Enable attention_slicing=True
Enable cpu_offload_unused_experts=True
Reduce max_tokens for the generation
Verify no other process is consuming VRAM with nvidia-smi

Slow First Generation

The first generation after loading is always slower due to CUDA kernel compilation and warmup. Subsequent generations will be at full speed. This is normal for optimized inference libraries.

Quality Issues

If outputs seem incoherent:

Increase num_diffusion_steps (try 20-24)
Check that you’re using the correct NVFP4 weights (not incorrectly quantized variants)
Ensure your prompt is well-formatted

Driver Compatibility

# If you get CUDA errors, update your driver
# Download latest from nvidia.com/drivers
# Minimum: 560.0 for NVFP4 support

Comparing Inference Options

DiffusionGemma currently has a more limited inference ecosystem than mature autoregressive models. Here’s the landscape:

Method	Status	Notes
RTX AI Garage	✅ Supported	Best performance, easiest setup
Python SDK	✅ Supported	Direct API, most flexible
vLLM	🔄 Coming soon	Batch inference optimization
Ollama	❌ Not yet	Requires diffusion support
llama.cpp	❌ Not yet	Autoregressive only currently
LM Studio	❌ Not yet	Autoregressive only currently

For the autoregressive model ecosystem comparison, see our vLLM vs Ollama vs llama.cpp vs TGI breakdown.

Practical Use Cases for Local Deployment

Once you have DiffusionGemma running, here are high-value local use cases:

Real-time coding assistant: Sub-second response times for code completion
Local chatbot: Instant responses without cloud API latency
Content pipeline: Generate hundreds of summaries/descriptions per minute
Document processing: Rapid extraction and reformatting at scale
Interactive applications: Real-time translation, live summarization

Frequently Asked Questions

Can I run DiffusionGemma on an RTX 4080 (16GB)?

No. The NVFP4 model requires 18GB VRAM minimum. The RTX 4080’s 16GB is insufficient. Your options are RTX 4090 (24GB), RTX 5090 (32GB), or professional cards like A6000/RTX PRO. There’s no smaller quantized variant available that maintains the diffusion speed advantage.

Is DiffusionGemma available on Ollama or LM Studio?

Not at launch. These tools are built for autoregressive models and don’t support the diffusion generation paradigm yet. The primary interfaces are NVIDIA’s RTX AI Garage and the Python SDK. Community integration with popular tools will depend on those tools adding diffusion support.

How does DiffusionGemma compare to running Gemma 4 27B locally?

On the same hardware (RTX 4090), Gemma 4 27B generates ~40 tokens/second autoregressively. DiffusionGemma generates 1000+ tokens/second. That’s 25x faster raw throughput. However, Gemma 4 27B produces higher quality output on complex reasoning tasks. Choose based on whether speed or quality matters more for your use case.

Can I fine-tune DiffusionGemma locally?

The Apache 2.0 license allows it, but fine-tuning diffusion models requires more VRAM than inference (typically 2-3x). You’d need 48GB+ VRAM for fine-tuning. The techniques also differ from standard autoregressive fine-tuning — traditional LoRA may not directly apply. Watch for community-developed fine-tuning guides.

What’s the maximum output length?

DiffusionGemma can generate sequences up to 2048 tokens in a single diffusion pass. For longer outputs, multiple passes can be chained, though this adds complexity. Most use cases (chat responses, summaries, code snippets) fit well within this limit.

Does it support system prompts and chat templates?

Yes, DiffusionGemma supports standard prompt formatting including system prompts. The denoising process is conditioned on the full prompt context. Use the same prompt templates you’d use with other Gemma family models.

What’s Next

DiffusionGemma is day one of a new paradigm for local AI inference. Expect rapid ecosystem development: framework integrations, community quantization experiments, fine-tuned variants, and performance optimizations. The 18GB VRAM floor may drop as better compression techniques emerge.

If you have an RTX 4090 or better, there’s no reason not to try it today. The setup is straightforward, the speed is genuinely impressive, and the Apache 2.0 license means you can use it for anything. Just remember it’s experimental — validate outputs for critical use cases.