Every large language model you’ve used — ChatGPT, Claude, Gemma, Llama, Qwen — generates text the same way: one token at a time, left to right, each token depending on all previous ones. This approach is called autoregressive generation, and it’s been the only game in town since GPT-2.
Until now.
On June 10, 2026, Google DeepMind released DiffusionGemma — an open-source model that generates text using diffusion, the same family of techniques that powers Stable Diffusion and DALL-E for images. The result? Over 1,000 tokens per second on consumer NVIDIA hardware. That’s roughly 4x faster than equivalent autoregressive models.
But how does it actually work? Why is it faster? And what are the tradeoffs? Let’s break it down.
How Autoregressive Models Generate Text
First, let’s understand what we’re replacing. When you prompt GPT-4, Gemma 4, or any autoregressive model, here’s what happens internally:
- The model sees your prompt tokens
- It predicts the single most likely next token
- That token gets appended to the sequence
- It predicts the next token based on the full sequence so far
- Repeat until done
This is fundamentally sequential. Token 50 can’t be generated until tokens 1-49 exist. Even on the fastest GPU in the world, you’re bottlenecked by this chain of dependencies.
On a typical autoregressive 27B model running locally, you might get 30-50 tokens per second. That’s about 2-3 sentences per second — noticeable latency for any interactive application.
For a deeper dive into how this inference pipeline works mechanically, check our LLM inference explained guide.
How Text Diffusion Works
Text diffusion flips the script entirely. Instead of generating tokens one by one, it generates ALL tokens simultaneously and refines them iteratively. Here’s the process:
Step 1: Initialize with Noise
The model starts with a sequence of random placeholder tokens — essentially gibberish. If you want a 200-token output, you begin with 200 random tokens.
Step 2: Parallel Denoising
In each “diffusion step,” the model looks at the entire noisy sequence and updates ALL tokens in parallel. Each token gets pushed slightly closer to what the final output should be. Because updates happen in parallel, you fully utilize GPU compute — thousands of CUDA cores working simultaneously rather than waiting in a sequential chain.
Step 3: Iterative Refinement
After each pass, the sequence is more coherent. Random gibberish becomes rough words, rough words become grammatical sentences, grammatical sentences become contextually appropriate responses. This typically takes 10-20 denoising steps.
Step 4: Final Output
After sufficient refinement passes, the output converges to fluent, coherent text.
The key insight: each denoising step processes all tokens in parallel, leveraging GPU architecture the way it was designed to be used. GPUs have thousands of cores that excel at parallel computation — autoregressive generation leaves most of them idle.
Why Text Diffusion is Faster
The speed advantage comes from three factors working together:
1. Parallelism
If you’re generating 500 tokens, autoregressive needs 500 sequential forward passes (one per token). Text diffusion needs maybe 16 forward passes (denoising steps), but each pass processes all 500 tokens simultaneously. With GPU parallelism, processing 500 tokens at once isn’t 500x slower than processing 1 — it’s nearly the same cost.
2. MoE Efficiency
DiffusionGemma specifically uses a Mixture-of-Experts architecture: 26B total parameters, but only 3.8B active per token. This means each forward pass through the model is computationally cheap — comparable to a 4B model — while retaining the knowledge capacity of a much larger model.
3. NVFP4 Optimization
The NVFP4 data format reduces memory bandwidth requirements, keeping the model in 18GB VRAM and allowing faster data movement between GPU memory and compute units. Memory bandwidth is often the real bottleneck in LLM inference, and NVFP4 directly addresses this.
Combined, these factors yield 1,000+ tokens/second on NVIDIA RTX GPUs — roughly 4x faster than autoregressive models in the same quality class.
If you want to understand how GPU architecture affects inference speed, our GPU vs CPU AI inference article covers the hardware fundamentals.
The Mathematical Intuition
Without getting too deep into the math, here’s the intuition for why this works at all:
In autoregressive generation, the model learns P(token_n | token_1, …, token_n-1) — the probability of the next token given all previous tokens. This inherently requires sequential computation.
In diffusion, the model learns to reverse a noise process. Given a noisy version of the text, predict what the “clean” version should be. The model learns a denoising function that can be applied to the entire sequence at once. Through iterative application of this denoising function, random noise converges to samples from the learned text distribution.
This is analogous to how image diffusion models work (Stable Diffusion, Midjourney), but adapted for discrete token sequences rather than continuous pixel values. The key research contribution of DiffusionGemma is “Uniform State Diffusion” — a specific formulation that makes this work effectively for text.
Quality Tradeoffs: The Honest Assessment
Here’s where I need to be straightforward with you: text diffusion currently trades some quality for speed. DiffusionGemma is experimental — the first major open text diffusion model — and Google acknowledges it may not match top autoregressive models on all tasks.
Why the quality gap exists:
Sequential Reasoning
Autoregressive models naturally “think step by step” because they literally generate one step at a time. Each token can attend to all previous reasoning. In diffusion, all tokens are generated simultaneously, which can make multi-step logical reasoning harder. The model needs to “plan ahead” during denoising rather than building reasoning incrementally.
Coherence Over Long Outputs
While short and medium outputs are generally coherent, very long generations (2000+ tokens) may show more repetition or logical inconsistency compared to autoregressive models. The parallel nature makes it harder to maintain long-range dependencies across many denoising steps.
Instruction Precision
Early testing suggests DiffusionGemma is slightly less precise at following complex, multi-constraint instructions compared to autoregressive models of similar size. Simple instructions work fine; highly specific formatting or constraint-heavy prompts may need more denoising steps.
The Speed-Quality Dial
The num_diffusion_steps parameter directly controls this tradeoff:
- 8 steps: Very fast, but noticeably lower quality
- 16 steps: Good balance for most tasks
- 24+ steps: Approaches autoregressive quality, but loses some speed advantage
This tunability is actually a feature — you can dial in exactly the speed/quality point you need for your use case.
What Text Diffusion is Good At
Despite the tradeoffs, there are tasks where diffusion excels:
- Summarization: Producing concise summaries of longer texts
- Translation: Generating translations where the output length is roughly known
- Fill-in-the-middle: Since tokens aren’t generated left-to-right, bidirectional context is natural
- Batch content generation: Producing many outputs quickly for content pipelines
- Interactive chat: Where response latency matters more than peak accuracy
- Drafting: Fast first drafts that can be refined by a slower model
For understanding how different models compare for specific tasks, see our best AI models for coding locally comparison.
How DiffusionGemma Compares to Autoregressive Gemma 4
Since both come from Google DeepMind, the comparison is instructive:
| Aspect | DiffusionGemma | Gemma 4 27B |
|---|---|---|
| Generation | Parallel diffusion | Autoregressive |
| Speed | 1000+ tok/s | ~40 tok/s |
| Parameters | 26B (3.8B active) | 27B (all active) |
| VRAM | 18GB | 20-24GB |
| Reasoning | Good | Excellent |
| Quality ceiling | High (improving) | Very high |
For a detailed head-to-head, see our DiffusionGemma vs Gemma 4 27B comparison.
The Future of Text Diffusion
DiffusionGemma is generation one. Here’s why the AI community is excited about where this is heading:
Quality will improve. Image diffusion models went from “interesting experiment” to “better than GANs” in about 2 years. Text diffusion is following a similar trajectory. Future models will likely close the quality gap with autoregressive approaches.
Hybrid approaches. Some researchers are exploring models that use diffusion for initial draft generation and autoregressive refinement for polishing. Best of both worlds.
Hardware co-evolution. As GPU architectures evolve, they’re increasingly optimized for the kind of parallel workloads diffusion models require. NVIDIA’s collaboration with Google on DiffusionGemma is just the start.
New capabilities. Text diffusion naturally supports capabilities that are awkward for autoregressive models: editing in-place, infilling, bidirectional context, and controlled generation with constraints.
The open-source Apache 2.0 license means the community will iterate on this rapidly. Expect quantized variants, fine-tuned versions, and framework integrations in the coming weeks.
Practical Implications for Developers
If you’re building applications with local LLMs, here’s what text diffusion means for you:
- Real-time applications become viable: 1000+ tok/s means streaming responses with near-zero perceived latency
- Cost reduction: Faster generation = less GPU time = lower costs for inference-heavy workloads
- New architectures: Applications that were impractical with slow generation (real-time translation, live summarization) become feasible
- Model ensembles: Use DiffusionGemma for fast drafts, autoregressive models for refinement
For running this locally, check our how to run DiffusionGemma locally tutorial and the NVIDIA RTX Spark complete guide for hardware options.
Frequently Asked Questions
Is text diffusion the same as image diffusion applied to text?
Same family of techniques, different implementation. Image diffusion works in continuous pixel space where you can smoothly add and remove Gaussian noise. Text is discrete (tokens), so you can’t directly add “a little noise” to a word. Uniform State Diffusion solves this by working with token distributions and placeholder tokens that get iteratively refined. The core principle — start with noise, iteratively denoise — is shared.
Will text diffusion replace autoregressive models?
Not immediately, and probably not entirely. More likely, we’ll see a spectrum: diffusion for speed-sensitive tasks, autoregressive for quality-sensitive tasks, and hybrid approaches combining both. Just as CNNs didn’t disappear when transformers arrived — they found their niche — autoregressive models will remain valuable for tasks requiring precise sequential reasoning.
Can I fine-tune DiffusionGemma?
The Apache 2.0 license allows it, and the community will likely develop fine-tuning recipes. However, fine-tuning diffusion models requires different techniques than autoregressive models. Standard LoRA adapters may not directly apply. Watch the developer community for emerging best practices.
Why does DiffusionGemma need NVIDIA specifically?
The NVFP4 data format and inference optimizations are NVIDIA-specific. The parallel denoising process maps extremely well to NVIDIA’s CUDA architecture and tensor cores. While the model weights could theoretically run on other hardware, the 4x speed advantage is specifically tied to NVIDIA’s ecosystem. For cross-platform options, see our Ollama complete guide.
How many denoising steps should I use?
For most tasks, 12-20 steps offer a good balance. Use fewer (8-10) for simple tasks where speed matters most — chatbot responses, simple Q&A. Use more (20-24) for tasks requiring high coherence — long-form writing, technical explanations, code generation. The quality improvement above 24 steps is marginal.
Does text diffusion support streaming?
Not in the traditional sense. Autoregressive models naturally stream because each token is final once generated. Diffusion generates all tokens simultaneously, refining the whole sequence each step. However, you can stream intermediate denoising states — showing the text “crystallize” from noise, which is its own interesting UX pattern.
Wrapping Up
Text diffusion is real, it works, and DiffusionGemma proves it can be done at scale with open weights. The 4x speed improvement is not marketing — it’s a fundamental architectural advantage of parallel generation over sequential generation.
Is it ready to replace your autoregressive models today? For some use cases, absolutely. For others, the quality tradeoff isn’t worth it yet. But the trajectory is clear: text diffusion will be a major part of how we generate text going forward.
The genie is out of the bottle. Start experimenting.