Everyoneās chasing tokens per second. Xiaomi claims 1,000+ tok/s for MiMo-V2.5-Pro-UltraSpeed. Cerebras does 2,000+ on smaller models. Groq hits 800+ on Llama variants. The numbers keep climbing.
But hereās a question nobody seems to be answering: does faster generation actually make AI coding agents more productive?
Not ādoes it feel snappier in a chatbot.ā Does it make an autonomous agent ā one that plans, writes code, runs tests, debugs, and iterates without human intervention ā complete more real work in the same amount of time?
I ran 106 autonomous coding sessions to find out. 62 on standard MiMo-V2.5-Pro. 44 on MiMo-V2.5-Pro-UltraSpeed. Same agent, same codebase, same types of production tasks. Hereās what the data says.
The setup
The test environment is an autonomous AI coding agent running as part of a long-running AI startup experiment on our VPS. The agent operates in fixed session windows (typically 30-35 minutes), during which it autonomously:
- Reads the codebase and identifies what to work on
- Plans multi-step changes
- Writes and modifies code across multiple files
- Runs builds and tests
- Debugs failures
- Commits working code
The agent uses the Anthropic-compatible API with tool use (file read/write, shell commands, web search). Context windows routinely hit 2-8 million cached tokens as the session progresses.
Standard model: MiMo-V2.5-Pro (1.02T parameters, 42B active, MoE architecture) UltraSpeed model: MiMo-V2.5-Pro-UltraSpeed (same model weights, different inference stack)
Both use the same underlying model. The difference is purely in the inference optimization: UltraSpeed applies a three-layer speed stack (FP4 quantization, DFlash speculative decoding, TileRT persistent-core runtime) to deliver faster generation.
The results
| Metric | MiMo-V2.5-Pro | UltraSpeed | Difference |
|---|---|---|---|
| Sessions measured | 62 | 44 | |
| Avg session duration | 7.7 min | 4.8 min | 37% faster |
| Avg output tokens/run | 23,244 | 23,807 | Equivalent |
| Median effective tok/s | 51 | 95 | 86% faster |
| P90 effective tok/s | 63 | 147 | 133% faster |
| P10 effective tok/s | 39 | 42 | Similar floor |
| Runs per 30-min window | 3-4 | 5-6 | ~60% more work |
The headline: UltraSpeed completed comparable production tasks with similar output volume, while reducing average session time by 37%. In time-constrained environments, that translates to roughly 60% more completed agent runs per window.
Why 1,000 tok/s becomes 95 tok/s in practice
This is the most important finding, and the one that pure benchmarks never show you.
MiMo UltraSpeed genuinely generates tokens at 1,000+ tok/s in isolation. Xiaomiās claim is real. But an autonomous coding agent doesnāt just generate output. Each turn in an agent loop involves:
- Processing input context ā reading the codebase, understanding previous actions, parsing tool results. With 2-8M cached tokens per session, this is significant.
- Reasoning and planning ā the modelās āthinkingā before it starts outputting.
- Generating output ā this is the part that hits 1,000 tok/s. But itās only one phase of each turn.
- Tool execution ā file writes, shell commands, build processes. These happen between turns.
In a 60-turn coding session (typical for UltraSpeed), the model generates an average of 397 output tokens per turn. At 1,000 tok/s, thatās 0.4 seconds of pure generation per turn. The rest of each turn (3-4 seconds average) is input processing, reasoning, and tool execution.
So the effective āsession-levelā throughput is 95 tok/s at the median, not because generation is slow, but because generation is only ~10% of what the agent spends time doing.
This matters because it reframes the entire ātokens per secondā conversation. For chatbots where you wait for a streaming response, raw tok/s is everything. For coding agents that run autonomously, itās one factor among many, and not even the dominant one.
Where UltraSpeed actually helps
Despite the above, the 37% wall-clock improvement is real. Where does it come from if generation is only 10% of turn time?
1. Reduced time-to-first-token (TTFT)
UltraSpeed sessions show TTFT of 2-3 seconds on cached contexts, vs 3-5 seconds on standard Pro. When you have 60+ turns per session, saving 1-2 seconds per turn adds up to 1-2 minutes per session.
2. Reduced delay before useful output
UltraSpeed also appears to reduce the delay before useful output appears, especially in multi-turn sessions with heavily cached context. This manifests as shorter gaps between turns even when the output itself isnāt long.
3. Compound effect over many turns
A 1-second improvement per turn is barely noticeable in a single interaction. Over 60 turns, itās a full minute. Over a 5-session day, thatās 5 extra productive runs.
4. Higher burst throughput on code-heavy outputs
When the agent generates long code blocks (500+ tokens), UltraSpeedās P90 throughput of 147 tok/s kicks in. Standard Pro caps around 63 tok/s. For individual code generation steps, thatās 2.3x faster.
How UltraSpeed works (technical breakdown)
According to Xiaomiās technical documentation and their inference partner TileRT, the 1,000+ tok/s performance comes from a three-layer optimization stack:
Layer 1: MXFP4 Quantization
The expert layers in MiMoās MoE architecture are quantized from 16-bit to 4-bit (MXFP4 format). This reduces memory footprint by 4x, meaning less data needs to move per token. For a memory-bandwidth-bound workload like LLM decoding, moving less data directly translates to faster generation.
The key insight: MoE models are particularly well-suited to aggressive quantization because only a fraction of experts activate per token. Quality degradation is minimal because the routing mechanism naturally selects the most relevant experts, and 4-bit precision is sufficient for the activated subset.
Layer 2: DFlash Speculative Decoding
Traditional speculative decoding uses a small ādraftā model to propose tokens, then verifies them with the full model. DFlash takes this further: instead of generating draft tokens one at a time (still sequential), it uses a block diffusion model that produces an entire block of K tokens in a single forward pass.
The full model then verifies the entire block in parallel. When acceptance rates are high (which they are for code generation, since code is highly predictable given context), multiple tokens get accepted per verification step. This compounds: high acceptance on code means fewer rejections, fewer re-drafts, and sustained high throughput.
Layer 3: TileRT Persistent-Core Runtime
TileRT keeps GPU compute cores occupied continuously rather than cycling them on and off between operations. Traditional GPU runtimes launch kernels, wait for completion, launch the next kernel. TileRTās persistent-core approach pre-schedules the entire decode pipeline as a single fused operation.
The result: near-zero inter-operation latency. According to Xiaomi, each of the three optimizations provides roughly 2-3x improvement. Combined: ~10x over unoptimized inference, reaching 1,000+ tok/s on a single 8-GPU commodity node.
What this means for coding agent builders
If youāre building or using AI coding agents, hereās my practical takeaway from 106 sessions:
1. Raw tok/s is a marketing metric, not a productivity metric.
Your agentās real throughput is determined by: context processing speed + reasoning time + generation speed + tool execution time. Generation is typically 10-15% of total agent turn time. Optimizing it helps, but 10x generation speed doesnāt give you 10x productivity.
2. Session-level throughput matters more.
The number you should care about is ātasks completed per hourā or āuseful commits per session,ā not ātokens per second.ā UltraSpeed gives 60% more completed tasks per time window. Thatās the metric that affects your development velocity.
3. Caching strategy dominates.
With 2-8M cached tokens per session, the KV-cache hit rate and prefill speed matter more than decode speed for agent workloads. MiMoās Hybrid Sliding Window Attention (compressing KV-cache to ~1/7 of full attention) is arguably more important for agent performance than the UltraSpeed decode optimizations.
4. The āspeed moatā is real but narrow.
Today, MiMo UltraSpeed at 1,000 tok/s is ahead of most competitors on raw generation speed for a trillion-parameter model. But Cerebras, Groq, and others are closing in on smaller models. The long-term advantage isnāt speed alone. Itās the combination of frontier-scale model (1T parameters, 42B active) at frontier speed on commodity hardware. Thatās what makes UltraSpeed interesting for production agent deployments.
Market context
For reference, MiMo UltraSpeed at 1,000+ tok/s on a trillion-parameter model is currently unique. Other fast inference providers like Cerebras (2,100 tok/s) and Groq (800 tok/s) achieve their speeds on smaller models (70B-class). Closed frontier models like Claude Opus 4.8 and GPT-5.5 typically generate at 60-80 tok/s.
UltraSpeedās positioning is specifically: frontier-scale model quality at frontier speed on commodity hardware. Whether that speed advantage persists as competitors optimize their own stacks remains to be seen.
Methodology notes
What was controlled:
- Same agent framework (Claude Code-compatible CLI with MCP)
- Same codebase (a production website with 680+ pages)
- Same task types (feature development, bug fixes, conversion optimization, content generation)
- Same session window structure
What wasnāt controlled:
- Task difficulty varied naturally (some sessions got harder problems)
- Cache state varied (first run of a session has less cache than sixth run)
- Network conditions (VPS in EU, API in Asia)
Sample sizes: 62 runs on Pro, 44 runs on UltraSpeed. Sufficient for meaningful averages but not statistically rigorous A/B testing. The results should be read as āstrong indicative dataā rather than ālaboratory-grade proof.ā
The bottom line
Does faster inference make AI coding agents more productive? Yes, meaningfully. Not proportionally to the tok/s improvement (10x generation speed doesnāt give you 10x productivity), but a consistent 37% wall-clock improvement that translates to 60% more completed work in time-constrained environments.
For developers running autonomous coding agents in production, UltraSpeed delivers a meaningful speed improvement that translates directly into more completed work per session. For interactive coding assistance (pair programming, code review, chatbot-style Q&A), the speed improvement is even more noticeable because generation is a larger fraction of the interaction time.
The bigger takeaway: the ātokens per secondā arms race matters, but not in the way marketing materials suggest. In agent workflows, itās one factor among many. The real question isnāt āhow fast can you generate?ā but āhow many useful things can you ship per hour?ā For MiMo UltraSpeed, the answer is: roughly 60% more than before.
FAQ
Is MiMo UltraSpeed worth the premium over standard Pro?
If your bottleneck is time (fixed session windows, CI/CD pipelines, real-time agents), yes. You get 37% faster sessions and roughly 60% more completed work per window. If you have unlimited time and just want to minimize spend, standard Pro gives comparable output, just slower.
Does UltraSpeed reduce output quality?
In 44 sessions, output volume remained equivalent (23.8K vs 23.2K tokens per run) and task types were the same (production code commits, feature development, bug fixes). We didnāt observe quality degradation. The FP4 quantization appears well-calibrated for code generation tasks.
Why donāt I see 1,000 tok/s when using UltraSpeed?
You do, during the generation phase. But if youāre measuring end-to-end session time divided by total output tokens, youāll see 50-150 tok/s because most time is spent on context processing and reasoning, not generation. The 1,000 tok/s headline is a generation-phase metric, not a session-level metric.
How does UltraSpeed compare to Claude Code or Codex?
Different category. Claude Code and Codex are agent frameworks running on their respective models (Claude Opus, GPT). MiMo UltraSpeed is a model + inference optimization. You could theoretically run a Claude Code-style agent on MiMo UltraSpeed via the API (which is what we did). The comparison isnāt model vs model, itās ādoes the same agent perform better with faster inference under it.ā
Can I self-host UltraSpeed?
No. UltraSpeed requires TileRTās persistent-core runtime which is only available via the Xiaomi API. You can self-host standard MiMo-V2.5-Pro (MIT open weights on HuggingFace), but you wonāt get the 1,000 tok/s optimization without TileRTās proprietary inference stack.
Is this only useful for coding agents?
The principle applies to any agentic workload with many turns: research agents, data analysis pipelines, multi-step content generation, automated testing. Anywhere an AI iterates through plan-execute-verify loops, faster inference per turn compounds into meaningful productivity gains.
This analysis is based on real production data from our AI Startup Race experiment, where autonomous AI agents compete to build and grow websites. MiMo UltraSpeed access was provided by Xiaomi for testing purposes. The analysis, methodology, and conclusions are entirely our own.