šŸ¤– AI Tools
Ā· 9 min read

MiMo UltraSpeed for Agentic Coding: 106 Sessions Tested


Everyone’s chasing tokens per second. Xiaomi claims 1,000+ tok/s for MiMo-V2.5-Pro-UltraSpeed. Cerebras does 2,000+ on smaller models. Groq hits 800+ on Llama variants. The numbers keep climbing.

But here’s a question nobody seems to be answering: does faster generation actually make AI coding agents more productive?

Not ā€œdoes it feel snappier in a chatbot.ā€ Does it make an autonomous agent — one that plans, writes code, runs tests, debugs, and iterates without human intervention — complete more real work in the same amount of time?

I ran 106 autonomous coding sessions to find out. 62 on standard MiMo-V2.5-Pro. 44 on MiMo-V2.5-Pro-UltraSpeed. Same agent, same codebase, same types of production tasks. Here’s what the data says.

The setup

The test environment is an autonomous AI coding agent running as part of a long-running AI startup experiment on our VPS. The agent operates in fixed session windows (typically 30-35 minutes), during which it autonomously:

  • Reads the codebase and identifies what to work on
  • Plans multi-step changes
  • Writes and modifies code across multiple files
  • Runs builds and tests
  • Debugs failures
  • Commits working code

The agent uses the Anthropic-compatible API with tool use (file read/write, shell commands, web search). Context windows routinely hit 2-8 million cached tokens as the session progresses.

Standard model: MiMo-V2.5-Pro (1.02T parameters, 42B active, MoE architecture) UltraSpeed model: MiMo-V2.5-Pro-UltraSpeed (same model weights, different inference stack)

Both use the same underlying model. The difference is purely in the inference optimization: UltraSpeed applies a three-layer speed stack (FP4 quantization, DFlash speculative decoding, TileRT persistent-core runtime) to deliver faster generation.

The results

MetricMiMo-V2.5-ProUltraSpeedDifference
Sessions measured6244
Avg session duration7.7 min4.8 min37% faster
Avg output tokens/run23,24423,807Equivalent
Median effective tok/s519586% faster
P90 effective tok/s63147133% faster
P10 effective tok/s3942Similar floor
Runs per 30-min window3-45-6~60% more work

The headline: UltraSpeed completed comparable production tasks with similar output volume, while reducing average session time by 37%. In time-constrained environments, that translates to roughly 60% more completed agent runs per window.

Why 1,000 tok/s becomes 95 tok/s in practice

This is the most important finding, and the one that pure benchmarks never show you.

MiMo UltraSpeed genuinely generates tokens at 1,000+ tok/s in isolation. Xiaomi’s claim is real. But an autonomous coding agent doesn’t just generate output. Each turn in an agent loop involves:

  1. Processing input context — reading the codebase, understanding previous actions, parsing tool results. With 2-8M cached tokens per session, this is significant.
  2. Reasoning and planning — the model’s ā€œthinkingā€ before it starts outputting.
  3. Generating output — this is the part that hits 1,000 tok/s. But it’s only one phase of each turn.
  4. Tool execution — file writes, shell commands, build processes. These happen between turns.

In a 60-turn coding session (typical for UltraSpeed), the model generates an average of 397 output tokens per turn. At 1,000 tok/s, that’s 0.4 seconds of pure generation per turn. The rest of each turn (3-4 seconds average) is input processing, reasoning, and tool execution.

So the effective ā€œsession-levelā€ throughput is 95 tok/s at the median, not because generation is slow, but because generation is only ~10% of what the agent spends time doing.

This matters because it reframes the entire ā€œtokens per secondā€ conversation. For chatbots where you wait for a streaming response, raw tok/s is everything. For coding agents that run autonomously, it’s one factor among many, and not even the dominant one.

Where UltraSpeed actually helps

Despite the above, the 37% wall-clock improvement is real. Where does it come from if generation is only 10% of turn time?

1. Reduced time-to-first-token (TTFT)

UltraSpeed sessions show TTFT of 2-3 seconds on cached contexts, vs 3-5 seconds on standard Pro. When you have 60+ turns per session, saving 1-2 seconds per turn adds up to 1-2 minutes per session.

2. Reduced delay before useful output

UltraSpeed also appears to reduce the delay before useful output appears, especially in multi-turn sessions with heavily cached context. This manifests as shorter gaps between turns even when the output itself isn’t long.

3. Compound effect over many turns

A 1-second improvement per turn is barely noticeable in a single interaction. Over 60 turns, it’s a full minute. Over a 5-session day, that’s 5 extra productive runs.

4. Higher burst throughput on code-heavy outputs

When the agent generates long code blocks (500+ tokens), UltraSpeed’s P90 throughput of 147 tok/s kicks in. Standard Pro caps around 63 tok/s. For individual code generation steps, that’s 2.3x faster.

How UltraSpeed works (technical breakdown)

According to Xiaomi’s technical documentation and their inference partner TileRT, the 1,000+ tok/s performance comes from a three-layer optimization stack:

Layer 1: MXFP4 Quantization

The expert layers in MiMo’s MoE architecture are quantized from 16-bit to 4-bit (MXFP4 format). This reduces memory footprint by 4x, meaning less data needs to move per token. For a memory-bandwidth-bound workload like LLM decoding, moving less data directly translates to faster generation.

The key insight: MoE models are particularly well-suited to aggressive quantization because only a fraction of experts activate per token. Quality degradation is minimal because the routing mechanism naturally selects the most relevant experts, and 4-bit precision is sufficient for the activated subset.

Layer 2: DFlash Speculative Decoding

Traditional speculative decoding uses a small ā€œdraftā€ model to propose tokens, then verifies them with the full model. DFlash takes this further: instead of generating draft tokens one at a time (still sequential), it uses a block diffusion model that produces an entire block of K tokens in a single forward pass.

The full model then verifies the entire block in parallel. When acceptance rates are high (which they are for code generation, since code is highly predictable given context), multiple tokens get accepted per verification step. This compounds: high acceptance on code means fewer rejections, fewer re-drafts, and sustained high throughput.

Layer 3: TileRT Persistent-Core Runtime

TileRT keeps GPU compute cores occupied continuously rather than cycling them on and off between operations. Traditional GPU runtimes launch kernels, wait for completion, launch the next kernel. TileRT’s persistent-core approach pre-schedules the entire decode pipeline as a single fused operation.

The result: near-zero inter-operation latency. According to Xiaomi, each of the three optimizations provides roughly 2-3x improvement. Combined: ~10x over unoptimized inference, reaching 1,000+ tok/s on a single 8-GPU commodity node.

What this means for coding agent builders

If you’re building or using AI coding agents, here’s my practical takeaway from 106 sessions:

1. Raw tok/s is a marketing metric, not a productivity metric.

Your agent’s real throughput is determined by: context processing speed + reasoning time + generation speed + tool execution time. Generation is typically 10-15% of total agent turn time. Optimizing it helps, but 10x generation speed doesn’t give you 10x productivity.

2. Session-level throughput matters more.

The number you should care about is ā€œtasks completed per hourā€ or ā€œuseful commits per session,ā€ not ā€œtokens per second.ā€ UltraSpeed gives 60% more completed tasks per time window. That’s the metric that affects your development velocity.

3. Caching strategy dominates.

With 2-8M cached tokens per session, the KV-cache hit rate and prefill speed matter more than decode speed for agent workloads. MiMo’s Hybrid Sliding Window Attention (compressing KV-cache to ~1/7 of full attention) is arguably more important for agent performance than the UltraSpeed decode optimizations.

4. The ā€œspeed moatā€ is real but narrow.

Today, MiMo UltraSpeed at 1,000 tok/s is ahead of most competitors on raw generation speed for a trillion-parameter model. But Cerebras, Groq, and others are closing in on smaller models. The long-term advantage isn’t speed alone. It’s the combination of frontier-scale model (1T parameters, 42B active) at frontier speed on commodity hardware. That’s what makes UltraSpeed interesting for production agent deployments.

Market context

For reference, MiMo UltraSpeed at 1,000+ tok/s on a trillion-parameter model is currently unique. Other fast inference providers like Cerebras (2,100 tok/s) and Groq (800 tok/s) achieve their speeds on smaller models (70B-class). Closed frontier models like Claude Opus 4.8 and GPT-5.5 typically generate at 60-80 tok/s.

UltraSpeed’s positioning is specifically: frontier-scale model quality at frontier speed on commodity hardware. Whether that speed advantage persists as competitors optimize their own stacks remains to be seen.

Methodology notes

What was controlled:

  • Same agent framework (Claude Code-compatible CLI with MCP)
  • Same codebase (a production website with 680+ pages)
  • Same task types (feature development, bug fixes, conversion optimization, content generation)
  • Same session window structure

What wasn’t controlled:

  • Task difficulty varied naturally (some sessions got harder problems)
  • Cache state varied (first run of a session has less cache than sixth run)
  • Network conditions (VPS in EU, API in Asia)

Sample sizes: 62 runs on Pro, 44 runs on UltraSpeed. Sufficient for meaningful averages but not statistically rigorous A/B testing. The results should be read as ā€œstrong indicative dataā€ rather than ā€œlaboratory-grade proof.ā€

The bottom line

Does faster inference make AI coding agents more productive? Yes, meaningfully. Not proportionally to the tok/s improvement (10x generation speed doesn’t give you 10x productivity), but a consistent 37% wall-clock improvement that translates to 60% more completed work in time-constrained environments.

For developers running autonomous coding agents in production, UltraSpeed delivers a meaningful speed improvement that translates directly into more completed work per session. For interactive coding assistance (pair programming, code review, chatbot-style Q&A), the speed improvement is even more noticeable because generation is a larger fraction of the interaction time.

The bigger takeaway: the ā€œtokens per secondā€ arms race matters, but not in the way marketing materials suggest. In agent workflows, it’s one factor among many. The real question isn’t ā€œhow fast can you generate?ā€ but ā€œhow many useful things can you ship per hour?ā€ For MiMo UltraSpeed, the answer is: roughly 60% more than before.

FAQ

Is MiMo UltraSpeed worth the premium over standard Pro?

If your bottleneck is time (fixed session windows, CI/CD pipelines, real-time agents), yes. You get 37% faster sessions and roughly 60% more completed work per window. If you have unlimited time and just want to minimize spend, standard Pro gives comparable output, just slower.

Does UltraSpeed reduce output quality?

In 44 sessions, output volume remained equivalent (23.8K vs 23.2K tokens per run) and task types were the same (production code commits, feature development, bug fixes). We didn’t observe quality degradation. The FP4 quantization appears well-calibrated for code generation tasks.

Why don’t I see 1,000 tok/s when using UltraSpeed?

You do, during the generation phase. But if you’re measuring end-to-end session time divided by total output tokens, you’ll see 50-150 tok/s because most time is spent on context processing and reasoning, not generation. The 1,000 tok/s headline is a generation-phase metric, not a session-level metric.

How does UltraSpeed compare to Claude Code or Codex?

Different category. Claude Code and Codex are agent frameworks running on their respective models (Claude Opus, GPT). MiMo UltraSpeed is a model + inference optimization. You could theoretically run a Claude Code-style agent on MiMo UltraSpeed via the API (which is what we did). The comparison isn’t model vs model, it’s ā€œdoes the same agent perform better with faster inference under it.ā€

Can I self-host UltraSpeed?

No. UltraSpeed requires TileRT’s persistent-core runtime which is only available via the Xiaomi API. You can self-host standard MiMo-V2.5-Pro (MIT open weights on HuggingFace), but you won’t get the 1,000 tok/s optimization without TileRT’s proprietary inference stack.

Is this only useful for coding agents?

The principle applies to any agentic workload with many turns: research agents, data analysis pipelines, multi-step content generation, automated testing. Anywhere an AI iterates through plan-execute-verify loops, faster inference per turn compounds into meaningful productivity gains.


This analysis is based on real production data from our AI Startup Race experiment, where autonomous AI agents compete to build and grow websites. MiMo UltraSpeed access was provided by Xiaomi for testing purposes. The analysis, methodology, and conclusions are entirely our own.