Step 3.7 Flash and MiniMax M3 are the two newest multimodal Chinese models. Both handle text, images, and video. Both are open-weight. Both launched within days of each other. But they sit at different price-performance points.
Step 3.7 Flash optimizes for raw speed (400 t/s) and cost ($0.20/$0.80). MiniMax M3 optimizes for coding quality (59% SWE-bench Pro) and long-context performance (MSA). Here is how to choose between them.
Head-to-head
| Step 3.7 Flash | MiniMax M3 | |
|---|---|---|
| Developer | StepFun | MiniMax |
| Architecture | MoE (198B/11B active) | MSA (sparse attention) |
| Speed | 400 t/s | Fast (MSA) but <400 t/s |
| Input price | $0.20/M | $0.60/M |
| Output price | $0.80/M | $2.40/M |
| Cache hit | $0.04/M | $0.12/M |
| Context | 256K | 1M (512K guaranteed) |
| Vision | β | β |
| Video | β | β |
| Computer use | β (basic) | β (stronger) |
| SWE-bench Pro | β | 59.0% |
| BrowseComp | 75.82% | 83.5% |
| ClawEval (agent reliability) | 67.1 | β |
| Reasoning tiers | β (Low/Med/High) | β |
| Advisor Mode | β | β |
| Open weight | β | β (~June 10) |
| OpenRouter | β | β |
Pricing: Step is 3Γ cheaper
| Step 3.7 Flash | MiniMax M3 | Ratio | |
|---|---|---|---|
| Input | $0.20/M | $0.60/M | 3Γ |
| Output | $0.80/M | $2.40/M | 3Γ |
| Cache | $0.04/M | $0.12/M | 3Γ |
| 1hr session | ~$0.08 | ~$0.50 | 6Γ |
| Monthly (24/7) | ~$60 | ~$360 | 6Γ |
Step 3.7 Flash is dramatically cheaper. For high-volume multimodal workloads, this 3-6Γ gap matters.
Where Step 3.7 Flash wins
Speed (fastest available)
400 tokens/second. No multimodal model comes close. For real-time applications, interactive coding, autocomplete, or any workflow where latency matters, Step is the clear winner.
Price (3Γ cheaper)
At $0.20/$0.80, Step is one of the cheapest capable models available β cheaper than Gemini 3.5 Flash ($0.15/$0.60 input but comparable output). For budget multimodal workloads, Step wins.
Reasoning tiers
Three adjustable levels per request (Low/Medium/High). Use Low for simple visual tasks (cheapest), High for complex reasoning. M3 has one inference mode β you always pay the same regardless of task complexity.
Advisor Mode
Automatic escalation to a stronger model when stuck. Achieves 97% of Opus 4.6 coding quality at $0.19/task average. M3 has no equivalent β you manually switch models.
Agent reliability
67.1 on ClawEval-1.1 measures multi-step task execution under adversarial conditions. Step follows constraints and avoids traps in complex agent workflows.
Where MiniMax M3 wins
Coding quality
59.0% SWE-bench Pro is a proven frontier-level coding score. Step 3.7 Flash has no published SWE-bench score. For complex coding tasks (multi-file refactoring, architecture, debugging), M3 is the safer choice.
Larger context (4Γ)
1M tokens vs 256K. For entire-codebase analysis, long documents, or agent sessions that accumulate history, M3 provides 4Γ the capacity. If your context regularly exceeds 256K tokens, M3 is the only option.
Long-context speed (MSA)
While Step is faster overall (400 t/s for short contexts), M3βs MSA architecture provides 15.6Γ faster decoding specifically at million-token contexts. For very long prompts, M3βs advantage grows.
Browsing accuracy
83.5% vs 75.82% on BrowseComp. M3 is more accurate at web research and information extraction tasks.
Computer use (stronger)
Both can operate a desktop, but M3βs computer use capability is more developed β it demonstrated autonomous ICLR paper reproduction over 12 hours. Stepβs computer use is newer and less proven.
Available sooner for self-hosting
M3 weights expected ~June 10. Step GGUF quantizations are available now. Both are self-hostable but Step requires ~100GB RAM while M3 needs potentially 200GB+. See how to run Step 3.7 locally and how to run M3 locally.
Use case recommendations
| Workload | Best choice | Why |
|---|---|---|
| Real-time multimodal chat | Step 3.7 Flash | 400 t/s, lowest latency |
| Complex coding tasks | MiniMax M3 | 59% SWE-bench Pro |
| Budget multimodal pipeline | Step 3.7 Flash | 3Γ cheaper |
| Long-document analysis | MiniMax M3 | 1M context |
| Video processing (speed) | Step 3.7 Flash | Faster processing |
| Web research agent | MiniMax M3 | 83.5% BrowseComp |
| Simple visual tasks | Step 3.7 Flash | Low reasoning tier = cheapest |
| Autonomous coding agent | MiniMax M3 | Better coding quality |
| Interactive UI testing | Step 3.7 Flash | Speed for iterative loops |
| Codebase-scale analysis | MiniMax M3 | 1M context + MSA |
The broader multimodal landscape
Both compete in an increasingly crowded multimodal tier:
| Model | Input/M | Output/M | Vision | Video | Speed |
|---|---|---|---|---|---|
| Step 3.7 Flash | $0.20 | $0.80 | β | β | 400 t/s |
| MiniMax M3 | $0.60 | $2.40 | β | β | Fast (MSA) |
| Gemini 3.5 Flash | $0.15 | $0.60 | β | β | ~200 t/s |
| Claude Opus 4.8 | $5.00 | $25.00 | β | β | ~80 t/s |
| Qwen 3.7 Plus | $2.50 | $7.50 | β | β | Standard |
Step and M3 are the only models offering native video input at sub-$3/M output pricing. Gemini is cheapest for image-only tasks.
FAQ
Which should I default to?
Step 3.7 Flash for most multimodal tasks (3Γ cheaper, faster). Escalate to M3 only when you need complex coding quality, 1M context, or superior browsing accuracy.
Can I use both via OpenRouter?
Yes. Both on OpenRouter. Route visual-simple tasks to Step, coding-heavy multimodal to M3.
Is Stepβs coding good enough?
For routine tasks with Advisor Mode: achieves 97% of Opus 4.6 quality. For complex multi-file coding: M3 is more reliable. If your agent does mostly tool-calling and simple code edits, Step is fine.
Which is easier to self-host?
Step 3.7 Flash (198B MoE, 11B active) needs ~100GB at Q4. M3 (200-400B estimated) needs ~100-200GB. Step is likely easier to run on a single Mac Studio 128GB. See running Step locally.
What about Gemini 3.5 Flash as an alternative?
Gemini 3.5 Flash is cheaper ($0.15/$0.60) with 1M context and better tool-calling (83.6% MCP Atlas). But it lacks video input and is closed-source. If you need video: Step or M3. If images-only and you want cheapest: Gemini.