πŸ€– AI Tools
Β· 5 min read

Step 3.7 Flash vs MiniMax M3: Speed vs Depth in Multimodal AI (2026)


Step 3.7 Flash and MiniMax M3 are the two newest multimodal Chinese models. Both handle text, images, and video. Both are open-weight. Both launched within days of each other. But they sit at different price-performance points.

Step 3.7 Flash optimizes for raw speed (400 t/s) and cost ($0.20/$0.80). MiniMax M3 optimizes for coding quality (59% SWE-bench Pro) and long-context performance (MSA). Here is how to choose between them.

Head-to-head

Step 3.7 FlashMiniMax M3
DeveloperStepFunMiniMax
ArchitectureMoE (198B/11B active)MSA (sparse attention)
Speed400 t/sFast (MSA) but <400 t/s
Input price$0.20/M$0.60/M
Output price$0.80/M$2.40/M
Cache hit$0.04/M$0.12/M
Context256K1M (512K guaranteed)
Visionβœ…βœ…
Videoβœ…βœ…
Computer useβœ… (basic)βœ… (stronger)
SWE-bench Proβ€”59.0%
BrowseComp75.82%83.5%
ClawEval (agent reliability)67.1β€”
Reasoning tiersβœ… (Low/Med/High)❌
Advisor Modeβœ…βŒ
Open weightβœ…βœ… (~June 10)
OpenRouterβœ…βœ…

Pricing: Step is 3Γ— cheaper

Step 3.7 FlashMiniMax M3Ratio
Input$0.20/M$0.60/M3Γ—
Output$0.80/M$2.40/M3Γ—
Cache$0.04/M$0.12/M3Γ—
1hr session~$0.08~$0.506Γ—
Monthly (24/7)~$60~$3606Γ—

Step 3.7 Flash is dramatically cheaper. For high-volume multimodal workloads, this 3-6Γ— gap matters.

Where Step 3.7 Flash wins

Speed (fastest available)

400 tokens/second. No multimodal model comes close. For real-time applications, interactive coding, autocomplete, or any workflow where latency matters, Step is the clear winner.

Price (3Γ— cheaper)

At $0.20/$0.80, Step is one of the cheapest capable models available β€” cheaper than Gemini 3.5 Flash ($0.15/$0.60 input but comparable output). For budget multimodal workloads, Step wins.

Reasoning tiers

Three adjustable levels per request (Low/Medium/High). Use Low for simple visual tasks (cheapest), High for complex reasoning. M3 has one inference mode β€” you always pay the same regardless of task complexity.

Advisor Mode

Automatic escalation to a stronger model when stuck. Achieves 97% of Opus 4.6 coding quality at $0.19/task average. M3 has no equivalent β€” you manually switch models.

Agent reliability

67.1 on ClawEval-1.1 measures multi-step task execution under adversarial conditions. Step follows constraints and avoids traps in complex agent workflows.

Where MiniMax M3 wins

Coding quality

59.0% SWE-bench Pro is a proven frontier-level coding score. Step 3.7 Flash has no published SWE-bench score. For complex coding tasks (multi-file refactoring, architecture, debugging), M3 is the safer choice.

Larger context (4Γ—)

1M tokens vs 256K. For entire-codebase analysis, long documents, or agent sessions that accumulate history, M3 provides 4Γ— the capacity. If your context regularly exceeds 256K tokens, M3 is the only option.

Long-context speed (MSA)

While Step is faster overall (400 t/s for short contexts), M3’s MSA architecture provides 15.6Γ— faster decoding specifically at million-token contexts. For very long prompts, M3’s advantage grows.

Browsing accuracy

83.5% vs 75.82% on BrowseComp. M3 is more accurate at web research and information extraction tasks.

Computer use (stronger)

Both can operate a desktop, but M3’s computer use capability is more developed β€” it demonstrated autonomous ICLR paper reproduction over 12 hours. Step’s computer use is newer and less proven.

Available sooner for self-hosting

M3 weights expected ~June 10. Step GGUF quantizations are available now. Both are self-hostable but Step requires ~100GB RAM while M3 needs potentially 200GB+. See how to run Step 3.7 locally and how to run M3 locally.

Use case recommendations

WorkloadBest choiceWhy
Real-time multimodal chatStep 3.7 Flash400 t/s, lowest latency
Complex coding tasksMiniMax M359% SWE-bench Pro
Budget multimodal pipelineStep 3.7 Flash3Γ— cheaper
Long-document analysisMiniMax M31M context
Video processing (speed)Step 3.7 FlashFaster processing
Web research agentMiniMax M383.5% BrowseComp
Simple visual tasksStep 3.7 FlashLow reasoning tier = cheapest
Autonomous coding agentMiniMax M3Better coding quality
Interactive UI testingStep 3.7 FlashSpeed for iterative loops
Codebase-scale analysisMiniMax M31M context + MSA

The broader multimodal landscape

Both compete in an increasingly crowded multimodal tier:

ModelInput/MOutput/MVisionVideoSpeed
Step 3.7 Flash$0.20$0.80βœ…βœ…400 t/s
MiniMax M3$0.60$2.40βœ…βœ…Fast (MSA)
Gemini 3.5 Flash$0.15$0.60βœ…βŒ~200 t/s
Claude Opus 4.8$5.00$25.00βœ…βŒ~80 t/s
Qwen 3.7 Plus$2.50$7.50βœ…βŒStandard

Step and M3 are the only models offering native video input at sub-$3/M output pricing. Gemini is cheapest for image-only tasks.

FAQ

Which should I default to?

Step 3.7 Flash for most multimodal tasks (3Γ— cheaper, faster). Escalate to M3 only when you need complex coding quality, 1M context, or superior browsing accuracy.

Can I use both via OpenRouter?

Yes. Both on OpenRouter. Route visual-simple tasks to Step, coding-heavy multimodal to M3.

Is Step’s coding good enough?

For routine tasks with Advisor Mode: achieves 97% of Opus 4.6 quality. For complex multi-file coding: M3 is more reliable. If your agent does mostly tool-calling and simple code edits, Step is fine.

Which is easier to self-host?

Step 3.7 Flash (198B MoE, 11B active) needs ~100GB at Q4. M3 (200-400B estimated) needs ~100-200GB. Step is likely easier to run on a single Mac Studio 128GB. See running Step locally.

What about Gemini 3.5 Flash as an alternative?

Gemini 3.5 Flash is cheaper ($0.15/$0.60) with 1M context and better tool-calling (83.6% MCP Atlas). But it lacks video input and is closed-source. If you need video: Step or M3. If images-only and you want cheapest: Gemini.