Jun 5, 2026 · 5 min read

Step 3.7 Flash vs MiniMax M3: Speed vs Depth in Multimodal AI (2026)

Step 3.7 Flash and MiniMax M3 are the two newest multimodal Chinese models. Both handle text, images, and video. Both are open-weight. Both launched within days of each other. But they sit at different price-performance points.

Step 3.7 Flash optimizes for raw speed (400 t/s) and cost ($0.20/$0.80). MiniMax M3 optimizes for coding quality (59% SWE-bench Pro) and long-context performance (MSA). Here is how to choose between them.

Head-to-head

	Step 3.7 Flash	MiniMax M3
Developer	StepFun	MiniMax
Architecture	MoE (198B/11B active)	MSA (sparse attention)
Speed	400 t/s	Fast (MSA) but <400 t/s
Input price	$0.20/M	$0.60/M
Output price	$0.80/M	$2.40/M
Cache hit	$0.04/M	$0.12/M
Context	256K	1M (512K guaranteed)
Vision	✅	✅
Video	✅	✅
Computer use	✅ (basic)	✅ (stronger)
SWE-bench Pro	—	59.0%
BrowseComp	75.82%	83.5%
ClawEval (agent reliability)	67.1	—
Reasoning tiers	✅ (Low/Med/High)	❌
Advisor Mode	✅	❌
Open weight	✅	✅ (~June 10)
OpenRouter	✅	✅

Pricing: Step is 3× cheaper

	Step 3.7 Flash	MiniMax M3	Ratio
Input	$0.20/M	$0.60/M	3×
Output	$0.80/M	$2.40/M	3×
Cache	$0.04/M	$0.12/M	3×
1hr session	~$0.08	~$0.50	6×
Monthly (24/7)	~$60	~$360	6×

Step 3.7 Flash is dramatically cheaper. For high-volume multimodal workloads, this 3-6× gap matters.

Where Step 3.7 Flash wins

Speed (fastest available)

400 tokens/second. No multimodal model comes close. For real-time applications, interactive coding, autocomplete, or any workflow where latency matters, Step is the clear winner.

Price (3× cheaper)

At $0.20/$0.80, Step is one of the cheapest capable models available — cheaper than Gemini 3.5 Flash ($0.15/$0.60 input but comparable output). For budget multimodal workloads, Step wins.

Reasoning tiers

Three adjustable levels per request (Low/Medium/High). Use Low for simple visual tasks (cheapest), High for complex reasoning. M3 has one inference mode — you always pay the same regardless of task complexity.

Advisor Mode

Automatic escalation to a stronger model when stuck. Achieves 97% of Opus 4.6 coding quality at $0.19/task average. M3 has no equivalent — you manually switch models.

Agent reliability

67.1 on ClawEval-1.1 measures multi-step task execution under adversarial conditions. Step follows constraints and avoids traps in complex agent workflows.

Where MiniMax M3 wins

Coding quality

59.0% SWE-bench Pro is a proven frontier-level coding score. Step 3.7 Flash has no published SWE-bench score. For complex coding tasks (multi-file refactoring, architecture, debugging), M3 is the safer choice.

Larger context (4×)

1M tokens vs 256K. For entire-codebase analysis, long documents, or agent sessions that accumulate history, M3 provides 4× the capacity. If your context regularly exceeds 256K tokens, M3 is the only option.

Long-context speed (MSA)

While Step is faster overall (400 t/s for short contexts), M3’s MSA architecture provides 15.6× faster decoding specifically at million-token contexts. For very long prompts, M3’s advantage grows.

Browsing accuracy

83.5% vs 75.82% on BrowseComp. M3 is more accurate at web research and information extraction tasks.

Computer use (stronger)

Both can operate a desktop, but M3’s computer use capability is more developed — it demonstrated autonomous ICLR paper reproduction over 12 hours. Step’s computer use is newer and less proven.

Available sooner for self-hosting

M3 weights expected ~June 10. Step GGUF quantizations are available now. Both are self-hostable but Step requires ~100GB RAM while M3 needs potentially 200GB+. See how to run Step 3.7 locally and how to run M3 locally.

Use case recommendations

Workload	Best choice	Why
Real-time multimodal chat	Step 3.7 Flash	400 t/s, lowest latency
Complex coding tasks	MiniMax M3	59% SWE-bench Pro
Budget multimodal pipeline	Step 3.7 Flash	3× cheaper
Long-document analysis	MiniMax M3	1M context
Video processing (speed)	Step 3.7 Flash	Faster processing
Web research agent	MiniMax M3	83.5% BrowseComp
Simple visual tasks	Step 3.7 Flash	Low reasoning tier = cheapest
Autonomous coding agent	MiniMax M3	Better coding quality
Interactive UI testing	Step 3.7 Flash	Speed for iterative loops
Codebase-scale analysis	MiniMax M3	1M context + MSA

The broader multimodal landscape

Both compete in an increasingly crowded multimodal tier:

Model	Input/M	Output/M	Vision	Video	Speed
Step 3.7 Flash	$0.20	$0.80	✅	✅	400 t/s
MiniMax M3	$0.60	$2.40	✅	✅	Fast (MSA)
Gemini 3.5 Flash	$0.15	$0.60	✅	❌	~200 t/s
Claude Opus 4.8	$5.00	$25.00	✅	❌	~80 t/s
Qwen 3.7 Plus	$2.50	$7.50	✅	❌	Standard

Step and M3 are the only models offering native video input at sub-$3/M output pricing. Gemini is cheapest for image-only tasks.

FAQ

Which should I default to?

Step 3.7 Flash for most multimodal tasks (3× cheaper, faster). Escalate to M3 only when you need complex coding quality, 1M context, or superior browsing accuracy.

Can I use both via OpenRouter?

Yes. Both on OpenRouter. Route visual-simple tasks to Step, coding-heavy multimodal to M3.

Is Step’s coding good enough?

For routine tasks with Advisor Mode: achieves 97% of Opus 4.6 quality. For complex multi-file coding: M3 is more reliable. If your agent does mostly tool-calling and simple code edits, Step is fine.

Which is easier to self-host?

Step 3.7 Flash (198B MoE, 11B active) needs ~100GB at Q4. M3 (200-400B estimated) needs ~100-200GB. Step is likely easier to run on a single Mac Studio 128GB. See running Step locally.

What about Gemini 3.5 Flash as an alternative?

Gemini 3.5 Flash is cheaper ($0.15/$0.60) with 1M context and better tool-calling (83.6% MCP Atlas). But it lacks video input and is closed-source. If you need video: Step or M3. If images-only and you want cheapest: Gemini.