May 29, 2026 · 6 min read

StepFun Step 3.7 Flash: Complete Guide to the 198B Open-Weight MoE Model (2026)

StepFun released Step 3.7 Flash on May 29, 2026 — a 198B parameter Mixture-of-Experts vision-language model that activates only 11B parameters per token. It runs at 400 tokens per second, supports a 256K context window, handles text + images + video natively, and costs $0.20 per million input tokens on OpenRouter.

The standout feature is Advisor Mode: Step 3.7 Flash handles routine tasks autonomously and only escalates to a larger model when genuinely stuck. This achieves 97% of Claude Opus 4.6’s coding performance at $0.19 per task (vs $1.76 for Opus). It is open-weight, runs locally on a Mac Studio with 128GB RAM, and supports vLLM, SGLang, llama.cpp, and Hugging Face Transformers.

Quick specs

Developer	StepFun (China)
Total parameters	198B (196B language + 1.8B vision encoder)
Active parameters	~11B per token (MoE routing)
Context window	256,000 tokens
Speed	400 tokens/second
Modalities	Text, images, video (native multimodal)
Reasoning tiers	Low, Medium, High (tunable per request)
OpenRouter price	~$0.20/M input, ~$0.80/M output
Cache hit price	$0.04/M tokens
Open weight	Yes (Hugging Face + GitHub)
Local deployment	GGUF quantized — Mac Studio 128GB, AMD 120GB, NVIDIA DGX

Architecture: 198B total, 11B active

Step 3.7 Flash uses a Mixture-of-Experts architecture similar to DeepSeek V4-Pro but with different trade-offs. The full model has 198B parameters, but only 11B activate per token through expert routing. This gives it the knowledge capacity of a 198B model with the inference cost of an 11B model.

The vision encoder (1.8B parameters) handles images and video natively — no separate preprocessing pipeline needed. The model can directly parse UI interfaces, charts, documents, and video frames as part of its reasoning.

The 256K context window is large enough for most codebases and document processing tasks. Combined with 400 t/s throughput, it processes long contexts faster than most frontier models.

Advisor Mode: the cost breakthrough

Advisor Mode is Step 3.7 Flash’s most innovative feature. Instead of always using the most expensive model, it creates a two-tier system:

Step 3.7 Flash handles routine execution — tool calling, code iteration, standard tasks
A larger “Advisor” model only activates when Flash gets genuinely stuck (complex planning, repeated failures)

The result: 97% of Claude Opus 4.6’s coding performance at $0.19 per task instead of $1.76. That is a 9× cost reduction with minimal quality loss.

This is similar to how developers use cheap models for 90% of work and expensive models for the hard 10% — but automated at the model level.

Three reasoning tiers

Unlike most models that have a single inference mode, Step 3.7 Flash offers three reasoning levels per request:

Tier	Use case	Speed	Cost	Quality
Low	Simple queries, classification, extraction	Fastest	Cheapest	Good
Medium	Standard coding, analysis, generation	Balanced	Moderate	Better
High	Complex reasoning, multi-step planning	Slower	Higher	Best

Developers choose the tier per API call. This means you can use Low for autocomplete, Medium for standard coding, and High for architecture decisions — all with the same model, same endpoint, same API key.

Benchmarks

Benchmark	Step 3.7 Flash	What it measures
ClawEval-1.1	67.1	Agent reliability (multi-step task execution)
BrowseComp	75.82%	Web search accuracy
Coding (Advisor Mode)	97% of Opus 4.6	Software engineering tasks
GUI control	Emergent	Can write code, open browser, visually verify, fix

The ClawEval score of 67.1 is notable — it measures whether the model follows constraints and avoids adversarial traps during multi-step agent execution. This makes it suitable for production agent workflows where reliability matters.

Multimodal capabilities

Step 3.7 Flash does not just understand images — it can act on them:

Direct image manipulation: Crops, zooms, draws bounding boxes using Python tools
GUI interaction: Opens browsers, inspects rendered pages, modifies code based on visual output
Video understanding: Processes video frames for temporal reasoning
Document parsing: OCR, chart reading, structured data extraction

The emergent ability to combine visual and non-visual tools is particularly interesting. It can write frontend code, open a browser to check the rendering, spot a visual bug, and fix the code — all autonomously.

Pricing comparison

Model	Input/M	Output/M	Speed	Context
Step 3.7 Flash	~$0.20	~$0.80	400 t/s	256K
Gemini 3.5 Flash	$0.15	$0.60	~200 t/s	1M
DeepSeek V4-Pro	$0.435	$0.87	~100 t/s	1M
MiMo V2.5 Pro	$0.435	$0.87	~80 t/s	1M
Claude Opus 4.8	$5.00	$25.00	~80 t/s	1M

Step 3.7 Flash sits in the same price tier as Gemini 3.5 Flash but with 2× the speed and native multimodal capabilities. It is cheaper than DeepSeek/MiMo on input but slightly more expensive on output.

How to use Step 3.7 Flash

Via OpenRouter (easiest)

from openai import OpenAI

client = OpenAI(
    base_url="https://openrouter.ai/api/v1",
    api_key=os.environ["OPENROUTER_API_KEY"]
)

response = client.chat.completions.create(
    model="stepfun/step-3.7-flash",
    messages=[{"role": "user", "content": "Write a Redis connection pool in Python"}]
)

Local deployment (GGUF)

# Download quantized model
huggingface-cli download stepfun-ai/Step-3.7-Flash-GGUF

# Run with llama.cpp
./llama-server -m Step-3.7-Flash-Q4_K_M.gguf -c 65536 -ngl 99

Requires 128GB unified memory (Mac Studio) or 120GB system RAM (AMD) for the full quantized model.

With coding tools

Step 3.7 Flash works with any tool that supports OpenAI-compatible endpoints:

Aider: Set --model openrouter/stepfun/step-3.7-flash
Continue: Add as custom model with OpenRouter base URL
Claude Code: Not directly supported (Anthropic models only)

Who should use Step 3.7 Flash

Agent builders needing reliable multi-step execution (67.1 ClawEval)
Multimodal workflows combining vision, code, and tool use
Cost-sensitive teams wanting near-frontier quality at $0.20/M
Speed-critical applications needing 400 t/s throughput
Privacy-conscious enterprises wanting to self-host an open-weight model
Developers who want tunable reasoning (pay only for the depth you need)

Limitations

Newer model: Less community tooling and documentation than DeepSeek or Gemini
Smaller context: 256K vs 1M for DeepSeek/Gemini/Claude
Less proven: Limited real-world production data compared to established models
Local requirements: 128GB RAM minimum for self-hosting is still substantial
Coding benchmarks: No SWE-bench score published yet — hard to compare directly on coding

FAQ

How does Step 3.7 Flash compare to Gemini 3.5 Flash?

Similar price tier ($0.20 vs $0.15 input) but Step 3.7 Flash is 2× faster (400 vs ~200 t/s), has native video understanding, and offers tunable reasoning tiers. Gemini has a larger context window (1M vs 256K) and deeper Google ecosystem integration. See our detailed comparison.

Can I use it for coding?

Yes. In Advisor Mode, it achieves 97% of Claude Opus 4.6’s coding performance at 9× lower cost. For routine coding tasks, it is more than capable. For the hardest problems, pair it with a stronger model via the Advisor pattern.

Is it truly open-weight?

Yes. Available on Hugging Face (stepfun-ai/Step-3.7-Flash) and GitHub. Supports vLLM, SGLang, llama.cpp, and Transformers. GGUF quantized versions available for local deployment.

What are the three reasoning tiers?

Low (fast, cheap), Medium (balanced), High (deep reasoning). You choose per API call. This lets you optimize cost vs quality for each specific task without switching models.

How does Advisor Mode work?

Step 3.7 Flash handles execution autonomously. Only when it encounters a genuine bottleneck (complex planning, repeated failures) does it escalate to a larger model. This happens automatically — you do not need to configure the routing manually.

Is it available on OpenRouter?

Yes. Model ID: stepfun/step-3.7-flash. Available immediately with standard OpenRouter billing.