πŸ€– AI Tools
Β· 6 min read

StepFun Step 3.7 Flash: Complete Guide to the 198B Open-Weight MoE Model (2026)


StepFun released Step 3.7 Flash on May 29, 2026 β€” a 198B parameter Mixture-of-Experts vision-language model that activates only 11B parameters per token. It runs at 400 tokens per second, supports a 256K context window, handles text + images + video natively, and costs $0.20 per million input tokens on OpenRouter.

The standout feature is Advisor Mode: Step 3.7 Flash handles routine tasks autonomously and only escalates to a larger model when genuinely stuck. This achieves 97% of Claude Opus 4.6’s coding performance at $0.19 per task (vs $1.76 for Opus). It is open-weight, runs locally on a Mac Studio with 128GB RAM, and supports vLLM, SGLang, llama.cpp, and Hugging Face Transformers.

Quick specs

Developer StepFun (China)
Total parameters 198B (196B language + 1.8B vision encoder)
Active parameters ~11B per token (MoE routing)
Context window 256,000 tokens
Speed 400 tokens/second
Modalities Text, images, video (native multimodal)
Reasoning tiers Low, Medium, High (tunable per request)
OpenRouter price ~$0.20/M input, ~$0.80/M output
Cache hit price $0.04/M tokens
Open weight Yes (Hugging Face + GitHub)
Local deployment GGUF quantized β€” Mac Studio 128GB, AMD 120GB, NVIDIA DGX

Architecture: 198B total, 11B active

Step 3.7 Flash uses a Mixture-of-Experts architecture similar to DeepSeek V4-Pro but with different trade-offs. The full model has 198B parameters, but only 11B activate per token through expert routing. This gives it the knowledge capacity of a 198B model with the inference cost of an 11B model.

The vision encoder (1.8B parameters) handles images and video natively β€” no separate preprocessing pipeline needed. The model can directly parse UI interfaces, charts, documents, and video frames as part of its reasoning.

The 256K context window is large enough for most codebases and document processing tasks. Combined with 400 t/s throughput, it processes long contexts faster than most frontier models.

Advisor Mode: the cost breakthrough

Advisor Mode is Step 3.7 Flash’s most innovative feature. Instead of always using the most expensive model, it creates a two-tier system:

  1. Step 3.7 Flash handles routine execution β€” tool calling, code iteration, standard tasks
  2. A larger β€œAdvisor” model only activates when Flash gets genuinely stuck (complex planning, repeated failures)

The result: 97% of Claude Opus 4.6’s coding performance at $0.19 per task instead of $1.76. That is a 9Γ— cost reduction with minimal quality loss.

This is similar to how developers use cheap models for 90% of work and expensive models for the hard 10% β€” but automated at the model level.

Three reasoning tiers

Unlike most models that have a single inference mode, Step 3.7 Flash offers three reasoning levels per request:

TierUse caseSpeedCostQuality
LowSimple queries, classification, extractionFastestCheapestGood
MediumStandard coding, analysis, generationBalancedModerateBetter
HighComplex reasoning, multi-step planningSlowerHigherBest

Developers choose the tier per API call. This means you can use Low for autocomplete, Medium for standard coding, and High for architecture decisions β€” all with the same model, same endpoint, same API key.

Benchmarks

BenchmarkStep 3.7 FlashWhat it measures
ClawEval-1.167.1Agent reliability (multi-step task execution)
BrowseComp75.82%Web search accuracy
Coding (Advisor Mode)97% of Opus 4.6Software engineering tasks
GUI controlEmergentCan write code, open browser, visually verify, fix

The ClawEval score of 67.1 is notable β€” it measures whether the model follows constraints and avoids adversarial traps during multi-step agent execution. This makes it suitable for production agent workflows where reliability matters.

Multimodal capabilities

Step 3.7 Flash does not just understand images β€” it can act on them:

  • Direct image manipulation: Crops, zooms, draws bounding boxes using Python tools
  • GUI interaction: Opens browsers, inspects rendered pages, modifies code based on visual output
  • Video understanding: Processes video frames for temporal reasoning
  • Document parsing: OCR, chart reading, structured data extraction

The emergent ability to combine visual and non-visual tools is particularly interesting. It can write frontend code, open a browser to check the rendering, spot a visual bug, and fix the code β€” all autonomously.

Pricing comparison

ModelInput/MOutput/MSpeedContext
Step 3.7 Flash~$0.20~$0.80400 t/s256K
Gemini 3.5 Flash$0.15$0.60~200 t/s1M
DeepSeek V4-Pro$0.435$0.87~100 t/s1M
MiMo V2.5 Pro$0.435$0.87~80 t/s1M
Claude Opus 4.8$5.00$25.00~80 t/s1M

Step 3.7 Flash sits in the same price tier as Gemini 3.5 Flash but with 2Γ— the speed and native multimodal capabilities. It is cheaper than DeepSeek/MiMo on input but slightly more expensive on output.

How to use Step 3.7 Flash

Via OpenRouter (easiest)

from openai import OpenAI

client = OpenAI(
    base_url="https://openrouter.ai/api/v1",
    api_key=os.environ["OPENROUTER_API_KEY"]
)

response = client.chat.completions.create(
    model="stepfun/step-3.7-flash",
    messages=[{"role": "user", "content": "Write a Redis connection pool in Python"}]
)

Local deployment (GGUF)

# Download quantized model
huggingface-cli download stepfun-ai/Step-3.7-Flash-GGUF

# Run with llama.cpp
./llama-server -m Step-3.7-Flash-Q4_K_M.gguf -c 65536 -ngl 99

Requires 128GB unified memory (Mac Studio) or 120GB system RAM (AMD) for the full quantized model.

With coding tools

Step 3.7 Flash works with any tool that supports OpenAI-compatible endpoints:

  • Aider: Set --model openrouter/stepfun/step-3.7-flash
  • Continue: Add as custom model with OpenRouter base URL
  • Claude Code: Not directly supported (Anthropic models only)

Who should use Step 3.7 Flash

  • Agent builders needing reliable multi-step execution (67.1 ClawEval)
  • Multimodal workflows combining vision, code, and tool use
  • Cost-sensitive teams wanting near-frontier quality at $0.20/M
  • Speed-critical applications needing 400 t/s throughput
  • Privacy-conscious enterprises wanting to self-host an open-weight model
  • Developers who want tunable reasoning (pay only for the depth you need)

Limitations

  • Newer model: Less community tooling and documentation than DeepSeek or Gemini
  • Smaller context: 256K vs 1M for DeepSeek/Gemini/Claude
  • Less proven: Limited real-world production data compared to established models
  • Local requirements: 128GB RAM minimum for self-hosting is still substantial
  • Coding benchmarks: No SWE-bench score published yet β€” hard to compare directly on coding

FAQ

How does Step 3.7 Flash compare to Gemini 3.5 Flash?

Similar price tier ($0.20 vs $0.15 input) but Step 3.7 Flash is 2Γ— faster (400 vs ~200 t/s), has native video understanding, and offers tunable reasoning tiers. Gemini has a larger context window (1M vs 256K) and deeper Google ecosystem integration. See our detailed comparison.

Can I use it for coding?

Yes. In Advisor Mode, it achieves 97% of Claude Opus 4.6’s coding performance at 9Γ— lower cost. For routine coding tasks, it is more than capable. For the hardest problems, pair it with a stronger model via the Advisor pattern.

Is it truly open-weight?

Yes. Available on Hugging Face (stepfun-ai/Step-3.7-Flash) and GitHub. Supports vLLM, SGLang, llama.cpp, and Transformers. GGUF quantized versions available for local deployment.

What are the three reasoning tiers?

Low (fast, cheap), Medium (balanced), High (deep reasoning). You choose per API call. This lets you optimize cost vs quality for each specific task without switching models.

How does Advisor Mode work?

Step 3.7 Flash handles execution autonomously. Only when it encounters a genuine bottleneck (complex planning, repeated failures) does it escalate to a larger model. This happens automatically β€” you do not need to configure the routing manually.

Is it available on OpenRouter?

Yes. Model ID: stepfun/step-3.7-flash. Available immediately with standard OpenRouter billing.