Jun 5, 2026 · 6 min read

How to Run Step 3.7 Flash Locally: Hardware, Setup, and Performance Guide (2026)

Step 3.7 Flash is a 198B parameter MoE model that activates only 11B parameters per token. This means it has the knowledge capacity of a 198B model but the inference cost of an 11B model. It is fully open-weight, available on Hugging Face, and runs locally on consumer hardware — if you have enough RAM.

This guide covers hardware requirements, quantization options, setup with llama.cpp/vLLM, and expected performance.

Hardware requirements

Step 3.7 Flash has 198B total parameters but only 11B activate per token. Memory requirements depend on whether you need to store all 198B in RAM (you do for MoE — all experts must be resident, even though only a few activate per token).

Quantization	Memory needed	Hardware options	Speed (est.)
FP16	~400GB	Multi-GPU server (4-8× A100)	50-100 t/s
Q8	~200GB	2-3× A100, Mac Studio 192GB (tight)	30-60 t/s
Q6_K	~150GB	Mac Studio 192GB, 2× A100	25-50 t/s
Q4_K_M	~100GB	RTX Spark 128GB, Mac Studio 128GB	15-30 t/s
Q3_K	~75GB	Mac Studio 128GB, high-RAM AMD	10-20 t/s

The sweet spot for consumer hardware is Q4_K_M at ~100GB — runs on a Mac Studio M4 Ultra 128GB or the upcoming NVIDIA RTX Spark (128GB unified memory, fall 2026).

Why MoE models are unique for local deployment

Unlike dense models where every parameter participates in every forward pass, MoE models activate a subset of experts per token. Step 3.7 Flash activates ~11B of its 198B per token. This means:

Memory: You need 100-200GB to store all experts (even dormant ones)
Compute: Only 11B worth of computation per token (fast inference)
Speed: Much faster than a 198B dense model, similar to a 11-14B model

This is why Step 3.7 Flash generates at 400 t/s on the API — the active compute is tiny despite the massive total parameter count.

Setup with llama.cpp (recommended for consumer hardware)

Download the GGUF quantization

# Install huggingface-cli if needed
pip install huggingface_hub

# Download Q4_K_M (recommended, ~100GB)
huggingface-cli download stepfun-ai/Step-3.7-Flash-GGUF \
  Step-3.7-Flash-Q4_K_M.gguf \
  --local-dir ./models/

# Or Q6_K for better quality (~150GB)
huggingface-cli download stepfun-ai/Step-3.7-Flash-GGUF \
  Step-3.7-Flash-Q6_K.gguf \
  --local-dir ./models/

Run the server

# Clone and build llama.cpp (if not installed)
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp && make -j

# Start the server
./llama-server \
  -m ../models/Step-3.7-Flash-Q4_K_M.gguf \
  -c 65536 \
  -ngl 99 \
  --port 8080 \
  --host 0.0.0.0

Options:

-c 65536 — 64K context (increase if you have spare RAM, up to 256K)
-ngl 99 — Offload all layers to GPU (for CUDA/Metal)
--port 8080 — API endpoint port

Connect your tools

Once running, the server exposes an OpenAI-compatible API at http://localhost:8080/v1:

# Aider
export OPENAI_API_BASE="http://localhost:8080/v1"
export OPENAI_API_KEY="not-needed"
aider --model openai/Step-3.7-Flash-Q4_K_M

# curl test
curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "Step-3.7-Flash", "messages": [{"role": "user", "content": "Hello"}]}'

Setup with vLLM (recommended for multi-GPU servers)

pip install vllm

# Run with tensor parallelism across 4 GPUs
python -m vllm.entrypoints.openai.api_server \
  --model stepfun-ai/Step-3.7-Flash \
  --tensor-parallel-size 4 \
  --max-model-len 262144 \
  --port 8000

vLLM provides better throughput for concurrent requests and handles multi-GPU routing automatically.

Setup with LM Studio (GUI, easiest)

Download LM Studio
Search for “Step 3.7 Flash” in the model browser
Select the Q4_K_M quantization
Click Download
Load and chat

LM Studio handles all configuration automatically and provides a nice interface for testing.

Platform-specific notes

Mac Studio M4 Ultra (128-192GB)

# Metal acceleration (automatic on Mac)
./llama-server -m Step-3.7-Flash-Q4_K_M.gguf -c 32768 -ngl 99

Expected performance: 15-30 t/s at Q4_K_M with 32-64K context. The 128GB model fits with ~28GB spare for context + OS. The 192GB model gives you room for Q6_K quality.

NVIDIA RTX Spark (fall 2026)

RTX Spark with 128GB unified memory + Blackwell GPU is purpose-built for this. NVIDIA has demonstrated 2× throughput improvements via multi-token prediction on llama.cpp. Expected: 25-40 t/s at Q4_K_M.

AMD high-RAM system (128-256GB)

CPU-only inference works but is slow (~5-10 t/s). If you have an AMD system with 128GB+ system RAM and no dedicated GPU, it runs — just not at GPU speeds.

Multi-GPU (2× RTX 5090 or similar)

2× RTX 5090 = 64GB total VRAM — not enough for the full model. Use llama.cpp’s --tensor-split to offload partially to GPU, or use system RAM for the majority. Not ideal — wait for RTX Spark if this is your setup.

Expected performance by hardware

Hardware	Quantization	Context	Speed (est.)	Practical?
Mac Studio M4 Ultra 192GB	Q6_K	64K	20-35 t/s	✅ Great
Mac Studio M4 Ultra 128GB	Q4_K_M	32-64K	15-30 t/s	✅ Good
RTX Spark 128GB (fall)	Q4_K_M	64K	25-40 t/s	✅ Great (predicted)
4× A100 80GB	FP16	256K	50-100 t/s	✅ Best
2× A100 80GB	Q6_K	128K	35-60 t/s	✅ Great
AMD 128GB RAM (CPU)	Q4_K_M	32K	5-10 t/s	⚠️ Slow

Multimodal locally?

Step 3.7 Flash supports image and video input. For local multimodal:

llama.cpp — Multimodal support for vision-language models is available but may require specific builds
vLLM — Vision-language model serving is supported
Check the Step 3.7 Flash repo README for the latest multimodal local inference instructions

Note: The 3 reasoning tiers (Low/Medium/High) and Advisor Mode are API features — they may not be available in local deployments. Standard inference works for all tasks.

Cost comparison: local vs API

Usage	API (OpenRouter)	Local (Mac Studio 128GB)
Hardware cost	$0	$4,000 (one-time)
Per-hour cost	~$0.08	~$0.02 (electricity)
Monthly (4hr/day)	~$10	~$2
Break-even	—	~400+ months

At Step 3.7 Flash’s API price ($0.20/$0.80), the API is already so cheap that self-hosting only makes financial sense for:

24/7 high-volume workloads
Strict privacy requirements (no data leaves your machine)
Offline/air-gapped environments

For most users, the API is cheaper and simpler.

FAQ

Can I run Step 3.7 Flash on a laptop?

Only if it has 128GB+ unified memory (future RTX Spark laptops). Current laptops max out at 64GB RAM — not enough for Q4_K_M (~100GB). A 16GB laptop cannot run this model at any quantization.

How does it compare to running Qwen 3.6 27B locally?

Qwen 3.6 27B needs only ~16GB at Q4 and runs on basically anything. Step 3.7 Flash needs 100GB+ and requires high-end hardware. Qwen is faster locally (less memory to load) but Step has more knowledge (198B total params). For most local use cases, Qwen 3.6 27B is more practical.

Is the GGUF version available?

Yes. StepFun published official GGUF files at stepfun-ai/Step-3.7-Flash-GGUF on Hugging Face. Multiple quantization levels available (Q3_K through Q8).

Can I use the 3 reasoning tiers locally?

The Low/Medium/High reasoning tiers are an API feature that routes to different inference configurations. Locally, you get standard inference which is roughly equivalent to “Medium.” For explicit reasoning control, use the API via OpenRouter.

Is local Step 3.7 Flash faster than the API?

No. The API runs on optimized NVIDIA hardware at 400 t/s. Local deployment on consumer hardware will be 15-40 t/s depending on your setup. The API is faster unless you have a multi-GPU server.

What about the NVFP4 format?

StepFun also published NVFP4 quantized versions for NVIDIA GPUs. These use NVIDIA’s 4-bit format optimized for Tensor Cores. Use these if you have NVIDIA GPUs for slightly better performance than generic GGUF Q4.