How to Run Step 3.7 Flash Locally: Hardware, Setup, and Performance Guide (2026)
Step 3.7 Flash is a 198B parameter MoE model that activates only 11B parameters per token. This means it has the knowledge capacity of a 198B model but the inference cost of an 11B model. It is fully open-weight, available on Hugging Face, and runs locally on consumer hardware β if you have enough RAM.
This guide covers hardware requirements, quantization options, setup with llama.cpp/vLLM, and expected performance.
Hardware requirements
Step 3.7 Flash has 198B total parameters but only 11B activate per token. Memory requirements depend on whether you need to store all 198B in RAM (you do for MoE β all experts must be resident, even though only a few activate per token).
| Quantization | Memory needed | Hardware options | Speed (est.) |
|---|---|---|---|
| FP16 | ~400GB | Multi-GPU server (4-8Γ A100) | 50-100 t/s |
| Q8 | ~200GB | 2-3Γ A100, Mac Studio 192GB (tight) | 30-60 t/s |
| Q6_K | ~150GB | Mac Studio 192GB, 2Γ A100 | 25-50 t/s |
| Q4_K_M | ~100GB | RTX Spark 128GB, Mac Studio 128GB | 15-30 t/s |
| Q3_K | ~75GB | Mac Studio 128GB, high-RAM AMD | 10-20 t/s |
The sweet spot for consumer hardware is Q4_K_M at ~100GB β runs on a Mac Studio M4 Ultra 128GB or the upcoming NVIDIA RTX Spark (128GB unified memory, fall 2026).
Why MoE models are unique for local deployment
Unlike dense models where every parameter participates in every forward pass, MoE models activate a subset of experts per token. Step 3.7 Flash activates ~11B of its 198B per token. This means:
- Memory: You need 100-200GB to store all experts (even dormant ones)
- Compute: Only 11B worth of computation per token (fast inference)
- Speed: Much faster than a 198B dense model, similar to a 11-14B model
This is why Step 3.7 Flash generates at 400 t/s on the API β the active compute is tiny despite the massive total parameter count.
Setup with llama.cpp (recommended for consumer hardware)
Download the GGUF quantization
# Install huggingface-cli if needed
pip install huggingface_hub
# Download Q4_K_M (recommended, ~100GB)
huggingface-cli download stepfun-ai/Step-3.7-Flash-GGUF \
Step-3.7-Flash-Q4_K_M.gguf \
--local-dir ./models/
# Or Q6_K for better quality (~150GB)
huggingface-cli download stepfun-ai/Step-3.7-Flash-GGUF \
Step-3.7-Flash-Q6_K.gguf \
--local-dir ./models/
Run the server
# Clone and build llama.cpp (if not installed)
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp && make -j
# Start the server
./llama-server \
-m ../models/Step-3.7-Flash-Q4_K_M.gguf \
-c 65536 \
-ngl 99 \
--port 8080 \
--host 0.0.0.0
Options:
-c 65536β 64K context (increase if you have spare RAM, up to 256K)-ngl 99β Offload all layers to GPU (for CUDA/Metal)--port 8080β API endpoint port
Connect your tools
Once running, the server exposes an OpenAI-compatible API at http://localhost:8080/v1:
# Aider
export OPENAI_API_BASE="http://localhost:8080/v1"
export OPENAI_API_KEY="not-needed"
aider --model openai/Step-3.7-Flash-Q4_K_M
# curl test
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model": "Step-3.7-Flash", "messages": [{"role": "user", "content": "Hello"}]}'
Setup with vLLM (recommended for multi-GPU servers)
pip install vllm
# Run with tensor parallelism across 4 GPUs
python -m vllm.entrypoints.openai.api_server \
--model stepfun-ai/Step-3.7-Flash \
--tensor-parallel-size 4 \
--max-model-len 262144 \
--port 8000
vLLM provides better throughput for concurrent requests and handles multi-GPU routing automatically.
Setup with LM Studio (GUI, easiest)
- Download LM Studio
- Search for βStep 3.7 Flashβ in the model browser
- Select the Q4_K_M quantization
- Click Download
- Load and chat
LM Studio handles all configuration automatically and provides a nice interface for testing.
Platform-specific notes
Mac Studio M4 Ultra (128-192GB)
# Metal acceleration (automatic on Mac)
./llama-server -m Step-3.7-Flash-Q4_K_M.gguf -c 32768 -ngl 99
Expected performance: 15-30 t/s at Q4_K_M with 32-64K context. The 128GB model fits with ~28GB spare for context + OS. The 192GB model gives you room for Q6_K quality.
NVIDIA RTX Spark (fall 2026)
RTX Spark with 128GB unified memory + Blackwell GPU is purpose-built for this. NVIDIA has demonstrated 2Γ throughput improvements via multi-token prediction on llama.cpp. Expected: 25-40 t/s at Q4_K_M.
AMD high-RAM system (128-256GB)
CPU-only inference works but is slow (~5-10 t/s). If you have an AMD system with 128GB+ system RAM and no dedicated GPU, it runs β just not at GPU speeds.
Multi-GPU (2Γ RTX 5090 or similar)
2Γ RTX 5090 = 64GB total VRAM β not enough for the full model. Use llama.cppβs --tensor-split to offload partially to GPU, or use system RAM for the majority. Not ideal β wait for RTX Spark if this is your setup.
Expected performance by hardware
| Hardware | Quantization | Context | Speed (est.) | Practical? |
|---|---|---|---|---|
| Mac Studio M4 Ultra 192GB | Q6_K | 64K | 20-35 t/s | β Great |
| Mac Studio M4 Ultra 128GB | Q4_K_M | 32-64K | 15-30 t/s | β Good |
| RTX Spark 128GB (fall) | Q4_K_M | 64K | 25-40 t/s | β Great (predicted) |
| 4Γ A100 80GB | FP16 | 256K | 50-100 t/s | β Best |
| 2Γ A100 80GB | Q6_K | 128K | 35-60 t/s | β Great |
| AMD 128GB RAM (CPU) | Q4_K_M | 32K | 5-10 t/s | β οΈ Slow |
Multimodal locally?
Step 3.7 Flash supports image and video input. For local multimodal:
- llama.cpp β Multimodal support for vision-language models is available but may require specific builds
- vLLM β Vision-language model serving is supported
- Check the Step 3.7 Flash repo README for the latest multimodal local inference instructions
Note: The 3 reasoning tiers (Low/Medium/High) and Advisor Mode are API features β they may not be available in local deployments. Standard inference works for all tasks.
Cost comparison: local vs API
| Usage | API (OpenRouter) | Local (Mac Studio 128GB) |
|---|---|---|
| Hardware cost | $0 | $4,000 (one-time) |
| Per-hour cost | ~$0.08 | ~$0.02 (electricity) |
| Monthly (4hr/day) | ~$10 | ~$2 |
| Break-even | β | ~400+ months |
At Step 3.7 Flashβs API price ($0.20/$0.80), the API is already so cheap that self-hosting only makes financial sense for:
- 24/7 high-volume workloads
- Strict privacy requirements (no data leaves your machine)
- Offline/air-gapped environments
For most users, the API is cheaper and simpler.
FAQ
Can I run Step 3.7 Flash on a laptop?
Only if it has 128GB+ unified memory (future RTX Spark laptops). Current laptops max out at 64GB RAM β not enough for Q4_K_M (~100GB). A 16GB laptop cannot run this model at any quantization.
How does it compare to running Qwen 3.6 27B locally?
Qwen 3.6 27B needs only ~16GB at Q4 and runs on basically anything. Step 3.7 Flash needs 100GB+ and requires high-end hardware. Qwen is faster locally (less memory to load) but Step has more knowledge (198B total params). For most local use cases, Qwen 3.6 27B is more practical.
Is the GGUF version available?
Yes. StepFun published official GGUF files at stepfun-ai/Step-3.7-Flash-GGUF on Hugging Face. Multiple quantization levels available (Q3_K through Q8).
Can I use the 3 reasoning tiers locally?
The Low/Medium/High reasoning tiers are an API feature that routes to different inference configurations. Locally, you get standard inference which is roughly equivalent to βMedium.β For explicit reasoning control, use the API via OpenRouter.
Is local Step 3.7 Flash faster than the API?
No. The API runs on optimized NVIDIA hardware at 400 t/s. Local deployment on consumer hardware will be 15-40 t/s depending on your setup. The API is faster unless you have a multi-GPU server.
What about the NVFP4 format?
StepFun also published NVFP4 quantized versions for NVIDIA GPUs. These use NVIDIAβs 4-bit format optimized for Tensor Cores. Use these if you have NVIDIA GPUs for slightly better performance than generic GGUF Q4.