How to Run MiMo-V2-Flash Locally — Xiaomi's Open-Source Model on Your Hardware
MiMo-V2-Flash is Xiaomi’s open-source AI model — 309B total parameters, 15B active, Apache 2.0 licensed. It scores 73.4% on SWE-bench and runs at 150 tokens per second via API. Here’s how to run it on your own hardware.
Hardware requirements
MiMo-V2-Flash uses a Mixture-of-Experts architecture: 309B total parameters but only 15B active per token. This means it’s much lighter than its total parameter count suggests.
| Quantization | VRAM/RAM needed | Speed | Quality |
|---|---|---|---|
| Q4_K_M | ~12-16GB | Fast | Good for most tasks |
| Q6_K | ~18-22GB | Medium | Better quality |
| Q8_0 | ~24-30GB | Slower | Near-original quality |
Minimum: 16GB VRAM (RTX 4080) or 16GB unified memory (M-series Mac) Recommended: 24GB VRAM (RTX 4090) or 32GB Mac for comfortable operation
Run with Ollama
# Install Ollama if you haven't
curl -fsSL https://ollama.com/install.sh | sh
# Run MiMo-V2-Flash
ollama run mimo-v2-flash
Once running, it exposes an OpenAI-compatible API:
from openai import OpenAI
client = OpenAI(base_url="http://localhost:11434/v1", api_key="not-needed")
response = client.chat.completions.create(
model="mimo-v2-flash",
messages=[{"role": "user", "content": "Write a Python function to parse CSV files"}]
)
print(response.choices[0].message.content)
Run with llama.cpp
# Download from HuggingFace
huggingface-cli download xiaomi/MiMo-V2-Flash-GGUF \
--include "*Q4_K_M*" \
--local-dir ./models
# Start server
llama-server \
--model ./models/MiMo-V2-Flash-Q4_K_M.gguf \
--ctx-size 16384 \
--threads 8 \
--port 8080
Why self-host Flash?
MiMo-V2-Flash via API costs $0.10/M input tokens — already dirt cheap. So why self-host?
- Privacy. Your code and data never leave your machine. No data sent to Xiaomi’s servers.
- Zero cost at volume. If you’re running thousands of requests per day, self-hosting is free after the hardware investment.
- No rate limits. Run as many requests as your hardware can handle.
- Offline access. Works without internet. Great for air-gapped environments or travel.
Flash vs other self-hosted models
| MiMo-V2-Flash | Qwen3.5-9B | DeepSeek Coder V2 Lite | |
|---|---|---|---|
| Active params | 15B | 9B | 14B |
| VRAM needed | ~12-16GB | ~8GB | ~10-12GB |
| SWE-bench | 73.4% | N/A (general model) | N/A |
| Speed | Very fast | Fast | Fast |
| Specialty | General + coding | General purpose | Coding |
Flash needs slightly more VRAM than the alternatives but offers the best coding performance in this weight class. If you have 16GB+ VRAM, it’s the strongest option. If you’re limited to 8-12GB, Qwen3.5-9B or DeepSeek Coder V2 Lite are better fits.