Mar 24, 2026 · 2 min read

How to Run MiMo-V2-Flash Locally — Xiaomi's Open-Source Model on Your Hardware

MiMo-V2-Flash is Xiaomi’s open-source AI model — 309B total parameters, 15B active, Apache 2.0 licensed. It scores 73.4% on SWE-bench and runs at 150 tokens per second via API. Here’s how to run it on your own hardware.

Hardware requirements

MiMo-V2-Flash uses a Mixture-of-Experts architecture: 309B total parameters but only 15B active per token. This means it’s much lighter than its total parameter count suggests.

Quantization	VRAM/RAM needed	Speed	Quality
Q4_K_M	~12-16GB	Fast	Good for most tasks
Q6_K	~18-22GB	Medium	Better quality
Q8_0	~24-30GB	Slower	Near-original quality

Minimum: 16GB VRAM (RTX 4080) or 16GB unified memory (M-series Mac) Recommended: 24GB VRAM (RTX 4090) or 32GB Mac for comfortable operation

Run with Ollama

# Install Ollama if you haven't
curl -fsSL https://ollama.com/install.sh | sh

# Run MiMo-V2-Flash
ollama run mimo-v2-flash

Once running, it exposes an OpenAI-compatible API:

from openai import OpenAI

client = OpenAI(base_url="http://localhost:11434/v1", api_key="not-needed")

response = client.chat.completions.create(
    model="mimo-v2-flash",
    messages=[{"role": "user", "content": "Write a Python function to parse CSV files"}]
)
print(response.choices[0].message.content)

Run with llama.cpp

# Download from HuggingFace
huggingface-cli download xiaomi/MiMo-V2-Flash-GGUF \
  --include "*Q4_K_M*" \
  --local-dir ./models

# Start server
llama-server \
  --model ./models/MiMo-V2-Flash-Q4_K_M.gguf \
  --ctx-size 16384 \
  --threads 8 \
  --port 8080

Why self-host Flash?

MiMo-V2-Flash via API costs $0.10/M input tokens — already dirt cheap. So why self-host?

Privacy. Your code and data never leave your machine. No data sent to Xiaomi’s servers.
Zero cost at volume. If you’re running thousands of requests per day, self-hosting is free after the hardware investment.
No rate limits. Run as many requests as your hardware can handle.
Offline access. Works without internet. Great for air-gapped environments or travel.

Flash vs other self-hosted models

	MiMo-V2-Flash	Qwen3.5-9B	DeepSeek Coder V2 Lite
Active params	15B	9B	14B
VRAM needed	~12-16GB	~8GB	~10-12GB
SWE-bench	73.4%	N/A (general model)	N/A
Speed	Very fast	Fast	Fast
Specialty	General + coding	General purpose	Coding

Flash needs slightly more VRAM than the alternatives but offers the best coding performance in this weight class. If you have 16GB+ VRAM, it’s the strongest option. If you’re limited to 8-12GB, Qwen3.5-9B or DeepSeek Coder V2 Lite are better fits.

How to Run MiMo-V2-Flash Locally — Xiaomi's Open-Source Model on Your Hardware

Hardware requirements

Run with Ollama

Run with llama.cpp

Why self-host Flash?

Flash vs other self-hosted models

Related

📬 Get weekly dev tools & AI tips

You might also like

The Complete MiMo-V2 Family Guide — Pro, Flash, Omni, and TTS (2026)

MiMo-V2-Flash vs DeepSeek V3 — Open-Source AI Model Showdown

MiMo-V2-Pro vs MiMo-V2-Flash — Which Xiaomi Model Should You Use?

Qwen 3.5 vs MiMo-V2-Flash — Open-Source AI Showdown (2026)