🤖 AI Tools
· 2 min read

How to Run MiMo-V2-Flash Locally — Xiaomi's Open-Source Model on Your Hardware


MiMo-V2-Flash is Xiaomi’s open-source AI model — 309B total parameters, 15B active, Apache 2.0 licensed. It scores 73.4% on SWE-bench and runs at 150 tokens per second via API. Here’s how to run it on your own hardware.

Hardware requirements

MiMo-V2-Flash uses a Mixture-of-Experts architecture: 309B total parameters but only 15B active per token. This means it’s much lighter than its total parameter count suggests.

QuantizationVRAM/RAM neededSpeedQuality
Q4_K_M~12-16GBFastGood for most tasks
Q6_K~18-22GBMediumBetter quality
Q8_0~24-30GBSlowerNear-original quality

Minimum: 16GB VRAM (RTX 4080) or 16GB unified memory (M-series Mac) Recommended: 24GB VRAM (RTX 4090) or 32GB Mac for comfortable operation

Run with Ollama

# Install Ollama if you haven't
curl -fsSL https://ollama.com/install.sh | sh

# Run MiMo-V2-Flash
ollama run mimo-v2-flash

Once running, it exposes an OpenAI-compatible API:

from openai import OpenAI

client = OpenAI(base_url="http://localhost:11434/v1", api_key="not-needed")

response = client.chat.completions.create(
    model="mimo-v2-flash",
    messages=[{"role": "user", "content": "Write a Python function to parse CSV files"}]
)
print(response.choices[0].message.content)

Run with llama.cpp

# Download from HuggingFace
huggingface-cli download xiaomi/MiMo-V2-Flash-GGUF \
  --include "*Q4_K_M*" \
  --local-dir ./models

# Start server
llama-server \
  --model ./models/MiMo-V2-Flash-Q4_K_M.gguf \
  --ctx-size 16384 \
  --threads 8 \
  --port 8080

Why self-host Flash?

MiMo-V2-Flash via API costs $0.10/M input tokens — already dirt cheap. So why self-host?

  1. Privacy. Your code and data never leave your machine. No data sent to Xiaomi’s servers.
  2. Zero cost at volume. If you’re running thousands of requests per day, self-hosting is free after the hardware investment.
  3. No rate limits. Run as many requests as your hardware can handle.
  4. Offline access. Works without internet. Great for air-gapped environments or travel.

Flash vs other self-hosted models

MiMo-V2-FlashQwen3.5-9BDeepSeek Coder V2 Lite
Active params15B9B14B
VRAM needed~12-16GB~8GB~10-12GB
SWE-bench73.4%N/A (general model)N/A
SpeedVery fastFastFast
SpecialtyGeneral + codingGeneral purposeCoding

Flash needs slightly more VRAM than the alternatives but offers the best coding performance in this weight class. If you have 16GB+ VRAM, it’s the strongest option. If you’re limited to 8-12GB, Qwen3.5-9B or DeepSeek Coder V2 Lite are better fits.