How to Run MiniMax M3 Locally: Hardware, Setup, and Deployment Guide (2026)
MiniMax M3 is open-weight β the first frontier model combining 59% SWE-bench Pro, 1M context, and native multimodal to be fully downloadable. Weights and a technical report are expected within 10 days of the June 1 launch (around June 10-11).
This guide covers everything you need to prepare: hardware requirements, quantization options, inference frameworks, and whether self-hosting makes financial sense for your workload. When the weights drop, you will be ready to deploy immediately.
Hardware requirements (estimated)
MiniMax has not published the exact parameter count for M3. Based on the MSA architecture and performance characteristics, the community estimates it is in the 200-400B parameter range. Here are the hardware tiers:
Full precision (FP16/BF16)
| Setup | VRAM needed | Hardware | Cost |
|---|---|---|---|
| Full model (estimated) | 400-800GB | 4-8Γ A100 80GB or 4-8Γ H100 | $30K-80K |
| Multi-node | Distributed | 2+ servers with NVLink | Enterprise |
Quantized (practical for most users)
| Quantization | Memory needed | Hardware options | Quality loss |
|---|---|---|---|
| Q8 (8-bit) | ~200-400GB | 2-4Γ A100 80GB, Mac Studio 192GB | Minimal |
| Q6_K | ~150-300GB | 2-3Γ A100, Mac Studio 192GB | Very low |
| Q4_K_M | ~100-200GB | 1-2Γ A100, Mac Studio 128GB | Low |
| Q3_K | ~75-150GB | 1Γ A100 80GB, Mac Pro 192GB | Moderate |
Consumer hardware (when GGUF drops)
For the quantized GGUF versions (expected shortly after weight release):
- Mac Studio M4 Ultra 192GB β Should run Q4_K_M comfortably. Best consumer option.
- Mac Studio M4 Ultra 128GB β May run Q3_K or smaller quantizations.
- AMD system with 192GB RAM β CPU inference possible but slow (~5-10 t/s).
- NVIDIA RTX 6000 Ada (48GB) β Too small for full model, may work for aggressive quantizations.
Note: These are estimates based on the expected model size. Actual requirements will be confirmed when weights release.
Inference frameworks
M3 will support these frameworks from day one (confirmed by MiniMax):
vLLM (recommended for production)
pip install vllm
# When weights are available:
python -m vllm.entrypoints.openai.api_server \
--model minimax/minimax-m3 \
--tensor-parallel-size 4 \
--max-model-len 1048576 \
--port 8000
vLLM provides the best throughput for serving multiple concurrent requests. Use tensor parallelism across multiple GPUs.
SGLang (best for agentic workloads)
pip install sglang
python -m sglang.launch_server \
--model-path minimax/minimax-m3 \
--tp 4 \
--context-length 1048576
SGLang excels at multi-turn conversations and tool-calling patterns common in agentic coding.
llama.cpp (best for consumer hardware)
# Download GGUF quantization (when available)
huggingface-cli download minimax/minimax-m3-GGUF minimax-m3-Q4_K_M.gguf
# Run server
./llama-server \
-m minimax-m3-Q4_K_M.gguf \
-c 65536 \
-ngl 99 \
--port 8080
llama.cpp is the path for Mac Studio and consumer GPU deployments. Expect 10-30 t/s on Apple Silicon depending on quantization and context length.
Hugging Face Transformers
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained(
"minimax/minimax-m3",
device_map="auto",
torch_dtype="auto"
)
tokenizer = AutoTokenizer.from_pretrained("minimax/minimax-m3")
Self-hosting vs API: cost comparison
When does self-hosting M3 make financial sense?
| Monthly API spend | Self-host hardware | Break-even | Recommendation |
|---|---|---|---|
| <$100/mo | Any | Never | Use API |
| $100-500/mo | Mac Studio 192GB ($6K) | 12-60 months | Probably API |
| $500-2000/mo | 2Γ A100 cloud ($3K/mo) | Immediately | Consider self-host |
| >$2000/mo | Dedicated server ($5-10K/mo) | Immediately | Self-host |
The API at $0.60/$2.40 per million tokens is cheap enough that most individual developers and small teams should just use it. Self-hosting makes sense for:
- High-volume production workloads (>$500/mo API spend)
- Data privacy requirements (no data leaves your infrastructure)
- Latency-sensitive applications (eliminate network round-trip)
- Fine-tuning needs (customize for your domain)
Preparing now (before weights drop)
While waiting for the weights release (~June 10-11), you can:
- Set up your inference framework β Install vLLM, SGLang, or llama.cpp and test with a smaller model
- Provision hardware β Reserve cloud GPUs or order Apple Silicon hardware
- Test with the API β Use the M3 API to validate your use case works well with M3
- Prepare your pipeline β Build your agent loop, tool definitions, and evaluation suite against the API, then swap to local when weights are available
Expected performance (local vs API)
| Deployment | Throughput | Latency (first token) | Context limit |
|---|---|---|---|
| MiniMax API | High (shared infra) | ~200-500ms | 1M tokens |
| vLLM (4Γ A100) | ~50-100 t/s | ~500ms | 1M tokens |
| SGLang (4Γ A100) | ~40-80 t/s | ~400ms | 1M tokens |
| llama.cpp (Mac Studio 192GB) | ~10-30 t/s | ~1-3s | 64-128K tokens |
| llama.cpp (Mac Studio 128GB) | ~5-15 t/s | ~2-5s | 32-64K tokens |
Note: Local deployments on consumer hardware will likely not support the full 1M context window due to memory constraints. You may be limited to 64-128K tokens locally while the API supports the full 1M.
Docker deployment (production)
FROM vllm/vllm-openai:latest
# When weights are available, download them
RUN huggingface-cli download minimax/minimax-m3
CMD ["python", "-m", "vllm.entrypoints.openai.api_server", \
"--model", "minimax/minimax-m3", \
"--tensor-parallel-size", "4", \
"--max-model-len", "524288"]
Integration with coding tools (local)
Once running locally, M3 exposes an OpenAI-compatible endpoint. Point your tools at it:
# Aider
export OPENAI_API_BASE="http://localhost:8000/v1"
export OPENAI_API_KEY="not-needed"
aider --model openai/minimax-m3
# Continue (VS Code) - add to config.json
# "apiBase": "http://localhost:8000/v1"
For full tool integration details, see our MiniMax M3 API Setup Guide.
FAQ
When exactly will weights be available?
MiniMax said βwithin 10 daysβ of the June 1 launch. Expected around June 10-11, 2026. A full technical report will accompany the release.
Can I run M3 on a single GPU?
Unlikely at full precision. Even aggressive quantizations (Q3_K) will likely need 75-150GB of memory. A single A100 80GB might work for the smallest quantizations. For practical use, plan for 2+ GPUs or a high-memory Apple Silicon machine.
Will there be GGUF quantizations?
Almost certainly. The community typically produces GGUF quantizations within hours of weight release. MiniMax may also release official quantized versions.
Is self-hosting worth it for coding agents?
If you run agents 8+ hours per day, self-hosting saves money within months. A Mac Studio 192GB ($6K) running M3 locally costs nothing per token after the hardware investment. At $0.60/M input tokens, that is break-even at ~10M tokens/day for about 20 months.
How does local M3 compare to local DeepSeek V4?
DeepSeek V4-Pro is a 1.6T MoE model (49B active) β it requires similar or more hardware than M3. M3βs advantage locally is the MSA architecture which should provide better long-context performance. DeepSeekβs advantage is the larger knowledge base from 1.6T total parameters.
Can I fine-tune M3?
Yes, once weights are released. The open-weight license should permit fine-tuning. Expect community fine-tuning guides within days of the weight release. Hardware requirements for fine-tuning will be significantly higher than inference (typically 2-4Γ the inference memory).