MiniMax M3 was tested autonomously reproducing an ICLR 2025 Outstanding Paper β running for nearly 12 hours, producing 18 commits and 23 experimental figures without human intervention. That is not a benchmark score. It is a demonstration of sustained autonomous execution over a complex, multi-step research task.
This makes M3 one of the most capable models for agentic coding workflows β tasks where an AI agent runs for hours, calling tools, writing code, running tests, and iterating without human input. At $0.60 per million input tokens (with cache reads at $0.12/M), it is also one of the cheapest options for this kind of work.
This guide covers how to set up M3 for agentic coding, what makes it different from other models in this space, and where it excels versus where you should still reach for Claude Opus 4.8.
Why M3 works for agentic coding
Three architectural features make M3 particularly suited for long-running autonomous tasks:
1. MSA enables sustained long sessions
MiniMax Sparse Attention delivers 15.6Γ faster decoding at 1M context. In practice, this means an agent session that accumulates hundreds of tool call results does not slow down as context grows. With standard attention, inference gets progressively slower as context fills up. With MSA, the speed penalty is dramatically reduced.
2. 1M context without truncation
Most agentic coding sessions fail when context overflows. The agent loses track of earlier decisions, repeats work, or contradicts itself. M3βs 1M token context window means:
- A full day of agent work fits in context
- No need for summarization or context compression hacks
- The agent remembers decisions made hours ago
- Tool call results accumulate without truncation
3. Native tool calling at 74.2% MCP Atlas
M3 scores 74.2% on MCP Atlas β a benchmark measuring multi-step tool use reliability. This is behind Claude Opus 4.8 (82.2%) but ahead of most open-weight alternatives. For agentic workflows that chain dozens of tool calls, reliability at each step compounds β 74.2% per-step accuracy is sufficient for most production agent loops.
The ICLR paper reproduction
MiniMax demonstrated M3βs agentic capabilities by having it autonomously reproduce an ICLR 2025 Outstanding Paper. The details:
- Duration: ~12 hours of continuous autonomous execution
- Output: 18 git commits, 23 experimental figures
- Process: Read the paper, understood the methodology, implemented the code, ran experiments, generated visualizations, iterated on results
- Human intervention: None
This is qualitatively different from benchmark scores. SWE-bench Pro measures whether a model can fix a single GitHub issue. The ICLR reproduction measures whether it can sustain coherent, goal-directed work over many hours across multiple phases of a complex project.
Setting up M3 for agentic workflows
With Aider (recommended for coding agents)
export OPENAI_API_BASE="https://api.minimax.io/v1"
export OPENAI_API_KEY="your-minimax-key"
# Run Aider in auto mode with M3
aider --model openai/minimax-m3 --yes --auto-commits
Aiderβs --yes flag enables autonomous operation. M3 will edit files, run commands, and commit changes without asking for confirmation.
With a custom agent loop
from openai import OpenAI
import subprocess
import os
client = OpenAI(
base_url="https://api.minimax.io/v1",
api_key=os.environ["MINIMAX_API_KEY"]
)
tools = [
{"type": "function", "function": {
"name": "run_command",
"description": "Execute a shell command and return output",
"parameters": {"type": "object", "properties": {
"command": {"type": "string"}
}, "required": ["command"]}
}},
{"type": "function", "function": {
"name": "write_file",
"description": "Write content to a file",
"parameters": {"type": "object", "properties": {
"path": {"type": "string"},
"content": {"type": "string"}
}, "required": ["path", "content"]}
}},
{"type": "function", "function": {
"name": "read_file",
"description": "Read a file's contents",
"parameters": {"type": "object", "properties": {
"path": {"type": "string"}
}, "required": ["path"]}
}}
]
messages = [{"role": "system", "content": "You are an autonomous coding agent. Complete the task using the available tools. Run tests after each change."}]
def agent_loop(task, max_iterations=50):
messages.append({"role": "user", "content": task})
for i in range(max_iterations):
response = client.chat.completions.create(
model="minimax-m3",
messages=messages,
tools=tools
)
msg = response.choices[0].message
messages.append(msg)
if msg.tool_calls:
for call in msg.tool_calls:
result = execute_tool(call)
messages.append({"role": "tool", "tool_call_id": call.id, "content": result})
else:
# No tool calls = agent is done
return msg.content
return "Max iterations reached"
With MiniMax Code
The dedicated coding interface at code.minimax.io provides a purpose-built environment for agentic coding. It handles the agent loop, tool execution, and file management automatically β similar to Claude Code but powered by M3.
Computer use for coding
M3 can operate a desktop computer β clicking, typing, navigating UIs. For coding workflows, this enables:
- Visual testing: Write frontend code, open a browser, visually verify the rendering, fix issues
- IDE interaction: Navigate complex IDEs, run debuggers, inspect variables
- Documentation browsing: Search docs, read API references, apply findings to code
- CI/CD interaction: Monitor build pipelines, read error logs, trigger deployments
This is the same capability that Claude Opus 4.8 offers via its computer use feature (87.1% OSWorld). M3βs computer use is newer and less battle-tested, but the capability is there.
Cost comparison for agentic workloads
Running an agent for 8 hours per day:
| Model | Cost/hour (estimated) | Cost/day (8hr) | Monthly |
|---|---|---|---|
| MiniMax M3 | ~$0.50 | ~$4.00 | ~$88 |
| MiniMax M3 (launch) | ~$0.25 | ~$2.00 | ~$44 |
| DeepSeek V4-Pro | ~$0.08 | ~$0.64 | ~$14 |
| MiMo V2.5 Pro | ~$0.06 | ~$0.48 | ~$11 |
| Claude Opus 4.8 | ~$2.25 | ~$18.00 | ~$396 |
M3 is 4.5Γ cheaper than Opus 4.8 for agentic work, but 6Γ more expensive than DeepSeek/MiMo. The premium buys you multimodal capabilities, computer use, and the MSA speed advantage at long contexts.
When to use M3 vs alternatives
| Scenario | Best model | Why |
|---|---|---|
| Complex multi-file coding (highest quality) | Claude Opus 4.8 | 69.2% SWE-bench Pro, dynamic workflows |
| Long-horizon autonomous tasks (budget) | MiniMax M3 | 1M context + MSA speed + $0.60/M |
| Maximum cost efficiency | DeepSeek V4-Pro | $0.435/$0.87, 80.6% SWE-bench Verified |
| Visual/multimodal agent tasks | MiniMax M3 | Native image/video + computer use |
| Speed-critical agent loops | Step 3.7 Flash | 400 t/s, Advisor Mode |
Tips for long-running M3 agent sessions
-
Use cache reads aggressively β At $0.12/M, cached context is 5Γ cheaper than fresh input. Keep your system prompt and codebase context stable across calls.
-
Set clear stopping criteria β M3 will keep working indefinitely. Define success conditions (βall tests passβ, βbuild succeedsβ, βno lint errorsβ) so the agent knows when to stop.
-
Commit frequently β Have the agent commit after each meaningful change. If something goes wrong at hour 6, you can roll back without losing hours 1-5.
-
Monitor token usage β A 12-hour session can consume 10-50M tokens. At $0.60/M input, that is $6-30. Set budget limits.
-
Use the 512K tier when possible β Pricing doubles above 512K context. If your task fits in 512K, you save 50%.
FAQ
How does M3 compare to Claude Opus 4.8 for agentic coding?
Opus 4.8 is better on raw coding quality (69.2% vs 59.0% SWE-bench Pro) and has dynamic workflows for parallel agent orchestration. M3 is 4.5Γ cheaper, has native multimodal, and MSA makes long contexts faster. Use Opus for the hardest tasks, M3 for sustained autonomous work where cost matters.
Can M3 run for 12 hours like the ICLR demo?
Yes, if your infrastructure supports it. The 1M context window and MSA architecture are designed for exactly this. The limiting factor is your agent loop implementation, not the model.
Is 59% SWE-bench Pro good enough for production agents?
It depends on your tolerance for errors. 59% means M3 resolves about 6 in 10 GitHub issues autonomously. For tasks with a test suite that catches failures, this is fine β the agent retries until tests pass. For tasks without automated verification, you may want Opus 4.8βs higher reliability.
What about the computer use capability?
It works but is less mature than Claudeβs (87.1% OSWorld). Use it for visual verification and simple UI interactions. For complex browser automation, Claude Opus 4.8 is more reliable.
How does the cost compare to running my own agent on DeepSeek?
DeepSeek V4-Pro is ~6Γ cheaper per token ($0.435/$0.87 vs $0.60/$2.40). If you do not need multimodal or computer use, DeepSeek is the better value for pure coding agents. M3βs advantage is the combination of multimodal + long context + computer use in one model.