Jun 1, 2026 · 6 min read

MiniMax M3 for Agentic Coding: Long-Horizon Autonomy at $0.60/M Tokens

🎉 Update (June 13, 2026): MiniMax M3 open weights are now available on Hugging Face (MiniMaxAI/MiniMax-M3). You can now download and self-host the 428B parameter model.

MiniMax M3 was tested autonomously reproducing an ICLR 2025 Outstanding Paper — running for nearly 12 hours, producing 18 commits and 23 experimental figures without human intervention. That is not a benchmark score. It is a demonstration of sustained autonomous execution over a complex, multi-step research task.

This makes M3 one of the most capable models for agentic coding workflows — tasks where an AI agent runs for hours, calling tools, writing code, running tests, and iterating without human input. At $0.60 per million input tokens (with cache reads at $0.12/M), it is also one of the cheapest options for this kind of work.

This guide covers how to set up M3 for agentic coding, what makes it different from other models in this space, and where it excels versus where you should still reach for Claude Opus 4.8.

Why M3 works for agentic coding

Three architectural features make M3 particularly suited for long-running autonomous tasks:

1. MSA enables sustained long sessions

MiniMax Sparse Attention delivers 15.6× faster decoding at 1M context. In practice, this means an agent session that accumulates hundreds of tool call results does not slow down as context grows. With standard attention, inference gets progressively slower as context fills up. With MSA, the speed penalty is dramatically reduced.

2. 1M context without truncation

Most agentic coding sessions fail when context overflows. The agent loses track of earlier decisions, repeats work, or contradicts itself. M3’s 1M token context window means:

A full day of agent work fits in context
No need for summarization or context compression hacks
The agent remembers decisions made hours ago
Tool call results accumulate without truncation

3. Native tool calling at 74.2% MCP Atlas

M3 scores 74.2% on MCP Atlas — a benchmark measuring multi-step tool use reliability. This is behind Claude Opus 4.8 (82.2%) but ahead of most open-weight alternatives. For agentic workflows that chain dozens of tool calls, reliability at each step compounds — 74.2% per-step accuracy is sufficient for most production agent loops.

The ICLR paper reproduction

MiniMax demonstrated M3’s agentic capabilities by having it autonomously reproduce an ICLR 2025 Outstanding Paper. The details:

Duration: ~12 hours of continuous autonomous execution
Output: 18 git commits, 23 experimental figures
Process: Read the paper, understood the methodology, implemented the code, ran experiments, generated visualizations, iterated on results
Human intervention: None

This is qualitatively different from benchmark scores. SWE-bench Pro measures whether a model can fix a single GitHub issue. The ICLR reproduction measures whether it can sustain coherent, goal-directed work over many hours across multiple phases of a complex project.

Setting up M3 for agentic workflows

With Aider (recommended for coding agents)

export OPENAI_API_BASE="https://api.minimax.io/v1"
export OPENAI_API_KEY="your-minimax-key"

# Run Aider in auto mode with M3
aider --model openai/minimax-m3 --yes --auto-commits

Aider’s --yes flag enables autonomous operation. M3 will edit files, run commands, and commit changes without asking for confirmation.

With a custom agent loop

from openai import OpenAI
import subprocess
import os

client = OpenAI(
    base_url="https://api.minimax.io/v1",
    api_key=os.environ["MINIMAX_API_KEY"]
)

tools = [
    {"type": "function", "function": {
        "name": "run_command",
        "description": "Execute a shell command and return output",
        "parameters": {"type": "object", "properties": {
            "command": {"type": "string"}
        }, "required": ["command"]}
    }},
    {"type": "function", "function": {
        "name": "write_file",
        "description": "Write content to a file",
        "parameters": {"type": "object", "properties": {
            "path": {"type": "string"},
            "content": {"type": "string"}
        }, "required": ["path", "content"]}
    }},
    {"type": "function", "function": {
        "name": "read_file",
        "description": "Read a file's contents",
        "parameters": {"type": "object", "properties": {
            "path": {"type": "string"}
        }, "required": ["path"]}
    }}
]

messages = [{"role": "system", "content": "You are an autonomous coding agent. Complete the task using the available tools. Run tests after each change."}]

def agent_loop(task, max_iterations=50):
    messages.append({"role": "user", "content": task})
    
    for i in range(max_iterations):
        response = client.chat.completions.create(
            model="minimax-m3",
            messages=messages,
            tools=tools
        )
        
        msg = response.choices[0].message
        messages.append(msg)
        
        if msg.tool_calls:
            for call in msg.tool_calls:
                result = execute_tool(call)
                messages.append({"role": "tool", "tool_call_id": call.id, "content": result})
        else:
            # No tool calls = agent is done
            return msg.content
    
    return "Max iterations reached"

With MiniMax Code

The dedicated coding interface at code.minimax.io provides a purpose-built environment for agentic coding. It handles the agent loop, tool execution, and file management automatically — similar to Claude Code but powered by M3.

Computer use for coding

M3 can operate a desktop computer — clicking, typing, navigating UIs. For coding workflows, this enables:

Visual testing: Write frontend code, open a browser, visually verify the rendering, fix issues
IDE interaction: Navigate complex IDEs, run debuggers, inspect variables
Documentation browsing: Search docs, read API references, apply findings to code
CI/CD interaction: Monitor build pipelines, read error logs, trigger deployments

This is the same capability that Claude Opus 4.8 offers via its computer use feature (87.1% OSWorld). M3’s computer use is newer and less battle-tested, but the capability is there.

Cost comparison for agentic workloads

Running an agent for 8 hours per day:

Model	Cost/hour (estimated)	Cost/day (8hr)	Monthly
MiniMax M3	~$0.50	~$4.00	~$88
MiniMax M3 (launch)	~$0.25	~$2.00	~$44
DeepSeek V4-Pro	~$0.08	~$0.64	~$14
MiMo V2.5 Pro	~$0.06	~$0.48	~$11
Claude Opus 4.8	~$2.25	~$18.00	~$396

M3 is 4.5× cheaper than Opus 4.8 for agentic work, but 6× more expensive than DeepSeek/MiMo. The premium buys you multimodal capabilities, computer use, and the MSA speed advantage at long contexts.

When to use M3 vs alternatives

Scenario	Best model	Why
Complex multi-file coding (highest quality)	Claude Opus 4.8	69.2% SWE-bench Pro, dynamic workflows
Long-horizon autonomous tasks (budget)	MiniMax M3	1M context + MSA speed + $0.60/M
Maximum cost efficiency	DeepSeek V4-Pro	$0.435/$0.87, 80.6% SWE-bench Verified
Visual/multimodal agent tasks	MiniMax M3	Native image/video + computer use
Speed-critical agent loops	Step 3.7 Flash	400 t/s, Advisor Mode

Tips for long-running M3 agent sessions

Use cache reads aggressively — At $0.12/M, cached context is 5× cheaper than fresh input. Keep your system prompt and codebase context stable across calls.
Set clear stopping criteria — M3 will keep working indefinitely. Define success conditions (“all tests pass”, “build succeeds”, “no lint errors”) so the agent knows when to stop.
Commit frequently — Have the agent commit after each meaningful change. If something goes wrong at hour 6, you can roll back without losing hours 1-5.
Monitor token usage — A 12-hour session can consume 10-50M tokens. At $0.60/M input, that is $6-30. Set budget limits.
Use the 512K tier when possible — Pricing doubles above 512K context. If your task fits in 512K, you save 50%.

FAQ

How does M3 compare to Claude Opus 4.8 for agentic coding?

Opus 4.8 is better on raw coding quality (69.2% vs 59.0% SWE-bench Pro) and has dynamic workflows for parallel agent orchestration. M3 is 4.5× cheaper, has native multimodal, and MSA makes long contexts faster. Use Opus for the hardest tasks, M3 for sustained autonomous work where cost matters.

Can M3 run for 12 hours like the ICLR demo?

Yes, if your infrastructure supports it. The 1M context window and MSA architecture are designed for exactly this. The limiting factor is your agent loop implementation, not the model.

Is 59% SWE-bench Pro good enough for production agents?

It depends on your tolerance for errors. 59% means M3 resolves about 6 in 10 GitHub issues autonomously. For tasks with a test suite that catches failures, this is fine — the agent retries until tests pass. For tasks without automated verification, you may want Opus 4.8’s higher reliability.

What about the computer use capability?

It works but is less mature than Claude’s (87.1% OSWorld). Use it for visual verification and simple UI interactions. For complex browser automation, Claude Opus 4.8 is more reliable.

How does the cost compare to running my own agent on DeepSeek?

DeepSeek V4-Pro is ~6× cheaper per token ($0.435/$0.87 vs $0.60/$2.40). If you do not need multimodal or computer use, DeepSeek is the better value for pure coding agents. M3’s advantage is the combination of multimodal + long context + computer use in one model.