Jun 1, 2026 · 6 min read

MiniMax M3 1M Context Window: How MSA Makes Million-Token Inference Practical

MiniMax M3 supports a 1-million-token context window — enough to fit an entire large codebase, a 500-page document, or hours of agent conversation history in a single prompt. What makes this different from other 1M-context models is the speed: MiniMax Sparse Attention (MSA) delivers 15.6× faster decoding and 9.7× faster prefill at million-token contexts compared to standard attention.

This means you can actually use the full 1M context without waiting minutes for a response. Other models technically support long contexts but become impractically slow. M3 stays fast.

How MSA works (simplified)

Standard transformer attention computes relationships between every token and every other token. At 1M tokens, that is 1 trillion attention computations per layer. This is why most models slow to a crawl at long contexts.

MSA (MiniMax Sparse Attention) takes a different approach:

Sparse patterns: Only computes attention between tokens that are likely to be relevant to each other
Uncompressed KV cache: Unlike DeepSeek’s MLA which compresses key-values (losing some precision), MSA keeps full precision
Hierarchical structure: Different attention patterns at different layers — some focus on local context, others on global relationships

The result: 15.6× faster decoding and 9.7× faster prefill at 1M tokens, with no precision loss. You get the speed of a short-context model with the capacity of a million-token window.

What fits in 1M tokens?

Content	Approximate tokens	Fits in 1M?
Average novel (80K words)	~100K tokens	✅ (10 novels)
Large codebase (50K lines)	~200K tokens	✅
Very large monorepo (200K lines)	~800K tokens	✅
100-page PDF document	~50K tokens	✅ (20 documents)
8-hour agent session (tool calls)	~500K-1M tokens	✅
Full React codebase (create-react-app)	~30K tokens	✅ (trivial)
Linux kernel (27M lines)	~100M tokens	❌ (100× too large)

For most practical coding tasks, 1M tokens is more than enough. You can fit an entire medium-to-large codebase plus extensive conversation history.

Pricing: the 512K threshold

M3 has tiered pricing based on context usage:

Context range	Input/M	Output/M	Cache reads/M
≤512K tokens	$0.60	$2.40	$0.12
512K–1M tokens	$1.20	$4.80	$0.24

The price doubles above 512K. This means:

If your task fits in 512K, keep it there (most tasks do)
Only use 512K-1M when you genuinely need the extra context
Cache reads are cheap at both tiers — reuse context aggressively

Use case 1: Entire codebase analysis

Load your full codebase into context and ask questions about it:

import os
from openai import OpenAI

client = OpenAI(
    base_url="https://api.minimax.io/v1",
    api_key=os.environ["MINIMAX_API_KEY"]
)

# Collect all source files
codebase = ""
for root, dirs, files in os.walk("./src"):
    for f in files:
        if f.endswith(('.py', '.ts', '.js')):
            path = os.path.join(root, f)
            with open(path) as fh:
                codebase += f"\n\n--- {path} ---\n{fh.read()}"

response = client.chat.completions.create(
    model="minimax-m3",
    messages=[
        {"role": "system", "content": "You have the entire codebase in context. Answer questions about architecture, find bugs, suggest improvements."},
        {"role": "user", "content": f"Here is the full codebase:\n{codebase}\n\nFind all places where we handle authentication incorrectly."}
    ]
)

With MSA, this responds in seconds even at 500K+ tokens of context — where standard models would take minutes.

Use case 2: Long-running agent sessions

Agent sessions accumulate context over time: system prompts, user messages, tool calls, tool results, assistant responses. With standard 128K models, you hit the limit after 30-60 minutes of active work. With M3’s 1M context:

# Agent session that runs for hours without context overflow
messages = [
    {"role": "system", "content": system_prompt},  # ~2K tokens
]

# This loop can run for hours
for task in task_queue:
    messages.append({"role": "user", "content": task})
    
    response = client.chat.completions.create(
        model="minimax-m3",
        messages=messages,
        tools=tools
    )
    
    # Accumulate all history — no summarization needed
    messages.append(response.choices[0].message)
    
    # Tool results accumulate too
    for tool_call in response.choices[0].message.tool_calls or []:
        result = execute_tool(tool_call)
        messages.append({"role": "tool", "tool_call_id": tool_call.id, "content": result})
    
    # At 1M tokens, this runs for hours before hitting limits
    total_tokens = sum(len(m.get("content", "")) // 4 for m in messages)
    if total_tokens > 900_000:
        # Only summarize when approaching the limit
        messages = summarize_old_messages(messages)

Use case 3: Multi-document reasoning

Process multiple documents simultaneously and reason across them:

documents = load_all_documents("./contracts/")  # 20 legal contracts

response = client.chat.completions.create(
    model="minimax-m3",
    messages=[{
        "role": "user",
        "content": f"Here are 20 vendor contracts:\n\n{documents}\n\nIdentify all clauses that conflict with each other across different contracts."
    }]
)

Use case 4: Video understanding

M3’s multimodal capabilities combined with long context enable processing long videos:

response = client.chat.completions.create(
    model="minimax-m3",
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": "Analyze this 30-minute product demo. List every feature shown, in order, with timestamps."},
            {"type": "video_url", "video_url": {"url": "https://example.com/demo.mp4"}}
        ]
    }]
)

When NOT to use 1M context

Long context is not always better:

Simple questions — If your task only needs 1K tokens of context, using 1M wastes money (you pay per input token)
Latency-critical — Even with MSA, 1M context is slower than 10K context. For real-time chat, keep context short.
Above 512K — Pricing doubles. Only cross the threshold when genuinely needed.
Repetitive content — If your context is 90% boilerplate, extract the relevant parts instead of dumping everything in.

Comparison: M3 vs other 1M-context models

Model	Context	Speed at 1M	Input price (1M range)	Architecture
MiniMax M3	1M	Fast (MSA: 15.6× speedup)	$1.20/M	MSA (sparse, full precision)
Claude Opus 4.8	1M	Standard	$5.00/M	Standard attention
DeepSeek V4-Pro	1M	Standard	$0.435/M	MLA (compressed KV)
Gemini 3.5 Flash	1M	Fast	$0.15/M	Unknown

M3’s unique advantage: fast long-context inference without precision loss. DeepSeek is cheaper but compresses KV cache (potential quality loss at extreme lengths). Claude is higher quality but 4× more expensive. Gemini is cheapest but less capable on coding.

Tips for efficient long-context usage

Use cache reads — At $0.12/M, cached context is 5× cheaper than fresh input. Keep system prompts and reference material in the cache.
Stay under 512K when possible — Pricing doubles above this threshold. Most tasks fit comfortably in 512K.
Front-load important context — Put the most relevant information early in the prompt. Even with 1M context, models attend more strongly to recent and early tokens.
Chunk large codebases — If you have a 2M-token codebase, load the relevant modules (not everything) to stay within limits.
Monitor token counts — Use a tokenizer to estimate context size before sending requests. Avoid surprise pricing tier jumps.

FAQ

Is 1M context actually usable or just a marketing number?

Actually usable. MSA makes it fast enough for production workloads. The 15.6× speedup over standard attention means 1M-token responses come back in seconds, not minutes. MiniMax guarantees a minimum of 512K tokens; 1M is available but may have slightly higher latency.

Does quality degrade at long contexts?

All models show some degradation at extreme context lengths (the “lost in the middle” problem). MSA’s full-precision approach should minimize this compared to compressed-KV approaches like DeepSeek’s MLA. For critical information, place it at the beginning or end of the context.

How does the 512K pricing threshold work?

If your total input (system prompt + messages + context) exceeds 512K tokens, the entire request is billed at the higher tier ($1.20/$4.80 instead of $0.60/$2.40). There is no partial billing — it is all-or-nothing at the threshold.

Can I use 1M context locally?

Depends on hardware. The full 1M context requires significant memory. On consumer hardware (Mac Studio 128-192GB), you will likely be limited to 64-128K tokens locally. The full 1M is practical on multi-GPU server setups or via the API. See our local deployment guide.

How does M3’s long context compare to Claude’s?

Both support 1M tokens. Claude Opus 4.8 has better retrieval accuracy at extreme lengths (based on Anthropic’s evaluations). M3 is faster at long contexts (MSA) and cheaper ($1.20/M vs $5.00/M for the 1M tier). For most practical tasks, both work well.