MiniMax M3 1M Context Window: How MSA Makes Million-Token Inference Practical
MiniMax M3 supports a 1-million-token context window β enough to fit an entire large codebase, a 500-page document, or hours of agent conversation history in a single prompt. What makes this different from other 1M-context models is the speed: MiniMax Sparse Attention (MSA) delivers 15.6Γ faster decoding and 9.7Γ faster prefill at million-token contexts compared to standard attention.
This means you can actually use the full 1M context without waiting minutes for a response. Other models technically support long contexts but become impractically slow. M3 stays fast.
How MSA works (simplified)
Standard transformer attention computes relationships between every token and every other token. At 1M tokens, that is 1 trillion attention computations per layer. This is why most models slow to a crawl at long contexts.
MSA (MiniMax Sparse Attention) takes a different approach:
- Sparse patterns: Only computes attention between tokens that are likely to be relevant to each other
- Uncompressed KV cache: Unlike DeepSeekβs MLA which compresses key-values (losing some precision), MSA keeps full precision
- Hierarchical structure: Different attention patterns at different layers β some focus on local context, others on global relationships
The result: 15.6Γ faster decoding and 9.7Γ faster prefill at 1M tokens, with no precision loss. You get the speed of a short-context model with the capacity of a million-token window.
What fits in 1M tokens?
| Content | Approximate tokens | Fits in 1M? |
|---|---|---|
| Average novel (80K words) | ~100K tokens | β (10 novels) |
| Large codebase (50K lines) | ~200K tokens | β |
| Very large monorepo (200K lines) | ~800K tokens | β |
| 100-page PDF document | ~50K tokens | β (20 documents) |
| 8-hour agent session (tool calls) | ~500K-1M tokens | β |
| Full React codebase (create-react-app) | ~30K tokens | β (trivial) |
| Linux kernel (27M lines) | ~100M tokens | β (100Γ too large) |
For most practical coding tasks, 1M tokens is more than enough. You can fit an entire medium-to-large codebase plus extensive conversation history.
Pricing: the 512K threshold
M3 has tiered pricing based on context usage:
| Context range | Input/M | Output/M | Cache reads/M |
|---|---|---|---|
| β€512K tokens | $0.60 | $2.40 | $0.12 |
| 512Kβ1M tokens | $1.20 | $4.80 | $0.24 |
The price doubles above 512K. This means:
- If your task fits in 512K, keep it there (most tasks do)
- Only use 512K-1M when you genuinely need the extra context
- Cache reads are cheap at both tiers β reuse context aggressively
Use case 1: Entire codebase analysis
Load your full codebase into context and ask questions about it:
import os
from openai import OpenAI
client = OpenAI(
base_url="https://api.minimax.io/v1",
api_key=os.environ["MINIMAX_API_KEY"]
)
# Collect all source files
codebase = ""
for root, dirs, files in os.walk("./src"):
for f in files:
if f.endswith(('.py', '.ts', '.js')):
path = os.path.join(root, f)
with open(path) as fh:
codebase += f"\n\n--- {path} ---\n{fh.read()}"
response = client.chat.completions.create(
model="minimax-m3",
messages=[
{"role": "system", "content": "You have the entire codebase in context. Answer questions about architecture, find bugs, suggest improvements."},
{"role": "user", "content": f"Here is the full codebase:\n{codebase}\n\nFind all places where we handle authentication incorrectly."}
]
)
With MSA, this responds in seconds even at 500K+ tokens of context β where standard models would take minutes.
Use case 2: Long-running agent sessions
Agent sessions accumulate context over time: system prompts, user messages, tool calls, tool results, assistant responses. With standard 128K models, you hit the limit after 30-60 minutes of active work. With M3βs 1M context:
# Agent session that runs for hours without context overflow
messages = [
{"role": "system", "content": system_prompt}, # ~2K tokens
]
# This loop can run for hours
for task in task_queue:
messages.append({"role": "user", "content": task})
response = client.chat.completions.create(
model="minimax-m3",
messages=messages,
tools=tools
)
# Accumulate all history β no summarization needed
messages.append(response.choices[0].message)
# Tool results accumulate too
for tool_call in response.choices[0].message.tool_calls or []:
result = execute_tool(tool_call)
messages.append({"role": "tool", "tool_call_id": tool_call.id, "content": result})
# At 1M tokens, this runs for hours before hitting limits
total_tokens = sum(len(m.get("content", "")) // 4 for m in messages)
if total_tokens > 900_000:
# Only summarize when approaching the limit
messages = summarize_old_messages(messages)
Use case 3: Multi-document reasoning
Process multiple documents simultaneously and reason across them:
documents = load_all_documents("./contracts/") # 20 legal contracts
response = client.chat.completions.create(
model="minimax-m3",
messages=[{
"role": "user",
"content": f"Here are 20 vendor contracts:\n\n{documents}\n\nIdentify all clauses that conflict with each other across different contracts."
}]
)
Use case 4: Video understanding
M3βs multimodal capabilities combined with long context enable processing long videos:
response = client.chat.completions.create(
model="minimax-m3",
messages=[{
"role": "user",
"content": [
{"type": "text", "text": "Analyze this 30-minute product demo. List every feature shown, in order, with timestamps."},
{"type": "video_url", "video_url": {"url": "https://example.com/demo.mp4"}}
]
}]
)
When NOT to use 1M context
Long context is not always better:
- Simple questions β If your task only needs 1K tokens of context, using 1M wastes money (you pay per input token)
- Latency-critical β Even with MSA, 1M context is slower than 10K context. For real-time chat, keep context short.
- Above 512K β Pricing doubles. Only cross the threshold when genuinely needed.
- Repetitive content β If your context is 90% boilerplate, extract the relevant parts instead of dumping everything in.
Comparison: M3 vs other 1M-context models
| Model | Context | Speed at 1M | Input price (1M range) | Architecture |
|---|---|---|---|---|
| MiniMax M3 | 1M | Fast (MSA: 15.6Γ speedup) | $1.20/M | MSA (sparse, full precision) |
| Claude Opus 4.8 | 1M | Standard | $5.00/M | Standard attention |
| DeepSeek V4-Pro | 1M | Standard | $0.435/M | MLA (compressed KV) |
| Gemini 3.5 Flash | 1M | Fast | $0.15/M | Unknown |
M3βs unique advantage: fast long-context inference without precision loss. DeepSeek is cheaper but compresses KV cache (potential quality loss at extreme lengths). Claude is higher quality but 4Γ more expensive. Gemini is cheapest but less capable on coding.
Tips for efficient long-context usage
- Use cache reads β At $0.12/M, cached context is 5Γ cheaper than fresh input. Keep system prompts and reference material in the cache.
- Stay under 512K when possible β Pricing doubles above this threshold. Most tasks fit comfortably in 512K.
- Front-load important context β Put the most relevant information early in the prompt. Even with 1M context, models attend more strongly to recent and early tokens.
- Chunk large codebases β If you have a 2M-token codebase, load the relevant modules (not everything) to stay within limits.
- Monitor token counts β Use a tokenizer to estimate context size before sending requests. Avoid surprise pricing tier jumps.
FAQ
Is 1M context actually usable or just a marketing number?
Actually usable. MSA makes it fast enough for production workloads. The 15.6Γ speedup over standard attention means 1M-token responses come back in seconds, not minutes. MiniMax guarantees a minimum of 512K tokens; 1M is available but may have slightly higher latency.
Does quality degrade at long contexts?
All models show some degradation at extreme context lengths (the βlost in the middleβ problem). MSAβs full-precision approach should minimize this compared to compressed-KV approaches like DeepSeekβs MLA. For critical information, place it at the beginning or end of the context.
How does the 512K pricing threshold work?
If your total input (system prompt + messages + context) exceeds 512K tokens, the entire request is billed at the higher tier ($1.20/$4.80 instead of $0.60/$2.40). There is no partial billing β it is all-or-nothing at the threshold.
Can I use 1M context locally?
Depends on hardware. The full 1M context requires significant memory. On consumer hardware (Mac Studio 128-192GB), you will likely be limited to 64-128K tokens locally. The full 1M is practical on multi-GPU server setups or via the API. See our local deployment guide.
How does M3βs long context compare to Claudeβs?
Both support 1M tokens. Claude Opus 4.8 has better retrieval accuracy at extreme lengths (based on Anthropicβs evaluations). M3 is faster at long contexts (MSA) and cheaper ($1.20/M vs $5.00/M for the 1M tier). For most practical tasks, both work well.