GPT-5.6 Sol Ultra Mode: How Subagents Push Terminal-Bench to 91.9%
GPT-5.6 Sol introduces two new features for controlling model behavior: max reasoning effort and ultra mode. Standard Sol scores 88.8% on Terminal-Bench 2.1. Enable ultra mode, and that jumps to 91.9%. A 3.1 percentage point improvement at this level of the benchmark is significant.
This guide explains what these features are, how they work, when to use them, and how they compare to Claude’s approach to extended reasoning.
Understanding the Two Features
Max Reasoning Effort
All three GPT-5.6 models (Sol, Terra, and Luna) support a reasoning effort parameter. This is a continuous scale that controls how much internal computation the model dedicates to a response.
At low effort, the model responds quickly with less deliberation. At max effort, it takes more time and tokens to think through the problem carefully. Think of it as a cost/quality dial.
This concept is similar to what you might have seen in Claude’s extended thinking, but implemented as a continuous parameter rather than a discrete on/off toggle with a token budget.
Ultra Mode (Sol Only)
Ultra mode is fundamentally different from reasoning effort. It is only available on Sol, and it represents a new architecture for model inference.
When ultra mode is enabled, Sol can spawn subagent processes. These are separate model instances that work on subtasks in parallel. The main model acts as an orchestrator:
- It analyzes the incoming request
- It decomposes the problem into subtasks
- It spawns subagents to work on each subtask independently
- It collects and synthesizes results from all subagents
- It produces a final response
This is not just “thinking longer.” It is structurally different from a single model reasoning in sequence. Multiple reasoning processes run in parallel, each focused on a specific aspect of the problem.
Why Subagents Matter
The jump from 88.8% to 91.9% on Terminal-Bench demonstrates why parallel decomposition is powerful. Consider a complex coding task that involves:
- Understanding existing code structure
- Identifying the correct modification points
- Writing the implementation
- Handling edge cases
- Ensuring compatibility with existing tests
A standard model processes these sequentially within a single reasoning chain. Ultra mode can assign each subtask to a dedicated subagent. The subagent focusing on edge cases does not compete for attention with the subagent working on implementation structure.
This mirrors how expert developers actually work on complex problems: they decompose, tackle pieces independently, then integrate. Ultra mode gives Sol this same capability at the inference level.
How to Use Ultra Mode
Based on the available documentation, ultra mode is activated through the API with specific parameters:
response = client.chat.completions.create(
model="gpt-5.6-sol",
messages=[{"role": "user", "content": your_prompt}],
reasoning_effort="max",
ultra=True
)
Key considerations:
- Ultra mode only works with
gpt-5.6-sol. Terra and Luna do not support it. - Setting
reasoning_effort="max"is recommended when using ultra mode for best results. - The response may take significantly longer due to subagent coordination.
- Token consumption is multiplied because each subagent consumes tokens independently.
Cost Implications
Ultra mode is expensive. Each subagent incurs its own token usage at Sol’s rates ($5/$30 per 1M tokens). A single ultra-mode request that spawns 4 subagents might consume:
- Orchestrator: ~10K input, ~5K output
- Subagent 1: ~30K input, ~20K output
- Subagent 2: ~30K input, ~15K output
- Subagent 3: ~25K input, ~18K output
- Subagent 4: ~20K input, ~12K output
Total tokens: ~115K input, ~70K output Total cost: ($115K × $5 + $70K × $30) / 1M = $0.575 + $2.10 = $2.675 per request
Compare this to a standard Sol request for the same task: perhaps 50K input, 10K output = $0.55. Ultra mode costs roughly 5x more in this example.
The question is whether that 5x cost multiplier buys you enough quality improvement to justify it. For a task where getting it right on the first try saves hours of debugging, the answer might be yes. For routine code completion, definitely not.
For tracking ultra-mode costs separately from standard usage, see our guide on monitoring AI API spending.
When to Use Ultra Mode
Ultra mode shines for:
Complex Refactoring
Tasks that require understanding multiple files, their relationships, and coordinating changes across all of them. Subagents can each focus on a different file or module while the orchestrator ensures consistency.
Architecture Decisions
When you need the model to evaluate multiple design approaches, subagents can explore different options in parallel and the orchestrator can compare them.
Bug Hunting
For debugging complex issues, subagents can investigate different hypotheses simultaneously: one checking for race conditions, another for state management issues, another for API misuse.
Multi-Step Workflows
Tasks that have clear stages (research, design, implement, verify) benefit from having dedicated subagents for each stage rather than asking a single reasoning chain to do everything sequentially.
When NOT to Use Ultra Mode
Do not use ultra mode for:
- Simple code completion or generation
- Single-file edits
- Chat responses
- Tasks where latency matters more than quality
- High-volume workloads (use Luna instead)
- Anything where standard Sol’s 88.8% is already good enough
The 3.1% Terminal-Bench improvement is real but narrow. Most practical coding tasks fall well within the capability range where standard Sol already succeeds. Ultra mode is for the hard tail of tasks where standard reasoning fails.
Comparison with Claude’s Approach
Claude models handle extended reasoning differently:
Claude Extended Thinking
Claude Opus 4.8 and Sonnet 5 use extended thinking, where you allocate a thinking token budget and the model uses that budget for internal deliberation before responding.
- Single reasoning process (no subagents)
- You control the budget explicitly
- The model decides how to allocate thinking within the budget
- More predictable cost (you set the ceiling)
GPT-5.6 Sol Ultra
- Multiple reasoning processes (subagents)
- Less predictable token consumption
- Parallel decomposition of problems
- Higher cost ceiling but potentially better results on complex tasks
The fundamental difference is sequential vs parallel. Claude thinks longer within a single chain. Sol spawns multiple chains. For problems that benefit from decomposition, Sol’s approach has structural advantages. For problems that require deep sequential reasoning (following a long chain of logic), Claude’s approach may work better.
Benchmark Comparison
| Model | Configuration | Terminal-Bench 2.1 |
|---|---|---|
| GPT-5.6 Sol Ultra | Ultra + max reasoning | 91.9% |
| GPT-5.6 Sol | Max reasoning, no ultra | 88.8% |
| Claude Opus 4.8 | Extended thinking | 78.9% |
The gap between Sol Ultra (91.9%) and Opus 4.8 (78.9%) is 13 percentage points. Even standard Sol without ultra beats Opus 4.8 by nearly 10 points. This is a substantial capability difference on coding-focused benchmarks.
However, Terminal-Bench is one benchmark. Real-world performance varies by task type, and Claude models have strengths in instruction following, safety, and long-form reasoning that may not be captured here.
Practical Architecture for Ultra Mode
If you have GPT-5.6 access and want to integrate ultra mode effectively:
Tiered Routing
def route_request(task_complexity):
if task_complexity == "simple":
return "gpt-5.6-luna" # Fast, cheap
elif task_complexity == "moderate":
return "gpt-5.6-sol" # Standard reasoning
elif task_complexity == "complex":
return "gpt-5.6-sol" # Ultra mode
# Set ultra only for complex tasks
params = {"model": route_request(complexity)}
if complexity == "complex":
params["ultra"] = True
params["reasoning_effort"] = "max"
return params
Complexity Classification
You need a way to determine which requests warrant ultra mode. Options:
- Heuristic rules: Long prompts, multi-file contexts, explicit “refactor” or “debug” keywords
- Luna as classifier: Use a quick Luna call to assess complexity before routing to Sol Ultra
- User-driven: Let users opt into ultra mode for specific requests
- Retry escalation: Start with standard Sol, escalate to ultra if the result fails validation
Cost Guards
Set hard limits on ultra-mode usage:
- Maximum ultra-mode requests per hour
- Maximum total ultra-mode spend per day
- Automatic fallback to standard Sol if budget is exhausted
The Cerebras Factor
Cerebras is bringing 750 tokens-per-second hosting for Sol in July 2026. This has interesting implications for ultra mode:
- Subagent processes run in parallel, so faster inference per subagent means faster total ultra-mode completion
- At 750 tok/s, a subagent generating 20K tokens completes in ~27 seconds
- 4 parallel subagents still complete in ~27 seconds (parallel, not sequential)
- Total ultra-mode latency could be 30 to 60 seconds instead of minutes
Faster inference makes ultra mode more practical for interactive use cases. Check our AI API providers guide for updates on Cerebras availability.
Integrating with Existing Tools
Ultra mode works within the standard chat completions API, so existing AI coding tools should support it once they add the ultra parameter. Key considerations:
- Tools that manage context windows need to account for subagent token consumption
- Streaming responses may behave differently (subagents process before the final response streams)
- Timeout settings need to be longer for ultra-mode requests
- Cost tracking needs to capture subagent usage separately
Make sure your API keys are secured with appropriate rate limiting, especially for ultra mode where a single compromised request can consume significant resources.
FAQ
Does ultra mode always spawn subagents?
No. The model decides whether to use subagents based on the complexity of the request. For simpler requests, it may process normally even with ultra mode enabled. The 91.9% benchmark score reflects the average improvement across Terminal-Bench tasks, some of which may not have triggered subagent spawning.
Can I control how many subagents are spawned?
Based on available documentation, you cannot directly control the number of subagents. The orchestrator decides the decomposition strategy. You may be able to influence this through prompt structure (explicitly breaking your request into numbered subtasks), but this is not guaranteed.
Is ultra mode available on Terra or Luna?
No. Ultra mode is exclusive to Sol. Terra and Luna support the reasoning effort parameter (low to max) but not subagent spawning. This makes sense since ultra mode’s cost implications would undermine Luna’s value proposition as the budget option.
How does ultra mode affect latency?
Significantly. Subagent spawning, parallel processing, and result synthesis all add latency. Expect ultra-mode requests to take 3 to 10x longer than standard Sol requests depending on the complexity of the decomposition. This is not suitable for real-time or low-latency applications.
Can I combine ultra mode with the cache system?
Yes. The orchestrator’s input (your prompt) benefits from caching normally. Subagent inputs are generated internally and are not directly cacheable across requests. However, if subagents reference your cached system prompt, that portion benefits from cache reads. The cost savings from caching apply to the orchestrator’s input but not to internally generated subagent contexts.