Jul 3, 2026 · 8 min read

GPT-5.6 Sol Ultra Mode: How Subagents Push Terminal-Bench to 91.9%

GPT-5.6 Sol introduces two new features for controlling model behavior: max reasoning effort and ultra mode. Standard Sol scores 88.8% on Terminal-Bench 2.1. Enable ultra mode, and that jumps to 91.9%. A 3.1 percentage point improvement at this level of the benchmark is significant.

This guide explains what these features are, how they work, when to use them, and how they compare to Claude’s approach to extended reasoning.

Understanding the Two Features

Max Reasoning Effort

All three GPT-5.6 models (Sol, Terra, and Luna) support a reasoning effort parameter. This is a continuous scale that controls how much internal computation the model dedicates to a response.

At low effort, the model responds quickly with less deliberation. At max effort, it takes more time and tokens to think through the problem carefully. Think of it as a cost/quality dial.

This concept is similar to what you might have seen in Claude’s extended thinking, but implemented as a continuous parameter rather than a discrete on/off toggle with a token budget.

Ultra Mode (Sol Only)

Ultra mode is fundamentally different from reasoning effort. It is only available on Sol, and it represents a new architecture for model inference.

When ultra mode is enabled, Sol can spawn subagent processes. These are separate model instances that work on subtasks in parallel. The main model acts as an orchestrator:

It analyzes the incoming request
It decomposes the problem into subtasks
It spawns subagents to work on each subtask independently
It collects and synthesizes results from all subagents
It produces a final response

This is not just “thinking longer.” It is structurally different from a single model reasoning in sequence. Multiple reasoning processes run in parallel, each focused on a specific aspect of the problem.

Why Subagents Matter

The jump from 88.8% to 91.9% on Terminal-Bench demonstrates why parallel decomposition is powerful. Consider a complex coding task that involves:

Understanding existing code structure
Identifying the correct modification points
Writing the implementation
Handling edge cases
Ensuring compatibility with existing tests

A standard model processes these sequentially within a single reasoning chain. Ultra mode can assign each subtask to a dedicated subagent. The subagent focusing on edge cases does not compete for attention with the subagent working on implementation structure.

This mirrors how expert developers actually work on complex problems: they decompose, tackle pieces independently, then integrate. Ultra mode gives Sol this same capability at the inference level.

How to Use Ultra Mode

Based on the available documentation, ultra mode is activated through the API with specific parameters:

response = client.chat.completions.create(
    model="gpt-5.6-sol",
    messages=[{"role": "user", "content": your_prompt}],
    reasoning_effort="max",
    ultra=True
)

Key considerations:

Ultra mode only works with gpt-5.6-sol. Terra and Luna do not support it.
Setting reasoning_effort="max" is recommended when using ultra mode for best results.
The response may take significantly longer due to subagent coordination.
Token consumption is multiplied because each subagent consumes tokens independently.

Cost Implications

Ultra mode is expensive. Each subagent incurs its own token usage at Sol’s rates ($5/$30 per 1M tokens). A single ultra-mode request that spawns 4 subagents might consume:

Orchestrator: ~10K input, ~5K output
Subagent 1: ~30K input, ~20K output
Subagent 2: ~30K input, ~15K output
Subagent 3: ~25K input, ~18K output
Subagent 4: ~20K input, ~12K output

Total tokens: ~115K input, ~70K output Total cost: ($115K × $5 + $70K × $30) / 1M = $0.575 + $2.10 = $2.675 per request

Compare this to a standard Sol request for the same task: perhaps 50K input, 10K output = $0.55. Ultra mode costs roughly 5x more in this example.

The question is whether that 5x cost multiplier buys you enough quality improvement to justify it. For a task where getting it right on the first try saves hours of debugging, the answer might be yes. For routine code completion, definitely not.

For tracking ultra-mode costs separately from standard usage, see our guide on monitoring AI API spending.

When to Use Ultra Mode

Ultra mode shines for:

Complex Refactoring

Tasks that require understanding multiple files, their relationships, and coordinating changes across all of them. Subagents can each focus on a different file or module while the orchestrator ensures consistency.

Architecture Decisions

When you need the model to evaluate multiple design approaches, subagents can explore different options in parallel and the orchestrator can compare them.

Bug Hunting

For debugging complex issues, subagents can investigate different hypotheses simultaneously: one checking for race conditions, another for state management issues, another for API misuse.

Multi-Step Workflows

Tasks that have clear stages (research, design, implement, verify) benefit from having dedicated subagents for each stage rather than asking a single reasoning chain to do everything sequentially.

When NOT to Use Ultra Mode

Do not use ultra mode for:

Simple code completion or generation
Single-file edits
Chat responses
Tasks where latency matters more than quality
High-volume workloads (use Luna instead)
Anything where standard Sol’s 88.8% is already good enough

The 3.1% Terminal-Bench improvement is real but narrow. Most practical coding tasks fall well within the capability range where standard Sol already succeeds. Ultra mode is for the hard tail of tasks where standard reasoning fails.

Comparison with Claude’s Approach

Claude models handle extended reasoning differently:

Claude Extended Thinking

Claude Opus 4.8 and Sonnet 5 use extended thinking, where you allocate a thinking token budget and the model uses that budget for internal deliberation before responding.

Single reasoning process (no subagents)
You control the budget explicitly
The model decides how to allocate thinking within the budget
More predictable cost (you set the ceiling)

GPT-5.6 Sol Ultra

Multiple reasoning processes (subagents)
Less predictable token consumption
Parallel decomposition of problems
Higher cost ceiling but potentially better results on complex tasks

The fundamental difference is sequential vs parallel. Claude thinks longer within a single chain. Sol spawns multiple chains. For problems that benefit from decomposition, Sol’s approach has structural advantages. For problems that require deep sequential reasoning (following a long chain of logic), Claude’s approach may work better.

Benchmark Comparison

Model	Configuration	Terminal-Bench 2.1
GPT-5.6 Sol Ultra	Ultra + max reasoning	91.9%
GPT-5.6 Sol	Max reasoning, no ultra	88.8%
Claude Opus 4.8	Extended thinking	78.9%

The gap between Sol Ultra (91.9%) and Opus 4.8 (78.9%) is 13 percentage points. Even standard Sol without ultra beats Opus 4.8 by nearly 10 points. This is a substantial capability difference on coding-focused benchmarks.

However, Terminal-Bench is one benchmark. Real-world performance varies by task type, and Claude models have strengths in instruction following, safety, and long-form reasoning that may not be captured here.

Practical Architecture for Ultra Mode

If you have GPT-5.6 access and want to integrate ultra mode effectively:

Tiered Routing

def route_request(task_complexity):
    if task_complexity == "simple":
        return "gpt-5.6-luna"  # Fast, cheap
    elif task_complexity == "moderate":
        return "gpt-5.6-sol"  # Standard reasoning
    elif task_complexity == "complex":
        return "gpt-5.6-sol"  # Ultra mode
    
    # Set ultra only for complex tasks
    params = {"model": route_request(complexity)}
    if complexity == "complex":
        params["ultra"] = True
        params["reasoning_effort"] = "max"
    return params

Complexity Classification

You need a way to determine which requests warrant ultra mode. Options:

Heuristic rules: Long prompts, multi-file contexts, explicit “refactor” or “debug” keywords
Luna as classifier: Use a quick Luna call to assess complexity before routing to Sol Ultra
User-driven: Let users opt into ultra mode for specific requests
Retry escalation: Start with standard Sol, escalate to ultra if the result fails validation

Cost Guards

Set hard limits on ultra-mode usage:

Maximum ultra-mode requests per hour
Maximum total ultra-mode spend per day
Automatic fallback to standard Sol if budget is exhausted

The Cerebras Factor

Cerebras is bringing 750 tokens-per-second hosting for Sol in July 2026. This has interesting implications for ultra mode:

Subagent processes run in parallel, so faster inference per subagent means faster total ultra-mode completion
At 750 tok/s, a subagent generating 20K tokens completes in ~27 seconds
4 parallel subagents still complete in ~27 seconds (parallel, not sequential)
Total ultra-mode latency could be 30 to 60 seconds instead of minutes

Faster inference makes ultra mode more practical for interactive use cases. Check our AI API providers guide for updates on Cerebras availability.

Integrating with Existing Tools

Ultra mode works within the standard chat completions API, so existing AI coding tools should support it once they add the ultra parameter. Key considerations:

Tools that manage context windows need to account for subagent token consumption
Streaming responses may behave differently (subagents process before the final response streams)
Timeout settings need to be longer for ultra-mode requests
Cost tracking needs to capture subagent usage separately

Make sure your API keys are secured with appropriate rate limiting, especially for ultra mode where a single compromised request can consume significant resources.

FAQ

Does ultra mode always spawn subagents?

No. The model decides whether to use subagents based on the complexity of the request. For simpler requests, it may process normally even with ultra mode enabled. The 91.9% benchmark score reflects the average improvement across Terminal-Bench tasks, some of which may not have triggered subagent spawning.

Can I control how many subagents are spawned?

Based on available documentation, you cannot directly control the number of subagents. The orchestrator decides the decomposition strategy. You may be able to influence this through prompt structure (explicitly breaking your request into numbered subtasks), but this is not guaranteed.

Is ultra mode available on Terra or Luna?

No. Ultra mode is exclusive to Sol. Terra and Luna support the reasoning effort parameter (low to max) but not subagent spawning. This makes sense since ultra mode’s cost implications would undermine Luna’s value proposition as the budget option.

How does ultra mode affect latency?

Significantly. Subagent spawning, parallel processing, and result synthesis all add latency. Expect ultra-mode requests to take 3 to 10x longer than standard Sol requests depending on the complexity of the decomposition. This is not suitable for real-time or low-latency applications.

Can I combine ultra mode with the cache system?

Yes. The orchestrator’s input (your prompt) benefits from caching normally. Subagent inputs are generated internally and are not directly cacheable across requests. However, if subagents reference your cached system prompt, that portion benefits from cache reads. The cost savings from caching apply to the orchestrator’s input but not to internally generated subagent contexts.

GPT-5.6 Sol Ultra Mode: How Subagents Push Terminal-Bench to 91.9%

GPT-5.6 Sol Ultra Mode: How Subagents Push Terminal-Bench to 91.9%

Understanding the Two Features

Max Reasoning Effort

Ultra Mode (Sol Only)

Why Subagents Matter

How to Use Ultra Mode

Cost Implications

When to Use Ultra Mode

Complex Refactoring

Architecture Decisions

Bug Hunting

Multi-Step Workflows

When NOT to Use Ultra Mode

Comparison with Claude’s Approach

Claude Extended Thinking

GPT-5.6 Sol Ultra

Benchmark Comparison

Practical Architecture for Ultra Mode

Tiered Routing

Complexity Classification

Cost Guards

The Cerebras Factor

Integrating with Existing Tools

FAQ

Does ultra mode always spawn subagents?

Can I control how many subagents are spawned?

Is ultra mode available on Terra or Luna?

How does ultra mode affect latency?

Can I combine ultra mode with the cache system?

📬 AI Dev Weekly

You might also like

Kimi K2.7 Code vs GPT-5.5: How Close is Open-Source Now?

Claude Fable 5 vs GPT-5.4: Coding Benchmark Comparison (2026)

Claude Fable 5 vs GPT-5.5: Which Frontier Model Wins in 2026?

How to Migrate from GPT-5.5 or Claude to DeepSeek/MiMo (Step-by-Step)