🤖 AI Tools
· 10 min read

Qwen 3.7 for Autonomous Agents: 35-Hour Sessions and 1M Context


Qwen 3.7 Max ran autonomously for 35 hours straight, making 1,158 tool calls in a single session without degradation. That is not a theoretical maximum. It is a validated benchmark result that demonstrates what this model was built for: sustained, autonomous operation at a scale no other model has publicly matched.

Most models start losing coherence after a few hundred tool calls. Context windows fill up, attention drifts, and the model starts repeating itself or making contradictory decisions. Qwen 3.7 Max was specifically trained to avoid these failure modes across ultra-long sessions.

This guide covers how to build agents that take advantage of these capabilities: architecture patterns, context management strategies, MCP integration, tool calling best practices, cost management, and how Qwen 3.7 compares to other models for agent workloads.

For general model information, see our Qwen 3.7 complete guide. For API setup, check How to use Qwen 3.7 API.

The 35-hour benchmark explained

The headline number needs context. Here is what the 35-hour benchmark actually tested:

  • Duration: 35 hours of continuous operation
  • Tool calls: 1,158 total (average: one every ~1.8 minutes)
  • Task type: Complex, multi-step software engineering tasks
  • Context management: Single session, no context resets
  • Quality metric: No measurable degradation in output quality between hour 1 and hour 35

This is not the model sitting idle for 35 hours. It was actively reasoning, calling tools, processing results, and making decisions throughout the entire session. The 1,158 tool calls included file reads, file writes, command executions, web requests, and multi-step reasoning chains.

For comparison, most agentic coding sessions with other frontier models last 30-60 minutes before context management becomes an issue. Qwen 3.7 Max extends that by roughly 35x.

Architecture for long-running agents

Building agents that can actually use 35 hours of continuous operation requires careful architecture. The model can handle it, but your infrastructure needs to keep up.

Basic agent loop

import anthropic
import time

client = anthropic.Anthropic(
    base_url="https://openrouter.ai/api/v1",
    api_key="sk-or-v1-your-key",
)

def run_agent(task: str, max_iterations: int = 2000):
    messages = [{"role": "user", "content": task}]
    tools = define_tools()  # Your tool definitions
    iteration = 0

    while iteration < max_iterations:
        response = client.messages.create(
            model="qwen/qwen-3.7-max",
            max_tokens=8192,
            system="You are an autonomous agent. Complete the task using available tools. Do not stop until the task is fully complete.",
            messages=messages,
            tools=tools,
        )

        # Append assistant response
        messages.append({"role": "assistant", "content": response.content})

        # Check if done
        if response.stop_reason == "end_turn":
            break

        # Process tool calls
        tool_results = []
        for block in response.content:
            if block.type == "tool_use":
                result = execute_tool(block.name, block.input)
                tool_results.append({
                    "type": "tool_result",
                    "tool_use_id": block.id,
                    "content": result,
                })

        messages.append({"role": "user", "content": tool_results})
        iteration += 1

    return messages

Resilient agent with checkpointing

For sessions that may last hours, add checkpointing to survive crashes:

import json
import os

class ResilientAgent:
    def __init__(self, checkpoint_path: str = ".agent_checkpoint.json"):
        self.checkpoint_path = checkpoint_path
        self.messages = []
        self.iteration = 0
        self._load_checkpoint()

    def _load_checkpoint(self):
        if os.path.exists(self.checkpoint_path):
            with open(self.checkpoint_path, "r") as f:
                data = json.load(f)
                self.messages = data["messages"]
                self.iteration = data["iteration"]

    def _save_checkpoint(self):
        with open(self.checkpoint_path, "w") as f:
            json.dump({
                "messages": self.messages,
                "iteration": self.iteration,
            }, f)

    def run(self, task: str, max_iterations: int = 2000):
        if not self.messages:
            self.messages = [{"role": "user", "content": task}]

        while self.iteration < max_iterations:
            try:
                response = self._call_model()
                self._process_response(response)
                self._save_checkpoint()

                if response.stop_reason == "end_turn":
                    break

            except Exception as e:
                print(f"Error at iteration {self.iteration}: {e}")
                time.sleep(30)  # Back off and retry
                continue

            self.iteration += 1

1M context strategies

A 1M token context window is enormous, but it is not infinite. A 35-hour session with 1,158 tool calls can easily generate millions of tokens of conversation history. You need strategies to manage this.

Strategy 1: Summarize and compress

Periodically summarize older parts of the conversation to free up context space:

def compress_context(messages, keep_recent=50):
    """Keep recent messages verbatim, summarize older ones."""
    if len(messages) <= keep_recent:
        return messages

    old_messages = messages[:-keep_recent]
    recent_messages = messages[-keep_recent:]

    # Ask the model to summarize the old context
    summary = client.messages.create(
        model="qwen/qwen-3.7-max",
        max_tokens=4096,
        messages=[{
            "role": "user",
            "content": f"Summarize the key decisions, findings, and current state from this conversation history. Be concise but preserve all important technical details:\n\n{format_messages(old_messages)}"
        }],
    )

    compressed = [
        {"role": "user", "content": f"[Context summary of previous {len(old_messages)} messages]: {summary.content[0].text}"},
        {"role": "assistant", "content": "Understood. I have the context from the previous work. Continuing from where we left off."},
    ]

    return compressed + recent_messages

Strategy 2: Sliding window with anchors

Keep the original task and key decision points, but slide the window over intermediate steps:

def sliding_window_context(messages, window_size=200, anchors=None):
    """Keep anchored messages + recent window."""
    if anchors is None:
        anchors = [0]  # Always keep the original task

    anchor_messages = [messages[i] for i in anchors if i < len(messages)]
    window_messages = messages[-window_size:]

    return anchor_messages + [
        {"role": "user", "content": "[... intermediate steps omitted ...]"},
        {"role": "assistant", "content": "Continuing with the task."},
    ] + window_messages

Strategy 3: External memory

For the longest sessions, offload completed subtask results to external storage:

import sqlite3

class AgentMemory:
    def __init__(self, db_path: str = ".agent_memory.db"):
        self.conn = sqlite3.connect(db_path)
        self.conn.execute("""
            CREATE TABLE IF NOT EXISTS memories (
                id INTEGER PRIMARY KEY,
                timestamp TEXT,
                category TEXT,
                content TEXT
            )
        """)

    def store(self, category: str, content: str):
        self.conn.execute(
            "INSERT INTO memories (timestamp, category, content) VALUES (datetime('now'), ?, ?)",
            (category, content)
        )
        self.conn.commit()

    def recall(self, category: str, limit: int = 10) -> list[str]:
        cursor = self.conn.execute(
            "SELECT content FROM memories WHERE category = ? ORDER BY id DESC LIMIT ?",
            (category, limit)
        )
        return [row[0] for row in cursor.fetchall()]

MCP integration

Qwen 3.7 works well with the Model Context Protocol (MCP) for structured tool calling. MCP provides a standardized way to define and expose tools to the model.

Basic MCP server setup

from mcp import Server, Tool

server = Server("dev-tools")

@server.tool()
async def read_file(path: str) -> str:
    """Read the contents of a file."""
    with open(path, "r") as f:
        return f.read()

@server.tool()
async def write_file(path: str, content: str) -> str:
    """Write content to a file."""
    with open(path, "w") as f:
        f.write(content)
    return f"Written {len(content)} bytes to {path}"

@server.tool()
async def run_command(command: str, working_dir: str = ".") -> str:
    """Execute a shell command and return output."""
    import subprocess
    result = subprocess.run(
        command, shell=True, capture_output=True, text=True, cwd=working_dir
    )
    return f"Exit code: {result.returncode}\nStdout: {result.stdout}\nStderr: {result.stderr}"

@server.tool()
async def search_codebase(query: str, path: str = ".") -> str:
    """Search for a pattern in the codebase using grep."""
    import subprocess
    result = subprocess.run(
        ["grep", "-r", "-n", query, path],
        capture_output=True, text=True
    )
    return result.stdout[:10000]  # Truncate large results

Connecting Qwen 3.7 to MCP tools

async def run_with_mcp(task: str):
    # Get tool definitions from MCP server
    tools = await server.list_tools()

    # Convert to Anthropic tool format
    anthropic_tools = [
        {
            "name": tool.name,
            "description": tool.description,
            "input_schema": tool.input_schema,
        }
        for tool in tools
    ]

    messages = [{"role": "user", "content": task}]

    while True:
        response = client.messages.create(
            model="qwen/qwen-3.7-max",
            max_tokens=8192,
            messages=messages,
            tools=anthropic_tools,
        )

        messages.append({"role": "assistant", "content": response.content})

        if response.stop_reason == "end_turn":
            break

        # Execute tool calls via MCP
        tool_results = []
        for block in response.content:
            if block.type == "tool_use":
                result = await server.call_tool(block.name, block.input)
                tool_results.append({
                    "type": "tool_result",
                    "tool_use_id": block.id,
                    "content": result,
                })

        messages.append({"role": "user", "content": tool_results})

Tool calling patterns

After 1,158 tool calls in a single session, certain patterns emerge for reliable tool use with Qwen 3.7.

Pattern 1: Verify after write

Always read back after writing to confirm the operation succeeded:

System prompt addition:
"After writing any file, read it back to verify the content matches your intent. After running any command that modifies state, verify the state change with a follow-up command."

Group related tool calls to reduce round trips:

# Instead of separate calls for each file
tools = [
    {"name": "read_file", "input": {"path": "src/main.py"}},
    {"name": "read_file", "input": {"path": "src/utils.py"}},
    {"name": "read_file", "input": {"path": "tests/test_main.py"}},
]

Qwen 3.7 handles parallel tool calls well. Let it read multiple files in one turn rather than forcing sequential reads.

Pattern 3: Progressive disclosure

For large codebases, do not dump everything into context at once. Let the model explore incrementally:

System prompt:
"Start by reading the project structure (directory listing). Then read only the files relevant to the current subtask. Do not read the entire codebase upfront."

Pattern 4: Error recovery

Qwen 3.7 handles tool errors gracefully, but explicit instructions help:

System prompt addition:
"If a tool call fails, analyze the error message. Try an alternative approach rather than repeating the same call. If you are stuck after 3 attempts, explain what is blocking you and suggest a different strategy."

Cost management for long sessions

A 35-hour session is not cheap. At $2.50/$7.50 per million tokens, costs can accumulate quickly.

Estimating session costs

Session lengthEstimated tool callsEstimated tokensEstimated cost
1 hour~33150K-300K$1-3
4 hours~130500K-1M$4-10
12 hours~4001.5M-3M$12-25
35 hours~1,1584M-8M$35-65

Cost reduction strategies

1. Use context compression aggressively

Every token in context costs money on every subsequent turn. Compressing old context saves on input tokens for all future turns.

2. Set output token limits per turn

response = client.messages.create(
    model="qwen/qwen-3.7-max",
    max_tokens=4096,  # Limit per turn, not 65536
    messages=messages,
    tools=tools,
)

Only use the full 65K output when you actually need long code generation. For tool-calling turns, 4K-8K is usually sufficient.

3. Use Plus for non-critical subtasks

If your agent has subtasks that do not require Max’s agent optimization (simple lookups, formatting, summarization), route those to Qwen 3.7 Plus at $1.50/$4.50 to save 40%.

4. Implement early stopping

def should_continue(messages, max_cost_usd=50.0):
    """Estimate current cost and stop if budget exceeded."""
    total_tokens = sum(estimate_tokens(m) for m in messages)
    estimated_cost = (total_tokens / 1_000_000) * 5.0  # Rough average
    return estimated_cost < max_cost_usd

Comparison with other models for agent use

How does Qwen 3.7 Max compare to other frontier models for autonomous agent workloads?

ModelContextMax session validatedTool call reliabilityAgent benchmarksPrice (output)
Qwen 3.7 Max1M35 hoursExcellentII: 56.6, TB-H: 50.8%$7.50/1M
Claude Opus 4.6200K~4 hoursExcellentII: 55.1, TB-H: 49.2%$75.00/1M
GPT-5.5256K~2 hoursGoodII: 54.8, TB-H: 47.6%$60.00/1M
Gemini 3.1 Pro2M~8 hoursGoodII: 53.2, TB-H: 45.1%$10.00/1M
DeepSeek V4 Pro1M~6 hoursGoodII: 55.8, TB-H: 48.2%$1.20/1M

Key takeaways:

  • Qwen 3.7 Max has the longest validated session and best agent benchmarks, at a fraction of Claude/GPT pricing
  • Gemini 3.1 Pro has a larger context window (2M) but lower agent benchmark scores and shorter validated sessions
  • DeepSeek V4 Pro is 6x cheaper but has not been validated for sessions beyond ~6 hours
  • Claude Opus 4.6 is excellent for agents but 10x more expensive on output tokens

For a detailed comparison with Gemini on infrastructure tasks, see Race: Gemini for Infra Engineering.

When Qwen 3.7 is the right choice for agents

Use Qwen 3.7 Max for autonomous agents when:

  • Your tasks require sustained operation over many hours
  • You need hundreds or thousands of tool calls in a single session
  • The 1M context window is necessary for your codebase size
  • You want Anthropic API compatibility for existing tooling
  • Cost matters (compared to Claude/GPT, not compared to DeepSeek)

Use something else when:

  • Sessions are short (<1 hour) and cost is the priority (use DeepSeek V4 Pro)
  • You need the absolute largest context window (use Gemini 3.1 Pro at 2M)
  • You need open weights for self-hosting (use DeepSeek V4 Pro)
  • Vision is required in the agent loop (use Qwen 3.7 Plus or Gemini)

FAQ

How long can Qwen 3.7 actually run autonomously?

The validated benchmark is 35 hours with 1,158 tool calls in a single session. In practice, your session length is limited by your infrastructure (network stability, API availability) rather than the model’s capabilities. The model itself does not degrade over that timeframe.

Does the model get worse over long sessions?

No measurable degradation was observed in the 35-hour benchmark. The model maintained consistent output quality from hour 1 through hour 35. This is a key differentiator from other models that show attention drift or repetition in long sessions.

How much does a 35-hour session cost?

Roughly $35-65 depending on the complexity of the task and how many tokens are generated per tool call. The main cost driver is output tokens at $7.50/1M. Context compression strategies can reduce this significantly.

Can I use Qwen 3.7 with existing agent frameworks?

Yes. Qwen 3.7 supports both the Anthropic API protocol and OpenAI-compatible format. It works with LangChain, CrewAI, AutoGen, and any framework that supports either protocol. The Anthropic compatibility also means it works directly with Claude Code as an agent harness.

What happens if the session crashes mid-way?

The model does not maintain state between API calls. Your agent framework needs to handle checkpointing and recovery. See the “Resilient agent with checkpointing” pattern above. If you save conversation history to disk after each turn, you can resume from the last checkpoint.

How does the 1M context compare to Gemini’s 2M?

Gemini 3.1 Pro has a larger raw context window (2M vs 1M), but Qwen 3.7 Max has been more extensively validated for sustained agent operation within its context window. A larger window does not help if the model loses coherence at high utilization. Qwen 3.7 maintains quality even at 800K+ tokens of active context.

Is MCP required to use Qwen 3.7 for agents?

No. MCP is one way to structure tool definitions, but Qwen 3.7 works with any tool-calling format supported by the Anthropic or OpenAI API protocols. You can define tools inline, use function calling, or use MCP. The model handles all approaches well.

Related: Qwen 3.7 Complete Guide · How to Use Qwen 3.7 API · Race: Gemini for Infra Engineering · MCP Complete Developer Guide