Alibaba released Qwen 3.7 on May 20-21, 2026, just one month after Qwen 3.6. The monthly cadence continues: 3.5 in March, 3.6 in April, 3.7 in May.
The question: is it worth upgrading? Short answer: yes, if you’re using the API. The improvements are substantial across every benchmark, the context window quadrupled, and you get new capabilities like Anthropic protocol support.
Benchmark comparison
| Benchmark | Qwen 3.6 Max | Qwen 3.7 Max | Change |
|---|---|---|---|
| Intelligence Index v4.0 | ~52 (estimated) | 56.6 | +4.6 |
| Terminal-Bench Hard | 43.9% | 50.8% | +6.9% |
| Humanity’s Last Exam | 28.9% | 38.1% | +9.2% |
| CritPt | 3.7% | 13.4% | +9.7% (3.6x) |
| Apex Math | N/A | 44.5 | New benchmark |
| MCP-Atlas | N/A | 76.4 | New benchmark |
| Arena AI Elo | N/A | 1,475 (#13) | New ranking |
| Hallucination (AA-Omniscience) | N/A | 22.9% (lowest) | New metric |
Every single benchmark shows meaningful improvement. The CritPt jump from 3.7% to 13.4% is the standout, nearly 4x improvement in critical point reasoning. Terminal-Bench Hard went from 43.9% to 50.8%, crossing the 50% threshold for the first time.
Context window
| Qwen 3.6 Max | Qwen 3.7 Max | |
|---|---|---|
| Context window | 256K tokens | 1M tokens |
| Improvement | N/A | 4x |
This is the single biggest practical upgrade. Going from 256K to 1M tokens means you can now fit entire codebases, full documentation sets, or multiple long documents in a single prompt without chunking or retrieval.
For agent workflows, this means longer conversation histories, more tool call results, and less context management overhead.
Architecture changes
Both models are closed-weights, so internal architecture details are limited. What we know:
- Qwen 3.6: Hybrid linear attention + sparse MoE, available in multiple sizes (35B-A3B, 27B dense, Plus, Max Preview, Flash)
- Qwen 3.7: Two variants only (Max and Plus), likely evolved architecture optimized for longer context and autonomous operation
The 3.7 release focuses on the flagship API models rather than the open-weight ecosystem. Alibaba appears to be shipping the API first and following up with open weights later.
New capabilities in 3.7
Anthropic API protocol support
Qwen 3.7 Max natively supports the Anthropic API protocol. This means tools built for Claude (including Claude Code) work directly with Qwen 3.7 without any adapter or translation layer.
This didn’t exist in 3.6. It’s a strategic move that lets developers use Qwen as a drop-in replacement for Claude in existing toolchains.
35-hour autonomous operation
Alibaba demonstrated Qwen 3.7 Max running autonomously for 35 hours, executing 1,158 tool calls. This is a new capability class. While 3.6 supported tool calling, the sustained autonomous operation at this scale is new.
Lower hallucination rate
22.9% on AA-Omniscience is the lowest among frontier models. This wasn’t a tracked metric for 3.6, but the improvement in factual reliability is notable for production use cases.
Pricing comparison
| Qwen 3.6 Max Preview | Qwen 3.7 Max | |
|---|---|---|
| Input | Standard Aliyun pricing | $2.50/1M tokens |
| Output | Standard Aliyun pricing | $7.50/1M tokens |
| OpenRouter | Free (preview) | $2.50/1M input |
Qwen 3.6 Plus was available free on OpenRouter during its preview period. Qwen 3.7 Max is a paid model from day one at $2.50/$7.50 per million tokens. This is still extremely competitive compared to Western frontier models.
Model variants comparison
| Variant | Qwen 3.6 | Qwen 3.7 |
|---|---|---|
| Max/Flagship | Max Preview | Max |
| Plus/Mid-tier | Plus (free preview) | Plus (multimodal) |
| Flash/Speed | Flash | Not yet |
| Open-weight large | 35B-A3B (Apache 2.0) | Not yet |
| Open-weight dense | 27B | Not yet |
Qwen 3.6 had a broader model family at this point in its lifecycle. Qwen 3.7 launched with just Max and Plus, with open-weight variants expected to follow.
Should you upgrade?
Upgrade if you:
- Use Qwen 3.6 Max/Plus via API and want better performance
- Need more than 256K context
- Build autonomous agents that run for extended periods
- Want to use Qwen with Claude Code or other Anthropic-protocol tools
- Need lower hallucination rates for factual tasks
Stay on 3.6 if you:
- Run models locally (3.7 has no open weights yet, 3.6 35B-A3B and 27B still work)
- Need the free tier (3.6 Plus on OpenRouter may still be free)
- Have workflows that depend on specific 3.6 behavior and can’t risk regression
Migration notes
API endpoint changes
If you’re using DashScope, update your model parameter:
# Before (3.6)
model = "qwen-max-preview"
# After (3.7)
model = "qwen3.7-max"
OpenRouter
# Before (3.6)
model = "qwen/qwen-max-preview"
# After (3.7)
model = "qwen/qwen3.7-max"
Behavior differences
- Output style may differ slightly. Test your prompts before switching production traffic.
- The 1M context window means you can send larger payloads, but costs scale with token count.
- Tool calling format is compatible but may have improved reliability.
For full API setup instructions, see our Qwen 3.7 API guide.
The bigger picture
Alibaba’s monthly release cadence is aggressive. Each version brings meaningful improvements:
- 3.5 (March 2026): Established Qwen as a serious coding model
- 3.6 (April 2026): Open weights, 35B-A3B, free API preview
- 3.7 (May 2026): Frontier performance, 1M context, autonomous agents
At this pace, Qwen 3.8 could arrive in June. The gap between Qwen and the top 3 (GPT-5.5, Claude Opus 4.7, Gemini 3.1 Pro) is narrowing with each release.
For a complete overview of Qwen 3.7’s capabilities, see our complete guide.
FAQ
Is Qwen 3.7 backward compatible with 3.6 API calls?
The API format is compatible (OpenAI-style), but you need to update the model name. Prompts that worked with 3.6 will work with 3.7, though outputs may differ.
Can I still use Qwen 3.6?
Yes. Qwen 3.6 models remain available. The open-weight variants (35B-A3B, 27B) are still the best option for local deployment.
Is the 1M context window real or just marketing?
It’s a real 1M token context window. Whether performance degrades at the edges (common with very long contexts) remains to be tested at scale, but the capability is there.
Why did Alibaba skip open weights for 3.7?
They didn’t skip them permanently. Following the 3.6 pattern, open-weight variants will likely come weeks after the API launch. Alibaba ships API first to monetize, then releases open weights for community adoption.
How much faster is 3.7 than 3.6?
Speed benchmarks haven’t been published yet. The focus of 3.7 is on capability (longer context, better reasoning, autonomous operation) rather than raw inference speed.
Does 3.7 replace 3.6 for coding tasks?
For API users, yes. Qwen 3.7 Max is strictly better than 3.6 Max on every published benchmark. For local users, 3.6 remains the only option until 3.7 open weights drop.