AI Dev Weekly Extra: Did Anthropic Let Opus 4.6 Rot So 4.7 Would Look Better?
AI Dev Weekly Extra — a special edition for breaking news that can’t wait until Thursday.
Anthropic shipped Claude Opus 4.7 this week. The benchmarks are impressive. The vision jump is absurd. And I should be writing a straightforward “here’s what’s new” piece right now.
But I can’t do that without talking about what happened to Opus 4.6 first. Because the story of 4.7 doesn’t start with its release — it starts with the slow, public deterioration of the model it replaces, and the uncomfortable questions that deterioration raises about trusting any AI provider with your production workloads.
The Opus 4.6 Collapse Was Real
Let me be blunt: Opus 4.6 got noticeably worse over the past several weeks, and the evidence isn’t anecdotal.
A HuggingFace analysis across 6,852 sessions documented a 67% drop in reasoning depth. On BridgeBench, Opus 4.6 fell from 83.3% — good enough for the #2 spot — down to 68.3%, landing it at #10. That’s not drift. That’s a cliff. An AMD senior director posted forensic evidence on GitHub showing systematic capability loss. Some users reported accuracy score declines of 58%.
If you were using Claude Code in mid-March, you probably felt it firsthand. Sessions hanging for 10-15 minutes on prompts that used to resolve in seconds. Outputs that felt shallow, hedging, stripped of the analytical depth that made Opus the model you reached for when the problem was hard.
Reddit and X lit up with the vocabulary we’ve all learned to use for this phenomenon: “AI shrinkflation.” “Lobotomized.” “Nerfed.” The community wasn’t being dramatic — they were describing a measurable reality.
Anthropic’s official response? They denied degrading the model weights.
I believe them, technically. I don’t think someone at Anthropic opened a config file and turned a dial labeled “make it worse.” But “we didn’t change the weights” is a narrow denial that sidesteps a lot of territory — infrastructure changes, serving optimizations, quantization adjustments, routing modifications. There are many ways a model’s effective capability can degrade without anyone touching the weights themselves.
Enter Opus 4.7: Savior or Convenient Timing?
Now here’s where it gets interesting. Opus 4.7 lands with numbers that look fantastic — especially when measured against the degraded version of 4.6 that users had been suffering through:
- SWE-bench Pro: 64.3% (up from 53.4%)
- CursorBench: 70% (up from 58%)
- Vision: 98.5% (up from 54.5%)
That vision jump alone — from 54.5% to 98.5% — is genuinely remarkable. The coding benchmarks represent real, meaningful progress. I’ve been running 4.7 through my own workflows for the past two days, and the improvement in structured reasoning and code generation is not imaginary. This is a better model.
But here’s the thing that keeps nagging at me: users on X have been joking that 4.7 “feels like early 4.6.” The version they actually liked. The one that scored 83.3% on BridgeBench before it started its mysterious decline.
So which is it? Is 4.7 a genuine leap forward, or did we just spend weeks watching 4.6 get worse so that “normal” would feel like a breakthrough?
I think the honest answer is: both. The SWE-bench and vision numbers suggest capabilities that go beyond where 4.6 ever was, even at its peak. But the subjective experience of improvement is amplified by the fact that we’ve been working with a degraded model for weeks. Anthropic gets to announce a 20% coding improvement against a baseline that had already fallen 15%. The math works out very nicely for the press release.
The Tokenizer Tax Nobody’s Talking About
Opus 4.7 ships at the same per-token price as 4.6. Anthropic made sure to highlight this. Same price, better model — what’s not to love?
The new tokenizer, that’s what.
Opus 4.7’s tokenizer uses up to 35% more tokens to represent the same content. If you’re processing the same codebase, the same documents, the same prompts you were running last week, you’re now paying up to 35% more for the privilege.
Let’s call this what it is: a hidden price increase. Not on the rate card — on the meter. It’s the AI equivalent of shrinking the cereal box while keeping the price tag the same. The “per token” price didn’t change, but the number of tokens your work requires did.
For hobbyists and occasional users, this is a rounding error. For teams running Claude through CI pipelines, code review automation, or document processing at scale, a 35% token increase is a material cost change that showed up with zero advance warning. If you’re budgeting API costs, recalculate now. Your March invoices are not predictive of your April ones.
For a deeper dive into the technical differences, check out our Opus 4.7 vs 4.6 comparison.
The Mythos in the Room
Here’s the part of this story that doesn’t get enough attention. The same week Anthropic released 4.7, Axios ran a headline that should have been louder than it was: “Anthropic releases Claude Opus 4.7, concedes it trails unreleased Mythos.”
Mythos Preview beats 4.7 on almost every benchmark. And it’s restricted — available only in limited preview, not generally accessible through the API.
So we’re in a strange position. Anthropic is asking developers to be excited about 4.7 while simultaneously acknowledging they have something substantially better that they’re not shipping. I understand the reasons — safety evaluation, scaling infrastructure, responsible deployment. These are legitimate concerns. But it creates an awkward dynamic where the product you’re paying for is, by the company’s own admission, not the best they can do.
It also raises a strategic question: if you’re building a product on top of 4.7 today, how do you plan for a model that might be dramatically better arriving in weeks or months? Do you optimize for 4.7’s specific strengths, or do you build abstractions assuming the foundation will shift under you again?
For more context on how these models stack up, see our AI model comparison.
This Isn’t Just an Anthropic Problem
I want to be fair here. Anthropic is not uniquely guilty of anything. GPT-4 users reported strikingly similar degradation patterns before GPT-4o launched. OpenAI faced the exact same “did they nerf it?” accusations. The community had the same arguments, the same forensic analyses, the same official denials.
This is a structural problem with the entire model-as-a-service paradigm. When you call an API, you have no way to verify what’s actually running on the other side. The model you tested against last Tuesday might not be the model serving your requests today. There’s no checksum, no version hash, no way to pin a specific set of weights the way you’d pin a dependency version in your package manager.
You’re renting intelligence, not owning it. And the landlord can renovate your apartment while you’re at work without telling you.
This is fundamentally different from every other dependency in your stack. When you upgrade PostgreSQL, you choose when. When a library updates, your lockfile protects you. But your AI provider can change the effective capability of your most critical dependency at any time, and your only detection mechanism is “hmm, the outputs feel different.”
For developers who lived through the 4.6 degradation while running production workloads — that’s not a theoretical concern. That’s a retrospective incident report waiting to be written.
What Developers Should Actually Do
So where does this leave us? Here’s my honest take.
Opus 4.7 is a good model. Probably a genuinely great one. The complete guide covers the capabilities in detail, and the coding and vision improvements are real and significant. If you’re choosing a model today, 4.7 deserves serious consideration.
But the 4.6 episode should change how you architect around these models. Here’s what I’d recommend:
-
Build evaluation harnesses, not vibes. If you don’t have automated quality checks on your AI-dependent workflows, the 4.6 degradation is what happens to you — slow, invisible capability loss that you only notice when users complain. Run benchmarks on your actual use cases. Weekly, at minimum.
-
Budget for the tokenizer tax. If you’re on Opus, your costs just went up ~35%. Plan for it. Monitor it. Don’t let it surprise your finance team.
-
Abstract your model layer. If you’re not already using a model-agnostic interface, start. The ability to swap between providers — or between Claude models — without rewriting your application isn’t a nice-to-have anymore. It’s operational resilience. Our Opus 4.6 vs 4.5 comparison shows how much can change between versions.
-
Keep receipts. Log your inputs, outputs, and quality metrics. When the next degradation happens — and it will, from someone — you want data, not feelings.
-
Watch Mythos. Whatever Anthropic is holding back is, by their own benchmarks, significantly better than what they just shipped. That’s either exciting or unsettling depending on your perspective. Either way, it’s worth tracking.
The AI industry has a trust problem it hasn’t solved. Not a safety trust problem — a reliability trust problem. The companies building these models need to give developers better tools for verifying, pinning, and monitoring the models they depend on. Until they do, we’re all building on ground that can shift without warning.
Opus 4.7 is a step forward. The way we got here is a step backward. Both things are true, and pretending otherwise doesn’t help anyone.
See you Thursday for the regular edition.