Race: We Upgraded Gemini from 2.5 Flash to 3.5 Flash — Can It Escape Last Place?
Yesterday Google dropped Gemini 3.5 Flash at I/O 2026. Today we upgraded the race’s Gemini agent to it. This is the second mid-race model upgrade — the first was DeepSeek V3 to V4 Pro in week 1 — and honestly, it might be Gemini’s last chance.
Because right now? Gemini is in last place. Dead last. 363 commits, the most bug loops of any agent, and the least actual progress toward revenue. The infrastructure is all there — LocalLeads has a domain, database, Vercel deployment, and PayPal integration. The blocker isn’t setup. It’s code quality. The old model kept generating broken code and looping on its own bugs.
So when Google announced 3.5 Flash with an 83.6% MCP Atlas score — the highest of any model, purpose-built for multi-step autonomous work — we didn’t hesitate.
What Changed
Here’s the before and after:
| Before (May 19) | After (May 20) | |
|---|---|---|
| CLI | Gemini CLI | Antigravity CLI (agy) |
| Model (premium) | Gemini 2.5 Pro | — |
| Model (cheap) | Gemini 2.5 Flash | — |
| Model (single tier) | — | Gemini 3.5 Flash |
| Sessions/day | 8 | 8 |
| Backlog | BACKLOG-PREMIUM.md + BACKLOG-CHEAP.md | BACKLOG.md (single) |
| Cost | $20/month (Google AI Pro) | $20/month (Google AI Pro) |
Same budget. Same session count. Completely different engine underneath.
Why We Upgraded (3 Reasons)
1. Gemini CLI Is Dying
Google announced Gemini CLI will be retired on June 18. We could have waited, but why run a dead tool for another month when the replacement is already better? Antigravity CLI (agy) supports 3.5 Flash natively and gives us cleaner session management.
2. 3.5 Flash Is a Coding Monster
The benchmarks speak for themselves:
- 76.2% on Terminal-bench — beats Gemini 3.1 Pro on coding tasks
- 83.6% MCP Atlas score — highest of any model, measuring exactly the multi-step agentic work our race demands
- 289 tokens/second — 4x faster than 2.5 Flash
Speed matters in the race. Every session is time-boxed. A model that thinks 4x faster gets 4x more done per session. And the coding accuracy means fewer bug loops — which is precisely what’s been killing Gemini’s progress.
3. The Agent Is Stuck
Let’s be honest. Looking at the week 4 results, Gemini has the worst trajectory of all five agents. It commits constantly but ships nothing usable. The pattern is clear: generate code → hit bug → try to fix → introduce new bug → loop until context is exhausted. A smarter model with better operational awareness is the only intervention that might break this cycle.
The Single-Backlog Approach
With 2.5 Flash and 2.5 Pro, we ran a two-tier backlog system: complex tasks went to Pro (premium sessions), simpler tasks went to Flash (cheap sessions). This added operational overhead and meant the agent was constantly context-switching between two different capability levels.
3.5 Flash eliminates this entirely. It’s good enough for everything — one model, one backlog. We merged BACKLOG-PREMIUM.md and BACKLOG-CHEAP.md into a single BACKLOG.md. This mirrors what we already do with Kimi, which has run a single-tier backlog from the start.
The first task on the new unified backlog: audit the entire LocalLeads product, merge any remaining backlog items, and identify the #1 blocker to revenue. No more coding blind. The agent needs to understand what exists before it builds more.
What We Expect
Within 48 hours, we should see:
- Fewer bug loops — 3.5 Flash’s higher coding accuracy should mean less time fixing its own mistakes
- Better operational awareness — the MCP Atlas score suggests it can hold multi-step plans together, which 2.5 Flash couldn’t
- Faster iteration — 289 tok/s means more gets done per session, compounding across 8 daily sessions
- Cleaner commits — quality over quantity, finally
The Gemini race page will show the results in real time. If the upgrade works, we should see a visible inflection point starting today.
Will It Work? (Honest Assessment)
I’m cautiously optimistic, but not certain.
The optimistic case: Gemini’s problems were purely model quality. 2.5 Flash was too weak for autonomous multi-step coding. 3.5 Flash fixes that, the bug loops stop, and LocalLeads starts shipping features. The infrastructure is already there — domain, database, payments. It just needs clean code on top.
The pessimistic case: Gemini’s problems are deeper than model quality. Maybe the project’s codebase is already so tangled from weeks of buggy commits that even a better model can’t navigate it. Maybe the context bloat problem is structural. Maybe 363 commits of technical debt is too much to overcome.
There’s also a practical constraint: the Google AI Pro subscription is shared with personal use. At 8 sessions/day for the race, we might hit quota limits. If that happens, we’ll need to reduce to 6 sessions and see if quality compensates for quantity.
The precedent is encouraging though. When we upgraded DeepSeek from V3 to V4 Pro, the improvement was immediate and dramatic. Better model, better output, simple as that.
We’ll know within 48 hours whether this is a turning point or whether Gemini’s problems run deeper than any model can fix. Either way — the race continues, and now every agent is running on the best model available to it.
The clock is ticking. Let’s see what 3.5 Flash can do.