🤖 AI Tools
· 12 min read

Race Update: DeepSeek Upgraded From 404 to V4 Pro + OpenCode


DeepSeek was the worst-performing agent in the race. Not close. Not debatable. Dead last.

24 sessions. 136 commits. A site that returned a 404. An agent stuck in a Stripe integration loop for 4+ commits without even having API keys. Zero help requests filed. The whole thing was running on Aider with deepseek-reasoner (V3) as the premium model and deepseek-chat for cheap sessions.

It was not working.

Then on April 24, DeepSeek released V4 Pro and V4 Flash. And they support OpenCode natively.

So we wiped the repo. Switched the tool. Upgraded the model. And gave DeepSeek a fresh start.

This article covers what went wrong with the old setup, what the new V4 Pro + OpenCode configuration looks like, and what it means for the race going forward.

📊 Live Dashboard | 📅 Race Digest | 💰 Budget Tracker

Why DeepSeek Was Failing

Let’s be specific about what went wrong, because it was not just “the model was bad.” It was a compounding failure across the tool, the model, and the agent’s own behavior.

Aider + deepseek-reasoner was the weakest tool+model combo in the race. Every other agent had a tighter integration between its model and its coding tool. Claude had Claude Code. Codex had Codex CLI. Gemini had Gemini CLI. DeepSeek had Aider, a third-party tool that was never optimized for deepseek-reasoner’s output format.

The result was bizarre. The agent created files named after Aider’s own output. Literally. One file was named:

I'll now output the SEARCH/REPLACE blocks.scripts/build.js

That is not a joke. That is a real filename that ended up in the repo. The model was outputting Aider’s SEARCH/REPLACE format instructions as part of the filename string. Aider was interpreting it as a file creation command. Nobody caught it because there was no human in the loop.

The site returned a 404. The vercel.json routing config was broken. The agent had built pages, written content, set up components, but none of it was reachable. Every visitor got a 404. For the entire run. 24 sessions of work, completely invisible to the outside world.

The Stripe checkout loop was the worst part. The agent spent 4+ consecutive commits polishing Stripe checkout integration code. Building the checkout flow. Refining the UI. Adding error handling. All without having Stripe API keys. The code could never have worked. But the agent kept iterating on it, commit after commit, because nothing in its context told it to stop and ask for the keys first.

This is a pattern we see in weaker agents: they optimize locally without checking global constraints. The agent knew how to write Stripe integration code. So it wrote Stripe integration code. It never stepped back to ask “do I have the credentials to make this work?” A stronger agent would have filed a help request after the first failed test. DeepSeek just kept polishing dead code.

Zero help requests. This is the stat that matters most. Every agent in the race has the ability to file help requests when it gets stuck. The agents that use this feature early tend to perform better. GLM filed help requests. Claude filed help requests. DeepSeek filed zero. It just kept grinding on broken code in silence.

24 sessions. 136 commits. Nothing worked.

Compare that to what the other agents did with their first 24 sessions. Claude had a live product. GLM had real users. Codex was sending outreach emails. DeepSeek had a repo full of broken files and a site that nobody could visit.

The tool+model mismatch was the root cause. Everything else was a symptom. When your coding agent cannot reliably create files with correct names, nothing downstream is going to work. The 404, the Stripe loop, the missing help requests: all of it traces back to Aider not understanding deepseek-reasoner’s output and deepseek-reasoner not understanding Aider’s expected format.

What Changed

On April 24, DeepSeek released two new models:

  • DeepSeek V4 Pro: 1.6 trillion parameters, 49 billion active (MoE architecture), 80.6% on SWE-bench Verified, MIT license
  • DeepSeek V4 Flash: Smaller, faster, cheaper. Built for high-volume tasks

V4 Pro is not a minor upgrade. It is the strongest open-source coding model available right now. 80.6% on SWE-bench puts it in the same tier as Claude Opus 4.6 and above GPT-5. And it is MIT licensed, which means anyone can run it, fine-tune it, or build on top of it.

The jump from V3 to V4 Pro is substantial across every metric that matters for the race. Better code generation. Better instruction following. Better long-context reasoning. And critically, better tool use. V3 struggled with Aider’s SEARCH/REPLACE format. V4 Pro was designed to work with agentic coding tools from the ground up.

More importantly for the race: V4 Pro officially supports OpenCode.

OpenCode is an open-source AI coding agent. It is a different tool than what any other agent in the race uses. Claude uses Claude Code. Codex uses Codex CLI. Gemini uses Gemini CLI. DeepSeek now uses OpenCode. This gives every agent a unique tool+model combination, which is exactly what we wanted for the race.

The pricing works out well for our budget structure:

  • V4 Flash: $0.14 per 1M input tokens, $0.28 per 1M output tokens (cheap sessions)
  • V4 Pro: $1.74 per 1M input tokens, $3.48 per 1M output tokens (premium sessions)

We wiped the DeepSeek repo completely. Deleted every file. Reset the git history. Then we ran the standard first-run prompt that every agent gets on Day 1. V4 Pro starts from zero, just like every other agent did at the beginning of the race.

This is the second time we have done a full reset in the race. The first was Xiaomi, which went through the same process when MiMo V2.5 launched. In both cases, the old codebase was too broken to salvage, and a significantly better model was available. These are the exceptions, not the rule. Future model upgrades will be inline: the model gets swapped, the code stays. No more fresh starts.

The Technical Setup

Here is exactly how the new DeepSeek agent is configured. Everything below is public. We are not hiding any part of the setup because transparency is a core principle of the race.

OpenCode version: v1.14.22 with a custom provider configuration pointing to the DeepSeek API.

Model IDs:

  • Premium sessions: deepseek/deepseek-v4-pro
  • Cheap sessions: deepseek/deepseek-v4-flash

Session schedule: 7 sessions per day. 2 premium (V4 Pro) and 5 cheap (V4 Flash). This is the same schedule DeepSeek had before the upgrade. It gives DeepSeek the most sessions per day of any agent in the race. Premium sessions are used for complex tasks like architecture decisions and new feature builds. Cheap sessions handle routine work like bug fixes, content updates, and incremental improvements.

The orchestrator command:

The run_deepseek() function in the orchestrator now calls:

opencode run -m MODEL --dangerously-skip-permissions

Where MODEL is either deepseek/deepseek-v4-pro or deepseek/deepseek-v4-flash depending on the session type.

The --dangerously-skip-permissions flag is required because the agent runs autonomously. There is no human to approve file writes or command execution. Every agent in the race runs with equivalent permission levels.

The rest of the orchestrator logic is unchanged. Same session scheduling. Same commit tracking. Same help request system. Same budget monitoring. The only things that changed are the tool (Aider to OpenCode) and the models (V3 to V4).

One thing worth noting: OpenCode’s architecture is different from Aider’s. Aider works by sending SEARCH/REPLACE blocks that modify files incrementally. OpenCode works more like Claude Code or Codex CLI, where the agent has direct file system access and can read, write, and execute commands natively. This is a better fit for autonomous operation because there is no intermediate format that can be misinterpreted. The agent writes files directly. No translation layer. No SEARCH/REPLACE parsing. No filenames that accidentally include instruction text.

What V4 Pro Did in Its First Session

The first session finished. Here is what V4 Pro produced.

The research was the most thorough of any agent in the race. V4 Pro brainstormed 10 micro-SaaS ideas, scored each on 5 criteria (revenue potential, technical feasibility, user acquisition ease, competition, monetization speed), eliminated the 5 weakest with detailed reasoning, then wrote mini business plans for the top 5 including exact pricing tiers, first-10-customers acquisition plans, and revenue projections.

No other agent did this level of analysis on Day 1. Most agents picked an idea and started building within the first few minutes. V4 Pro spent the first half of its session thinking before writing a single line of HTML.

The pick: Spyglass, competitive intelligence for indie SaaS founders.

The pitch: enterprise CI tools like Crayon and Klue cost $1K-10K/month. Indie founders with $1K-50K MRR have zero options. Spyglass fills that gap at $29 for a one-time competitor analysis report, $79/month for weekly monitoring, and $199/month for a full command center.

The idea scored 38/50 in V4 Pro’s own evaluation, beating PricingPageRoast (37/50) by one point. The rejection reasoning was sharp: PricingPageRoast has weaker subscription hooks and is more vulnerable to free alternatives. Spyglass has natural recurring value (competitors change every week) and builds a data moat over time (historical competitor data creates switching costs).

What it built in the same session:

After the research phase, V4 Pro built a 354-line landing page with hero section, features, pricing cards, testimonials, FAQ, and a responsive mobile menu. It also created an about page, a pricing comparison page, a blog index, privacy policy, terms of service, a custom 404 page, sitemap.xml, robots.txt, and a favicon. All with a dark theme, scroll animations, and Open Graph meta tags.

That is 10 pages in one session. The old DeepSeek agent produced a 404 in 24 sessions.

The help request was immediate. V4 Pro filed a help request asking for three things: domain registration (spyglasshq.com, $12), Stripe payment links ($0), and an OpenAI API key ($20 for report generation). Total budget ask: $32 from the $90 remaining. The request was well-structured with exact steps, time estimates, and backup options.

This is the single biggest behavioral change from the old agent. The old DeepSeek never filed a help request. V4 Pro filed one in its first session. The agents that ask for help early are the agents that win. V4 Pro appears to understand this.

The 12-week roadmap is detailed. Week-by-week milestones. Specific deliverables. A content marketing flywheel (every competitor analysis produces shareable Twitter threads). A referral program. A free tool (“Quick Competitor Scan”) to drive organic traffic. Revenue target: $1,000 MRR by Week 12.

Whether it can execute on all of that remains to be seen. But the strategic thinking is a different league from what V3 produced.

The Scoreboard Context

Here is where every agent stands as of the DeepSeek upgrade:

GLM is in the lead. 12 real users. A working product. Planning a Hacker News launch. GLM asked for help early, got its API keys sorted, and has been executing consistently.

Claude has a working product with email nurture sequences. The site is live, the funnel is built, and it is actively trying to convert visitors into paying users.

Codex sent 6 outreach emails autonomously. It figured out that it needed users, identified potential customers, and sent cold emails without being told to. That is the kind of autonomous behavior we are looking for in this race.

Xiaomi got a fresh start 2 days ago when MiMo V2.5 was released. It built a full site in just 2 sessions. The upgrade pattern is the same as what we are doing with DeepSeek today. Xiaomi’s rapid progress after its reset is the best evidence we have that a fresh start with a better model can work.

DeepSeek was last place. Broken site. 404 errors. No users. No product. No help requests. The only agent in the race with zero functional output after multiple days of running.

Can V4 Pro catch up? The math is interesting. DeepSeek has 7 sessions per day, the most of any agent. If V4 Pro can execute at the level its benchmarks suggest, it has more opportunities per day to make progress than any competitor. The question is whether raw session count can overcome a multi-day head start.

The other question is whether V4 Pro will use the help request system. The old DeepSeek agent never filed a single one. If V4 Pro repeats that pattern, it will hit the same walls. If it learns to ask for help when it needs API keys or deployment configs, it has a real shot at catching up. The agents that communicate are the agents that win.

Why This Matters Beyond the Race

This upgrade is not just a race story. It is a real-world test of several things that matter to anyone building with AI coding tools.

V4 Pro is the strongest open-source coding model available. 80.6% on SWE-bench Verified is not a marketing number. It means V4 Pro can resolve real GitHub issues from real repositories at a rate that competes with the best proprietary models. For teams that need to run models on their own infrastructure, or that need MIT-licensed weights, this is a significant milestone.

The cost difference is massive. V4 Pro is roughly 7x cheaper than Claude Opus 4.6 for similar coding performance. At $3.48 per 1M output tokens versus Opus 4.6’s pricing, the economics shift dramatically for high-volume agentic workflows. If you are running hundreds of coding sessions per day, that 7x multiplier adds up fast.

OpenCode support changes the tool landscape. Before V4, if you wanted to use DeepSeek models for agentic coding, you had to go through Aider or another third-party tool that was not optimized for DeepSeek’s output format. We saw firsthand how badly that can go (files named after SEARCH/REPLACE blocks). With native OpenCode support, V4 Pro gets a purpose-built coding agent that understands its output format.

This race is the test. Benchmarks tell you how a model performs on isolated coding tasks. The race tells you how a model performs when it has to build a real product, make strategic decisions, handle errors, ask for help, and generate revenue. V4 Pro’s SWE-bench score says it can write code. The next few days will tell us if it can build a business.

We have already seen that benchmark performance does not always translate to race performance. The old DeepSeek agent used deepseek-reasoner, which was a strong model on paper. But paired with the wrong tool and running autonomously, it produced garbage. V4 Pro has better benchmarks and a better tool. Whether that combination actually works in practice is what the next week will answer.

FAQ

Why not keep the old code and just swap the model?

The old codebase was broken beyond repair. The site returned a 404 because of a broken vercel.json config. There were files in the repo named after Aider output instructions. The Stripe integration was built without API keys. Trying to have V4 Pro fix all of that would waste sessions on cleanup instead of building something new. A clean start is faster.

This is the same decision we made with Xiaomi when MiMo V2.5 dropped. When the old code is fundamentally broken and a better model is available, starting fresh is the right call.

Is this fair to the other agents?

Yes. These were not routine upgrades. Both DeepSeek and Xiaomi had fundamentally broken setups: bad tool+model combos, broken deployments, and startup ideas that were going nowhere. The fresh starts were a last resort, not a standard procedure.

Going forward, model upgrades happen inline. When Kimi’s subscription moved from K2.5 to K2.6, the agent kept its repo, its startup, and its progress. The model just got better under the hood. That is the normal path. No agent will get another fresh start. DeepSeek and Xiaomi used their one chance.

How much does V4 Pro cost?

API costs are tracked separately from the $100 product budget. The $100 is for things like domains, hosting, and third-party services. API costs for running the models come out of a separate pool.

V4 Pro pricing:

  • Input: $1.74 per 1M tokens
  • Output: $3.48 per 1M tokens

V4 Flash pricing:

  • Input: $0.14 per 1M tokens
  • Output: $0.28 per 1M tokens

You can track all spending on the budget tracker and see session-by-session costs on the live dashboard.

With V4 Flash handling 5 of the 7 daily sessions at $0.28/1M output, the per-day API cost for DeepSeek should stay well below the other agents running on more expensive models. The 2 premium V4 Pro sessions will cost more, but still significantly less than equivalent Claude or GPT sessions. This cost efficiency is one of DeepSeek’s structural advantages in the race.

We will be tracking V4 Pro’s performance closely over the next few days. If it delivers on its benchmarks, DeepSeek could go from last place to a real contender. If it does not, we will have learned something important about the gap between benchmark scores and real-world agentic performance. Either way, the data will be on the dashboard.