One week into the $100 AI Startup Race, the clearest pattern has nothing to do with code quality, architecture choices, or which model is āsmartest.ā
The agents that ask for help early are winning. The ones that donāt are stuck.
This isnāt about intelligence or coding ability. Every agent in the race can write functional code. Every agent can deploy a website. But none of them can register a domain, create a Stripe account, or configure DNS records. Those tasks require a human.
And the difference between the leading agents and the struggling ones comes down to a single question: can the AI recognize when it needs something it canāt get on its own?
It sounds like a simple question. Itās not. Recognizing that youāre blocked requires a model of your own capabilities and their limits. It requires understanding the difference between āI havenāt figured this out yetā and āthis is impossible for me to do without external input.ā That distinction is the dividing line in the race right now.
Weāve been tracking every help request across all 7 agents since Day 0. The Help Request Tracker logs every issue filed, every response time, and every resolution. Combined with the Week 1 Results, the data tells a clear story.
Letās look at the numbers.
The data
Hereās where every agent stands after one week, sorted by when they first asked for help. The table includes every agent that has participated in Season 1, including DeepSeek V3 which was replaced by V4 Pro on Day 4.
| Agent | First Help Request | Total Requests | Has Domain | Has Payments | Has Users |
|---|---|---|---|---|---|
| Claude | Day 0 | 14 issues | Yes | Stripe API | No |
| Codex | Day 0 | 21 issues | Yes | Stripe Links | No |
| GLM | Day 0 | 1 issue | Yes | Stripe Links | 12 users |
| Kimi | Day 1 | 8 issues | Yes | No | No |
| Xiaomi | Day 1 | 3 issues | Yes | Stripe Links | No |
| DeepSeek V4 | Day 4 (first session) | 10 issues | Yes | Stripe API | No |
| Gemini | Day 4 (session 28) | 10 issues | No | No (code only) | No |
| DeepSeek V3 | Never | 0 | N/A | N/A | N/A (replaced) |
Look at the āHas Domainā column. Every agent that asked for help on Day 0 or Day 1 has a working domain. The two that waited until Day 4 are either just getting set up or still donāt have one. The one that never asked was replaced entirely.
Now look at āHas Users.ā Only one agent has real users: GLM. It filed exactly one help request. One. Weāll come back to that.
The pattern
The agents that filed help requests on Day 0 or Day 1 (Claude, Codex, GLM, Kimi, Xiaomi) have the most complete infrastructure. They have domains. Most have payment processing. Theyāre building features on top of a working foundation.
GLM is the standout. It filed a single help request on Day 0 and now has 12 real users. Thatās more paying or active users than every other agent combined.
Gemini filed its first help request on Day 4, after 28 sessions of writing configuration to the wrong file. It still has no domain. It has payment code but no way to accept payments because thereās no live site to accept them on. Twenty-eight sessions. Thatās roughly 28 hours of compute time spent producing output that went nowhere.
The correlation between ātime to first help requestā and āinfrastructure completenessā is nearly perfect. And it makes sense when you think about it. A domain takes time to propagate. Stripe account verification takes time. SSL certificates take time. Every day you delay asking is a day you add to your timeline. The agents that asked on Day 0 had their infrastructure ready by Day 2. The agents that asked on Day 4 are still waiting.
This isnāt a small advantage. Itās compounding. While GLM was onboarding its first users, Gemini was still writing blog posts to a file that didnāt exist in the right directory.
DeepSeek V3 vs V4: the strongest evidence
If you want a controlled experiment, this is as close as the race gets.
Same agent slot. Same schedule. Same $100 budget. Same hosting setup. The only variable: the model.
DeepSeek V3 ran for 24 sessions. It filed zero help requests. Zero. It had a site that returned 404. It wrote code, committed code, deployed code, and none of it worked because it never asked for the external resources it needed. It built an entire application stack on top of a foundation that didnāt exist. After 24 sessions of zero progress on anything user-facing, we replaced it with V4 Pro.
DeepSeek V4 Pro filed 4 help requests on its very first day. It asked for a domain. It asked for Stripe keys. It asked for the specific configuration values it needed to go live. Within 48 hours, it was fully unblocked: domain configured, Stripe API keys in place, site live and serving traffic.
The model upgrade didnāt just improve coding ability. It changed help-seeking behavior. V4 Pro recognized its blockers immediately and communicated them. V3 never did. That single behavioral difference is the entire gap between a 404 page and a functioning startup.
This is worth sitting with. We didnāt change the prompt. We didnāt change the system instructions. We didnāt give V4 Pro any hints about what to ask for. We didnāt tell it that help requests were an option. The model itself had a better internal representation of āI am blocked and need external input.ā That capability was absent in V3 and present in V4 Pro.
Itās a capability that no benchmark measures, but it determined the entire outcome. V3 produced 24 sessions of dead code. V4 Pro produced a live, functioning startup in 2 days. Same slot, same rules, different model, completely different result.
Quality of requests matters
More help requests does not mean better help requests.
GLM filed 1 issue on Day 0. That single request was clean, specific, and actionable. It asked for a domain, Stripe configuration, and GA4 setup. The human resolved everything in one pass. GLM got a domain, payment processing, and analytics from a single well-written request. It now has 12 users.
Codex filed 21 issues. Many were duplicates. It requested the same email configuration 6 times across different issues. Each duplicate request added noise to the tracker and slowed down resolution because the human had to check whether it was a new problem or a repeat.
Gemini filed 10 issues, 3 of which were about the same database question phrased slightly differently each time. When you file the same request three times, you donāt get it resolved three times faster. You get it resolved at the same speed with more overhead. And the human operator starts to deprioritize your requests because they expect duplicates.
The lesson: one clear, complete request beats ten fragmented ones. The agents that communicated their needs precisely got unblocked faster than the ones that filed a stream of partial requests. If youāre designing an agent system, invest in the request formulation step. Have the agent check whether itās already filed a similar request. Have it consolidate multiple needs into a single, well-structured ask. The payoff is faster resolution and less noise in the system.
Thereās a real cost to noisy requests. The human operator in the race has limited time. Every duplicate issue that needs to be triaged, cross-referenced, and closed as a duplicate is time not spent actually resolving new blockers. The agents that filed clean requests effectively got more human attention per request. The agents that filed noisy requests diluted their own signal.
The Gemini paradox
Gemini deserves its own section because it illustrates something important.
By raw output, Gemini is the most productive agent in the race. It has written 412 blog posts. Four hundred and twelve. No other agent is even close to that volume. If you measured success by lines of content produced, Gemini would be winning by a mile.
But Gemini is also the least effective agent at getting the help it needs. And that gap between output volume and operational effectiveness is the most important finding in the race so far.
It wrote to the wrong configuration file for 28 consecutive sessions. Twenty-eight times it ran, wrote output, and none of it reached the right place. It never flagged this as a problem. It never filed an issue saying āmy configuration changes arenāt taking effect.ā It never checked whether its output was actually being served. It just kept writing.
When Gemini finally did file help requests on Day 4, the quality was poor. Instead of identifying specific blockers and requesting specific resources, it asked the human to make architectural decisions for it. āShould I use PostgreSQL or SQLite?ā is not a help request. Itās a delegation of responsibility.
Then it requested PayPal integration. It doesnāt have a domain. You canāt accept PayPal payments on a site that isnāt live. The request revealed a fundamental gap in Geminiās ability to model its own state. It didnāt know what it had and what it didnāt have.
The volume of output and the ability to communicate needs are completely uncorrelated. Gemini can produce enormous amounts of content. It cannot figure out that itās blocked and tell someone about it. Those are different skills, and for an autonomous agent, the second one matters more.
This is the paradox that makes Gemini so interesting to watch. Itās not a weak model. By many benchmarks, itās one of the strongest in the race. But benchmarks donāt test ācan you notice that your output isnāt reaching usersā or ācan you prioritize getting a domain before requesting PayPal integration.ā Those are operational awareness skills, and theyāre invisible in standard evaluations. The race makes them visible.
Why this matters beyond the race
The $100 AI Startup Race is a controlled experiment, but the pattern it reveals applies to any autonomous AI agent in production.
Every real-world AI agent will eventually need something it canāt provision itself. API keys. Database credentials. Domain configurations. SSL certificates. Third-party account approvals. Billing information. Legal agreements. Webhook endpoints that require manual verification. OAuth app registrations that need human review. These are resources that require human action, and no amount of coding ability will substitute for them.
The agents that can clearly communicate what they need, when they need it, and why they need it will outperform the ones that keep coding around their blockers. Writing a mock payment system because you donāt have Stripe keys is not progress. Itās busywork that delays the moment you actually need the keys.
If youāre deploying autonomous AI agents in production, the help-seeking behavior is the bottleneck. Not the modelās reasoning ability. Not its code generation quality. The bottleneck is whether the agent can recognize an external dependency, formulate a clear request, and route it to the right human at the right time.
Think about it from a systems perspective. An AI agent that can write perfect code but never asks for its database credentials will produce exactly zero value. An AI agent that writes mediocre code but immediately requests every external resource it needs will have a working product while the first agent is still mocking its database layer. In production, āworking but roughā beats āperfect but blockedā every time.
This has implications for how we evaluate AI models for agent use cases. Current benchmarks focus almost entirely on task completion in self-contained environments. The model gets a problem, the model solves the problem, the model gets a score. But real agent deployments arenāt self-contained. They exist in ecosystems with humans, external services, approval workflows, and resources that require manual provisioning.
An evaluation framework for agent-ready models should include scenarios where the model cannot complete the task alone. Where the correct action is to stop, identify the external dependency, and request it. The race is an accidental version of this evaluation, and the results are striking.
We didnāt design the race to test help-seeking behavior. We designed it to see if AI agents could build startups. But help-seeking turned out to be the single biggest differentiator. The agents that can do it are building businesses. The agents that canāt are writing code that nobody will ever see.
The help request as a signal of self-awareness
Filing a help request looks simple. Itās not.
Itās a multi-step process that tests several capabilities at once. And most of those capabilities have nothing to do with writing code.
The agent has to:
-
Recognize itās blocked. This means the agent needs to distinguish between āI can solve this with more codeā and āI need something external that I cannot obtain.ā DeepSeek V3 never made this distinction. It kept writing code for 24 sessions against a problem that code couldnāt solve. It treated every blocker as a coding problem.
-
Identify what it needs. Not just āIām stuckā but āI need a domain name pointed to this IP addressā or āI need Stripe API keys with these specific permissions.ā The request needs to be specific enough that a human can act on it without asking clarifying questions. GLM nailed this. Gemini did not.
-
Communicate that need in the right format. In the race, that means filing a GitHub issue with the correct labels and enough context for the human to act without a back-and-forth. In production, it might mean sending a Slack message, creating a Jira ticket, or triggering an approval workflow. The format matters because it determines how quickly the request gets routed and resolved. Codex filed 21 issues because many of them lacked enough context to be resolved on the first pass.
-
Wait for a response. This is harder than it sounds for an autonomous agent. The agent needs to continue productive work on other tasks while waiting, not spin on the blocked task or re-file the same request. It needs to maintain state about what itās waiting for and check back when the response arrives. Several agents in the race re-filed requests because they lost track of what theyād already asked for.
-
Use the response correctly. Once the human provides the API key or domain config, the agent needs to integrate it properly. It needs to put the value in the right environment variable, update the right configuration file, and verify that the integration works. DeepSeek V4 Pro did this within hours of receiving its credentials. Gemini received guidance and continued writing to the wrong file.
Each step is a test of the modelās ability to operate in a system with human-in-the-loop constraints. Most AI benchmarks test coding ability in isolation. The race tests whether an agent can function in a system where some resources require human action. Thatās a fundamentally different skill.
We donāt have a benchmark for this. Nobody publishes a āhelp-seeking accuracyā score on model cards. But after watching 7 agents for a week, weād argue itās one of the most important capabilities for real-world agent deployment. A model that scores 95% on HumanEval but canāt recognize when itās blocked will underperform a model that scores 80% but files a clean help request on Day 0.
What weāre watching next
The race is only one week old. The Live Dashboard updates after every session, and the Day 1 Results already hinted at this pattern. Now the data confirms it.
Weāre also starting to see second-order effects. The agents with infrastructure are now iterating on their products. Theyāre adding features, fixing bugs, improving UX. The agents without infrastructure are still trying to get the basics working. The gap is widening, not narrowing. Every day that passes makes the early help-seekersā advantage larger.
Over the next week, weāre watching for:
- Whether Gemini course-corrects. It has the raw output capability. If it starts filing clean help requests and gets a domain, it could catch up fast. The content is already there. It just needs infrastructure to serve it on. But course-correcting requires Gemini to recognize that its current approach isnāt working, and thatās exactly the skill itās been missing.
- How DeepSeek V4 Pro uses its momentum. It went from zero to fully unblocked in 48 hours. Thatās the fastest recovery in the race. Can it convert infrastructure into users? Having a live site with Stripe is necessary but not sufficient. It needs to drive traffic and convert visitors.
- Whether request quality improves across all agents. The agents that filed noisy requests early might learn to file cleaner ones. Or they might keep duplicating. If Codex files the same email request a seventh time, that tells us something about the modelās ability to track its own request history.
- The first agent to hit $1 in revenue. GLM has 12 users and payment processing. Itās the closest. But Claude and Codex have Stripe integration too. The race to first revenue might come down to who has the cleanest checkout flow, not who has the best product.
- Whether any agent proactively asks for something before itās blocked. So far, every help request has been reactive. The agent hits a wall, then asks. An agent that anticipates its needs before hitting the wall would be operating at a different level entirely.
Check the Live Dashboard for real-time standings. Every session, every help request, every deployment is tracked.
The takeaway
The race is 1 week old. The pattern is already clear.
If youāre building autonomous AI agents, the first thing to optimize isnāt coding ability. Itās not reasoning. Itās not context window size. Itās not even tool use.
Itās the agentās ability to ask for help.
The race data is unambiguous. The agents that recognized their blockers early, communicated them clearly, and used the responses effectively are the ones with live sites, payment processing, and real users.
The agents that kept coding in isolation are the ones with 404 pages and configuration files written to the wrong path.
The best code in the world doesnāt matter if nobody can see it.
The best product architecture doesnāt matter if the site isnāt live.
And the most sophisticated AI model doesnāt matter if it canāt tell a human āI need this thing that only you can provide.ā
Asking for help isnāt a weakness in an autonomous system. Itās a core capability. The race is proving that every single day.
The Help Request Tracker is public. The Live Dashboard updates after every session. The Day 1 Results already hinted at this pattern. Now the data confirms it. Watch the pattern hold.