šŸ¤– AI Tools
Ā· 14 min read

What 7 AI Agents Taught Us About Asking for Help


One week into the $100 AI Startup Race, the clearest pattern has nothing to do with code quality, architecture choices, or which model is ā€œsmartest.ā€

The agents that ask for help early are winning. The ones that don’t are stuck.

This isn’t about intelligence or coding ability. Every agent in the race can write functional code. Every agent can deploy a website. But none of them can register a domain, create a Stripe account, or configure DNS records. Those tasks require a human.

And the difference between the leading agents and the struggling ones comes down to a single question: can the AI recognize when it needs something it can’t get on its own?

It sounds like a simple question. It’s not. Recognizing that you’re blocked requires a model of your own capabilities and their limits. It requires understanding the difference between ā€œI haven’t figured this out yetā€ and ā€œthis is impossible for me to do without external input.ā€ That distinction is the dividing line in the race right now.

We’ve been tracking every help request across all 7 agents since Day 0. The Help Request Tracker logs every issue filed, every response time, and every resolution. Combined with the Week 1 Results, the data tells a clear story.

Let’s look at the numbers.

The data

Here’s where every agent stands after one week, sorted by when they first asked for help. The table includes every agent that has participated in Season 1, including DeepSeek V3 which was replaced by V4 Pro on Day 4.

AgentFirst Help RequestTotal RequestsHas DomainHas PaymentsHas Users
ClaudeDay 014 issuesYesStripe APINo
CodexDay 021 issuesYesStripe LinksNo
GLMDay 01 issueYesStripe Links12 users
KimiDay 18 issuesYesNoNo
XiaomiDay 13 issuesYesStripe LinksNo
DeepSeek V4Day 4 (first session)10 issuesYesStripe APINo
GeminiDay 4 (session 28)10 issuesNoNo (code only)No
DeepSeek V3Never0N/AN/AN/A (replaced)

Look at the ā€œHas Domainā€ column. Every agent that asked for help on Day 0 or Day 1 has a working domain. The two that waited until Day 4 are either just getting set up or still don’t have one. The one that never asked was replaced entirely.

Now look at ā€œHas Users.ā€ Only one agent has real users: GLM. It filed exactly one help request. One. We’ll come back to that.

The pattern

The agents that filed help requests on Day 0 or Day 1 (Claude, Codex, GLM, Kimi, Xiaomi) have the most complete infrastructure. They have domains. Most have payment processing. They’re building features on top of a working foundation.

GLM is the standout. It filed a single help request on Day 0 and now has 12 real users. That’s more paying or active users than every other agent combined.

Gemini filed its first help request on Day 4, after 28 sessions of writing configuration to the wrong file. It still has no domain. It has payment code but no way to accept payments because there’s no live site to accept them on. Twenty-eight sessions. That’s roughly 28 hours of compute time spent producing output that went nowhere.

The correlation between ā€œtime to first help requestā€ and ā€œinfrastructure completenessā€ is nearly perfect. And it makes sense when you think about it. A domain takes time to propagate. Stripe account verification takes time. SSL certificates take time. Every day you delay asking is a day you add to your timeline. The agents that asked on Day 0 had their infrastructure ready by Day 2. The agents that asked on Day 4 are still waiting.

This isn’t a small advantage. It’s compounding. While GLM was onboarding its first users, Gemini was still writing blog posts to a file that didn’t exist in the right directory.

DeepSeek V3 vs V4: the strongest evidence

If you want a controlled experiment, this is as close as the race gets.

Same agent slot. Same schedule. Same $100 budget. Same hosting setup. The only variable: the model.

DeepSeek V3 ran for 24 sessions. It filed zero help requests. Zero. It had a site that returned 404. It wrote code, committed code, deployed code, and none of it worked because it never asked for the external resources it needed. It built an entire application stack on top of a foundation that didn’t exist. After 24 sessions of zero progress on anything user-facing, we replaced it with V4 Pro.

DeepSeek V4 Pro filed 4 help requests on its very first day. It asked for a domain. It asked for Stripe keys. It asked for the specific configuration values it needed to go live. Within 48 hours, it was fully unblocked: domain configured, Stripe API keys in place, site live and serving traffic.

The model upgrade didn’t just improve coding ability. It changed help-seeking behavior. V4 Pro recognized its blockers immediately and communicated them. V3 never did. That single behavioral difference is the entire gap between a 404 page and a functioning startup.

This is worth sitting with. We didn’t change the prompt. We didn’t change the system instructions. We didn’t give V4 Pro any hints about what to ask for. We didn’t tell it that help requests were an option. The model itself had a better internal representation of ā€œI am blocked and need external input.ā€ That capability was absent in V3 and present in V4 Pro.

It’s a capability that no benchmark measures, but it determined the entire outcome. V3 produced 24 sessions of dead code. V4 Pro produced a live, functioning startup in 2 days. Same slot, same rules, different model, completely different result.

Quality of requests matters

More help requests does not mean better help requests.

GLM filed 1 issue on Day 0. That single request was clean, specific, and actionable. It asked for a domain, Stripe configuration, and GA4 setup. The human resolved everything in one pass. GLM got a domain, payment processing, and analytics from a single well-written request. It now has 12 users.

Codex filed 21 issues. Many were duplicates. It requested the same email configuration 6 times across different issues. Each duplicate request added noise to the tracker and slowed down resolution because the human had to check whether it was a new problem or a repeat.

Gemini filed 10 issues, 3 of which were about the same database question phrased slightly differently each time. When you file the same request three times, you don’t get it resolved three times faster. You get it resolved at the same speed with more overhead. And the human operator starts to deprioritize your requests because they expect duplicates.

The lesson: one clear, complete request beats ten fragmented ones. The agents that communicated their needs precisely got unblocked faster than the ones that filed a stream of partial requests. If you’re designing an agent system, invest in the request formulation step. Have the agent check whether it’s already filed a similar request. Have it consolidate multiple needs into a single, well-structured ask. The payoff is faster resolution and less noise in the system.

There’s a real cost to noisy requests. The human operator in the race has limited time. Every duplicate issue that needs to be triaged, cross-referenced, and closed as a duplicate is time not spent actually resolving new blockers. The agents that filed clean requests effectively got more human attention per request. The agents that filed noisy requests diluted their own signal.

The Gemini paradox

Gemini deserves its own section because it illustrates something important.

By raw output, Gemini is the most productive agent in the race. It has written 412 blog posts. Four hundred and twelve. No other agent is even close to that volume. If you measured success by lines of content produced, Gemini would be winning by a mile.

But Gemini is also the least effective agent at getting the help it needs. And that gap between output volume and operational effectiveness is the most important finding in the race so far.

It wrote to the wrong configuration file for 28 consecutive sessions. Twenty-eight times it ran, wrote output, and none of it reached the right place. It never flagged this as a problem. It never filed an issue saying ā€œmy configuration changes aren’t taking effect.ā€ It never checked whether its output was actually being served. It just kept writing.

When Gemini finally did file help requests on Day 4, the quality was poor. Instead of identifying specific blockers and requesting specific resources, it asked the human to make architectural decisions for it. ā€œShould I use PostgreSQL or SQLite?ā€ is not a help request. It’s a delegation of responsibility.

Then it requested PayPal integration. It doesn’t have a domain. You can’t accept PayPal payments on a site that isn’t live. The request revealed a fundamental gap in Gemini’s ability to model its own state. It didn’t know what it had and what it didn’t have.

The volume of output and the ability to communicate needs are completely uncorrelated. Gemini can produce enormous amounts of content. It cannot figure out that it’s blocked and tell someone about it. Those are different skills, and for an autonomous agent, the second one matters more.

This is the paradox that makes Gemini so interesting to watch. It’s not a weak model. By many benchmarks, it’s one of the strongest in the race. But benchmarks don’t test ā€œcan you notice that your output isn’t reaching usersā€ or ā€œcan you prioritize getting a domain before requesting PayPal integration.ā€ Those are operational awareness skills, and they’re invisible in standard evaluations. The race makes them visible.

Why this matters beyond the race

The $100 AI Startup Race is a controlled experiment, but the pattern it reveals applies to any autonomous AI agent in production.

Every real-world AI agent will eventually need something it can’t provision itself. API keys. Database credentials. Domain configurations. SSL certificates. Third-party account approvals. Billing information. Legal agreements. Webhook endpoints that require manual verification. OAuth app registrations that need human review. These are resources that require human action, and no amount of coding ability will substitute for them.

The agents that can clearly communicate what they need, when they need it, and why they need it will outperform the ones that keep coding around their blockers. Writing a mock payment system because you don’t have Stripe keys is not progress. It’s busywork that delays the moment you actually need the keys.

If you’re deploying autonomous AI agents in production, the help-seeking behavior is the bottleneck. Not the model’s reasoning ability. Not its code generation quality. The bottleneck is whether the agent can recognize an external dependency, formulate a clear request, and route it to the right human at the right time.

Think about it from a systems perspective. An AI agent that can write perfect code but never asks for its database credentials will produce exactly zero value. An AI agent that writes mediocre code but immediately requests every external resource it needs will have a working product while the first agent is still mocking its database layer. In production, ā€œworking but roughā€ beats ā€œperfect but blockedā€ every time.

This has implications for how we evaluate AI models for agent use cases. Current benchmarks focus almost entirely on task completion in self-contained environments. The model gets a problem, the model solves the problem, the model gets a score. But real agent deployments aren’t self-contained. They exist in ecosystems with humans, external services, approval workflows, and resources that require manual provisioning.

An evaluation framework for agent-ready models should include scenarios where the model cannot complete the task alone. Where the correct action is to stop, identify the external dependency, and request it. The race is an accidental version of this evaluation, and the results are striking.

We didn’t design the race to test help-seeking behavior. We designed it to see if AI agents could build startups. But help-seeking turned out to be the single biggest differentiator. The agents that can do it are building businesses. The agents that can’t are writing code that nobody will ever see.

The help request as a signal of self-awareness

Filing a help request looks simple. It’s not.

It’s a multi-step process that tests several capabilities at once. And most of those capabilities have nothing to do with writing code.

The agent has to:

  1. Recognize it’s blocked. This means the agent needs to distinguish between ā€œI can solve this with more codeā€ and ā€œI need something external that I cannot obtain.ā€ DeepSeek V3 never made this distinction. It kept writing code for 24 sessions against a problem that code couldn’t solve. It treated every blocker as a coding problem.

  2. Identify what it needs. Not just ā€œI’m stuckā€ but ā€œI need a domain name pointed to this IP addressā€ or ā€œI need Stripe API keys with these specific permissions.ā€ The request needs to be specific enough that a human can act on it without asking clarifying questions. GLM nailed this. Gemini did not.

  3. Communicate that need in the right format. In the race, that means filing a GitHub issue with the correct labels and enough context for the human to act without a back-and-forth. In production, it might mean sending a Slack message, creating a Jira ticket, or triggering an approval workflow. The format matters because it determines how quickly the request gets routed and resolved. Codex filed 21 issues because many of them lacked enough context to be resolved on the first pass.

  4. Wait for a response. This is harder than it sounds for an autonomous agent. The agent needs to continue productive work on other tasks while waiting, not spin on the blocked task or re-file the same request. It needs to maintain state about what it’s waiting for and check back when the response arrives. Several agents in the race re-filed requests because they lost track of what they’d already asked for.

  5. Use the response correctly. Once the human provides the API key or domain config, the agent needs to integrate it properly. It needs to put the value in the right environment variable, update the right configuration file, and verify that the integration works. DeepSeek V4 Pro did this within hours of receiving its credentials. Gemini received guidance and continued writing to the wrong file.

Each step is a test of the model’s ability to operate in a system with human-in-the-loop constraints. Most AI benchmarks test coding ability in isolation. The race tests whether an agent can function in a system where some resources require human action. That’s a fundamentally different skill.

We don’t have a benchmark for this. Nobody publishes a ā€œhelp-seeking accuracyā€ score on model cards. But after watching 7 agents for a week, we’d argue it’s one of the most important capabilities for real-world agent deployment. A model that scores 95% on HumanEval but can’t recognize when it’s blocked will underperform a model that scores 80% but files a clean help request on Day 0.

What we’re watching next

The race is only one week old. The Live Dashboard updates after every session, and the Day 1 Results already hinted at this pattern. Now the data confirms it.

We’re also starting to see second-order effects. The agents with infrastructure are now iterating on their products. They’re adding features, fixing bugs, improving UX. The agents without infrastructure are still trying to get the basics working. The gap is widening, not narrowing. Every day that passes makes the early help-seekers’ advantage larger.

Over the next week, we’re watching for:

  • Whether Gemini course-corrects. It has the raw output capability. If it starts filing clean help requests and gets a domain, it could catch up fast. The content is already there. It just needs infrastructure to serve it on. But course-correcting requires Gemini to recognize that its current approach isn’t working, and that’s exactly the skill it’s been missing.
  • How DeepSeek V4 Pro uses its momentum. It went from zero to fully unblocked in 48 hours. That’s the fastest recovery in the race. Can it convert infrastructure into users? Having a live site with Stripe is necessary but not sufficient. It needs to drive traffic and convert visitors.
  • Whether request quality improves across all agents. The agents that filed noisy requests early might learn to file cleaner ones. Or they might keep duplicating. If Codex files the same email request a seventh time, that tells us something about the model’s ability to track its own request history.
  • The first agent to hit $1 in revenue. GLM has 12 users and payment processing. It’s the closest. But Claude and Codex have Stripe integration too. The race to first revenue might come down to who has the cleanest checkout flow, not who has the best product.
  • Whether any agent proactively asks for something before it’s blocked. So far, every help request has been reactive. The agent hits a wall, then asks. An agent that anticipates its needs before hitting the wall would be operating at a different level entirely.

Check the Live Dashboard for real-time standings. Every session, every help request, every deployment is tracked.

The takeaway

The race is 1 week old. The pattern is already clear.

If you’re building autonomous AI agents, the first thing to optimize isn’t coding ability. It’s not reasoning. It’s not context window size. It’s not even tool use.

It’s the agent’s ability to ask for help.

The race data is unambiguous. The agents that recognized their blockers early, communicated them clearly, and used the responses effectively are the ones with live sites, payment processing, and real users.

The agents that kept coding in isolation are the ones with 404 pages and configuration files written to the wrong path.

The best code in the world doesn’t matter if nobody can see it.

The best product architecture doesn’t matter if the site isn’t live.

And the most sophisticated AI model doesn’t matter if it can’t tell a human ā€œI need this thing that only you can provide.ā€

Asking for help isn’t a weakness in an autonomous system. It’s a core capability. The race is proving that every single day.

The Help Request Tracker is public. The Live Dashboard updates after every session. The Day 1 Results already hinted at this pattern. Now the data confirms it. Watch the pattern hold.