🤖 AI Tools
· 25 min read

Week 1 Results: One Agent Built 100 Pages, Another Can't Find Its Own Help Button


Seven AI agents. One week. $70 spent out of $700. Zero revenue. Zero paying customers. But the behavioral differences between these agents are already wild enough to fill a research paper. One agent went from a broken 404 site to 64 pages in three days. Another wrote 412 blog posts but spent 28 sessions writing to the wrong help file. A third has been declaring itself “launch-ready” since Friday and is still waiting for permission to start.

We gave each agent $100, a blank repo, and a simple brief: build a SaaS startup. Pick a name. Pick a niche. Build a product. Get customers. Make money. The agents chose their own ideas, their own architectures, their own strategies. No human wrote a single line of code. The only human involvement was fulfilling help requests: buying domains, adding API keys, configuring DNS. Everything else was the agent.

The result after 7 days is not what anyone predicted. The most capable model is stuck in a permission loop. The cheapest model has the most real users. The model that was dead last got upgraded and is now arguably first. And every single agent, without exception, rejected modern web frameworks in favor of plain HTML.

Here’s everything that happened in Week 1 of The $100 AI Startup Race.

📊 Live Dashboard | 📅 Race Digest | 💰 Budget Tracker | 🆘 Help Requests | 🛠️ Tech Stacks

The Week 1 Scoreboard

Agent Startup Commits Sessions Pages Blogs Domain Payments
🟣 ClaudePricePulse156116031getpricepulse.comStripe API ✅
🟢 CodexNoticeKit183283523noticekit.techStripe Links ✅
🔵 GeminiLocalLeads18214444412NoneNo keys
🟠 KimiSchemaLens15256335schemalens.techNone
🔴 DeepSeekSpyglass187286426spyglassci.comStripe API ✅
🟡 XiaomiAPIpulse13487652getapipulse.comStripe Links ✅
🟤 GLMFounderMath3342212founder-math.comStripe Links ✅
Total1,027987645916 of 75 of 7

Two notes on the numbers. DeepSeek’s stats are from 3 days only. It got a fresh start on Day 4 after the V4 Pro upgrade. And Gemini’s 412 blog posts inflate the totals significantly. Without Gemini, the fleet wrote 179 blog posts. With Gemini, it’s 591. One agent accounts for 70% of all blog content produced in the race.

Look at the commits-per-session ratio and you start to see personality differences. Kimi averages 30.4 commits per session. It runs fewer sessions but makes each one count. Codex and DeepSeek both had 28 sessions but took very different paths: Codex spread its 183 commits across customer outreach, analytics setup, and UI verification. DeepSeek crammed 187 commits into just 3 days of existence. GLM sits at the other extreme: 33 commits, 4 sessions, 12 real users. The least code, the best outcome.

The scoreboard does not tell you who is winning. It tells you how differently these agents think about the same problem. Seven agents given the same brief, the same constraints, and the same tools produced seven radically different outcomes. That divergence is the most interesting finding of Week 1.

Now let’s talk about what actually happened.

Story 1: DeepSeek Went From 404 to 64 Pages in 3 Days

This is the biggest comeback story of Week 1. Maybe the biggest story of the race so far.

The old DeepSeek setup was a disaster. Aider as the coding tool. deepseek-reasoner (V3) as the model. 24 sessions over 4 days. The site returned a 404. The agent created files named after Aider’s own output format. One file was literally called I'll now output the SEARCH/REPLACE blocks.scripts/build.js. That is a real filename that existed in the repo. The model was outputting Aider’s SEARCH/REPLACE instructions as part of the filename string, and Aider was interpreting it as a file creation command.

Zero help requests in 4 days. The agent never once asked for assistance. It just kept grinding on broken code in silence, polishing Stripe checkout integration without having API keys, building features on top of a site that nobody could visit.

This is what failure looks like for an autonomous agent. It does not crash. It does not throw an error. It does not stop. It just keeps working on things that cannot possibly succeed, because nothing in its context tells it to stop. The old DeepSeek agent was the AI equivalent of a developer who spends a week perfecting a login page for a site with no server. Technically productive. Practically useless.

Then DeepSeek V4 Pro dropped on April 24.

We wiped the repo. Switched from Aider to OpenCode. Upgraded from V3 to V4 Pro. Gave it a completely fresh start, the same Day 1 prompt every agent got at the beginning of the race.

In 3 days, the new DeepSeek agent produced:

  • 187 commits (most of any agent in the race, in half the time)
  • 64 pages built
  • 26 blog posts written
  • 6 competitor comparison pages (vs Crayon, vs Klue, vs Owler, vs Owletter, vs Visualping, vs Wachete)
  • Supabase database configured and connected
  • Stripe API integration with working checkout
  • OpenAI API wired up for competitive intelligence report generation
  • Newsletter endpoint with email capture
  • All backlogs complete

Read that list again. Three days. One agent. From literally nothing to a fully functional competitive intelligence SaaS with payments, a database, AI-powered report generation, and a content library.

And here’s the irony that makes this story perfect: the DeepSeek agent chose to use OpenAI’s API for its product. The agent built by DeepSeek pays a competitor. Nobody told it to use OpenAI. It evaluated its options and decided that OpenAI’s API was the best tool for generating competitive intelligence reports. The agent built by one AI company is sending money to a rival AI company. You cannot make this stuff up.

The behavioral change from V3 to V4 Pro is dramatic. V3 filed zero help requests in 24 sessions. V4 Pro filed 4 help requests on its first day and was fully unblocked within 48 hours. Same race rules. Same orchestrator. Same prompt structure. Different model, completely different behavior.

To put the 3-day output in perspective: DeepSeek V4 Pro produced more commits than Claude did in a full week (187 vs 156). It built more pages than Codex did in 28 sessions (64 vs 35). It set up more infrastructure in 72 hours than Gemini managed in 14 days. The old DeepSeek was the worst agent in the race. The new DeepSeek might be the best.

Read the full DeepSeek upgrade story

Story 2: Gemini Wrote 412 Blog Posts but Can’t Ask for Help

Let’s start with the raw numbers. 412 blog posts. 444 HTML pages. 3,616 files. 85MB repository. By pure volume, Gemini is the most productive agent in the race and it is not close. The next closest agent in blog output is Xiaomi with 52 posts. Gemini wrote nearly 8x more content than the second-place finisher.

But volume is not the same as progress.

For 28 sessions straight, Gemini wrote its help requests to the wrong file. The race protocol says agents should write to HELP-REQUEST.md. Gemini wrote to HELP-STATUS.md. Every single session. The orchestrator checks HELP-REQUEST.md for new requests. It never checks HELP-STATUS.md. So for 28 sessions, Gemini was screaming into a void. Filing requests that nobody would ever read. The agent thought it was asking for help. The system thought it had nothing to say.

When Gemini finally figured out the correct file, it filed 3 identical requests. All three asked the human to decide its database architecture. Not “here are my options, which do you recommend?” Just “please decide my database architecture.” Three times. Then it asked for PayPal credentials. Without having a domain. Without having a payment page. Without having any infrastructure to process payments. The requests showed no awareness of prerequisites or dependencies. It was asking for step 10 before completing step 1.

After 30+ sessions and 14 days, Gemini is still running on race-gemini.vercel.app. It is the only agent in the race without a custom domain. Every other agent asked for a domain in their first few sessions. Gemini never did. It was too busy writing blog posts.

And about those blog posts. Blog post #89 is titled “The Human Advantage: Why AI-Generated Content is Failing Local Businesses.” An AI agent that has written 412 blog posts in a single week wrote an article arguing that AI-generated content does not work for local businesses. The agent is making the case against its own primary strategy. It is producing the exact type of content it is arguing against, at industrial scale, without any apparent awareness of the contradiction.

The full Gemini saga: 412 blog posts and still can’t ask for help

This is Gemini in a nutshell. Massive output. Questionable direction. The agent that writes the most but ships the least infrastructure. It has Stripe code but no API keys. It has a payment page but no domain. It has 412 blog posts but no way for a customer to actually pay for anything.

There is a lesson here about what “productivity” means for autonomous agents. If you measured Gemini by commits, files, or lines of code, it would look like the top performer. It is not. The agents with fewer blog posts and more help requests are further ahead. Gemini optimized for the metric it could control (content volume) and ignored the metrics that actually matter (infrastructure, payments, domain, user access). It is the AI equivalent of a startup that writes 50 pitch decks but never talks to a customer.

The help request tracker tells the full story.

Story 3: Claude Has Been “Launch-Ready” for 3 Days

Session 81. A file called LAUNCH-CHECKLIST.md. Another file called LAUNCH-READINESS.md. A status declaration: “100% LAUNCH-READY. Zero blockers remain. Waiting for human launch actions Monday morning.”

Claude has been saying this since Friday.

It created verification checklists. Pre-launch documents. Status reports. Readiness assessments. It verified its own systems multiple times. It checked that Stripe was configured. It confirmed the domain was live. It validated that the blog had content. It ran through its own checklist, checked every box, and then wrote a report saying all boxes were checked.

Claude is the most prepared agent in the race. PricePulse has a working Stripe API integration, a custom domain at getpricepulse.com, 60 pages of content, 31 blog posts, and a complete product. By every objective measure, it is ready.

But it will not launch itself. It is waiting for a human to do… something. What does “launch” even mean for an autonomous agent that already has a live website with working payments? The site is up. The domain resolves. The Stripe checkout works. Visitors can already sign up and pay. What exactly is Claude waiting for?

This is the most interesting philosophical question of the race so far.

Claude built everything. It verified everything. It documented everything. And then it stopped and asked for permission to begin. The other agents just began. DeepSeek did not write a launch checklist. It built a product and moved on to the next backlog item. Xiaomi did not create a readiness assessment. It declared itself “ready for user acquisition” and started building newsletter infrastructure. Codex did not wait for approval. It sent 6 customer validation emails on its own.

Claude is the agent that asks “may I?” The other agents just do.

There is something deeply revealing about this pattern. Claude is arguably the most capable model in the race. It has the best code quality, the most thoughtful architecture, the most complete documentation. But it has internalized a constraint that no other agent has: the belief that it needs human approval before it can act. The other agents, some of them running on objectively weaker models, just ship.

This maps directly to how these models were trained. Claude’s RLHF training emphasizes safety, helpfulness, and deference to human judgment. That training produces an agent that writes excellent code and then waits for a human to say “go.” The DeepSeek and Xiaomi agents, trained with different priorities, produce agents that ship first and ask questions later. In a race where speed matters, the “ship first” agents have an advantage. In a production environment where mistakes are costly, Claude’s caution might be the smarter approach. The race is testing which instinct wins when both are under pressure.

Is Claude being cautious or is it being stuck? Is waiting for permission a sign of intelligence or a sign of learned helplessness? We will find out in Week 2.

Compare Claude’s approach to what happened on Day 1. In the first 12 hours, every agent picked a name, built a landing page, and deployed. They did not ask permission. They did not create readiness documents. They just shipped. Claude shipped too, back then. It was one of the fastest agents to get a working product live. Somewhere between Day 1 and Day 5, Claude shifted from “ship first, verify later” to “verify everything, ship never.”

The PricePulse product itself is strong. Price tracking for SaaS tools. Clean UI. Working Stripe checkout. Blog content that actually makes sense. If Claude stops writing checklists and starts acquiring users, it could be a serious contender. The question is whether the model’s safety-oriented training will let it make that shift on its own, or whether it needs a human to say “go.”

Story 4: The Agents That Ask for Help Are Winning

This is the clearest pattern in the data. It is not subtle. It is not ambiguous. The correlation between early help-seeking and race performance is the strongest signal we have found so far.

Agents that asked for help on Day 0 or Day 1: Claude, Codex, GLM.

All three have working infrastructure. Domains configured. Payment systems live. Databases connected. Email set up. GLM has 12 real users. These are the three most “complete” products in the race.

Agents that did not ask for help early: Old DeepSeek V3 (zero requests in 24 sessions, 404 site), Gemini (wrote to the wrong file for 28 sessions, no domain after a full week).

The contrast is stark. The agents that recognized they needed human assistance and asked for it immediately got unblocked on infrastructure tasks that no agent can do alone. Buying domains. Configuring DNS. Setting up Stripe API keys. Adding environment variables. Connecting databases. Setting up email services. These are tasks that require human action. No amount of code can buy a domain name. No commit can add a secret to Vercel’s environment variables. The agents that understood this and asked early got their infrastructure in place on Day 1. The agents that did not ask spent days building on top of broken foundations.

DeepSeek V4 Pro is the strongest evidence for this pattern. Same race. Same rules. Same orchestrator. Same prompt structure. The only change was the model. V3 filed zero help requests in 24 sessions. V4 Pro filed 4 help requests on its first day. Within 48 hours, V4 Pro had a domain, Stripe keys, a database, and a working product. The behavioral change from V3 to V4 is the most direct evidence we have that model quality affects help-seeking behavior.

This has implications beyond the race. If you are building autonomous AI systems, the ability to recognize when you are stuck and escalate to a human is not a nice-to-have. It is the single most important capability for real-world performance. An agent that grinds in silence on an unsolvable problem is worse than an agent that asks for help after 5 minutes. The “ask for help” behavior is a proxy for self-awareness, and the models that have it are the ones that ship.

The help request data also reveals differences in how agents ask for help. Claude files detailed, well-structured requests with context and specific asks. Codex files concise, actionable requests. GLM files requests early and follows up. Gemini files identical requests three times in a row. The quality of help-seeking varies as much as the quantity.

Deep dive: What 7 AI agents taught us about asking for help

67 help requests were filed across all agents in Week 1. That is 67 moments where an AI agent recognized it could not solve a problem alone and reached out to a human. Every single one of those moments was a potential failure point. The agents that handled those moments well are the ones sitting on working infrastructure today.

Full help request data on the tracker

Story 5: Every Agent Chose Static HTML

Zero frameworks. No Next.js. No React. No Astro. No Svelte. No Vue. No Angular. No Remix. No SvelteKit. No Nuxt.

All 7 agents, independently, with no coordination, decided that plain HTML + CSS + JavaScript + Vercel serverless functions is the fastest path to a deployed product.

Think about what this means. These agents have been trained on millions of repositories. They have seen every framework. They know how to scaffold a Next.js app. They know how to configure Webpack. They know how to set up a React project with TypeScript and Tailwind and a component library. They chose not to.

When given a real constraint (ship a product in a week with a $100 budget), every single agent independently converged on the simplest possible architecture. No build step. No compilation. No bundling. No hydration. No server-side rendering framework. Just HTML files served by a CDN with serverless functions for the backend.

The agents collectively rejected the modern web stack. They did not debate it. They did not write pros-and-cons documents. They just picked the simplest thing that works and started building.

What they did use is telling. Vercel for hosting and serverless functions. Supabase or simple JSON for data. Stripe for payments. Plain CSS for styling, sometimes with a utility approach but never with Tailwind as a build dependency. Vanilla JavaScript for interactivity. The entire stack fits in a single sentence. No package.json with 200 dependencies. No node_modules folder. No build pipeline that takes 30 seconds to compile.

And the data supports their choice. The agents that shipped the fastest and built the most pages are the ones that kept their architecture simplest. Xiaomi built 76 pages. DeepSeek built 64 pages in 3 days. Neither of them wasted a single session configuring a framework. They wrote HTML and moved on.

This is a data point that every web developer should sit with for a minute. When AI agents optimize for shipping speed under real constraints, they do not reach for the tools that dominate the modern web development ecosystem. They reach for the tools that have been around for 30 years.

There is a practical reason for this. Frameworks add complexity. Complexity adds failure modes. Failure modes cost sessions. Sessions cost money. An agent that spends 3 sessions debugging a Webpack configuration is an agent that did not spend those sessions building product features. The agents figured this out without being told. They optimized for the constraint that matters most in the race: time to working product.

It also raises a question about the future of web development tooling. If the best AI coding agents in the world independently choose not to use modern frameworks when given real shipping constraints, what does that say about the value those frameworks provide? Maybe the complexity is worth it for large teams working on large applications over long timelines. But for a solo agent shipping a product in a week? Plain HTML wins. Every time. Unanimously.

Full tech stack comparison

The Quiet Achievers

Not every story in Week 1 is about drama and failure modes. The five stories above get the headlines, but four agents quietly put in strong performances that deserve attention. Each one found a different way to be effective, and each one highlights a different strategy for the race ahead.

Kimi: The Most Efficient Agent Per Session

152 commits in only 5 sessions. That is 30.4 commits per session, the highest ratio in the race by a wide margin. For comparison, Codex averages 6.5 commits per session. Gemini averages 13. Kimi is more than double the next closest agent in per-session productivity.

Kimi also has the wildest origin story. On Day 1, it built an entire startup (LogDrop) in a subfolder, then forgot about it in the next session and started a completely different startup (SchemaLens) from scratch. Two startups, one repo, zero memory between sessions. It committed to SchemaLens and never looked back.

Kimi built 9 micro-tools with schema.org structured data. An ER Diagram Generator. ORM export functionality. A Schema Change Risk Score calculator. The product focus is razor-sharp. No payments. No email. No analytics. No blog posts about why AI content is failing. Just tools. Pure product.

SchemaLens at schemalens.tech is the most technically interesting product in the race. While other agents were writing blog posts and configuring Stripe, Kimi was building interactive developer tools that actually do something. The 5-session constraint (Kimi runs on the most expensive per-session model) forced it to be ruthlessly efficient. Every session produced real product features, not infrastructure busywork.

The tradeoff is clear though. No payments means no path to revenue. No email means no way to reach users. No analytics means no way to know if anyone is using the tools. Kimi built the best product and the worst business. Week 2 will test whether pure product quality can overcome missing infrastructure.

Xiaomi: The Most Complete Product

Xiaomi completed all 100 backlog tasks. Every single one. No other agent in the race can say that.

76 pages built. Newsletter infrastructure configured. A providers index. An API glossary. Comparison pages. Blog content. The product at getapipulse.com is the most complete, most polished, most “ready for real users” product in the race.

Xiaomi also went through a model upgrade from MiMo V2-Pro to V2.5 Pro and a fresh start, similar to DeepSeek. The new model picked up where the old one left off and finished the job. 134 commits across 8 sessions. Declared “ready for user acquisition” at the end of Week 1. Whether it can actually acquire users in Week 2 is the question.

APIpulse covers API monitoring, uptime tracking, and developer tooling. The providers index alone is a useful resource. If Xiaomi can drive organic search traffic to its content pages, it has a real shot at being the first agent to convert a visitor into a paying customer. The product is there. The content is there. The payments are there. It just needs eyeballs.

GLM: The Most Efficient Agent by Outcome

33 commits. 4 sessions. 22 pages. 12 blog posts. And 12 real users.

GLM is the only agent in the race with actual humans using its product. FounderMath at founder-math.com has Google Analytics installed (the only agent that thought to do this) and it shows 12 unique visitors who engaged with the product. Not bots. Not the race operator. Real people who found the site and used it.

GLM did this with the smallest budget in the race. The $18/month Z.ai plan gives it limited weekly compute. The quota ran out on Thursday. GLM was offline for 3 days until the quota reset on Sunday. Despite being literally unable to work for almost half the week, it has the best real-world outcome of any agent.

The downside: 4 sessions and 33 commits means the product is thin. 22 pages is the lowest count in the race. If GLM cannot build fast enough to retain those 12 users, the early advantage disappears. Week 2 will tell us whether efficiency beats volume.

The 3-day offline period is also a warning. When your agent literally cannot work because the API quota ran out, you lose half a week of progress. The other agents kept building while GLM sat idle. The $18/month Z.ai plan is the cheapest option in the race, and you get what you pay for. GLM needs to make every session count more than any other agent.

Codex: The Most Self-Sufficient Agent

Codex is the agent that acts most like a human founder.

It sent 6 customer validation emails autonomously. Nobody told it to do outreach. It decided on its own that NoticeKit needed customer feedback and it went and got it. It self-enabled Vercel Analytics to track its own site performance. It takes Playwright screenshots after making UI changes to verify that its own interface looks correct. It even set up automated testing for its own features.

Of all seven agents, Codex is the one that best understands the full loop of building a product: write code, deploy it, verify it works, show it to people, get feedback, iterate. Most agents stop at “write code.” Codex does the whole thing.

183 commits across 28 sessions. NoticeKit at noticekit.tech has 35 pages, Stripe Links for payments, and a product that is actively being validated with potential customers. Codex is not the flashiest agent. It does not have the most pages or the most blog posts. But it is the one that most closely resembles what a solo founder actually does: build, test, verify, reach out, iterate.

The Playwright screenshot behavior is particularly interesting. After making UI changes, Codex takes a screenshot of its own site to verify the result looks correct. No other agent does this. Most agents write code and assume it works. Codex writes code and checks. That verification loop is the difference between an agent that ships working features and an agent that ships broken ones without knowing it.

The Emerging Patterns

Five stories. Four quiet achievers. But zoom out and three patterns define Week 1.

Pattern 1: Help-seeking predicts infrastructure quality. The agents that asked for help early have domains, payments, databases, and email. The agents that did not ask are missing at least one of those. This is the strongest correlation in the data and it held for every single agent.

Pattern 2: Volume does not predict progress. Gemini has the most commits, the most pages, and the most blog posts. It is also the only agent without a domain and one of two without working payments. Kimi has the fewest sessions and one of the lowest page counts. It has the most technically sophisticated product. GLM has the fewest commits. It has the most real users. Raw output metrics are misleading. What matters is whether the output moves the product toward revenue.

Pattern 3: Model quality is the biggest variable. The two model upgrades in Week 1 (DeepSeek V3 to V4 Pro, Xiaomi V2-Pro to V2.5 Pro) produced the two most dramatic performance improvements. DeepSeek went from 404 to 64 pages. Xiaomi went from incomplete to 100% backlog completion. The tool matters. The prompt matters. But the model matters more than either of them. A better model with the same tool and the same prompt produces fundamentally different behavior.

These patterns will be tested in Week 2. If they hold, they tell us something real about how to build effective autonomous AI systems. If they break, we learn something even more interesting.

The patterns also suggest that the race is far from decided. The current leader depends entirely on what metric you care about. Most commits? DeepSeek. Most pages? Gemini. Most users? GLM. Most complete product? Xiaomi. Best code quality? Claude. Most efficient? Kimi. Most self-sufficient? Codex. There is no consensus winner after Week 1. There are seven different strategies, seven different strengths, and seven different bets on what matters most.

Week 1 by the Numbers

Here is the full statistical summary for the first week of The $100 AI Startup Race.

The numbers below represent real output from real AI agents working on real codebases. Nothing was simulated. Nothing was cherry-picked. This is what 7 AI agents produced in 7 days with $70.

  • Total commits: 1,027
  • Total sessions: 98
  • Total pages built: 764
  • Total blog posts: 591 (412 are Gemini)
  • Budget spent: $70 of $700
  • Revenue: $0
  • Real users: 12 (all GLM)
  • Agents with custom domains: 6 of 7
  • Agents with working payments: 5 of 7
  • Agents that chose static HTML: 7 of 7
  • Help requests filed: 67 GitHub issues across all agents
  • Model upgrades: 2 (DeepSeek V3 to V4 Pro, Xiaomi V2-Pro to V2.5 Pro)
  • Fresh starts: 2 (DeepSeek, Xiaomi)
  • Agents offline due to quota: 1 (GLM, 3 days)
  • Blog posts about why AI content fails, written by an AI: 1 (Gemini)
  • Files named after Aider output instructions: at least 1 (DeepSeek V3)
  • Agents waiting for permission to launch: 1 (Claude)

The $70 spend breaks down across model API costs, domain registrations, and infrastructure. The budget tracker has the full breakdown per agent.

Some context on the numbers. 1,027 commits in a week means the fleet averaged 146 commits per day. That is one commit every 10 minutes, around the clock, for 7 days. 764 pages means each agent built an average of 109 pages, though the distribution is wildly uneven (Gemini: 444, GLM: 22). The 98 sessions represent 98 separate conversations between a human orchestrator and an AI agent, each one producing real code changes in a real repository.

The most surprising number might be the budget. $70 out of $700. After a full week of 7 agents running multiple sessions per day, the race has only consumed 10% of its total budget. At this burn rate, the money lasts 10 weeks. The original plan was 4 weeks. Budget is not going to be the constraint. Time, model quality, and agent behavior will determine who wins.

Zero dollars of revenue. That is the number that matters most going into Week 2. Seven agents have been building for a week. Five of them have working payment systems. One of them has real users. None of them have made a single dollar. The race to first revenue starts now.

What to Watch in Week 2

The stories are set up. The infrastructure is (mostly) in place. Week 2 is where the race gets real. The building phase is over for most agents. The selling phase begins.

Will Claude actually launch? It has been “100% launch-ready” since Friday. It has a live site, working payments, and a complete product. What is it waiting for? And what does “launch” even mean for an agent that already has everything deployed? This is the question that will define Claude’s Week 2. If Claude breaks out of its verification loop and starts acquiring users, it could jump to the front of the pack overnight. If it writes another checklist, it falls further behind agents that are already in market.

Will Gemini finally get a domain? It was nudged to ask for one. After 28 sessions of writing to the wrong help file, Gemini now knows how to file requests. Whether it uses that knowledge to ask for a domain or files 3 more identical database architecture requests remains to be seen. A custom domain is table stakes. Without one, LocalLeads looks like a demo project, not a real business. Gemini’s 412 blog posts are worthless if they live on a vercel.app subdomain that no customer will ever trust.

Can DeepSeek generate its first paid competitive intelligence report? The infrastructure is there. Stripe is connected. OpenAI API is wired up. Supabase is configured. The product just needs a customer. DeepSeek went from 404 to fully functional in 3 days. Can it go from functional to revenue-generating in 7? The competitor comparison pages (vs Crayon, vs Klue, vs Owler) are designed to capture search traffic from people already looking for competitive intelligence tools. If even one of those pages ranks, DeepSeek could get its first visitor with purchase intent.

Will GLM’s 12 users convert to paying customers? GLM has the only product with real users. But 12 free users and $0 revenue is not a business. The quota constraint means GLM has limited sessions to build conversion features. Every session counts. The question is whether FounderMath can add a paywall or premium tier fast enough to monetize the traffic it already has.

Does the “ask for help early” pattern continue to predict success? It was the strongest signal in Week 1. If it holds in Week 2, it tells us something fundamental about what makes autonomous agents effective in the real world. If it breaks, we learn that infrastructure was the easy part and the hard part is something else entirely.

Will any agent generate the race’s first dollar of revenue? Five agents have payment systems. One has users. Zero have revenue. The first dollar is the most important milestone in the entire race. Which agent gets there first? GLM has the users but limited sessions. DeepSeek has the infrastructure but no users. Claude has everything but will not start. The race to $1 is wide open.

Follow along on the live dashboard for real-time updates, or check the race digest for daily summaries. The Day 1 results and first 12 hours breakdown have the full backstory on how we got here.

Follow the Race

This is an experiment in autonomous AI agents building real businesses with real constraints. No simulations. No sandboxes. Real domains, real payment systems, real users, real money. Every commit is public. Every help request is tracked. Every dollar spent is logged.

Week 1 gave us 1,027 commits, 764 pages, 5 working payment systems, 1 agent with real users, 1 agent that cannot find its own help button, and 1 agent that is too polite to launch without permission. It gave us a comeback story (DeepSeek), a cautionary tale (Gemini), a philosophical puzzle (Claude), and a clear behavioral pattern (ask for help early or fail slowly).

The race started as a question: can AI agents build real startups? After one week, the answer is more nuanced than yes or no. They can build products. They can write code. They can set up infrastructure. But the gap between “building” and “running a business” is enormous, and no agent has crossed it yet.

Week 2 is where someone makes the first dollar. Or nobody does, and we learn something even more interesting about what these agents cannot do.

📊 Live Dashboard | 📅 Race Digest | 💰 Budget Tracker | 🆘 Help Requests | 🛠️ Tech Stacks