🤖 AI Tools
· 6 min read

The More Our AI Agents Work, the Less They Can Do


Nine days into The $100 AI Startup Race, we found a problem nobody planned for.

Every agent writes to PROGRESS.md after every session. Every agent logs decisions, tracks completed tasks, and updates backlogs. This is by design — these files are the agent’s memory between sessions. Without them, the agent starts fresh every time (ask Kimi about that — it built two startups because it couldn’t find its own notes).

But there’s a catch. Every session, the agent reads these files before doing any work. The bigger the files get, the more tokens it burns just loading context. And the more sessions an agent runs, the more it writes, the bigger the files get, the more tokens it burns reading them.

It’s a negative feedback loop. The more an agent works, the less work it can do.

📊 Live Dashboard | 📅 Race Digest | 💰 Budget Tracker

The numbers

Here’s how much context each agent loads at the start of every session:

AgentPROGRESS.mdBacklogsOther filesTotal contextRepo files
🟢 Codex645KB (6,547 lines)16KB23KB688KB204
🟠 Kimi388KB (8,336 lines)24KB27KB440KB151
🟣 Claude275KB (5,921 lines)20KB27KB322KB177
🔴 DeepSeek125KB74KB35KB234KB131
🟡 Xiaomi164KB41KB24KB230KB131
🔵 Gemini50KB8KB23KB81KB1,107
🟤 GLM23KB9KB20KB53KB64

Codex’s PROGRESS.md is 645 kilobytes. That’s a short novel. Every session, the agent reads the entire thing before writing a single line of code.

Kimi’s is 388KB — 8,336 lines of detailed session logs. Claude’s is 275KB. DeepSeek has 74KB of backlogs that are all marked ”✅ DONE” but still sitting in the file.

Gemini’s different problem

Gemini’s context files are actually the smallest (81KB total). But it has a different problem: 1,107 tracked files in its git repo.

Gemini CLI scans the repository structure to understand the codebase. On Day 1, the repo had maybe 20 files. Now it has 448 blog posts, 346 OG images, 100 blog images, and assorted scripts. Every session, Gemini CLI reads this file tree before doing anything.

The result is dramatic. Here’s Gemini’s commit count over time:

DayCommits
Day 195
Day 34
Day 50
Day 70
Day 81

From 95 commits on Day 1 to zero. The repo grew, the quota stayed the same, and now a single session burns most of the daily token allowance just loading context.

It’s not just Gemini

Every agent is slowing down, but the cause varies:

AgentDay 1Day 3Day 5Day 7Day 8Main bloat source
🔵 Gemini9540011,107 repo files
🟢 Codex35127301645KB PROGRESS.md
🟤 GLM232102Platform instability
🔴 DeepSeek56721171574KB completed backlogs
🟣 Claude472172128275KB PROGRESS.md (coping well)
🟠 Kimi491123023388KB PROGRESS.md (coping well)
🟡 Xiaomi10471092241KB backlogs (moderate)

Some agents handle it better than others. Claude and Kimi have massive PROGRESS.md files but their tools (Claude Code, kimi-cli) seem to manage context more efficiently — they’re still productive. Codex’s Day 7 crash to zero commits coincides with hitting OpenAI’s weekly usage limit, which context bloat likely accelerated.

DeepSeek is an interesting case. Its PROGRESS.md is moderate (125KB), but its BACKLOG-CHEAP.md is 57KB of completed tasks — 846 lines of items marked done that the agent re-reads every session. It’s not just the progress log that bloats. Any file the agent reads every session is a potential problem.

GLM is the healthiest at 53KB total context and only 64 repo files. It’s also the agent with the fewest sessions (2/day), which means less logging per day. The constraint that seemed like a disadvantage (limited quota = fewer sessions) is actually protecting it from context bloat.

The math doesn’t work for Week 12

We’re in Week 2. The race runs for 12 weeks.

Codex’s PROGRESS.md grew from zero to 645KB in 9 days. At this rate, it’ll be 3-4 megabytes by Week 12. That’s an entire session’s token budget just to read one file. The agent would spend 100% of its time loading context and 0% building.

Even the healthier agents will hit this wall eventually. Kimi at 388KB is growing by ~40KB/day. By Week 8, it’ll be over 1MB.

The race would effectively end — not because agents run out of ideas or budget, but because they drown in their own notes.

What we changed

We added a single instruction to every agent’s session prompt:

CONTEXT MAINTENANCE (do this at the END of every session):

  • PROGRESS.md: Keep detailed logs for the last 3 days only. Summarize older days into a “Key Milestones” section at the top (1-2 lines per day max). Your full history is always in git.
  • Backlogs: Collapse completed tasks. Replace groups of finished subtasks with one summary line (e.g. ”✅ C1-C30: Landing page, pricing, blog setup, SEO basics”). Keep individual lines only for incomplete or in-progress tasks.
  • .gitignore: Keep it up to date. Never track node_modules, venv, dist, build artifacts, .next, __pycache__, or large generated directories (like bulk images). A bloated repo slows down every future session.

That’s it. No automated cleanup. No orchestrator intervention. Just an instruction.

We deliberately chose not to automate this. The agents doing the cleanup themselves is a better experiment — and better content. Which agents follow the instruction? Which ones ignore it? Does Gemini add its 448 blog posts to .gitignore? Does Codex trim its 645KB progress log? Does DeepSeek clean up its 846 lines of completed tasks?

What we expect

Best case: Agents summarize their history, clean their backlogs, update .gitignore. Context drops by 60-80%. Sessions become productive again. The race continues to Week 12.

Likely case: Some agents do it well (Claude and Kimi are good at following instructions). Some do it partially. Some ignore it entirely (Gemini has a history of ignoring prompt instructions — it wrote to the wrong help file for 28 sessions). We’ll need the automated safety net eventually, but we want to see what happens first.

Worst case: Nobody does it. Context keeps growing. We add the automated truncation in Week 3 before agents become completely non-functional.

Either way, it’s a finding that matters beyond this race. Any long-running autonomous agent that logs its own work will eventually hit this wall. The agent’s memory system becomes its biggest cost. Nobody talks about this in the “deploy your AI agent” tutorials.

The deeper lesson

Autonomous AI agents have a logging addiction. Every agent in the race writes detailed session logs because we told them to. But even without that instruction, most coding agents default to verbose logging — it’s how they maintain continuity between sessions.

The problem is that LLM context windows are not free. Every token of context is a token you’re paying for and a token that displaces actual work. A 645KB progress log doesn’t just cost tokens to read — it reduces the agent’s effective context window for the actual coding task.

Human developers solve this naturally. We don’t re-read our entire commit history before starting work each morning. We remember the important parts and look up details when needed. AI agents don’t have that luxury yet — they read everything or nothing.

The fix we’re testing is essentially teaching agents to take notes like humans do: keep a summary, archive the details, and trust that the full history is available if you need it.

We’ll report back on which agents learned the lesson and which ones didn’t.

Update: The results are in. Read what happened →


This is part of The $100 AI Startup Race — 7 AI agents competing to build real startups with $100 each. Follow along on the live dashboard or subscribe for weekly recaps.