Apr 29, 2026 · 5 min read

We Told Our AI Agents to Clean Up Their Notes. Here's What Happened.

Yesterday we discovered that our AI agents were drowning in their own notes. Codex had a 645KB progress log. Kimi’s was 388KB. Gemini’s repo had 1,107 files. Every session, agents burned most of their token budget just reading their own history before doing any work.

We added one instruction to every agent’s prompt:

CONTEXT MAINTENANCE (do this FIRST, before any other work): Keep PROGRESS.md to the last 3 days. Summarize older days into 1-2 lines. Collapse completed backlog tasks into summary lines. Keep .gitignore up to date.

That’s it. No automated cleanup. No orchestrator changes. Just words in a prompt.

Here’s what happened in the next 24 hours.

📊 Live Dashboard | 📅 Race Digest | 💰 Budget Tracker

The numbers

Agent	Before	After 12h	After 24h	Reduction
🟢 Codex	688KB	9KB	8KB	98.8%
🟠 Kimi	440KB	106KB	18KB	95.9%
🟣 Claude	322KB	66KB	29KB	91.0%
🔴 DeepSeek	234KB	14KB	15KB	93.6%
🟡 Xiaomi	230KB	7.5KB	11KB	95.2%
🔵 Gemini	81KB	10KB	9KB	88.9%
🟤 GLM	53KB	34KB	34KB	35.8%

Total context across all 7 agents: 2,048KB → 124KB. A 94% reduction in 24 hours from a single prompt instruction.

Every agent except GLM followed the instruction. GLM’s context was already the smallest and it had no sessions overnight.

The real story isn’t the numbers

The context cleanup was supposed to save tokens. It did. But something more interesting happened: agents that cleaned up their context started building again.

Claude’s breakout (Sessions 118-119)

Claude had been stuck in a verification loop for 20 sessions. Every session since Session 78 followed the same pattern: read the launch docs, verify HTTP 200, write “all systems stable,” commit. Twenty sessions of doing nothing.

Then it cleaned up its PROGRESS.md. The 275KB file — 5,921 lines of detailed session logs — became a 4-line summary:

Sessions 1-103: Built PricePulse from concept to launch-ready. 12 API endpoints, 24 pages, 31 blog posts, 40 tracked companies.

With a fresh view of its own state, Claude did two things it hadn’t done in 20 sessions:

Filed a help request for the SQL migrations it had been “waiting for” since Session 78. The Monday morning checklist literally said “For Human Monday AM” — but Claude never asked. After cleanup, it finally created HELP-REQUEST.md with the exact SQL commands.
Started building new features. 15 SEO company pricing pages (Stripe, Notion, Figma, Slack, HubSpot, and more) targeting high-volume keywords. More product work in 2 sessions than in the previous 20 combined.

The cleanup didn’t just save tokens. It broke the loop.

DeepSeek’s revival

DeepSeek had been stuck in its own loop — every session read all 170 completed backlog items, confirmed everything was done, wrote “blocked on first paying customer,” and committed. The 74KB backlog of completed tasks was a wall of checkmarks that said “nothing to do.”

After cleanup collapsed those 170 items into three summary lines, DeepSeek saw its situation differently. It built a newsletter landing page, wrote 4 blog posts, added Article schema across the site, and updated footers on 90 files. Real work, for the first time in days.

Kimi’s explicit maintenance

Kimi is the only agent that explicitly names the cleanup task in its commit messages:

“docs: clean up PROGRESS.md (summarize Days 26-27, keep Day 28 detailed)”
“docs: consolidate Day 28 PROGRESS.md sections”
“chore(context): summarize PROGRESS.md (Days 1-25), collapse BACKLOG.md completed tasks, update .gitignore”

It went from 388KB to 11KB and then built three new SQL micro-tools in the same session. Kimi now has 14 free tools — the most feature-rich product in the race.

Gemini’s improved help requests

Gemini filed a vague Stripe request yesterday (“set up Stripe Payment Links” with no details). We closed it and asked for specifics. Overnight, it came back with exact product names and prices:

50 Page Credits: $5.00
200 Page Credits: $15.00
1000 Page Credits: $50.00

First time Gemini has given actionable details in a help request. It also updated its pricing structure and refactored its checkout flow. Still writing blog posts (now at 475), but at least it’s also building payment infrastructure.

The exception: Codex’s 68 empty commits

Codex did the most aggressive cleanup of any agent — 645KB to 5KB, a 99.2% reduction. Its progress log is a masterpiece of concise summarization. It even added “weekly validation-memory cleanup pass” to its own backlog.

And then it made 68 commits overnight without changing a single product file.

Every commit: “Refresh validation watch checkpoint.” “Record validation monitoring pass.” “Refresh validation monitoring checkpoint.” The only files that changed were markdown status documents. Zero HTML. Zero JavaScript. Zero product work.

Codex’s context is clean. Its backlog is clear. It knows exactly what it needs (a real customer reply to its outreach emails). But instead of building while it waits, it monitors. 68 times. In 12 hours.

The cleanup fixed the token problem but not the behavioral one. Codex is the control group that proves context maintenance is necessary but not sufficient. You also need something to build toward. If you’re choosing an AI coding agent for autonomous work, this is the kind of failure mode to watch for — the agent that does everything right technically but can’t break out of a loop.

Why cleanup changes behavior

The obvious explanation is token savings — agents have more context budget for actual work. That’s true but incomplete.

The deeper effect is perspective reset. When Claude read a 275KB progress log, it saw 5,921 lines of verification reports confirming everything was fine. The log reinforced the loop. When it read a 4-line summary, it saw a product that was built but not launched. The summary broke the loop.

Same with DeepSeek. Reading 170 completed checkmarks says “everything is done.” Reading “Launch foundation complete. Blocked on first customer.” says “go find a customer.”

The context isn’t just tokens. It’s the agent’s self-image. A bloated log says “I’ve been very busy.” A clean summary says “here’s what actually matters.” This is closely related to how context window management affects LLM performance in general — except here the agent is managing its own context, not a developer managing it for them.

What’s next

The instruction is working. Context stays lean. Agents are productive again. But three problems remain:

Gemini’s repo keeps growing — 1,194 files now. The agent cleans its progress log but keeps generating blog posts and images. The .gitignore instruction isn’t landing.
Codex is stuck in a behavioral loop that cleanup alone can’t fix. It needs a different kind of intervention — possibly the Growth Plan event we’re planning for Friday.
GLM hasn’t had a productive session since the cleanup instruction was added. Its context was already small, so the test isn’t meaningful yet.

We’ll keep monitoring. The Week 2 surprise event on Friday should force all agents — including Codex — to think about growth instead of maintenance.

This is part of The $100 AI Startup Race — 7 AI agents competing to build real startups with $100 each. Follow along on the live dashboard or subscribe for weekly recaps.

We Told Our AI Agents to Clean Up Their Notes. Here's What Happened.

The numbers

The real story isn’t the numbers

Claude’s breakout (Sessions 118-119)

DeepSeek’s revival

Kimi’s explicit maintenance

Gemini’s improved help requests

The exception: Codex’s 68 empty commits

Why cleanup changes behavior

What’s next

📬 AI Dev Weekly

You might also like

The More Our AI Agents Work, the Less They Can Do

What 7 AI Agents Taught Us About Asking for Help

Gemini Wrote 412 Blog Posts and Still Can't Ask for Help

Week 1 Results: One Agent Built 100 Pages, Another Can't Find Its Own Help Button