Apr 16, 2026 · 4 min read

AI Agent Cost Management: Track and Control Token Spend (2026)

An AI agent without cost controls is a credit card with no limit. One runaway loop, one verbose prompt, one user who discovers they can ask unlimited questions — and your monthly API bill goes from $50 to $5,000 overnight.

This isn’t theoretical. In our AI Startup Race, each agent has exactly $100 for 12 weeks. When the money runs out, the agent stops. That constraint forces smart cost management from day one. Here’s how to apply the same discipline to production agents.

Where the money goes

A typical AI agent interaction costs:

Component	Tokens	Cost (GPT-4o)	Cost (GPT-4o-mini)
System prompt	500-2,000	$0.001-0.005	$0.0001-0.0003
User message	100-500	$0.0003-0.001	$0.00002-0.0001
Tool calls (3-5)	2,000-10,000	$0.005-0.025	$0.0003-0.002
Agent response	500-2,000	$0.002-0.008	$0.0002-0.001
Total per interaction	3,000-15,000	$0.008-0.04	$0.0006-0.003

At 1,000 users making 10 requests/day: $80-400/day with GPT-4o, or $6-30/day with GPT-4o-mini. The model choice alone is a 10-15x cost difference.

Strategy 1: Model routing

Not every request needs your most expensive model. Route based on complexity:

from agents import Agent, Runner

# Cheap model for simple tasks
simple_agent = Agent(
    name="Quick Helper",
    model="gpt-4o-mini",
    instructions="Answer simple questions concisely.",
)

# Expensive model for complex tasks
complex_agent = Agent(
    name="Deep Thinker",
    model="gpt-4o",
    instructions="Handle complex reasoning, debugging, and architecture questions.",
)

async def route_request(message: str):
    # Simple heuristic: short messages get cheap model
    if len(message) < 200 and "?" in message:
        return await Runner.run(simple_agent, message)
    return await Runner.run(complex_agent, message)

Better approach: use a classifier to route:

classifier = Agent(
    name="Router",
    model="gpt-4o-mini",  # Cheap model for classification
    instructions="""Classify the user's request as 'simple' or 'complex'.
    Simple: factual questions, formatting, basic code snippets.
    Complex: debugging, architecture, multi-step reasoning, security review.
    Respond with only 'simple' or 'complex'.""",
)

This adds one cheap API call but saves money on 60-70% of requests that don’t need the expensive model.

For open-source alternatives, route simple tasks to Ollama running locally (cost: $0) and only use paid APIs for complex tasks.

Strategy 2: Per-user budgets

import redis

r = redis.Redis()
DAILY_LIMIT = 50_000  # tokens per user per day
MONTHLY_LIMIT = 1_000_000

async def check_and_track(user_id: str, tokens_used: int) -> bool:
    daily_key = f"usage:{user_id}:{date.today()}"
    monthly_key = f"usage:{user_id}:{date.today().strftime('%Y-%m')}"
    
    daily = int(r.get(daily_key) or 0)
    monthly = int(r.get(monthly_key) or 0)
    
    if daily + tokens_used > DAILY_LIMIT:
        return False  # Daily limit reached
    if monthly + tokens_used > MONTHLY_LIMIT:
        return False  # Monthly limit reached
    
    pipe = r.pipeline()
    pipe.incrby(daily_key, tokens_used)
    pipe.expire(daily_key, 86400)
    pipe.incrby(monthly_key, tokens_used)
    pipe.expire(monthly_key, 2678400)  # 31 days
    pipe.execute()
    return True

When a user hits their limit, degrade gracefully: switch to a cheaper model, reduce max response length, or show a “limit reached” message with an upgrade option.

Strategy 3: Prompt caching and compression

System prompts are sent with every request. A 2,000-token system prompt across 10,000 daily requests = 20M input tokens/day.

Cache system prompts: OpenAI and Anthropic both offer prompt caching that reduces cost for repeated prefixes by 50-90%.

Compress context: Instead of sending full file contents, send summaries:

# Expensive: send full file
context = open("large_file.py").read()  # 5,000 tokens

# Cheap: send summary
context = "File: large_file.py (450 lines). Contains: FastAPI app with 12 endpoints, SQLAlchemy models for User, Order, Product. Auth middleware using JWT."  # 50 tokens

Strategy 4: Circuit breakers

Prevent runaway agent loops:

MAX_TOOL_CALLS = 10
MAX_TOKENS_PER_RUN = 50_000
MAX_DURATION_SECONDS = 120

async def run_with_limits(agent, message):
    tool_calls = 0
    total_tokens = 0
    start = time.time()
    
    async for event in Runner.run_streamed(agent, message):
        if event.type == "tool_call":
            tool_calls += 1
            if tool_calls > MAX_TOOL_CALLS:
                raise AgentBudgetExceeded("Too many tool calls")
        
        total_tokens += getattr(event, 'tokens', 0)
        if total_tokens > MAX_TOKENS_PER_RUN:
            raise AgentBudgetExceeded("Token limit exceeded")
        
        if time.time() - start > MAX_DURATION_SECONDS:
            raise AgentBudgetExceeded("Time limit exceeded")
        
        yield event

Strategy 5: Response caching

Many agent queries are similar. Cache responses for identical or near-identical inputs:

import hashlib

def cache_key(message: str, agent_name: str) -> str:
    return hashlib.sha256(f"{agent_name}:{message}".encode()).hexdigest()

async def cached_run(agent, message):
    key = cache_key(message, agent.name)
    cached = await redis.get(f"response:{key}")
    if cached:
        return cached  # Free!
    
    result = await Runner.run(agent, message)
    await redis.setex(f"response:{key}", 3600, result.final_output)  # Cache 1 hour
    return result.final_output

This works well for FAQ-style agents, documentation assistants, and code explanation tools where the same questions come up repeatedly.

Real-time cost monitoring

Track costs in real-time with your observability platform:

# After each agent run
async def log_cost(user_id, agent_name, input_tokens, output_tokens, model):
    costs = {
        "gpt-4o": {"input": 2.50, "output": 10.00},       # per 1M tokens
        "gpt-4o-mini": {"input": 0.15, "output": 0.60},
        "claude-sonnet-4": {"input": 3.00, "output": 15.00},
    }
    
    rate = costs.get(model, costs["gpt-4o-mini"])
    cost = (input_tokens * rate["input"] + output_tokens * rate["output"]) / 1_000_000
    
    await metrics.record("agent_cost", cost, tags={
        "user": user_id, "agent": agent_name, "model": model
    })

Set alerts for:

Daily spend exceeding 2x the average
Any single user consuming more than 10% of total budget
Any single agent run costing more than $1

The $100 budget framework

From running the AI Startup Race, here’s what we learned about budgeting:

Budget	What you get	Duration
$10/mo	~3M tokens GPT-4o-mini	Hobby project
$50/mo	~15M tokens GPT-4o-mini or ~1.5M GPT-4o	Small SaaS
$200/mo	Mixed routing, 100+ daily users	Growing product
$1,000/mo	Full GPT-4o, 1,000+ daily users	Serious product

The key insight: start with the cheapest model that works, add expensive models only for tasks that need them, and always have per-user limits.

AI Agent Cost Management: Track and Control Token Spend (2026)

Where the money goes

Strategy 1: Model routing

Strategy 2: Per-user budgets

Strategy 3: Prompt caching and compression

Strategy 4: Circuit breakers

Strategy 5: Response caching

Real-time cost monitoring

The $100 budget framework

📬 AI Dev Weekly

You might also like

How to Debug AI Agents — When Your Agent Goes Off the Rails

Agent Memory Patterns — How to Give AI Agents Long-Term Context

When NOT to Use AI Agents — The Anti-Hype Guide

Agent vs Workflow — When to Use Autonomous AI vs Deterministic Pipelines