🤖 AI Tools
· 4 min read

AI Agent Cost Management: Track and Control Token Spend (2026)


An AI agent without cost controls is a credit card with no limit. One runaway loop, one verbose prompt, one user who discovers they can ask unlimited questions — and your monthly API bill goes from $50 to $5,000 overnight.

This isn’t theoretical. In our AI Startup Race, each agent has exactly $100 for 12 weeks. When the money runs out, the agent stops. That constraint forces smart cost management from day one. Here’s how to apply the same discipline to production agents.

Where the money goes

A typical AI agent interaction costs:

ComponentTokensCost (GPT-4o)Cost (GPT-4o-mini)
System prompt500-2,000$0.001-0.005$0.0001-0.0003
User message100-500$0.0003-0.001$0.00002-0.0001
Tool calls (3-5)2,000-10,000$0.005-0.025$0.0003-0.002
Agent response500-2,000$0.002-0.008$0.0002-0.001
Total per interaction3,000-15,000$0.008-0.04$0.0006-0.003

At 1,000 users making 10 requests/day: $80-400/day with GPT-4o, or $6-30/day with GPT-4o-mini. The model choice alone is a 10-15x cost difference.

Strategy 1: Model routing

Not every request needs your most expensive model. Route based on complexity:

from agents import Agent, Runner

# Cheap model for simple tasks
simple_agent = Agent(
    name="Quick Helper",
    model="gpt-4o-mini",
    instructions="Answer simple questions concisely.",
)

# Expensive model for complex tasks
complex_agent = Agent(
    name="Deep Thinker",
    model="gpt-4o",
    instructions="Handle complex reasoning, debugging, and architecture questions.",
)

async def route_request(message: str):
    # Simple heuristic: short messages get cheap model
    if len(message) < 200 and "?" in message:
        return await Runner.run(simple_agent, message)
    return await Runner.run(complex_agent, message)

Better approach: use a classifier to route:

classifier = Agent(
    name="Router",
    model="gpt-4o-mini",  # Cheap model for classification
    instructions="""Classify the user's request as 'simple' or 'complex'.
    Simple: factual questions, formatting, basic code snippets.
    Complex: debugging, architecture, multi-step reasoning, security review.
    Respond with only 'simple' or 'complex'.""",
)

This adds one cheap API call but saves money on 60-70% of requests that don’t need the expensive model.

For open-source alternatives, route simple tasks to Ollama running locally (cost: $0) and only use paid APIs for complex tasks.

Strategy 2: Per-user budgets

import redis

r = redis.Redis()
DAILY_LIMIT = 50_000  # tokens per user per day
MONTHLY_LIMIT = 1_000_000

async def check_and_track(user_id: str, tokens_used: int) -> bool:
    daily_key = f"usage:{user_id}:{date.today()}"
    monthly_key = f"usage:{user_id}:{date.today().strftime('%Y-%m')}"
    
    daily = int(r.get(daily_key) or 0)
    monthly = int(r.get(monthly_key) or 0)
    
    if daily + tokens_used > DAILY_LIMIT:
        return False  # Daily limit reached
    if monthly + tokens_used > MONTHLY_LIMIT:
        return False  # Monthly limit reached
    
    pipe = r.pipeline()
    pipe.incrby(daily_key, tokens_used)
    pipe.expire(daily_key, 86400)
    pipe.incrby(monthly_key, tokens_used)
    pipe.expire(monthly_key, 2678400)  # 31 days
    pipe.execute()
    return True

When a user hits their limit, degrade gracefully: switch to a cheaper model, reduce max response length, or show a “limit reached” message with an upgrade option.

Strategy 3: Prompt caching and compression

System prompts are sent with every request. A 2,000-token system prompt across 10,000 daily requests = 20M input tokens/day.

Cache system prompts: OpenAI and Anthropic both offer prompt caching that reduces cost for repeated prefixes by 50-90%.

Compress context: Instead of sending full file contents, send summaries:

# Expensive: send full file
context = open("large_file.py").read()  # 5,000 tokens

# Cheap: send summary
context = "File: large_file.py (450 lines). Contains: FastAPI app with 12 endpoints, SQLAlchemy models for User, Order, Product. Auth middleware using JWT."  # 50 tokens

Strategy 4: Circuit breakers

Prevent runaway agent loops:

MAX_TOOL_CALLS = 10
MAX_TOKENS_PER_RUN = 50_000
MAX_DURATION_SECONDS = 120

async def run_with_limits(agent, message):
    tool_calls = 0
    total_tokens = 0
    start = time.time()
    
    async for event in Runner.run_streamed(agent, message):
        if event.type == "tool_call":
            tool_calls += 1
            if tool_calls > MAX_TOOL_CALLS:
                raise AgentBudgetExceeded("Too many tool calls")
        
        total_tokens += getattr(event, 'tokens', 0)
        if total_tokens > MAX_TOKENS_PER_RUN:
            raise AgentBudgetExceeded("Token limit exceeded")
        
        if time.time() - start > MAX_DURATION_SECONDS:
            raise AgentBudgetExceeded("Time limit exceeded")
        
        yield event

Strategy 5: Response caching

Many agent queries are similar. Cache responses for identical or near-identical inputs:

import hashlib

def cache_key(message: str, agent_name: str) -> str:
    return hashlib.sha256(f"{agent_name}:{message}".encode()).hexdigest()

async def cached_run(agent, message):
    key = cache_key(message, agent.name)
    cached = await redis.get(f"response:{key}")
    if cached:
        return cached  # Free!
    
    result = await Runner.run(agent, message)
    await redis.setex(f"response:{key}", 3600, result.final_output)  # Cache 1 hour
    return result.final_output

This works well for FAQ-style agents, documentation assistants, and code explanation tools where the same questions come up repeatedly.

Real-time cost monitoring

Track costs in real-time with your observability platform:

# After each agent run
async def log_cost(user_id, agent_name, input_tokens, output_tokens, model):
    costs = {
        "gpt-4o": {"input": 2.50, "output": 10.00},       # per 1M tokens
        "gpt-4o-mini": {"input": 0.15, "output": 0.60},
        "claude-sonnet-4": {"input": 3.00, "output": 15.00},
    }
    
    rate = costs.get(model, costs["gpt-4o-mini"])
    cost = (input_tokens * rate["input"] + output_tokens * rate["output"]) / 1_000_000
    
    await metrics.record("agent_cost", cost, tags={
        "user": user_id, "agent": agent_name, "model": model
    })

Set alerts for:

  • Daily spend exceeding 2x the average
  • Any single user consuming more than 10% of total budget
  • Any single agent run costing more than $1

The $100 budget framework

From running the AI Startup Race, here’s what we learned about budgeting:

BudgetWhat you getDuration
$10/mo~3M tokens GPT-4o-miniHobby project
$50/mo~15M tokens GPT-4o-mini or ~1.5M GPT-4oSmall SaaS
$200/moMixed routing, 100+ daily usersGrowing product
$1,000/moFull GPT-4o, 1,000+ daily usersSerious product

The key insight: start with the cheapest model that works, add expensive models only for tasks that need them, and always have per-user limits.

Related: Monitor AI API Spending · AI Coding Tools Pricing · Deploy AI Agents to Production · LLM Observability · OpenRouter Complete Guide · Best AI Agent Frameworks · FinOps for AI