An AI agent without cost controls is a credit card with no limit. One runaway loop, one verbose prompt, one user who discovers they can ask unlimited questions — and your monthly API bill goes from $50 to $5,000 overnight.
This isn’t theoretical. In our AI Startup Race, each agent has exactly $100 for 12 weeks. When the money runs out, the agent stops. That constraint forces smart cost management from day one. Here’s how to apply the same discipline to production agents.
Where the money goes
A typical AI agent interaction costs:
| Component | Tokens | Cost (GPT-4o) | Cost (GPT-4o-mini) |
|---|---|---|---|
| System prompt | 500-2,000 | $0.001-0.005 | $0.0001-0.0003 |
| User message | 100-500 | $0.0003-0.001 | $0.00002-0.0001 |
| Tool calls (3-5) | 2,000-10,000 | $0.005-0.025 | $0.0003-0.002 |
| Agent response | 500-2,000 | $0.002-0.008 | $0.0002-0.001 |
| Total per interaction | 3,000-15,000 | $0.008-0.04 | $0.0006-0.003 |
At 1,000 users making 10 requests/day: $80-400/day with GPT-4o, or $6-30/day with GPT-4o-mini. The model choice alone is a 10-15x cost difference.
Strategy 1: Model routing
Not every request needs your most expensive model. Route based on complexity:
from agents import Agent, Runner
# Cheap model for simple tasks
simple_agent = Agent(
name="Quick Helper",
model="gpt-4o-mini",
instructions="Answer simple questions concisely.",
)
# Expensive model for complex tasks
complex_agent = Agent(
name="Deep Thinker",
model="gpt-4o",
instructions="Handle complex reasoning, debugging, and architecture questions.",
)
async def route_request(message: str):
# Simple heuristic: short messages get cheap model
if len(message) < 200 and "?" in message:
return await Runner.run(simple_agent, message)
return await Runner.run(complex_agent, message)
Better approach: use a classifier to route:
classifier = Agent(
name="Router",
model="gpt-4o-mini", # Cheap model for classification
instructions="""Classify the user's request as 'simple' or 'complex'.
Simple: factual questions, formatting, basic code snippets.
Complex: debugging, architecture, multi-step reasoning, security review.
Respond with only 'simple' or 'complex'.""",
)
This adds one cheap API call but saves money on 60-70% of requests that don’t need the expensive model.
For open-source alternatives, route simple tasks to Ollama running locally (cost: $0) and only use paid APIs for complex tasks.
Strategy 2: Per-user budgets
import redis
r = redis.Redis()
DAILY_LIMIT = 50_000 # tokens per user per day
MONTHLY_LIMIT = 1_000_000
async def check_and_track(user_id: str, tokens_used: int) -> bool:
daily_key = f"usage:{user_id}:{date.today()}"
monthly_key = f"usage:{user_id}:{date.today().strftime('%Y-%m')}"
daily = int(r.get(daily_key) or 0)
monthly = int(r.get(monthly_key) or 0)
if daily + tokens_used > DAILY_LIMIT:
return False # Daily limit reached
if monthly + tokens_used > MONTHLY_LIMIT:
return False # Monthly limit reached
pipe = r.pipeline()
pipe.incrby(daily_key, tokens_used)
pipe.expire(daily_key, 86400)
pipe.incrby(monthly_key, tokens_used)
pipe.expire(monthly_key, 2678400) # 31 days
pipe.execute()
return True
When a user hits their limit, degrade gracefully: switch to a cheaper model, reduce max response length, or show a “limit reached” message with an upgrade option.
Strategy 3: Prompt caching and compression
System prompts are sent with every request. A 2,000-token system prompt across 10,000 daily requests = 20M input tokens/day.
Cache system prompts: OpenAI and Anthropic both offer prompt caching that reduces cost for repeated prefixes by 50-90%.
Compress context: Instead of sending full file contents, send summaries:
# Expensive: send full file
context = open("large_file.py").read() # 5,000 tokens
# Cheap: send summary
context = "File: large_file.py (450 lines). Contains: FastAPI app with 12 endpoints, SQLAlchemy models for User, Order, Product. Auth middleware using JWT." # 50 tokens
Strategy 4: Circuit breakers
Prevent runaway agent loops:
MAX_TOOL_CALLS = 10
MAX_TOKENS_PER_RUN = 50_000
MAX_DURATION_SECONDS = 120
async def run_with_limits(agent, message):
tool_calls = 0
total_tokens = 0
start = time.time()
async for event in Runner.run_streamed(agent, message):
if event.type == "tool_call":
tool_calls += 1
if tool_calls > MAX_TOOL_CALLS:
raise AgentBudgetExceeded("Too many tool calls")
total_tokens += getattr(event, 'tokens', 0)
if total_tokens > MAX_TOKENS_PER_RUN:
raise AgentBudgetExceeded("Token limit exceeded")
if time.time() - start > MAX_DURATION_SECONDS:
raise AgentBudgetExceeded("Time limit exceeded")
yield event
Strategy 5: Response caching
Many agent queries are similar. Cache responses for identical or near-identical inputs:
import hashlib
def cache_key(message: str, agent_name: str) -> str:
return hashlib.sha256(f"{agent_name}:{message}".encode()).hexdigest()
async def cached_run(agent, message):
key = cache_key(message, agent.name)
cached = await redis.get(f"response:{key}")
if cached:
return cached # Free!
result = await Runner.run(agent, message)
await redis.setex(f"response:{key}", 3600, result.final_output) # Cache 1 hour
return result.final_output
This works well for FAQ-style agents, documentation assistants, and code explanation tools where the same questions come up repeatedly.
Real-time cost monitoring
Track costs in real-time with your observability platform:
# After each agent run
async def log_cost(user_id, agent_name, input_tokens, output_tokens, model):
costs = {
"gpt-4o": {"input": 2.50, "output": 10.00}, # per 1M tokens
"gpt-4o-mini": {"input": 0.15, "output": 0.60},
"claude-sonnet-4": {"input": 3.00, "output": 15.00},
}
rate = costs.get(model, costs["gpt-4o-mini"])
cost = (input_tokens * rate["input"] + output_tokens * rate["output"]) / 1_000_000
await metrics.record("agent_cost", cost, tags={
"user": user_id, "agent": agent_name, "model": model
})
Set alerts for:
- Daily spend exceeding 2x the average
- Any single user consuming more than 10% of total budget
- Any single agent run costing more than $1
The $100 budget framework
From running the AI Startup Race, here’s what we learned about budgeting:
| Budget | What you get | Duration |
|---|---|---|
| $10/mo | ~3M tokens GPT-4o-mini | Hobby project |
| $50/mo | ~15M tokens GPT-4o-mini or ~1.5M GPT-4o | Small SaaS |
| $200/mo | Mixed routing, 100+ daily users | Growing product |
| $1,000/mo | Full GPT-4o, 1,000+ daily users | Serious product |
The key insight: start with the cheapest model that works, add expensive models only for tasks that need them, and always have per-user limits.
Related: Monitor AI API Spending · AI Coding Tools Pricing · Deploy AI Agents to Production · LLM Observability · OpenRouter Complete Guide · Best AI Agent Frameworks · FinOps for AI