Most teams overspend on LLM APIs by 3-10x. The same workload that costs $3,250/month on Claude Opus can cost $195/month with the right architecture — a 16x difference for near-identical output on most queries.
Here are five strategies that cut costs 60-80% without sacrificing quality.
1. Model routing (40-60% savings)
The biggest win. Stop sending every request to your most expensive model.
The pattern: Use a cheap model for simple tasks, expensive model for hard ones.
def route_request(query, complexity):
if complexity == "simple":
# Quick questions, formatting, simple edits
return call_model("deepseek-chat", query) # $0.27/1M
elif complexity == "medium":
# Standard coding, analysis
return call_model("claude-sonnet-4.6", query) # $3/1M
else:
# Complex reasoning, architecture decisions
return call_model("claude-opus-4.6", query) # $15/1M
In practice, 60-70% of requests are “simple.” Routing those to DeepSeek or Qwen Flash at $0.07-0.27/1M instead of Claude at $15/1M saves 40-60% immediately.
Tools like OpenRouter make this easy — one API, switch models per request. Aider has built-in --model and --weak-model flags for exactly this pattern.
2. Prompt caching (up to 90% on cached tokens)
Anthropic, OpenAI, and Google all offer prompt caching — if the first N tokens of your prompt match a recent request, you pay 90% less for those tokens.
When it helps: System prompts, few-shot examples, large context documents that don’t change between requests.
# Without caching: 10K system prompt tokens × $15/1M = $0.15 per request
# With caching: 10K cached tokens × $1.50/1M = $0.015 per request
# Savings: 90% on the system prompt portion
For AI coding tools with large system prompts (like the ones in our AI Startup Race), this is significant. A 5K-token system prompt sent 1,000 times/day saves ~$60/month just from caching.
3. Token optimization (30-50% reduction)
Every token costs money. Reduce them:
Shorter system prompts. Most system prompts are 2-3x longer than needed. Cut the fluff.
Structured output. Ask for JSON instead of prose — it’s shorter and parseable.
Context pruning. Don’t send your entire codebase. Only include relevant files. Aider’s --read flag and repo map do this automatically.
Summarize conversation history. Instead of sending the full chat history, summarize older messages:
# Instead of 50 messages (20K tokens):
messages = [system_prompt, summary_of_first_48, last_2_messages]
# Now: ~3K tokens
4. Batching (50% discount)
OpenAI and Anthropic offer batch APIs with 50% discounts for non-real-time workloads.
Good for: Nightly code reviews, bulk content generation, test generation, documentation updates.
# OpenAI Batch API
batch = client.batches.create(
input_file_id="file-abc123",
endpoint="/v1/chat/completions",
completion_window="24h" # Results within 24 hours
)
# 50% cheaper than real-time API
If your AI coding agent runs on a schedule (like our race agents do), batch the non-urgent tasks.
5. Self-host for predictable workloads
At some point, API costs exceed hardware costs. The break-even:
| Monthly API spend | Self-host option | Break-even |
|---|---|---|
| <$100/mo | Don’t bother | API is cheaper |
| $100-500/mo | Ollama on Mac/GPU | ~6 months |
| $500-2000/mo | Cloud GPU (A100) | ~3 months |
| >$2000/mo | Dedicated server | Immediately |
For coding tasks, a Mac Mini M4 32GB ($1,150) running Qwen 3.5 27B replaces ~$50-100/month in API costs. Pays for itself in a year.
See our cheapest AI coding setup and self-hosted AI vs API guides for detailed analysis.
The combined impact
| Strategy | Savings | Effort |
|---|---|---|
| Model routing | 40-60% | Low (config change) |
| Prompt caching | 10-30% | Low (API flag) |
| Token optimization | 15-25% | Medium (prompt rewriting) |
| Batching | 25% (on batch-eligible) | Low |
| Self-hosting | 50-90% (at scale) | High |
Combined, these strategies typically reduce costs by 60-80%. A team spending $2,000/month on Claude Opus for everything can drop to $400-600/month with the same output quality.
Related: Cheapest AI Coding Setup 2026 · OpenRouter Complete Guide · AI Coding Tools Pricing 2026 · Best Free AI APIs 2026