Most teams overspend on LLM APIs by 3-10x. The same workload that costs $3,250/month on Claude Opus can cost $195/month with the right architecture β a 16x difference for near-identical output on most queries.
Update (April 24, 2026): DeepSeek V4 Flash at $0.14/$0.28 per 1M tokens is the cheapest frontier option. See V4 API guide.
Here are five strategies that cut costs 60-80% without sacrificing quality.
1. Model routing (40-60% savings)
The biggest win. Stop sending every request to your most expensive model.
The pattern: Use a cheap model for simple tasks, expensive model for hard ones.
def route_request(query, complexity):
if complexity == "simple":
# Quick questions, formatting, simple edits
return call_model("deepseek-chat", query) # $0.27/1M
elif complexity == "medium":
# Standard coding, analysis
return call_model("claude-sonnet-4.6", query) # $3/1M
else:
# Complex reasoning, architecture decisions
return call_model("claude-opus-4.6", query) # $15/1M
In practice, 60-70% of requests are βsimple.β Routing those to DeepSeek or Qwen Flash at $0.07-0.27/1M instead of Claude at $15/1M saves 40-60% immediately.
Tools like OpenRouter make this easy β one API, switch models per request. Aider has built-in --model and --weak-model flags for exactly this pattern.
2. Prompt caching (up to 90% on cached tokens)
Anthropic, OpenAI, and Google all offer prompt caching β if the first N tokens of your prompt match a recent request, you pay 90% less for those tokens.
When it helps: System prompts, few-shot examples, large context documents that donβt change between requests.
# Without caching: 10K system prompt tokens Γ $15/1M = $0.15 per request
# With caching: 10K cached tokens Γ $1.50/1M = $0.015 per request
# Savings: 90% on the system prompt portion
For AI coding tools with large system prompts (like the ones in our AI Startup Race), this is significant. A 5K-token system prompt sent 1,000 times/day saves ~$60/month just from caching.
3. Token optimization (30-50% reduction)
Every token costs money. Reduce them:
Shorter system prompts. Most system prompts are 2-3x longer than needed. Cut the fluff.
Structured output. Ask for JSON instead of prose β itβs shorter and parseable.
Context pruning. Donβt send your entire codebase. Only include relevant files. Aiderβs --read flag and repo map do this automatically.
Summarize conversation history. Instead of sending the full chat history, summarize older messages:
# Instead of 50 messages (20K tokens):
messages = [system_prompt, summary_of_first_48, last_2_messages]
# Now: ~3K tokens
4. Batching (50% discount)
OpenAI and Anthropic offer batch APIs with 50% discounts for non-real-time workloads.
Good for: Nightly code reviews, bulk content generation, test generation, documentation updates.
# OpenAI Batch API
batch = client.batches.create(
input_file_id="file-abc123",
endpoint="/v1/chat/completions",
completion_window="24h" # Results within 24 hours
)
# 50% cheaper than real-time API
If your AI coding agent runs on a schedule (like our race agents do), batch the non-urgent tasks.
5. Self-host for predictable workloads
At some point, API costs exceed hardware costs. The break-even:
| Monthly API spend | Self-host option | Break-even |
|---|---|---|
| <$100/mo | Donβt bother | API is cheaper |
| $100-500/mo | Ollama on Mac/GPU | ~6 months |
| $500-2000/mo | Cloud GPU (A100) | ~3 months |
| >$2000/mo | Dedicated server | Immediately |
For coding tasks, a Mac Mini M4 32GB ($1,150) running Qwen 3.5 27B replaces ~$50-100/month in API costs. Pays for itself in a year.
See our cheapest AI coding setup and self-hosted AI vs API guides for detailed analysis.
The combined impact
| Strategy | Savings | Effort |
|---|---|---|
| Model routing | 40-60% | Low (config change) |
| Prompt caching | 10-30% | Low (API flag) |
| Token optimization | 15-25% | Medium (prompt rewriting) |
| Batching | 25% (on batch-eligible) | Low |
| Self-hosting | 50-90% (at scale) | High |
Combined, these strategies typically reduce costs by 60-80%. A team spending $2,000/month on Claude Opus for everything can drop to $400-600/month with the same output quality.
Related: Cheapest AI Coding Setup 2026 Β· OpenRouter Complete Guide Β· AI Coding Tools Pricing 2026 Β· Best Free AI APIs 2026 Β· Chinese AI Models Are 30x Cheaper Β· Migrate from GPT/Claude to DeepSeek/MiMo