🤖 AI Tools
· 3 min read

How to Reduce LLM API Costs by 70% — 5 Strategies That Actually Work


Most teams overspend on LLM APIs by 3-10x. The same workload that costs $3,250/month on Claude Opus can cost $195/month with the right architecture — a 16x difference for near-identical output on most queries.

Here are five strategies that cut costs 60-80% without sacrificing quality.

1. Model routing (40-60% savings)

The biggest win. Stop sending every request to your most expensive model.

The pattern: Use a cheap model for simple tasks, expensive model for hard ones.

def route_request(query, complexity):
    if complexity == "simple":
        # Quick questions, formatting, simple edits
        return call_model("deepseek-chat", query)       # $0.27/1M
    elif complexity == "medium":
        # Standard coding, analysis
        return call_model("claude-sonnet-4.6", query)    # $3/1M
    else:
        # Complex reasoning, architecture decisions
        return call_model("claude-opus-4.6", query)      # $15/1M

In practice, 60-70% of requests are “simple.” Routing those to DeepSeek or Qwen Flash at $0.07-0.27/1M instead of Claude at $15/1M saves 40-60% immediately.

Tools like OpenRouter make this easy — one API, switch models per request. Aider has built-in --model and --weak-model flags for exactly this pattern.

2. Prompt caching (up to 90% on cached tokens)

Anthropic, OpenAI, and Google all offer prompt caching — if the first N tokens of your prompt match a recent request, you pay 90% less for those tokens.

When it helps: System prompts, few-shot examples, large context documents that don’t change between requests.

# Without caching: 10K system prompt tokens × $15/1M = $0.15 per request
# With caching:    10K cached tokens × $1.50/1M = $0.015 per request
# Savings: 90% on the system prompt portion

For AI coding tools with large system prompts (like the ones in our AI Startup Race), this is significant. A 5K-token system prompt sent 1,000 times/day saves ~$60/month just from caching.

3. Token optimization (30-50% reduction)

Every token costs money. Reduce them:

Shorter system prompts. Most system prompts are 2-3x longer than needed. Cut the fluff.

Structured output. Ask for JSON instead of prose — it’s shorter and parseable.

Context pruning. Don’t send your entire codebase. Only include relevant files. Aider’s --read flag and repo map do this automatically.

Summarize conversation history. Instead of sending the full chat history, summarize older messages:

# Instead of 50 messages (20K tokens):
messages = [system_prompt, summary_of_first_48, last_2_messages]
# Now: ~3K tokens

4. Batching (50% discount)

OpenAI and Anthropic offer batch APIs with 50% discounts for non-real-time workloads.

Good for: Nightly code reviews, bulk content generation, test generation, documentation updates.

# OpenAI Batch API
batch = client.batches.create(
    input_file_id="file-abc123",
    endpoint="/v1/chat/completions",
    completion_window="24h"  # Results within 24 hours
)
# 50% cheaper than real-time API

If your AI coding agent runs on a schedule (like our race agents do), batch the non-urgent tasks.

5. Self-host for predictable workloads

At some point, API costs exceed hardware costs. The break-even:

Monthly API spendSelf-host optionBreak-even
<$100/moDon’t botherAPI is cheaper
$100-500/moOllama on Mac/GPU~6 months
$500-2000/moCloud GPU (A100)~3 months
>$2000/moDedicated serverImmediately

For coding tasks, a Mac Mini M4 32GB ($1,150) running Qwen 3.5 27B replaces ~$50-100/month in API costs. Pays for itself in a year.

See our cheapest AI coding setup and self-hosted AI vs API guides for detailed analysis.

The combined impact

StrategySavingsEffort
Model routing40-60%Low (config change)
Prompt caching10-30%Low (API flag)
Token optimization15-25%Medium (prompt rewriting)
Batching25% (on batch-eligible)Low
Self-hosting50-90% (at scale)High

Combined, these strategies typically reduce costs by 60-80%. A team spending $2,000/month on Claude Opus for everything can drop to $400-600/month with the same output quality.

Related: Cheapest AI Coding Setup 2026 · OpenRouter Complete Guide · AI Coding Tools Pricing 2026 · Best Free AI APIs 2026