Your team started with one developer using Claude Code. Now 20 people are calling AI APIs across 5 products. The monthly bill went from $200 to $8,000 and nobody knows which team is spending what. Welcome to AI FinOps.
What AI FinOps means
FinOps (Financial Operations) for AI applies cloud cost management principles to LLM spending:
- Visibility β who is spending what, on which models, for which features
- Allocation β assign costs to teams, products, and features
- Optimization β reduce waste without reducing quality
- Governance β budgets, alerts, and approval workflows
The AI cost visibility problem
Unlike cloud infrastructure where costs map to servers, AI costs map to API calls that are invisible in traditional monitoring:
Traditional cloud: EC2 instance β $500/month β assigned to Team A
AI costs: 47,000 API calls β $3,200/month β ???
Without tagging and tracking, you canβt answer basic questions:
- Which feature costs the most?
- Which team is over budget?
- Are we using the right model for each task?
Step 1: Tag everything
Add metadata to every API call:
response = client.chat.completions.create(
model="claude-opus-4.6",
messages=[...],
extra_headers={
"Helicone-Property-Team": "backend",
"Helicone-Property-Feature": "code-review",
"Helicone-Property-Environment": "production"
}
)
If using OpenRouter, use their built-in tagging. If using Helicone, their proxy captures this automatically.
Step 2: Set budgets
| Level | Budget | Alert at | Hard stop at |
|---|---|---|---|
| Company | $10,000/mo | 75% | 95% |
| Team | $2,000/mo | 80% | 90% |
| Feature | $500/mo | 80% | None |
| Individual | $200/mo | 90% | 100% |
See our monitoring guide for implementation.
Step 3: Optimize
The biggest wins, in order of effort:
-
Model routing β use DeepSeek or MiniMax ($0.27-0.30/1M) for routine tasks, Claude Opus ($15/1M) only for complex ones. Saves 40-60%.
-
Prompt caching β reuse system prompts across requests. Saves 10-30%.
-
Token optimization β shorter prompts, structured outputs, context pruning. Saves 15-25%.
-
Self-hosting for predictable workloads β when API costs exceed hardware costs. See our cost calculator.
Step 4: Showback reports
Monthly report to each team:
Team: Backend Engineering
Period: March 2026
Total spend: $1,847
Claude Opus: $1,200 (65%) β code review, architecture
DeepSeek: $147 (8%) β routine coding
Qwen Flash: $500 (27%) β batch processing
Budget: $2,000 β 92% used
Trend: +15% vs February
Optimization opportunity:
Code review uses Opus but 60% of reviews are simple.
Routing simple reviews to DeepSeek would save ~$400/month.
Tools for AI FinOps
| Tool | What it does |
|---|---|
| Helicone | Cost tracking, tagging, caching |
| OpenRouter | Unified billing across providers |
| Portkey | Multi-provider routing + cost tracking |
| Custom middleware | Tag requests, enforce budgets |
For most teams, Helicone (1-line proxy setup) + OpenRouter (unified billing) covers 90% of AI FinOps needs.
When to start
- 1-5 developers: Track total spend, set a company budget. Thatβs enough.
- 5-20 developers: Add team-level budgets and model routing.
- 20+ developers: Full FinOps with tagging, showback, and optimization reviews.
Donβt over-engineer it early. A spreadsheet tracking monthly spend per team is better than no tracking at all.
Tools for AI FinOps
| Need | Tool | Why |
|---|---|---|
| Cost tracking | Helicone | 1-line proxy setup, best cost dashboards |
| Multi-provider billing | OpenRouter | One bill for all models, built-in cost tracking |
| Budget enforcement | Portkey | Programmable guardrails and spending limits |
| Custom dashboards | Grafana + custom logging | Full control, cheapest at scale |
For most teams, start with Helicone for visibility and OpenRouter for unified billing. Add custom tooling only when you outgrow these.
Quick wins that save 30-50%
These optimizations take less than a day to implement:
- Prompt caching β if your system prompt is >1000 tokens and shared across requests, caching saves 80-90% on input tokens
- Model routing β use DeepSeek or Haiku for classification/routing, Sonnet for generation
- Max tokens limit β set
max_tokenson every request to prevent runaway generation - Response caching β cache identical queries for 5-60 minutes depending on freshness needs
- Context trimming β most prompts include unnecessary context. Trim to whatβs actually needed.
See our cost optimization guide for detailed implementation of each technique.
Related: How to Reduce LLM API Costs Β· Monitor and Control AI Spending Β· LLM Cost Calculator Β· AI Coding Tools Pricing 2026 Β· AI Cost Governance for Teams