You need to monitor your LLM app in production. Three tools dominate: Helicone (best for cost tracking), LangSmith (best for LangChain users), and Langfuse (best open-source option). Here’s how they compare.
Quick comparison
| Helicone | LangSmith | Langfuse | |
|---|---|---|---|
| Best for | Cost analytics | LangChain users | Open-source, self-host |
| Setup | 1-line proxy | SDK integration | SDK or self-host |
| Tracing | ✅ | ✅ Deep | ✅ |
| Cost tracking | ✅ Best | ✅ | ✅ |
| Evals | Basic | ✅ Best | ✅ Good |
| Prompt management | ❌ | ✅ | ✅ |
| Open source | ✅ | ❌ | ✅ MIT |
| Self-host | ✅ | ❌ | ✅ |
| Free tier | 100K requests/mo | 5K traces/mo | 50K observations/mo |
| Paid | Usage-based | $39/mo team | Usage-based |
Helicone — best for cost tracking
Helicone works as a proxy — change your API base URL and every request is automatically logged. No SDK needed.
# One-line setup
client = OpenAI(
base_url="https://oai.helicone.ai/v1",
default_headers={"Helicone-Auth": "Bearer your-key"}
)
Strengths: Instant setup, best cost dashboards, request caching (saves money), works with any provider.
Weaknesses: Less deep tracing than LangSmith, basic eval capabilities.
Pick Helicone when: Cost is your primary concern, you want the fastest setup, or you use multiple AI providers through OpenRouter.
LangSmith — best for LangChain users
Deep integration with LangChain. Automatic tracing of chains, agents, and tool calls. Best evaluation framework.
Strengths: Deepest tracing for LangChain apps, best eval/testing tools, prompt playground, dataset management.
Weaknesses: Tightly coupled to LangChain, not open source, $39/mo for teams.
Pick LangSmith when: You use LangChain and want the best debugging experience.
Langfuse — best open-source option
MIT licensed, can be self-hosted for complete data control. Good balance of features.
from langfuse import Langfuse
langfuse = Langfuse()
# Trace a generation
trace = langfuse.trace(name="chat")
generation = trace.generation(
name="llm-call",
model="claude-opus-4.6",
input=messages,
output=response
)
Strengths: Open source (MIT), self-hostable for GDPR, good tracing + evals, works with any framework.
Weaknesses: Requires SDK integration (not a proxy), smaller community than Helicone.
Pick Langfuse when: You need open source, want to self-host for privacy, or want a balanced feature set without LangChain lock-in.
Decision framework
| Situation | Pick |
|---|---|
| ”I just want to see costs” | Helicone |
| ”I use LangChain” | LangSmith |
| ”I need open source / self-host” | Langfuse |
| ”I need GDPR compliance” | Langfuse (self-hosted) |
| “I want the fastest setup” | Helicone (1-line proxy) |
| “I need deep eval/testing” | LangSmith |
Other options worth knowing
- Portkey — best for multi-provider routing + observability
- Phoenix (Arize) — best for local debugging, fully open source
- SigNoz — best if you want LLM monitoring alongside full-stack observability
- OpenTelemetry — DIY with your existing monitoring stack
Migrating between tools
All three tools use similar concepts (traces, spans, generations), so migrating isn’t painful. The main lock-in is:
- Helicone: Proxy URL in your config. Change one line to remove.
- LangSmith: SDK decorators in your code. More work to remove, especially if using LangChain callbacks.
- Langfuse: SDK calls in your code. Similar effort to LangSmith.
If you’re worried about lock-in, start with Helicone (1-line proxy, easiest to remove) or Langfuse (open source, can always self-host).
Cost comparison at scale
| Monthly volume | Helicone | LangSmith | Langfuse Cloud |
|---|---|---|---|
| 10K requests | Free | Free | Free |
| 100K requests | ~$20 | $39 | ~$25 |
| 500K requests | ~$80 | $39 + overages | ~$100 |
| 1M requests | ~$150 | Custom | ~$200 |
| Self-hosted | N/A | N/A | $0 (your infra) |
Langfuse self-hosted is the cheapest option at any scale — you only pay for the server (a $10/month VPS handles millions of traces). But you manage the infrastructure.
What to monitor (regardless of tool)
Whichever tool you pick, track these metrics from day one:
- Cost per request — catch spending anomalies early
- Latency (P50, P95) — detect slowdowns before users complain
- Error rate — API failures, timeouts, rate limits
- Token usage trends — are prompts growing over time?
- Model distribution — which models are being used and how much?
See our what to log guide for the complete logging strategy and our LLM observability guide for what to monitor.
Related: LLM Observability for Developers · What to Log in AI Systems · How to Reduce LLM API Costs · Monitor and Control AI Spending · Self-Hosted AI for GDPR