LLM Observability for Developers β How to Monitor AI Apps in Production
Traditional application monitoring tracks request times and error rates. LLM applications break this model completely β failures are silent, costs spike without warning, and the same input can produce wildly different outputs. Only 15% of GenAI deployments have proper LLM observability in place.
Hereβs what you actually need to monitor and how.
Why traditional APM fails for LLMs
Your existing monitoring (Datadog, New Relic, Grafana) tracks:
- HTTP status codes β LLMs return 200 even when hallucinating
- Response times β LLM latency varies 10x based on output length
- Error rates β A wrong answer isnβt an error, itβs a 200 with bad content
What LLM apps need instead:
- Token tracking β which requests burn the most tokens (and money)
- Quality evaluation β is the output actually correct?
- Prompt tracing β what exact prompt produced this output?
- Cost attribution β which feature/user/team is spending the most?
- Hallucination detection β is the model making things up?
The five pillars of LLM observability
1. Request tracing
Every LLM call should be traced end-to-end: user input β prompt construction β model call β response β post-processing. For RAG systems, this includes the retrieval step too.
# Minimal tracing with OpenTelemetry
from opentelemetry import trace
tracer = trace.get_tracer("llm-app")
with tracer.start_as_current_span("llm_request") as span:
span.set_attribute("gen_ai.model", "claude-opus-4.6")
span.set_attribute("gen_ai.prompt_tokens", prompt_tokens)
span.set_attribute("gen_ai.completion_tokens", completion_tokens)
span.set_attribute("gen_ai.cost", cost)
response = call_llm(prompt)
OpenTelemetry now has standardized gen_ai.* attributes specifically for LLM tracing. This means your traces work with any observability backend.
2. Cost tracking
LLM costs can spike 10x overnight if a prompt gets longer or traffic increases. Track:
- Cost per request β input tokens Γ input price + output tokens Γ output price
- Cost per user β which users consume the most?
- Cost per feature β which product feature is most expensive?
- Daily/weekly trends β catch spikes before the bill arrives
See our LLM cost calculator and cost reduction guide for optimization strategies.
3. Quality evaluation
The hardest part. How do you know if the LLMβs output is good?
Automated approaches:
- LLM-as-a-judge β use a second model to evaluate the first
- Regex/rule checks β verify output format, required fields
- Structured output validation β schema enforcement
- Retrieval relevance scoring β for RAG systems
Human approaches:
- Thumbs up/down from users
- Random sampling for manual review
- A/B testing different prompts
4. Latency monitoring
LLM latency has two components:
- Time to first token (TTFT) β how long before the response starts streaming
- Total generation time β depends on output length
Track both separately. A slow TTFT means your prefill is bottlenecked. Slow generation means the model or inference engine is overloaded.
5. Prompt versioning
When you change a prompt, quality can change. Track which prompt version produced which outputs so you can:
- Roll back to a previous version if quality drops
- A/B test prompt changes
- Correlate prompt changes with cost/quality metrics
The tools
| Tool | Best for | Pricing | Open source? |
|---|---|---|---|
| Langfuse | Full tracing + evals | Free tier, then usage | β MIT |
| Helicone | Cost analytics + caching | Free tier, then usage | β |
| LangSmith | LangChain users | $39/mo teams | β |
| Portkey | Multi-provider routing | Free tier | β |
| Phoenix (Arize) | Local debugging | Free | β |
| OpenTelemetry | DIY with existing stack | Free | β |
| SigNoz | Full-stack + LLM | Usage-based | β |
For most teams: Start with Langfuse (open source, generous free tier) or Helicone (best cost tracking). Add OpenTelemetry if you already have a monitoring stack.
For our AI race: We use custom logging in the orchestrator β every session logs model, tokens, duration, and commits. Simple but effective for our use case.
Getting started in 5 minutes
The fastest path: add Helicone as a proxy. One line change, instant observability:
# Before
client = OpenAI(api_key="sk-...")
# After β all requests now logged in Helicone
client = OpenAI(
api_key="sk-...",
base_url="https://oai.helicone.ai/v1",
default_headers={"Helicone-Auth": "Bearer your-helicone-key"}
)
Works with OpenRouter, Claude, GPT, and any OpenAI-compatible API. No code changes beyond the base URL.
What to monitor first
Donβt try to monitor everything at once. Start with:
- Cost per day β catch spikes immediately
- Latency p95 β find slow requests
- Error rate β actual API failures
- Token usage trends β understand your consumption pattern
Add quality evaluation and prompt versioning once the basics are stable.
For GDPR compliance
If youβre logging prompts and responses, youβre logging user data. Make sure your observability tool:
- Has a DPA (Data Processing Agreement)
- Supports data retention policies
- Can redact PII from logs
- Stores data in your region (or self-host Langfuse)
Related: How to Reduce LLM API Costs Β· Monitor and Control AI Spending Β· LLM Inference Explained Β· AI and GDPR