Apr 27, 2026 · 4 min read

LLM Observability for Developers — How to Monitor AI Apps in Production

Traditional application monitoring tracks request times and error rates. LLM applications break this model completely — failures are silent, costs spike without warning, and the same input can produce wildly different outputs. Only 15% of GenAI deployments have proper LLM observability in place.

Here’s what you actually need to monitor and how.

Why traditional APM fails for LLMs

Your existing monitoring (Datadog, New Relic, Grafana) tracks:

HTTP status codes → LLMs return 200 even when hallucinating
Response times → LLM latency varies 10x based on output length
Error rates → A wrong answer isn’t an error, it’s a 200 with bad content

What LLM apps need instead:

Token tracking — which requests burn the most tokens (and money)
Quality evaluation — is the output actually correct?
Prompt tracing — what exact prompt produced this output?
Cost attribution — which feature/user/team is spending the most?
Hallucination detection — is the model making things up?

The five pillars of LLM observability

1. Request tracing

Every LLM call should be traced end-to-end: user input → prompt construction → model call → response → post-processing. For RAG systems, this includes the retrieval step too.

# Minimal tracing with OpenTelemetry
from opentelemetry import trace

tracer = trace.get_tracer("llm-app")

with tracer.start_as_current_span("llm_request") as span:
    span.set_attribute("gen_ai.model", "claude-opus-4.6")
    span.set_attribute("gen_ai.prompt_tokens", prompt_tokens)
    span.set_attribute("gen_ai.completion_tokens", completion_tokens)
    span.set_attribute("gen_ai.cost", cost)
    response = call_llm(prompt)

OpenTelemetry now has standardized gen_ai.* attributes specifically for LLM tracing. This means your traces work with any observability backend.

2. Cost tracking

LLM costs can spike 10x overnight if a prompt gets longer or traffic increases. Track:

Cost per request — input tokens × input price + output tokens × output price
Cost per user — which users consume the most?
Cost per feature — which product feature is most expensive?
Daily/weekly trends — catch spikes before the bill arrives

See our LLM cost calculator and cost reduction guide for optimization strategies.

3. Quality evaluation

The hardest part. How do you know if the LLM’s output is good?

Automated approaches:

LLM-as-a-judge — use a second model to evaluate the first
Regex/rule checks — verify output format, required fields
Structured output validation — schema enforcement
Retrieval relevance scoring — for RAG systems

Human approaches:

Thumbs up/down from users
Random sampling for manual review
A/B testing different prompts

4. Latency monitoring

LLM latency has two components:

Time to first token (TTFT) — how long before the response starts streaming
Total generation time — depends on output length

Track both separately. A slow TTFT means your prefill is bottlenecked. Slow generation means the model or inference engine is overloaded.

5. Prompt versioning

When you change a prompt, quality can change. Track which prompt version produced which outputs so you can:

Roll back to a previous version if quality drops
A/B test prompt changes
Correlate prompt changes with cost/quality metrics

The tools

Tool	Best for	Pricing	Open source?
Langfuse	Full tracing + evals	Free tier, then usage	✅ MIT
Helicone	Cost analytics + caching	Free tier, then usage	✅
LangSmith	LangChain users	$39/mo teams	❌
Portkey	Multi-provider routing	Free tier	❌
Phoenix (Arize)	Local debugging	Free	✅
OpenTelemetry	DIY with existing stack	Free	✅
SigNoz	Full-stack + LLM	Usage-based	✅

For most teams: Start with Langfuse (open source, generous free tier) or Helicone (best cost tracking). Add OpenTelemetry if you already have a monitoring stack.

For our AI race: We use custom logging in the orchestrator — every session logs model, tokens, duration, and commits. Simple but effective for our use case.

Getting started in 5 minutes

The fastest path: add Helicone as a proxy. One line change, instant observability:

# Before
client = OpenAI(api_key="sk-...")

# After — all requests now logged in Helicone
client = OpenAI(
    api_key="sk-...",
    base_url="https://oai.helicone.ai/v1",
    default_headers={"Helicone-Auth": "Bearer your-helicone-key"}
)

Works with OpenRouter, Claude, GPT, and any OpenAI-compatible API. No code changes beyond the base URL.

What to monitor first

Don’t try to monitor everything at once. Start with:

Cost per day — catch spikes immediately
Latency p95 — find slow requests
Error rate — actual API failures
Token usage trends — understand your consumption pattern

Add quality evaluation and prompt versioning once the basics are stable.

If you’re logging prompts and responses, you’re logging user data. Make sure your observability tool:

Has a DPA (Data Processing Agreement)
Supports data retention policies
Can redact PII from logs
Stores data in your region (or self-host Langfuse)

LLM Observability for Developers — How to Monitor AI Apps in Production

Why traditional APM fails for LLMs

The five pillars of LLM observability

1. Request tracing

2. Cost tracking

3. Quality evaluation

4. Latency monitoring

5. Prompt versioning

The tools

Getting started in 5 minutes

What to monitor first

For GDPR compliance

📬 AI Dev Weekly

You might also like

LLM Alerting in Production — What to Alert On and What to Ignore

How to Monitor and Control AI API Spending — Stop the Surprise Bills

Context Window Management — How to Fit More Into Your LLM's Memory

What to Log in AI Systems — And What Not To