πŸ€– AI Tools
Β· 4 min read

LLM Observability for Developers β€” How to Monitor AI Apps in Production


Traditional application monitoring tracks request times and error rates. LLM applications break this model completely β€” failures are silent, costs spike without warning, and the same input can produce wildly different outputs. Only 15% of GenAI deployments have proper LLM observability in place.

Here’s what you actually need to monitor and how.

Why traditional APM fails for LLMs

Your existing monitoring (Datadog, New Relic, Grafana) tracks:

  • HTTP status codes β†’ LLMs return 200 even when hallucinating
  • Response times β†’ LLM latency varies 10x based on output length
  • Error rates β†’ A wrong answer isn’t an error, it’s a 200 with bad content

What LLM apps need instead:

  • Token tracking β€” which requests burn the most tokens (and money)
  • Quality evaluation β€” is the output actually correct?
  • Prompt tracing β€” what exact prompt produced this output?
  • Cost attribution β€” which feature/user/team is spending the most?
  • Hallucination detection β€” is the model making things up?

The five pillars of LLM observability

1. Request tracing

Every LLM call should be traced end-to-end: user input β†’ prompt construction β†’ model call β†’ response β†’ post-processing. For RAG systems, this includes the retrieval step too.

# Minimal tracing with OpenTelemetry
from opentelemetry import trace

tracer = trace.get_tracer("llm-app")

with tracer.start_as_current_span("llm_request") as span:
    span.set_attribute("gen_ai.model", "claude-opus-4.6")
    span.set_attribute("gen_ai.prompt_tokens", prompt_tokens)
    span.set_attribute("gen_ai.completion_tokens", completion_tokens)
    span.set_attribute("gen_ai.cost", cost)
    response = call_llm(prompt)

OpenTelemetry now has standardized gen_ai.* attributes specifically for LLM tracing. This means your traces work with any observability backend.

2. Cost tracking

LLM costs can spike 10x overnight if a prompt gets longer or traffic increases. Track:

  • Cost per request β€” input tokens Γ— input price + output tokens Γ— output price
  • Cost per user β€” which users consume the most?
  • Cost per feature β€” which product feature is most expensive?
  • Daily/weekly trends β€” catch spikes before the bill arrives

See our LLM cost calculator and cost reduction guide for optimization strategies.

3. Quality evaluation

The hardest part. How do you know if the LLM’s output is good?

Automated approaches:

  • LLM-as-a-judge β€” use a second model to evaluate the first
  • Regex/rule checks β€” verify output format, required fields
  • Structured output validation β€” schema enforcement
  • Retrieval relevance scoring β€” for RAG systems

Human approaches:

  • Thumbs up/down from users
  • Random sampling for manual review
  • A/B testing different prompts

4. Latency monitoring

LLM latency has two components:

  • Time to first token (TTFT) β€” how long before the response starts streaming
  • Total generation time β€” depends on output length

Track both separately. A slow TTFT means your prefill is bottlenecked. Slow generation means the model or inference engine is overloaded.

5. Prompt versioning

When you change a prompt, quality can change. Track which prompt version produced which outputs so you can:

  • Roll back to a previous version if quality drops
  • A/B test prompt changes
  • Correlate prompt changes with cost/quality metrics

The tools

ToolBest forPricingOpen source?
LangfuseFull tracing + evalsFree tier, then usageβœ… MIT
HeliconeCost analytics + cachingFree tier, then usageβœ…
LangSmithLangChain users$39/mo teams❌
PortkeyMulti-provider routingFree tier❌
Phoenix (Arize)Local debuggingFreeβœ…
OpenTelemetryDIY with existing stackFreeβœ…
SigNozFull-stack + LLMUsage-basedβœ…

For most teams: Start with Langfuse (open source, generous free tier) or Helicone (best cost tracking). Add OpenTelemetry if you already have a monitoring stack.

For our AI race: We use custom logging in the orchestrator β€” every session logs model, tokens, duration, and commits. Simple but effective for our use case.

Getting started in 5 minutes

The fastest path: add Helicone as a proxy. One line change, instant observability:

# Before
client = OpenAI(api_key="sk-...")

# After β€” all requests now logged in Helicone
client = OpenAI(
    api_key="sk-...",
    base_url="https://oai.helicone.ai/v1",
    default_headers={"Helicone-Auth": "Bearer your-helicone-key"}
)

Works with OpenRouter, Claude, GPT, and any OpenAI-compatible API. No code changes beyond the base URL.

What to monitor first

Don’t try to monitor everything at once. Start with:

  1. Cost per day β€” catch spikes immediately
  2. Latency p95 β€” find slow requests
  3. Error rate β€” actual API failures
  4. Token usage trends β€” understand your consumption pattern

Add quality evaluation and prompt versioning once the basics are stable.

For GDPR compliance

If you’re logging prompts and responses, you’re logging user data. Make sure your observability tool:

  • Has a DPA (Data Processing Agreement)
  • Supports data retention policies
  • Can redact PII from logs
  • Stores data in your region (or self-host Langfuse)

Related: How to Reduce LLM API Costs Β· Monitor and Control AI Spending Β· LLM Inference Explained Β· AI and GDPR