When an AI agent gives a wrong answer, you need to trace back through its reasoning: what did it read, what tools did it call, what context did it have, and where did it go wrong? Without structured logging, debugging agents is guesswork.
Traditional application logging (logger.info("request processed")) isnβt enough. Agent interactions are multi-step, non-deterministic, and involve external API calls that cost money. You need traces, not just logs.
This guide complements our LLM observability overview with agent-specific tracing patterns.
What to capture
Every agent interaction should log:
trace = {
# Identity
"trace_id": "abc-123",
"session_id": "user_456_session_789",
"user_id": "user_456",
"agent_name": "Code Reviewer",
# Input
"user_message": "Review the auth middleware",
"system_prompt_tokens": 450,
# Reasoning steps
"steps": [
{
"type": "tool_call",
"tool": "read_file",
"input": {"path": "src/auth/middleware.ts"},
"output_tokens": 1200,
"duration_ms": 45,
},
{
"type": "tool_call",
"tool": "search_code",
"input": {"query": "jwt.verify", "path": "src/"},
"output_tokens": 800,
"duration_ms": 120,
},
{
"type": "llm_call",
"model": "claude-sonnet-4",
"input_tokens": 3200,
"output_tokens": 650,
"duration_ms": 2400,
},
],
# Output
"final_output": "Found 2 security issues...",
"total_tokens": 6300,
"total_cost_usd": 0.032,
"total_duration_ms": 3100,
}
OpenTelemetry integration
The OpenAI Agents SDK has built-in tracing that exports to OpenTelemetry:
from agents.tracing import trace, set_trace_processors
from opentelemetry.sdk.trace.export import BatchSpanExporter
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
# Export traces to your observability platform
exporter = OTLPSpanExporter(endpoint="https://your-collector:4317")
set_trace_processors([BatchSpanExporter(exporter)])
@trace
async def review_code(file_path: str):
result = await Runner.run(review_agent, f"Review {file_path}")
return result.final_output
This sends structured traces to any OpenTelemetry-compatible backend: Jaeger, Grafana Tempo, Datadog, or Langfuse.
Platform-specific tracing
Helicone (proxy-based)
Helicone sits between your agent and the LLM API, capturing everything automatically:
import openai
client = openai.OpenAI(
base_url="https://oai.helicone.ai/v1",
default_headers={
"Helicone-Auth": f"Bearer {HELICONE_API_KEY}",
"Helicone-Session-Id": session_id,
"Helicone-User-Id": user_id,
},
)
Zero code changes to your agent. Helicone captures every LLM call with tokens, cost, latency, and the full request/response.
Langfuse (SDK-based)
Langfuse gives you more control with explicit trace creation:
from langfuse import Langfuse
langfuse = Langfuse()
trace = langfuse.trace(name="code-review", user_id=user_id)
span = trace.span(name="read-file", input={"path": file_path})
# ... do the work ...
span.end(output={"content": file_content, "tokens": 1200})
Custom dashboard
For simple setups, log to PostgreSQL and build a dashboard:
CREATE TABLE agent_traces (
id BIGSERIAL PRIMARY KEY,
trace_id UUID NOT NULL,
session_id TEXT,
user_id TEXT,
agent_name TEXT,
step_type TEXT, -- 'tool_call', 'llm_call', 'error'
step_name TEXT,
input JSONB,
output TEXT,
tokens_used INTEGER,
cost_usd NUMERIC(10, 6),
duration_ms INTEGER,
created_at TIMESTAMPTZ DEFAULT NOW()
);
-- Quick queries
-- Cost per user today
SELECT user_id, SUM(cost_usd) FROM agent_traces
WHERE created_at > NOW() - INTERVAL '1 day' GROUP BY user_id;
-- Slowest tool calls
SELECT step_name, AVG(duration_ms), COUNT(*) FROM agent_traces
WHERE step_type = 'tool_call' GROUP BY step_name ORDER BY AVG(duration_ms) DESC;
Alerting on trace data
Set up alerts for anomalies:
| Alert | Threshold | Action |
|---|---|---|
| High error rate | >5% of traces have errors | Page on-call |
| Cost spike | Daily cost >2x average | Notify team |
| Slow responses | p95 latency >30s | Investigate |
| Loop detection | >3 identical tool calls in trace | Auto-interrupt agent |
| Token budget | User at 80% of daily limit | Warn user |
What NOT to log
- Full user messages in plain text if they contain PII β hash or redact
- API keys or tokens β never log credentials
- Full file contents from tool calls β log file paths and sizes instead
- Every intermediate token β log summaries, not streams
Balance observability with privacy. See our GDPR guide for compliance requirements.
Related: LLM Observability for Developers Β· Helicone vs LangSmith vs Langfuse Β· How to Debug AI Agents Β· AI Agent Cost Management Β· AI Agent Error Handling Β· Deploy AI Agents to Production