Your LLM app is in production. Something goes wrong. You check the logs. They say β200 OK.β Thatβs it. No prompt, no response, no token count, no model version. Youβre blind.
Hereβs what to log in AI systems β and what to leave out.
What to log (always)
Per-request metadata
{
"timestamp": "2026-04-13T15:30:00Z",
"request_id": "req_abc123",
"model": "claude-sonnet-4.6",
"model_version": "2026-03-15",
"input_tokens": 1247,
"output_tokens": 389,
"total_tokens": 1636,
"latency_ms": 2340,
"ttft_ms": 450,
"cost_usd": 0.0082,
"status": "success",
"user_id": "user_hash_xyz",
"feature": "code-review",
"team": "backend"
}
This alone answers 80% of debugging questions: which model, how many tokens, how long, how much, who triggered it.
Error details
{
"status": "error",
"error_type": "rate_limit",
"error_message": "429 Too Many Requests",
"retry_count": 2,
"fallback_model": "deepseek-chat"
}
Tool calls (for MCP and agents)
{
"tools_called": ["read_file", "search_codebase", "write_file"],
"tool_count": 3,
"tool_latency_ms": [120, 340, 89]
}
What to log (carefully)
Prompts and responses
The dilemma: You need prompts for debugging. But prompts contain user data.
Solution: Log prompts in a separate, access-controlled store with automatic expiration:
# Main log: metadata only
logger.info({"request_id": req_id, "model": model, "tokens": tokens})
# Prompt store: full content, 30-day retention, restricted access
prompt_store.save(req_id, prompt, response, ttl_days=30)
For GDPR: prompts containing PII must be covered by your data retention policy. Consider redacting PII before logging:
def redact_pii(text):
text = re.sub(r'\b[\w.-]+@[\w.-]+\.\w+\b', '[EMAIL]', text)
text = re.sub(r'\b\d{3}[-.]?\d{3}[-.]?\d{4}\b', '[PHONE]', text)
return text
Evaluation scores
If using LLM-as-judge:
{
"eval_score": 4.2,
"eval_model": "claude-opus-4.6",
"eval_criteria": "helpfulness"
}
What NOT to log
API keys and tokens β Never log authentication credentials. Seems obvious but it happens.
Full model weights or configs β Log the model name and version, not internal configuration.
Raw user PII β Redact before logging. Names, emails, addresses should never appear in plain text in logs.
Every intermediate step in long chains β For RAG or agent workflows, log the start, end, and any errors. Donβt log every retrieval result or intermediate reasoning step unless debugging.
Log structure
Use structured logging (JSON), not free-text:
# Bad
logger.info(f"Called {model} with {tokens} tokens in {latency}ms")
# Good
logger.info({
"event": "llm_call",
"model": model,
"tokens": tokens,
"latency_ms": latency
})
Structured logs are searchable, filterable, and can feed into dashboards.
Where to send logs
| Option | Best for |
|---|---|
| Helicone | Automatic via proxy, best cost dashboards |
| Langfuse | Self-hostable, GDPR-safe |
| Your existing stack (Datadog, Grafana) | If you already have observability |
| Custom (PostgreSQL + Grafana) | Full control, cheapest |
For most teams, start with Helicone (1-line proxy setup) for AI-specific logs alongside your existing application logging.
Log retention and compliance
How long should you keep AI logs?
| Regulation | Requirement |
|---|---|
| GDPR | Delete personal data when no longer needed. Anonymized logs can be kept indefinitely. |
| EU AI Act | High-risk systems: keep logs for the lifetime of the system + audit period |
| SOC 2 | Typically 1 year minimum retention |
| Internal debugging | 30-90 days is usually sufficient |
Practical approach: Keep metadata logs (model, tokens, cost, latency) indefinitely. Keep full prompt/response logs for 30-90 days with automatic expiration. Redact PII before logging.
Setting up alerts
Donβt just log β alert on anomalies:
# Alert on cost spikes
if request_cost > avg_cost * 3:
send_alert(f"Cost spike: ${request_cost:.4f} (3x average)")
# Alert on latency spikes
if latency_ms > p95_latency * 2:
send_alert(f"Latency spike: {latency_ms}ms (2x P95)")
# Alert on error rate
if hourly_error_rate > 0.05:
send_alert(f"Error rate: {hourly_error_rate:.1%} (>5%)")
These three alerts catch 90% of production issues before users report them.
Example: complete logging middleware
import time
import logging
logger = logging.getLogger("llm")
async def llm_middleware(model, messages, **kwargs):
start = time.time()
request_id = generate_id()
try:
response = await call_llm(model, messages, **kwargs)
latency = (time.time() - start) * 1000
logger.info({
"event": "llm_call",
"request_id": request_id,
"model": model,
"input_tokens": response.usage.input_tokens,
"output_tokens": response.usage.output_tokens,
"latency_ms": round(latency),
"cost_usd": calculate_cost(model, response.usage),
"status": "success",
})
return response
except Exception as e:
latency = (time.time() - start) * 1000
logger.error({
"event": "llm_call",
"request_id": request_id,
"model": model,
"latency_ms": round(latency),
"status": "error",
"error_type": type(e).__name__,
"error_message": str(e),
})
raise
This middleware wraps every LLM call and logs everything you need. Add it once, get visibility forever.
The minimum viable logging
If you do nothing else, log these 5 fields for every LLM call:
- Model β which model responded
- Tokens β input + output count
- Latency β total response time
- Cost β calculated from tokens x price
- Status β success or error type
This takes 10 minutes to implement and gives you enough data to debug 80% of production issues.
Related: LLM Observability for Developers Β· Helicone vs LangSmith vs Langfuse Β· Monitor and Control AI Spending Β· AI and GDPR