May 2, 2026 · 4 min read

What to Log in AI Systems — And What Not To

Your LLM app is in production. Something goes wrong. You check the logs. They say “200 OK.” That’s it. No prompt, no response, no token count, no model version. You’re blind.

Here’s what to log in AI systems — and what to leave out.

What to log (always)

Per-request metadata

{
  "timestamp": "2026-04-13T15:30:00Z",
  "request_id": "req_abc123",
  "model": "claude-sonnet-4.6",
  "model_version": "2026-03-15",
  "input_tokens": 1247,
  "output_tokens": 389,
  "total_tokens": 1636,
  "latency_ms": 2340,
  "ttft_ms": 450,
  "cost_usd": 0.0082,
  "status": "success",
  "user_id": "user_hash_xyz",
  "feature": "code-review",
  "team": "backend"
}

This alone answers 80% of debugging questions: which model, how many tokens, how long, how much, who triggered it.

Error details

{
  "status": "error",
  "error_type": "rate_limit",
  "error_message": "429 Too Many Requests",
  "retry_count": 2,
  "fallback_model": "deepseek-chat"
}

Tool calls (for MCP and agents)

{
  "tools_called": ["read_file", "search_codebase", "write_file"],
  "tool_count": 3,
  "tool_latency_ms": [120, 340, 89]
}

What to log (carefully)

Prompts and responses

The dilemma: You need prompts for debugging. But prompts contain user data.

Solution: Log prompts in a separate, access-controlled store with automatic expiration:

# Main log: metadata only
logger.info({"request_id": req_id, "model": model, "tokens": tokens})

# Prompt store: full content, 30-day retention, restricted access
prompt_store.save(req_id, prompt, response, ttl_days=30)

For GDPR: prompts containing PII must be covered by your data retention policy. Consider redacting PII before logging:

def redact_pii(text):
    text = re.sub(r'\b[\w.-]+@[\w.-]+\.\w+\b', '[EMAIL]', text)
    text = re.sub(r'\b\d{3}[-.]?\d{3}[-.]?\d{4}\b', '[PHONE]', text)
    return text

Evaluation scores

If using LLM-as-judge:

{
  "eval_score": 4.2,
  "eval_model": "claude-opus-4.6",
  "eval_criteria": "helpfulness"
}

What NOT to log

API keys and tokens — Never log authentication credentials. Seems obvious but it happens.

Full model weights or configs — Log the model name and version, not internal configuration.

Raw user PII — Redact before logging. Names, emails, addresses should never appear in plain text in logs.

Every intermediate step in long chains — For RAG or agent workflows, log the start, end, and any errors. Don’t log every retrieval result or intermediate reasoning step unless debugging.

Log structure

Use structured logging (JSON), not free-text:

# Bad
logger.info(f"Called {model} with {tokens} tokens in {latency}ms")

# Good
logger.info({
    "event": "llm_call",
    "model": model,
    "tokens": tokens,
    "latency_ms": latency
})

Structured logs are searchable, filterable, and can feed into dashboards.

Where to send logs

Option	Best for
Helicone	Automatic via proxy, best cost dashboards
Langfuse	Self-hostable, GDPR-safe
Your existing stack (Datadog, Grafana)	If you already have observability
Custom (PostgreSQL + Grafana)	Full control, cheapest

For most teams, start with Helicone (1-line proxy setup) for AI-specific logs alongside your existing application logging.

Log retention and compliance

How long should you keep AI logs?

Regulation	Requirement
GDPR	Delete personal data when no longer needed. Anonymized logs can be kept indefinitely.
EU AI Act	High-risk systems: keep logs for the lifetime of the system + audit period
SOC 2	Typically 1 year minimum retention
Internal debugging	30-90 days is usually sufficient

Practical approach: Keep metadata logs (model, tokens, cost, latency) indefinitely. Keep full prompt/response logs for 30-90 days with automatic expiration. Redact PII before logging.

Setting up alerts

Don’t just log — alert on anomalies:

# Alert on cost spikes
if request_cost > avg_cost * 3:
    send_alert(f"Cost spike: ${request_cost:.4f} (3x average)")

# Alert on latency spikes
if latency_ms > p95_latency * 2:
    send_alert(f"Latency spike: {latency_ms}ms (2x P95)")

# Alert on error rate
if hourly_error_rate > 0.05:
    send_alert(f"Error rate: {hourly_error_rate:.1%} (>5%)")

These three alerts catch 90% of production issues before users report them.

Example: complete logging middleware

import time
import logging

logger = logging.getLogger("llm")

async def llm_middleware(model, messages, **kwargs):
    start = time.time()
    request_id = generate_id()
    
    try:
        response = await call_llm(model, messages, **kwargs)
        latency = (time.time() - start) * 1000
        
        logger.info({
            "event": "llm_call",
            "request_id": request_id,
            "model": model,
            "input_tokens": response.usage.input_tokens,
            "output_tokens": response.usage.output_tokens,
            "latency_ms": round(latency),
            "cost_usd": calculate_cost(model, response.usage),
            "status": "success",
        })
        return response
        
    except Exception as e:
        latency = (time.time() - start) * 1000
        logger.error({
            "event": "llm_call",
            "request_id": request_id,
            "model": model,
            "latency_ms": round(latency),
            "status": "error",
            "error_type": type(e).__name__,
            "error_message": str(e),
        })
        raise

This middleware wraps every LLM call and logs everything you need. Add it once, get visibility forever.

The minimum viable logging

If you do nothing else, log these 5 fields for every LLM call:

Model — which model responded
Tokens — input + output count
Latency — total response time
Cost — calculated from tokens x price
Status — success or error type

This takes 10 minutes to implement and gives you enough data to debug 80% of production issues.