πŸ€– AI Tools
Β· 3 min read

LLM Alerting in Production β€” What to Alert On and What to Ignore


Some links in this article are affiliate links. We earn a commission at no extra cost to you when you purchase through them. Full disclosure.

A β€œ200 OK” response from your LLM doesn’t mean everything is fine. The model could be hallucinating, costs could be spiking, or latency could be degrading. Traditional monitoring misses these AI-specific failure modes.

Here are the alerts that actually matter.

The 5 essential alerts

1. Cost spike alert

The most expensive failure mode. A runaway loop, a prompt injection that triggers expensive reasoning, or a traffic spike can blow your budget in hours.

# Alert when daily cost exceeds 2x the 7-day average
daily_cost = get_today_cost()
avg_cost = get_7_day_average()

if daily_cost > avg_cost * 2:
    alert("cost_spike", f"Daily cost ${daily_cost:.2f} is 2x average ${avg_cost:.2f}")

Threshold: 2x daily average for warning, 3x for critical. Channel: Slack + email (you need to act fast).

2. Error rate alert

LLM API errors (429 rate limits, 500 server errors, timeouts) directly impact users.

# Alert when error rate exceeds 5% over 15 minutes
error_rate = errors_last_15min / total_requests_last_15min

if error_rate > 0.05:
    alert("error_rate", f"Error rate {error_rate:.1%} exceeds 5% threshold")

Threshold: >5% for warning, >15% for critical. Channel: Slack (immediate action needed).

3. Latency degradation alert

LLM responses getting slower means either the provider is overloaded or your prompts are getting longer.

# Alert when P95 latency exceeds 2x baseline
p95_latency = get_p95_latency_last_hour()
baseline_p95 = get_p95_latency_last_7_days()

if p95_latency > baseline_p95 * 2:
    alert("latency", f"P95 latency {p95_latency:.1f}s is 2x baseline {baseline_p95:.1f}s")

Threshold: 2x baseline P95 for warning. Channel: Slack.

4. Token usage anomaly

Sudden increases in token usage indicate prompt injection, context window stuffing, or a bug that’s sending too much data.

# Alert when average tokens per request exceeds 3x normal
avg_tokens = get_avg_tokens_last_hour()
baseline_tokens = get_avg_tokens_last_7_days()

if avg_tokens > baseline_tokens * 3:
    alert("token_anomaly", f"Avg tokens {avg_tokens} is 3x baseline {baseline_tokens}")

Threshold: 3x baseline for warning. Channel: Slack + review logs for prompt injection.

5. Model availability alert

Your LLM provider goes down. You need to know before users report it.

# Ping the API every 5 minutes
try:
    response = call_llm("Say OK", max_tokens=5, timeout=10)
    if "ok" not in response.lower():
        alert("model_health", "Model returned unexpected response")
except Exception as e:
    alert("model_down", f"LLM API unreachable: {e}")

Threshold: 2 consecutive failures. Channel: Slack + email + consider automatic fallback to backup model.

What NOT to alert on

  • Individual slow requests β€” LLMs are inherently variable. Alert on P95, not individual requests.
  • Minor cost fluctuations β€” daily cost varies 20-30% naturally. Only alert on 2x+ spikes.
  • Model version changes β€” track these in logs but don’t alert unless quality drops.
  • Every 429 rate limit β€” occasional rate limits are normal. Alert on sustained error rates.

Alert fatigue prevention

The #1 reason alerting fails is too many alerts. Follow these rules:

  1. Maximum 5 alert types β€” the 5 above cover 90% of issues
  2. Escalation tiers β€” warning goes to Slack, critical goes to Slack + email + phone
  3. Cooldown periods β€” don’t re-alert for the same issue within 30 minutes
  4. Weekly review β€” if an alert fires more than 3x/week without action, raise the threshold

Tools for LLM alerting

ToolBest forSetup effort
HeliconeCost + latency alerts via proxy5 minutes
Grafana + PrometheusCustom dashboards + alerts2-4 hours
PagerDuty/OpsGenieOn-call escalation1 hour
Custom script + Slack webhookQuick and dirty30 minutes
UptimeRobotHealth endpoint monitoring5 minutes

For most teams, start with Helicone for LLM-specific alerts and UptimeRobot for uptime monitoring. Add Grafana when you need custom dashboards.

See our LLM observability guide for the full monitoring strategy and what to log for the data foundation.

Related: LLM Observability for Developers Β· What to Log in AI Systems Β· Helicone vs LangSmith vs Langfuse Β· LLM Regression Testing Β· AI App Deployment Checklist