May 8, 2026 · 3 min read

LLM Alerting in Production — What to Alert On and What to Ignore

Some links in this article are affiliate links. We earn a commission at no extra cost to you when you purchase through them. Full disclosure.

A “200 OK” response from your LLM doesn’t mean everything is fine. The model could be hallucinating, costs could be spiking, or latency could be degrading. Traditional monitoring misses these AI-specific failure modes.

Here are the alerts that actually matter.

The 5 essential alerts

1. Cost spike alert

The most expensive failure mode. A runaway loop, a prompt injection that triggers expensive reasoning, or a traffic spike can blow your budget in hours.

# Alert when daily cost exceeds 2x the 7-day average
daily_cost = get_today_cost()
avg_cost = get_7_day_average()

if daily_cost > avg_cost * 2:
    alert("cost_spike", f"Daily cost ${daily_cost:.2f} is 2x average ${avg_cost:.2f}")

Threshold: 2x daily average for warning, 3x for critical. Channel: Slack + email (you need to act fast).

2. Error rate alert

LLM API errors (429 rate limits, 500 server errors, timeouts) directly impact users.

# Alert when error rate exceeds 5% over 15 minutes
error_rate = errors_last_15min / total_requests_last_15min

if error_rate > 0.05:
    alert("error_rate", f"Error rate {error_rate:.1%} exceeds 5% threshold")

Threshold: >5% for warning, >15% for critical. Channel: Slack (immediate action needed).

3. Latency degradation alert

LLM responses getting slower means either the provider is overloaded or your prompts are getting longer.

# Alert when P95 latency exceeds 2x baseline
p95_latency = get_p95_latency_last_hour()
baseline_p95 = get_p95_latency_last_7_days()

if p95_latency > baseline_p95 * 2:
    alert("latency", f"P95 latency {p95_latency:.1f}s is 2x baseline {baseline_p95:.1f}s")

Threshold: 2x baseline P95 for warning. Channel: Slack.

4. Token usage anomaly

Sudden increases in token usage indicate prompt injection, context window stuffing, or a bug that’s sending too much data.

# Alert when average tokens per request exceeds 3x normal
avg_tokens = get_avg_tokens_last_hour()
baseline_tokens = get_avg_tokens_last_7_days()

if avg_tokens > baseline_tokens * 3:
    alert("token_anomaly", f"Avg tokens {avg_tokens} is 3x baseline {baseline_tokens}")

Threshold: 3x baseline for warning. Channel: Slack + review logs for prompt injection.

5. Model availability alert

Your LLM provider goes down. You need to know before users report it.

# Ping the API every 5 minutes
try:
    response = call_llm("Say OK", max_tokens=5, timeout=10)
    if "ok" not in response.lower():
        alert("model_health", "Model returned unexpected response")
except Exception as e:
    alert("model_down", f"LLM API unreachable: {e}")

Threshold: 2 consecutive failures. Channel: Slack + email + consider automatic fallback to backup model.

What NOT to alert on

Individual slow requests — LLMs are inherently variable. Alert on P95, not individual requests.
Minor cost fluctuations — daily cost varies 20-30% naturally. Only alert on 2x+ spikes.
Model version changes — track these in logs but don’t alert unless quality drops.
Every 429 rate limit — occasional rate limits are normal. Alert on sustained error rates.

Alert fatigue prevention

The #1 reason alerting fails is too many alerts. Follow these rules:

Maximum 5 alert types — the 5 above cover 90% of issues
Escalation tiers — warning goes to Slack, critical goes to Slack + email + phone
Cooldown periods — don’t re-alert for the same issue within 30 minutes
Weekly review — if an alert fires more than 3x/week without action, raise the threshold

Tools for LLM alerting

Tool	Best for	Setup effort
Helicone	Cost + latency alerts via proxy	5 minutes
Grafana + Prometheus	Custom dashboards + alerts	2-4 hours
PagerDuty/OpsGenie	On-call escalation	1 hour
Custom script + Slack webhook	Quick and dirty	30 minutes
UptimeRobot	Health endpoint monitoring	5 minutes

For most teams, start with Helicone for LLM-specific alerts and UptimeRobot for uptime monitoring. Add Grafana when you need custom dashboards.

See our LLM observability guide for the full monitoring strategy and what to log for the data foundation.

LLM Alerting in Production — What to Alert On and What to Ignore

The 5 essential alerts

1. Cost spike alert

2. Error rate alert

3. Latency degradation alert

4. Token usage anomaly

5. Model availability alert

What NOT to alert on

Alert fatigue prevention

Tools for LLM alerting

📬 AI Dev Weekly

You might also like

LLM Observability for Developers — How to Monitor AI Apps in Production

What to Log in AI Systems — And What Not To

How to Monitor and Control AI API Spending — Stop the Surprise Bills

Helicone vs LangSmith vs Langfuse — LLM Observability Tools Compared (2026)