Some links in this article are affiliate links. We earn a commission at no extra cost to you when you purchase through them. Full disclosure.
A β200 OKβ response from your LLM doesnβt mean everything is fine. The model could be hallucinating, costs could be spiking, or latency could be degrading. Traditional monitoring misses these AI-specific failure modes.
Here are the alerts that actually matter.
The 5 essential alerts
1. Cost spike alert
The most expensive failure mode. A runaway loop, a prompt injection that triggers expensive reasoning, or a traffic spike can blow your budget in hours.
# Alert when daily cost exceeds 2x the 7-day average
daily_cost = get_today_cost()
avg_cost = get_7_day_average()
if daily_cost > avg_cost * 2:
alert("cost_spike", f"Daily cost ${daily_cost:.2f} is 2x average ${avg_cost:.2f}")
Threshold: 2x daily average for warning, 3x for critical. Channel: Slack + email (you need to act fast).
2. Error rate alert
LLM API errors (429 rate limits, 500 server errors, timeouts) directly impact users.
# Alert when error rate exceeds 5% over 15 minutes
error_rate = errors_last_15min / total_requests_last_15min
if error_rate > 0.05:
alert("error_rate", f"Error rate {error_rate:.1%} exceeds 5% threshold")
Threshold: >5% for warning, >15% for critical. Channel: Slack (immediate action needed).
3. Latency degradation alert
LLM responses getting slower means either the provider is overloaded or your prompts are getting longer.
# Alert when P95 latency exceeds 2x baseline
p95_latency = get_p95_latency_last_hour()
baseline_p95 = get_p95_latency_last_7_days()
if p95_latency > baseline_p95 * 2:
alert("latency", f"P95 latency {p95_latency:.1f}s is 2x baseline {baseline_p95:.1f}s")
Threshold: 2x baseline P95 for warning. Channel: Slack.
4. Token usage anomaly
Sudden increases in token usage indicate prompt injection, context window stuffing, or a bug thatβs sending too much data.
# Alert when average tokens per request exceeds 3x normal
avg_tokens = get_avg_tokens_last_hour()
baseline_tokens = get_avg_tokens_last_7_days()
if avg_tokens > baseline_tokens * 3:
alert("token_anomaly", f"Avg tokens {avg_tokens} is 3x baseline {baseline_tokens}")
Threshold: 3x baseline for warning. Channel: Slack + review logs for prompt injection.
5. Model availability alert
Your LLM provider goes down. You need to know before users report it.
# Ping the API every 5 minutes
try:
response = call_llm("Say OK", max_tokens=5, timeout=10)
if "ok" not in response.lower():
alert("model_health", "Model returned unexpected response")
except Exception as e:
alert("model_down", f"LLM API unreachable: {e}")
Threshold: 2 consecutive failures. Channel: Slack + email + consider automatic fallback to backup model.
What NOT to alert on
- Individual slow requests β LLMs are inherently variable. Alert on P95, not individual requests.
- Minor cost fluctuations β daily cost varies 20-30% naturally. Only alert on 2x+ spikes.
- Model version changes β track these in logs but donβt alert unless quality drops.
- Every 429 rate limit β occasional rate limits are normal. Alert on sustained error rates.
Alert fatigue prevention
The #1 reason alerting fails is too many alerts. Follow these rules:
- Maximum 5 alert types β the 5 above cover 90% of issues
- Escalation tiers β warning goes to Slack, critical goes to Slack + email + phone
- Cooldown periods β donβt re-alert for the same issue within 30 minutes
- Weekly review β if an alert fires more than 3x/week without action, raise the threshold
Tools for LLM alerting
| Tool | Best for | Setup effort |
|---|---|---|
| Helicone | Cost + latency alerts via proxy | 5 minutes |
| Grafana + Prometheus | Custom dashboards + alerts | 2-4 hours |
| PagerDuty/OpsGenie | On-call escalation | 1 hour |
| Custom script + Slack webhook | Quick and dirty | 30 minutes |
| UptimeRobot | Health endpoint monitoring | 5 minutes |
For most teams, start with Helicone for LLM-specific alerts and UptimeRobot for uptime monitoring. Add Grafana when you need custom dashboards.
See our LLM observability guide for the full monitoring strategy and what to log for the data foundation.
Related: LLM Observability for Developers Β· What to Log in AI Systems Β· Helicone vs LangSmith vs Langfuse Β· LLM Regression Testing Β· AI App Deployment Checklist