Your AI application worked perfectly last week. This week, users are complaining about worse responses. Nothing changed in your code. What happened?
Model drift. The AI provider updated the model, the distribution of user queries shifted, or your prompts that worked at one quality level stopped working at another. Unlike traditional software bugs, model drift is silent β no errors, no crashes, just gradually worse outputs.
Types of drift
Provider-side drift
The model itself changed. Anthropic removed version pinning, OpenAI regularly updates models, and even βstableβ versions receive silent patches.
Symptoms: sudden quality change across all users, often correlated with provider announcements or status page updates.
Data drift
Your usersβ queries changed. The model was fine for the queries you tested with, but real-world usage patterns shifted.
Symptoms: gradual quality decline, concentrated in specific query types or user segments.
Prompt drift
Your prompts accumulated context or were modified in ways that degraded quality. Common in systems with dynamic prompt construction.
Symptoms: quality decline correlated with prompt length or complexity increases.
Detection strategies
Strategy 1: Continuous evaluation sampling
Sample a percentage of production responses and score them automatically:
import random
SAMPLE_RATE = 0.05 # Score 5% of responses
async def maybe_evaluate(user_message, agent_response):
if random.random() > SAMPLE_RATE:
return # Skip this one
score = await llm_judge(
question=user_message,
response=agent_response,
criteria="accuracy, helpfulness, completeness",
)
await store_metric("response_quality", score, timestamp=datetime.utcnow())
Plot quality scores over time. A downward trend = drift.
Strategy 2: Reference comparison
Maintain a set of βgoldenβ query-response pairs. Periodically re-run them and compare:
GOLDEN_SET = [
{
"query": "How do I add authentication to a Next.js app?",
"expected_topics": ["NextAuth", "middleware", "session", "JWT"],
"min_score": 4,
},
# ... 50-100 more
]
async def run_golden_set():
results = []
for item in GOLDEN_SET:
response = await call_model(PRODUCTION_MODEL, item["query"])
score = await evaluate(response, item["expected_topics"])
results.append({"query": item["query"], "score": score, "min": item["min_score"]})
failures = [r for r in results if r["score"] < r["min"]]
avg_score = sum(r["score"] for r in results) / len(results)
return {"avg_score": avg_score, "failures": len(failures), "total": len(results)}
Run this daily via a Claude Code Routine or cron job. Alert when the average score drops or failures increase.
Strategy 3: User feedback signals
Implicit and explicit signals from users:
# Explicit: thumbs up/down
async def record_feedback(session_id, rating):
await store_metric("user_feedback", rating)
# Implicit: retry detection (user asks the same question again)
async def detect_retry(user_id, message):
recent = await get_recent_messages(user_id, hours=1)
for prev in recent:
if similarity(prev.content, message) > 0.85:
await store_metric("user_retry", 1) # User wasn't satisfied
return True
return False
A spike in retries or negative feedback = drift.
Strategy 4: Statistical process control
Apply manufacturing quality control to AI outputs:
import numpy as np
async def check_for_drift(metric_name, window_days=7):
recent = await get_metrics(metric_name, days=window_days)
baseline = await get_metrics(metric_name, days=30, offset=window_days)
recent_mean = np.mean(recent)
baseline_mean = np.mean(baseline)
baseline_std = np.std(baseline)
# Z-score: how many standard deviations from baseline?
if baseline_std > 0:
z_score = (recent_mean - baseline_mean) / baseline_std
else:
z_score = 0
if abs(z_score) > 2:
alert(f"Drift detected in {metric_name}: z-score={z_score:.2f}")
return True
return False
A z-score above 2 means the recent quality is statistically different from the baseline β something changed.
Automated response to drift
When drift is detected, respond automatically:
async def handle_drift(metric_name, severity):
if severity == "critical": # z-score > 3
# Immediate fallback to backup model
await switch_to_fallback_model()
alert_oncall(f"Critical drift: switched to fallback model")
elif severity == "warning": # z-score > 2
# Reduce traffic to affected model
await set_canary_percentage(50) # Route 50% to backup
alert_team(f"Quality drift detected, canary activated")
elif severity == "info": # z-score > 1.5
# Log and monitor
log(f"Possible drift in {metric_name}, monitoring")
Dashboard essentials
At minimum, track these metrics over time:
| Metric | Source | Alert threshold |
|---|---|---|
| Average quality score | LLM-as-judge sampling | Drop >0.5 points |
| User retry rate | Implicit feedback | Increase >50% |
| Thumbs down rate | Explicit feedback | Increase >100% |
| Golden set pass rate | Daily eval run | Drop below 90% |
| Response latency p95 | Request logs | Increase >50% |
| Token cost per request | Usage tracking | Increase >30% |
Connect these to your observability platform for real-time dashboards and alerting.
Prevention
The best drift detection is prevention:
- Run regression tests on every model update β catch provider-side drift immediately
- Use feature flags for prompt changes β roll out gradually, measure impact
- Monitor user query distribution β detect data drift early
- Keep prompts simple and focused β complex prompts are more fragile
- Use OpenRouter for multi-provider resilience β donβt depend on one provider
Related: LLM Regression Testing Β· How to Handle AI Model Version Changes Β· AI Model Rollback Strategies Β· LLM Feature Flags Β· LLM Observability Β· AI Agent Error Handling