You changed a prompt. Tests pass. Regression tests look good. But production traffic is different from test data. A canary deploy lets you test with real users while limiting blast radius.
Whatβs a canary deploy for LLMs
Instead of shipping a prompt change to 100% of users at once, you:
- Ship to 5% of traffic
- Monitor quality, latency, and cost
- If metrics are good, increase to 25%, then 50%, then 100%
- If metrics drop, roll back instantly
Implementation
Step 1: Feature flags
import hashlib
# Configuration
CANARY_PERCENTAGE = 5 # Start at 5%
PROMPT_VERSIONS = {
"stable": "You are a helpful assistant...",
"canary": "You are a concise, helpful assistant...",
}
def get_prompt_version(user_id):
hash_val = int(hashlib.md5(user_id.encode()).hexdigest(), 16) % 100
if hash_val < CANARY_PERCENTAGE:
return "canary"
return "stable"
Step 2: Route traffic
@app.post("/chat")
async def chat(request: ChatRequest):
version = get_prompt_version(request.user_id)
prompt = PROMPT_VERSIONS[version]
response = await call_llm(
system_prompt=prompt,
messages=request.messages,
)
# Log version for monitoring
logger.info({
"event": "llm_call",
"prompt_version": version,
"user_id": hash(request.user_id),
"latency_ms": latency,
"tokens": tokens,
"cost": cost,
})
return response
Step 3: Monitor canary vs stable
def compare_versions(hours=24):
stable_metrics = get_metrics(version="stable", hours=hours)
canary_metrics = get_metrics(version="canary", hours=hours)
print(f"{'Metric':<20} {'Stable':>10} {'Canary':>10} {'Diff':>10}")
print("-" * 50)
for metric in ["avg_quality", "p95_latency", "avg_cost", "error_rate"]:
s = stable_metrics[metric]
c = canary_metrics[metric]
diff = ((c - s) / s) * 100
print(f"{metric:<20} {s:>10.3f} {c:>10.3f} {diff:>+9.1f}%")
Step 4: Gradual rollout
# Day 1: 5% canary
CANARY_PERCENTAGE = 5
# Day 2: If metrics are good, increase
# Check: quality within 5%, latency within 20%, cost within 30%
if canary_quality >= stable_quality * 0.95:
CANARY_PERCENTAGE = 25
# Day 3: Continue if still good
if still_good:
CANARY_PERCENTAGE = 50
# Day 4: Full rollout
CANARY_PERCENTAGE = 100
# Rename canary to stable
PROMPT_VERSIONS["stable"] = PROMPT_VERSIONS["canary"]
Step 5: Automatic rollback
def check_canary_health():
"""Run every 15 minutes"""
stable = get_metrics("stable", hours=1)
canary = get_metrics("canary", hours=1)
# Auto-rollback if quality drops more than 10%
if canary["avg_quality"] < stable["avg_quality"] * 0.90:
CANARY_PERCENTAGE = 0
alert("canary_rollback",
f"Quality dropped {canary['avg_quality']:.2f} vs {stable['avg_quality']:.2f}")
# Auto-rollback if error rate spikes
if canary["error_rate"] > stable["error_rate"] * 2:
CANARY_PERCENTAGE = 0
alert("canary_rollback", f"Error rate doubled")
When to use canary deploys
| Change type | Canary needed? | Why |
|---|---|---|
| System prompt rewrite | β Yes | High risk, affects all outputs |
| Adding few-shot examples | β Yes | Can change behavior unpredictably |
| Model swap (Sonnet β DeepSeek) | β Yes | Different model, different behavior |
| Temperature change | β οΈ Maybe | Low risk but can affect consistency |
| Max tokens change | β No | Predictable effect, just test offline |
| Bug fix in code (not prompt) | β No | Standard deploy process |
Canary deploys vs A/B tests
| Canary deploy | A/B test | |
|---|---|---|
| Goal | Safe rollout | Measure which is better |
| Traffic split | 5% β 100% (gradual) | 50/50 (fixed) |
| Duration | Days | Weeks |
| Rollback | Automatic on failure | Manual after analysis |
| When to use | Shipping a change you believe is better | Comparing two options youβre unsure about |
Use A/B tests to decide WHAT to ship. Use canary deploys to ship it SAFELY.
Tools
| Approach | Complexity | Best for |
|---|---|---|
| Custom feature flags (code above) | Low | Small teams, simple apps |
| LaunchDarkly | Medium | Teams with many feature flags |
| Helicone + custom | Low | LLM-specific monitoring |
| Kubernetes canary | High | Infrastructure-level canary |
For most AI apps, the custom approach (50 lines of Python) is sufficient. You donβt need a platform until youβre running multiple concurrent canaries.
Related: A/B Testing Prompts Β· LLM Regression Testing Β· LLM Alerting in Production Β· AI App Deployment Checklist Β· LLM Observability