πŸ€– AI Tools
Β· 3 min read

Canary Deploys for LLM Features β€” Ship Prompt Changes Safely


You changed a prompt. Tests pass. Regression tests look good. But production traffic is different from test data. A canary deploy lets you test with real users while limiting blast radius.

What’s a canary deploy for LLMs

Instead of shipping a prompt change to 100% of users at once, you:

  1. Ship to 5% of traffic
  2. Monitor quality, latency, and cost
  3. If metrics are good, increase to 25%, then 50%, then 100%
  4. If metrics drop, roll back instantly

Implementation

Step 1: Feature flags

import hashlib

# Configuration
CANARY_PERCENTAGE = 5  # Start at 5%
PROMPT_VERSIONS = {
    "stable": "You are a helpful assistant...",
    "canary": "You are a concise, helpful assistant...",
}

def get_prompt_version(user_id):
    hash_val = int(hashlib.md5(user_id.encode()).hexdigest(), 16) % 100
    if hash_val < CANARY_PERCENTAGE:
        return "canary"
    return "stable"

Step 2: Route traffic

@app.post("/chat")
async def chat(request: ChatRequest):
    version = get_prompt_version(request.user_id)
    prompt = PROMPT_VERSIONS[version]
    
    response = await call_llm(
        system_prompt=prompt,
        messages=request.messages,
    )
    
    # Log version for monitoring
    logger.info({
        "event": "llm_call",
        "prompt_version": version,
        "user_id": hash(request.user_id),
        "latency_ms": latency,
        "tokens": tokens,
        "cost": cost,
    })
    
    return response

Step 3: Monitor canary vs stable

def compare_versions(hours=24):
    stable_metrics = get_metrics(version="stable", hours=hours)
    canary_metrics = get_metrics(version="canary", hours=hours)
    
    print(f"{'Metric':<20} {'Stable':>10} {'Canary':>10} {'Diff':>10}")
    print("-" * 50)
    
    for metric in ["avg_quality", "p95_latency", "avg_cost", "error_rate"]:
        s = stable_metrics[metric]
        c = canary_metrics[metric]
        diff = ((c - s) / s) * 100
        print(f"{metric:<20} {s:>10.3f} {c:>10.3f} {diff:>+9.1f}%")

Step 4: Gradual rollout

# Day 1: 5% canary
CANARY_PERCENTAGE = 5

# Day 2: If metrics are good, increase
# Check: quality within 5%, latency within 20%, cost within 30%
if canary_quality >= stable_quality * 0.95:
    CANARY_PERCENTAGE = 25

# Day 3: Continue if still good
if still_good:
    CANARY_PERCENTAGE = 50

# Day 4: Full rollout
CANARY_PERCENTAGE = 100
# Rename canary to stable
PROMPT_VERSIONS["stable"] = PROMPT_VERSIONS["canary"]

Step 5: Automatic rollback

def check_canary_health():
    """Run every 15 minutes"""
    stable = get_metrics("stable", hours=1)
    canary = get_metrics("canary", hours=1)
    
    # Auto-rollback if quality drops more than 10%
    if canary["avg_quality"] < stable["avg_quality"] * 0.90:
        CANARY_PERCENTAGE = 0
        alert("canary_rollback", 
            f"Quality dropped {canary['avg_quality']:.2f} vs {stable['avg_quality']:.2f}")
    
    # Auto-rollback if error rate spikes
    if canary["error_rate"] > stable["error_rate"] * 2:
        CANARY_PERCENTAGE = 0
        alert("canary_rollback", f"Error rate doubled")

When to use canary deploys

Change typeCanary needed?Why
System prompt rewriteβœ… YesHigh risk, affects all outputs
Adding few-shot examplesβœ… YesCan change behavior unpredictably
Model swap (Sonnet β†’ DeepSeek)βœ… YesDifferent model, different behavior
Temperature change⚠️ MaybeLow risk but can affect consistency
Max tokens change❌ NoPredictable effect, just test offline
Bug fix in code (not prompt)❌ NoStandard deploy process

Canary deploys vs A/B tests

Canary deployA/B test
GoalSafe rolloutMeasure which is better
Traffic split5% β†’ 100% (gradual)50/50 (fixed)
DurationDaysWeeks
RollbackAutomatic on failureManual after analysis
When to useShipping a change you believe is betterComparing two options you’re unsure about

Use A/B tests to decide WHAT to ship. Use canary deploys to ship it SAFELY.

Tools

ApproachComplexityBest for
Custom feature flags (code above)LowSmall teams, simple apps
LaunchDarklyMediumTeams with many feature flags
Helicone + customLowLLM-specific monitoring
Kubernetes canaryHighInfrastructure-level canary

For most AI apps, the custom approach (50 lines of Python) is sufficient. You don’t need a platform until you’re running multiple concurrent canaries.

Related: A/B Testing Prompts Β· LLM Regression Testing Β· LLM Alerting in Production Β· AI App Deployment Checklist Β· LLM Observability