Feature flags are standard practice for rolling out code changes. They should be standard practice for AI model changes too. When you update a prompt, switch models, or add a new agent capability, feature flags let you control who sees the change, measure the impact, and roll back instantly if something breaks.
Why AI needs feature flags
AI changes are riskier than code changes because:
- Non-deterministic: The same prompt change can improve quality for 80% of queries and degrade it for 20%
- Hard to test: You canโt write unit tests for โdoes this prompt produce better responsesโ
- User-facing: A bad model change is immediately visible to users (unlike a backend optimization)
- Expensive to roll back: If youโve already served bad responses to 100% of users, the damage is done
Feature flags give you a controlled rollout: start with 1% of users, measure quality, and gradually increase.
Basic implementation
import hashlib
class AIFeatureFlags:
def __init__(self, config: dict):
self.config = config
def is_enabled(self, flag: str, user_id: str) -> bool:
flag_config = self.config.get(flag, {})
if not flag_config.get("enabled", False):
return False
percentage = flag_config.get("percentage", 0)
hash_val = int(hashlib.md5(f"{flag}:{user_id}".encode()).hexdigest(), 16)
return (hash_val % 100) < percentage
flags = AIFeatureFlags({
"new_system_prompt": {"enabled": True, "percentage": 10},
"gpt4o_to_claude": {"enabled": True, "percentage": 5},
"agent_v2": {"enabled": False, "percentage": 0},
})
# Usage
async def handle_request(user_id, message):
if flags.is_enabled("gpt4o_to_claude", user_id):
model = "claude-sonnet-4"
else:
model = "gpt-4o"
return await call_model(model, message)
The hash ensures the same user always gets the same experience (no flickering between versions).
What to feature-flag
| Change type | Risk | Flag strategy |
|---|---|---|
| Model swap (GPT-4o โ Claude) | High | Start at 1%, measure for 48h |
| System prompt update | Medium | Start at 10%, measure for 24h |
| New tool/capability | Medium | Start at 5%, measure for 48h |
| Temperature/parameter change | Low | Start at 25%, measure for 12h |
| New agent workflow | High | Start at 1%, measure for 72h |
A/B testing prompts
Feature flags enable prompt A/B testing โ the most underused optimization technique in AI:
PROMPT_VARIANTS = {
"control": "You are a helpful coding assistant. Answer questions clearly and concisely.",
"variant_a": "You are a senior software engineer. Provide specific, actionable code examples. Always explain your reasoning.",
"variant_b": "You are a coding mentor. Start with the simplest solution, then offer improvements. Ask clarifying questions when the request is ambiguous.",
}
async def get_prompt(user_id):
hash_val = int(hashlib.md5(user_id.encode()).hexdigest(), 16) % 100
if hash_val < 33:
return "control", PROMPT_VARIANTS["control"]
elif hash_val < 66:
return "variant_a", PROMPT_VARIANTS["variant_a"]
else:
return "variant_b", PROMPT_VARIANTS["variant_b"]
# Track which variant each user gets
variant, prompt = await get_prompt(user_id)
result = await call_model(model, message, system_prompt=prompt)
await log_metric("response_quality", score, tags={"variant": variant})
After a week, compare quality scores across variants. The winning prompt becomes the new default.
Measuring success
Feature flags are useless without measurement. Track:
async def measure_flag_impact(flag_name, days=7):
enabled_group = await get_metrics(flag=flag_name, enabled=True, days=days)
control_group = await get_metrics(flag=flag_name, enabled=False, days=days)
return {
"quality_delta": enabled_group.avg_score - control_group.avg_score,
"latency_delta": enabled_group.p95_latency - control_group.p95_latency,
"cost_delta": enabled_group.avg_cost - control_group.avg_cost,
"error_rate_delta": enabled_group.error_rate - control_group.error_rate,
"sample_size": {"enabled": enabled_group.count, "control": control_group.count},
}
Only promote a flag to 100% when:
- Quality is equal or better
- Latency is acceptable
- Cost is within budget
- Error rate hasnโt increased
- Sample size is statistically significant (100+ per group minimum)
Integration with existing tools
If you already use a feature flag service (LaunchDarkly, Flagsmith, Unleash), use it for AI flags too:
# LaunchDarkly example
import ldclient
def get_model(user):
if ldclient.get().variation("use-claude-sonnet", user, False):
return "claude-sonnet-4"
return "gpt-4o"
If you donโt have a feature flag service, the hash-based implementation above works fine for most teams. Graduate to a proper service when you need targeting rules, audit logs, or team management.
Emergency kill switches
Every AI feature should have a kill switch:
KILL_SWITCHES = {
"ai_agent_enabled": True,
"code_execution_enabled": True,
"external_api_calls_enabled": True,
}
async def check_kill_switch(feature):
if not KILL_SWITCHES.get(feature, True):
return {"error": "Feature temporarily disabled", "retry_after": 300}
return None
Store kill switches in Redis or a fast key-value store so they take effect immediately. When something goes wrong at 3 AM, flipping a kill switch is faster than deploying a code fix.
Related: How to Handle AI Model Version Changes ยท AI Model Rollback Strategies ยท LLM Regression Testing ยท Test AI Agents Before Production ยท AI Agent Error Handling ยท Deploy AI Agents to Production