๐Ÿค– AI Tools
ยท 3 min read

LLM Feature Flags: Safely Roll Out Model Changes to Users (2026)


Feature flags are standard practice for rolling out code changes. They should be standard practice for AI model changes too. When you update a prompt, switch models, or add a new agent capability, feature flags let you control who sees the change, measure the impact, and roll back instantly if something breaks.

Why AI needs feature flags

AI changes are riskier than code changes because:

  1. Non-deterministic: The same prompt change can improve quality for 80% of queries and degrade it for 20%
  2. Hard to test: You canโ€™t write unit tests for โ€œdoes this prompt produce better responsesโ€
  3. User-facing: A bad model change is immediately visible to users (unlike a backend optimization)
  4. Expensive to roll back: If youโ€™ve already served bad responses to 100% of users, the damage is done

Feature flags give you a controlled rollout: start with 1% of users, measure quality, and gradually increase.

Basic implementation

import hashlib

class AIFeatureFlags:
    def __init__(self, config: dict):
        self.config = config
    
    def is_enabled(self, flag: str, user_id: str) -> bool:
        flag_config = self.config.get(flag, {})
        if not flag_config.get("enabled", False):
            return False
        
        percentage = flag_config.get("percentage", 0)
        hash_val = int(hashlib.md5(f"{flag}:{user_id}".encode()).hexdigest(), 16)
        return (hash_val % 100) < percentage

flags = AIFeatureFlags({
    "new_system_prompt": {"enabled": True, "percentage": 10},
    "gpt4o_to_claude": {"enabled": True, "percentage": 5},
    "agent_v2": {"enabled": False, "percentage": 0},
})

# Usage
async def handle_request(user_id, message):
    if flags.is_enabled("gpt4o_to_claude", user_id):
        model = "claude-sonnet-4"
    else:
        model = "gpt-4o"
    
    return await call_model(model, message)

The hash ensures the same user always gets the same experience (no flickering between versions).

What to feature-flag

Change typeRiskFlag strategy
Model swap (GPT-4o โ†’ Claude)HighStart at 1%, measure for 48h
System prompt updateMediumStart at 10%, measure for 24h
New tool/capabilityMediumStart at 5%, measure for 48h
Temperature/parameter changeLowStart at 25%, measure for 12h
New agent workflowHighStart at 1%, measure for 72h

A/B testing prompts

Feature flags enable prompt A/B testing โ€” the most underused optimization technique in AI:

PROMPT_VARIANTS = {
    "control": "You are a helpful coding assistant. Answer questions clearly and concisely.",
    "variant_a": "You are a senior software engineer. Provide specific, actionable code examples. Always explain your reasoning.",
    "variant_b": "You are a coding mentor. Start with the simplest solution, then offer improvements. Ask clarifying questions when the request is ambiguous.",
}

async def get_prompt(user_id):
    hash_val = int(hashlib.md5(user_id.encode()).hexdigest(), 16) % 100
    
    if hash_val < 33:
        return "control", PROMPT_VARIANTS["control"]
    elif hash_val < 66:
        return "variant_a", PROMPT_VARIANTS["variant_a"]
    else:
        return "variant_b", PROMPT_VARIANTS["variant_b"]

# Track which variant each user gets
variant, prompt = await get_prompt(user_id)
result = await call_model(model, message, system_prompt=prompt)
await log_metric("response_quality", score, tags={"variant": variant})

After a week, compare quality scores across variants. The winning prompt becomes the new default.

Measuring success

Feature flags are useless without measurement. Track:

async def measure_flag_impact(flag_name, days=7):
    enabled_group = await get_metrics(flag=flag_name, enabled=True, days=days)
    control_group = await get_metrics(flag=flag_name, enabled=False, days=days)
    
    return {
        "quality_delta": enabled_group.avg_score - control_group.avg_score,
        "latency_delta": enabled_group.p95_latency - control_group.p95_latency,
        "cost_delta": enabled_group.avg_cost - control_group.avg_cost,
        "error_rate_delta": enabled_group.error_rate - control_group.error_rate,
        "sample_size": {"enabled": enabled_group.count, "control": control_group.count},
    }

Only promote a flag to 100% when:

  • Quality is equal or better
  • Latency is acceptable
  • Cost is within budget
  • Error rate hasnโ€™t increased
  • Sample size is statistically significant (100+ per group minimum)

Integration with existing tools

If you already use a feature flag service (LaunchDarkly, Flagsmith, Unleash), use it for AI flags too:

# LaunchDarkly example
import ldclient

def get_model(user):
    if ldclient.get().variation("use-claude-sonnet", user, False):
        return "claude-sonnet-4"
    return "gpt-4o"

If you donโ€™t have a feature flag service, the hash-based implementation above works fine for most teams. Graduate to a proper service when you need targeting rules, audit logs, or team management.

Emergency kill switches

Every AI feature should have a kill switch:

KILL_SWITCHES = {
    "ai_agent_enabled": True,
    "code_execution_enabled": True,
    "external_api_calls_enabled": True,
}

async def check_kill_switch(feature):
    if not KILL_SWITCHES.get(feature, True):
        return {"error": "Feature temporarily disabled", "retry_after": 300}
    return None

Store kill switches in Redis or a fast key-value store so they take effect immediately. When something goes wrong at 3 AM, flipping a kill switch is faster than deploying a code fix.

Related: How to Handle AI Model Version Changes ยท AI Model Rollback Strategies ยท LLM Regression Testing ยท Test AI Agents Before Production ยท AI Agent Error Handling ยท Deploy AI Agents to Production