Apr 19, 2026 · 3 min read

LLM Feature Flags: Safely Roll Out Model Changes to Users (2026)

Feature flags are standard practice for rolling out code changes. They should be standard practice for AI model changes too. When you update a prompt, switch models, or add a new agent capability, feature flags let you control who sees the change, measure the impact, and roll back instantly if something breaks.

Why AI needs feature flags

AI changes are riskier than code changes because:

Non-deterministic: The same prompt change can improve quality for 80% of queries and degrade it for 20%
Hard to test: You can’t write unit tests for “does this prompt produce better responses”
User-facing: A bad model change is immediately visible to users (unlike a backend optimization)
Expensive to roll back: If you’ve already served bad responses to 100% of users, the damage is done

Feature flags give you a controlled rollout: start with 1% of users, measure quality, and gradually increase.

Basic implementation

import hashlib

class AIFeatureFlags:
    def __init__(self, config: dict):
        self.config = config
    
    def is_enabled(self, flag: str, user_id: str) -> bool:
        flag_config = self.config.get(flag, {})
        if not flag_config.get("enabled", False):
            return False
        
        percentage = flag_config.get("percentage", 0)
        hash_val = int(hashlib.md5(f"{flag}:{user_id}".encode()).hexdigest(), 16)
        return (hash_val % 100) < percentage

flags = AIFeatureFlags({
    "new_system_prompt": {"enabled": True, "percentage": 10},
    "gpt4o_to_claude": {"enabled": True, "percentage": 5},
    "agent_v2": {"enabled": False, "percentage": 0},
})

# Usage
async def handle_request(user_id, message):
    if flags.is_enabled("gpt4o_to_claude", user_id):
        model = "claude-sonnet-4"
    else:
        model = "gpt-4o"
    
    return await call_model(model, message)

The hash ensures the same user always gets the same experience (no flickering between versions).

What to feature-flag

Change type	Risk	Flag strategy
Model swap (GPT-4o → Claude)	High	Start at 1%, measure for 48h
System prompt update	Medium	Start at 10%, measure for 24h
New tool/capability	Medium	Start at 5%, measure for 48h
Temperature/parameter change	Low	Start at 25%, measure for 12h
New agent workflow	High	Start at 1%, measure for 72h

A/B testing prompts

Feature flags enable prompt A/B testing — the most underused optimization technique in AI:

PROMPT_VARIANTS = {
    "control": "You are a helpful coding assistant. Answer questions clearly and concisely.",
    "variant_a": "You are a senior software engineer. Provide specific, actionable code examples. Always explain your reasoning.",
    "variant_b": "You are a coding mentor. Start with the simplest solution, then offer improvements. Ask clarifying questions when the request is ambiguous.",
}

async def get_prompt(user_id):
    hash_val = int(hashlib.md5(user_id.encode()).hexdigest(), 16) % 100
    
    if hash_val < 33:
        return "control", PROMPT_VARIANTS["control"]
    elif hash_val < 66:
        return "variant_a", PROMPT_VARIANTS["variant_a"]
    else:
        return "variant_b", PROMPT_VARIANTS["variant_b"]

# Track which variant each user gets
variant, prompt = await get_prompt(user_id)
result = await call_model(model, message, system_prompt=prompt)
await log_metric("response_quality", score, tags={"variant": variant})

After a week, compare quality scores across variants. The winning prompt becomes the new default.

Measuring success

Feature flags are useless without measurement. Track:

async def measure_flag_impact(flag_name, days=7):
    enabled_group = await get_metrics(flag=flag_name, enabled=True, days=days)
    control_group = await get_metrics(flag=flag_name, enabled=False, days=days)
    
    return {
        "quality_delta": enabled_group.avg_score - control_group.avg_score,
        "latency_delta": enabled_group.p95_latency - control_group.p95_latency,
        "cost_delta": enabled_group.avg_cost - control_group.avg_cost,
        "error_rate_delta": enabled_group.error_rate - control_group.error_rate,
        "sample_size": {"enabled": enabled_group.count, "control": control_group.count},
    }

Only promote a flag to 100% when:

Quality is equal or better
Latency is acceptable
Cost is within budget
Error rate hasn’t increased
Sample size is statistically significant (100+ per group minimum)

Integration with existing tools

If you already use a feature flag service (LaunchDarkly, Flagsmith, Unleash), use it for AI flags too:

# LaunchDarkly example
import ldclient

def get_model(user):
    if ldclient.get().variation("use-claude-sonnet", user, False):
        return "claude-sonnet-4"
    return "gpt-4o"

If you don’t have a feature flag service, the hash-based implementation above works fine for most teams. Graduate to a proper service when you need targeting rules, audit logs, or team management.

Emergency kill switches

Every AI feature should have a kill switch:

KILL_SWITCHES = {
    "ai_agent_enabled": True,
    "code_execution_enabled": True,
    "external_api_calls_enabled": True,
}

async def check_kill_switch(feature):
    if not KILL_SWITCHES.get(feature, True):
        return {"error": "Feature temporarily disabled", "retry_after": 300}
    return None

Store kill switches in Redis or a fast key-value store so they take effect immediately. When something goes wrong at 3 AM, flipping a kill switch is faster than deploying a code fix.

LLM Feature Flags: Safely Roll Out Model Changes to Users (2026)

Why AI needs feature flags

Basic implementation

What to feature-flag

A/B testing prompts

Measuring success

Integration with existing tools

Emergency kill switches

📬 AI Dev Weekly

You might also like

Canary Deploys for LLM Features — Ship Prompt Changes Safely

AI Model Drift: Detect and Fix Silent Quality Degradation (2026)

AI Model Rollback Strategies: Canary, Shadow, and Blue-Green (2026)

How to Handle AI Model Version Changes in Production (2026)