πŸ€– AI Tools
Β· 3 min read

A/B Testing Prompts in Production β€” Replace Guesswork with Data


You wrote a new prompt that β€œfeels better.” But does it actually perform better with real users? Without A/B testing, you’re guessing. Here’s how to test prompt changes with data.

Why A/B test prompts

LLMs are non-deterministic. A prompt that works great on 10 test cases might fail on the 11th. The only way to know if a prompt change improves your app is to test it with real traffic.

Common prompt changes that need A/B testing:

  • Rewriting the system prompt for clarity
  • Adding or removing few-shot examples
  • Changing output format (structured outputs)
  • Switching models (e.g., Sonnet to DeepSeek)
  • Adding guardrails or safety instructions

The simple approach

Step 1: Define your metric

Pick one primary metric that matters for your use case:

App typePrimary metric
ChatbotUser satisfaction (thumbs up/down)
Code generatorTests passing
SummarizerLLM-as-judge quality score
ClassifierAccuracy vs ground truth
SearchClick-through rate

Step 2: Split traffic

import hashlib

def get_prompt_variant(user_id):
    # Consistent assignment: same user always gets same variant
    hash_val = int(hashlib.md5(user_id.encode()).hexdigest(), 16)
    return "B" if hash_val % 100 < 50 else "A"

# In your request handler
variant = get_prompt_variant(request.user_id)

if variant == "A":
    system_prompt = PROMPT_V1  # Current production prompt
else:
    system_prompt = PROMPT_V2  # New candidate prompt

Use a hash of the user ID for consistent assignment. The same user always sees the same variant, preventing confusion.

Step 3: Log everything

logger.info({
    "event": "llm_call",
    "variant": variant,
    "prompt_version": "v1" if variant == "A" else "v2",
    "user_id": user_hash,
    "quality_score": score,
    "latency_ms": latency,
    "tokens": total_tokens,
    "cost": cost,
})

See our logging guide for the complete logging strategy.

Step 4: Analyze results

After collecting enough data (minimum 100 samples per variant):

import numpy as np
from scipy import stats

scores_a = [...]  # Quality scores for variant A
scores_b = [...]  # Quality scores for variant B

# Two-sample t-test
t_stat, p_value = stats.ttest_ind(scores_a, scores_b)

print(f"Variant A: {np.mean(scores_a):.3f} (n={len(scores_a)})")
print(f"Variant B: {np.mean(scores_b):.3f} (n={len(scores_b)})")
print(f"P-value: {p_value:.4f}")

if p_value < 0.05 and np.mean(scores_b) > np.mean(scores_a):
    print("Variant B is significantly better. Ship it.")
elif p_value < 0.05:
    print("Variant A is significantly better. Keep current prompt.")
else:
    print("No significant difference. Need more data or the change doesn't matter.")

Step 5: Ship or revert

  • P-value < 0.05 and B is better: Ship variant B to 100% of traffic
  • P-value < 0.05 and A is better: Revert, your change made things worse
  • P-value > 0.05: No significant difference. Ship B if it’s cheaper/faster, otherwise keep A

What to measure beyond quality

MetricWhy it matters
Quality scoreIs the output better?
LatencyIs it faster or slower?
Token usageIs it cheaper or more expensive?
Error rateDoes the new prompt cause more failures?
User engagementDo users interact more (thumbs up, follow-up questions)?

A prompt that’s 5% better quality but 50% more expensive might not be worth shipping. Track all dimensions.

Sample size calculator

How many requests do you need before the test is conclusive?

Expected improvementSamples needed per variant
20%+ (large)~100
10% (medium)~400
5% (small)~1,600
2% (tiny)~10,000

If your app gets 100 requests/day, a large improvement is detectable in 2 days. A small improvement takes 32 days. Don’t run tests longer than necessary β€” ship fast.

Common mistakes

Testing too many things at once

Change one thing per test. If you change the system prompt AND the model AND the temperature, you won’t know which change caused the improvement.

Stopping too early

β€œIt looks better after 20 requests” is not statistically significant. Wait for your minimum sample size.

Ignoring cost

A prompt that adds 3 few-shot examples might improve quality 5% but double your token costs. Always compare cost-adjusted quality.

Not testing edge cases

Your A/B test might show improvement on average but hide a regression on specific input types. Monitor regression tests alongside the A/B test.

Tools

ToolA/B testing support
PromptfooBuilt-in comparison mode
HeliconeTag requests with variant, compare in dashboard
LangfuseExperiment tracking with scoring
CustomThe code above (50 lines of Python)

For most teams, the custom approach works fine. Graduate to Promptfoo or Langfuse when you’re running multiple concurrent experiments.

Related: LLM Regression Testing Β· LLM-as-a-Judge Β· How to Test AI Applications Β· LLM Observability Β· What to Log in AI Systems