May 9, 2026 · 3 min read

A/B Testing Prompts in Production — Replace Guesswork with Data

You wrote a new prompt that “feels better.” But does it actually perform better with real users? Without A/B testing, you’re guessing. Here’s how to test prompt changes with data.

Why A/B test prompts

LLMs are non-deterministic. A prompt that works great on 10 test cases might fail on the 11th. The only way to know if a prompt change improves your app is to test it with real traffic.

Common prompt changes that need A/B testing:

Rewriting the system prompt for clarity
Adding or removing few-shot examples
Changing output format (structured outputs)
Switching models (e.g., Sonnet to DeepSeek)
Adding guardrails or safety instructions

The simple approach

Step 1: Define your metric

Pick one primary metric that matters for your use case:

App type	Primary metric
Chatbot	User satisfaction (thumbs up/down)
Code generator	Tests passing
Summarizer	LLM-as-judge quality score
Classifier	Accuracy vs ground truth
Search	Click-through rate

Step 2: Split traffic

import hashlib

def get_prompt_variant(user_id):
    # Consistent assignment: same user always gets same variant
    hash_val = int(hashlib.md5(user_id.encode()).hexdigest(), 16)
    return "B" if hash_val % 100 < 50 else "A"

# In your request handler
variant = get_prompt_variant(request.user_id)

if variant == "A":
    system_prompt = PROMPT_V1  # Current production prompt
else:
    system_prompt = PROMPT_V2  # New candidate prompt

Use a hash of the user ID for consistent assignment. The same user always sees the same variant, preventing confusion.

Step 3: Log everything

logger.info({
    "event": "llm_call",
    "variant": variant,
    "prompt_version": "v1" if variant == "A" else "v2",
    "user_id": user_hash,
    "quality_score": score,
    "latency_ms": latency,
    "tokens": total_tokens,
    "cost": cost,
})

See our logging guide for the complete logging strategy.

Step 4: Analyze results

After collecting enough data (minimum 100 samples per variant):

import numpy as np
from scipy import stats

scores_a = [...]  # Quality scores for variant A
scores_b = [...]  # Quality scores for variant B

# Two-sample t-test
t_stat, p_value = stats.ttest_ind(scores_a, scores_b)

print(f"Variant A: {np.mean(scores_a):.3f} (n={len(scores_a)})")
print(f"Variant B: {np.mean(scores_b):.3f} (n={len(scores_b)})")
print(f"P-value: {p_value:.4f}")

if p_value < 0.05 and np.mean(scores_b) > np.mean(scores_a):
    print("Variant B is significantly better. Ship it.")
elif p_value < 0.05:
    print("Variant A is significantly better. Keep current prompt.")
else:
    print("No significant difference. Need more data or the change doesn't matter.")

Step 5: Ship or revert

P-value < 0.05 and B is better: Ship variant B to 100% of traffic
P-value < 0.05 and A is better: Revert, your change made things worse
P-value > 0.05: No significant difference. Ship B if it’s cheaper/faster, otherwise keep A

What to measure beyond quality

Metric	Why it matters
Quality score	Is the output better?
Latency	Is it faster or slower?
Token usage	Is it cheaper or more expensive?
Error rate	Does the new prompt cause more failures?
User engagement	Do users interact more (thumbs up, follow-up questions)?

A prompt that’s 5% better quality but 50% more expensive might not be worth shipping. Track all dimensions.

Sample size calculator

How many requests do you need before the test is conclusive?

Expected improvement	Samples needed per variant
20%+ (large)	~100
10% (medium)	~400
5% (small)	~1,600
2% (tiny)	~10,000

If your app gets 100 requests/day, a large improvement is detectable in 2 days. A small improvement takes 32 days. Don’t run tests longer than necessary — ship fast.

Common mistakes

Testing too many things at once

Change one thing per test. If you change the system prompt AND the model AND the temperature, you won’t know which change caused the improvement.

Stopping too early

“It looks better after 20 requests” is not statistically significant. Wait for your minimum sample size.

Ignoring cost

A prompt that adds 3 few-shot examples might improve quality 5% but double your token costs. Always compare cost-adjusted quality.

Not testing edge cases

Your A/B test might show improvement on average but hide a regression on specific input types. Monitor regression tests alongside the A/B test.

Tools

Tool	A/B testing support
Promptfoo	Built-in comparison mode
Helicone	Tag requests with variant, compare in dashboard
Langfuse	Experiment tracking with scoring
Custom	The code above (50 lines of Python)

For most teams, the custom approach works fine. Graduate to Promptfoo or Langfuse when you’re running multiple concurrent experiments.