You wrote a new prompt that βfeels better.β But does it actually perform better with real users? Without A/B testing, youβre guessing. Hereβs how to test prompt changes with data.
Why A/B test prompts
LLMs are non-deterministic. A prompt that works great on 10 test cases might fail on the 11th. The only way to know if a prompt change improves your app is to test it with real traffic.
Common prompt changes that need A/B testing:
- Rewriting the system prompt for clarity
- Adding or removing few-shot examples
- Changing output format (structured outputs)
- Switching models (e.g., Sonnet to DeepSeek)
- Adding guardrails or safety instructions
The simple approach
Step 1: Define your metric
Pick one primary metric that matters for your use case:
| App type | Primary metric |
|---|---|
| Chatbot | User satisfaction (thumbs up/down) |
| Code generator | Tests passing |
| Summarizer | LLM-as-judge quality score |
| Classifier | Accuracy vs ground truth |
| Search | Click-through rate |
Step 2: Split traffic
import hashlib
def get_prompt_variant(user_id):
# Consistent assignment: same user always gets same variant
hash_val = int(hashlib.md5(user_id.encode()).hexdigest(), 16)
return "B" if hash_val % 100 < 50 else "A"
# In your request handler
variant = get_prompt_variant(request.user_id)
if variant == "A":
system_prompt = PROMPT_V1 # Current production prompt
else:
system_prompt = PROMPT_V2 # New candidate prompt
Use a hash of the user ID for consistent assignment. The same user always sees the same variant, preventing confusion.
Step 3: Log everything
logger.info({
"event": "llm_call",
"variant": variant,
"prompt_version": "v1" if variant == "A" else "v2",
"user_id": user_hash,
"quality_score": score,
"latency_ms": latency,
"tokens": total_tokens,
"cost": cost,
})
See our logging guide for the complete logging strategy.
Step 4: Analyze results
After collecting enough data (minimum 100 samples per variant):
import numpy as np
from scipy import stats
scores_a = [...] # Quality scores for variant A
scores_b = [...] # Quality scores for variant B
# Two-sample t-test
t_stat, p_value = stats.ttest_ind(scores_a, scores_b)
print(f"Variant A: {np.mean(scores_a):.3f} (n={len(scores_a)})")
print(f"Variant B: {np.mean(scores_b):.3f} (n={len(scores_b)})")
print(f"P-value: {p_value:.4f}")
if p_value < 0.05 and np.mean(scores_b) > np.mean(scores_a):
print("Variant B is significantly better. Ship it.")
elif p_value < 0.05:
print("Variant A is significantly better. Keep current prompt.")
else:
print("No significant difference. Need more data or the change doesn't matter.")
Step 5: Ship or revert
- P-value < 0.05 and B is better: Ship variant B to 100% of traffic
- P-value < 0.05 and A is better: Revert, your change made things worse
- P-value > 0.05: No significant difference. Ship B if itβs cheaper/faster, otherwise keep A
What to measure beyond quality
| Metric | Why it matters |
|---|---|
| Quality score | Is the output better? |
| Latency | Is it faster or slower? |
| Token usage | Is it cheaper or more expensive? |
| Error rate | Does the new prompt cause more failures? |
| User engagement | Do users interact more (thumbs up, follow-up questions)? |
A prompt thatβs 5% better quality but 50% more expensive might not be worth shipping. Track all dimensions.
Sample size calculator
How many requests do you need before the test is conclusive?
| Expected improvement | Samples needed per variant |
|---|---|
| 20%+ (large) | ~100 |
| 10% (medium) | ~400 |
| 5% (small) | ~1,600 |
| 2% (tiny) | ~10,000 |
If your app gets 100 requests/day, a large improvement is detectable in 2 days. A small improvement takes 32 days. Donβt run tests longer than necessary β ship fast.
Common mistakes
Testing too many things at once
Change one thing per test. If you change the system prompt AND the model AND the temperature, you wonβt know which change caused the improvement.
Stopping too early
βIt looks better after 20 requestsβ is not statistically significant. Wait for your minimum sample size.
Ignoring cost
A prompt that adds 3 few-shot examples might improve quality 5% but double your token costs. Always compare cost-adjusted quality.
Not testing edge cases
Your A/B test might show improvement on average but hide a regression on specific input types. Monitor regression tests alongside the A/B test.
Tools
| Tool | A/B testing support |
|---|---|
| Promptfoo | Built-in comparison mode |
| Helicone | Tag requests with variant, compare in dashboard |
| Langfuse | Experiment tracking with scoring |
| Custom | The code above (50 lines of Python) |
For most teams, the custom approach works fine. Graduate to Promptfoo or Langfuse when youβre running multiple concurrent experiments.
Related: LLM Regression Testing Β· LLM-as-a-Judge Β· How to Test AI Applications Β· LLM Observability Β· What to Log in AI Systems