You change a prompt. Tests pass. You deploy. Users report broken outputs. The prompt change degraded quality on edge cases your tests didnβt cover. This is the #1 failure mode for LLM applications in production.
LLM regression testing catches these quality drops before they reach users.
Why traditional tests fail for LLMs
# This test is useless for LLMs
def test_summarize():
result = summarize("The quick brown fox...")
assert result == "A fox jumped over a dog" # Will NEVER match exactly
LLMs are non-deterministic. The same input produces different outputs. You canβt use assertEqual. You need fuzzy evaluation.
Building a regression test suite
Step 1: Create an eval dataset
Collect 50-100 test cases from real usage:
eval_dataset = [
{
"input": "Summarize this bug report: ...",
"expected_qualities": ["mentions the error", "suggests a fix", "under 100 words"],
"baseline_score": 4.2 # Score from current production prompt
},
# ... 49 more cases
]
Sources for test cases:
- Real user queries from production logs
- Edge cases that previously caused failures
- Adversarial inputs (prompt injection attempts)
- Domain-specific hard cases
Step 2: Define scoring
def score_response(input, response, expected_qualities):
"""Use LLM-as-judge to score quality 1-5."""
judge_prompt = f"""Rate this response 1-5 on these criteria:
{expected_qualities}
Input: {input}
Response: {response}
Score (1-5):"""
score = call_llm("claude-sonnet-4.6", judge_prompt)
return parse_score(score)
For coding tasks, you can also run the generated code and check if tests pass β thatβs a deterministic signal.
Step 3: Run baseline
Score your current production prompt against the full eval dataset:
baseline_scores = []
for case in eval_dataset:
response = call_llm(current_prompt, case["input"])
score = score_response(case["input"], response, case["expected_qualities"])
baseline_scores.append(score)
avg_baseline = sum(baseline_scores) / len(baseline_scores)
print(f"Baseline: {avg_baseline:.2f}") # e.g., 4.2
Step 4: Test changes
Before deploying a prompt change:
new_scores = []
for case in eval_dataset:
response = call_llm(new_prompt, case["input"])
score = score_response(case["input"], response, case["expected_qualities"])
new_scores.append(score)
avg_new = sum(new_scores) / len(new_scores)
# Fail if quality drops more than 5%
if avg_new < avg_baseline * 0.95:
raise Exception(f"REGRESSION: {avg_baseline:.2f} β {avg_new:.2f}")
else:
print(f"PASS: {avg_baseline:.2f} β {avg_new:.2f}")
Step 5: CI/CD integration
Run regression tests automatically on every prompt change:
# GitHub Actions
- name: LLM Regression Test
run: python run_eval.py --prompt prompts/v2.txt --threshold 0.95
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_KEY }}
Tools
| Tool | Approach |
|---|---|
| Promptfoo | CLI-based, config-driven eval. Best for CI/CD. |
| DeepEval | Python framework with built-in metrics. |
| Braintrust | Platform with dataset management + scoring. |
| Custom script | 50 lines of Python (shown above). |
For most teams, start with a custom script. Graduate to Promptfoo or DeepEval when you need more metrics and team collaboration.
What to test
| Test type | What it catches |
|---|---|
| Quality regression | Prompt changes that degrade output |
| Format regression | Output structure changes (structured outputs help) |
| Safety regression | New prompts that are more vulnerable to injection |
| Cost regression | Changes that increase token usage |
| Latency regression | Changes that slow down responses |
The minimum viable approach
If you do nothing else:
- Save 20 real user queries as test cases
- Score them with your current prompt (baseline)
- Before any prompt change, re-score and compare
- Block deployment if average score drops >5%
This takes 2 hours to set up and catches 80% of regressions. See our AI testing guide for the full evaluation framework and our observability guide for monitoring quality in production.
Common pitfalls
Testing with too few examples
20 test cases is the minimum. Below that, a single outlier skews your average by 5%+. Aim for 50-100 cases for production systems.
Not testing edge cases
Your eval dataset should include:
- Happy path (60%) β normal, expected inputs
- Edge cases (20%) β unusual but valid inputs (very long, very short, multilingual)
- Adversarial (10%) β prompt injection attempts, confusing inputs
- Regression cases (10%) β inputs that previously caused failures
Ignoring latency regression
A prompt change that improves quality by 5% but doubles latency is often a net negative for users. Track latency alongside quality:
import time
start = time.time()
response = call_llm(new_prompt, case["input"])
latency = time.time() - start
if latency > baseline_latency * 1.5:
print(f"LATENCY REGRESSION: {baseline_latency:.1f}s β {latency:.1f}s")
Not versioning prompts
Store prompts in version control alongside your eval results:
prompts/
v1.txt # Original prompt
v2.txt # Updated prompt
evals/
v1_20260413.json # Eval results for v1
v2_20260413.json # Eval results for v2
This gives you a complete history of what changed, when, and how it affected quality.
Related: How to Test AI Applications Β· LLM Observability Β· Structured Outputs Explained Β· How to Benchmark LLM Inference Β· LLM-as-a-Judge