Apr 30, 2026 · 3 min read

LLM Regression Testing — How to Catch Quality Drops Before Production

You change a prompt. Tests pass. You deploy. Users report broken outputs. The prompt change degraded quality on edge cases your tests didn’t cover. This is the #1 failure mode for LLM applications in production.

LLM regression testing catches these quality drops before they reach users.

Why traditional tests fail for LLMs

# This test is useless for LLMs
def test_summarize():
    result = summarize("The quick brown fox...")
    assert result == "A fox jumped over a dog"  # Will NEVER match exactly

LLMs are non-deterministic. The same input produces different outputs. You can’t use assertEqual. You need fuzzy evaluation.

Building a regression test suite

Step 1: Create an eval dataset

Collect 50-100 test cases from real usage:

eval_dataset = [
    {
        "input": "Summarize this bug report: ...",
        "expected_qualities": ["mentions the error", "suggests a fix", "under 100 words"],
        "baseline_score": 4.2  # Score from current production prompt
    },
    # ... 49 more cases
]

Sources for test cases:

Real user queries from production logs
Edge cases that previously caused failures
Adversarial inputs (prompt injection attempts)
Domain-specific hard cases

Step 2: Define scoring

def score_response(input, response, expected_qualities):
    """Use LLM-as-judge to score quality 1-5."""
    judge_prompt = f"""Rate this response 1-5 on these criteria:
    {expected_qualities}
    
    Input: {input}
    Response: {response}
    
    Score (1-5):"""
    
    score = call_llm("claude-sonnet-4.6", judge_prompt)
    return parse_score(score)

For coding tasks, you can also run the generated code and check if tests pass — that’s a deterministic signal.

Step 3: Run baseline

Score your current production prompt against the full eval dataset:

baseline_scores = []
for case in eval_dataset:
    response = call_llm(current_prompt, case["input"])
    score = score_response(case["input"], response, case["expected_qualities"])
    baseline_scores.append(score)

avg_baseline = sum(baseline_scores) / len(baseline_scores)
print(f"Baseline: {avg_baseline:.2f}")  # e.g., 4.2

Step 4: Test changes

Before deploying a prompt change:

new_scores = []
for case in eval_dataset:
    response = call_llm(new_prompt, case["input"])
    score = score_response(case["input"], response, case["expected_qualities"])
    new_scores.append(score)

avg_new = sum(new_scores) / len(new_scores)

# Fail if quality drops more than 5%
if avg_new < avg_baseline * 0.95:
    raise Exception(f"REGRESSION: {avg_baseline:.2f} → {avg_new:.2f}")
else:
    print(f"PASS: {avg_baseline:.2f} → {avg_new:.2f}")

Step 5: CI/CD integration

Run regression tests automatically on every prompt change:

# GitHub Actions
- name: LLM Regression Test
  run: python run_eval.py --prompt prompts/v2.txt --threshold 0.95
  env:
    OPENAI_API_KEY: ${{ secrets.OPENAI_KEY }}

Tools

Tool	Approach
Promptfoo	CLI-based, config-driven eval. Best for CI/CD.
DeepEval	Python framework with built-in metrics.
Braintrust	Platform with dataset management + scoring.
Custom script	50 lines of Python (shown above).

For most teams, start with a custom script. Graduate to Promptfoo or DeepEval when you need more metrics and team collaboration.

What to test

Test type	What it catches
Quality regression	Prompt changes that degrade output
Format regression	Output structure changes (structured outputs help)
Safety regression	New prompts that are more vulnerable to injection
Cost regression	Changes that increase token usage
Latency regression	Changes that slow down responses

The minimum viable approach

If you do nothing else:

Save 20 real user queries as test cases
Score them with your current prompt (baseline)
Before any prompt change, re-score and compare
Block deployment if average score drops >5%

This takes 2 hours to set up and catches 80% of regressions. See our AI testing guide for the full evaluation framework and our observability guide for monitoring quality in production.

Common pitfalls

Testing with too few examples

20 test cases is the minimum. Below that, a single outlier skews your average by 5%+. Aim for 50-100 cases for production systems.

Not testing edge cases

Your eval dataset should include:

Happy path (60%) — normal, expected inputs
Edge cases (20%) — unusual but valid inputs (very long, very short, multilingual)
Adversarial (10%) — prompt injection attempts, confusing inputs
Regression cases (10%) — inputs that previously caused failures

Ignoring latency regression

A prompt change that improves quality by 5% but doubles latency is often a net negative for users. Track latency alongside quality:

import time

start = time.time()
response = call_llm(new_prompt, case["input"])
latency = time.time() - start

if latency > baseline_latency * 1.5:
    print(f"LATENCY REGRESSION: {baseline_latency:.1f}s → {latency:.1f}s")

Not versioning prompts

Store prompts in version control alongside your eval results:

prompts/
  v1.txt          # Original prompt
  v2.txt          # Updated prompt
evals/
  v1_20260413.json  # Eval results for v1
  v2_20260413.json  # Eval results for v2

This gives you a complete history of what changed, when, and how it affected quality.