65% of LLM applications fail in production within 90 days due to inadequate testing. The problem: traditional test frameworks check exact matches, but LLMs generate different outputs each time. You canβt assertEqual() on a creative response.
Hereβs how to actually test AI applications.
Why LLM testing is different
| Traditional software | LLM applications |
|---|---|
| Deterministic output | Non-deterministic (same input β different output) |
| Binary pass/fail | Quality is a spectrum |
| Unit tests work | Exact match tests fail |
| Bugs cause errors | Bad output returns 200 OK |
| Test once, deploy | Quality drifts over time |
The four levels of LLM testing
Level 1: Format validation
The easiest to automate. Does the output match the expected structure?
def test_output_format(response):
# If using structured outputs
assert json.loads(response) # Valid JSON?
data = json.loads(response)
assert "answer" in data # Required fields?
assert "sources" in data
assert len(data["answer"]) > 10 # Not empty?
Use structured outputs to make this trivial β the model is constrained to your schema at generation time.
Level 2: Factual correctness
Does the output contain correct information? This requires ground truth data.
eval_dataset = [
{"input": "What is the capital of France?", "expected": "Paris"},
{"input": "What year was Python created?", "expected": "1991"},
]
for case in eval_dataset:
response = call_llm(case["input"])
# Fuzzy match β the answer should CONTAIN the expected value
assert case["expected"].lower() in response.lower()
For coding tasks, you can run the generated code and check if tests pass β thatβs what SWE-bench does.
Level 3: Quality evaluation (LLM-as-judge)
Use a second model to evaluate the first. This is the standard approach for subjective quality.
def evaluate_quality(question, answer):
judge_prompt = f"""Rate this answer on a scale of 1-5:
Question: {question}
Answer: {answer}
Score (1-5) and one-sentence explanation:"""
judgment = call_llm(judge_prompt, model="claude-opus-4.6")
return parse_score(judgment)
When LLM-as-judge works: General quality, helpfulness, coherence, style.
When it fails: Factual accuracy (the judge can be wrong too), domain-specific correctness, subtle bugs in code.
Level 4: Regression testing
The most important for production. When you change a prompt, model, or configuration β does quality stay the same?
# Run the same eval dataset before and after a change
baseline_scores = run_eval(eval_dataset, prompt_v1)
new_scores = run_eval(eval_dataset, prompt_v2)
# Alert if quality drops more than 5%
avg_baseline = sum(baseline_scores) / len(baseline_scores)
avg_new = sum(new_scores) / len(new_scores)
if avg_new < avg_baseline * 0.95:
raise Exception(f"Quality regression: {avg_baseline:.2f} β {avg_new:.2f}")
Run this in CI/CD before every prompt change reaches production.
Building an eval dataset
Your eval dataset is the most valuable asset in LLM testing. Build it from:
- Real user queries β sample from production logs (with GDPR compliance)
- Edge cases β inputs that previously caused failures
- Adversarial inputs β prompt injection attempts, confusing queries
- Domain-specific cases β the hardest questions for your use case
Start with 50-100 cases. Grow it over time as you find new failure modes.
Tools
| Tool | Best for | Open source? |
|---|---|---|
| DeepEval | Python eval framework | β |
| Promptfoo | CLI-based prompt testing | β |
| Braintrust | Full eval platform | Free tier |
| LangSmith | LangChain integration | β |
| Langfuse | Tracing + evals | β MIT |
For most teams, start with Promptfoo (simple CLI) or DeepEval (Python). Add a platform like Braintrust or Langfuse when you need dashboards and team collaboration.
The minimum viable eval
If you do nothing else, do this:
- Create 20 test cases from real user queries
- Run them against your current prompt (baseline)
- Before any prompt change, run the same 20 cases
- Compare scores β block deployment if quality drops >5%
This catches 80% of regressions with minimal effort. See our LLM observability guide for monitoring quality in production.
For AI coding tools
If youβre building tools like Aider or OpenCode, your eval dataset should include:
- Multi-file refactoring tasks
- Bug fixes with known solutions
- Code generation from specs
- Edge cases (empty files, huge files, unusual languages)
The SWE-bench dataset is a good starting point for coding evals.
Related: LLM Observability for Developers Β· Why Parsing LLM Output Breaks Β· Structured Outputs Explained Β· How to Benchmark LLM Inference