Apr 27, 2026 · 3 min read

How to Test AI Applications — A Developer's Guide to LLM Evaluation

65% of LLM applications fail in production within 90 days due to inadequate testing. The problem: traditional test frameworks check exact matches, but LLMs generate different outputs each time. You can’t assertEqual() on a creative response.

Here’s how to actually test AI applications.

Why LLM testing is different

Traditional software	LLM applications
Deterministic output	Non-deterministic (same input → different output)
Binary pass/fail	Quality is a spectrum
Unit tests work	Exact match tests fail
Bugs cause errors	Bad output returns 200 OK
Test once, deploy	Quality drifts over time

The four levels of LLM testing

Level 1: Format validation

The easiest to automate. Does the output match the expected structure?

def test_output_format(response):
    # If using structured outputs
    assert json.loads(response)  # Valid JSON?
    data = json.loads(response)
    assert "answer" in data      # Required fields?
    assert "sources" in data
    assert len(data["answer"]) > 10  # Not empty?

Use structured outputs to make this trivial — the model is constrained to your schema at generation time.

Level 2: Factual correctness

Does the output contain correct information? This requires ground truth data.

eval_dataset = [
    {"input": "What is the capital of France?", "expected": "Paris"},
    {"input": "What year was Python created?", "expected": "1991"},
]

for case in eval_dataset:
    response = call_llm(case["input"])
    # Fuzzy match — the answer should CONTAIN the expected value
    assert case["expected"].lower() in response.lower()

For coding tasks, you can run the generated code and check if tests pass — that’s what SWE-bench does.

Level 3: Quality evaluation (LLM-as-judge)

Use a second model to evaluate the first. This is the standard approach for subjective quality.

def evaluate_quality(question, answer):
    judge_prompt = f"""Rate this answer on a scale of 1-5:
    Question: {question}
    Answer: {answer}
    
    Score (1-5) and one-sentence explanation:"""
    
    judgment = call_llm(judge_prompt, model="claude-opus-4.6")
    return parse_score(judgment)

When LLM-as-judge works: General quality, helpfulness, coherence, style.

When it fails: Factual accuracy (the judge can be wrong too), domain-specific correctness, subtle bugs in code.

Level 4: Regression testing

The most important for production. When you change a prompt, model, or configuration — does quality stay the same?

# Run the same eval dataset before and after a change
baseline_scores = run_eval(eval_dataset, prompt_v1)
new_scores = run_eval(eval_dataset, prompt_v2)

# Alert if quality drops more than 5%
avg_baseline = sum(baseline_scores) / len(baseline_scores)
avg_new = sum(new_scores) / len(new_scores)

if avg_new < avg_baseline * 0.95:
    raise Exception(f"Quality regression: {avg_baseline:.2f} → {avg_new:.2f}")

Run this in CI/CD before every prompt change reaches production.

Building an eval dataset

Your eval dataset is the most valuable asset in LLM testing. Build it from:

Real user queries — sample from production logs (with GDPR compliance)
Edge cases — inputs that previously caused failures
Adversarial inputs — prompt injection attempts, confusing queries
Domain-specific cases — the hardest questions for your use case

Start with 50-100 cases. Grow it over time as you find new failure modes.

Tools

Tool	Best for	Open source?
DeepEval	Python eval framework	✅
Promptfoo	CLI-based prompt testing	✅
Braintrust	Full eval platform	Free tier
LangSmith	LangChain integration	❌
Langfuse	Tracing + evals	✅ MIT

For most teams, start with Promptfoo (simple CLI) or DeepEval (Python). Add a platform like Braintrust or Langfuse when you need dashboards and team collaboration.

The minimum viable eval

If you do nothing else, do this:

Create 20 test cases from real user queries
Run them against your current prompt (baseline)
Before any prompt change, run the same 20 cases
Compare scores — block deployment if quality drops >5%

This catches 80% of regressions with minimal effort. See our LLM observability guide for monitoring quality in production.

For AI coding tools

If you’re building tools like Aider or OpenCode, your eval dataset should include:

Multi-file refactoring tasks
Bug fixes with known solutions
Code generation from specs
Edge cases (empty files, huge files, unusual languages)

The SWE-bench dataset is a good starting point for coding evals.

How to Test AI Applications — A Developer's Guide to LLM Evaluation

Why LLM testing is different

The four levels of LLM testing

Level 1: Format validation

Level 2: Factual correctness

Level 3: Quality evaluation (LLM-as-judge)

Level 4: Regression testing

Building an eval dataset

Tools

The minimum viable eval

For AI coding tools

📬 AI Dev Weekly

You might also like

How to Build an LLM Eval Dataset — From Zero to Production-Ready

LLM-as-a-Judge: When It Works and When It Fails

Canary Deploys for LLM Features — Ship Prompt Changes Safely

Context Window Management — How to Fit More Into Your LLM's Memory