πŸ€– AI Tools
Β· 3 min read

How to Test AI Applications β€” A Developer's Guide to LLM Evaluation


65% of LLM applications fail in production within 90 days due to inadequate testing. The problem: traditional test frameworks check exact matches, but LLMs generate different outputs each time. You can’t assertEqual() on a creative response.

Here’s how to actually test AI applications.

Why LLM testing is different

Traditional softwareLLM applications
Deterministic outputNon-deterministic (same input β†’ different output)
Binary pass/failQuality is a spectrum
Unit tests workExact match tests fail
Bugs cause errorsBad output returns 200 OK
Test once, deployQuality drifts over time

The four levels of LLM testing

Level 1: Format validation

The easiest to automate. Does the output match the expected structure?

def test_output_format(response):
    # If using structured outputs
    assert json.loads(response)  # Valid JSON?
    data = json.loads(response)
    assert "answer" in data      # Required fields?
    assert "sources" in data
    assert len(data["answer"]) > 10  # Not empty?

Use structured outputs to make this trivial β€” the model is constrained to your schema at generation time.

Level 2: Factual correctness

Does the output contain correct information? This requires ground truth data.

eval_dataset = [
    {"input": "What is the capital of France?", "expected": "Paris"},
    {"input": "What year was Python created?", "expected": "1991"},
]

for case in eval_dataset:
    response = call_llm(case["input"])
    # Fuzzy match β€” the answer should CONTAIN the expected value
    assert case["expected"].lower() in response.lower()

For coding tasks, you can run the generated code and check if tests pass β€” that’s what SWE-bench does.

Level 3: Quality evaluation (LLM-as-judge)

Use a second model to evaluate the first. This is the standard approach for subjective quality.

def evaluate_quality(question, answer):
    judge_prompt = f"""Rate this answer on a scale of 1-5:
    Question: {question}
    Answer: {answer}
    
    Score (1-5) and one-sentence explanation:"""
    
    judgment = call_llm(judge_prompt, model="claude-opus-4.6")
    return parse_score(judgment)

When LLM-as-judge works: General quality, helpfulness, coherence, style.

When it fails: Factual accuracy (the judge can be wrong too), domain-specific correctness, subtle bugs in code.

Level 4: Regression testing

The most important for production. When you change a prompt, model, or configuration β€” does quality stay the same?

# Run the same eval dataset before and after a change
baseline_scores = run_eval(eval_dataset, prompt_v1)
new_scores = run_eval(eval_dataset, prompt_v2)

# Alert if quality drops more than 5%
avg_baseline = sum(baseline_scores) / len(baseline_scores)
avg_new = sum(new_scores) / len(new_scores)

if avg_new < avg_baseline * 0.95:
    raise Exception(f"Quality regression: {avg_baseline:.2f} β†’ {avg_new:.2f}")

Run this in CI/CD before every prompt change reaches production.

Building an eval dataset

Your eval dataset is the most valuable asset in LLM testing. Build it from:

  1. Real user queries β€” sample from production logs (with GDPR compliance)
  2. Edge cases β€” inputs that previously caused failures
  3. Adversarial inputs β€” prompt injection attempts, confusing queries
  4. Domain-specific cases β€” the hardest questions for your use case

Start with 50-100 cases. Grow it over time as you find new failure modes.

Tools

ToolBest forOpen source?
DeepEvalPython eval frameworkβœ…
PromptfooCLI-based prompt testingβœ…
BraintrustFull eval platformFree tier
LangSmithLangChain integration❌
LangfuseTracing + evalsβœ… MIT

For most teams, start with Promptfoo (simple CLI) or DeepEval (Python). Add a platform like Braintrust or Langfuse when you need dashboards and team collaboration.

The minimum viable eval

If you do nothing else, do this:

  1. Create 20 test cases from real user queries
  2. Run them against your current prompt (baseline)
  3. Before any prompt change, run the same 20 cases
  4. Compare scores β€” block deployment if quality drops >5%

This catches 80% of regressions with minimal effort. See our LLM observability guide for monitoring quality in production.

For AI coding tools

If you’re building tools like Aider or OpenCode, your eval dataset should include:

  • Multi-file refactoring tasks
  • Bug fixes with known solutions
  • Code generation from specs
  • Edge cases (empty files, huge files, unusual languages)

The SWE-bench dataset is a good starting point for coding evals.

Related: LLM Observability for Developers Β· Why Parsing LLM Output Breaks Β· Structured Outputs Explained Β· How to Benchmark LLM Inference