πŸ€– AI Tools
Β· 3 min read

How to Build an LLM Eval Dataset β€” From Zero to Production-Ready


Your LLM app needs an eval dataset before you can test, A/B test, or regression test anything. Most teams skip this step and rely on vibes. Here’s how to build one properly.

What’s an eval dataset

An eval dataset is a collection of input-output pairs with quality scores. It answers: β€œGiven this input, is the model’s output good enough?”

{
  "input": "Summarize this bug report: Connection timeout after 30s when...",
  "expected_qualities": ["mentions the error type", "suggests a fix", "under 100 words"],
  "reference_output": "Connection timeout caused by...",
  "baseline_score": 4.2
}

You need 50-100 examples minimum. 200+ is better.

Step 1: Collect real inputs (week 1)

The best eval examples come from real usage. Don’t invent test cases β€” they won’t represent actual user behavior.

Sources:

  • Production logs (if you have them)
  • Customer support tickets
  • Internal team usage
  • Beta tester feedback
  • Edge cases that caused failures
# Extract diverse examples from production logs
import random

logs = load_production_logs(last_30_days=True)

# Sample across different input types
short_inputs = [l for l in logs if len(l["input"]) < 100]
long_inputs = [l for l in logs if len(l["input"]) > 500]
error_cases = [l for l in logs if l.get("user_feedback") == "bad"]

eval_set = (
    random.sample(short_inputs, 20) +
    random.sample(long_inputs, 20) +
    error_cases[:10]  # Include all known failures
)

No production data yet? Generate synthetic examples:

  1. Write 20 typical inputs yourself
  2. Ask colleagues to contribute 10 each
  3. Use an LLM to generate 20 edge cases: β€œGenerate 20 unusual inputs for a code review tool”
  4. Add 10 adversarial inputs (prompt injection attempts)

Step 2: Define quality criteria

What makes a β€œgood” output for your use case?

App typeQuality criteria
ChatbotHelpful, accurate, concise, safe
Code generatorCorrect, follows style guide, handles edge cases
SummarizerCaptures key points, correct length, no hallucination
ClassifierCorrect label, confidence calibrated
RAG appUses retrieved context, doesn’t hallucinate, cites sources

Write 3-5 specific criteria. Vague criteria (β€œis it good?”) produce inconsistent scores.

Step 3: Score your baseline

Run your current prompt against the eval dataset and score each output:

def score_output(input_text, output, criteria):
    """Score 1-5 using LLM-as-judge"""
    judge_prompt = f"""Score this output 1-5 on each criterion.

Criteria: {criteria}

Input: {input_text}
Output: {output}

Return JSON: {{"scores": {{"criterion1": N, ...}}, "overall": N, "reasoning": "..."}}"""
    
    result = call_llm("claude-opus-4.6", judge_prompt)
    return json.loads(result)

See our LLM-as-judge guide for best practices on automated scoring.

Also score manually for the first 20 examples. Compare human scores with LLM-judge scores to calibrate. If they disagree on more than 30% of examples, your rubric needs work.

Step 4: Categorize and balance

Tag each example with categories:

eval_dataset = [
    {"input": "...", "category": "simple_question", "difficulty": "easy"},
    {"input": "...", "category": "code_review", "difficulty": "hard"},
    {"input": "...", "category": "adversarial", "difficulty": "hard"},
]

Ensure your dataset covers:

  • 60% happy path β€” normal, expected inputs
  • 20% edge cases β€” unusual but valid inputs
  • 10% adversarial β€” prompt injection, confusing inputs
  • 10% regression cases β€” inputs that previously caused failures

Step 5: Version and maintain

Store your eval dataset in version control alongside your prompts:

evals/
  dataset_v1.json       # Initial dataset
  dataset_v2.json       # Added edge cases from production
  results/
    v1_prompt_v1.json   # Scores for prompt v1
    v1_prompt_v2.json   # Scores for prompt v2

Maintenance schedule:

  • Weekly: Add 2-3 new examples from production failures
  • Monthly: Review and remove outdated examples
  • Quarterly: Full dataset audit β€” are categories still balanced?

Step 6: Automate

Run evals automatically on every prompt change:

# GitHub Actions
- name: Run LLM Eval
  run: |
    python run_eval.py \
      --dataset evals/dataset_v2.json \
      --prompt prompts/current.txt \
      --threshold 4.0
  env:
    ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_KEY }}

See our regression testing guide for full CI/CD integration.

Common mistakes

  • Too few examples β€” 10 examples is not an eval dataset. Aim for 50+.
  • All easy examples β€” if every example scores 5/5, your dataset isn’t testing anything
  • No adversarial examples β€” you need to test failure modes, not just happy paths
  • Static dataset β€” production inputs evolve. Your eval dataset should too.
  • Scoring without rubric β€” β€œrate 1-5” without criteria produces inconsistent scores

Related: How to Test AI Applications Β· LLM-as-a-Judge Β· LLM Regression Testing Β· A/B Testing Prompts Β· Prompt Injection Explained