May 11, 2026 · 3 min read

How to Build an LLM Eval Dataset — From Zero to Production-Ready

Your LLM app needs an eval dataset before you can test, A/B test, or regression test anything. Most teams skip this step and rely on vibes. Here’s how to build one properly.

What’s an eval dataset

An eval dataset is a collection of input-output pairs with quality scores. It answers: “Given this input, is the model’s output good enough?”

{
  "input": "Summarize this bug report: Connection timeout after 30s when...",
  "expected_qualities": ["mentions the error type", "suggests a fix", "under 100 words"],
  "reference_output": "Connection timeout caused by...",
  "baseline_score": 4.2
}

You need 50-100 examples minimum. 200+ is better.

Step 1: Collect real inputs (week 1)

The best eval examples come from real usage. Don’t invent test cases — they won’t represent actual user behavior.

Sources:

Production logs (if you have them)
Customer support tickets
Internal team usage
Beta tester feedback
Edge cases that caused failures

# Extract diverse examples from production logs
import random

logs = load_production_logs(last_30_days=True)

# Sample across different input types
short_inputs = [l for l in logs if len(l["input"]) < 100]
long_inputs = [l for l in logs if len(l["input"]) > 500]
error_cases = [l for l in logs if l.get("user_feedback") == "bad"]

eval_set = (
    random.sample(short_inputs, 20) +
    random.sample(long_inputs, 20) +
    error_cases[:10]  # Include all known failures
)

No production data yet? Generate synthetic examples:

Write 20 typical inputs yourself
Ask colleagues to contribute 10 each
Use an LLM to generate 20 edge cases: “Generate 20 unusual inputs for a code review tool”
Add 10 adversarial inputs (prompt injection attempts)

Step 2: Define quality criteria

What makes a “good” output for your use case?

App type	Quality criteria
Chatbot	Helpful, accurate, concise, safe
Code generator	Correct, follows style guide, handles edge cases
Summarizer	Captures key points, correct length, no hallucination
Classifier	Correct label, confidence calibrated
RAG app	Uses retrieved context, doesn’t hallucinate, cites sources

Write 3-5 specific criteria. Vague criteria (“is it good?”) produce inconsistent scores.

Step 3: Score your baseline

Run your current prompt against the eval dataset and score each output:

def score_output(input_text, output, criteria):
    """Score 1-5 using LLM-as-judge"""
    judge_prompt = f"""Score this output 1-5 on each criterion.

Criteria: {criteria}

Input: {input_text}
Output: {output}

Return JSON: {{"scores": {{"criterion1": N, ...}}, "overall": N, "reasoning": "..."}}"""
    
    result = call_llm("claude-opus-4.6", judge_prompt)
    return json.loads(result)

See our LLM-as-judge guide for best practices on automated scoring.

Also score manually for the first 20 examples. Compare human scores with LLM-judge scores to calibrate. If they disagree on more than 30% of examples, your rubric needs work.

Step 4: Categorize and balance

Tag each example with categories:

eval_dataset = [
    {"input": "...", "category": "simple_question", "difficulty": "easy"},
    {"input": "...", "category": "code_review", "difficulty": "hard"},
    {"input": "...", "category": "adversarial", "difficulty": "hard"},
]

Ensure your dataset covers:

60% happy path — normal, expected inputs
20% edge cases — unusual but valid inputs
10% adversarial — prompt injection, confusing inputs
10% regression cases — inputs that previously caused failures

Step 5: Version and maintain

Store your eval dataset in version control alongside your prompts:

evals/
  dataset_v1.json       # Initial dataset
  dataset_v2.json       # Added edge cases from production
  results/
    v1_prompt_v1.json   # Scores for prompt v1
    v1_prompt_v2.json   # Scores for prompt v2

Maintenance schedule:

Weekly: Add 2-3 new examples from production failures
Monthly: Review and remove outdated examples
Quarterly: Full dataset audit — are categories still balanced?

Step 6: Automate

Run evals automatically on every prompt change:

# GitHub Actions
- name: Run LLM Eval
  run: |
    python run_eval.py \
      --dataset evals/dataset_v2.json \
      --prompt prompts/current.txt \
      --threshold 4.0
  env:
    ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_KEY }}

See our regression testing guide for full CI/CD integration.

Common mistakes

Too few examples — 10 examples is not an eval dataset. Aim for 50+.
All easy examples — if every example scores 5/5, your dataset isn’t testing anything
No adversarial examples — you need to test failure modes, not just happy paths
Static dataset — production inputs evolve. Your eval dataset should too.
Scoring without rubric — “rate 1-5” without criteria produces inconsistent scores

How to Build an LLM Eval Dataset — From Zero to Production-Ready

What’s an eval dataset

Step 1: Collect real inputs (week 1)

Step 2: Define quality criteria

Step 3: Score your baseline

Step 4: Categorize and balance

Step 5: Version and maintain

Step 6: Automate

Common mistakes

📬 AI Dev Weekly

You might also like

LLM-as-a-Judge: When It Works and When It Fails

How to Test AI Applications — A Developer's Guide to LLM Evaluation

Canary Deploys for LLM Features — Ship Prompt Changes Safely

A/B Testing Prompts in Production — Replace Guesswork with Data