Your LLM app needs an eval dataset before you can test, A/B test, or regression test anything. Most teams skip this step and rely on vibes. Hereβs how to build one properly.
Whatβs an eval dataset
An eval dataset is a collection of input-output pairs with quality scores. It answers: βGiven this input, is the modelβs output good enough?β
{
"input": "Summarize this bug report: Connection timeout after 30s when...",
"expected_qualities": ["mentions the error type", "suggests a fix", "under 100 words"],
"reference_output": "Connection timeout caused by...",
"baseline_score": 4.2
}
You need 50-100 examples minimum. 200+ is better.
Step 1: Collect real inputs (week 1)
The best eval examples come from real usage. Donβt invent test cases β they wonβt represent actual user behavior.
Sources:
- Production logs (if you have them)
- Customer support tickets
- Internal team usage
- Beta tester feedback
- Edge cases that caused failures
# Extract diverse examples from production logs
import random
logs = load_production_logs(last_30_days=True)
# Sample across different input types
short_inputs = [l for l in logs if len(l["input"]) < 100]
long_inputs = [l for l in logs if len(l["input"]) > 500]
error_cases = [l for l in logs if l.get("user_feedback") == "bad"]
eval_set = (
random.sample(short_inputs, 20) +
random.sample(long_inputs, 20) +
error_cases[:10] # Include all known failures
)
No production data yet? Generate synthetic examples:
- Write 20 typical inputs yourself
- Ask colleagues to contribute 10 each
- Use an LLM to generate 20 edge cases: βGenerate 20 unusual inputs for a code review toolβ
- Add 10 adversarial inputs (prompt injection attempts)
Step 2: Define quality criteria
What makes a βgoodβ output for your use case?
| App type | Quality criteria |
|---|---|
| Chatbot | Helpful, accurate, concise, safe |
| Code generator | Correct, follows style guide, handles edge cases |
| Summarizer | Captures key points, correct length, no hallucination |
| Classifier | Correct label, confidence calibrated |
| RAG app | Uses retrieved context, doesnβt hallucinate, cites sources |
Write 3-5 specific criteria. Vague criteria (βis it good?β) produce inconsistent scores.
Step 3: Score your baseline
Run your current prompt against the eval dataset and score each output:
def score_output(input_text, output, criteria):
"""Score 1-5 using LLM-as-judge"""
judge_prompt = f"""Score this output 1-5 on each criterion.
Criteria: {criteria}
Input: {input_text}
Output: {output}
Return JSON: {{"scores": {{"criterion1": N, ...}}, "overall": N, "reasoning": "..."}}"""
result = call_llm("claude-opus-4.6", judge_prompt)
return json.loads(result)
See our LLM-as-judge guide for best practices on automated scoring.
Also score manually for the first 20 examples. Compare human scores with LLM-judge scores to calibrate. If they disagree on more than 30% of examples, your rubric needs work.
Step 4: Categorize and balance
Tag each example with categories:
eval_dataset = [
{"input": "...", "category": "simple_question", "difficulty": "easy"},
{"input": "...", "category": "code_review", "difficulty": "hard"},
{"input": "...", "category": "adversarial", "difficulty": "hard"},
]
Ensure your dataset covers:
- 60% happy path β normal, expected inputs
- 20% edge cases β unusual but valid inputs
- 10% adversarial β prompt injection, confusing inputs
- 10% regression cases β inputs that previously caused failures
Step 5: Version and maintain
Store your eval dataset in version control alongside your prompts:
evals/
dataset_v1.json # Initial dataset
dataset_v2.json # Added edge cases from production
results/
v1_prompt_v1.json # Scores for prompt v1
v1_prompt_v2.json # Scores for prompt v2
Maintenance schedule:
- Weekly: Add 2-3 new examples from production failures
- Monthly: Review and remove outdated examples
- Quarterly: Full dataset audit β are categories still balanced?
Step 6: Automate
Run evals automatically on every prompt change:
# GitHub Actions
- name: Run LLM Eval
run: |
python run_eval.py \
--dataset evals/dataset_v2.json \
--prompt prompts/current.txt \
--threshold 4.0
env:
ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_KEY }}
See our regression testing guide for full CI/CD integration.
Common mistakes
- Too few examples β 10 examples is not an eval dataset. Aim for 50+.
- All easy examples β if every example scores 5/5, your dataset isnβt testing anything
- No adversarial examples β you need to test failure modes, not just happy paths
- Static dataset β production inputs evolve. Your eval dataset should too.
- Scoring without rubric β βrate 1-5β without criteria produces inconsistent scores
Related: How to Test AI Applications Β· LLM-as-a-Judge Β· LLM Regression Testing Β· A/B Testing Prompts Β· Prompt Injection Explained