53% of teams with deployed AI agents use LLM-as-a-judge for evaluation. The idea: use a capable model (like Claude Opus) to score the outputs of your production model. It scales, it’s cheap, and when done correctly it correlates well with human judgment.
But 93% of teams struggle with implementation. Here’s why, and how to get it right.
How it works
def judge(question, answer, criteria):
prompt = f"""Evaluate this answer on a scale of 1-5.
Criteria: {criteria}
Question: {question}
Answer: {answer}
Provide a score (1-5) and a one-sentence justification."""
return call_llm("claude-opus-4.6", prompt)
The judge model reads the original question, the model’s answer, and your evaluation criteria, then scores it. Simple in concept, tricky in practice.
When it works well
Subjective quality assessment: “Is this response helpful, clear, and well-structured?” LLM judges correlate 80-90% with human evaluators on these criteria.
Style and tone: “Does this match our brand voice?” Models are good at detecting style consistency.
Completeness: “Did the response address all parts of the question?” Easy for a judge to verify.
Safety and policy compliance: “Does this response contain harmful content?” Models are trained specifically for this.
Comparative ranking: “Which of these two responses is better?” Pairwise comparison is more reliable than absolute scoring.
When it fails
Factual accuracy: The judge can’t verify facts it doesn’t know. If your model says “Python was created in 1989” (wrong — it’s 1991), a judge model might not catch it.
Domain expertise: Medical, legal, or financial accuracy requires domain knowledge the judge may lack.
Position bias: Swap the order of two candidate responses in a pairwise comparison, and the verdict flips 10-30% of the time. The judge prefers whichever response it reads first (or last, depending on the model).
Self-preference: Models tend to rate their own outputs higher. Don’t use Claude to judge Claude.
Verbosity bias: Longer responses get higher scores even when shorter ones are better. Judges confuse length with quality.
Inconsistency: Run the same evaluation twice and you might get different scores. Temperature, prompt phrasing, and even time of day affect results.
How to do it right
1. Use a stronger model as judge
The judge should be more capable than the model being evaluated. Use Claude Opus to judge Sonnet outputs, not the other way around.
2. Use rubrics, not vibes
Bad: “Rate this response 1-5” Good: “Rate on these specific criteria: (1) addresses the question directly, (2) provides code examples, (3) handles edge cases, (4) is concise”
3. Mitigate position bias
For pairwise comparisons, run each comparison twice with swapped order. Only count it as a win if the same response wins both times.
4. Calibrate with human scores
Score 50 examples both with humans and the LLM judge. Calculate correlation. If it’s below 0.7, your rubric needs work.
5. Combine with deterministic checks
Use LLM-as-judge for subjective quality AND deterministic checks for objective criteria:
# Deterministic: format, length, required fields
assert len(response) < 500 # Conciseness
assert "```" in response # Contains code
assert json.loads(response) # Valid JSON
# LLM judge: quality, helpfulness, accuracy
quality_score = judge(question, response, rubric)
6. Log everything
Store every judgment with the full context (question, answer, rubric, score, justification). You’ll need this for debugging and calibration. See our observability guide.
Cost
LLM-as-judge adds cost — every evaluation is an API call. For a 100-case eval dataset:
| Judge model | Cost per eval run |
|---|---|
| Claude Opus | ~$1.50 |
| Claude Sonnet | ~$0.30 |
| DeepSeek | ~$0.03 |
| MiniMax M2.7 | ~$0.03 |
Use a cheap model for frequent CI/CD evals and Opus for weekly deep evaluations.
When NOT to use LLM-as-judge
- Code correctness — run the code and check if tests pass instead
- Exact factual claims — verify against a ground truth database
- Regulatory compliance — use rule-based checks, not probabilistic judges
- High-stakes decisions — human review is still necessary
Building your first eval pipeline
Here’s a minimal but complete eval pipeline you can set up in an afternoon:
import json
from datetime import datetime
def run_eval(prompt_version, eval_dataset, judge_model="claude-opus-4.6"):
results = []
for case in eval_dataset:
# Get response from your production model
response = call_llm(prompt_version, case["input"])
# Judge the response
score = judge(case["input"], response, case["criteria"], judge_model)
results.append({
"input": case["input"][:100],
"score": score,
"timestamp": datetime.now().isoformat(),
})
avg_score = sum(r["score"] for r in results) / len(results)
# Save results for comparison
with open(f"evals/{prompt_version}_{datetime.now():%Y%m%d}.json", "w") as f:
json.dump({"avg_score": avg_score, "results": results}, f)
return avg_score
# Compare versions
baseline = run_eval("v1", dataset)
candidate = run_eval("v2", dataset)
if candidate < baseline * 0.95:
print(f"REGRESSION: {baseline:.2f} → {candidate:.2f}")
exit(1)
else:
print(f"PASS: {baseline:.2f} → {candidate:.2f}")
Store eval results in git alongside your prompts. This gives you a history of quality over time and makes it easy to identify when regressions were introduced.
The future of LLM evaluation
LLM-as-judge is a stopgap. The field is moving toward:
- Reward models — purpose-trained models that score quality faster and cheaper
- Constitutional AI — models that evaluate against explicit principles
- Human-AI hybrid — AI does first pass, humans review disagreements
For now, LLM-as-judge with proper rubrics and calibration is the most practical approach for most teams.
See our AI testing guide for the full evaluation framework and regression testing guide for CI/CD integration.
Related: How to Test AI Applications · LLM Regression Testing · LLM Observability · How to Benchmark LLM Inference