May 1, 2026 · 4 min read

LLM-as-a-Judge: When It Works and When It Fails

53% of teams with deployed AI agents use LLM-as-a-judge for evaluation. The idea: use a capable model (like Claude Opus) to score the outputs of your production model. It scales, it’s cheap, and when done correctly it correlates well with human judgment.

But 93% of teams struggle with implementation. Here’s why, and how to get it right.

How it works

def judge(question, answer, criteria):
    prompt = f"""Evaluate this answer on a scale of 1-5.
    
    Criteria: {criteria}
    Question: {question}
    Answer: {answer}
    
    Provide a score (1-5) and a one-sentence justification."""
    
    return call_llm("claude-opus-4.6", prompt)

The judge model reads the original question, the model’s answer, and your evaluation criteria, then scores it. Simple in concept, tricky in practice.

When it works well

Subjective quality assessment: “Is this response helpful, clear, and well-structured?” LLM judges correlate 80-90% with human evaluators on these criteria.

Style and tone: “Does this match our brand voice?” Models are good at detecting style consistency.

Completeness: “Did the response address all parts of the question?” Easy for a judge to verify.

Safety and policy compliance: “Does this response contain harmful content?” Models are trained specifically for this.

Comparative ranking: “Which of these two responses is better?” Pairwise comparison is more reliable than absolute scoring.

When it fails

Factual accuracy: The judge can’t verify facts it doesn’t know. If your model says “Python was created in 1989” (wrong — it’s 1991), a judge model might not catch it.

Domain expertise: Medical, legal, or financial accuracy requires domain knowledge the judge may lack.

Position bias: Swap the order of two candidate responses in a pairwise comparison, and the verdict flips 10-30% of the time. The judge prefers whichever response it reads first (or last, depending on the model).

Self-preference: Models tend to rate their own outputs higher. Don’t use Claude to judge Claude.

Verbosity bias: Longer responses get higher scores even when shorter ones are better. Judges confuse length with quality.

Inconsistency: Run the same evaluation twice and you might get different scores. Temperature, prompt phrasing, and even time of day affect results.

How to do it right

1. Use a stronger model as judge

The judge should be more capable than the model being evaluated. Use Claude Opus to judge Sonnet outputs, not the other way around.

2. Use rubrics, not vibes

Bad: “Rate this response 1-5” Good: “Rate on these specific criteria: (1) addresses the question directly, (2) provides code examples, (3) handles edge cases, (4) is concise”

3. Mitigate position bias

For pairwise comparisons, run each comparison twice with swapped order. Only count it as a win if the same response wins both times.

4. Calibrate with human scores

Score 50 examples both with humans and the LLM judge. Calculate correlation. If it’s below 0.7, your rubric needs work.

5. Combine with deterministic checks

Use LLM-as-judge for subjective quality AND deterministic checks for objective criteria:

# Deterministic: format, length, required fields
assert len(response) < 500  # Conciseness
assert "```" in response     # Contains code
assert json.loads(response)   # Valid JSON

# LLM judge: quality, helpfulness, accuracy
quality_score = judge(question, response, rubric)

6. Log everything

Store every judgment with the full context (question, answer, rubric, score, justification). You’ll need this for debugging and calibration. See our observability guide.

Cost

LLM-as-judge adds cost — every evaluation is an API call. For a 100-case eval dataset:

Judge model	Cost per eval run
Claude Opus	~$1.50
Claude Sonnet	~$0.30
DeepSeek	~$0.03
MiniMax M2.7	~$0.03

Use a cheap model for frequent CI/CD evals and Opus for weekly deep evaluations.

When NOT to use LLM-as-judge

Code correctness — run the code and check if tests pass instead
Exact factual claims — verify against a ground truth database
Regulatory compliance — use rule-based checks, not probabilistic judges
High-stakes decisions — human review is still necessary

Building your first eval pipeline

Here’s a minimal but complete eval pipeline you can set up in an afternoon:

import json
from datetime import datetime

def run_eval(prompt_version, eval_dataset, judge_model="claude-opus-4.6"):
    results = []
    
    for case in eval_dataset:
        # Get response from your production model
        response = call_llm(prompt_version, case["input"])
        
        # Judge the response
        score = judge(case["input"], response, case["criteria"], judge_model)
        
        results.append({
            "input": case["input"][:100],
            "score": score,
            "timestamp": datetime.now().isoformat(),
        })
    
    avg_score = sum(r["score"] for r in results) / len(results)
    
    # Save results for comparison
    with open(f"evals/{prompt_version}_{datetime.now():%Y%m%d}.json", "w") as f:
        json.dump({"avg_score": avg_score, "results": results}, f)
    
    return avg_score

# Compare versions
baseline = run_eval("v1", dataset)
candidate = run_eval("v2", dataset)

if candidate < baseline * 0.95:
    print(f"REGRESSION: {baseline:.2f} → {candidate:.2f}")
    exit(1)
else:
    print(f"PASS: {baseline:.2f} → {candidate:.2f}")

Store eval results in git alongside your prompts. This gives you a history of quality over time and makes it easy to identify when regressions were introduced.

The future of LLM evaluation

LLM-as-judge is a stopgap. The field is moving toward:

Reward models — purpose-trained models that score quality faster and cheaper
Constitutional AI — models that evaluate against explicit principles
Human-AI hybrid — AI does first pass, humans review disagreements

For now, LLM-as-judge with proper rubrics and calibration is the most practical approach for most teams.

See our AI testing guide for the full evaluation framework and regression testing guide for CI/CD integration.