May 3, 2026 · 5 min read

Last updated on May 15, 2026

How to Evaluate RAG Quality — Metrics, Tools, and Common Failures

Your RAG app returns answers. But are they correct? Are they using the right documents? Are they hallucinating? Without evaluation, you’re guessing.

The two things to evaluate

RAG has two stages, and each can fail independently:

Retrieval — did we find the right documents?
Generation — did the LLM use them correctly?

A perfect retrieval with bad generation = wrong answer from right sources. A bad retrieval with perfect generation = confident answer from wrong sources (worse).

Retrieval metrics

Hit rate (recall@k)

“Was the correct document in the top K results?”

def hit_rate(eval_set, k=5):
    hits = 0
    for example in eval_set:
        retrieved = vector_db.search(example["query"], top_k=k)
        retrieved_ids = [doc.id for doc in retrieved]
        if example["relevant_doc_id"] in retrieved_ids:
            hits += 1
    return hits / len(eval_set)

Target: >90% hit rate at k=5. Below 80% means your embeddings or chunking strategy needs work.

Mean Reciprocal Rank (MRR)

“How high in the results is the correct document?”

def mrr(eval_set, k=5):
    reciprocal_ranks = []
    for example in eval_set:
        retrieved = vector_db.search(example["query"], top_k=k)
        for i, doc in enumerate(retrieved):
            if doc.id == example["relevant_doc_id"]:
                reciprocal_ranks.append(1 / (i + 1))
                break
        else:
            reciprocal_ranks.append(0)
    return sum(reciprocal_ranks) / len(reciprocal_ranks)

Target: >0.7 MRR. This means the correct document is usually in the top 2 results.

Generation metrics

Faithfulness

“Does the answer only use information from the retrieved documents?”

def check_faithfulness(query, answer, retrieved_docs):
    prompt = f"""Given these source documents and the answer, 
    does the answer contain any claims NOT supported by the documents?
    
    Documents: {retrieved_docs}
    Answer: {answer}
    
    Return JSON: {{"faithful": true/false, "unsupported_claims": [...]}}"""
    
    result = call_llm("claude-opus-4.6", prompt)
    return json.loads(result)

This catches hallucination — the #1 RAG failure mode.

Answer relevance

“Does the answer actually address the question?”

def check_relevance(query, answer):
    prompt = f"""Does this answer address the question? Score 1-5.
    
    Question: {query}
    Answer: {answer}
    
    Score (1=irrelevant, 5=perfectly relevant):"""
    
    return parse_score(call_llm("claude-opus-4.6", prompt))

See our LLM-as-judge guide for best practices on automated scoring.

Common RAG failures

Failure	Symptom	Fix
Wrong documents retrieved	Answer is about wrong topic	Improve embeddings, chunking, or add metadata filters
Hallucination	Answer includes facts not in documents	Add “only use provided context” instruction, check faithfulness
Incomplete answer	Answer misses key information	Increase top_k, improve chunking to keep related info together
Outdated information	Answer uses old data	Implement document refresh pipeline
Context overflow	Too many documents stuffed in prompt	Reduce top_k, use reranking to pick best documents

Building a RAG eval dataset

You need 50+ question-answer pairs with labeled relevant documents:

{
    "query": "What's the refund policy for annual plans?",
    "relevant_doc_id": "doc_refund_policy_v3",
    "expected_answer_contains": ["30 days", "prorated", "annual"],
    "expected_answer_not_contains": ["no refund"]
}

See our eval dataset guide for the complete process.

Tools

Tool	What it evaluates
Ragas	Retrieval + generation metrics, open source
DeepEval	RAG-specific metrics with LLM-as-judge
Langfuse	Trace retrieval + generation, score in dashboard
Custom	The code above (100 lines of Python)

The minimum viable RAG eval

If you do nothing else:

Create 20 test questions with known correct answers
Run them through your RAG pipeline
Check: did it retrieve the right documents? (hit rate)
Check: did it answer correctly? (manual review or LLM-as-judge)
Track these metrics over time

This takes 2 hours to set up and catches 80% of RAG quality issues.

Common RAG failure modes

Understanding what can go wrong helps you build better evals:

Failure	Symptom	Fix
Wrong documents retrieved	Answer is confident but incorrect	Improve embeddings, chunking, or add metadata filters
Right documents, wrong answer	LLM ignores or misinterprets the context	Improve prompt, reduce context window noise, use a better model
Hallucination	Answer contains facts not in any document	Add faithfulness check, use “only answer from provided context” instruction
Outdated information	Answer uses old version of a document	Add timestamps to chunks, prefer recent documents in ranking
Partial retrieval	Answer is incomplete because not all relevant chunks were found	Increase k, improve chunking overlap, use hybrid search

When to use LLM-as-judge vs human evaluation

Use LLM-as-judge for:

Faithfulness checks (is the answer grounded in sources?)
Relevance scoring (does the answer address the question?)
Running evals at scale (hundreds of test cases)

Use human evaluation for:

Building your initial eval dataset (ground truth)
Validating that LLM-as-judge agrees with human judgment
Edge cases where nuance matters (legal, medical, financial content)

A good approach: start with 20 human-evaluated examples, then use LLM-as-judge for ongoing monitoring with periodic human spot-checks.

FAQ

How many test questions do I need for a reliable eval?

Start with 20-50 covering your main use cases. For production systems, aim for 100+ with good coverage of edge cases. The key is diversity — cover different question types, document types, and expected answer formats.

How often should I run RAG evaluations?

Run the full eval suite after every change to: embedding model, chunking strategy, retrieval parameters, prompt template, or LLM model. For production monitoring, run a subset daily and the full suite weekly.

My hit rate is good but answers are still wrong. Why?

Good retrieval + bad answers usually means: (1) the LLM is ignoring the context, (2) the context is too long and the relevant part gets lost in the middle, or (3) the prompt doesn’t clearly instruct the model to only use provided sources. Try reducing chunk size or adding “Answer ONLY based on the provided documents” to your prompt.

Should I evaluate RAG differently for chat vs single-query?

Yes. Chat RAG needs additional evaluation for: conversation context handling (does it remember previous questions?), coreference resolution (does “it” refer to the right thing?), and progressive refinement (does follow-up retrieval improve answers?).