πŸ€– AI Tools
Β· 5 min read
Last updated on

How to Evaluate RAG Quality β€” Metrics, Tools, and Common Failures


Your RAG app returns answers. But are they correct? Are they using the right documents? Are they hallucinating? Without evaluation, you’re guessing.

The two things to evaluate

RAG has two stages, and each can fail independently:

  1. Retrieval β€” did we find the right documents?
  2. Generation β€” did the LLM use them correctly?

A perfect retrieval with bad generation = wrong answer from right sources. A bad retrieval with perfect generation = confident answer from wrong sources (worse).

Retrieval metrics

Hit rate (recall@k)

β€œWas the correct document in the top K results?”

def hit_rate(eval_set, k=5):
    hits = 0
    for example in eval_set:
        retrieved = vector_db.search(example["query"], top_k=k)
        retrieved_ids = [doc.id for doc in retrieved]
        if example["relevant_doc_id"] in retrieved_ids:
            hits += 1
    return hits / len(eval_set)

Target: >90% hit rate at k=5. Below 80% means your embeddings or chunking strategy needs work.

Mean Reciprocal Rank (MRR)

β€œHow high in the results is the correct document?”

def mrr(eval_set, k=5):
    reciprocal_ranks = []
    for example in eval_set:
        retrieved = vector_db.search(example["query"], top_k=k)
        for i, doc in enumerate(retrieved):
            if doc.id == example["relevant_doc_id"]:
                reciprocal_ranks.append(1 / (i + 1))
                break
        else:
            reciprocal_ranks.append(0)
    return sum(reciprocal_ranks) / len(reciprocal_ranks)

Target: >0.7 MRR. This means the correct document is usually in the top 2 results.

Generation metrics

Faithfulness

β€œDoes the answer only use information from the retrieved documents?”

def check_faithfulness(query, answer, retrieved_docs):
    prompt = f"""Given these source documents and the answer, 
    does the answer contain any claims NOT supported by the documents?
    
    Documents: {retrieved_docs}
    Answer: {answer}
    
    Return JSON: {{"faithful": true/false, "unsupported_claims": [...]}}"""
    
    result = call_llm("claude-opus-4.6", prompt)
    return json.loads(result)

This catches hallucination β€” the #1 RAG failure mode.

Answer relevance

β€œDoes the answer actually address the question?”

def check_relevance(query, answer):
    prompt = f"""Does this answer address the question? Score 1-5.
    
    Question: {query}
    Answer: {answer}
    
    Score (1=irrelevant, 5=perfectly relevant):"""
    
    return parse_score(call_llm("claude-opus-4.6", prompt))

See our LLM-as-judge guide for best practices on automated scoring.

Common RAG failures

FailureSymptomFix
Wrong documents retrievedAnswer is about wrong topicImprove embeddings, chunking, or add metadata filters
HallucinationAnswer includes facts not in documentsAdd β€œonly use provided context” instruction, check faithfulness
Incomplete answerAnswer misses key informationIncrease top_k, improve chunking to keep related info together
Outdated informationAnswer uses old dataImplement document refresh pipeline
Context overflowToo many documents stuffed in promptReduce top_k, use reranking to pick best documents

Building a RAG eval dataset

You need 50+ question-answer pairs with labeled relevant documents:

{
    "query": "What's the refund policy for annual plans?",
    "relevant_doc_id": "doc_refund_policy_v3",
    "expected_answer_contains": ["30 days", "prorated", "annual"],
    "expected_answer_not_contains": ["no refund"]
}

See our eval dataset guide for the complete process.

Tools

ToolWhat it evaluates
RagasRetrieval + generation metrics, open source
DeepEvalRAG-specific metrics with LLM-as-judge
LangfuseTrace retrieval + generation, score in dashboard
CustomThe code above (100 lines of Python)

The minimum viable RAG eval

If you do nothing else:

  1. Create 20 test questions with known correct answers
  2. Run them through your RAG pipeline
  3. Check: did it retrieve the right documents? (hit rate)
  4. Check: did it answer correctly? (manual review or LLM-as-judge)
  5. Track these metrics over time

This takes 2 hours to set up and catches 80% of RAG quality issues.

Common RAG failure modes

Understanding what can go wrong helps you build better evals:

FailureSymptomFix
Wrong documents retrievedAnswer is confident but incorrectImprove embeddings, chunking, or add metadata filters
Right documents, wrong answerLLM ignores or misinterprets the contextImprove prompt, reduce context window noise, use a better model
HallucinationAnswer contains facts not in any documentAdd faithfulness check, use β€œonly answer from provided context” instruction
Outdated informationAnswer uses old version of a documentAdd timestamps to chunks, prefer recent documents in ranking
Partial retrievalAnswer is incomplete because not all relevant chunks were foundIncrease k, improve chunking overlap, use hybrid search

When to use LLM-as-judge vs human evaluation

Use LLM-as-judge for:

  • Faithfulness checks (is the answer grounded in sources?)
  • Relevance scoring (does the answer address the question?)
  • Running evals at scale (hundreds of test cases)

Use human evaluation for:

  • Building your initial eval dataset (ground truth)
  • Validating that LLM-as-judge agrees with human judgment
  • Edge cases where nuance matters (legal, medical, financial content)

A good approach: start with 20 human-evaluated examples, then use LLM-as-judge for ongoing monitoring with periodic human spot-checks.

FAQ

How many test questions do I need for a reliable eval?

Start with 20-50 covering your main use cases. For production systems, aim for 100+ with good coverage of edge cases. The key is diversity β€” cover different question types, document types, and expected answer formats.

How often should I run RAG evaluations?

Run the full eval suite after every change to: embedding model, chunking strategy, retrieval parameters, prompt template, or LLM model. For production monitoring, run a subset daily and the full suite weekly.

My hit rate is good but answers are still wrong. Why?

Good retrieval + bad answers usually means: (1) the LLM is ignoring the context, (2) the context is too long and the relevant part gets lost in the middle, or (3) the prompt doesn’t clearly instruct the model to only use provided sources. Try reducing chunk size or adding β€œAnswer ONLY based on the provided documents” to your prompt.

Should I evaluate RAG differently for chat vs single-query?

Yes. Chat RAG needs additional evaluation for: conversation context handling (does it remember previous questions?), coreference resolution (does β€œit” refer to the right thing?), and progressive refinement (does follow-up retrieval improve answers?).