How to Evaluate RAG Quality β Metrics, Tools, and Common Failures
Your RAG app returns answers. But are they correct? Are they using the right documents? Are they hallucinating? Without evaluation, youβre guessing.
The two things to evaluate
RAG has two stages, and each can fail independently:
- Retrieval β did we find the right documents?
- Generation β did the LLM use them correctly?
A perfect retrieval with bad generation = wrong answer from right sources. A bad retrieval with perfect generation = confident answer from wrong sources (worse).
Retrieval metrics
Hit rate (recall@k)
βWas the correct document in the top K results?β
def hit_rate(eval_set, k=5):
hits = 0
for example in eval_set:
retrieved = vector_db.search(example["query"], top_k=k)
retrieved_ids = [doc.id for doc in retrieved]
if example["relevant_doc_id"] in retrieved_ids:
hits += 1
return hits / len(eval_set)
Target: >90% hit rate at k=5. Below 80% means your embeddings or chunking strategy needs work.
Mean Reciprocal Rank (MRR)
βHow high in the results is the correct document?β
def mrr(eval_set, k=5):
reciprocal_ranks = []
for example in eval_set:
retrieved = vector_db.search(example["query"], top_k=k)
for i, doc in enumerate(retrieved):
if doc.id == example["relevant_doc_id"]:
reciprocal_ranks.append(1 / (i + 1))
break
else:
reciprocal_ranks.append(0)
return sum(reciprocal_ranks) / len(reciprocal_ranks)
Target: >0.7 MRR. This means the correct document is usually in the top 2 results.
Generation metrics
Faithfulness
βDoes the answer only use information from the retrieved documents?β
def check_faithfulness(query, answer, retrieved_docs):
prompt = f"""Given these source documents and the answer,
does the answer contain any claims NOT supported by the documents?
Documents: {retrieved_docs}
Answer: {answer}
Return JSON: {{"faithful": true/false, "unsupported_claims": [...]}}"""
result = call_llm("claude-opus-4.6", prompt)
return json.loads(result)
This catches hallucination β the #1 RAG failure mode.
Answer relevance
βDoes the answer actually address the question?β
def check_relevance(query, answer):
prompt = f"""Does this answer address the question? Score 1-5.
Question: {query}
Answer: {answer}
Score (1=irrelevant, 5=perfectly relevant):"""
return parse_score(call_llm("claude-opus-4.6", prompt))
See our LLM-as-judge guide for best practices on automated scoring.
Common RAG failures
| Failure | Symptom | Fix |
|---|---|---|
| Wrong documents retrieved | Answer is about wrong topic | Improve embeddings, chunking, or add metadata filters |
| Hallucination | Answer includes facts not in documents | Add βonly use provided contextβ instruction, check faithfulness |
| Incomplete answer | Answer misses key information | Increase top_k, improve chunking to keep related info together |
| Outdated information | Answer uses old data | Implement document refresh pipeline |
| Context overflow | Too many documents stuffed in prompt | Reduce top_k, use reranking to pick best documents |
Building a RAG eval dataset
You need 50+ question-answer pairs with labeled relevant documents:
{
"query": "What's the refund policy for annual plans?",
"relevant_doc_id": "doc_refund_policy_v3",
"expected_answer_contains": ["30 days", "prorated", "annual"],
"expected_answer_not_contains": ["no refund"]
}
See our eval dataset guide for the complete process.
Tools
| Tool | What it evaluates |
|---|---|
| Ragas | Retrieval + generation metrics, open source |
| DeepEval | RAG-specific metrics with LLM-as-judge |
| Langfuse | Trace retrieval + generation, score in dashboard |
| Custom | The code above (100 lines of Python) |
The minimum viable RAG eval
If you do nothing else:
- Create 20 test questions with known correct answers
- Run them through your RAG pipeline
- Check: did it retrieve the right documents? (hit rate)
- Check: did it answer correctly? (manual review or LLM-as-judge)
- Track these metrics over time
This takes 2 hours to set up and catches 80% of RAG quality issues.
Common RAG failure modes
Understanding what can go wrong helps you build better evals:
| Failure | Symptom | Fix |
|---|---|---|
| Wrong documents retrieved | Answer is confident but incorrect | Improve embeddings, chunking, or add metadata filters |
| Right documents, wrong answer | LLM ignores or misinterprets the context | Improve prompt, reduce context window noise, use a better model |
| Hallucination | Answer contains facts not in any document | Add faithfulness check, use βonly answer from provided contextβ instruction |
| Outdated information | Answer uses old version of a document | Add timestamps to chunks, prefer recent documents in ranking |
| Partial retrieval | Answer is incomplete because not all relevant chunks were found | Increase k, improve chunking overlap, use hybrid search |
When to use LLM-as-judge vs human evaluation
Use LLM-as-judge for:
- Faithfulness checks (is the answer grounded in sources?)
- Relevance scoring (does the answer address the question?)
- Running evals at scale (hundreds of test cases)
Use human evaluation for:
- Building your initial eval dataset (ground truth)
- Validating that LLM-as-judge agrees with human judgment
- Edge cases where nuance matters (legal, medical, financial content)
A good approach: start with 20 human-evaluated examples, then use LLM-as-judge for ongoing monitoring with periodic human spot-checks.
FAQ
How many test questions do I need for a reliable eval?
Start with 20-50 covering your main use cases. For production systems, aim for 100+ with good coverage of edge cases. The key is diversity β cover different question types, document types, and expected answer formats.
How often should I run RAG evaluations?
Run the full eval suite after every change to: embedding model, chunking strategy, retrieval parameters, prompt template, or LLM model. For production monitoring, run a subset daily and the full suite weekly.
My hit rate is good but answers are still wrong. Why?
Good retrieval + bad answers usually means: (1) the LLM is ignoring the context, (2) the context is too long and the relevant part gets lost in the middle, or (3) the prompt doesnβt clearly instruct the model to only use provided sources. Try reducing chunk size or adding βAnswer ONLY based on the provided documentsβ to your prompt.
Should I evaluate RAG differently for chat vs single-query?
Yes. Chat RAG needs additional evaluation for: conversation context handling (does it remember previous questions?), coreference resolution (does βitβ refer to the right thing?), and progressive refinement (does follow-up retrieval improve answers?).