You built a RAG system. It works on your test data. In production, it returns irrelevant garbage 30% of the time. Here are the seven most common causes and how to fix each one.
1. Bad chunking
Symptom: Answers are vague or miss important details.
Cause: Chunks are too large (entire pages) or too small (single sentences). Large chunks dilute the relevant information. Small chunks lose context.
Fix: Chunk at 200-500 tokens with 50-token overlap. Use semantic chunking (split at paragraph/section boundaries) instead of fixed-size:
from langchain.text_splitter import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
chunk_size=500,
chunk_overlap=50,
separators=["\n\n", "\n", ". ", " "] # Split at natural boundaries
)
chunks = splitter.split_text(document)
2. Wrong embedding model
Symptom: Search returns topically related but not actually relevant results.
Cause: Using a general-purpose embedding model for specialized content (code, legal, medical).
Fix: Use domain-specific embeddings. For code, use Codestral Embed. For general text, OpenAI’s text-embedding-3-large is the safest choice. See our embeddings guide.
3. No hybrid search
Symptom: Exact terms aren’t found. Searching for “ECONNRESET” returns pages about “connection errors” but not the specific error code.
Cause: Pure vector search matches meaning but misses exact keywords.
Fix: Combine vector search with BM25 keyword search:
# Pseudo-code for hybrid search
vector_results = vector_db.search(query_embedding, top_k=10)
keyword_results = bm25_search(query_text, top_k=10)
final_results = reciprocal_rank_fusion(vector_results, keyword_results)
Weaviate supports this natively. For other databases, implement it manually.
4. Stale embeddings
Symptom: RAG returns outdated information even though docs were updated.
Cause: Documents changed but embeddings weren’t regenerated.
Fix: Re-embed on every content update. Use a hash to detect changes:
import hashlib
def needs_reembedding(doc_id, new_content):
new_hash = hashlib.md5(new_content.encode()).hexdigest()
old_hash = get_stored_hash(doc_id)
return new_hash != old_hash
5. No metadata filtering
Symptom: Results from wrong categories, dates, or sources.
Cause: Vector similarity alone can’t distinguish between a 2024 guide and a 2026 guide on the same topic.
Fix: Store metadata and filter before vector search:
results = collection.query(
query_texts=["how to deploy"],
where={"year": {"$gte": 2025}, "category": "deployment"},
n_results=5
)
6. Context window overflow
Symptom: LLM ignores some retrieved documents or gives incomplete answers.
Cause: Too many retrieved chunks exceed the model’s effective context window. Models degrade on information in the middle of long contexts (“lost in the middle” problem).
Fix: Retrieve fewer, more relevant chunks (3-5 instead of 10-20). Put the most relevant chunk first and last.
7. No reranking
Symptom: The best result is at position 5 instead of position 1.
Cause: Embedding similarity is a rough approximation. The top-10 results are all “similar” but not equally relevant.
Fix: Add a reranker that scores each result against the query:
from cohere import Client
co = Client(api_key="your-key")
reranked = co.rerank(
query="how to fix memory leak in Node.js",
documents=[chunk.text for chunk in initial_results],
top_n=3
)
Cohere Rerank and cross-encoder models are the most common choices.
The debugging checklist
When your RAG returns bad results:
- Check the retrieved chunks — are they relevant? (retrieval problem)
- Check the prompt — does the LLM have enough context? (generation problem)
- Check the embedding model — is it appropriate for your domain?
- Check chunk sizes — too big or too small?
- Try hybrid search — does adding keywords help?
Related: Embeddings Explained · Vector Databases Compared · RAG vs Fine-Tuning · How to Build an AI Search Engine