What is RAG? Retrieval-Augmented Generation Explained for Developers
RAG (Retrieval-Augmented Generation) is a technique that gives AI models access to external knowledge at query time. Instead of relying only on what the model learned during training, RAG fetches relevant documents and includes them in the prompt.
This is how Perplexity answers questions with citations, how GitHub Copilot understands your codebase, and how enterprise chatbots know about your companyโs internal docs.
How it works
1. User asks a question
2. System searches a knowledge base for relevant documents
3. Retrieved documents are added to the LLM prompt as context
4. LLM generates an answer using those documents
5. Answer includes citations to the source documents
Without RAG, the model can only use knowledge from its training data (which has a cutoff date and doesnโt include your private data). With RAG, it can answer questions about anything you put in the knowledge base.
A simple example
# 1. User question
question = "What's our refund policy?"
# 2. Search your docs (using embeddings + vector DB)
relevant_docs = vector_db.search(question, top_k=3)
# 3. Add docs to prompt
prompt = f"""Answer based on these documents:
{relevant_docs}
Question: {question}"""
# 4. LLM generates answer with citations
answer = llm.generate(prompt)
Key components
- Embeddings โ convert text to numbers for semantic search
- Vector database โ stores and searches embeddings (Pinecone, Qdrant, Chroma)
- Chunking โ splitting documents into searchable pieces
- LLM โ generates the final answer (Claude, GPT, DeepSeek)
When to use RAG
- Your data changes frequently
- You need answers about private/internal data
- You want source citations
- You canโt (or donโt want to) fine-tune a model
When NOT to use RAG
- The model already knows the answer (general knowledge)
- You need consistent output format (use fine-tuning instead)
- Latency is critical (retrieval adds 100-500ms)
Learn more
- RAG vs Fine-Tuning โ When to Use Each
- How to Build an AI Search Engine
- Why Your RAG Returns Bad Results
- Embeddings Explained for Developers
FAQ
How is RAG different from fine-tuning?
RAG retrieves external documents at query time and adds them to the prompt โ it doesnโt change the model itself. Fine-tuning modifies the modelโs weights to bake in new knowledge or behavior. RAG is better for frequently changing data; fine-tuning is better for consistent style or domain-specific reasoning.
Does RAG eliminate hallucinations?
RAG significantly reduces hallucinations by grounding answers in retrieved documents, but it doesnโt eliminate them entirely. The model can still misinterpret retrieved context or generate unsupported claims. Adding source citations and confidence thresholds helps catch remaining issues.
Whatโs the minimum amount of data needed for RAG to be useful?
RAG can be useful with as few as a dozen documents โ thereโs no strict minimum. The key requirement is that your data contains information the model doesnโt already know. Even a small internal FAQ or product spec can dramatically improve answer quality for domain-specific questions.
Related: How to Reduce LLM API Costs