AI agents need information to produce useful outputs. There are three fundamental mechanisms for providing that information: retrieval, memory, and tools. Each solves a different problem, has different latency characteristics, and fits different parts of the agent workflow. Understanding when to use each — and how to combine them — is the key to building agents that actually work in production.
Retrieval (RAG)
Retrieval-augmented generation pulls relevant documents from a knowledge base and injects them into the model’s context window. The agent searches for information related to the current query, retrieves the top results, and includes them alongside the user’s question.
Best for: Large, evolving knowledge bases that don’t fit in a prompt. Company documentation, product catalogs, codebases, legal documents, and any corpus that changes frequently.
How it works: The user’s query gets embedded into a vector, compared against pre-embedded document chunks, and the most similar chunks are retrieved and added to the prompt. This happens at query time, adding 100-500ms of latency.
Limitations: Quality depends heavily on chunking strategy, embedding model choice, and retrieval parameters. Bad chunking produces irrelevant results. Too few results miss important context. Too many results overwhelm the model’s attention.
For a hands-on implementation guide, see how to build a local RAG pipeline with Ollama.
Memory
Memory gives agents access to information from previous interactions. This includes conversation history within a session, summaries of past sessions, user preferences learned over time, and facts the agent has been explicitly told to remember.
Best for: Personalization, multi-turn conversations, and maintaining continuity across sessions. An agent that remembers your coding style, your project structure, or decisions made in previous conversations.
How it works: Short-term memory is simply the conversation history in the current context window. Long-term memory requires explicit storage — writing facts to a database and retrieving them in future sessions. Some systems use the model itself to decide what’s worth remembering.
Limitations: Context windows are finite. Conversation history grows linearly and eventually must be summarized or truncated. Long-term memory requires careful curation to avoid retrieving outdated or irrelevant information.
For patterns on implementing persistent memory, see our guide on agent memory patterns.
Tools
Tools let agents take actions and fetch real-time information from external systems. Instead of relying on pre-indexed knowledge, the agent calls an API, queries a database, runs code, or interacts with a service to get current information or produce side effects.
Best for: Real-time data, actions that change state, and information that can’t be pre-indexed. Checking current prices, sending emails, creating tickets, querying live databases, or executing code.
How it works: The model generates a tool call — a structured request specifying which function to invoke and with what arguments. The runtime executes the tool and returns the result to the model for further processing.
Limitations: Each tool call adds latency (network round-trip plus execution time). Tools can fail, timeout, or return unexpected results. The model must correctly decide when to use a tool and how to interpret its output.
When to use each
| Scenario | Best mechanism | Why |
|---|---|---|
| ”What does our refund policy say?” | Retrieval | Static knowledge, pre-indexed |
| ”What did I ask you yesterday?” | Memory | Past interaction history |
| ”What’s the current stock price?” | Tool | Real-time data, can’t pre-index |
| ”Create a Jira ticket for this bug” | Tool | Action with side effects |
| ”Use the same format as last time” | Memory | User preference from past session |
| ”Summarize this 200-page PDF” | Retrieval | Large document, needs chunking |
Combining all three
Production agents rarely use just one mechanism. A well-designed agent combines all three in a single turn. Understanding how AI agents work means understanding this orchestration.
Consider a coding assistant handling “fix the authentication bug from yesterday’s discussion”:
- Memory recalls yesterday’s conversation about the auth bug, including the file names and error messages discussed
- Retrieval pulls the relevant source files and documentation about the auth system
- Tools read the current file contents, run tests to reproduce the bug, and apply the fix
The orchestration layer decides which mechanism to invoke based on the query. Some systems use the model itself to plan which sources to consult. Others use deterministic routing based on query classification.
Architecture decisions
Where to start: If you’re building your first agent, start with tools. They provide the most immediate value with the least infrastructure. Add retrieval when your knowledge base exceeds what fits in a prompt. Add memory when users expect continuity across sessions.
Latency budget: Retrieval adds 100-500ms. Tool calls add 200-2000ms depending on the external service. Memory retrieval is typically fast (50-100ms) if stored locally. Budget your latency accordingly — an agent that takes 10 seconds to respond because it’s calling five tools sequentially will frustrate users.
Failure modes: Each mechanism fails differently. Retrieval returns irrelevant results. Memory surfaces outdated information. Tools timeout or error. Your agent needs graceful degradation for each — falling back to the model’s parametric knowledge when external sources fail.
The context engineering perspective
Retrieval, memory, and tools are all forms of context engineering — they determine what information the model sees when generating a response. The model’s output quality is bounded by the quality of its context. A perfectly prompted model with bad context produces bad outputs. A simply prompted model with excellent context produces excellent outputs.
The art is in the orchestration: knowing which mechanism to invoke, how much context to include, and when to stop gathering information and start generating a response.
FAQ
What’s the difference between RAG and agent memory?
RAG retrieves information from an external knowledge base — documents, code, articles — based on semantic similarity to the current query. Agent memory stores and retrieves information from the agent’s own past interactions — conversation history, user preferences, and learned facts. RAG answers “what does the documentation say?” while memory answers “what did we discuss before?”
When should an agent use tools vs retrieval?
Use retrieval when the information already exists in a pre-indexed knowledge base and doesn’t change in real-time. Use tools when you need current data (live prices, system status), when the information can’t be pre-indexed (dynamic database queries), or when the agent needs to take an action that changes state (sending messages, creating records, modifying files).
Can an agent use all three?
Yes, and production agents typically do. A single user query might trigger memory retrieval to recall context from past conversations, RAG to pull relevant documentation, and tool calls to fetch current data or execute actions. The orchestration layer — whether model-driven or rule-based — decides which mechanisms to invoke for each query.
Which is most important for production agents?
Tools are typically most important because they enable agents to take actions and access real-time information, which is what differentiates an agent from a chatbot. However, the answer depends on your use case. A customer support agent needs retrieval most (knowledge base access). A personal assistant needs memory most (user context). A DevOps agent needs tools most (system interactions).