🤖 AI Tools
· 6 min read
Last updated on

MCP for RAG — Connect AI to Your Knowledge Base


MCP is a natural fit for RAG. Instead of building custom retrieval code for every AI host, expose your knowledge base as an MCP server. Any MCP-compatible host — Claude, Cursor, VS Code, custom apps — can then search your data through a single, standardized interface. Build once, use everywhere.

Why MCP makes sense for RAG

Traditional RAG pipelines are tightly coupled. You build a retrieval system, wire it into your specific AI application, and if you want to use a different AI host or add another consumer, you rebuild the integration. Every new client means new code.

MCP decouples retrieval from consumption. Your knowledge base becomes a service that any MCP-compatible client can query. The same RAG server works with Claude Desktop, Cursor, a custom chatbot, and your internal tools — all without changing a line of retrieval code.

This is the same pattern that made REST APIs successful: standardize the interface, and the ecosystem grows around it. For a deeper dive into the protocol itself, see our MCP complete developer guide.

Architecture

Here’s how MCP fits into a RAG pipeline:

User question → AI Host (Claude/Cursor/Custom)


               MCP Client


            MCP RAG Server

            ┌───────┼───────┐
            ▼       ▼       ▼
        Vector DB  Search  Document
        (Qdrant)   API    Store (S3)
            │       │       │
            └───────┼───────┘

           Relevant documents + metadata


            AI generates answer with citations

The MCP RAG server is the bridge. It receives search queries from the AI host, routes them to your retrieval backends (vector databases, search engines, document stores), and returns formatted results. The AI host never needs to know about your storage layer — it just calls MCP tools.

Building a RAG MCP server

Here’s a complete, working MCP server that connects to a vector database for semantic search:

import { McpServer } from '@modelcontextprotocol/sdk/server/mcp.js';
import { StdioServerTransport } from '@modelcontextprotocol/sdk/server/stdio.js';
import { z } from 'zod';
import { QdrantClient } from '@qdrant/js-client-rest';

const server = new McpServer({
  name: 'rag-knowledge-base',
  version: '1.0.0',
});

const qdrant = new QdrantClient({ url: 'http://localhost:6333' });

// Main search tool — the AI host calls this to find relevant documents
server.tool('search_docs', {
  query: z.string().describe('Natural language search query'),
  top_k: z.number().default(5).describe('Number of results to return'),
  collection: z.string().default('docs').describe('Which collection to search'),
}, async ({ query, top_k, collection }) => {
  // Generate embedding for the query
  const embedding = await generateEmbedding(query);

  // Search the vector database
  const results = await qdrant.search(collection, {
    vector: embedding,
    limit: top_k,
    with_payload: true,
  });

  // Format results with source attribution
  const context = results.map((r, i) =>
    `Source [${i + 1}]: ${r.payload.title} (${r.payload.url})\n` +
    `Relevance: ${(r.score * 100).toFixed(1)}%\n` +
    `${r.payload.content}`
  ).join('\n\n---\n\n');

  return {
    content: [{
      type: 'text',
      text: results.length > 0
        ? `Found ${results.length} relevant documents:\n\n${context}`
        : 'No relevant documents found for this query.',
    }],
  };
});

// List available collections — helps the AI understand what's searchable
server.tool('list_collections', {}, async () => {
  const collections = await qdrant.getCollections();
  const list = collections.collections.map(c => `- ${c.name}`).join('\n');
  return {
    content: [{ type: 'text', text: `Available collections:\n${list}` }],
  };
});

// Expose knowledge base as a resource for context
server.resource('knowledge-base-info', 'docs://info', async () => ({
  contents: [{
    uri: 'docs://info',
    mimeType: 'text/plain',
    text: 'This knowledge base contains internal documentation, API references, and runbooks. Use search_docs to find relevant information.',
  }],
}));

const transport = new StdioServerTransport();
await server.connect(transport);

The generateEmbedding function can use any embedding model — OpenAI’s text-embedding-3-small for cloud setups, or a local model like nomic-embed-text via Ollama for fully local RAG. See our guide on building a local RAG pipeline with Ollama for the self-hosted approach.

Connecting to different vector databases

The MCP server pattern works with any vector database. Here’s how the connection layer differs:

Qdrant (shown above): Open-source, runs locally or in cloud. Excellent for self-hosted setups. Supports filtering, payload indexing, and multi-tenancy.

ChromaDB: Simplest to set up for prototyping. Runs in-process or as a server. Good for small-to-medium knowledge bases.

import { ChromaClient } from 'chromadb';
const chroma = new ChromaClient();
const collection = await chroma.getCollection({ name: 'docs' });
const results = await collection.query({ queryTexts: [query], nResults: top_k });

Weaviate: Strong hybrid search (vector + keyword). Good when you need both semantic and exact-match retrieval.

Pinecone: Fully managed cloud service. Easiest to scale but involves external data transfers (relevant for GDPR considerations).

For a detailed comparison, see our vector database comparison.

Why MCP for RAG beats custom integrations

  • Reusable — the same RAG server works with Claude, Cursor, GPT, any MCP-compatible host. Build once, connect everywhere.
  • Decoupled — update your knowledge base, change your vector database, or modify your retrieval logic without touching the AI application. The MCP interface stays the same.
  • Composable — combine your RAG server with other MCP servers. A developer can have simultaneous access to your docs (RAG server), your database (Postgres server), and your Git repos (Git server) through the same protocol.
  • SecureMCP authentication controls who can search. Run locally with stdio transport for zero network exposure, or use SSE with auth for remote access.
  • GDPR-safe — run the entire stack locally and data never leaves your network.
  • Testable — MCP servers are just programs. You can unit test your retrieval logic, integration test the MCP interface, and load test the whole pipeline independently of the AI host.

Advanced patterns

Hybrid retrieval

Combine vector search with keyword search for better results. Vector search handles semantic similarity; keyword search catches exact terms the embedding might miss:

server.tool('hybrid_search', {
  query: z.string(),
  top_k: z.number().default(5),
}, async ({ query, top_k }) => {
  const [vectorResults, keywordResults] = await Promise.all([
    vectorSearch(query, top_k),
    keywordSearch(query, top_k),
  ]);

  // Reciprocal rank fusion to merge results
  const merged = reciprocalRankFusion(vectorResults, keywordResults);
  return formatResults(merged.slice(0, top_k));
});

Multi-source RAG

Pull from multiple knowledge sources in a single query — documentation, Slack messages, Jira tickets, code comments:

server.tool('search_all', {
  query: z.string(),
  sources: z.array(z.enum(['docs', 'slack', 'jira', 'code'])).default(['docs']),
}, async ({ query, sources }) => {
  const results = await Promise.all(
    sources.map(source => searchSource(source, query))
  );
  return formatResults(results.flat().sort((a, b) => b.score - a.score));
});

Contextual chunking

Instead of returning raw chunks, return chunks with surrounding context — the section title, the document it came from, and adjacent paragraphs:

// Store chunks with metadata
{
  content: "The API rate limit is 100 requests per minute.",
  title: "API Reference",
  section: "Rate Limits",
  prev_chunk: "Authentication uses Bearer tokens...",
  next_chunk: "Exceeding the rate limit returns 429...",
  url: "/docs/api-reference#rate-limits"
}

This gives the AI model much better context for generating accurate, well-attributed answers.

Choosing an embedding model

Your embedding model determines retrieval quality. Options:

ModelDimensionsBest forLocal?
OpenAI text-embedding-3-small1536General textNo (API)
OpenAI text-embedding-3-large3072High accuracyNo (API)
nomic-embed-text (Ollama)768Self-hosted generalYes
Codestral Embed1024Code searchNo (API)
all-MiniLM-L6-v2384Lightweight localYes

For GDPR-compliant setups, use nomic-embed-text via Ollama. For best quality without privacy constraints, OpenAI’s text-embedding-3-large is hard to beat.

FAQ

Can I use MCP for RAG?

Yes, and it’s one of the most practical MCP use cases. You build an MCP server that wraps your vector database and retrieval logic, then any MCP-compatible AI host can search your knowledge base. The server exposes tools like search_docs that the AI calls when it needs information. This is cleaner than building custom RAG integrations for each AI tool you use, and it means switching AI hosts doesn’t require rebuilding your retrieval pipeline.

Which MCP servers support vector databases?

There are community MCP servers for most popular vector databases — Qdrant, ChromaDB, Weaviate, and Pinecone all have MCP server implementations available. However, for production use, building your own MCP server (as shown above) gives you control over retrieval logic, result formatting, and access control. The community servers are good starting points but often lack features like hybrid search, contextual chunking, and multi-collection routing that production RAG systems need.

Is MCP better than custom RAG integration?

For most teams, yes. Custom integrations are faster to build initially for a single AI host, but they don’t scale. When you want to add a second AI host, support a new use case, or let a different team access the same knowledge base, you’re rebuilding from scratch. MCP gives you a standard interface that works across hosts, is independently testable, and separates your retrieval logic from your AI application. The overhead of building an MCP server versus a custom integration is minimal — and the long-term maintenance savings are significant.

Related: What is RAG? · What is a Vector Database? · Build a Local RAG Pipeline with Ollama · MCP Complete Developer Guide · Vector Databases Compared · Embeddings Explained · Why RAG Returns Bad Results