๐Ÿ“ Tutorials
ยท 7 min read

Build a Local RAG Pipeline with Ollama โ€” No Cloud, No API Keys (2026)


Large language models are impressive, but they donโ€™t know anything about your documents. Retrieval-Augmented Generation (RAG) fixes that โ€” it lets you feed your own files to an LLM so it can answer questions grounded in your actual data instead of hallucinating.

The catch with most RAG tutorials? They ship your documents to OpenAI or some other cloud API. If youโ€™re working with internal docs, client data, or anything remotely sensitive, thatโ€™s a non-starter.

This tutorial builds a fully local RAG pipeline. Nothing leaves your machine. No API keys, no cloud accounts, no subscriptions. Just Python, Ollama, and a vector database running in-process.

What youโ€™ll build

Hereโ€™s the architecture in plain text:

Your Documents (.txt, .md, .pdf)
        โ”‚
        โ–ผ
   Load & Chunk
        โ”‚
        โ–ผ
  Embed with nomic-embed-text (Ollama)
        โ”‚
        โ–ผ
  Store in ChromaDB (local vector DB)
        โ”‚
        โ–ผ
   User Query โ†’ Embed โ†’ Similarity Search
        โ”‚
        โ–ผ
  Retrieved Chunks + Query โ†’ LLM (Ollama)
        โ”‚
        โ–ผ
     Answer

The stack:

  • Ollama โ€” runs the LLM and embedding model locally
  • nomic-embed-text โ€” embedding model (768 dimensions, runs inside Ollama)
  • ChromaDB โ€” lightweight vector store, runs in-process with no separate server
  • Python โ€” glue code to tie it all together

You can use any Ollama-compatible LLM for generation: Qwen 3.6, Llama 4, Mistral โ€” whatever fits your hardware. Check how much VRAM you actually need if youโ€™re unsure which model to pick.

Prerequisites

  • Python 3.10+
  • Ollama installed and running (full setup guide here)
  • 16 GB RAM minimum โ€” a GPU helps but isnโ€™t required
  • A folder of text documents you want to query (.txt or .md files work best for this tutorial)

Want to try this without buying hardware? Cloud GPU providers let you spin up the right GPU in minutes.

Step 1: Install dependencies

pip install chromadb ollama

Thatโ€™s it. Two packages. ChromaDB handles the vector store, and the ollama package talks to your local Ollama instance.

Step 2: Pull the models

You need two models โ€” one for embeddings, one for generation:

ollama pull nomic-embed-text
ollama pull qwen3.6:35b-a3b

nomic-embed-text is a solid local embedding model that produces 768-dimensional vectors. For the LLM, weโ€™re using Qwen 3.6 35B-A3B here, but swap in any model you prefer. Llama 4 and Mistral both work fine.

Verify both models are available:

ollama list

Step 3: Load and chunk documents

RAG works by splitting your documents into small chunks, embedding each chunk, and then retrieving the most relevant ones at query time. Hereโ€™s the loading and chunking step:

import os

def load_documents(directory: str) -> list[str]:
    """Load all .txt and .md files from a directory."""
    docs = []
    for filename in os.listdir(directory):
        if filename.endswith((".txt", ".md")):
            with open(os.path.join(directory, filename)) as f:
                docs.append(f.read())
    return docs

def chunk_text(text: str, size: int = 500, overlap: int = 50) -> list[str]:
    """Split text into overlapping chunks."""
    chunks = []
    start = 0
    while start < len(text):
        end = start + size
        chunks.append(text[start:end])
        start += size - overlap
    return chunks

# Load and chunk
documents = load_documents("./my_docs")
all_chunks = []
for doc in documents:
    all_chunks.extend(chunk_text(doc))

print(f"Created {len(all_chunks)} chunks from {len(documents)} documents")

Put your documents in a ./my_docs folder. The chunker uses a 500-character window with 50-character overlap so you donโ€™t lose context at chunk boundaries.

Step 4: Create embeddings and store in ChromaDB

Now embed every chunk and store the vectors in ChromaDB:

import chromadb
import ollama

# Initialize ChromaDB (persistent, stored on disk)
client = chromadb.PersistentClient(path="./chroma_db")
collection = client.get_or_create_collection(name="my_docs")

# Embed and store each chunk
for i, chunk in enumerate(all_chunks):
    response = ollama.embed(model="nomic-embed-text", input=chunk)
    embedding = response["embeddings"][0]
    collection.add(
        ids=[f"chunk_{i}"],
        embeddings=[embedding],
        documents=[chunk],
    )

print(f"Stored {len(all_chunks)} embeddings in ChromaDB")

ChromaDBโ€™s PersistentClient saves everything to ./chroma_db on disk. Next time you run the script, your embeddings are already there โ€” no need to re-embed.

Step 5: Query the pipeline

This is where it all comes together. Embed the userโ€™s question, find the most similar chunks, and pass them to the LLM as context:

import ollama
import chromadb

client = chromadb.PersistentClient(path="./chroma_db")
collection = client.get_collection(name="my_docs")

def query_rag(question: str, n_results: int = 3) -> str:
    # Embed the question
    response = ollama.embed(model="nomic-embed-text", input=question)
    query_embedding = response["embeddings"][0]

    # Retrieve relevant chunks
    results = collection.query(query_embeddings=[query_embedding], n_results=n_results)
    context = "\n\n---\n\n".join(results["documents"][0])

    # Generate answer with context
    prompt = f"""Use the following context to answer the question. If the context doesn't contain the answer, say so.

Context:
{context}

Question: {question}

Answer:"""

    output = ollama.chat(
        model="qwen3.6:35b-a3b",
        messages=[{"role": "user", "content": prompt}],
    )
    return output["message"]["content"]

# Try it
answer = query_rag("What is the refund policy?")
print(answer)

The n_results=3 parameter controls how many chunks get retrieved. Three is a good default โ€” enough context without overwhelming the LLMโ€™s context window.

Full working script

Hereโ€™s everything in one file you can copy and run:

import os
import ollama
import chromadb

# --- Config ---
DOCS_DIR = "./my_docs"
CHROMA_PATH = "./chroma_db"
EMBED_MODEL = "nomic-embed-text"
LLM_MODEL = "qwen3.6:35b-a3b"
CHUNK_SIZE = 500
CHUNK_OVERLAP = 50

# --- Load & Chunk ---
def load_documents(directory: str) -> list[str]:
    docs = []
    for filename in os.listdir(directory):
        if filename.endswith((".txt", ".md")):
            with open(os.path.join(directory, filename)) as f:
                docs.append(f.read())
    return docs

def chunk_text(text: str, size: int = CHUNK_SIZE, overlap: int = CHUNK_OVERLAP) -> list[str]:
    chunks = []
    start = 0
    while start < len(text):
        chunks.append(text[start:start + size])
        start += size - overlap
    return chunks

# --- Embed & Store ---
def build_index(chunks: list[str]) -> None:
    client = chromadb.PersistentClient(path=CHROMA_PATH)
    # Start fresh each time โ€” remove old collection if it exists
    try:
        client.delete_collection(name="my_docs")
    except ValueError:
        pass
    collection = client.create_collection(name="my_docs")

    for i, chunk in enumerate(chunks):
        response = ollama.embed(model=EMBED_MODEL, input=chunk)
        collection.add(
            ids=[f"chunk_{i}"],
            embeddings=[response["embeddings"][0]],
            documents=[chunk],
        )
    print(f"Indexed {len(chunks)} chunks.")

# --- Query ---
def query_rag(question: str, n_results: int = 3) -> str:
    client = chromadb.PersistentClient(path=CHROMA_PATH)
    collection = client.get_collection(name="my_docs")

    response = ollama.embed(model=EMBED_MODEL, input=question)
    results = collection.query(
        query_embeddings=[response["embeddings"][0]], n_results=n_results
    )
    context = "\n\n---\n\n".join(results["documents"][0])

    prompt = f"""Use the following context to answer the question. If the context doesn't contain the answer, say so.

Context:
{context}

Question: {question}

Answer:"""

    output = ollama.chat(
        model=LLM_MODEL,
        messages=[{"role": "user", "content": prompt}],
    )
    return output["message"]["content"]

# --- Main ---
if __name__ == "__main__":
    documents = load_documents(DOCS_DIR)
    all_chunks = []
    for doc in documents:
        all_chunks.extend(chunk_text(doc))

    print(f"Loaded {len(documents)} documents, {len(all_chunks)} chunks")
    build_index(all_chunks)

    while True:
        question = input("\nAsk a question (or 'quit'): ")
        if question.lower() == "quit":
            break
        print("\n" + query_rag(question))

Drop your .txt or .md files into ./my_docs, run the script, and start asking questions. The first run embeds everything; subsequent queries hit the stored index instantly.

Tips for better results

Chunk size matters. 500 characters is a starting point. If your answers feel incomplete, try 800โ€“1000. If they feel noisy and unfocused, go smaller โ€” 300 characters works well for FAQ-style documents. Thereโ€™s no universal best value โ€” it depends entirely on the structure and density of your documents. Experiment and compare the output quality.

Overlap prevents lost context. The 50-character overlap ensures sentences at chunk boundaries arenโ€™t cut off mid-thought. For longer documents with complex paragraphs, try 100โ€“200 overlap. The tradeoff is more chunks (and slightly more storage), but better retrieval accuracy at the edges.

Prompt template is everything. The simple template above works, but you can improve it significantly. Add instructions like โ€œAnswer in bullet points,โ€ โ€œCite which section the information comes from,โ€ or โ€œIf multiple chunks disagree, mention the discrepancy.โ€ A well-crafted prompt template is often the single biggest lever for output quality โ€” more impactful than switching models.

Retrieve more, filter later. Pulling 5 chunks instead of 3 gives the LLM more to work with. If some retrieved chunks are irrelevant, the model usually ignores them. You can also add a relevance score threshold โ€” ChromaDB returns distances with each result, so you can filter out chunks that are too far from the query embedding.

Re-index when documents change. The script above rebuilds the index each run. For a production setup, youโ€™d want incremental updates โ€” track file hashes and only re-embed changed files. ChromaDB supports upserting by ID, which makes this straightforward.

Try different embedding models. nomic-embed-text is a great default, but Ollama supports other embedding models too. If your documents are code-heavy, a code-specific embedding model may retrieve better results. The rest of the pipeline stays identical โ€” just swap the model name.

When to use local RAG vs cloud RAG

Go local when:

  • Your data is sensitive or regulated (GDPR considerations)
  • You want zero ongoing costs
  • You need offline access
  • Youโ€™re building internal tools for a team

Go cloud when:

  • You need the absolute best model quality (GPT-4.5, Claude)
  • Your document corpus is massive (millions of chunks)
  • You donโ€™t want to manage hardware
  • Latency requirements are strict and you lack a good GPU

For most personal and small-team use cases โ€” querying internal docs, building a private docs chatbot, codebase Q&A โ€” local RAG is more than good enough. The models available through Ollama in 2026 are genuinely capable, and you keep full control of your data. Start local, and only move to cloud if you hit a concrete wall.

๐Ÿ“˜