Build a Local RAG Pipeline with Ollama โ No Cloud, No API Keys (2026)
Large language models are impressive, but they donโt know anything about your documents. Retrieval-Augmented Generation (RAG) fixes that โ it lets you feed your own files to an LLM so it can answer questions grounded in your actual data instead of hallucinating.
The catch with most RAG tutorials? They ship your documents to OpenAI or some other cloud API. If youโre working with internal docs, client data, or anything remotely sensitive, thatโs a non-starter.
This tutorial builds a fully local RAG pipeline. Nothing leaves your machine. No API keys, no cloud accounts, no subscriptions. Just Python, Ollama, and a vector database running in-process.
What youโll build
Hereโs the architecture in plain text:
Your Documents (.txt, .md, .pdf)
โ
โผ
Load & Chunk
โ
โผ
Embed with nomic-embed-text (Ollama)
โ
โผ
Store in ChromaDB (local vector DB)
โ
โผ
User Query โ Embed โ Similarity Search
โ
โผ
Retrieved Chunks + Query โ LLM (Ollama)
โ
โผ
Answer
The stack:
- Ollama โ runs the LLM and embedding model locally
- nomic-embed-text โ embedding model (768 dimensions, runs inside Ollama)
- ChromaDB โ lightweight vector store, runs in-process with no separate server
- Python โ glue code to tie it all together
You can use any Ollama-compatible LLM for generation: Qwen 3.6, Llama 4, Mistral โ whatever fits your hardware. Check how much VRAM you actually need if youโre unsure which model to pick.
Prerequisites
- Python 3.10+
- Ollama installed and running (full setup guide here)
- 16 GB RAM minimum โ a GPU helps but isnโt required
- A folder of text documents you want to query (
.txtor.mdfiles work best for this tutorial)
Want to try this without buying hardware? Cloud GPU providers let you spin up the right GPU in minutes.
Step 1: Install dependencies
pip install chromadb ollama
Thatโs it. Two packages. ChromaDB handles the vector store, and the ollama package talks to your local Ollama instance.
Step 2: Pull the models
You need two models โ one for embeddings, one for generation:
ollama pull nomic-embed-text
ollama pull qwen3.6:35b-a3b
nomic-embed-text is a solid local embedding model that produces 768-dimensional vectors. For the LLM, weโre using Qwen 3.6 35B-A3B here, but swap in any model you prefer. Llama 4 and Mistral both work fine.
Verify both models are available:
ollama list
Step 3: Load and chunk documents
RAG works by splitting your documents into small chunks, embedding each chunk, and then retrieving the most relevant ones at query time. Hereโs the loading and chunking step:
import os
def load_documents(directory: str) -> list[str]:
"""Load all .txt and .md files from a directory."""
docs = []
for filename in os.listdir(directory):
if filename.endswith((".txt", ".md")):
with open(os.path.join(directory, filename)) as f:
docs.append(f.read())
return docs
def chunk_text(text: str, size: int = 500, overlap: int = 50) -> list[str]:
"""Split text into overlapping chunks."""
chunks = []
start = 0
while start < len(text):
end = start + size
chunks.append(text[start:end])
start += size - overlap
return chunks
# Load and chunk
documents = load_documents("./my_docs")
all_chunks = []
for doc in documents:
all_chunks.extend(chunk_text(doc))
print(f"Created {len(all_chunks)} chunks from {len(documents)} documents")
Put your documents in a ./my_docs folder. The chunker uses a 500-character window with 50-character overlap so you donโt lose context at chunk boundaries.
Step 4: Create embeddings and store in ChromaDB
Now embed every chunk and store the vectors in ChromaDB:
import chromadb
import ollama
# Initialize ChromaDB (persistent, stored on disk)
client = chromadb.PersistentClient(path="./chroma_db")
collection = client.get_or_create_collection(name="my_docs")
# Embed and store each chunk
for i, chunk in enumerate(all_chunks):
response = ollama.embed(model="nomic-embed-text", input=chunk)
embedding = response["embeddings"][0]
collection.add(
ids=[f"chunk_{i}"],
embeddings=[embedding],
documents=[chunk],
)
print(f"Stored {len(all_chunks)} embeddings in ChromaDB")
ChromaDBโs PersistentClient saves everything to ./chroma_db on disk. Next time you run the script, your embeddings are already there โ no need to re-embed.
Step 5: Query the pipeline
This is where it all comes together. Embed the userโs question, find the most similar chunks, and pass them to the LLM as context:
import ollama
import chromadb
client = chromadb.PersistentClient(path="./chroma_db")
collection = client.get_collection(name="my_docs")
def query_rag(question: str, n_results: int = 3) -> str:
# Embed the question
response = ollama.embed(model="nomic-embed-text", input=question)
query_embedding = response["embeddings"][0]
# Retrieve relevant chunks
results = collection.query(query_embeddings=[query_embedding], n_results=n_results)
context = "\n\n---\n\n".join(results["documents"][0])
# Generate answer with context
prompt = f"""Use the following context to answer the question. If the context doesn't contain the answer, say so.
Context:
{context}
Question: {question}
Answer:"""
output = ollama.chat(
model="qwen3.6:35b-a3b",
messages=[{"role": "user", "content": prompt}],
)
return output["message"]["content"]
# Try it
answer = query_rag("What is the refund policy?")
print(answer)
The n_results=3 parameter controls how many chunks get retrieved. Three is a good default โ enough context without overwhelming the LLMโs context window.
Full working script
Hereโs everything in one file you can copy and run:
import os
import ollama
import chromadb
# --- Config ---
DOCS_DIR = "./my_docs"
CHROMA_PATH = "./chroma_db"
EMBED_MODEL = "nomic-embed-text"
LLM_MODEL = "qwen3.6:35b-a3b"
CHUNK_SIZE = 500
CHUNK_OVERLAP = 50
# --- Load & Chunk ---
def load_documents(directory: str) -> list[str]:
docs = []
for filename in os.listdir(directory):
if filename.endswith((".txt", ".md")):
with open(os.path.join(directory, filename)) as f:
docs.append(f.read())
return docs
def chunk_text(text: str, size: int = CHUNK_SIZE, overlap: int = CHUNK_OVERLAP) -> list[str]:
chunks = []
start = 0
while start < len(text):
chunks.append(text[start:start + size])
start += size - overlap
return chunks
# --- Embed & Store ---
def build_index(chunks: list[str]) -> None:
client = chromadb.PersistentClient(path=CHROMA_PATH)
# Start fresh each time โ remove old collection if it exists
try:
client.delete_collection(name="my_docs")
except ValueError:
pass
collection = client.create_collection(name="my_docs")
for i, chunk in enumerate(chunks):
response = ollama.embed(model=EMBED_MODEL, input=chunk)
collection.add(
ids=[f"chunk_{i}"],
embeddings=[response["embeddings"][0]],
documents=[chunk],
)
print(f"Indexed {len(chunks)} chunks.")
# --- Query ---
def query_rag(question: str, n_results: int = 3) -> str:
client = chromadb.PersistentClient(path=CHROMA_PATH)
collection = client.get_collection(name="my_docs")
response = ollama.embed(model=EMBED_MODEL, input=question)
results = collection.query(
query_embeddings=[response["embeddings"][0]], n_results=n_results
)
context = "\n\n---\n\n".join(results["documents"][0])
prompt = f"""Use the following context to answer the question. If the context doesn't contain the answer, say so.
Context:
{context}
Question: {question}
Answer:"""
output = ollama.chat(
model=LLM_MODEL,
messages=[{"role": "user", "content": prompt}],
)
return output["message"]["content"]
# --- Main ---
if __name__ == "__main__":
documents = load_documents(DOCS_DIR)
all_chunks = []
for doc in documents:
all_chunks.extend(chunk_text(doc))
print(f"Loaded {len(documents)} documents, {len(all_chunks)} chunks")
build_index(all_chunks)
while True:
question = input("\nAsk a question (or 'quit'): ")
if question.lower() == "quit":
break
print("\n" + query_rag(question))
Drop your .txt or .md files into ./my_docs, run the script, and start asking questions. The first run embeds everything; subsequent queries hit the stored index instantly.
Tips for better results
Chunk size matters. 500 characters is a starting point. If your answers feel incomplete, try 800โ1000. If they feel noisy and unfocused, go smaller โ 300 characters works well for FAQ-style documents. Thereโs no universal best value โ it depends entirely on the structure and density of your documents. Experiment and compare the output quality.
Overlap prevents lost context. The 50-character overlap ensures sentences at chunk boundaries arenโt cut off mid-thought. For longer documents with complex paragraphs, try 100โ200 overlap. The tradeoff is more chunks (and slightly more storage), but better retrieval accuracy at the edges.
Prompt template is everything. The simple template above works, but you can improve it significantly. Add instructions like โAnswer in bullet points,โ โCite which section the information comes from,โ or โIf multiple chunks disagree, mention the discrepancy.โ A well-crafted prompt template is often the single biggest lever for output quality โ more impactful than switching models.
Retrieve more, filter later. Pulling 5 chunks instead of 3 gives the LLM more to work with. If some retrieved chunks are irrelevant, the model usually ignores them. You can also add a relevance score threshold โ ChromaDB returns distances with each result, so you can filter out chunks that are too far from the query embedding.
Re-index when documents change. The script above rebuilds the index each run. For a production setup, youโd want incremental updates โ track file hashes and only re-embed changed files. ChromaDB supports upserting by ID, which makes this straightforward.
Try different embedding models. nomic-embed-text is a great default, but Ollama supports other embedding models too. If your documents are code-heavy, a code-specific embedding model may retrieve better results. The rest of the pipeline stays identical โ just swap the model name.
When to use local RAG vs cloud RAG
Go local when:
- Your data is sensitive or regulated (GDPR considerations)
- You want zero ongoing costs
- You need offline access
- Youโre building internal tools for a team
Go cloud when:
- You need the absolute best model quality (GPT-4.5, Claude)
- Your document corpus is massive (millions of chunks)
- You donโt want to manage hardware
- Latency requirements are strict and you lack a good GPU
For most personal and small-team use cases โ querying internal docs, building a private docs chatbot, codebase Q&A โ local RAG is more than good enough. The models available through Ollama in 2026 are genuinely capable, and you keep full control of your data. Start local, and only move to cloud if you hit a concrete wall.