Perplexity answers questions by searching the web, reading the results, and synthesizing an answer with citations. You can build the same thing. Hereβs how.
Architecture
User question
β
1. Classify: needs web search? or local knowledge?
β
2. Search: web API (Brave/Serper) + vector DB (local docs)
β
3. Retrieve: fetch top 5-10 results, extract relevant text
β
4. Generate: LLM synthesizes answer with citations
β
5. Stream: response streams to user in real-time
Step 1: Web search API
You need a search API that returns actual page content, not just links:
import requests
def web_search(query, num_results=5):
# Brave Search API (free tier: 2,000 queries/month)
resp = requests.get("https://api.search.brave.com/res/v1/web/search",
headers={"X-Subscription-Token": BRAVE_API_KEY},
params={"q": query, "count": num_results}
)
return [{"title": r["title"], "url": r["url"], "snippet": r["description"]}
for r in resp.json()["web"]["results"]]
Alternatives: Serper ($50/5K queries), SerpAPI, or Tavily (built for AI search).
Step 2: Content extraction
Search results give you snippets. For better answers, fetch and extract the full page:
from trafilatura import fetch_url, extract
def get_page_content(url):
downloaded = fetch_url(url)
return extract(downloaded, include_links=False, include_tables=True)
Step 3: The RAG pipeline
Combine search results with an LLM to generate an answer:
from openai import OpenAI
client = OpenAI() # or OpenRouter, DeepSeek, etc.
def answer_question(question, search_results):
context = "\n\n".join([
f"Source [{i+1}]: {r['title']}\nURL: {r['url']}\n{r['content'][:2000]}"
for i, r in enumerate(search_results)
])
response = client.chat.completions.create(
model="deepseek-chat", # Cheap and good enough
messages=[{
"role": "system",
"content": "Answer the question using ONLY the provided sources. Cite sources as [1], [2], etc. If sources don't contain the answer, say so."
}, {
"role": "user",
"content": f"Sources:\n{context}\n\nQuestion: {question}"
}],
stream=True
)
return response
Using DeepSeek at $0.27/1M tokens keeps costs under $0.001 per query. For better quality, use Claude or GPT-5.
Step 4: Add a vector database for local knowledge
For searching your own documents (not just the web), add a vector database:
import chromadb
# Index your docs once
collection = chromadb.Client().create_collection("knowledge")
collection.add(
documents=your_documents,
ids=[f"doc_{i}" for i in range(len(your_documents))]
)
# Search at query time
def local_search(query):
return collection.query(query_texts=[query], n_results=5)
Step 5: Combine web + local search
def search(question):
web_results = web_search(question)
local_results = local_search(question)
# Merge and deduplicate
all_results = web_results + local_results
return answer_question(question, all_results)
Cost per query
| Component | Cost per query |
|---|---|
| Web search (Brave) | $0.025 |
| Content extraction | Free (self-hosted) |
| Embeddings (local search) | $0.0001 |
| LLM generation (DeepSeek) | $0.001 |
| Total | ~$0.03 |
At $0.03/query, 10,000 queries/month costs $300. Using prompt caching and model routing can cut this further.
Going further
- Streaming UI β Use Server-Sent Events to stream the answer as it generates
- Follow-up questions β Maintain conversation context across queries
- Source quality ranking β Weight authoritative sources higher
- Answer verification β Cross-check claims against multiple sources
Related: Embeddings Explained Β· Vector Databases Compared Β· RAG vs Fine-Tuning Β· Best Free AI APIs 2026