RAG AI Agent: How to Build Retrieval-Augmented Generation Agents (2026 Guide)
Your AI agent is smart — until you ask it about your data. Then it hallucinates, guesses, or politely refuses. RAG (Retrieval-Augmented Generation) fixes that. It connects your agent to your knowledge base so every answer is grounded in real, up-to-date information.
This guide covers the complete architecture: from chunking documents to vector search to production-grade RAG agents that actually work. No theory-only fluff — you'll get code you can run today.
What Is RAG (And Why Your Agent Needs It)
Retrieval-Augmented Generation is a two-step process:
- Retrieve — Search your knowledge base for relevant documents
- Generate — Feed those documents to an LLM as context, then generate an answer
Without RAG, your agent only knows what was in its training data (which is months old). With RAG, it can answer questions about your company docs, product catalog, support tickets, or any proprietary data — accurately and with citations.
When to Use RAG
- Internal knowledge bases — Company docs, SOPs, policies
- Customer support — Product manuals, FAQ, troubleshooting guides
- Legal / compliance — Contracts, regulations, case law
- Sales enablement — Product specs, pricing, competitor intel
- Research — Papers, reports, datasets
The RAG Agent Architecture
A production RAG agent has 5 layers. Skip any one and you'll feel it in production.
Layer 1: Document Ingestion
Raw documents go in. Clean, chunked text comes out.
# document_loader.py
import os
from pathlib import Path
class DocumentLoader:
"""Load and extract text from various file formats."""
LOADERS = {
'.pdf': 'load_pdf',
'.md': 'load_markdown',
'.txt': 'load_text',
'.html': 'load_html',
'.docx': 'load_docx',
'.csv': 'load_csv',
}
def load(self, file_path: str) -> list[dict]:
ext = Path(file_path).suffix.lower()
loader = self.LOADERS.get(ext)
if not loader:
raise ValueError(f"Unsupported format: {ext}")
return getattr(self, loader)(file_path)
def load_markdown(self, path: str) -> list[dict]:
with open(path) as f:
text = f.read()
return [{
'text': text,
'source': path,
'format': 'markdown'
}]
def load_pdf(self, path: str) -> list[dict]:
import pymupdf # pip install pymupdf
doc = pymupdf.open(path)
pages = []
for i, page in enumerate(doc):
pages.append({
'text': page.get_text(),
'source': f"{path}#page={i+1}",
'format': 'pdf',
'page': i + 1
})
return pages
Layer 2: Chunking
This is where most RAG systems fail. Bad chunks = bad retrieval = bad answers.
# chunker.py
from dataclasses import dataclass
@dataclass
class Chunk:
text: str
source: str
index: int
metadata: dict
class SmartChunker:
"""Semantic-aware chunking that respects document structure."""
def __init__(self, chunk_size=512, overlap=64):
self.chunk_size = chunk_size
self.overlap = overlap
def chunk_markdown(self, text: str, source: str) -> list[Chunk]:
"""Split by headers first, then by size."""
sections = self._split_by_headers(text)
chunks = []
for section in sections:
if len(section['text'].split()) <= self.chunk_size:
chunks.append(Chunk(
text=section['text'],
source=source,
index=len(chunks),
metadata={'heading': section.get('heading', '')}
))
else:
# Split large sections with overlap
sub_chunks = self._split_with_overlap(
section['text'],
self.chunk_size,
self.overlap
)
for sc in sub_chunks:
chunks.append(Chunk(
text=sc,
source=source,
index=len(chunks),
metadata={'heading': section.get('heading', '')}
))
return chunks
def _split_by_headers(self, text):
sections = []
current = {'text': '', 'heading': ''}
for line in text.split('\n'):
if line.startswith('#'):
if current['text'].strip():
sections.append(current)
current = {'text': '', 'heading': line.strip('# ')}
current['text'] += line + '\n'
if current['text'].strip():
sections.append(current)
return sections
def _split_with_overlap(self, text, size, overlap):
words = text.split()
chunks = []
start = 0
while start < len(words):
end = start + size
chunk = ' '.join(words[start:end])
chunks.append(chunk)
start = end - overlap
return chunks
• 512 tokens — Good default for most use cases
• 256 tokens — Better for precise Q&A (support, legal)
• 1024 tokens — Better for summarization and long-form
• Always overlap — 10-15% prevents losing context at boundaries
• Respect structure — Don't split mid-paragraph or mid-table
Layer 3: Embedding & Vector Storage
Convert chunks to vectors, store them in a database optimized for similarity search.
# embedder.py
import anthropic
import numpy as np
class Embedder:
"""Generate embeddings using Voyage AI (Anthropic's partner)."""
def __init__(self):
# Or use OpenAI, Cohere, local models
import voyageai
self.client = voyageai.Client()
self.model = "voyage-3" # 1024 dimensions
def embed_chunks(self, chunks: list[Chunk]) -> list[list[float]]:
texts = [c.text for c in chunks]
# Batch embed (max 128 per call)
embeddings = []
for i in range(0, len(texts), 128):
batch = texts[i:i+128]
result = self.client.embed(
batch,
model=self.model,
input_type="document"
)
embeddings.extend(result.embeddings)
return embeddings
def embed_query(self, query: str) -> list[float]:
result = self.client.embed(
[query],
model=self.model,
input_type="query"
)
return result.embeddings[0]
# vector_store.py — Using ChromaDB (simplest to start)
import chromadb
class VectorStore:
def __init__(self, collection_name="knowledge_base"):
self.client = chromadb.PersistentClient(path="./chroma_db")
self.collection = self.client.get_or_create_collection(
name=collection_name,
metadata={"hnsw:space": "cosine"}
)
def add(self, chunks, embeddings):
self.collection.add(
ids=[f"chunk_{c.index}_{hash(c.source)}" for c in chunks],
embeddings=embeddings,
documents=[c.text for c in chunks],
metadatas=[{
'source': c.source,
'heading': c.metadata.get('heading', ''),
'index': c.index
} for c in chunks]
)
def search(self, query_embedding, n_results=5, where=None):
kwargs = {
'query_embeddings': [query_embedding],
'n_results': n_results,
}
if where:
kwargs['where'] = where
results = self.collection.query(**kwargs)
return [{
'text': doc,
'source': meta['source'],
'heading': meta['heading'],
'score': 1 - dist # Convert distance to similarity
} for doc, meta, dist in zip(
results['documents'][0],
results['metadatas'][0],
results['distances'][0]
)]
Layer 4: Retrieval Strategy
Basic vector search works. But smart retrieval is what separates "okay RAG" from "this is actually useful." Three patterns you should know:
| Strategy | How It Works | When to Use |
|---|---|---|
| Naive RAG | Embed query → top-k nearest chunks | Simple Q&A, prototypes |
| Hybrid Search | Vector search + keyword search (BM25), merge results | When exact terms matter (product names, codes) |
| Reranking | Retrieve top-20, rerank with cross-encoder to top-5 | When precision matters more than speed |
| HyDE | LLM generates hypothetical answer → embed that → search | When queries are vague or conversational |
| Multi-Query | LLM generates 3-5 query variations → search all → merge | Complex questions that touch multiple topics |
| Agentic RAG | Agent decides when/what to retrieve, can refine queries | Production systems with diverse data sources |
# retrieval.py — Hybrid Search + Reranking
class SmartRetriever:
def __init__(self, vector_store, embedder):
self.vector_store = vector_store
self.embedder = embedder
def retrieve(self, query: str, top_k=5) -> list[dict]:
# Step 1: Vector search (semantic)
query_embedding = self.embedder.embed_query(query)
vector_results = self.vector_store.search(
query_embedding, n_results=top_k * 3
)
# Step 2: Keyword search (BM25)
keyword_results = self._bm25_search(query, top_k * 3)
# Step 3: Merge with Reciprocal Rank Fusion
merged = self._reciprocal_rank_fusion(
[vector_results, keyword_results], k=60
)
# Step 4: Rerank top candidates
reranked = self._rerank(query, merged[:top_k * 2])
return reranked[:top_k]
def _reciprocal_rank_fusion(self, result_lists, k=60):
scores = {}
for results in result_lists:
for rank, result in enumerate(results):
key = result['source'] + result['text'][:100]
if key not in scores:
scores[key] = {'result': result, 'score': 0}
scores[key]['score'] += 1 / (k + rank + 1)
sorted_results = sorted(
scores.values(),
key=lambda x: x['score'],
reverse=True
)
return [item['result'] for item in sorted_results]
def _rerank(self, query, candidates):
"""Rerank using Cohere or a cross-encoder."""
import cohere
co = cohere.Client()
results = co.rerank(
query=query,
documents=[c['text'] for c in candidates],
model="rerank-v3.5",
top_n=len(candidates)
)
reranked = []
for r in results.results:
candidate = candidates[r.index]
candidate['rerank_score'] = r.relevance_score
reranked.append(candidate)
return reranked
Layer 5: The RAG Agent
Now wire it all together. The agent doesn't just answer — it decides when to search, what to search for, and whether the results are good enough.
# rag_agent.py
import anthropic
SYSTEM_PROMPT = """You are a knowledgeable assistant with access to a
company knowledge base. When answering questions:
1. ALWAYS search the knowledge base before answering factual questions
2. Cite your sources with [Source: filename] tags
3. If the knowledge base doesn't contain the answer, say so clearly
4. Never make up information — if unsure, say "I don't have that info"
5. For follow-up questions, search again with refined queries
You have these tools:
- search_knowledge_base: Search the vector store for relevant documents
- get_document: Retrieve a full document by source path
"""
class RAGAgent:
def __init__(self, retriever, model="claude-sonnet-4-20250514"):
self.retriever = retriever
self.client = anthropic.Anthropic()
self.model = model
self.conversation = []
def chat(self, user_message: str) -> str:
self.conversation.append({
"role": "user",
"content": user_message
})
# Let Claude decide whether to search
response = self.client.messages.create(
model=self.model,
max_tokens=4096,
system=SYSTEM_PROMPT,
tools=[{
"name": "search_knowledge_base",
"description": "Search the knowledge base for relevant info",
"input_schema": {
"type": "object",
"properties": {
"query": {
"type": "string",
"description": "Search query"
},
"filters": {
"type": "object",
"description": "Optional metadata filters"
}
},
"required": ["query"]
}
}],
messages=self.conversation
)
# Handle tool use loop
while response.stop_reason == "tool_use":
tool_results = []
for block in response.content:
if block.type == "tool_use":
if block.name == "search_knowledge_base":
results = self.retriever.retrieve(
block.input["query"]
)
tool_results.append({
"type": "tool_result",
"tool_use_id": block.id,
"content": self._format_results(results)
})
self.conversation.append({
"role": "assistant",
"content": response.content
})
self.conversation.append({
"role": "user",
"content": tool_results
})
response = self.client.messages.create(
model=self.model,
max_tokens=4096,
system=SYSTEM_PROMPT,
tools=[...], # Same tools
messages=self.conversation
)
# Extract final text
answer = ""
for block in response.content:
if hasattr(block, 'text'):
answer += block.text
self.conversation.append({
"role": "assistant",
"content": answer
})
return answer
def _format_results(self, results):
formatted = "Retrieved documents:\n\n"
for i, r in enumerate(results, 1):
formatted += f"--- Document {i} ---\n"
formatted += f"Source: {r['source']}\n"
formatted += f"Section: {r['heading']}\n"
formatted += f"Content: {r['text']}\n\n"
return formatted
🛠️ Get the AI Employee Playbook
Complete blueprint for building production AI agents — including RAG systems, deployment patterns, and the exact prompts we use in production.
Get the Playbook — €29Vector Database Comparison (2026)
Your vector DB choice matters more than you think. Here's what we've tested:
| Database | Best For | Pricing | Max Vectors |
|---|---|---|---|
| ChromaDB | Prototyping, local dev | Free (open source) | ~1M (local) |
| Pinecone | Production SaaS, managed | Free tier → $70+/mo | Unlimited |
| Weaviate | Hybrid search, multi-modal | Free (self-host) / Cloud | Unlimited |
| Qdrant | Performance, filtering | Free (self-host) / Cloud | Unlimited |
| Supabase pgvector | Already using Postgres | Free tier → $25+/mo | ~5M+ |
| Turbopuffer | Cost-efficient at scale | Pay-per-query | Unlimited |
Embedding Models Compared
| Model | Dimensions | MTEB Score | Cost (1M tokens) |
|---|---|---|---|
| Voyage 3 | 1024 | 67.3 | $0.06 |
| OpenAI text-embedding-3-large | 3072 | 64.6 | $0.13 |
| Cohere embed-v4 | 1024 | 66.1 | $0.10 |
| BGE-M3 (local) | 1024 | 65.0 | Free |
| Nomic Embed (local) | 768 | 62.4 | Free |
5 Advanced RAG Patterns
Pattern 1: Parent-Child Chunking
Store small chunks for precise retrieval, but return the full parent section for context.
# Small chunks (256 tokens) for embedding
# When matched, return the parent section (1024+ tokens)
# This gives you precise matching + rich context
class ParentChildStore:
def __init__(self):
self.child_store = VectorStore("children")
self.parents = {} # child_id -> parent_text
def add(self, document):
# Create large parent chunks
parents = chunk(document, size=1024)
for parent in parents:
# Split each parent into children
children = chunk(parent.text, size=256)
child_embeddings = embed(children)
self.child_store.add(children, child_embeddings)
for child in children:
self.parents[child.id] = parent.text
def search(self, query, top_k=3):
# Search children, return parents
child_results = self.child_store.search(
embed_query(query), n_results=top_k
)
return [self.parents[r['id']] for r in child_results]
Pattern 2: Contextual Retrieval (Anthropic's Method)
Before embedding, prepend each chunk with LLM-generated context about where it fits in the document. This dramatically improves retrieval accuracy.
# Add context to each chunk before embedding
def add_context(chunk, full_document):
prompt = f"""Here is a document:
{full_document[:2000]}
Here is a chunk from that document:
{chunk.text}
Please give a short (2-3 sentence) context explaining what
this chunk is about and where it fits in the document."""
context = call_llm(prompt)
chunk.text = f"{context}\n\n{chunk.text}"
return chunk
# Anthropic reports 49% reduction in retrieval failures
# with this simple technique
Pattern 3: Query Routing
Different questions need different retrieval strategies. Let the agent route.
# Route queries to the best retrieval strategy
ROUTING_PROMPT = """Classify this query:
1. FACTUAL - Needs specific facts (use hybrid search)
2. CONCEPTUAL - Needs explanation (use semantic search)
3. COMPARATIVE - Compares things (use multi-query)
4. PROCEDURAL - How-to steps (use parent-child)
5. NONE - Doesn't need retrieval (chitchat, math)
Query: {query}
Classification:"""
def route_query(query):
classification = call_llm(ROUTING_PROMPT.format(query=query))
if "FACTUAL" in classification:
return hybrid_search(query)
elif "CONCEPTUAL" in classification:
return semantic_search(query)
elif "COMPARATIVE" in classification:
return multi_query_search(query)
elif "PROCEDURAL" in classification:
return parent_child_search(query)
else:
return [] # No retrieval needed
Pattern 4: Self-Correcting RAG
# Agent checks if retrieved context actually answers the question
def rag_with_self_correction(query, max_retries=2):
for attempt in range(max_retries + 1):
results = retriever.retrieve(query)
# Grade the retrieval
grade = call_llm(f"""Do these documents contain
enough information to answer: "{query}"?
Documents: {format_results(results)}
Answer YES or NO with a brief explanation.""")
if "YES" in grade:
return generate_answer(query, results)
# Refine the query and try again
query = call_llm(f"""The search for "{query}" didn't
return good results. Rephrase the query to find
better matches. Just output the new query.""")
return "I couldn't find sufficient information to answer."
Pattern 5: Multi-Source RAG
# Agent can search multiple knowledge bases
tools = [
{"name": "search_docs", "desc": "Internal documentation"},
{"name": "search_tickets", "desc": "Support ticket history"},
{"name": "search_api_docs", "desc": "API reference"},
{"name": "search_slack", "desc": "Slack conversations"},
]
# The agent decides which source(s) to query
# and can cross-reference between them
Production RAG Checklist
Before going live, make sure you've handled these:
- Evaluation pipeline — Test with 50+ question-answer pairs. Measure retrieval recall and answer accuracy.
- Chunking quality — Manually review 20 random chunks. Can a human understand them out of context?
- Citation verification — Every answer should cite its source. Spot-check that citations match.
- Hallucination guard — Ask questions NOT in your knowledge base. The agent should say "I don't know."
- Freshness pipeline — Documents change. Set up automatic re-ingestion (daily or on change).
- Cost monitoring — Track embedding costs, LLM calls per query, vector DB usage.
- Latency budget — Set a target (e.g., <3s). Measure P50 and P99.
- Access control — Users should only retrieve documents they have access to. Use metadata filters.
60-Minute Quickstart: Build Your First RAG Agent
Let's build a working RAG agent that answers questions about your documents.
1 Install Dependencies (2 min)
pip install anthropic chromadb voyageai pymupdf
2 Create the Ingestion Script (10 min)
# ingest.py
import os, chromadb, voyageai
from pathlib import Path
# Init
voyage = voyageai.Client()
chroma = chromadb.PersistentClient(path="./rag_db")
collection = chroma.get_or_create_collection("docs")
def chunk_text(text, size=500, overlap=50):
words = text.split()
chunks = []
for i in range(0, len(words), size - overlap):
chunks.append(' '.join(words[i:i + size]))
return chunks
def ingest_file(path):
with open(path) as f:
text = f.read()
chunks = chunk_text(text)
embeddings = voyage.embed(
chunks, model="voyage-3", input_type="document"
).embeddings
collection.add(
ids=[f"{path}_{i}" for i in range(len(chunks))],
embeddings=embeddings,
documents=chunks,
metadatas=[{"source": str(path)} for _ in chunks]
)
print(f"Ingested {path}: {len(chunks)} chunks")
# Ingest all .md and .txt files in ./docs/
docs_dir = Path("./docs")
for file in docs_dir.rglob("*"):
if file.suffix in ['.md', '.txt']:
ingest_file(str(file))
print(f"Total chunks: {collection.count()}")
3 Create the RAG Agent (15 min)
# agent.py
import anthropic, chromadb, voyageai
voyage = voyageai.Client()
chroma = chromadb.PersistentClient(path="./rag_db")
collection = chroma.get_or_create_collection("docs")
claude = anthropic.Anthropic()
def search(query, n=5):
embedding = voyage.embed(
[query], model="voyage-3", input_type="query"
).embeddings[0]
results = collection.query(
query_embeddings=[embedding], n_results=n
)
context = ""
for doc, meta in zip(results['documents'][0], results['metadatas'][0]):
context += f"\n---\nSource: {meta['source']}\n{doc}\n"
return context
def chat(question, history=[]):
context = search(question)
messages = history + [{
"role": "user",
"content": f"""Answer based on this context:
{context}
Question: {question}
Rules:
- Only use information from the context above
- Cite sources as [Source: filename]
- Say "I don't have that information" if not in context"""
}]
response = claude.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=2048,
messages=messages
)
return response.content[0].text
# Interactive loop
print("RAG Agent ready! Ask anything about your docs.")
history = []
while True:
q = input("\nYou: ")
if q.lower() in ['quit', 'exit']:
break
answer = chat(q, history)
print(f"\nAgent: {answer}")
history.append({"role": "user", "content": q})
history.append({"role": "assistant", "content": answer})
4 Add Your Documents (5 min)
mkdir docs
# Add your .md or .txt files to ./docs/
python ingest.py
5 Run It (1 min)
python agent.py
# > You: What's our refund policy?
# > Agent: Based on the documentation, your refund policy...
# [Source: docs/policies/refunds.md]
📬 The Operator Signal
Weekly insights on AI agents, automation, and building with LLMs. Join 2,000+ operators.
Subscribe Free7 Common RAG Mistakes (And How to Fix Them)
1. Chunks Too Big or Too Small
Problem: Large chunks dilute relevance. Tiny chunks lose context.
Fix: Start at 512 tokens with 10% overlap. Test with your actual queries and adjust.
2. No Metadata Filtering
Problem: Searching the entire knowledge base for every query.
Fix: Add metadata (category, date, department) and filter before searching.
3. Ignoring Document Structure
Problem: Splitting a table across chunks, or separating a header from its content.
Fix: Use structure-aware chunking. Split by headers, keep tables intact.
4. No Reranking
Problem: Vector search returns semantically similar but irrelevant results.
Fix: Add a reranker (Cohere Rerank, cross-encoder) as a second pass. 20-40% accuracy improvement.
5. Stale Data
Problem: Documents updated but embeddings still reflect the old version.
Fix: Track document hashes. Re-embed on change. Run nightly sync jobs.
6. No Evaluation Set
Problem: "It seems to work" isn't a metric.
Fix: Create 50+ question → expected source pairs. Measure retrieval recall@5 weekly.
7. Stuffing Too Much Context
Problem: Retrieving 20 chunks and cramming them all into the prompt.
Fix: Less is more. 3-5 highly relevant chunks beat 15 mediocre ones. Your LLM bill will thank you too.
Cost Breakdown: RAG in Production
| Component | Budget Build | Pro Build |
|---|---|---|
| Embedding model | BGE-M3 (free, local) | Voyage 3 ($0.06/1M tok) |
| Vector database | ChromaDB (free) | Qdrant Cloud ($30/mo) |
| LLM (generation) | Claude Haiku ($0.25/1M tok) | Claude Sonnet ($3/1M tok) |
| Reranker | None | Cohere Rerank ($1/1K queries) |
| Total (10K queries/mo) | ~$5/mo | ~$80/mo |
What's Next: RAG in 2026 and Beyond
- Agentic RAG — Agents that plan multi-step retrieval, cross-referencing sources automatically
- Graph RAG — Combining vector search with knowledge graphs for better reasoning over relationships
- Multimodal RAG — Indexing images, videos, and audio alongside text
- Real-time RAG — Streaming document updates into the index with sub-second latency
- Longer context windows — 200K+ context may reduce need for retrieval in some cases, but RAG still wins for large knowledge bases