RAG AI Agent: How to Build Retrieval-Augmented Generation Agents (2026 Guide)

Your AI agent is smart — until you ask it about your data. Then it hallucinates, guesses, or politely refuses. RAG (Retrieval-Augmented Generation) fixes that. It connects your agent to your knowledge base so every answer is grounded in real, up-to-date information.

This guide covers the complete architecture: from chunking documents to vector search to production-grade RAG agents that actually work. No theory-only fluff — you'll get code you can run today.

90%
Hallucination Reduction
5x
More Accurate Answers
$0
Fine-Tuning Cost
60 min
To Build Your First

What Is RAG (And Why Your Agent Needs It)

Retrieval-Augmented Generation is a two-step process:

  1. Retrieve — Search your knowledge base for relevant documents
  2. Generate — Feed those documents to an LLM as context, then generate an answer

Without RAG, your agent only knows what was in its training data (which is months old). With RAG, it can answer questions about your company docs, product catalog, support tickets, or any proprietary data — accurately and with citations.

RAG vs Fine-Tuning: Fine-tuning bakes knowledge into model weights (expensive, static). RAG retrieves knowledge at query time (cheap, always fresh). For 90% of use cases, RAG wins. Use fine-tuning only when you need to change the model's behavior, not just its knowledge.

When to Use RAG

The RAG Agent Architecture

A production RAG agent has 5 layers. Skip any one and you'll feel it in production.

Layer 1: Document Ingestion

Raw documents go in. Clean, chunked text comes out.

# document_loader.py
import os
from pathlib import Path

class DocumentLoader:
    """Load and extract text from various file formats."""
    
    LOADERS = {
        '.pdf': 'load_pdf',
        '.md': 'load_markdown',
        '.txt': 'load_text',
        '.html': 'load_html',
        '.docx': 'load_docx',
        '.csv': 'load_csv',
    }
    
    def load(self, file_path: str) -> list[dict]:
        ext = Path(file_path).suffix.lower()
        loader = self.LOADERS.get(ext)
        if not loader:
            raise ValueError(f"Unsupported format: {ext}")
        return getattr(self, loader)(file_path)
    
    def load_markdown(self, path: str) -> list[dict]:
        with open(path) as f:
            text = f.read()
        return [{
            'text': text,
            'source': path,
            'format': 'markdown'
        }]
    
    def load_pdf(self, path: str) -> list[dict]:
        import pymupdf  # pip install pymupdf
        doc = pymupdf.open(path)
        pages = []
        for i, page in enumerate(doc):
            pages.append({
                'text': page.get_text(),
                'source': f"{path}#page={i+1}",
                'format': 'pdf',
                'page': i + 1
            })
        return pages

Layer 2: Chunking

This is where most RAG systems fail. Bad chunks = bad retrieval = bad answers.

# chunker.py
from dataclasses import dataclass

@dataclass
class Chunk:
    text: str
    source: str
    index: int
    metadata: dict

class SmartChunker:
    """Semantic-aware chunking that respects document structure."""
    
    def __init__(self, chunk_size=512, overlap=64):
        self.chunk_size = chunk_size
        self.overlap = overlap
    
    def chunk_markdown(self, text: str, source: str) -> list[Chunk]:
        """Split by headers first, then by size."""
        sections = self._split_by_headers(text)
        chunks = []
        
        for section in sections:
            if len(section['text'].split()) <= self.chunk_size:
                chunks.append(Chunk(
                    text=section['text'],
                    source=source,
                    index=len(chunks),
                    metadata={'heading': section.get('heading', '')}
                ))
            else:
                # Split large sections with overlap
                sub_chunks = self._split_with_overlap(
                    section['text'], 
                    self.chunk_size, 
                    self.overlap
                )
                for sc in sub_chunks:
                    chunks.append(Chunk(
                        text=sc,
                        source=source,
                        index=len(chunks),
                        metadata={'heading': section.get('heading', '')}
                    ))
        
        return chunks
    
    def _split_by_headers(self, text):
        sections = []
        current = {'text': '', 'heading': ''}
        for line in text.split('\n'):
            if line.startswith('#'):
                if current['text'].strip():
                    sections.append(current)
                current = {'text': '', 'heading': line.strip('# ')}
            current['text'] += line + '\n'
        if current['text'].strip():
            sections.append(current)
        return sections
    
    def _split_with_overlap(self, text, size, overlap):
        words = text.split()
        chunks = []
        start = 0
        while start < len(words):
            end = start + size
            chunk = ' '.join(words[start:end])
            chunks.append(chunk)
            start = end - overlap
        return chunks
⚠️ Chunking Rules of Thumb:
512 tokens — Good default for most use cases
256 tokens — Better for precise Q&A (support, legal)
1024 tokens — Better for summarization and long-form
Always overlap — 10-15% prevents losing context at boundaries
Respect structure — Don't split mid-paragraph or mid-table

Layer 3: Embedding & Vector Storage

Convert chunks to vectors, store them in a database optimized for similarity search.

# embedder.py
import anthropic
import numpy as np

class Embedder:
    """Generate embeddings using Voyage AI (Anthropic's partner)."""
    
    def __init__(self):
        # Or use OpenAI, Cohere, local models
        import voyageai
        self.client = voyageai.Client()
        self.model = "voyage-3"  # 1024 dimensions
    
    def embed_chunks(self, chunks: list[Chunk]) -> list[list[float]]:
        texts = [c.text for c in chunks]
        # Batch embed (max 128 per call)
        embeddings = []
        for i in range(0, len(texts), 128):
            batch = texts[i:i+128]
            result = self.client.embed(
                batch, 
                model=self.model,
                input_type="document"
            )
            embeddings.extend(result.embeddings)
        return embeddings
    
    def embed_query(self, query: str) -> list[float]:
        result = self.client.embed(
            [query], 
            model=self.model,
            input_type="query"
        )
        return result.embeddings[0]
# vector_store.py — Using ChromaDB (simplest to start)
import chromadb

class VectorStore:
    def __init__(self, collection_name="knowledge_base"):
        self.client = chromadb.PersistentClient(path="./chroma_db")
        self.collection = self.client.get_or_create_collection(
            name=collection_name,
            metadata={"hnsw:space": "cosine"}
        )
    
    def add(self, chunks, embeddings):
        self.collection.add(
            ids=[f"chunk_{c.index}_{hash(c.source)}" for c in chunks],
            embeddings=embeddings,
            documents=[c.text for c in chunks],
            metadatas=[{
                'source': c.source,
                'heading': c.metadata.get('heading', ''),
                'index': c.index
            } for c in chunks]
        )
    
    def search(self, query_embedding, n_results=5, where=None):
        kwargs = {
            'query_embeddings': [query_embedding],
            'n_results': n_results,
        }
        if where:
            kwargs['where'] = where
        results = self.collection.query(**kwargs)
        return [{
            'text': doc,
            'source': meta['source'],
            'heading': meta['heading'],
            'score': 1 - dist  # Convert distance to similarity
        } for doc, meta, dist in zip(
            results['documents'][0],
            results['metadatas'][0],
            results['distances'][0]
        )]

Layer 4: Retrieval Strategy

Basic vector search works. But smart retrieval is what separates "okay RAG" from "this is actually useful." Three patterns you should know:

StrategyHow It WorksWhen to Use
Naive RAGEmbed query → top-k nearest chunksSimple Q&A, prototypes
Hybrid SearchVector search + keyword search (BM25), merge resultsWhen exact terms matter (product names, codes)
RerankingRetrieve top-20, rerank with cross-encoder to top-5When precision matters more than speed
HyDELLM generates hypothetical answer → embed that → searchWhen queries are vague or conversational
Multi-QueryLLM generates 3-5 query variations → search all → mergeComplex questions that touch multiple topics
Agentic RAGAgent decides when/what to retrieve, can refine queriesProduction systems with diverse data sources
# retrieval.py — Hybrid Search + Reranking
class SmartRetriever:
    def __init__(self, vector_store, embedder):
        self.vector_store = vector_store
        self.embedder = embedder
    
    def retrieve(self, query: str, top_k=5) -> list[dict]:
        # Step 1: Vector search (semantic)
        query_embedding = self.embedder.embed_query(query)
        vector_results = self.vector_store.search(
            query_embedding, n_results=top_k * 3
        )
        
        # Step 2: Keyword search (BM25)
        keyword_results = self._bm25_search(query, top_k * 3)
        
        # Step 3: Merge with Reciprocal Rank Fusion
        merged = self._reciprocal_rank_fusion(
            [vector_results, keyword_results], k=60
        )
        
        # Step 4: Rerank top candidates
        reranked = self._rerank(query, merged[:top_k * 2])
        
        return reranked[:top_k]
    
    def _reciprocal_rank_fusion(self, result_lists, k=60):
        scores = {}
        for results in result_lists:
            for rank, result in enumerate(results):
                key = result['source'] + result['text'][:100]
                if key not in scores:
                    scores[key] = {'result': result, 'score': 0}
                scores[key]['score'] += 1 / (k + rank + 1)
        
        sorted_results = sorted(
            scores.values(), 
            key=lambda x: x['score'], 
            reverse=True
        )
        return [item['result'] for item in sorted_results]
    
    def _rerank(self, query, candidates):
        """Rerank using Cohere or a cross-encoder."""
        import cohere
        co = cohere.Client()
        results = co.rerank(
            query=query,
            documents=[c['text'] for c in candidates],
            model="rerank-v3.5",
            top_n=len(candidates)
        )
        reranked = []
        for r in results.results:
            candidate = candidates[r.index]
            candidate['rerank_score'] = r.relevance_score
            reranked.append(candidate)
        return reranked

Layer 5: The RAG Agent

Now wire it all together. The agent doesn't just answer — it decides when to search, what to search for, and whether the results are good enough.

# rag_agent.py
import anthropic

SYSTEM_PROMPT = """You are a knowledgeable assistant with access to a 
company knowledge base. When answering questions:

1. ALWAYS search the knowledge base before answering factual questions
2. Cite your sources with [Source: filename] tags
3. If the knowledge base doesn't contain the answer, say so clearly
4. Never make up information — if unsure, say "I don't have that info"
5. For follow-up questions, search again with refined queries

You have these tools:
- search_knowledge_base: Search the vector store for relevant documents
- get_document: Retrieve a full document by source path
"""

class RAGAgent:
    def __init__(self, retriever, model="claude-sonnet-4-20250514"):
        self.retriever = retriever
        self.client = anthropic.Anthropic()
        self.model = model
        self.conversation = []
    
    def chat(self, user_message: str) -> str:
        self.conversation.append({
            "role": "user", 
            "content": user_message
        })
        
        # Let Claude decide whether to search
        response = self.client.messages.create(
            model=self.model,
            max_tokens=4096,
            system=SYSTEM_PROMPT,
            tools=[{
                "name": "search_knowledge_base",
                "description": "Search the knowledge base for relevant info",
                "input_schema": {
                    "type": "object",
                    "properties": {
                        "query": {
                            "type": "string",
                            "description": "Search query"
                        },
                        "filters": {
                            "type": "object",
                            "description": "Optional metadata filters"
                        }
                    },
                    "required": ["query"]
                }
            }],
            messages=self.conversation
        )
        
        # Handle tool use loop
        while response.stop_reason == "tool_use":
            tool_results = []
            for block in response.content:
                if block.type == "tool_use":
                    if block.name == "search_knowledge_base":
                        results = self.retriever.retrieve(
                            block.input["query"]
                        )
                        tool_results.append({
                            "type": "tool_result",
                            "tool_use_id": block.id,
                            "content": self._format_results(results)
                        })
            
            self.conversation.append({
                "role": "assistant", 
                "content": response.content
            })
            self.conversation.append({
                "role": "user", 
                "content": tool_results
            })
            
            response = self.client.messages.create(
                model=self.model,
                max_tokens=4096,
                system=SYSTEM_PROMPT,
                tools=[...],  # Same tools
                messages=self.conversation
            )
        
        # Extract final text
        answer = ""
        for block in response.content:
            if hasattr(block, 'text'):
                answer += block.text
        
        self.conversation.append({
            "role": "assistant", 
            "content": answer
        })
        return answer
    
    def _format_results(self, results):
        formatted = "Retrieved documents:\n\n"
        for i, r in enumerate(results, 1):
            formatted += f"--- Document {i} ---\n"
            formatted += f"Source: {r['source']}\n"
            formatted += f"Section: {r['heading']}\n"
            formatted += f"Content: {r['text']}\n\n"
        return formatted

🛠️ Get the AI Employee Playbook

Complete blueprint for building production AI agents — including RAG systems, deployment patterns, and the exact prompts we use in production.

Get the Playbook — €29

Vector Database Comparison (2026)

Your vector DB choice matters more than you think. Here's what we've tested:

DatabaseBest ForPricingMax Vectors
ChromaDBPrototyping, local devFree (open source)~1M (local)
PineconeProduction SaaS, managedFree tier → $70+/moUnlimited
WeaviateHybrid search, multi-modalFree (self-host) / CloudUnlimited
QdrantPerformance, filteringFree (self-host) / CloudUnlimited
Supabase pgvectorAlready using PostgresFree tier → $25+/mo~5M+
TurbopufferCost-efficient at scalePay-per-queryUnlimited
Our recommendation: Start with ChromaDB locally. Move to Supabase pgvector or Qdrant for production. Use Pinecone if you want zero ops overhead and have budget.

Embedding Models Compared

ModelDimensionsMTEB ScoreCost (1M tokens)
Voyage 3102467.3$0.06
OpenAI text-embedding-3-large307264.6$0.13
Cohere embed-v4102466.1$0.10
BGE-M3 (local)102465.0Free
Nomic Embed (local)76862.4Free

5 Advanced RAG Patterns

Pattern 1: Parent-Child Chunking

Store small chunks for precise retrieval, but return the full parent section for context.

# Small chunks (256 tokens) for embedding
# When matched, return the parent section (1024+ tokens)
# This gives you precise matching + rich context

class ParentChildStore:
    def __init__(self):
        self.child_store = VectorStore("children")
        self.parents = {}  # child_id -> parent_text
    
    def add(self, document):
        # Create large parent chunks
        parents = chunk(document, size=1024)
        for parent in parents:
            # Split each parent into children
            children = chunk(parent.text, size=256)
            child_embeddings = embed(children)
            self.child_store.add(children, child_embeddings)
            for child in children:
                self.parents[child.id] = parent.text
    
    def search(self, query, top_k=3):
        # Search children, return parents
        child_results = self.child_store.search(
            embed_query(query), n_results=top_k
        )
        return [self.parents[r['id']] for r in child_results]

Pattern 2: Contextual Retrieval (Anthropic's Method)

Before embedding, prepend each chunk with LLM-generated context about where it fits in the document. This dramatically improves retrieval accuracy.

# Add context to each chunk before embedding
def add_context(chunk, full_document):
    prompt = f"""Here is a document:
{full_document[:2000]}

Here is a chunk from that document:
{chunk.text}

Please give a short (2-3 sentence) context explaining what 
this chunk is about and where it fits in the document."""
    
    context = call_llm(prompt)
    chunk.text = f"{context}\n\n{chunk.text}"
    return chunk

# Anthropic reports 49% reduction in retrieval failures
# with this simple technique

Pattern 3: Query Routing

Different questions need different retrieval strategies. Let the agent route.

# Route queries to the best retrieval strategy
ROUTING_PROMPT = """Classify this query:
1. FACTUAL - Needs specific facts (use hybrid search)
2. CONCEPTUAL - Needs explanation (use semantic search)
3. COMPARATIVE - Compares things (use multi-query)
4. PROCEDURAL - How-to steps (use parent-child)
5. NONE - Doesn't need retrieval (chitchat, math)

Query: {query}
Classification:"""

def route_query(query):
    classification = call_llm(ROUTING_PROMPT.format(query=query))
    if "FACTUAL" in classification:
        return hybrid_search(query)
    elif "CONCEPTUAL" in classification:
        return semantic_search(query)
    elif "COMPARATIVE" in classification:
        return multi_query_search(query)
    elif "PROCEDURAL" in classification:
        return parent_child_search(query)
    else:
        return []  # No retrieval needed

Pattern 4: Self-Correcting RAG

# Agent checks if retrieved context actually answers the question
def rag_with_self_correction(query, max_retries=2):
    for attempt in range(max_retries + 1):
        results = retriever.retrieve(query)
        
        # Grade the retrieval
        grade = call_llm(f"""Do these documents contain 
        enough information to answer: "{query}"?
        
        Documents: {format_results(results)}
        
        Answer YES or NO with a brief explanation.""")
        
        if "YES" in grade:
            return generate_answer(query, results)
        
        # Refine the query and try again
        query = call_llm(f"""The search for "{query}" didn't 
        return good results. Rephrase the query to find 
        better matches. Just output the new query.""")
    
    return "I couldn't find sufficient information to answer."

Pattern 5: Multi-Source RAG

# Agent can search multiple knowledge bases
tools = [
    {"name": "search_docs", "desc": "Internal documentation"},
    {"name": "search_tickets", "desc": "Support ticket history"},
    {"name": "search_api_docs", "desc": "API reference"},
    {"name": "search_slack", "desc": "Slack conversations"},
]

# The agent decides which source(s) to query
# and can cross-reference between them

Production RAG Checklist

Before going live, make sure you've handled these:

  1. Evaluation pipeline — Test with 50+ question-answer pairs. Measure retrieval recall and answer accuracy.
  2. Chunking quality — Manually review 20 random chunks. Can a human understand them out of context?
  3. Citation verification — Every answer should cite its source. Spot-check that citations match.
  4. Hallucination guard — Ask questions NOT in your knowledge base. The agent should say "I don't know."
  5. Freshness pipeline — Documents change. Set up automatic re-ingestion (daily or on change).
  6. Cost monitoring — Track embedding costs, LLM calls per query, vector DB usage.
  7. Latency budget — Set a target (e.g., <3s). Measure P50 and P99.
  8. Access control — Users should only retrieve documents they have access to. Use metadata filters.
⚠️ The #1 RAG Mistake: Skipping evaluation. Without measuring retrieval quality, you're flying blind. Build a test set of questions + expected source documents BEFORE optimizing anything.

60-Minute Quickstart: Build Your First RAG Agent

Let's build a working RAG agent that answers questions about your documents.

1 Install Dependencies (2 min)

pip install anthropic chromadb voyageai pymupdf

2 Create the Ingestion Script (10 min)

# ingest.py
import os, chromadb, voyageai
from pathlib import Path

# Init
voyage = voyageai.Client()
chroma = chromadb.PersistentClient(path="./rag_db")
collection = chroma.get_or_create_collection("docs")

def chunk_text(text, size=500, overlap=50):
    words = text.split()
    chunks = []
    for i in range(0, len(words), size - overlap):
        chunks.append(' '.join(words[i:i + size]))
    return chunks

def ingest_file(path):
    with open(path) as f:
        text = f.read()
    
    chunks = chunk_text(text)
    embeddings = voyage.embed(
        chunks, model="voyage-3", input_type="document"
    ).embeddings
    
    collection.add(
        ids=[f"{path}_{i}" for i in range(len(chunks))],
        embeddings=embeddings,
        documents=chunks,
        metadatas=[{"source": str(path)} for _ in chunks]
    )
    print(f"Ingested {path}: {len(chunks)} chunks")

# Ingest all .md and .txt files in ./docs/
docs_dir = Path("./docs")
for file in docs_dir.rglob("*"):
    if file.suffix in ['.md', '.txt']:
        ingest_file(str(file))

print(f"Total chunks: {collection.count()}")

3 Create the RAG Agent (15 min)

# agent.py
import anthropic, chromadb, voyageai

voyage = voyageai.Client()
chroma = chromadb.PersistentClient(path="./rag_db")
collection = chroma.get_or_create_collection("docs")
claude = anthropic.Anthropic()

def search(query, n=5):
    embedding = voyage.embed(
        [query], model="voyage-3", input_type="query"
    ).embeddings[0]
    
    results = collection.query(
        query_embeddings=[embedding], n_results=n
    )
    
    context = ""
    for doc, meta in zip(results['documents'][0], results['metadatas'][0]):
        context += f"\n---\nSource: {meta['source']}\n{doc}\n"
    return context

def chat(question, history=[]):
    context = search(question)
    
    messages = history + [{
        "role": "user",
        "content": f"""Answer based on this context:

{context}

Question: {question}

Rules:
- Only use information from the context above
- Cite sources as [Source: filename]
- Say "I don't have that information" if not in context"""
    }]
    
    response = claude.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=2048,
        messages=messages
    )
    return response.content[0].text

# Interactive loop
print("RAG Agent ready! Ask anything about your docs.")
history = []
while True:
    q = input("\nYou: ")
    if q.lower() in ['quit', 'exit']:
        break
    answer = chat(q, history)
    print(f"\nAgent: {answer}")
    history.append({"role": "user", "content": q})
    history.append({"role": "assistant", "content": answer})

4 Add Your Documents (5 min)

mkdir docs
# Add your .md or .txt files to ./docs/
python ingest.py

5 Run It (1 min)

python agent.py
# > You: What's our refund policy?
# > Agent: Based on the documentation, your refund policy...
#   [Source: docs/policies/refunds.md]

📬 The Operator Signal

Weekly insights on AI agents, automation, and building with LLMs. Join 2,000+ operators.

Subscribe Free

7 Common RAG Mistakes (And How to Fix Them)

1. Chunks Too Big or Too Small

Problem: Large chunks dilute relevance. Tiny chunks lose context.
Fix: Start at 512 tokens with 10% overlap. Test with your actual queries and adjust.

2. No Metadata Filtering

Problem: Searching the entire knowledge base for every query.
Fix: Add metadata (category, date, department) and filter before searching.

3. Ignoring Document Structure

Problem: Splitting a table across chunks, or separating a header from its content.
Fix: Use structure-aware chunking. Split by headers, keep tables intact.

4. No Reranking

Problem: Vector search returns semantically similar but irrelevant results.
Fix: Add a reranker (Cohere Rerank, cross-encoder) as a second pass. 20-40% accuracy improvement.

5. Stale Data

Problem: Documents updated but embeddings still reflect the old version.
Fix: Track document hashes. Re-embed on change. Run nightly sync jobs.

6. No Evaluation Set

Problem: "It seems to work" isn't a metric.
Fix: Create 50+ question → expected source pairs. Measure retrieval recall@5 weekly.

7. Stuffing Too Much Context

Problem: Retrieving 20 chunks and cramming them all into the prompt.
Fix: Less is more. 3-5 highly relevant chunks beat 15 mediocre ones. Your LLM bill will thank you too.

Cost Breakdown: RAG in Production

ComponentBudget BuildPro Build
Embedding modelBGE-M3 (free, local)Voyage 3 ($0.06/1M tok)
Vector databaseChromaDB (free)Qdrant Cloud ($30/mo)
LLM (generation)Claude Haiku ($0.25/1M tok)Claude Sonnet ($3/1M tok)
RerankerNoneCohere Rerank ($1/1K queries)
Total (10K queries/mo)~$5/mo~$80/mo
💡 Pro tip: Start with the budget build. Only upgrade components where you measure actual quality problems. Most teams don't need the pro build.

What's Next: RAG in 2026 and Beyond

🛠️ Build production AI agents → Get the Playbook (€29)