RAG Tutorial for Beginners: Build a Retrieval-Augmented Generation System

Quick answer: RAG combines a vector database (for finding relevant documents) with an LLM (for generating answers). The basic loop is: embed the user query → find similar document chunks → pass chunks as context to the LLM → generate an answer grounded in those chunks. This guide builds a working RAG system from scratch.

Why RAG exists

LLMs have two knowledge limitations:

Knowledge cutoff: They don't know about events after their training date
Private data: They don't know about your company's documents, databases, or internal knowledge

RAG solves both by retrieving relevant information at query time and passing it to the LLM as context. The LLM doesn't need to memorize your docs — it reads them on demand.

The RAG architecture

[Your documents] → [Chunker] → [Embedder] → [Vector Store]
                                                    ↓
[User query] → [Embedder] → [Vector similarity search]
                                                    ↓
                    [Retrieved chunks] → [LLM] → [Answer]

Step 1: Chunk your documents

Documents must be split into smaller chunks before embedding. Optimal chunk size is typically 300-800 tokens:

def chunk_text(text: str, chunk_size: int = 500, overlap: int = 50) -> list[str]:
    words = text.split()
    chunks = []
    
    for i in range(0, len(words), chunk_size - overlap):
        chunk = ' '.join(words[i:i + chunk_size])
        if chunk:
            chunks.append(chunk)
    
    return chunks

documents = [
    "Introduction to machine learning... [long text]",
    "How neural networks work... [long text]",
]

all_chunks = []
for doc in documents:
    all_chunks.extend(chunk_text(doc))

print(f"Total chunks: {len(all_chunks)}")

Step 2: Create embeddings

Embeddings are numerical representations of text that capture semantic meaning. Similar meaning = similar numbers = close in vector space.

import openai
import numpy as np

client = openai.OpenAI()

def embed_texts(texts: list[str]) -> list[list[float]]:
    # Batch embedding for efficiency
    response = client.embeddings.create(
        model="text-embedding-3-small",  # $0.02/1M tokens
        input=texts
    )
    return [item.embedding for item in response.data]

chunk_embeddings = embed_texts(all_chunks)
print(f"Embedding dimension: {len(chunk_embeddings[0])}")  # 1536

Step 3: Store in a vector database

For production, use a dedicated vector database. For a quick prototype, numpy works:

# Simple numpy-based vector store
class SimpleVectorStore:
    def __init__(self):
        self.chunks = []
        self.embeddings = []
    
    def add(self, chunks: list[str], embeddings: list[list[float]]):
        self.chunks.extend(chunks)
        self.embeddings.extend(embeddings)
    
    def search(self, query_embedding: list[float], top_k: int = 5) -> list[str]:
        if not self.embeddings:
            return []
        
        # Cosine similarity
        q = np.array(query_embedding)
        E = np.array(self.embeddings)
        sims = E @ q / (np.linalg.norm(E, axis=1) * np.linalg.norm(q))
        top_indices = np.argsort(sims)[-top_k:][::-1]
        return [self.chunks[i] for i in top_indices]

store = SimpleVectorStore()
store.add(all_chunks, chunk_embeddings)

For production: use Pinecone, pgvector, Qdrant, or Weaviate.

Step 4: Query and generate

import anthropic

anthropic_client = anthropic.Anthropic()

def rag_query(question: str) -> str:
    # Embed the question
    query_embedding = embed_texts([question])[0]
    
    # Retrieve relevant chunks
    relevant_chunks = store.search(query_embedding, top_k=4)
    context = "\n\n---\n\n".join(relevant_chunks)
    
    # Generate answer grounded in retrieved context
    response = anthropic_client.messages.create(
        model="claude-sonnet-4",
        max_tokens=1024,
        system="""Answer questions based only on the provided context. 
        If the answer is not in the context, say so clearly.
        Always cite which part of the context supports your answer.""",
        messages=[{
            "role": "user",
            "content": f"Context:\n{context}\n\nQuestion: {question}"
        }]
    )
    return response.content[0].text

print(rag_query("How do neural networks learn?"))

Cost optimization for RAG

RAG input costs can be high because you're passing retrieved chunks as context. Use prompt caching when the same documents are retrieved repeatedly.

Model recommendations for RAG:

Best quality: Claude Sonnet 4 or GPT-4o (best at grounding in context)
Cost-optimized: Gemini 2.0 Flash with 1M context window
Open-source: Command R+ (specifically optimized for RAG)

See the best LLMs for RAG ranking for a full comparison.