RAG Tutorial for Beginners: Build a Retrieval-Augmented Generation System
Quick answer: RAG combines a vector database (for finding relevant documents) with an LLM (for generating answers). The basic loop is: embed the user query → find similar document chunks → pass chunks as context to the LLM → generate an answer grounded in those chunks. This guide builds a working RAG system from scratch.
Why RAG exists
LLMs have two knowledge limitations:
- Knowledge cutoff: They don't know about events after their training date
- Private data: They don't know about your company's documents, databases, or internal knowledge
RAG solves both by retrieving relevant information at query time and passing it to the LLM as context. The LLM doesn't need to memorize your docs — it reads them on demand.
The RAG architecture
[Your documents] → [Chunker] → [Embedder] → [Vector Store]
↓
[User query] → [Embedder] → [Vector similarity search]
↓
[Retrieved chunks] → [LLM] → [Answer]
Step 1: Chunk your documents
Documents must be split into smaller chunks before embedding. Optimal chunk size is typically 300-800 tokens:
def chunk_text(text: str, chunk_size: int = 500, overlap: int = 50) -> list[str]:
words = text.split()
chunks = []
for i in range(0, len(words), chunk_size - overlap):
chunk = ' '.join(words[i:i + chunk_size])
if chunk:
chunks.append(chunk)
return chunks
documents = [
"Introduction to machine learning... [long text]",
"How neural networks work... [long text]",
]
all_chunks = []
for doc in documents:
all_chunks.extend(chunk_text(doc))
print(f"Total chunks: {len(all_chunks)}")
Step 2: Create embeddings
Embeddings are numerical representations of text that capture semantic meaning. Similar meaning = similar numbers = close in vector space.
import openai
import numpy as np
client = openai.OpenAI()
def embed_texts(texts: list[str]) -> list[list[float]]:
# Batch embedding for efficiency
response = client.embeddings.create(
model="text-embedding-3-small", # $0.02/1M tokens
input=texts
)
return [item.embedding for item in response.data]
chunk_embeddings = embed_texts(all_chunks)
print(f"Embedding dimension: {len(chunk_embeddings[0])}") # 1536
Step 3: Store in a vector database
For production, use a dedicated vector database. For a quick prototype, numpy works:
# Simple numpy-based vector store
class SimpleVectorStore:
def __init__(self):
self.chunks = []
self.embeddings = []
def add(self, chunks: list[str], embeddings: list[list[float]]):
self.chunks.extend(chunks)
self.embeddings.extend(embeddings)
def search(self, query_embedding: list[float], top_k: int = 5) -> list[str]:
if not self.embeddings:
return []
# Cosine similarity
q = np.array(query_embedding)
E = np.array(self.embeddings)
sims = E @ q / (np.linalg.norm(E, axis=1) * np.linalg.norm(q))
top_indices = np.argsort(sims)[-top_k:][::-1]
return [self.chunks[i] for i in top_indices]
store = SimpleVectorStore()
store.add(all_chunks, chunk_embeddings)
For production: use Pinecone, pgvector, Qdrant, or Weaviate.
Step 4: Query and generate
import anthropic
anthropic_client = anthropic.Anthropic()
def rag_query(question: str) -> str:
# Embed the question
query_embedding = embed_texts([question])[0]
# Retrieve relevant chunks
relevant_chunks = store.search(query_embedding, top_k=4)
context = "\n\n---\n\n".join(relevant_chunks)
# Generate answer grounded in retrieved context
response = anthropic_client.messages.create(
model="claude-sonnet-4",
max_tokens=1024,
system="""Answer questions based only on the provided context.
If the answer is not in the context, say so clearly.
Always cite which part of the context supports your answer.""",
messages=[{
"role": "user",
"content": f"Context:\n{context}\n\nQuestion: {question}"
}]
)
return response.content[0].text
print(rag_query("How do neural networks learn?"))
Cost optimization for RAG
RAG input costs can be high because you're passing retrieved chunks as context. Use prompt caching when the same documents are retrieved repeatedly.
Model recommendations for RAG:
- Best quality: Claude Sonnet 4 or GPT-4o (best at grounding in context)
- Cost-optimized: Gemini 2.0 Flash with 1M context window
- Open-source: Command R+ (specifically optimized for RAG)
See the best LLMs for RAG ranking for a full comparison.