Building RAG Systems That Actually Work
Retrieval-Augmented Generation sounds simple. In practice, it's full of gotchas.
The Basic Architecture
Query → Embed → Search Vector DB → Retrieve Chunks → LLM → Answer
Simple enough. So why do most RAG systems give garbage answers?
Problem 1: Chunking
Naive chunking destroys context:
python# Bad: Fixed-size chunks chunks = [text[i:i+500] for i in range(0, len(text), 500)] # Better: Semantic chunking from langchain.text_splitter import RecursiveCharacterTextSplitter splitter = RecursiveCharacterTextSplitter( chunk_size=500, chunk_overlap=50, separators=["\n\n", "\n", ". ", " "] ) chunks = splitter.split_text(text)
Even better: Use document structure (headers, sections).
Problem 2: Retrieval Quality
Vector similarity ≠ relevance.
python# Hybrid search: combine dense + sparse from sentence_transformers import SentenceTransformer from rank_bm25 import BM25Okapi # Dense retrieval model = SentenceTransformer("all-MiniLM-L6-v2") query_embedding = model.encode(query) dense_results = vector_db.search(query_embedding, k=10) # Sparse retrieval (BM25) bm25 = BM25Okapi(tokenized_corpus) sparse_results = bm25.get_top_n(tokenized_query, documents, n=10) # Reciprocal Rank Fusion final_results = rrf_merge(dense_results, sparse_results)
Problem 3: Context Window Stuffing
Don't just concatenate all retrieved chunks:
python# Bad: Stuff everything context = "\n".join(all_chunks) prompt = f"Context: {context}\n\nQuestion: {query}" # Better: Rerank and select from sentence_transformers import CrossEncoder reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2") scores = reranker.predict([(query, chunk) for chunk in chunks]) top_chunks = [chunks[i] for i in np.argsort(scores)[-3:]]
Problem 4: No Source Attribution
Users don't trust black boxes:
pythonresponse = llm.generate( f"""Based on the following sources, answer the question. Sources: [1] {chunk1.text} (from: {chunk1.source}) [2] {chunk2.text} (from: {chunk2.source}) Question: {query} Provide your answer with citations like [1], [2]. """ )
Evaluation Metrics
Measure what matters:
- Retrieval: Recall@k, MRR
- Generation: Faithfulness, relevance
- End-to-end: User satisfaction, task completion
My Stack
- Embeddings:
text-embedding-3-small(cost) orvoyage-2(quality) - Vector DB: Qdrant (good hybrid search support)
- Reranker: Cohere Rerank or cross-encoder
- LLM: Claude 3.5 Sonnet (best at following citations)
RAG isn't hard. Good RAG is.