Building a Retrieval-Augmented Generation (RAG) system sounds straightforward on paper: retrieve relevant documents, feed them to an LLM, get accurate answers. In practice? It's a completely different beast. Here's what I learned building one from scratch.
WHAT IS RAG AND WHY DO WE NEED IT
Large Language Models are powerful, but they have two critical limitations: knowledge cutoff and hallucinations. RAG solves both by grounding the model's responses in actual retrieved data.
Instead of relying on the LLM's parametric memory, we:
- Chunk and embed our documents
- Store embeddings in a vector database
- At query time, retrieve relevant chunks
- Pass them as context to the LLM
Simple, right? Well, here's where it gets interesting.
LESSON 1: CHUNKING IS AN ART, NOT A SCIENCE
My first approach was naive: split documents every 500 tokens. The results were terrible. Chunks would cut off mid-sentence, losing crucial context.
What actually worked:
- Semantic chunking — Split on paragraph boundaries, headers, or logical sections
- Overlapping chunks — 10-20% overlap helps maintain context across boundaries
- Metadata preservation — Keep track of source, page number, and section headers
def smart_chunk(text, max_tokens=500, overlap=50):
# Split on double newlines first (paragraphs)
paragraphs = text.split('\n\n')
chunks = []
current_chunk = ""
for para in paragraphs:
if len(current_chunk) + len(para) < max_tokens:
current_chunk += para + "\n\n"
else:
chunks.append(current_chunk.strip())
current_chunk = para + "\n\n"
return chunksLESSON 2: EMBEDDING MODELS MATTER MORE THAN YOU THINK
I started with OpenAI's text-embedding-ada-002. It's good, but not always the best choice. For domain-specific applications, fine-tuned or specialized models outperform generic ones.
Key considerations:
- Dimensionality — Higher isn't always better. 384-768 dimensions often suffice
- Domain alignment — Technical docs? Try
sentence-transformers/all-MiniLM-L6-v2 - Multilingual needs — Consider
multilingual-e5-basefor non-English content
LESSON 3: RETRIEVAL IS WHERE MOST RAGS FAIL
The retrieval step is the bottleneck. If you retrieve irrelevant chunks, even the best LLM can't save you.
Improvements that made a real difference:
- Hybrid search — Combine vector similarity with BM25 keyword matching
- Re-ranking — Use a cross-encoder to re-rank top-k results
- Query expansion — Generate multiple query variations to improve recall
def hybrid_search(query, k=10):
# Vector search
vector_results = vector_db.similarity_search(query, k=k*2)
# BM25 keyword search
keyword_results = bm25_index.search(query, k=k*2)
# Reciprocal rank fusion
combined = reciprocal_rank_fusion(vector_results, keyword_results)
# Re-rank with cross-encoder
reranked = cross_encoder.rerank(query, combined[:k*2])
return reranked[:k]LESSON 4: PROMPT ENGINEERING FOR RAG IS DIFFERENT
Standard prompting doesn't work well for RAG. The model needs explicit instructions on how to use the retrieved context.
What worked:
- Be explicit about citations — "Answer based ONLY on the provided context"
- Handle missing information — "If the context doesn't contain the answer, say so"
- Structure the context — Use clear delimiters between chunks
You are a technical assistant. Answer questions using ONLY the provided context.
[CONTEXT]
{retrieved_chunks}
[/CONTEXT]
Question: {user_query}
If the answer is not in the context, respond with "I don't have enough information to answer that."LESSON 5: EVALUATION IS HARDER THAN BUILDING
How do you know if your RAG is good? Vibes-based testing doesn't scale.
Metrics I found useful:
- Retrieval metrics — Precision@k, Recall@k, MRR
- Answer quality — Use an LLM-as-judge approach
- Faithfulness — Does the answer actually reflect the retrieved context?
Building a golden dataset of 100+ query-answer pairs was tedious but invaluable for iteration.
THE ARCHITECTURE THAT WORKED
After many iterations, here's the stack that performed best for my use case:
| Component | Choice |
|---|---|
| Vector DB | Qdrant (fast, feature-rich) |
| Embeddings | E5-base-v2 |
| Chunking | Semantic + 100 token overlap |
| Retrieval | Hybrid search + MMR |
| LLM | GPT-4 Turbo (for quality) |
FINAL THOUGHTS
Building a RAG system taught me that the devil is in the details. Every component—chunking, embedding, retrieval, prompting—needs careful tuning for your specific use case.
The most important lesson? Start simple, measure everything, iterate fast. Don't over-engineer from day one. Get a basic pipeline working, then optimize based on real failures.
RAG isn't a silver bullet, but when done right, it's incredibly powerful. The key is understanding that it's a system of interconnected parts, not just "vector search + LLM."
If you're building something similar or have questions, feel free to reach out on Twitter or Telegram.

Sign in to join the discussion.