RAG Systems: What I Learned Building One

ARTICLE

Building a Retrieval-Augmented Generation (RAG) system sounds straightforward on paper: retrieve relevant documents, feed them to an LLM, get accurate answers. In practice? It's a completely different beast. Here's what I learned building one from scratch.

WHAT IS RAG AND WHY DO WE NEED IT

Large Language Models are powerful, but they have two critical limitations: knowledge cutoff and hallucinations. RAG solves both by grounding the model's responses in actual retrieved data.

Instead of relying on the LLM's parametric memory, we:

Chunk and embed our documents
Store embeddings in a vector database
At query time, retrieve relevant chunks
Pass them as context to the LLM

Simple, right? Well, here's where it gets interesting.

LESSON 1: CHUNKING IS AN ART, NOT A SCIENCE

My first approach was naive: split documents every 500 tokens. The results were terrible. Chunks would cut off mid-sentence, losing crucial context.

What actually worked:

Semantic chunking — Split on paragraph boundaries, headers, or logical sections
Overlapping chunks — 10-20% overlap helps maintain context across boundaries
Metadata preservation — Keep track of source, page number, and section headers

def smart_chunk(text, max_tokens=500, overlap=50):
    # Split on double newlines first (paragraphs)
    paragraphs = text.split('\n\n')
    chunks = []
    current_chunk = ""
    
    for para in paragraphs:
        if len(current_chunk) + len(para) < max_tokens:
            current_chunk += para + "\n\n"
        else:
            chunks.append(current_chunk.strip())
            current_chunk = para + "\n\n"
    
    return chunks

LESSON 2: EMBEDDING MODELS MATTER MORE THAN YOU THINK

I started with OpenAI's text-embedding-ada-002. It's good, but not always the best choice. For domain-specific applications, fine-tuned or specialized models outperform generic ones.

Key considerations:

Dimensionality — Higher isn't always better. 384-768 dimensions often suffice
Domain alignment — Technical docs? Try sentence-transformers/all-MiniLM-L6-v2
Multilingual needs — Consider multilingual-e5-base for non-English content

LESSON 3: RETRIEVAL IS WHERE MOST RAGS FAIL

The retrieval step is the bottleneck. If you retrieve irrelevant chunks, even the best LLM can't save you.

Improvements that made a real difference:

Hybrid search — Combine vector similarity with BM25 keyword matching
Re-ranking — Use a cross-encoder to re-rank top-k results
Query expansion — Generate multiple query variations to improve recall

def hybrid_search(query, k=10):
    # Vector search
    vector_results = vector_db.similarity_search(query, k=k*2)
    
    # BM25 keyword search
    keyword_results = bm25_index.search(query, k=k*2)
    
    # Reciprocal rank fusion
    combined = reciprocal_rank_fusion(vector_results, keyword_results)
    
    # Re-rank with cross-encoder
    reranked = cross_encoder.rerank(query, combined[:k*2])
    
    return reranked[:k]

LESSON 4: PROMPT ENGINEERING FOR RAG IS DIFFERENT

Standard prompting doesn't work well for RAG. The model needs explicit instructions on how to use the retrieved context.

What worked:

Be explicit about citations — "Answer based ONLY on the provided context"
Handle missing information — "If the context doesn't contain the answer, say so"
Structure the context — Use clear delimiters between chunks

You are a technical assistant. Answer questions using ONLY the provided context.

[CONTEXT]
{retrieved_chunks}
[/CONTEXT]

Question: {user_query}

If the answer is not in the context, respond with "I don't have enough information to answer that."

LESSON 5: EVALUATION IS HARDER THAN BUILDING

How do you know if your RAG is good? Vibes-based testing doesn't scale.

Metrics I found useful:

Retrieval metrics — Precision@k, Recall@k, MRR
Answer quality — Use an LLM-as-judge approach
Faithfulness — Does the answer actually reflect the retrieved context?

Building a golden dataset of 100+ query-answer pairs was tedious but invaluable for iteration.

THE ARCHITECTURE THAT WORKED

After many iterations, here's the stack that performed best for my use case:

Component	Choice
Vector DB	Qdrant (fast, feature-rich)
Embeddings	E5-base-v2
Chunking	Semantic + 100 token overlap
Retrieval	Hybrid search + MMR
LLM	GPT-4 Turbo (for quality)

FINAL THOUGHTS

Building a RAG system taught me that the devil is in the details. Every component—chunking, embedding, retrieval, prompting—needs careful tuning for your specific use case.

The most important lesson? Start simple, measure everything, iterate fast. Don't over-engineer from day one. Get a basic pipeline working, then optimize based on real failures.

RAG isn't a silver bullet, but when done right, it's incredibly powerful. The key is understanding that it's a system of interconnected parts, not just "vector search + LLM."

If you're building something similar or have questions, feel free to reach out on Twitter or Telegram.

READY FOR THE NEXT ONE?

BACK TO BLOG CONTACT

Discussion00