
Everyone'sbuildingRAG.Almostnobody'sdoingitwell.Afterauditingdozensofproductionsystems,here'swhatseparatestheonesthatworkfromtheexpensivechatbotsthatdon't.
This content is gated. Unlock to view the full article.
This section contains advanced architectural patterns. Unlock to access the "Secret Sauce."
Retrieval Augmented Generation sounds simple: find relevant documents, stuff them into context, let the LLM synthesize. In practice, it's a minefield. Chunking strategies, embedding model selection, reranking, hybrid search—each decision compounds. I've seen teams burn 6 months building RAG systems that perform worse than a well-prompted base model. Here's how to not be one of them.
Most teams start with fixed-size chunks: 500 tokens, overlap of 50. It's the default in every tutorial. It's also almost always wrong. Documents have semantic structure—paragraphs, sections, ideas. Cutting through the middle of a concept is worse than no retrieval at all.
Smart chunking respects document structure. Use headers as chunk boundaries. Keep paragraphs together. Consider parent-child relationships—sometimes you need the context of the surrounding section, not just the matching snippet.
Not all embeddings are created equal. OpenAI's ada-002 is fine for general text. But if you're embedding code, legal documents, or domain-specific content, you need specialized models. The MTEB leaderboard is your friend here—test on benchmarks that match your domain.
Here's the move that separates working RAG from broken RAG: two-stage retrieval. First, cast a wide net with vector search (top 20-50 results). Then, use a cross-encoder reranker to score those results against the actual query. Cohere's reranker, BGE-reranker, or even a small LLM can dramatically improve precision.
Vector similarity finds things that are "about" the same topic. Reranking finds things that actually answer the question. They're solving different problems.
Pure vector search fails on exact matches—product codes, names, specific terms. Pure keyword search (BM25) misses semantic similarity. Hybrid search combines both, typically with reciprocal rank fusion. It's more complex to implement but handles the full range of real-world queries.
RAG isn't magic. It's engineering. The teams that treat it like a science project—measuring, iterating, instrumenting—are the ones shipping systems that work. Everyone else is just hoping.