|
Hello Reader, Back in 2025 at AWS, as a L7 Principal Solutions Architect, I was building many customer Gen AI chatbots. We threw entire PDFs into the system and wondered why responses were garbage. “RAG is simple,” we thought. “Just embed documents and query them.” We were so wrong. Today, everyone’s jumping on the RAG bandwagon. But most are making the same mistakes I made two years back. In this newsletter, we’re going deep into the hidden mechanics of RAG that actually determine success or failure. Let’s get started: Bad: Uploading Whole DocumentsMost beginners think RAG works like this - upload a 100-page PDF, ask questions, get answers. This is the worst approach possible. Why? When you query “What are the security requirements?”, the system retrieves massive chunks containing everything EXCEPT what you need. The LLM gets overwhelmed with irrelevant context, burns through tokens, and gives you mediocre answers. Good: Strategic Document ChunkingSmart developers chunk documents into smaller pieces. But here’s where it gets interesting - there’s no one-size-fits-all chunking strategy. Different chunking strategies exist for different content types: Fixed-size chunking - Split every x number (e.g. 512) tokens. Simple, but breaks context mid-sentence Semantic chunking - Split at paragraph or section boundaries. Preserves meaning but creates uneven sizes Recursive chunking - Try large chunks first, split recursively if needed. Best for technical docs Context-aware chunking - Keep code blocks intact, preserve table structures, maintain bullet point groups together Here’s the trade-off - Smaller chunks give precise retrieval but lose context. Larger chunks maintain context but dilute relevance. You need to experiment with your specific content. The next one also shocked me! Bad: Using random model for embeddingWe were using random embedding models, and saw the accuracy of RAG was changing. Why? Read along.. Good: Embedding Model Alignment (The Secret Sauce)Here’s something that shocked me when I first learned it: Different embedding models are trained on different corpus types. Using a general-purpose embedding model for specialized content is like using a English-Spanish dictionary to translate Japanese. Three alignment strategies: Domain-specific embeddings - Use medical embeddings for healthcare docs, legal embeddings for contracts Fine-tuned embeddings - Train your embedding model on your actual documents and query patterns Hybrid approach - Combine multiple embedding models and ensemble their results Real-world example: At a financial services client, we switched from generic embeddings to finance-domain embeddings and saw retrieval accuracy jump from 62% to 89% overnight. Same documents. Same chunks. Different embeddings. The Vector Database Truth : Here’s what most tutorials won’t tell you: Vector databases don’t just store vectors. What’s actually store:
Why store the text? Because the final LLM never receives raw vectors. The vectors are only used for similarity search. Once relevant chunks are found, the actual text is sent to the LLM. Important implication #1: Your vector database size isn’t just the embedding dimensions. You’re storing the full text too. Consider this while estimating cost Important implication #2: Did you know that the SAMR embedding LLM is used twice - one to embed docs to the vector database, and then when prompts come, the same model converts the prompt to vector before searching. You might ask - why convert to vector and not just do text search if we are saving text in the Vector DB? Vector search is based on number and formulas which is way faster and efficient than text based search. Advanced: Reranking Changes EverythingInitial vector search gives you candidates. But candidates aren’t perfect answers. This is where reranking enters the picture. What is reranking? After vector search retrieves 20 potentially relevant chunks, a reranking model re-scores them using more sophisticated cross-attention mechanisms. The top 5 reranked chunks go to your LLM. A chunk might be vector-similar because it contains the same keywords, but completely irrelevant to the actual question intent. Reranking filters this out. Three reranking approaches: Cross-encoder reranking - Uses BERT-style models to score query-chunk pairs. Expensive but accurate LLM-based reranking - Ask a small LLM to score relevance. Flexible but costs more tokens Hybrid reranking - Combine vector scores, keyword matching (BM25), and cross-encoder scores Your Three Next Steps1. Audit your chunking strategy - Is it actually preserving the context your users need? Test with real queries 2. Validate your embedding alignment - Run sample queries and check if retrieved chunks actually contain relevant information. If not, your embeddings are misaligned 3. Implement basic reranking - Even a simple cross-encoder reranking layer will boost your RAG quality significantly RAG isn’t rocket science, but the devil is in these details. The difference between a mediocre RAG system and a delightful one comes down to understanding these hidden mechanics. Question to you readers: Have you implemented RAG in production? What was your biggest challenge - chunking, embeddings, or retrieval quality? Keep learning and keep rocking 🚀, Raj P.S - If you want to get an AWS Solutions Architect job without coding or learning every AWS service, the 8th cohort for AWS SA Bootcamp is launching on May 16th, 12 PM ET (Eastern Time) via live webinar. Please register for the webinar below:
Here’s what you get when you show up LIVE:
And good news - it already worked for last cohort's students who secured cloud jobs in top companies, including at AWS, Microsoft, Google, JPMorgan, Reddit, and some of them didn't even have cloud experience 💰. Spots are limited, so don't miss it! |
Free Cloud Interview Guide to crush your next interview. Plus, real-world answers for cloud interviews, and system design from a top AWS Solutions Architect.
Hello Reader, Are you thinking about becoming an AWS Solutions Architect and working at top companies? Data shows that Solution Architects earn between $200,000/year and $500,000+/year. When looking into becoming an SA it's really common to get overwhelmed and discouraged real quick. The number of potential paths.Countless certifications.You're not sure what to focus on, and what you REALLY need. Not to mention the impossible task of getting recruiter's attention. And even when you do - how...
Hello Reader, As a former Principal Solutions Architect at AWS, and Distinguished Cloud Architect at Verizon, I conducted 300+ interviews. And every cloud interview has this question. Every. Single. One. "How will you make your application scalable for a big traffic day?" And almost every candidate gives the same answer: "I'll use an Auto Scaling Group with EC2s and a load balancer to distribute traffic." Technically correct. Completely average. Three years ago that answer was fine. Today it...
Hello Reader, "Are microservices better than monolith?" - this is a very popular interview question and real world project topic. Another variation of this is asking differences between monolith and microservices. Let me clear something up right away: monolith is not the bad guy. When I conducted hundreds of interviews at AWS as Principal Solutions Architect, I have seen candidates walk into interviews ready to trash the monolith and declare microservices the winner. That is the wrong move....