The RAG Secrets No One Talks About

Hello Reader,

Back in 2025 at AWS, as a L7 Principal Solutions Architect, I was building many customer Gen AI chatbots. We threw entire PDFs into the system and wondered why responses were garbage. “RAG is simple,” we thought. “Just embed documents and query them.” We were so wrong.

Today, everyone’s jumping on the RAG bandwagon. But most are making the same mistakes I made two years back.

In this newsletter, we’re going deep into the hidden mechanics of RAG that actually determine success or failure. Let’s get started:

Bad: Uploading Whole Documents

Most beginners think RAG works like this - upload a 100-page PDF, ask questions, get answers. This is the worst approach possible.

Why? When you query “What are the security requirements?”, the system retrieves massive chunks containing everything EXCEPT what you need. The LLM gets overwhelmed with irrelevant context, burns through tokens, and gives you mediocre answers.

Good: Strategic Document Chunking

Smart developers chunk documents into smaller pieces. But here’s where it gets interesting - there’s no one-size-fits-all chunking strategy. Different chunking strategies exist for different content types:

Fixed-size chunking - Split every x number (e.g. 512) tokens. Simple, but breaks context mid-sentence

Semantic chunking - Split at paragraph or section boundaries. Preserves meaning but creates uneven sizes

Recursive chunking - Try large chunks first, split recursively if needed. Best for technical docs

Context-aware chunking - Keep code blocks intact, preserve table structures, maintain bullet point groups together

Here’s the trade-off - Smaller chunks give precise retrieval but lose context. Larger chunks maintain context but dilute relevance.

You need to experiment with your specific content.

The next one also shocked me!

Bad: Using random model for embedding

We were using random embedding models, and saw the accuracy of RAG was changing. Why? Read along..

Good: Embedding Model Alignment (The Secret Sauce)

Here’s something that shocked me when I first learned it: Different embedding models are trained on different corpus types. Using a general-purpose embedding model for specialized content is like using a English-Spanish dictionary to translate Japanese. Three alignment strategies:

Domain-specific embeddings - Use medical embeddings for healthcare docs, legal embeddings for contracts

Fine-tuned embeddings - Train your embedding model on your actual documents and query patterns

Hybrid approach - Combine multiple embedding models and ensemble their results

Real-world example: At a financial services client, we switched from generic embeddings to finance-domain embeddings and saw retrieval accuracy jump from 62% to 89% overnight. Same documents. Same chunks. Different embeddings.

The Vector Database Truth : Here’s what most tutorials won’t tell you: Vector databases don’t just store vectors.

What’s actually store:

The vector embeddings (obviously)
Metadata (document ID, chunk position, timestamps, source file)
The actual text chunk (yes, the raw text is stored alongside the vector)

Why store the text? Because the final LLM never receives raw vectors. The vectors are only used for similarity search. Once relevant chunks are found, the actual text is sent to the LLM.

Important implication #1: Your vector database size isn’t just the embedding dimensions. You’re storing the full text too. Consider this while estimating cost

Important implication #2: Did you know that the SAMR embedding LLM is used twice - one to embed docs to the vector database, and then when prompts come, the same model converts the prompt to vector before searching. You might ask - why convert to vector and not just do text search if we are saving text in the Vector DB? Vector search is based on number and formulas which is way faster and efficient than text based search.

Advanced: Reranking Changes Everything

Initial vector search gives you candidates. But candidates aren’t perfect answers. This is where reranking enters the picture. What is reranking? After vector search retrieves 20 potentially relevant chunks, a reranking model re-scores them using more sophisticated cross-attention mechanisms. The top 5 reranked chunks go to your LLM. A chunk might be vector-similar because it contains the same keywords, but completely irrelevant to the actual question intent. Reranking filters this out. Three reranking approaches:

Cross-encoder reranking - Uses BERT-style models to score query-chunk pairs. Expensive but accurate

LLM-based reranking - Ask a small LLM to score relevance. Flexible but costs more tokens

Hybrid reranking - Combine vector scores, keyword matching (BM25), and cross-encoder scores

Your Three Next Steps

1. Audit your chunking strategy - Is it actually preserving the context your users need? Test with real queries

2. Validate your embedding alignment - Run sample queries and check if retrieved chunks actually contain relevant information. If not, your embeddings are misaligned

3. Implement basic reranking - Even a simple cross-encoder reranking layer will boost your RAG quality significantly

RAG isn’t rocket science, but the devil is in these details. The difference between a mediocre RAG system and a delightful one comes down to understanding these hidden mechanics.

Question to you readers: Have you implemented RAG in production? What was your biggest challenge - chunking, embeddings, or retrieval quality?

Keep learning and keep rocking 🚀,

Raj

P.S - If you want to get an AWS Solutions Architect job without coding or learning every AWS service, the 8th cohort for AWS SA Bootcamp is launching on May 16th, 12 PM ET (Eastern Time) via live webinar. Please register for the webinar below:

Register for Webinar

Here’s what you get when you show up LIVE:

My exclusive Solutions Architect framework to prep you for today's job market! But if you’re not live, you won’t get it. No second chances.
Full bootcamp details, my new stealth product reveal to help you prepare better, AND a special offer for live participants only.
You will have the chance to interact with me and ask questions.

And good news - it already worked for last cohort's students who secured cloud jobs in top companies, including at AWS, Microsoft, Google, JPMorgan, Reddit, and some of them didn't even have cloud experience 💰.

Spots are limited, so don't miss it!

Fast Track To Cloud