The problem RAG solves
LLMs have three hard limitations that RAG directly addresses:
- Hallucination — the model generates plausible-sounding but factually wrong text when it lacks knowledge
- Knowledge cutoff — training data has a fixed date; the model cannot know what happened last week
- Private data — the model was never trained on your internal docs, tickets, or codebases
RAG augments the model’s prompt with retrieved, relevant, grounded context at inference time — without retraining the model. It is cheaper, faster to iterate, and easier to audit than fine-tuning.
Naive RAG pipeline
Document corpus
↓ [1] Chunk
Chunks (e.g., 512-token segments)
↓ [2] Embed
Dense vectors
↓ [3] Store
Vector database
━━━━━━━━━━━━ query time ━━━━━━━━━━━━
User query
↓ [4] Embed query
Query vector
↓ [5] ANN retrieve (top-k)
Relevant chunks
↓ [6] Augment prompt
[System prompt + retrieved chunks + user question]
↓ [7] Generate
LLM response
Chunking strategies
Chunking is the most underappreciated lever in RAG. Bad chunking destroys retrieval recall regardless of how good your model or index is.
Fixed-size chunking splits on token count with overlap:
chunk_size = 512 # tokens
overlap = 64 # tokens shared with adjacent chunks
# Simple, predictable, but may cut mid-sentence or mid-concept
Semantic chunking splits at natural boundaries (paragraphs, headings, sentence embedding distance thresholds). Higher quality but slower to build.
Hierarchical chunking stores both small chunks (precise retrieval) and their parent document/section (rich context). At query time, retrieve the small chunk for relevance scoring, but inject the parent for generation context. Implemented natively in LlamaIndex as ParentDocumentRetriever.
| Strategy | Retrieval precision | Context richness | Build cost |
|---|---|---|---|
| Fixed-size | Medium | Low | Very low |
| Semantic | High | Medium | Medium |
| Hierarchical | High | High | Medium |
Hybrid retrieval
Naive dense-only retrieval fails on exact terms (product codes, names, error messages). Production systems combine:
- BM25 — fast keyword search, strong on exact matches
- Dense ANN — semantic similarity, handles paraphrase and concept search
- Reranker — a cross-encoder model (e.g., Cohere Rerank,
bge-reranker-v2) that scores each (query, chunk) pair with full attention — expensive but high accuracy
# Rough hybrid pipeline
bm25_results = bm25_index.search(query, top_k=20)
dense_results = vector_db.search(embed(query), top_k=20)
merged = reciprocal_rank_fusion(bm25_results, dense_results) # RRF merging
reranked = reranker.score(query, merged[:30])
final_context = reranked[:5]
Reciprocal Rank Fusion (RRF): score = Σ 1 / (k + rank_i), where k=60 is a smoothing constant. Robust, parameter-free, works well in practice.
Advanced RAG patterns
Query expansion generates multiple phrasings of the user question before retrieval, merging results. Compensates for vocabulary mismatch between query and document.
Multi-hop RAG iterates: retrieve context → reason → generate a follow-up query → retrieve again. Used when answering requires chaining facts (e.g., “Who founded the company that acquired Figma?”).
Self-RAG trains the model to decide whether to retrieve, and to critique the relevance and factuality of retrieved passages before generating. Reduces over-retrieval noise.
HyDE (Hypothetical Document Embeddings) embeds a hypothetical answer to the question and uses that as the query vector, rather than the raw question. Effective when questions and documents are phrased very differently.
Evaluation metrics
A RAG system has two failure modes: retrieval failing to find the right chunks, and the model failing to use them correctly.
| Metric | What it measures | Tool |
|---|---|---|
| Retrieval recall@k | Are the relevant chunks in the top-k? | Manual labels or RAGAS |
| Answer faithfulness | Is the answer grounded in retrieved context (no hallucination)? | RAGAS, TruLens |
| Answer relevance | Does the answer actually address the question? | RAGAS, human eval |
| Context precision | Of retrieved chunks, how many were actually useful? | RAGAS |
RAGAS (Retrieval Augmented Generation Assessment) is the de facto open-source evaluation framework. It uses LLM-as-judge internally for semantic metrics.
Interview angle
When asked to “design a Q&A system over internal docs,” walk through: document ingestion pipeline → chunking strategy choice → embedding model → vector DB with hybrid search → reranking layer → LLM generation → evaluation loop. The follow-up will be about latency: reranking adds 50–200ms per query, so you may want to gate it behind a threshold or run it asynchronously. Discuss caching embeddings of popular queries and prefix caching for stable system prompts.