AI Engineering · Topic 4 of 8

Retrieval-Augmented Generation (RAG)

200 XP

The problem RAG solves

LLMs have three hard limitations that RAG directly addresses:

  1. Hallucination — the model generates plausible-sounding but factually wrong text when it lacks knowledge
  2. Knowledge cutoff — training data has a fixed date; the model cannot know what happened last week
  3. Private data — the model was never trained on your internal docs, tickets, or codebases

RAG augments the model’s prompt with retrieved, relevant, grounded context at inference time — without retraining the model. It is cheaper, faster to iterate, and easier to audit than fine-tuning.

Naive RAG pipeline

Document corpus
    ↓  [1] Chunk
Chunks (e.g., 512-token segments)
    ↓  [2] Embed
Dense vectors
    ↓  [3] Store
Vector database
━━━━━━━━━━━━ query time ━━━━━━━━━━━━
User query
    ↓  [4] Embed query
Query vector
    ↓  [5] ANN retrieve (top-k)
Relevant chunks
    ↓  [6] Augment prompt
[System prompt + retrieved chunks + user question]
    ↓  [7] Generate
LLM response

Chunking strategies

Chunking is the most underappreciated lever in RAG. Bad chunking destroys retrieval recall regardless of how good your model or index is.

Fixed-size chunking splits on token count with overlap:

chunk_size = 512   # tokens
overlap = 64       # tokens shared with adjacent chunks
# Simple, predictable, but may cut mid-sentence or mid-concept

Semantic chunking splits at natural boundaries (paragraphs, headings, sentence embedding distance thresholds). Higher quality but slower to build.

Hierarchical chunking stores both small chunks (precise retrieval) and their parent document/section (rich context). At query time, retrieve the small chunk for relevance scoring, but inject the parent for generation context. Implemented natively in LlamaIndex as ParentDocumentRetriever.

StrategyRetrieval precisionContext richnessBuild cost
Fixed-sizeMediumLowVery low
SemanticHighMediumMedium
HierarchicalHighHighMedium

Hybrid retrieval

Naive dense-only retrieval fails on exact terms (product codes, names, error messages). Production systems combine:

  1. BM25 — fast keyword search, strong on exact matches
  2. Dense ANN — semantic similarity, handles paraphrase and concept search
  3. Reranker — a cross-encoder model (e.g., Cohere Rerank, bge-reranker-v2) that scores each (query, chunk) pair with full attention — expensive but high accuracy
# Rough hybrid pipeline
bm25_results = bm25_index.search(query, top_k=20)
dense_results = vector_db.search(embed(query), top_k=20)
merged = reciprocal_rank_fusion(bm25_results, dense_results)  # RRF merging
reranked = reranker.score(query, merged[:30])
final_context = reranked[:5]

Reciprocal Rank Fusion (RRF): score = Σ 1 / (k + rank_i), where k=60 is a smoothing constant. Robust, parameter-free, works well in practice.

Advanced RAG patterns

Query expansion generates multiple phrasings of the user question before retrieval, merging results. Compensates for vocabulary mismatch between query and document.

Multi-hop RAG iterates: retrieve context → reason → generate a follow-up query → retrieve again. Used when answering requires chaining facts (e.g., “Who founded the company that acquired Figma?”).

Self-RAG trains the model to decide whether to retrieve, and to critique the relevance and factuality of retrieved passages before generating. Reduces over-retrieval noise.

HyDE (Hypothetical Document Embeddings) embeds a hypothetical answer to the question and uses that as the query vector, rather than the raw question. Effective when questions and documents are phrased very differently.

Evaluation metrics

A RAG system has two failure modes: retrieval failing to find the right chunks, and the model failing to use them correctly.

MetricWhat it measuresTool
Retrieval recall@kAre the relevant chunks in the top-k?Manual labels or RAGAS
Answer faithfulnessIs the answer grounded in retrieved context (no hallucination)?RAGAS, TruLens
Answer relevanceDoes the answer actually address the question?RAGAS, human eval
Context precisionOf retrieved chunks, how many were actually useful?RAGAS

RAGAS (Retrieval Augmented Generation Assessment) is the de facto open-source evaluation framework. It uses LLM-as-judge internally for semantic metrics.

Interview angle

When asked to “design a Q&A system over internal docs,” walk through: document ingestion pipeline → chunking strategy choice → embedding model → vector DB with hybrid search → reranking layer → LLM generation → evaluation loop. The follow-up will be about latency: reranking adds 50–200ms per query, so you may want to gate it behind a threshold or run it asynchronously. Discuss caching embeddings of popular queries and prefix caching for stable system prompts.