Chapter 10 · RAG · 9 min

Reading your documents

How an LLM accesses thousands of pages without memorizing them. Embeddings, semantic search, injected context.

The limit context can't solve

Imagine you want to build an assistant capable of answering questions about your entire company's internal documentation — 50,000 pages, updated every week.

Even with a 1-million-token window, you can't fit it all in. And even if you could, the cost would be prohibitive and response quality would degrade: models struggle to extract precise information from very long contexts.

You need a different approach. That's RAGRetrieval-Augmented Generation.

The fundamental idea

Rather than giving the entire document to the model, you give it only the passages relevant to the question. To do this, you need two things:

  1. An index: a vector representation of all your documents, stored in advance.
  2. A semantic search engine: when a question arrives, it finds the semantically closest passages.

The retrieved passages are then injected into the LLM's context along with the question. The model answers by drawing on these excerpts.

This is exactly what semantic search with embeddings does — which you already know from chapter 03 — but applied to real documents.

The pipeline in three steps

1. Indexing (once)

The document is split into chunks — excerpts of a few hundred tokens (with slight overlap between them to avoid cutting meaning at the boundary).

Each chunk is then converted into a vector by an embedding model. These vectors are stored in a vector database (Pinecone, Chroma, pgvector…).

This step is done once, or whenever the document is updated.

2. Retrieval (at each question)

When a question arrives, it's converted into a vector with the same embedding model.

We compute the cosine similarity between this question-vector and all chunk-vectors in the database. The k closest chunks are retrieved — typically 3 to 5.

This is semantic search: "close" doesn't mean "contains the same words," but "expresses a similar meaning."

3. Generation

The retrieved chunks are assembled into a prompt:

Document excerpts:
[Excerpt 1 — ...]
[Excerpt 2 — ...]
[Excerpt 3 — ...]

Question: {user's question}

This prompt is sent to the LLM, which generates a response based on the excerpts. The model hasn't memorized the document — it reads it at the moment the question is asked.

Pipeline visualization

Here's the system in action on a mini corpus of planetology. Choose a question — observe which chunks are retrieved in the vector space and how they're assembled into context.

Pick a question: it gets embedded, compared against the corpus chunks, and the most relevant ones are injected into the prompt before generation. That detour is what lets an LLM answer questions about documents it has never seen during training.

Chunk size: a real engineering choice

Splitting into chunks isn't trivial. It's often the primary source of problems in a RAG system.

Chunks too small: each chunk is too little informative. We retrieve fragments that, out of context, don't allow the model to answer correctly.

Chunks too large: we exceed the embedding's precision. A vector representing 2,000 tokens is a blurry average — it'll be retrieved for certain questions but the relevant content will be buried in the text.

A practical rule: chunks of 200 to 500 tokens, with 10–15% overlap. The right parameter depends on the structure of your documents.

Which embedding model should you use?

Not every question lands on the right chunks with the same embedding. A model trained on general English will retrieve Python code poorly; a code-trained model will retrieve medical exchanges poorly.

The recurring choices in practice:

  • text-embedding-3-small / -large (OpenAI) — solid quality, generalist, multilingual. The default when you start.
  • bge-large / bge-m3 (BAAI) — open-source, excellent multilingual, top of the MTEB leaderboard in 2024.
  • all-MiniLM-L6-v2 (Sentence-Transformers) — small, fast, deployable locally. Good cost/quality ratio for simple use cases.
  • Domain-specialized models (code, biomedical, legal) — always better in their niche, weaker everywhere else.

A practical rule: test two or three models on your data and your queries. The MTEB leaderboard tells you very little about your specific use case.

The reranker: a second pass

Vector search has a flaw: it's fast but coarse. A few-hundred-dimensional embedding is a fuzzy average of a text. Many "roughly relevant" chunks float to the top, and the truly best ones sometimes get drowned out.

That's why a second stage has become standard in serious RAG systems: the reranker.

The idea, in two steps:

  1. First pass (vector search) — retrieve the 50 or 100 closest chunks. Fast.
  2. Second pass (reranking) — a small model (often a cross-encoder) scores each (question, chunk) pair together, gives a precise score, and keeps the top 5–10. Slow per chunk, but only run on the 50 candidates.

The reranker sees the question and the chunk at once, which the embedding doesn't. It catches semantic subtleties vector search misses. Cohere, Jina AI, BAAI offer ready-to-use rerankers.

Without a reranker, your RAG plateaus. With one, quality jumps a notch without changing the rest of the pipeline.

A legitimate question: why not simply do keyword search, like a classic search engine?

Vector search is complementary, not a replacement. It finds semantically similar passages, even if the wording is different. "Which planet rotates the slowest?" can find a chunk that talks about "orbital period" without those words appearing in the question.

In practice, the best systems combine both: hybrid search — a combined score of vector similarity and keyword matching (BM25).

RAG's limitations

RAG isn't magic. Here are the most common problems:

The badly split chunk. If an answer spans two chunks and the split falls in the wrong place, neither chunk is sufficient on its own.

"Lost in the middle" within the chunk. If the chunk is too long, important information may end up in the middle and be ignored by the model.

The ambiguous question. If the question is vague, the question-embedding will be far from most relevant chunks. Solution: rephrase the question, or generate multiple variants.

Hallucinations on excerpts. The LLM can extrapolate beyond what the chunks say. RAG reduces hallucinations; it doesn't eliminate them.

What RAG changes in practice

RAG is today the reference technique for connecting a LLM to private or recent data. Almost all enterprise chatbots, documentation assistants, and monitoring tools rely on it.

But it's still a "passive" architecture: the model receives a question, finds chunks, answers. For more complex tasks — search, then act on the result, then search again — you need to go one step further.

Updated

RAG explained: how an LLM reads your documents · Step by Token