Chapter 09 · Context · 8 min

What the model remembers

The context window: perfect but bounded memory. Why ChatGPT forgets and what it costs.

The memory paradox

A LLM has perfect memory — and yet it forgets.

Perfect because everything in its context window is accessible instantly, with absolute precision. The model doesn't "roughly remember" what you said ten exchanges ago: it still sees it, exactly as it was.

Bounded because this window has a maximum size. Beyond it, tokens disappear — permanently, without residue. No gradual degradation, no blur. Just: present or absent.

This explains the strange experience of talking to ChatGPT for an hour on a project, then opening a new conversation and watching it "forget" all that context. That's not a bug — it's a fundamental limitation of the architecture.

What the context window really is

The context window is the number of tokens the model can process in a single pass. Everything that enters this window — your question, the conversation history, system instructions, documents you've pasted — is processed together, in one pass.

The first versions of GPT-3 had windows of 2,048 tokens. Today, some models reach 1 million tokens. That's a massive evolution, but it doesn't change the nature of the problem: there's always a limit.

And that limit has a direct consequence on cost.

Attention is expensive

Remember the attention mechanism? Each token looks at all the other tokens to compute its weights. If you have n tokens in your window, attention performs operations.

Doubling the window doesn't double the cost — it quadruples it.

At 1,000 tokens: 1 million operations.
At 2,000 tokens: 4 million.
At 8,000 tokens: 64 million.
At 128,000 tokens: 16 billion.

That's why providers charge per token. And that's why long conversations cost exponentially more than short ones.

Play with the window

Send messages one by one. Watch when old turns fall out of the window — the model can no longer access them, even if you explicitly refer to them.

Change the window size to see how forgetting speeds up or slows down.

Slide the content: anything that scrolls out of the window is forgotten for good. The compute cost of attention grows in O(n²) with context length, which is why providers race for 1M-token windows.

The KV cache, in two words

Computing attention on all tokens at every new step would be catastrophic. The optimization that saves the day is called the KV cache: keep the work already done in memory, and only add the new token's row at each step.

This is what makes real-time generation possible — and what consumes VRAM proportional to context length. We come back to it in detail in chapter 18.

System prompts and chat templates

When you send a message to ChatGPT or Claude, what the model receives isn't exactly your message. It's a carefully formatted sequence that looks something like:

<|im_start|>system
You are a helpful and concise assistant.
<|im_end|>
<|im_start|>user
What's the capital of France?
<|im_end|>
<|im_start|>assistant

This formatting — the chat template — turns a dialogue into a linear sequence of tokens, with special markers distinguishing roles (system, user, assistant). The model has been fine-tuned to recognize them.

The system prompt is a special message placed at the very start. It sets behavior, tone, constraints ("respond in JSON", "never answer questions about topic X", "you are an American IP lawyer"). It lives in the context window like everything else, but its position at the start and its privileged role give it more weight than ordinary messages.

Each model family has its own template: ChatML for OpenAI, [INST]…[/INST] for Llama 2, newer variants for Llama 3 and Claude. If you type your dialogue by hand without respecting the right format, the model misbehaves — it was trained to see those markers and expects them.

Why position in the window matters

A surprising finding from recent research: LLMs don't use all context with equal effectiveness.

Information at the beginning and at the end of the window is better exploited than information in the middle. This phenomenon is called "lost in the middle" — the model tends to ignore passages buried at the center of a long context.

Practical consequence: if you build an assistant with reference documents, put the most important information at the beginning or end of the prompt, not in the middle.

Toward longer windows

Researchers are exploring several approaches to extend windows without exploding costs:

  • Sparse attention: instead of looking at all tokens, each token only looks at a subset (nearby ones, or selected by importance).
  • Sliding window attention: a sliding window — each token only looks at its k nearest neighbors.
  • Flash Attention: not a reduction in computation, but a much more GPU-efficient implementation of exact attention.

But all these techniques have a cost in precision or implementation complexity. There's no perfect solution yet. The inference-side details — and especially how they interact with the KV cache — are in chapter 18.

What this means for you

A few practical rules that follow from all this:

Be concise in your prompts. Every token in context costs memory and compute. A 10,000-token context costs much more than ten times a 1,000-token context.

Structure your documents. If you inject content (notes, articles, code), chunk and filter first — don't paste long blocks where 90% isn't relevant.

Use short sessions. For long iterative tasks, it's often better to start a new conversation with a summary of the context rather than letting a giant conversation grow.

And if you need to give the model access to very large amounts of information — thousands of pages, a database — the context window is not the right solution. That's where the next chapter comes in.

Updated

The context window: what an LLM remembers · Step by Token