An interactive guide to LLMs
Step by Token
Understanding how large language models work, one interactive visualization at a time.
Table of contents
21 / 21 · 189 min- 016 min
Predicting one word at a time
What is a language model? Why predicting the next word is enough to make intelligence emerge.
- 028 min
From text to tokens
How text becomes numbers. BPE, subwords, and why LLMs struggle to count letters.
- 0310 min
The space of meaning
Words in a geometric space. King − Man + Woman = Queen, and other vector miracles.
- 0412 min
Attention is all you need
The mechanism that changes everything. How each token looks at all others to understand context.
- 0514 min
The Transformer, in full
Putting the pieces together: multi-head attention, feed-forward, normalization, residual connections.
- 0610 min
How it learns
Loss, gradient descent, backpropagation. And why billions of parameters are needed.
- 077 min
Choosing the next word
Temperature, top-k, top-p. The art of turning a probability distribution into text.
- 089 min
From raw model to assistant
Fine-tuning, RLHF, constitutional AI. How we make an LLM useful and harmless.
- 098 min
What the model remembers
The context window: perfect but bounded memory. Why ChatGPT forgets and what it costs.
- 109 min
Reading your documents
How an LLM accesses thousands of pages without memorizing them. Embeddings, semantic search, injected context.
- 1110 min
From model that replies to model that acts
Tool use, ReAct loop, multi-step tasks. How an LLM becomes an agent capable of acting in the world.
- 128 min
The art of talking to a LLM
Zero-shot, few-shot, chain-of-thought, self-consistency. Why prompt wording radically changes what a model produces.
- 139 min
Why LLMs make things up
Calibration, confident falsehoods, countermeasures. The structural mechanism behind the most common criticism — and what we can actually do about it.
- 149 min
Specializing a model without retraining everything
LoRA, QLoRA, SFT. How to adapt a generalist model to a specific domain by training 0.1% of its parameters.
- 158 min
When the model reads images
Patch embedding, ViT, CLIP. How a text Transformer becomes multimodal by treating an image as a grid of tokens.
- 168 min
How do we know a model is better?
MMLU, HumanEval, LMSYS Arena. Why measuring LLM intelligence is hard — and why no single benchmark is enough.
- 179 min
Think before you answer
Thinking tokens, extended reasoning, thinking budgets. How o1/o3-class models generate a hidden chain of thought before responding.
- 188 min
Why the 2nd token is faster than the 1st
The KV cache and autoregressive generation. Prefill vs decode, TTFT, and why the cache changes everything.
- 199 min
Bigger, always better?
Kaplan and Chinchilla scaling laws. Why GPT-3 was undertrained, and the optimal 20-tokens-per-parameter ratio.
- 209 min
What's really going on inside?
Circuits, polysemantic neurons, Sparse Autoencoders. How Anthropic and DeepMind are opening the black box.
- 219 min
Generate an image by erasing noise
Stable Diffusion, DALL-E, Midjourney. The reverse denoising process, the role of CLIP, and why U-Net is giving way to Transformers.