An interactive guide to LLMs

Step by Token

Understanding how large language models work, one interactive visualization at a time.

Table of contents

21 / 21 · 189 min
IVGoing Further
  1. 14

    Specializing a model without retraining everything

    LoRA, QLoRA, SFT. How to adapt a generalist model to a specific domain by training 0.1% of its parameters.

    9 min
  2. 15

    When the model reads images

    Patch embedding, ViT, CLIP. How a text Transformer becomes multimodal by treating an image as a grid of tokens.

    8 min
  3. 16

    How do we know a model is better?

    MMLU, HumanEval, LMSYS Arena. Why measuring LLM intelligence is hard — and why no single benchmark is enough.

    8 min
  4. 17

    Think before you answer

    Thinking tokens, extended reasoning, thinking budgets. How o1/o3-class models generate a hidden chain of thought before responding.

    9 min
  5. 18

    Why the 2nd token is faster than the 1st

    The KV cache and autoregressive generation. Prefill vs decode, TTFT, and why the cache changes everything.

    8 min
  6. 19

    Bigger, always better?

    Kaplan and Chinchilla scaling laws. Why GPT-3 was undertrained, and the optimal 20-tokens-per-parameter ratio.

    9 min
  7. 20

    What's really going on inside?

    Circuits, polysemantic neurons, Sparse Autoencoders. How Anthropic and DeepMind are opening the black box.

    9 min
  8. 21

    Generate an image by erasing noise

    Stable Diffusion, DALL-E, Midjourney. The reverse denoising process, the role of CLIP, and why U-Net is giving way to Transformers.

    9 min
Step by Token