Chapter 02 · Tokenization · 8 min
From text to tokens
How text becomes numbers. BPE, subwords, and why LLMs struggle to count letters.
Why tokens?
A language model cannot manipulate text directly. It manipulates numbers. The very first step, every time you talk to a LLM, is to convert your text into a sequence of integers — the token IDs.
Tokenization is the splitting that makes this conversion possible.
Why not one token per word?
At first glance, you might imagine: one word = one token. Simple.
But it doesn't work:
- Language contains millions of possible words (inflected forms, neologisms, proper nouns, typos…). One token per word requires a gigantic vocabulary.
- The model can do nothing with a word it has never seen.
- Some languages (Chinese, Japanese) don't even have spaces between words.
The solution adopted by almost all modern LLMs: subwords.
Subword tokenization
With a subword tokenizer:
- Frequent words become a single token (
the,and,is) - Rare words are split into smaller pieces (
tokenization→token+ization) - Unknown characters can always be decomposed down to individual letters
The result: a vocabulary of reasonable size (typically between 30,000 and 200,000 tokens) that can represent any text.
The most widely used algorithm is called BPE (Byte Pair Encoding). It works by starting from individual characters and iteratively merging the most frequent pairs in the training corpus.
BPE in thirty seconds
Imagine a tiny corpus of three words: low, lower, lowest. Start by tokenizing at the character level:
l o w
l o w e r
l o w e s t
At each iteration, find the most frequent adjacent pair. Here, l o appears three times — merge it into lo:
lo w
lo w e r
lo w e s t
Now lo w is the most frequent pair. Merge it:
low
low e r
low e s t
Keep going until you reach the desired vocabulary size. Chunks that recur often (low) become a single token. Rare ones (est) stay decomposed. That's exactly what BPE does — across billions of words instead of three, and over bytes rather than characters in modern models (byte-level BPE), which guarantees nothing is ever "out of vocabulary".
A few cousins worth knowing: WordPiece (BERT), SentencePiece (T5, Llama), Unigram LM (mT5). They share the same idea — a subword vocabulary — with different merge heuristics.
Special tokens
Beyond text subwords, the tokenizer reserves a few special tokens that never appear naturally:
<|im_start|>,<|im_end|>(OpenAI),[INST]…[/INST](Llama),<|user|>/<|assistant|>— delimit conversation turns.<|endoftext|>— end of document.<|fim_prefix|>,<|fim_middle|>— for fill-in-the-middle, used in code completion.
These are what turn a dialogue "User said X, Assistant replied Y" into a single linear sequence of tokens the model can process. When you send a message to ChatGPT, these markers are added automatically before tokenization.
Try it
Subwords appear in the right pane: each token is a reusable fragment, not necessarily a complete word. Frequent words fit in a single token; rare ones break into several pieces.
A few things to notice:
- Short, frequent words are rarely split.
- Long or rare words often end up in multiple pieces.
- The space preceding a word is part of the token (that's why
· helloandhelloare different tokens). - In English, the token/word ratio is typically lower than in other languages (1.2–1.4) — models have seen more English during training.
Practical consequences
This token business has lots of unexpected implications:
- LLMs count letters poorly. "How many R's in strawberry?" — they often answer 2 instead of 3, because the word arrives in a few tokens, not as separate letters.
- API prices are calculated in tokens, not words. Languages that tokenize less efficiently cost a bit more to process.
- Context windows (
128k tokens,200k tokens…) are also measured in tokens. A 100,000-word book represents roughly 130,000 tokens.
For the model, "tokenization" and "token·iza·tion" are the same thing. It only sees the pieces.
What's next
Now that your text has become a sequence of integers, those integers will enter the model. The first step inside the model: they are transformed into vectors, in a space of several hundred dimensions where meaning becomes a geometric position.
That's the subject of the next chapter.
Updated