Chapter 19 · Scaling · 9 min
Bigger, always better?
Kaplan and Chinchilla scaling laws. Why GPT-3 was undertrained, and the optimal 20-tokens-per-parameter ratio.
A misleading intuition
For years, the AI industry ran on a simple belief: a model twice as big is better. GPT-2 (1.5 billion parameters) was eclipsed by GPT-3 (175 billion). PaLM, Megatron, Gopher — the parameter race seemed endless.
Then in 2022, a team at DeepMind published a paper that changed everything. Their thesis: the large models of that era were massively undertrained. Not too small — undernourished in data.
The model that proved the thesis was named Chinchilla.
Kaplan's law: the first formulation
In 2020, OpenAI publishes a paper by Jared Kaplan and colleagues — "Scaling Laws for Neural Language Models" — showing something remarkable. Across dozens of models trained at different sizes, validation loss follows a simple power law:
L ≈ L∞ + (C₀ / C)^α
Decoding the formula:
- C — total compute spent on training (in FLOPs).
- L — final validation loss.
- L∞ — the irreducible loss: the floor below which you can't go, even with infinite compute. It's the natural entropy of human language — there's always some unpredictability in the next word.
- C₀ — a normalization constant that depends on the architecture.
- α ≈ 0.05 — the power-law exponent.
In plain terms: doubling compute reduces loss by a predictable amount. The law is remarkably robust over 7 orders of magnitude.
Kaplan draws a conclusion that will guide the industry for two years: given a compute budget, allocate most of it to model size, little to data.
That's exactly what OpenAI did with GPT-3. 175 billion parameters, but "only" 300 billion training tokens.
Chinchilla flips the table
In 2022, Hoffmann et al. (DeepMind) redo the experiment with a different methodology. Instead of fixing model size and varying compute, they systematically explore the (N, D) plane at constant compute.
Their conclusion directly contradicts Kaplan: N and D should grow at the same rate. To minimize loss given a fixed compute budget, train a modestly-sized model on a lot of data.
More precisely, the optimal ratio is:
D ≈ 20 × N
For a 70-billion-parameter model, the optimum is about 1.4 trillion tokens (1,400 billion). GPT-3 (175 billion parameters, 300 billion tokens) had a ratio of 1.7 — twenty times below optimum.
DeepMind proved it by training Chinchilla: 70 billion parameters, 1.4 trillion tokens. Smaller than GPT-3, more tokens, and better on every benchmark.
The compute map
On the log-log plot, loss decreases as a power law of compute. The N (parameters) and D (tokens) sliders show the iso-compute curve: for a fixed budget, there is an optimal N/D ratio — about 20 tokens per parameter according to Chinchilla.
Move the point to explore the (N, D) plane. The Chinchilla diagonal is the line where each dollar of compute is spent optimally. Above it, you've trained too long on too small a model; below, the opposite.
You'll notice something interesting: LLaMA-3 sits well above the diagonal. With 70 billion parameters trained on 15 trillion tokens, its ratio is 214 — ten times above Chinchilla optimum.
Why? Because Meta optimized for something other than training compute efficiency. They optimized for inference cost. A smaller model trained longer costs more to train (a bit) but much less to serve in production. Across billions of requests, the savings are massive.
Beyond parameters: data quality
Scaling laws aren't the end of the story. Several limits emerge.
Available data is finite. Common Crawl, Wikipedia, GitHub, ArXiv, scanned books — the inventory of high-quality text data on the internet isn't infinite. Several teams estimate we're approaching the wall: training a 1-trillion-parameter model at Chinchilla optimum would require 20 trillion tokens, well beyond clean public corpora.
Quality beats quantity, but only up to a point. Filtering a corpus to keep only high-quality data (textbooks, technical books, clean code) improves the model more than adding mediocre data. But over-aggressive filtering eventually impoverishes the distribution and hurts generalization.
Emergent abilities blur the curve. For certain tasks (multi-step reasoning, complex math, rare instructions), performance stays flat until a certain size threshold — then jumps sharply. These "emergent abilities" are controversial: some researchers (Schaeffer et al., 2023) show they disappear when you pick a more continuous metric. But the practical phenomenon remains: small models simply cannot do certain things, no matter how much fine-tuning.
The practical lesson
If you're training a model today, here's what scaling laws tell you:
- Fixed compute? Aim for a D/N ratio near 20. That's the training optimum.
- Will you serve the model at scale? Shift the ratio upward. A smaller model trained longer is cheaper at inference — that's what Meta, Mistral, and more teams are doing.
- Targeting an emergent capability? Small optimizations won't cut it. You need to cross a size threshold.
- Short on data? Quality, filtering, and diversity matter more than raw corpus size.
Scaling laws don't say you should keep growing indefinitely. They say there's a right ratio between parameters and data — and we spent years getting it wrong.
Updated