Chapter 20 · Interpretability · 9 min
What's really going on inside?
Circuits, polysemantic neurons, Sparse Autoencoders. How Anthropic and DeepMind are opening the black box.
A black box that works
A 70-billion-parameter LLM is something we can train, evaluate, and deploy — but not really understand. We know which weights it learned. We know those weights implement something. We don't know what.
For a long time, that didn't seem to matter much. If it works, it works. But as LLMs make decisions with real-world consequences — medical code, autonomous agents, content moderation — the question becomes urgent: can we open the box?
That's the goal of mechanistic interpretability. Not a discipline that describes what a model does from the outside (that's benchmarking), but one that tries to reverse-engineer the algorithms implemented inside the neurons.
A small vocabulary for this chapter
Before diving in, here are the terms we'll meet. We'll develop them along the way — this table is just a handrail so you don't get lost.
| Term | In one sentence |
|---|---|
| Feature | A recurring pattern in the model's activations, often associable with a human concept. |
| Polysemantic | A neuron that fires on several unrelated concepts. The norm in an LLM. |
| Monosemantic | A "feature" that responds to a single identifiable concept. The goal. |
| Superposition | The network encodes more concepts than it has neurons, by stacking them. |
| Circuit | A subgraph of the network that implements a specific function. |
| SAE (Sparse Autoencoder) | The technique that decomposes activations into monosemantic features. |
| Steering | Modifying the model's behavior by amplifying or suppressing a feature. |
The problem: polysemantic neurons
If you could open a Transformer and watch a specific neuron, you might hope to find "the dog detector" or "the addition neuron". The reality is messier.
A neuron in an LLM is typically polysemantic: it activates on several unrelated concepts. The same neuron can fire strongly on dog mentions, past-tense verbs, open-ended French questions, and HTML tags. Why? Because the network has many more concepts to represent than it has neurons — so it superposes them.
This superposition (Elhage et al., 2022) is a key finding. It explains why staring at one neuron almost never gives an interpretable signal.
The solution: Sparse Autoencoders
If concepts are superposed in neurons, then the right basis for looking at them isn't neuron-space — it's another, larger space where each dimension corresponds to one concept.
That's the idea behind Sparse Autoencoders (SAEs). You learn a projection from the model's internal activation into a much larger space (often 10× or 100× larger), with a sparsity constraint: only a few dimensions should fire at a time. The network is forced to represent each activation as a combination of a small number of interpretable features.
In 2024 Anthropic published a landmark paper ("Scaling Monosemanticity") applying this to Claude 3 Sonnet. They extracted millions of features, some spectacular: a feature for the Golden Gate Bridge, one for code bugs, one for sycophancy, one for betrayal. These features are monosemantic — each maps to a single recognizable concept.
Explore for yourself
Hover over a neuron to see what fires it. Many respond to apparently unrelated concepts — that's polysemanticity. Sparse Autoencoders decompose these activations into human-interpretable features, the foundation of mechanistic interpretability.
Pick a feature and watch which tokens it lights up across different passages. You'll notice that some features (like negation or proper nouns) span languages and contexts — these are robust concepts the model has abstracted.
Circuits: emergent algorithms
Beyond features, interpretability studies circuits: subgraphs of the network that implement a specific function. A bit like identifying, inside a microprocessor, the subcircuit that performs addition.
The most famous example is the induction head, discovered by Anthropic in 2022 (Olsson et al.). It's a mechanism that typically appears in middle layers of a Transformer and implements a simple rule: if the model has seen pattern AB earlier in context, and now sees A again, it predicts B.
It's a primitive form of in-context learning. Before this circuit, the model can't exploit context repetitions. After, it becomes sharply more capable — and this transition coincides with a jump in few-shot benchmarks.
Several other circuits have been identified:
- Induction heads — copying patterns from context
- Bracket completion — coherent closing of nested parentheses
- Indirect object identification — resolving "Mary gave the book to Paul" → "to her" refers to Mary
- Feature suppression — a head that turns off a feature in certain contexts
Each circuit is a small algorithm the network discovered on its own during training.
Why this matters for safety
Interpretability isn't just scientific curiosity. For many researchers, it's the most promising route to seriously aligning powerful models.
Today, alignment relies on RLHF and fine-tuning: we shape the observable outputs but don't know whether the model has truly internalized a value or is merely playing along on the surface. If we could identify the features and circuits responsible for, say, deceptive behavior, or moral reasoning, we'd have a much firmer grip.
Anthropic has shown you can steer a model directly: by artificially boosting the "Golden Gate Bridge" feature, they made Claude obsessed with the bridge — it kept bringing it up regardless of the question. A playful demo, but the same mechanism could in principle let us surgically remove a dangerous behavior without degrading the rest.
Current limits
Mechanistic interpretability is a young discipline. The challenges are real:
- Scale. An SAE on Claude 3 extracted 34 million features. Annotating, naming, and understanding them one by one is enormous.
- Completeness. We find features. We also miss them. How many important concepts go undetected?
- Compositionality. Understanding a feature in isolation is doable. Understanding how 50 features interact to produce a behavior is much harder.
- Generalization. A feature found in GPT-2 doesn't mechanically transfer to Claude or Llama. Each model is its own black box.
But publication pace is accelerating. Anthropic, DeepMind, EleutherAI, OpenAI, Apollo Research, Goodfire, Transluce — entire teams are forming around these questions.
If we want to trust ever more powerful models with ever more important decisions, we'll need more than passing them a test. We'll need to look inside.
Updated