Chapter 07 · Generation · 7 min

Choosing the next word

Temperature, top-k, top-p. The art of turning a probability distribution into text.

The distribution is computed. Now what?

At this point in the book, the model knows how to compute a probability distribution over the next token. To generate text, it must now choose one. This is the sampling step.

We already glimpsed this mechanism in chapter 01. The time has come to revisit it carefully: this is what distinguishes a cautious model from a creative one, a reliable assistant from a repetitive parrot.

Why not always take the most probable?

The simplest strategy — greedy decoding — always picks the highest-probability token. Fast, deterministic, reproducible.

And yet, almost never used alone. Why?

Because greedy decoding produces strangely flat text. It gets stuck in loops ("very very very interesting interesting interesting…"). It always picks the most conventional word. It generates technically correct sentences, but without texture. The model was trained on human text, which always contains a degree of unpredictability — and that unpredictability must show up in generation.

Hence stochastic sampling: we draw a token at random according to the distribution. The more probable a token is, the more likely it is to be chosen — but others have a chance too.

Three levers

Raw stochastic sampling would be too noisy. We typically apply three filters before drawing:

Temperature

Temperature T divides the logits before softmax. It's the most powerful setting.

T = 1.0 — distribution as it comes out of the model (reference)
T < 1.0 — distribution more peaked: highs become higher, lows become lower. The model becomes more predictable.
T → 0 — equivalent to greedy decoding (only one possible token).
T > 1.0 — distribution flattened: low-probability options come back up. The model becomes creative, or even bizarre.
T → ∞ — uniform, the model draws completely at random.

Temperature is a dial between predictability and imagination.

Top-k

Rather than allowing all tokens in the vocabulary (50,000+), we keep the k most probable and zero out the rest, then renormalize. k = 40 is a classic value for creativity, k = 1 returns to greedy.

Advantage: eliminates outright the absurd tokens (those at 0.0001%) that might be drawn by chance. Disadvantage: the optimal value of k depends on each step — sometimes 5 candidates suffice, sometimes 100 are all plausible.

Top-p (nucleus sampling)

Smarter than top-k. We keep the smallest set whose cumulative mass exceeds p, then renormalize.

At a step where the model hesitates between 30 close options → top-p keeps all options whose sum reaches p.
At a step where the model is 95% sure → top-p keeps just one token.

In practice, top-p ≈ 0.9 has become the default setting for modern LLMs. It adapts automatically to the model's confidence.

Play again

We already saw this visualization in chapter 01 — we replay it here because this is exactly the topic:

Play with the temperature: at 0, the model is deterministic and repetitive; at 1.5, it gets creative until it becomes incoherent. Top-k and top-p trim the long tail of unlikely candidates without touching the likely ones.

This time, observe:

On Capital, the model is so confident (10.0 vs 5.0 for the next) that it takes an extremely high temperature (>1.5) to get anything other than London. Typical of prompts where the answer is known.
On The sky at the second step, several continuations are semantically plausible. With temperature = 1.5, you might get color or sea instead of light. The meaning stays correct, but the text becomes less predictable.
At top-k = 1, any other setting becomes useless: you're in greedy mode.
At top-p = 0.5 on the Capital prompt, only the top answer survives (because it already reaches 50% on its own).

Typical settings

Common settings by use case:

Use case	Temperature	Top-p
Code, structured data	0.0–0.2	1.0
Factual response	0.3–0.5	0.9
Generic conversation	0.7	0.9
Brainstorming, creative writing	0.9–1.2	0.95
Generating variations	1.2–1.5	0.95

These aren't hard rules. But they're the ballpark you'll find in most products using a LLM.

Three more knobs to know

Beyond temperature / top-k / top-p, a few parameters show up everywhere in APIs.

Repetition penalty (or frequency penalty / presence penalty on OpenAI). Penalizes tokens that have already appeared in the output. Useful for breaking repetitive loops without cranking the temperature. Typical value: 1.05–1.2.

Stop sequences. A list of strings that, if they appear in the output, halt generation immediately. Essential for structured uses: in a formatted dialogue, you stop on <|im_end|> or \n\nUser: to keep the model from continuing the conversation by itself.

Beam search. Instead of stochastic sampling, you keep the k best partial sequences in parallel at each step, and pick the one that maximizes overall probability at the end. The output is smoother but often less natural. Used in machine translation and speech recognition; rarely in creative generation.

The illusory determinism

One last subtlety. Even with temperature = 0, two calls to the same model can give different results in practice:

Non-deterministic quantization (INT8/FP8 calculations have a margin of error).
Dynamic batching server-side (depending on other requests, floating-point operation orders change).
Model versions that change silently at the provider.

If you write tests that depend on the exact output of a LLM, you're going to have a bad time.

What's next

At this point, you know how a LLM:

splits text into tokens,
transforms them into vectors,
makes them look at each other via attention,
stacks Transformer blocks to refine the representation,
predicts a distribution over the next token,
samples one with a sampling strategy.

And yet, what we call "an assistant" — something like ChatGPT or Claude — isn't just that. A base model at this stage would be a text completer, not an assistant. How do we get from one to the other? That's the subject of the last chapter: alignment.

Updated May 10, 2026