Chapter 17 · Reasoning · 9 min

Think before you answer

Thinking tokens, extended reasoning, thinking budgets. How o1/o3-class models generate a hidden chain of thought before responding.

The quick answer is often wrong

What is the last digit of 7¹⁰⁰?

Ask a standard LLM and it'll probably answer "7" in a fraction of a second. Makes sense — 7 starts with 7, 7² = 49, and without thinking too hard, you might assume it stays 7. That answer is wrong — it's 1.

But ask the same question to a reasoning model like o1, o3, or DeepSeek-R1, and it hesitates. It "thinks" for 10, 20, sometimes 60 seconds. And it arrives at the right answer.

The difference isn't in the model's weights. It's in what the model is allowed to do before responding.

Thinking tokens

Every LLM generates tokens, one at a time, left to right. What sets reasoning models apart is that they first generate a long sequence of hidden tokens — an internal monologue the user never sees — before producing the final answer.

These hidden tokens are called thinking tokens.

The model can write anything there: intermediate calculations, hypotheses it then disproves, abandoned exploration branches, double-checks. It's a scratchpad it erases before showing you the clean result.

It's not magic. It's just extra space to work through a hard problem.

Try it yourself

Set the thinking budget to "None" and click "Start reasoning." Watch the instant answer. Then switch to "Full" and run it again.

The greyed-out blocks are the internal chain of thought — the model hypothesizes, verifies, sometimes backtracks. These thinking tokens cost latency and money, but unlock problems that direct mode cannot solve.

The difference between the two isn't in the model's capacity — it's in the inference-time compute it's allowed to use.

How it works technically

This isn't a different architecture. It's the same Transformer, the same attention mechanism, the same autoregressive generation.

What changes is the training and decoding. During fine-tuning, the model learns to produce useful reasoning traces — chains of thought that converge toward the right answer. It's shown thousands of problems with their solutions and learns to construct the intermediate path.

At inference, it's given a thinking token budget — a limit on how many hidden tokens it can generate. The larger the budget, the more it can explore. Beyond a certain budget, quality improvements on hard tasks start to plateau.

An important detail: thinking tokens are generated before the answer, in the same token stream. The model doesn't "think" in parallel — it thinks in series, and that costs tokens just like everything else.

Extended reasoning vs. chain-of-thought

You may have seen the chain-of-thought (CoT) technique, where you explicitly ask the model to "think step by step." That's different, but related.

Chain-of-Thought (prompted)Extended reasoning (native)
Who triggers itThe user, in the promptThe model itself
VisibilityVisible in the responseHidden (thinking tokens)
ControlUser can guide the stepsModel chooses its plan
ExamplesGPT-4 with "let's think step by step"o1, o3, Claude with extended thinking

Prompted CoT also improves performance — but native reasoning goes further, because the model isn't constrained to write readable steps. It can explore messy paths, run calculations it later discards, contradict itself and self-correct, all within the hidden space.

When it's worth it

Extended reasoning meaningfully improves performance on:

  • Math and logic — proofs, combinatorics, exact arithmetic
  • Complex code — multi-file debugging, non-trivial algorithms
  • Structured reasoning — puzzles, chained deductions
  • Planning — tasks that require laying out a strategy before acting

For simple factual questions ("what's the capital of France?"), creative text, or translation, extended reasoning adds nothing — and costs more.

It's also one of the most effective countermeasures against hallucinations (chapter 13). A model that takes time to verify its own draft catches errors that a single-shot answer would have let through. Not magic — it can hallucinate inside its reasoning too — but the simple act of unfolding the steps filters out a meaningful share of factual errors.

The cost is the real constraint. Thinking tokens are billed like regular tokens. An o1 model that generates 1,000 thinking tokens before a 30-token answer actually consumes 1,030 tokens. At millions of requests, that adds up.

Test-time compute scaling

What reasoning models revealed is that you can buy intelligence at inference time: the more thinking tokens you allocate, the better the answers get on hard tasks.

This is called test-time compute scaling — as opposed to the usual scaling that increases model parameters during training.

The curve looks similar to classical scaling laws: doubling the thinking budget improves performance, but with diminishing returns. At some point, thinking longer no longer compensates.

And this is a significant discovery: a LLM's intelligence isn't a fixed constant set by its weights. It also depends on the compute given to it at the moment of responding.

A model that thinks long and hard about a difficult problem can outperform a larger model that answers quickly. Speed isn't always a virtue.

Updated

Reasoning models: thinking before answering · Step by Token