Chapter 18 · Inference · 8 min

Why the 2nd token is faster than the 1st

The KV cache and autoregressive generation. Prefill vs decode, TTFT, and why the cache changes everything.

The illusion of response time

You ask ChatGPT a question. It pauses for about a second before starting to answer. Then the words come out almost instantly, faster than you can read them.

This asymmetry isn't a UI quirk. It's the signature of a fundamental optimization, without which generating text with an LLM would cost a hundred times more: the KV cache.

How a Transformer generates a token

At each generation step, the Transformer must produce one new token. To do that, it computes the attention of the latest token over all the previous ones. That's how it takes the entire context into account.

But attention requires, for every token in context, two vectors: a key (K) and a value (V). With no optimization, every new token forces the model to recompute K and V for the entire sequence — including tokens already processed in the previous step. That's an O(n²) workload over sequence length: doubling the number of tokens quadruples the cost.

It's wasted work: those vectors haven't changed. Token n°3 has the same K₃ it had at the previous step.

The KV cache: never recompute what you already have

The idea is trivial and decisive. Keep the K and V of every already-processed token in GPU memory. At each new generation step, compute only the K and V of the new token and append them to the cache.

Attention then reads the full cache, but the work to do at this step is O(1) in size — not O(n).

Without the cache, each new token recomputes attention over the entire prefix — O(n²) cost. With the cache, only the new row is computed. That's what separates the slow first token (prefill) from the fast next ones (decode).

On the left, no cache: every step redraws all rows. On the right, with cache: just one row added. After a few tokens, the gap in cumulative operations becomes huge.

Prefill vs decode: two very distinct phases

LLM generation breaks into two phases that inference engineers carefully distinguish.

Prefill. The model receives the full prompt and computes K and V for all its tokens in parallel. It's fast in throughput terms — the GPU is saturated — but takes time on long prompts. This determines the TTFT (Time To First Token), the latency before the first word appears.

Decode. The model generates the rest, one token at a time, reusing the cache. Each step is fast individually, but sequential: future tokens can't be parallelized since each depends on the previous. This determines the ITL (Inter-Token Latency).

The two phases have completely different profiles:

	Prefill	Decode
Parallelizable?	Yes (all tokens at once)	No (sequential)
Hardware bottleneck	Compute (FLOPs)	Memory (cache reads)
Effect of length	Linear in N	Linear in N per token
Key metric	TTFT	ITL

On a long prompt, prefill can take several seconds. On a long output, decode dominates — and it's bottlenecked by how fast the KV cache can be read from GPU HBM.

Why providers price input tokens differently

Look at the rates at OpenAI, Anthropic or Google: input tokens are systematically cheaper than output tokens — often 4× to 5× cheaper. That's not arbitrary. Prefill, which processes input tokens, is massively parallel and uses the GPU efficiently. Decode generates output tokens one by one and underuses the hardware.

More subtly: Anthropic, OpenAI and others now offer prefix caching. If many requests share the same system prompt, the KV cache for that prefix is computed once and reused. That's what makes agents and multi-turn chatbots economically viable: without prefix caching, each turn would cost reprocessing the entire conversation.

The hidden cost: GPU memory

The KV cache isn't free in memory. It occupies:

memory = 2 × n_layers × n_heads × d_head × seq_len × batch_size × 2 bytes (FP16)

For a 70-billion-parameter model with a 128,000-token context and batch size 1, that's tens of gigabytes. This is often what limits the practical context length, more than the model's ability to reason over the sequence itself.

To push further, several techniques exist:

Cache quantization: store K and V in INT8 or INT4 instead of FP16, halving or quartering memory.
MQA / GQA (Multi-Query / Grouped-Query Attention): share K/V across multiple heads. Llama 2 70B and Llama 3 use GQA, drastically shrinking the cache.
Sliding window attention: only keep a recent window of the cache (Mistral, Gemma).
PagedAttention (vLLM): treat the cache as virtual memory pages for better dynamic batching.

Quantization, in two words

The word comes up everywhere in this part: QLoRA in 4 bits (chapter 14), cache quantization just above, GGUF models you download from Hugging Face. Time to explain what it means.

To quantize is to represent each model parameter with fewer bits. A 32-bit floating-point number (FP32) takes 4 bytes. In FP16, 2 bytes. In INT8, 1 byte. In INT4, half a byte — weight memory is divided by 8 compared to original FP32.

Precision	Bytes / param	70B model takes
FP32	4	280 GB
FP16 / BF16	2	140 GB
INT8	1	70 GB
INT4	0.5	35 GB
INT2 (extreme)	0.25	17.5 GB

The trick: a weight of 0.237 doesn't become exactly 0.237 in INT4 (which has only 16 possible values), but its closest match on a discrete grid. The quality loss depends on the model and the method, but it's typically small down to INT8, modest at INT4 (a few % degradation on benchmarks), significant below that.

Modern techniques (GPTQ, AWQ, GGUF) don't quantize all weights equally — they preserve precision in sensitive layers and compress others more aggressively. And the KV cache quantization mentioned above applies exactly the same idea to activations stored in memory during generation.

This is the technique that lets Llama 70B run on a MacBook with 64 GB of RAM, where the FP16 version needs a cluster.

The lesson

Without the KV cache, LLMs in production would be impractical. A long conversation, a reasoning agent, a chatbot that remembers what you said five messages back — none of it would be economically viable.

But this cache is also what constrains context length. When we talk about a "1 million token window", it's largely a cache memory problem, not an attention computation problem.

The KV cache isn't one optimization among many. It's what turns attention from a theoretical mechanism into production infrastructure.

Updated May 10, 2026