Chapter 18 · Inference · 8 min
Why the 2nd token is faster than the 1st
The KV cache and autoregressive generation. Prefill vs decode, TTFT, and why the cache changes everything.
The illusion of response time
You ask ChatGPT a question. It pauses for about a second before starting to answer. Then the words come out almost instantly, faster than you can read them.
This asymmetry isn't a UI quirk. It's the signature of a fundamental optimization, without which generating text with an LLM would cost a hundred times more: the KV cache.
How a Transformer generates a token
At each generation step, the Transformer must produce one new token. To do that, it computes the attention of the latest token over all the previous ones. That's how it takes the entire context into account.
But attention requires, for every token in context, two vectors: a key (K) and a value (V). With no optimization, every new token forces the model to recompute K and V for the entire sequence — including tokens already processed in the previous step. That's an O(n²) workload over sequence length: doubling the number of tokens quadruples the cost.
It's wasted work: those vectors haven't changed. Token n°3 has the same K₃ it had at the previous step.
The KV cache: never recompute what you already have
The idea is trivial and decisive. Keep the K and V of every already-processed token in GPU memory. At each new generation step, compute only the K and V of the new token and append them to the cache.
Attention then reads the full cache, but the work to do at this step is O(1) in size — not O(n).
Without the cache, each new token recomputes attention over the entire prefix — O(n²) cost. With the cache, only the new row is computed. That's what separates the slow first token (prefill) from the fast next ones (decode).
On the left, no cache: every step redraws all rows. On the right, with cache: just one row added. After a few tokens, the gap in cumulative operations becomes huge.
Prefill vs decode: two very distinct phases
LLM generation breaks into two phases that inference engineers carefully distinguish.
Prefill. The model receives the full prompt and computes K and V for all its tokens in parallel. It's fast in throughput terms — the GPU is saturated — but takes time on long prompts. This determines the TTFT (Time To First Token), the latency before the first word appears.
Decode. The model generates the rest, one token at a time, reusing the cache. Each step is fast individually, but sequential: future tokens can't be parallelized since each depends on the previous. This determines the ITL (Inter-Token Latency).
The two phases have completely different profiles:
| Prefill | Decode | |
|---|---|---|
| Parallelizable? | Yes (all tokens at once) | No (sequential) |
| Hardware bottleneck | Compute (FLOPs) | Memory (cache reads) |
| Effect of length | Linear in N | Linear in N per token |
| Key metric | TTFT | ITL |
On a long prompt, prefill can take several seconds. On a long output, decode dominates — and it's bottlenecked by how fast the KV cache can be read from GPU HBM.
Why providers price input tokens differently
Look at the rates at OpenAI, Anthropic or Google: input tokens are systematically cheaper than output tokens — often 4× to 5× cheaper. That's not arbitrary. Prefill, which processes input tokens, is massively parallel and uses the GPU efficiently. Decode generates output tokens one by one and underuses the hardware.
More subtly: Anthropic, OpenAI and others now offer prefix caching. If many requests share the same system prompt, the KV cache for that prefix is computed once and reused. That's what makes agents and multi-turn chatbots economically viable: without prefix caching, each turn would cost reprocessing the entire conversation.
The hidden cost: GPU memory
The KV cache isn't free in memory. It occupies:
memory = 2 × n_layers × n_heads × d_head × seq_len × batch_size × 2 bytes (FP16)
For a 70-billion-parameter model with a 128,000-token context and batch size 1, that's tens of gigabytes. This is often what limits the practical context length, more than the model's ability to reason over the sequence itself.
To push further, several techniques exist:
- Cache quantization: store K and V in INT8 or INT4 instead of FP16, halving or quartering memory.
- MQA / GQA (Multi-Query / Grouped-Query Attention): share K/V across multiple heads. Llama 2 70B and Llama 3 use GQA, drastically shrinking the cache.
- Sliding window attention: only keep a recent window of the cache (Mistral, Gemma).
- PagedAttention (vLLM): treat the cache as virtual memory pages for better dynamic batching.
Quantization, in two words
The word comes up everywhere in this part: QLoRA in 4 bits (chapter 14), cache quantization just above, GGUF models you download from Hugging Face. Time to explain what it means.
To quantize is to represent each model parameter with fewer bits. A 32-bit floating-point number (FP32) takes 4 bytes. In FP16, 2 bytes. In INT8, 1 byte. In INT4, half a byte — weight memory is divided by 8 compared to original FP32.
| Precision | Bytes / param | 70B model takes |
|---|---|---|
| FP32 | 4 | 280 GB |
| FP16 / BF16 | 2 | 140 GB |
| INT8 | 1 | 70 GB |
| INT4 | 0.5 | 35 GB |
| INT2 (extreme) | 0.25 | 17.5 GB |
The trick: a weight of 0.237 doesn't become exactly 0.237 in INT4 (which has only 16 possible values), but its closest match on a discrete grid. The quality loss depends on the model and the method, but it's typically small down to INT8, modest at INT4 (a few % degradation on benchmarks), significant below that.
Modern techniques (GPTQ, AWQ, GGUF) don't quantize all weights equally — they preserve precision in sensitive layers and compress others more aggressively. And the KV cache quantization mentioned above applies exactly the same idea to activations stored in memory during generation.
This is the technique that lets Llama 70B run on a MacBook with 64 GB of RAM, where the FP16 version needs a cluster.
The lesson
Without the KV cache, LLMs in production would be impractical. A long conversation, a reasoning agent, a chatbot that remembers what you said five messages back — none of it would be economically viable.
But this cache is also what constrains context length. When we talk about a "1 million token window", it's largely a cache memory problem, not an attention computation problem.
The KV cache isn't one optimization among many. It's what turns attention from a theoretical mechanism into production infrastructure.
Updated