Chapter 21 · Diffusion · 9 min

Generate an image by erasing noise

Stable Diffusion, DALL-E, Midjourney. The reverse denoising process, the role of CLIP, and why U-Net is giving way to Transformers.

A different family

Everything we've seen so far describes Transformers that predict the next token. That's the architecture that dominates language, code, and increasingly video and audio.

But when you type a prompt into Midjourney, DALL-E or Stable Diffusion, that's not what's happening. The image isn't generated pixel by pixel from left to right. It appears everywhere at once, refined through successive steps.

This is the work of a very different family of models: diffusion models.

The core idea: learning to denoise

Diffusion is built on an almost-too-simple intuition. If I take a clean image and progressively add Gaussian noise, at some point it becomes indistinguishable from pure noise. If I learn to invert that process — to remove noise step by step — then I can start from pure noise and end with a clean image.

During training:

  1. Pick an image from the dataset
  2. Add noise according to a predefined schedule (e.g. T = 1000 steps)
  3. Train a network to predict the noise that was added, given the noisy image and the step t

The model thus learns, for any noise level, to recognize what's "real image" and what's "added noise".

At inference, we reverse: start from pure noise, at each step the model predicts noise, we subtract it, and repeat. After T steps we get a plausible image — one that resembles the learned distribution.

Try the denoising

The animation starts from pure noise and denoises it step by step. Each step follows the gradient learned during training, guided by a CLIP text embedding. At 5 steps the result is blurry; at 50, it's sharp.

Play with the step count. At 5 steps, the image stays grainy — the model doesn't have time to refine. At 50 steps, it's clean but at ten times the compute. Most modern samplers (DDIM, DPM-Solver) reach near-optimal quality in 20 to 30 steps.

Steering toward a specific image: guidance

Pure noise is the same for any prompt. How does the model know we want a sunset and not a dog? The answer: we condition the model on the text.

During training, the network receives, in addition to the noisy image and the timestep, an embedding of the associated text (often a CLIP-style encoder). The network thus learns conditional noise predictions: "given that the image describes a sunset, here's what the noise looks like".

At inference, we compute two predictions: one with the prompt, one without (or with an empty prompt). The Classifier-Free Guidance (CFG) technique extrapolates in the direction of the prompt:

final_prediction = unconditional + guidance × (conditional − unconditional)

The guidance coefficient (CFG scale) controls how far we move from natural to match the prompt. At CFG=0, the model ignores the text; at CFG=7, the image follows the prompt faithfully without becoming artificial; beyond 12, it becomes oversaturated and loses fine detail.

Latent diffusion: doing less work

A 512×512 image has 786,432 pixels (3 channels × 512 × 512). Running 1,000 denoising steps on that is very expensive. Stable Diffusion popularized a trick: work in a compressed latent space.

Before training the diffusion model, we train an autoencoder (a VAE) that compresses images into a 64×64×4 latent space — about 48× smaller. Diffusion then operates only on this latent, not on pixels. After denoising, we decode the latent back into the final image.

This is what makes Stable Diffusion runnable on a consumer GPU in a few seconds — when an equivalent pixel-based model would need a cluster.

CLIP: the bridge between text and image

To condition on text, we need a textual representation that shares a space with images. That's the role of CLIP (Contrastive Language-Image Pretraining), trained by OpenAI on 400 million image/caption pairs.

CLIP learns two encoders — one for images, one for text — that produce embeddings in a shared space. A caption and its target image have close embeddings; an unrelated caption is far. This alignment lets the diffusion model understand textual prompts it has never seen exactly during training.

U-Net vs Transformer (DiT)

For a long time, the reference architecture for diffusion was the U-Net: a U-shaped convolutional network with skip connections that preserve fine detail. Stable Diffusion 1.4, 1.5, and 2 all use U-Nets.

But in 2022, Peebles and Xie proposed DiT (Diffusion Transformer): replace the U-Net with a pure Transformer. Each image "patch" is treated like a token, with full attention over all other patches. No more convolutional pyramid, no more skip connections — just stacked Transformer blocks.

Stable Diffusion 3, FLUX, Sora (video) — every recent architecture has migrated to Transformers. Why? Because Transformers scale better: we recover the same scaling laws as for language. Doubling compute on a DiT predictably improves the images.

The sampler: DDPM, DDIM, DPM-Solver

The sampler is the algorithm that decides, at each step, how to combine the noise prediction with the current state to produce the next step.

SamplerTypical stepsParticularity
DDPM1000Stochastic, faithful to training, slow
DDIM20–50Deterministic, near-DDPM quality with far fewer steps
DPM-Solver10–25ODE solver, even faster
Euler / Heun20–30Classical ODE methods, simple and robust
LCM4–8Latent consistency distillation, ultra-fast

Picking a sampler is one of the most accessible levers to trade speed for quality.

Why this family exists alongside Transformers

You might ask: why not generate images like text, token by token? The answer is subtle.

For text, the order is natural: we read left to right, and each word depends on the previous ones. For an image, there's no canonical order. Generating pixel by pixel introduces artifacts (the first pixels don't have the context of the last ones). Diffusion solves this by generating the whole image in parallel, refined globally at each step.

That's why autoregressive approaches on pixels (PixelRNN, OpenAI's original ImageGPT) were abandoned in favor of diffusion. The only possible comeback would be via autoregressive models over image tokens (DALL-E 1, and now Meta's Chameleon) — but they still trail diffusion on photorealistic quality.

Diffusion isn't a Transformer. It's another, mathematically orthogonal way of learning a complex distribution — and for images, it's the one that won.

Updated

Diffusion: generating images by removing noise · Step by Token