Chapter 15 · Multimodality · 8 min

When the model reads images

Patch embedding, ViT, CLIP. How a text Transformer becomes multimodal by treating an image as a grid of tokens.

The model doesn't really see

When you send a photo to GPT-4o or Claude and it tells you what it represents, there's no "eye" in the model. No visual system, no object detector. The model doesn't "see" an image — it reads a sequence of vectors.

That's the entire trick of modern multimodality: transform any type of data (image, audio, video) into a representation that resembles text tokens. After this transformation, the standard Transformer does the rest.

A note on this chapter's scope. We're talking here about models that understand images (describe them, answer questions, read a chart). For models that generate images from text — Stable Diffusion, DALL-E, Midjourney — the architecture is different and is the subject of chapter 21.

ViT: cut the image into patches

The reference architecture for images is called ViT (Vision Transformer), proposed by Google in 2020.

The idea is disconcertingly simple: cut the image into small squares (patches) of fixed size — 16×16 pixels by default. Each patch is flattened into a vector, then projected into the Transformer's embedding space.

A 224×224 pixel image, cut into 16×16 patches, produces 196 patches — 196 tokens. These tokens are sent to the Transformer exactly like text tokens. Attention processes them, relates them, extracts relevant features.

The position of each patch in the image is encoded via a positional embedding, exactly like for text tokens.

Explore the patches

Here's a 16×16 image (a simplification — ViT-Base works on 224×224). Change the patch size to observe how resolution and token count vary.

The image is split into square patches; each patch becomes a token via a linear projection. The Transformer no longer cares whether the input is text or an image grid — it processes the sequence the same way.

The special [CLS] token

In early ViTs, a special [CLS] token is added at the beginning of the sequence. After attention has circulated information among all patches, the [CLS] vector is used as a global representation of the image.

It's the one sent to a classification head to answer "what is this?"

In modern multimodal models (GPT-4V, Claude 3, Gemini), the approach is different: image patch tokens are directly concatenated with text tokens in the same sequence. Cross-attention does the rest.

CLIP: aligning images and text

Before you can mix images and text in the same Transformer, image embeddings and text embeddings must live in the same vector space.

That's the problem CLIP (OpenAI, 2021) solved. CLIP trains two encoders in parallel — one for images, one for text — with a single objective: bring the representations of an image and its caption closer together.

After training on hundreds of millions of (image, text) pairs, CLIP produces spaces where "a photo of a cat" and an actual photo of a cat have close vectors. This property is what lets a LLM "understand" an image injected into its context.

The architecture of a vision-language model

Current multimodal LLMs are generally composed of:

  1. A visual encoder (often a pre-trained ViT) that produces image tokens.
  2. A projector (a small MLP) that maps image tokens into the LLM's embedding space.
  3. The LLM which receives the mixed sequence (image tokens + text tokens) and generates the response.

This "projector" is often the only part trained when adapting a text LLM to multimodal — the rest stays frozen.

What the model actually sees

It's tempting to imagine the model has deep "visual understanding." In reality, here's what happens:

Each 16×16 patch is flattened into a vector of 768 values (for ViT-Base). This vector is a statistical average of the pixels — a very local representation.

It's the attention between all these vectors that reconstructs spatial relationships, detects edges, recognizes shapes. The model has no built-in concept of "circle" or "face" — it discovers them statistically.

That's why visual LLMs can be surprising on tasks simple for humans (counting objects, distinguishing left/right) but remarkable on high-level tasks (interpreting a chart, reading a prescription).

What about audio?

The same principle — converting any modality into a token sequence — applies to voice. OpenAI's Whisper (2022) remains the reference for speech-to-text transcription. Its architecture is an encoder-decoder Transformer, exactly like a translation model.

The trick: convert the audio signal into a Mel spectrogram — a 2D image where the vertical axis is frequency and horizontal is time. Each little square of this spectrogram becomes an input token, just like ViT patches for images. Whisper then produces text tokens as output.

For voice generation (text-to-speech), the principle reverses: generate audio tokens from text. ElevenLabs, OpenAI TTS, Suno (for music) all use Transformers trained to predict the next audio token. A loved one's cloned voice is exactly the tokenization of a few minutes of recording used as conditioning.

The recent leap comes from natively multimodal voice models: GPT-4o realtime (2024), Gemini Live (2025), Claude voice. These models no longer round-trip through text internally — they reason directly in a space that mixes text tokens and audio tokens. That's what makes latency low (~300 ms) and prosody natural — the model can smile while speaking, because it never left the audio domain.

As with vision, the underlying architecture is still a Transformer. The only difference is the modality of the token.

Tokens: a universal currency

The true lesson of multimodality is that the token is a universal abstraction.

Text → tokens.
Images → tokens (patches).
Audio → tokens (sliced spectrogram).
Molecules → tokens (atoms).

As soon as you can convert a modality into a sequence of dense vectors, a Transformer can process it. That's why the same architectures that revolutionized NLP are now revolutionizing vision, audio, biology, and physics.

The Transformer is a token engine. Researchers keep inventing new ways to tokenize the world.

Updated

Multimodality: when an LLM looks at images · Step by Token