Chapter 14 · Fine-tuning · 9 min

Specializing a model without retraining everything

LoRA, QLoRA, SFT. How to adapt a generalist model to a specific domain by training 0.1% of its parameters.

The base model — what exactly is it?

When GPT-4 or Claude come out of initial training — billions of tokens of raw text, predicting the next word — they know a lot of things. But they don't know how to behave.

Ask a raw model to write a cover letter: it'll probably continue the sentence like a Wikipedia article about cover letters. Or worse, generate several contradictory responses, as if the text were continuing a FAQ.

Fine-tuning is the step that transforms this raw model into a usable assistant: helpful, consistent, and adapted to a domain.

Two problems, two solutions

Supervised fine-tuning (SFT)

The simplest form: show the model (question, ideal response) pairs and train it to produce those responses.

That's what OpenAI did to create InstructGPT: humans wrote exemplary responses to thousands of questions, and the model learned to imitate them. The result: a model that "follows instructions."

The limitation: it's expensive in human data, and it doesn't tell the model how to compare two responses.

RLHF (and DPO in practice)

We covered this in chapter 08. Reinforcement learning from human feedback learns preferences. It's complementary to SFT, not a replacement.

Important detail: since 2024, most preference fine-tuning is done via DPO (Direct Preference Optimization) rather than classical RLHF's PPO. That's what turns an SFT-only model into an aligned one without the compute cost of RL. The full picture is in chapter 08.

But SFT and DPO/RLHF share a cost problem: they modify all the model's parameters. For a 70-billion-parameter model, that's weeks of compute and thousands of GPU dollars.

LoRA: training 0.1% of parameters

LoRA (Low-Rank Adaptation) is an elegant technique: instead of modifying the model's weights W, we add an update ΔW decomposed into two small matrices:

W_adapted = W_base + ΔW    where    ΔW = B × A
  • A: rank r × dimension matrix (small)
  • B: dimension × rank r matrix (small)
  • W_base: the original weightsfrozen, unmodified

The key insight: weight updates during fine-tuning have a low-rank structure in practice. You don't need a full matrix to capture the adaptation. Two small matrices suffice.

At rank r = 8, for a 4,096-dimension layer: LoRA uses 65,536 parameters versus 16 million for the full layer0.4%.

Play with rank

The rank r controls the expressivity of the update. Watch how ΔW changes structure as you vary r — and the effect on model responses after fine-tuning.

LoRA adds two small matrices alongside the existing weights. Instead of modifying billions of parameters, you train just a few million — most of the model stays frozen and shared across all fine-tuned variants.

Why does it work so well?

The surprising finding from LoRA experiments: even at very low rank (r = 4 or 8), post-fine-tuning performance is nearly identical to full fine-tuning. The intuitive reason: the space of useful adaptations is intrinsically low-dimensional.

In other words: to specialize a model for medical writing or Python code, you don't need to modify all its neurons. A few "directions" in weight space are enough.

QLoRA: going even lower

If LoRA reduces the number of trained parameters, QLoRA (Quantized LoRA) also reduces the numerical precision of the frozen parameters.

The base model is loaded in 4 bits (instead of 16 or 32), which divides its memory consumption by 4. The LoRA adapters remain in 16 bits for gradient precision. For the quantization mechanics themselves (FP32 → INT4 → INT2, what you lose at each level), see chapter 18.

Result: fine-tuning a 70-billion-parameter model on a single consumer GPU becomes possible. This is what enabled the explosion of specialized open-source models in 2023–2024.

The adaptation hierarchy

Today, practitioners distinguish several levels:

Prompting: no model modification. A few examples in context. Fast, zero cost, limited.

RAG: connect the model to a knowledge base (chapter 10). No model modification.

LoRA fine-tuning: adapt behavior and style on a targeted dataset. A few hours on one GPU, a few thousand examples. The right balance for most cases.

Full fine-tuning: modify all parameters. Necessary for deeply changing the model's capabilities (rare languages, very specific domain). Expensive.

Pre-training: start from scratch or continue training on a new corpus. Reserved for labs with access to clusters of thousands of GPUs.

What this changes for you

If you want to specialize a model for your use case — your editorial voice, your domain, your response formats — LoRA is today's reference technique.

Frameworks like Hugging Face PEFT or Axolotl let you launch a LoRA fine-tuning in a few lines of Python. The real difficulty isn't technical: it's building a quality dataset. A good fine-tune starts with 500 carefully written examples — not 10,000 examples generated by ChatGPT.

Updated

Fine-tuning explained: LoRA, QLoRA, SFT · Step by Token