Chapter 08 · Alignment · 9 min
From raw model to assistant
Fine-tuning, RLHF, constitutional AI. How we make an LLM useful and harmless.
The raw model is not an assistant
At the end of pre-training, your LLM knows one thing and one thing only: continue text plausibly. That's useful, even magical. But it's not an assistant.
Ask a raw model "How do I make cookies?" and there's a good chance it continues like this:
"How do I make cookies? How do I make biscuits? How do I make cakes? The cookie recipe is a recipe that requires cookies, sugar…"
Not because it's stupid. Because it saw, in its corpus, many pages where a question is followed by more questions or noise. It's doing its job: predicting what statistically follows, not what would be useful.
To go from completer to assistant, we align the model.
Three identical prompts, two models: the raw model on the left, the same one after supervised fine-tuning and RLHF on the right. The raw model continues the text; the aligned one answers — and refuses problematic requests.
Three successive steps
Modern alignment is done in several phases stacked on top of pre-training.
1. Instruction tuning (SFT)
We fine-tune the model (in classical supervised mode) on a dataset of instruction → ideal response pairs, written by humans. A few tens of thousands of pairs is enough. This is what teaches the model:
- to follow instructions rather than complete them
- to respect the requested format (list, paragraph, code…)
- to produce a complete response rather than rambling
This is the step that transforms gpt-3 into gpt-3.5-instruct. The difference is spectacular: the model finally starts answering.
2. RLHF (Reinforcement Learning from Human Feedback)
SFT alone isn't sufficient. It teaches one response style, but doesn't make the fine distinction between an "average" response and an "excellent" one.
Hence RLHF, in three sub-steps:
a) The model generates multiple possible responses to the same prompt. b) A human ranks them (A > B > C). c) We train a reward model that imitates human preferences, then optimize the LLM via reinforcement learning to maximize this reward.
The result: a model that doesn't just respond, but responds as humans prefer models to respond. More polite, more structured, less arrogant, more useful.
2bis. DPO: PPO, made simpler
The RLHF we just described relies on an RL algorithm (PPO) that's heavy to train: separate reward model, numerical instability, huge compute cost.
In 2023, a Stanford team proposed DPO (Direct Preference Optimization). The idea: short-circuit the reward model and the RL step. Mathematically, you can derive a simple supervised loss that directly optimizes the LLM to prefer the "winning" response over the "losing" one for each comparison pair.
Concretely, starting from the same (prompt, response_A_better_than_B) pairs classical RLHF used, DPO trains the model in a single pass — like a regular supervised fine-tune. No separate reward model, no PPO, no instability.
The result is nearly indistinguishable from PPO-RLHF on benchmarks, at a fraction of the cost. Since 2024, DPO and its variants (IPO, KTO, ORPO) have largely replaced classical PPO at Llama, Mistral, and most open-source labs. Anthropic and OpenAI still use more complex pipelines, but the gap is closing.
You still see "RLHF" everywhere. It has become a generic term. Under the hood, it's increasingly DPO.
3. RLAIF / Constitutional AI
A variant: instead of humans, we use another model (often the same one) to provide feedback according to a written constitution — a set of principles ("don't give illegal instructions", "don't fabricate sources", "explain your reasoning when useful"…). This is called Constitutional AI.
Advantages: scalable (humans are expensive and slow), reproducible (the constitution is explicit), modifiable (you can adjust the principles without re-annotating everything).
This is the process Anthropic uses for Claude, and that many other labs have adopted since.
What alignment does not do
A few myths to dispel.
Alignment doesn't change the model's knowledge. If the raw model doesn't know that Napoleon died on Saint Helena, RLHF won't teach it that. RLHF changes how the model expresses what it knows, not the extent of what it knows.
Alignment isn't simple censorship. Refusing to give instructions for making a bomb isn't a blacklisted keyword: it's a learned policy that generalizes to indirect formulations and justifies the refusal.
Alignment isn't perfect. Jailbreaks (prompts that circumvent RLHF) still exist. Corpus biases partially persist. Hallucinations still exist, because the model sometimes gets more reward for seeming confident than for admitting it doesn't know.
Alignment has a cost. On some technical tasks, an aligned model is worse than a base model: it refuses to take risks, adds disclaimers, becomes overly cautious. This is called the alignment tax.
What about hallucinations?
Alignment improves a lot of things. It doesn't fix the fact that an LLM, by construction, is trained to always produce plausible text — even when it doesn't have the answer. That's what produces hallucinations.
Why RLHF doesn't erase the problem, and what actually works in practice (RAG, tools, extended reasoning, fine-tuning on uncertainty)? That's the subject of chapter 13, right after the agents chapter.
The open question
Alignment solves an immediate problem: making a LLM useful and generally reasonable. It doesn't exhaust the deeper question, sometimes called alignment with a capital A:
How do you guarantee that a system far more capable than a human acts in humanity's interest?
Today, we align through human feedback, because humans remain the best judges. When models become better than humans at the tasks we want to judge them on, this lever won't be enough. It's an open problem, and the subject of an entire branch of research.
End of part II
You've just traversed the entire internal pipeline of a modern LLM, from raw text bytes to aligned behavior:
- 01 — Predict the next word, again and again.
- 02 — Tokenize the text.
- 03 — Embed each token in a space of meaning.
- 04 — Let tokens look at each other via attention.
- 05 — Stack Transformer blocks.
- 06 — Train by gradient descent.
- 07 — Sample the next word.
- 08 — Align on human preferences.
None of these mechanisms is mysterious in isolation. None of them, alone, is enough to explain what you see when an LLM summarizes a scientific paper or writes a sonnet: intelligence emerges from their composition at scale.
The miracle isn't in any one of the pieces. It's in the entire chain, multiplied by billions of parameters, and trained on trillions of tokens.
What's next?
The model is ready. It can predict, reason, follow instructions. But between it and the experience you have when you use ChatGPT or Claude, there's a whole infrastructure: the context window that defines what it remembers, RAG that gives it access to your documents, agents that connect it to tools.
That's the subject of part III — The model in production.
And beyond that, part IV — Going further dives into current research topics: fine-tuning, multimodality, extended reasoning, scaling laws, interpretability, diffusion.
The pipeline is in place. The rest is everything we build on top of it.
Updated