Chapter 06 · Training · 10 min

How it learns

Loss, gradient descent, backpropagation. And why billions of parameters are needed.

Error, measured

At the start, the model is random. Give it "The sky is", and it'll predict "banana" with as much probability as "blue". What we want is for it to predict "blue" (or a plausible word).

To get there, two things are needed:

A measure of how wrong it is.
A mechanism to correct its parameters in the right direction.

That's all training is.

Cross-entropy, without the formula

At each step, we give the model a piece of text. It predicts the next token as a distribution (chapter 01). We look at the probability it assigned to the token actually present in the text. If it's high, it was right. If it's low, it was wrong.

Cross-entropy measures this error in log-probability:

The more the model is confident and correct, the smaller the loss. The more it is confident and wrong, the more the loss explodes.

It's a harsh measure: assigning 0.01% to the right answer costs much more than assigning 10%. The model learns to avoid false certainties.

Descending the slope

Once the loss is calculated, how do we adjust the parameters?

Imagine the loss as a surface in a gigantic space (as many dimensions as parameters: billions). The model is a point on this surface. We want it to descend toward the valleys.

The algorithm is called gradient descent: at each step, we compute the direction of steepest descent (the gradient) and move a little in that direction.

Parameters ← Parameters − η × Gradient

η (eta) is the learning rate: the step size. Too small, we don't move forward. Too large, we jump over the valley and diverge.

The loss curve drops in steps — each step corresponds to a new pattern the model has just finished learning. The four learning-rate regimes expose the classic pitfalls: too low and the model stalls; too high and it diverges.

Three regimes to observe in the visualization:

Very low LR (≤ 0.001) — the curve descends, but slowly. The model learns, but we don't have time to wait.
Optimal LR (≈ 0.01) — steady descent, low asymptote. That's the goal.
Too high LR (≥ 0.05) — the loss oscillates, or even diverges. The model "jumps" over minima without being able to settle there.

In practice, we dynamically adjust the learning rate during training: linear warmup at the start (to avoid breaking everything), then cosine decay.

Adam: gradient descent, but better

The equation Parameters ← Parameters − η × Gradient describes pure gradient descent. In practice, no one uses it as-is to train an LLM.

The reference optimizer is called Adam (and its modern variant AdamW). The idea: instead of blindly stepping in the current gradient's direction, keep a running memory of the average recent direction (the momentum) and the variance of updates per parameter.

Parameters whose gradient consistently points the same way take big steps.
Parameters that oscillate (noisy gradient) take small steps.

Adam thus adapts automatically to each parameter, where SGD applies the same learning rate to all. It's more stable, and it converges much faster in practice. AdamW (the variant most used today) adds a regularization called weight decay that keeps weights from blowing up over the course of training.

Today, training an LLM without AdamW is about as rare as coding in assembly without a good reason.

Backpropagation, in one sentence

To compute the gradient — meaning knowing how each parameter influences the loss — we use backpropagation. It's an algorithm that propagates the error from the output toward the input of the network, layer by layer, applying the chain rule of differentiation.

You don't need to understand the derivation to intuit what happens. Think of it as:

"If I had turned this knob by 0.001 units, would the loss have gone up or down, and by how much?"

The answer to that question, for each of the model's billions of knobs, in parallel, is exactly what backprop does. That's what makes it possible to train a 70-billion-parameter model in about ten days on a cluster.

Pre-training = reading the internet

For a LLM to learn something useful, it needs a lot of text. A lot a lot. Modern models see:

1 to 15 trillion tokens during pre-training
Filtered text (Common Crawl, Wikipedia, books, code, papers)
Passing multiple times over certain parts (the best books, several epochs)

Throughout this process, the task is always the same: predict the next token. No labeled question-answer pairs, no "here's the correct translation," no human reward. Just raw text, and the objective of predicting what follows.

That's what we call self-supervision: the data provides its own "labels." No humans needed for annotation — just have text.

Batch size: how many examples at a time

You never compute the gradient on a single example. You group several sequences into a batch, compute the average gradient over the whole batch, then update the parameters once.

The bigger the batch, the more stable the gradient (less noise), the larger the learning rate you can afford. But you need enough GPU memory to hold it all.

For modern LLMs, the effective batch size reaches several million tokens — usually obtained by combining:

The local batch (per GPU) — limited by VRAM.
Gradient accumulation — compute several small batches and only apply the update at the end.
Data parallelism — split the batch across dozens, hundreds, sometimes thousands of GPUs.

What engineers call the global batch size is the total amount of data contributing to a single optimization step. For GPT-4, we're talking millions of tokens per step.

Data: half the work

We talk a lot about parameters. We talk less about data preparation — which actually takes half the time of any serious team training a model.

Filtering — strip out low-quality content (spam, 404 error pages, machine-generated content, product listings without context).
Deduplication — remove duplicates. Common Crawl contains many copies of the same pages; leaving them in makes the model memorize those pages instead of generalizing.
Mixing — balance sources (Wikipedia, books, code, scientific papers) by their pedagogical value, not their raw size.
Quality filtering — at the best teams, a classifier scores each document, keeping only what looks like a textbook or a well-researched article.
Decontamination — make sure evaluation benchmarks (MMLU, HumanEval…) don't leak into the training corpus.

Brutal summary from a Meta researcher: "we spend 10% of the time training the model, 90% preparing the data."

That's also why the best open models (Llama, Mistral, DeepSeek) almost never disclose the details of their data recipe: it's their main competitive edge.

Overfitting and validation

The more you train a model, the more it knows its corpus by heart. At some point, it starts memorizing rather than generalizing. That's overfitting.

To detect it, we set aside a portion of the corpus — the validation set — on which we don't train. During training, we measure the loss on both. As long as both descend, all is well. When the validation loss rises while the training loss keeps descending, we're starting to overfit. That's the time to stop (or increase the data, or regularization).

In the visualization, the dashed curve represents the validation loss. It rises slightly at the end — that's exactly it.

Compute = capacity

One last thing. Why are so many parameters and so much data needed? Because learning follows very regular scaling laws:

Loss = A × (Compute)^−α

Doubling the compute (parameters × data × iterations) divides the loss by a constant factor. The curve is smooth, predictable, over 6 orders of magnitude. The entire industry rests on this regularity: you know in advance that investing 10× more GPU will yield a measurable improvement.

We come back to this in detail in chapter 19 (Kaplan and Chinchilla scaling laws) — including why GPT-3 was undertrained on data, and what the optimal ratio between parameters and tokens really is.

What's next

The pre-trained model is now a very competent token predictor. It can complete any text with striking naturalness. But it's not yet an assistant.

Before getting there, one more thing to understand: once the distribution is predicted, how do we choose a token? We return to this point in the next chapter.

Updated May 10, 2026