Chapter 01 · Foundations · 6 min
Predicting one word at a time
What is a language model? Why predicting the next word is enough to make intelligence emerge.
The big surprise
Here is the strangest thing about modern AI: everything a large language model does rests on a single capability — predicting the next word.
You give the model the beginning of a sentence:
"The sky is blue because…"
The model calculates, among the thousands of words it knows, which one is most likely to come next. Then it repeats with this new word added. And again. And again. That's all.
From this tiny, mechanical operation emerges everything else: translation, summarization, code, quantum physics explanations, jokes, poems.
Why it works
To predict the next word well, you need to understand an enormous amount about the world.
Consider this sentence:
"The doctor fired the nurse because she…"
To guess what follows, the model must know that "she" likely refers to the nurse (not the doctor) — it must understand grammar, context, perhaps even the social conventions of the medical profession.
Predicting words means modeling the world that produced them.
That is the central idea. Forcing a system to predict human text at scale forces it to learn, implicitly, how the world that produced that text works.
A distribution, not a word
When we say "the model predicts the next word," that's a shortcut. In reality, at each step, it produces a probability distribution over its entire vocabulary: every token receives a score, and they sum to 1.
To generate text, you must then choose a token from this distribution. That's where things get interesting: the same model, with the same prompt, can produce very different texts depending on the sampling strategy used.
At each step, the model proposes a distribution over every token in the vocabulary. The tallest bar is rarely the only plausible candidate — that's what makes the next part of a text open rather than mechanical.
Three levers to play with above:
- Temperature — divides the logits before the softmax. At low temperature (0.1–0.3), the distribution concentrates on the most likely candidate: the model becomes predictable, almost deterministic. At high temperature (1.5–2.0), it flattens: exotic options become credible again.
- Top-k — keeps only the k most probable candidates, eliminating the long tail of rare options.
- Top-p (nucleus sampling) — keeps the smallest set whose cumulative mass exceeds p. Smarter than top-k: if a step has an obvious answer, p can cut to just 1 candidate. If the model hesitates between 20 close options, it keeps them all.
Try the Capital prompt. The distribution is so peaked on Paris that temperature has almost no effect: you need to go above 1.8 before other options have a chance. The model is sure of itself.
Conversely, on The sky at the second step, several continuations are plausible ( light, color, sea…) — that's where temperature really changes the result.
The loop that does everything
Everything a LLM does fits in this loop:
- Read the context (the tokens already present).
- Produce a probability distribution over the next token.
- Sample a token from this distribution.
- Add it to the context. Repeat.
It's mechanical, repetitive, dull to describe. And yet, executed billions of times on a model with hundreds of billions of parameters, this loop produces dialogues, demonstrations, code that compiles.
What's ahead
The journey is organized into four parts, from the most mechanical to the most advanced.
I. Anatomy of a model. Take the machine apart. Tokenization, embeddings, attention, Transformer — how text becomes a sequence of vectors that can be transformed.
II. Training and alignment. How those billions of parameters learn. Loss, gradient, sampling, RLHF — from random model to useful assistant.
III. The model in production. What actually happens when you send a prompt to ChatGPT or Claude. Context window, RAG, agents — the infrastructure that makes LLMs usable day to day.
IV. Going further. The topics on the current research frontier. Fine-tuning, multimodality, extended reasoning, scaling laws, interpretability, diffusion — to understand where this is going next.
Each chapter contains at least one interactive visualization. The goal isn't to make you memorize formulas, but to give you a mechanical intuition of what's happening inside.
Let's go.
Updated