Chapter 04 · Attention · 12 min

Attention is all you need

The mechanism that changes everything. How each token looks at all others to understand context.

The pronoun and the doctor

Come back for a moment to this sentence, which we've already seen:

"The doctor fired the nurse because she was tired…"

For a human, "she" naturally refers to the nurse. For a LLM, this isn't obvious: at the moment it processes the token she, the word nurse is eight positions back. How does it connect the two?

That's the role of attention.

Why attention exists

Before 2017, language models were mostly recurrent (RNN, LSTM): they read text token by token, propagating a "hidden state" that summarized everything they had seen so far.

The problem: this hidden state is a bottleneck. Everything must pass through it. As the sentence grows longer, old information dilutes. And learning is sequential — to process the 100th word, you must have processed the previous 99, which makes parallelization difficult.

The paper Attention Is All You Need (Vaswani et al., 2017) proposed a break:

No more recurrence. Each token looks directly at all the others, in parallel.

That's the mechanism that makes modern models possible.

The intuition

At each layer of the model, each token performs three operations:

It asks a question to the rest of the sentence (the Query vector).
Every other token displays a label summarizing what it is (the Key vector).
The token compares its question to each label: where they match, it retrieves some content (the Value vector).

The result: a new representation for each token, which is a weighted sum of the others, where the weights come from Q-K matches.

Attention(Q, K, V) = softmax(QK^T / √d_k) · V

No need to remember the formula — remember the idea. Each token looks at all the others and blends what it finds interesting.

Where do Q, K, and V actually come from?

Not from nowhere. For each token, we take its embedding vector x and multiply it by three learned matrices — fixed during training:

Q = x · W_Q
K = x · W_K
V = x · W_V

These three matrices W_Q, W_K, W_V are the parameters of attention. They're shared across all tokens within a layer — that's what the model adjusts, via gradient descent, so that the right "questions" find the right "labels".

The √d_k factor in the formula keeps the dot products from blowing up at high dimension. Without it, pre-softmax values get huge, the softmax saturates, and the gradient dies. Technical detail, but necessary.

Multiple questions at once

A single set of questions isn't enough. A token may need to look at its syntactic subject and its coreference and the main verb at the same time.

Hence multi-head attention: instead of a single Q-K-V system, several run in parallel (typically 8, 16, 32). Each learns to specialize in a type of relationship. When you look at a trained model, you find heads dedicated to:

local attention (each token looks at itself or its immediate neighbors)
subject-verb binding
coreference (pronouns back to their referent)
delimiters (punctuation, start/end of sentence)
rhyme or poetic structure
things we don't know how to name

Explore it

The visualization below shows, on two sentences, what the attention of different heads can look like. The patterns are stylized (real weights come from a trained model), but each head corresponds to behavior actually observed in current models.

Each row shows how a token looks at every other one. Some heads track syntax (subject ↔ verb), others capture semantics (referents, antecedents). None of these patterns is hand-coded — they emerge from training.

Three things to try:

On The cat sleeps with the "Subject ↔ verb" head, look at the sleeps row. The strongest weight goes to cat. The verb has "found" its subject.
On Coreference with the "Coreference" head, look at the he row. The strongest weight points back to baby. That's exactly the mechanism that resolves the pronoun puzzle.
On any head, look at the upper-right triangle: it's gray. That's the causal mask — a token can only look at tokens that precede it. This is what forces the model to predict, not to copy.

Causal or bidirectional?

Not all attention is equal. Two regimes exist.

Bidirectional. Each token sees all the others, both backward and forward. That's what BERT (Google, 2018) and the encoder side of T5 use. These models excel at understanding a sentence — classification, extractive Q&A, semantic search — but they don't generate text token by token.

Causal. Each token sees only the previous ones. That's the triangular mask you saw earlier. This constraint is what makes autoregressive generation possible: to predict the next word, the model must work from the past alone.

GPT, Claude, Llama, Gemini, Mistral — every consumer-facing LLM uses causal attention. That mask is what makes them able to predict, not just describe.

Attention is expensive

This elegance has a cost. For a sequence of length n, computing the attention matrix requires O(n²) operations. Doubling the context size quadruples the cost.

That's why context windows were limited to 2,048 tokens in GPT-2, 8,192 tokens in GPT-3.5, and it took algorithmic tricks (FlashAttention, sliding window, sparse attention) to reach the 200,000 tokens of today. We come back to these techniques in chapter 18, where they tie into KV-cache memory.

Attention is powerful; it's also the main bottleneck of modern LLMs.

What's next

Attention alone doesn't make a language model. You need to stack it in successive blocks, add feed-forward computation layers, normalizations, residual connections. That's the complete Transformer architecture — the subject of the next chapter.

Updated May 10, 2026