Chapter 05 · Architecture · 14 min

The Transformer, in full

Putting the pieces together: multi-head attention, feed-forward, normalization, residual connections.

All that for this

We now have all the puzzle pieces:

  • Text becomes tokens (chapter 02).
  • Tokens become vectors in a space of meaning (chapter 03).
  • Attention lets each vector look at the others and reconfigure itself (chapter 04).

All that remains is understanding how we assemble these pieces into a complete model. The answer, elegantly minimalist: we stack them.

The basic block

A modern Transformer is the same small block, repeated N times. This block contains two sub-modules:

  1. A multi-head attention layer, which allows tokens to communicate with each other.
  2. A feed-forward network (FFN) — two linear transformations separated by a nonlinearity — which transforms each token independently.

Around these two sub-modules, two fixed structures:

  • LayerNorm before each sub-module: normalizes the vectors to stabilize learning.
  • Residual connections around each sub-module: the block's output is the input plus the transformation, never the transformation alone.

Hover over a sub-block to see its role: attention diffuses information between tokens, the feed-forward transforms it locally, normalization and residuals stabilize the rest. Stacked 32 or 96 times, they make a GPT-4 or a Claude.

Why residuals change everything

This is probably the most important architectural trick of the decade. Without a residual connection, stacking 96 successive blocks means passing a signal through 96 cascading transformations. The gradient (the learning signal) dilutes at each pass. After a few layers, nothing is left to learn.

With a residual connection, the block learns a modification rather than a total transformation: output = input + f(input). The original signal always passes intact through the entire network, and each block enriches it a little.

Without residuals, a deep Transformer cannot train. With them, it can stack 100+ layers without issue.

The FFN, the forgotten half

We talk a lot about attention. We talk less about the FFN, which contains twice as many parameters.

At each layer, after attention, each token passes through an MLP:

FFN(x) = Linear_2(GELU(Linear_1(x)))

Linear_1 projects the vector into an intermediate dimension 4× wider (typically 4 × 768 = 3072 for GPT-2 small). Linear_2 brings it back to the original dimension. This expand-then-contract is where the model stores the majority of its factual knowledge — proper nouns, learned associations, recurring patterns.

When people talk about "75 billion parameters" in a model, the overwhelming majority lives in the FFNs.

Position and causality

Two details we haven't mentioned yet:

Positional encoding. Attention is permutation-invariant: if you shuffle the tokens of a sentence, attention returns the same result (just shuffled). That's not what we want. For a LLM to know that "The cat eats the fish" differs from "The fish eats the cat", we inject position information into each embedding vector (positional encoding, RoPE, ALiBi…). Today, RoPE (rotary position embedding) is the convention.

Causal mask. As we saw in chapter 4, in a generation model, each token can only look at its predecessors. The causal mask is applied to the attention matrix: future positions are set to −∞ before softmax. This forces the model to predict, not to copy.

How many blocks?

The architecture is the same from GPT-2 to GPT-4. What changes is the scale:

ModelBlocksDimensionHeadsParameters
GPT-2 small1276812117M
GPT-2 XL481600251.5B
GPT-3961228896175B
Llama 3 70B8081926470B

More blocks = more compositional reasoning possible (each layer can build on the abstractions of the previous one). More dimensions = more capacity per token. More heads = more simultaneous "questions."

The output: from vector to distribution

At this point, the last block gives us, for every position, a vector of a few thousand dimensions. How do we turn that back into a distribution over the vocabulary?

One step. Multiply that vector by a matrix W_out of dimensions (d_model × |vocab|), then apply a softmax. The result: for each position, a probability over the ~50,000 tokens of the vocabulary. That's the output of an LLM — a distribution.

Elegant detail: in most models, W_out shares its weights with the input embedding matrix (weight tying). The same transformation that maps token 5234 into a vector, run in reverse, maps a vector back to the probability of token 5234. Saves parameters, generalizes better.

Mixture of Experts: not all parameters fire

A new architectural variant has become dominant in recent models: Mixture of Experts (MoE). Mixtral, DeepSeek, Llama 4, GPT-4, Gemini — they all use it.

The idea: instead of a single FFN per block, you place several in parallel (typically 8 to 128 experts). For each token, a small routing network (router) picks two or four — the most relevant for that token. Only those experts fire.

Consequence: a model can have 400 billion "total" parameters but only activate 50 billion per token. Capacity of a large model, compute cost of a small one. That's what makes Mixtral 8×7B (47 billion parameters) competitive at inference with much larger dense models.

The trade-off: VRAM has to fit all the experts (otherwise you swap), and routing adds a layer of training instability. Active research area.

The miracle of simplicity

The entire edifice rests on a single pattern, repeated, normalized, added to itself. No specific structures for grammar. No separate modules for semantics. No hard-coded linguistic rules.

A Transformer knows nothing about language. It just knows how to mix vectors by looking at who resembles whom.

All the complexity emerges from training — that's the subject of the next chapter.

Updated

The Transformer, end to end · Step by Token