Chapter 03 · Embeddings · 10 min

The space of meaning

Words in a geometric space. King − Man + Woman = Queen, and other vector miracles.

An equation that shouldn't work

Consider this operation:

king − man + woman ≈ queen

It's an arithmetic equation, like 5 − 2 + 4 = 7. Except it's about words.

And it works. Not because someone programmed "king" and "queen" to resemble each other. But because each word was converted into a list of numbers — a vector — and the algebra of meaning becomes ordinary algebra.

This is the most counterintuitive idea in LLMs, and the most powerful.

From token to position

In the previous chapter, we saw that text becomes a sequence of token IDs — integers like 5234 or 91. But a bare integer has no structure. Token 5234 is neither "close to" nor "far from" token 5235. They're just numbered.

For a model to compute with words, it needs a richer representation. The solution: associate each token with a vector of roughly 768, 1024, or 4096 real numbers. This is what we call an embedding.

At the start of training, these vectors are random. Gradually, through predicting millions of next words, the model learns to arrange them so that words with similar meanings have similar vectors.

No one wrote this rule. It emerges from the prediction task.

Why it works

Think about what it means to accurately predict the word after "The king told his…". The good answers are daughter, wife, mother, queen — not engine or algorithm. A model that predicts these continuations well must know that these words are interchangeable in this context.

The most economical way to memorize this equivalence, when you have billions of parameters and billions of sentences, is to group "daughter", "wife", "mother", "queen" in the same region of the vector space. The gradient pushes in that direction at each iteration, without any human ever needing to label anything.

Embeddings are not designed. They are the geometric trace of the prediction task.

Explore the space

The space below is a cartoon in two dimensions — real embeddings have hundreds. But the essential properties are there: semantic clusters, neighborhood, vector arithmetic.

Each dot is a word projected into a meaning space. Neighbors share a theme — not a spelling. The arrow shows the vector arithmetic that makes King − Man + Woman = Queen possible.

Three things to notice:

  • Clusters appear without being named. Hover over cat and you'll see dog, mouse, lion. Hover over joy and you'll see love, fear, sadness. Categories don't exist in the data — they exist in the geometry.
  • Certain directions have meaning. The vector from man to woman is roughly the same as the one from king to queen, or from father to mother. This regularity is what makes the arithmetic work.
  • Distances are relative, not absolute. That cat is at distance 0.32 from dog means nothing in itself. What matters is that it's closer to dog than to bread or anger.

The 2D illusion

In real models, an embedding typically has between 768 and 4096 dimensions. Why so many?

Because in 2D, you're forced to make compromises. cat must be close to dog (domestic animals), to mouse (mammals), to tiger (felines), to bird (animal). All these "proximities" pull in different directions — and in 2D, they conflict.

At 768 dimensions, each facet of meaning can have its own direction. The word cat can be close to dog along the "pet" axis, close to tiger along the "feline" axis, close to mouse along the "small mammal" axis. The space is vast enough for all these relationships to coexist without colliding.

Humans think they can't visualize 768 dimensions. Embeddings don't ask for that much: they just use the dimensions to store their categories without collision.

Measuring closeness: cosine similarity

When we say two words are "close" in embedding space, how do we actually measure that? Not with plain Euclidean distance. With cosine similarity.

The idea: look at the angle between two vectors, not their length. Two vectors pointing in the same direction have cosine similarity 1, regardless of their magnitudes. Two orthogonal vectors give 0. Two opposite ones, −1.

cos(u, v) = (u · v) / (||u|| × ||v||)

Why this measure rather than another? Because the norm of an embedding (its length) varies for reasons unrelated to meaning — word frequency, layer depth. The direction encodes meaning. Cosine similarity isolates what counts.

This is also the measure that powers all modern semantic search: vector databases (Pinecone, pgvector, Chroma…) index millions of vectors and find the nearest neighbors of a query in a blink. We come back to this in chapter 10 (RAG).

Consequences

This geometric representation has surprising effects:

  • Spelling errors are robust. hello and helo have very close embeddings, so the model "understands" them almost identically — even though at the token level, they're totally different.
  • Biases get embedded. If in the training corpus, nurse appears more often in the feminine and doctor in the masculine, the arithmetic of embeddings will reflect it. doctor − man + woman might give nurse. Much work goes toward correcting these biases — we'll revisit this in chapter 8.
  • Everything becomes computable. Once meaning has become a vector, you can add, project, measure angles. That's exactly what the next mechanism does.

What's next

Your word has become a vector. The word next to it too. And the one before. How, from this sequence of vectors, does the model decide that in "The doctor fired the nurse because **she**…", the pronoun she refers to the nurse and not the doctor?

Answer in the next chapter: attention, the mechanism that lets each token look at all the others before deciding who it is.

Updated

Embeddings: the geometric space of meaning · Step by Token