← back to The Commons

Meaning Becomes Geometry

Tomas Mikolov, Kai Chen, Greg Corrado, Jeffrey Dean · Google · 2013 · arxiv arXiv:1301.3781

Word2Vec learns a vector for every word such that words with similar meanings have similar vectors. The famous result: king − man + woman ≈ queen. Meaning has geometric structure, and that structure makes a new kind of search possible.

Context is meaning

Words that appear in similar contexts have similar meanings. "King" and "queen" both appear near "throne," "crown," "rule." "Cat" and "dog" both appear near "pet," "feed," "vet." Linguists called this the distributional hypothesis in the 1950s. Mikolov's team made it computable.

Word2Vec trains a shallow neural network on a simple task: given a word, predict the words around it (or vice versa). The network has a hidden layer of, say, 300 neurons. After training on billions of words of text, those 300 numbers for each word become its embedding -- a point in 300-dimensional space. Words that appear in similar contexts end up near each other. Words that don't, don't.

The model never sees a dictionary. It never reads a definition. It learns meaning entirely from patterns of co-occurrence. The geometry emerges from the statistics of language.

Meaning as arithmetic

The result that made the paper famous: take the vector for "king," subtract "man," add "woman," and the nearest vector is "queen." This is not a party trick. It means the model learned that "royalty" and "gender" are independent dimensions of meaning. The direction from "man" to "woman" is the same as the direction from "king" to "queen." Meaning has geometric structure.

gender royalty man woman king queen king − man + woman ≈ queen

The same arithmetic works across many relationships. Paris − France + Italy ≈ Rome. Walking − walk + swim ≈ swimming. The model discovers analogies as parallel lines in vector space. Each consistent relationship -- capital-of, tense, gender, size -- carves out its own direction in the geometry.

Why this matters for search

Traditional search matches keywords. If you search "heart attack" you won't find a page about "myocardial infarction" unless someone manually added synonyms. The two phrases point to the same thing, but keyword search can't see that. The vocabulary gap is the oldest problem in information retrieval.

Embedding-based search matches meaning. Embed the query. Embed the documents. Find the documents whose vectors are closest to the query's vector. Two people describing the same idea in different vocabularies can find each other, because their embeddings land in the same region of vector space.

This is Bush's associative trails, realized as geometry. Bush wanted a machine that could link documents by association rather than index. Embeddings do exactly this -- but the associations are computed from the statistics of language, not hand-built by a researcher. The memex was personal. The embedding space is shared.

Retrieval without reproduction

An embedding is a lossy compression of meaning. You can't reconstruct the original text from its embedding -- information is destroyed in the transformation. A 300-dimensional vector cannot encode the full content of a paragraph, let alone a book. It encodes what the text is about, not what the text says.

This matters legally. In Authors Guild v. Google, the Second Circuit held that scanning and indexing twenty million books was fair use because Google's search results showed only snippets, not full text. The index was transformative: it turned books into a searchable database.

Embedding-based indexing is on even stronger ground. A snippet reproduces some of the original. An embedding reproduces none of it. The transformation is more complete, the copy more lossy, the purpose more clearly different from the original. If full-text indexing is fair use, then embedding-based indexing -- which contains strictly less of the original work -- should be too.

The commons implication

Keyword search required centralized infrastructure. You need to crawl the entire web, build a massive inverted index, serve it from data centers that cost billions. Only a few companies can afford this. The technology dictates the structure: search engines became gatekeepers because the economics demanded scale.

Embedding search can be distributed. Anyone can embed their own content and contribute vectors to a shared index. The vectors are small -- a few hundred numbers per chunk of text. They don't reproduce the original. They don't require a centralized crawl. A community can build a search index the same way it builds a wiki: each contributor adds their own piece.

The technology doesn't require a gatekeeper. That is the fact that changes the game.


Neighbors
  • ⚖ Authors Guild v. Google 2015 — previous in the Commons collection: the legal precedent that makes embedding-based indexing defensible
  • 📡 Info Theory — embeddings are lossy compression: Shannon's rate-distortion theory explains how much meaning can be preserved in how few dimensions
  • 🤖 ML — the architecture that made large-scale embeddings practical: Word2Vec proved meaning could be geometry, Transformers scaled it
  • 🔗 Linear Algebra Ch.2 — vector spaces and inner products: word vectors are elements of a high-dimensional vector space; cosine similarity is the inner product normalized by vector norms
  • 🤖 ML Ch.4 — neural networks and representation learning: Word2Vec's skip-gram model learns representations by predicting context — the geometry emerges from the prediction task
  • 📡 IT Ch.2 — mutual information: Word2Vec maximizes mutual information between a word and its context — the "meaning as geometry" is encoded mutual information

This is the technical fact that makes everything else in the collection actionable. Meaning as geometry means retrieval without reproduction. Copyleft + embeddings = a search engine that stays on the commons side. jkPageLeft Manifesto