← back to machine learning

Transformers

MML (CC BY 4.0) · D2L (CC BY-SA 4.0) · 11 of 12

Attention replaces recurrence: every token attends to every other token in parallel. Q, K, V matrices compute similarity. Multi-head attention learns different relationship types. Positional encoding adds sequence order since attention is permutation-equivariant.

Keys the cat sat down Queries the cat sat down .40 .30 .20 .10 .15 .45 .25 .15 .10 .35 .30 .25 .05 .15 .35 .45 darker = more attention

Scaled dot-product attention

The core operation: given queries Q, keys K, and values V, compute attention(Q,K,V) = softmax(QKᵀ/√d)V. The dot product measures similarity between query and key. The √d scaling prevents the softmax from saturating when dimensions are large. The result is a weighted sum of values.

Scheme

Multi-head attention

Instead of one attention function, split Q, K, V into h heads, run attention on each, then concatenate and project. Each head learns a different type of relationship: one head might attend to syntactic neighbors, another to semantic relatives. This is why transformers are expressive.

Scheme

Positional encoding (sinusoidal)

Attention treats its input as a set, not a sequence — it's permutation-equivariant. To inject order, add a positional encoding to each token embedding. The sinusoidal version uses sin and cos at different frequencies: position i, dimension j gets sin(i/10000j/d) for even j and cos for odd j. The wavelengths form a geometric progression from 2π to 20000π.

Scheme

Notation reference

Math Scheme Meaning
softmax(QKᵀ/√d)V(attention q keys values)Scaled dot-product attention
Q, K, Vquery, keys, valuesQuery, Key, Value projections
dkd-headKey/query dimension per head
PE(pos, i)(pos-encoding pos)Positional encoding vector

Translation notes

"wpAttention Is All You Need" (Vaswani et al., 2017) introduced the transformer. The encoder-decoder architecture processes both input and output sequences. Modern LLMs typically use decoder-only transformers. The self-attention mechanism has O(n²) complexity in sequence length — the motivation behind recent work on linear attention, sparse attention, and state-space models. The quadratic cost makes jkcontext density a practical concern: when the context window is finite, every token must earn its place.

Neighbors