Attention replaces recurrence: every token attends to every other token in parallel. Q, K, V matrices compute similarity. Multi-head attention learns different relationship types. Positional encoding adds sequence order since attention is permutation-equivariant.
Scaled dot-product attention
The core operation: given queries Q, keys K, and values V, compute attention(Q,K,V) = softmax(QKᵀ/√d)V. The dot product measures similarity between query and key. The √d scaling prevents the softmax from saturating when dimensions are large. The result is a weighted sum of values.
importmath# Scaled dot-product attentiondef dot(a, b):
returnsum(x*y for x, y inzip(a, b))
def softmax(xs):
mx = max(xs)
exps = [math.exp(x - mx) for x in xs]
total = sum(exps)
return [e / total for e in exps]
def attention(query, keys, values):
d = len(query)
scale = math.sqrt(d)
scores = [dot(query, k) / scale for k in keys]
weights = softmax(scores)
print("Scores: ", [round(s, 2) for s in scores])
print("Weights:", [round(w, 2) for w in weights])
d_v = len(values[0])
out = [sum(w * v[j] for w, v inzip(weights, values)) for j inrange(d_v)]
return out
query = [1.0, 0.5, 0.0]
keys = [[1,0,0], [0,1,0], [0,0,1]]
values = [[1.0, 0.0], [0.0, 1.0], [0.5, 0.5]]
result = attention(query, keys, values)
print("Output: ", [round(x, 2) for x in result])
Multi-head attention
Instead of one attention function, split Q, K, V into h heads, run attention on each, then concatenate and project. Each head learns a different type of relationship: one head might attend to syntactic neighbors, another to semantic relatives. This is why transformers are expressive.
# Multi-head split and concatenationdef split_heads(vec, n_heads):
d_head = len(vec) // n_heads
return [vec[i*d_head:(i+1)*d_head] for i inrange(n_heads)]
def concat_heads(heads):
return [x for head in heads for x in head]
embedding = [0.2, 0.8, -0.3, 0.5]
print("Full embedding:", embedding)
heads = split_heads(embedding, 2)
print("Head 1: ", heads[0])
print("Head 2: ", heads[1])
print("Concatenated: ", concat_heads(heads))
print("n_heads=2, d_model=4 -> d_head=2")
Positional encoding (sinusoidal)
Attention treats its input as a set, not a sequence — it's permutation-equivariant. To inject order, add a positional encoding to each token embedding. The sinusoidal version uses sin and cos at different frequencies: position i, dimension j gets sin(i/10000j/d) for even j and cos for odd j. The wavelengths form a geometric progression from 2π to 20000π.
Scheme
; Sinusoidal positional encoding; PE(pos, 2i) = sin(pos / 10000^(2i/d)); PE(pos, 2i+1) = cos(pos / 10000^(2i/d))
(define d-model 8)
(define (pos-encoding pos)
(let loop ((i 0) (enc '()))
(if (= i d-model) (reverse enc)
(let ((angle (/ pos (expt 10000 (/ (* 2 (quotient i 2)) d-model)))))
(loop (+ i 1)
(cons (if (even? i) (sin angle) (cos angle))
enc))))))
(define (show-enc pos)
(display "pos=") (display pos) (display ": ")
(display (map (lambda (x) (/ (round (* x 1000)) 1000))
(pos-encoding pos)))
(newline))
(show-enc 0)
(show-enc 1)
(show-enc 2)
(show-enc 3)
; Each position gets a unique pattern; Nearby positions have similar encodings; The model can learn relative positions from these
Python
importmath# Sinusoidal positional encoding
d_model = 8def pos_encoding(pos):
enc = []
for i inrange(d_model):
angle = pos / (10000 ** (2 * (i // 2) / d_model))
enc.append(math.sin(angle) if i % 2 == 0elsemath.cos(angle))
return enc
for pos inrange(4):
enc = [round(x, 3) for x in pos_encoding(pos)]
print("pos={}: {}".format(pos, enc))
Notation reference
Math
Scheme
Meaning
softmax(QKᵀ/√d)V
(attention q keys values)
Scaled dot-product attention
Q, K, V
query, keys, values
Query, Key, Value projections
dk
d-head
Key/query dimension per head
PE(pos, i)
(pos-encoding pos)
Positional encoding vector
Translation notes
"Attention Is All You Need" (Vaswani et al., 2017) introduced the transformer. The encoder-decoder architecture processes both input and output sequences. Modern LLMs typically use decoder-only transformers. The self-attention mechanism has O(n²) complexity in sequence length — the motivation behind recent work on linear attention, sparse attention, and state-space models. The quadratic cost makes context density a practical concern: when the context window is finite, every token must earn its place.