Transformers

MML (CC BY 4.0) · D2L (CC BY-SA 4.0) · 11 of 12

Attention replaces recurrence: every token attends to every other token in parallel. Q, K, V matrices compute similarity. Multi-head attention learns different relationship types. Positional encoding adds sequence order since attention is permutation-equivariant.

Scaled dot-product attention

The core operation: given queries Q, keys K, and values V, compute attention(Q,K,V) = softmax(QKᵀ/√d)V. The dot product measures similarity between query and key. The √d scaling prevents the softmax from saturating when dimensions are large. The result is a weighted sum of values.

Scheme

; Scaled dot-product attention (single query, multiple keys)
; attention(q, K, V) = softmax(q . K^T / sqrt(d)) . V

(define (dot a b) (apply + (map * a b)))

(define (softmax xs)
  (let ((mx (apply max xs)))
    (let ((exps (map (lambda (x) (exp (- x mx))) xs)))
      (let ((total (apply + exps)))
        (map (lambda (e) (/ e total)) exps)))))

(define (attention query keys values)
  (let* ((d (length query))
         (scale (sqrt d))
         (scores (map (lambda (k) (/ (dot query k) scale)) keys))
         (weights (softmax scores)))
    (display "Scores:  ") (display (map (lambda (s) (/ (round (* s 100)) 100)) scores)) (newline)
    (display "Weights: ") (display (map (lambda (w) (/ (round (* w 100)) 100)) weights)) (newline)
    ; Weighted sum of values
    (let ((d-v (length (car values))))
      (let loop ((j 0) (result '()))
        (if (= j d-v) (reverse result)
            (loop (+ j 1)
                  (cons (apply + (map (lambda (w v) (* w (list-ref v j)))
                                      weights values))
                        result)))))))

(define query  '(1.0 0.5 0.0))
(define keys   '((1.0 0.0 0.0) (0.0 1.0 0.0) (0.0 0.0 1.0)))
(define values '((1.0 0.0) (0.0 1.0) (0.5 0.5)))

(display "Output:  ") (display (map (lambda (x) (/ (round (* x 100)) 100))
                                    (attention query keys values)))

; Scaled dot-product attention (single query, multiple keys)
; attention(q, K, V) = softmax(q . K^T / sqrt(d)) . V

(define (dot a b) (apply + (map * a b)))

(define (softmax xs)
  (let ((mx (apply max xs)))
    (let ((exps (map (lambda (x) (exp (- x mx))) xs)))
      (let ((total (apply + exps)))
        (map (lambda (e) (/ e total)) exps)))))

(define (attention query keys values)
  (let* ((d (length query))
         (scale (sqrt d))
         (scores (map (lambda (k) (/ (dot query k) scale)) keys))
         (weights (softmax scores)))
    (display "Scores:  ") (display (map (lambda (s) (/ (round (* s 100)) 100)) scores)) (newline)
    (display "Weights: ") (display (map (lambda (w) (/ (round (* w 100)) 100)) weights)) (newline)
    ; Weighted sum of values
    (let ((d-v (length (car values))))
      (let loop ((j 0) (result '()))
        (if (= j d-v) (reverse result)
            (loop (+ j 1)
                  (cons (apply + (map (lambda (w v) (* w (list-ref v j)))
                                      weights values))
                        result)))))))

(define query  '(1.0 0.5 0.0))
(define keys   '((1.0 0.0 0.0) (0.0 1.0 0.0) (0.0 0.0 1.0)))
(define values '((1.0 0.0) (0.0 1.0) (0.5 0.5)))

(display "Output:  ") (display (map (lambda (x) (/ (round (* x 100)) 100))
                                    (attention query keys values)))

Multi-head attention

Instead of one attention function, split Q, K, V into h heads, run attention on each, then concatenate and project. Each head learns a different type of relationship: one head might attend to syntactic neighbors, another to semantic relatives. This is why transformers are expressive.

Scheme

; Multi-head attention: split embedding into heads
; d_model = 4, n_heads = 2, d_head = 2

(define (split-heads vec n-heads)
  (let ((d-head (/ (length vec) n-heads)))
    (let loop ((i 0) (heads '()))
      (if (= i n-heads) (reverse heads)
          (let inner ((j 0) (head '()))
            (if (= j d-head)
                (loop (+ i 1) (cons (reverse head) heads))
                (inner (+ j 1)
                       (cons (list-ref vec (+ (* i d-head) j)) head))))))))

(define (concat-heads heads)
  (apply append heads))

(define embedding '(0.2 0.8 -0.3 0.5))

(display "Full embedding: ") (display embedding) (newline)
(display "Head 1:         ") (display (car (split-heads embedding 2))) (newline)
(display "Head 2:         ") (display (cadr (split-heads embedding 2))) (newline)
(display "Concatenated:   ") (display (concat-heads (split-heads embedding 2))) (newline)

; Each head runs its own attention, then results are concatenated
(display "n_heads=2, d_model=4 -> d_head=2") (newline)
(display "Each head attends independently, then merge.")

; Multi-head attention: split embedding into heads
; d_model = 4, n_heads = 2, d_head = 2

(define (split-heads vec n-heads)
  (let ((d-head (/ (length vec) n-heads)))
    (let loop ((i 0) (heads '()))
      (if (= i n-heads) (reverse heads)
          (let inner ((j 0) (head '()))
            (if (= j d-head)
                (loop (+ i 1) (cons (reverse head) heads))
                (inner (+ j 1)
                       (cons (list-ref vec (+ (* i d-head) j)) head))))))))

(define (concat-heads heads)
  (apply append heads))

(define embedding '(0.2 0.8 -0.3 0.5))

(display "Full embedding: ") (display embedding) (newline)
(display "Head 1:         ") (display (car (split-heads embedding 2))) (newline)
(display "Head 2:         ") (display (cadr (split-heads embedding 2))) (newline)
(display "Concatenated:   ") (display (concat-heads (split-heads embedding 2))) (newline)

; Each head runs its own attention, then results are concatenated
(display "n_heads=2, d_model=4 -> d_head=2") (newline)
(display "Each head attends independently, then merge.")

Positional encoding (sinusoidal)

Attention treats its input as a set, not a sequence — it's permutation-equivariant. To inject order, add a positional encoding to each token embedding. The sinusoidal version uses sin and cos at different frequencies: position i, dimension j gets sin(i/10000^j/d) for even j and cos for odd j. The wavelengths form a geometric progression from 2π to 20000π.

Scheme

; Sinusoidal positional encoding
; PE(pos, 2i)   = sin(pos / 10000^(2i/d))
; PE(pos, 2i+1) = cos(pos / 10000^(2i/d))

(define d-model 8)

(define (pos-encoding pos)
  (let loop ((i 0) (enc '()))
    (if (= i d-model) (reverse enc)
        (let ((angle (/ pos (expt 10000 (/ (* 2 (quotient i 2)) d-model)))))
          (loop (+ i 1)
                (cons (if (even? i) (sin angle) (cos angle))
                      enc))))))

(define (show-enc pos)
  (display "pos=") (display pos) (display ": ")
  (display (map (lambda (x) (/ (round (* x 1000)) 1000))
                (pos-encoding pos)))
  (newline))

(show-enc 0)
(show-enc 1)
(show-enc 2)
(show-enc 3)

; Each position gets a unique pattern
; Nearby positions have similar encodings
; The model can learn relative positions from these

Notation reference

Math	Scheme	Meaning
softmax(QKᵀ/√d)V	(attention q keys values)	Scaled dot-product attention
Q, K, V	query, keys, values	Query, Key, Value projections
d_k	d-head	Key/query dimension per head
PE(pos, i)	(pos-encoding pos)	Positional encoding vector

Translation notes

"Attention Is All You Need" (Vaswani et al., 2017) introduced the transformer. The encoder-decoder architecture processes both input and output sequences. Modern LLMs typically use decoder-only transformers. The self-attention mechanism has O(n²) complexity in sequence length — the motivation behind recent work on linear attention, sparse attention, and state-space models. The quadratic cost makes context density a practical concern: when the context window is finite, every token must earn its place.

Neighbors

📐 Hefferon Ch.3 — linear maps as the Q, K, V projections
🐈 Milewski Ch.10 — natural transformations: maps between functors, the category theory of "attention"
📐 Linear Algebra Ch.5 — eigendecomposition is the mathematical foundation of PCA
🏛️ Attention Is All You Need — transformers use attention rather than PCA, but both find structure in high-dimensional representations

← Sequence Models by june.kim Large Language Models →