Large Language Models

MML (CC BY 4.0) · D2L (CC BY-SA 4.0) · 12 of 12

Scale the transformer, pretrain on next-token prediction over the internet, then align with human feedback. In-context learning emerges: the model learns new tasks from examples in the prompt without weight updates. The LLM is a foundation model — a general-purpose function approximator that specializes via prompting.

Tokenization (BPE concept)

Before a model sees text, it's split into tokens. Byte Pair Encoding starts with individual characters and repeatedly merges the most frequent adjacent pair. The result: common words are single tokens, rare words are split into subwords. This gives a fixed vocabulary that can represent any text.

Scheme

; Simplified BPE: merge most frequent character pairs
; Real BPE builds a vocabulary of ~50k-100k tokens

(define (count-pairs tokens)
  ; Count adjacent pair frequencies
  (let loop ((toks tokens) (counts '()))
    (if (null? (cdr toks)) counts
        (let* ((pair (list (car toks) (cadr toks)))
               (existing (assoc pair counts)))
          (loop (cdr toks)
                (if existing
                    (map (lambda (c) (if (equal? (car c) pair)
                                        (cons pair (+ (cdr c) 1))
                                        c))
                         counts)
                    (cons (cons pair 1) counts)))))))

(define (find-best-pair counts)
  (let loop ((cs counts) (best #f) (best-count 0))
    (if (null? cs) best
        (if (> (cdar cs) best-count)
            (loop (cdr cs) (caar cs) (cdar cs))
            (loop (cdr cs) best best-count)))))

(define (merge-pair tokens pair)
  (let ((merged (string-append (car pair) (cadr pair))))
    (let loop ((toks tokens) (result '()))
      (cond ((null? toks) (reverse result))
            ((and (not (null? (cdr toks)))
                  (equal? (car toks) (car pair))
                  (equal? (cadr toks) (cadr pair)))
             (loop (cddr toks) (cons merged result)))
            (else (loop (cdr toks) (cons (car toks) result)))))))

; Start with characters
(define text '("l" "o" "w" " " "l" "o" "w" " " "l" "o" "w" "e" "r"))

(display "Start:   ") (display text) (newline)

; Step 1: merge most frequent pair
(let* ((counts (count-pairs text))
       (best (find-best-pair counts))
       (merged (merge-pair text best)))
  (display "Merge 1: ") (display best) (display " -> ") (display merged) (newline)
  (let* ((counts2 (count-pairs merged))
         (best2 (find-best-pair counts2))
         (merged2 (merge-pair merged best2)))
    (display "Merge 2: ") (display best2) (display " -> ") (display merged2)))

; Simplified BPE: merge most frequent character pairs
; Real BPE builds a vocabulary of ~50k-100k tokens

(define (count-pairs tokens)
  ; Count adjacent pair frequencies
  (let loop ((toks tokens) (counts '()))
    (if (null? (cdr toks)) counts
        (let* ((pair (list (car toks) (cadr toks)))
               (existing (assoc pair counts)))
          (loop (cdr toks)
                (if existing
                    (map (lambda (c) (if (equal? (car c) pair)
                                        (cons pair (+ (cdr c) 1))
                                        c))
                         counts)
                    (cons (cons pair 1) counts)))))))

(define (find-best-pair counts)
  (let loop ((cs counts) (best #f) (best-count 0))
    (if (null? cs) best
        (if (> (cdar cs) best-count)
            (loop (cdr cs) (caar cs) (cdar cs))
            (loop (cdr cs) best best-count)))))

(define (merge-pair tokens pair)
  (let ((merged (string-append (car pair) (cadr pair))))
    (let loop ((toks tokens) (result '()))
      (cond ((null? toks) (reverse result))
            ((and (not (null? (cdr toks)))
                  (equal? (car toks) (car pair))
                  (equal? (cadr toks) (cadr pair)))
             (loop (cddr toks) (cons merged result)))
            (else (loop (cdr toks) (cons (car toks) result)))))))

; Start with characters
(define text '("l" "o" "w" " " "l" "o" "w" " " "l" "o" "w" "e" "r"))

(display "Start:   ") (display text) (newline)

; Step 1: merge most frequent pair
(let* ((counts (count-pairs text))
       (best (find-best-pair counts))
       (merged (merge-pair text best)))
  (display "Merge 1: ") (display best) (display " -> ") (display merged) (newline)
  (let* ((counts2 (count-pairs merged))
         (best2 (find-best-pair counts2))
         (merged2 (merge-pair merged best2)))
    (display "Merge 2: ") (display best2) (display " -> ") (display merged2)))

Next-token sampling (temperature / top-k)

The model outputs a probability distribution over the vocabulary. Temperature controls sharpness: T→0 makes it greedy (always pick the most likely), T→∞ makes it uniform (random). Top-k restricts sampling to the k most probable tokens. Together they control the creativity/coherence tradeoff.

Scheme

; Next-token sampling with temperature and top-k

(define (softmax-with-temp logits temperature)
  (let* ((scaled (map (lambda (x) (/ x temperature)) logits))
         (mx (apply max scaled))
         (exps (map (lambda (x) (exp (- x mx))) scaled))
         (total (apply + exps)))
    (map (lambda (e) (/ e total)) exps)))

(define vocab '("the" "cat" "sat" "on" "mat"))
(define logits '(2.0 1.5 0.5 0.3 0.1))

(define (show-dist label probs)
  (display label) (newline)
  (for-each (lambda (word p)
              (display "  ") (display word) (display ": ")
              (display (/ (round (* p 1000)) 1000)) (newline))
            vocab probs))

(show-dist "T=1.0 (normal):" (softmax-with-temp logits 1.0))
(show-dist "T=0.3 (sharp):" (softmax-with-temp logits 0.3))
(show-dist "T=3.0 (flat):" (softmax-with-temp logits 3.0))

Prompt template

In-context learning works by formatting examples into the prompt. The model doesn't update its weights — it uses the pattern in the context to infer the task. This is why prompt engineering matters: the format of the examples is the program. Fitting more signal into fewer tokens is the problem context density addresses.

Scheme

; Prompt template: few-shot in-context learning
; The model sees examples, then completes the pattern

(define (make-prompt examples query)
  (let ((formatted
         (map (lambda (ex)
                (string-append "Input: " (car ex) "
Output: " (cdr ex)))
              examples)))
    (string-append
     (apply string-append
            (map (lambda (s) (string-append s "

")) formatted))
     "Input: " query "
Output:")))

(define examples
  '(("happy" . "sad")
    ("hot" . "cold")
    ("big" . "small")))

(display (make-prompt examples "fast")) (newline)
(newline)
(display "---") (newline)
(display "The model sees the pattern (antonyms) and completes:") (newline)
(display "Expected completion: slow") (newline)
(newline)

; Token count matters: more examples = better pattern recognition
; but you're limited by the context window
(display "Examples: ") (display (length examples)) (newline)
(display "Context window: ~4k to ~1M tokens")

Notation reference

Math	Scheme	Meaning
P(x_t\|x_<t)	(softmax logits)	Next-token distribution
T (temperature)	temperature	Sharpness of sampling
top-k	k most probable	Restrict sampling to k tokens
RLHF	reward model + PPO	Align model to human preferences

Translation notes

The jump from transformer to LLM is primarily one of scale: billions of parameters, trillions of tokens, thousands of GPUs. Emergent abilities (chain-of-thought reasoning, code generation, tool use) appear at scale thresholds that we can't predict from smaller models. Whether this is true emergence or smooth scaling is debated. What's not debated: the practical capabilities are real. Whether scale alone produces general intelligence is a separate question — one that depends on whether the system has a closed consolidation loop, not just a larger forward pass.

Neighbors

📡 Shannon Ch.1 — next-token prediction is minimizing surprise (cross-entropy)
🧠 Lovelace Ch.8 — cognitive architectures and the question of understanding
📖 SICP Ch.14 — the metacircular evaluator: programs that interpret programs, self-reference all the way down

Ready for the real thing?

Dive into Deep Learning (d2l.ai) covers everything here with runnable PyTorch/JAX code, full math, and exercises. It's free, it's excellent, and it's where these chapters point when the scaffolding comes off.

← Transformers by june.kim fin