Entropy: expected surprise

Shannon 1948 (public domain) · Wikipedia (CC BY-SA 4.0)

Entropy H(X) = −∑ P(x) log₂ P(x) is the average surprise of a random variable. A fair coin has maximum entropy (1 bit). A loaded coin has less. Certainty has zero.

From surprise to entropy

Self-information measures the surprise of a single event. Entropy averages that surprise over the entire distribution. If you know the probabilities, entropy tells you how many bits per symbol you need on average to encode messages from the source. Shannon invented this to solve the telegraph problem: how to send messages efficiently.

Scheme

; Entropy: H(X) = -sum P(x) log2 P(x)
; The average surprise of a random variable

(define (log2 x) (/ (log x) (log 2)))

(define (entropy probs)
  (let loop ((ps probs) (h 0))
    (if (null? ps) h
        (let ((p (car ps)))
          (if (= p 0) (loop (cdr ps) h)
              (loop (cdr ps)
                    (- h (* p (log2 p)))))))))

; Fair coin: P(H) = P(T) = 0.5
(display "fair coin:   ")
(display (entropy '(0.5 0.5)))
(display " bits") (newline)

; Loaded coin: P(H) = 0.9, P(T) = 0.1
(display "loaded coin: ")
(display (entropy '(0.9 0.1)))
(display " bits") (newline)

; Fair die: each face P = 1/6
(define die-probs (list (/ 1.0 6) (/ 1.0 6) (/ 1.0 6)
                        (/ 1.0 6) (/ 1.0 6) (/ 1.0 6)))
(display "fair die:    ")
(display (entropy die-probs))
(display " bits") (newline)

; Certain event: P = 1
(display "certain:     ")
(display (entropy '(1.0)))
(display " bits")

Uniform distributions maximize entropy

Among all distributions over n outcomes, the uniform distribution P(x) = 1/n has maximum entropy H = log₂(n). Any deviation from uniformity reduces entropy. This is because concentrating probability on some outcomes reduces average surprise.

Scheme

; Entropy is maximized by the uniform distribution
(define (log2 x) (/ (log x) (log 2)))

(define (entropy probs)
  (let loop ((ps probs) (h 0))
    (if (null? ps) h
        (let ((p (car ps)))
          (if (= p 0) (loop (cdr ps) h)
              (loop (cdr ps)
                    (- h (* p (log2 p)))))))))

; Maximum entropy for n outcomes = log2(n)
(display "max H for 2 outcomes: ") (display (log2 2)) (newline)
(display "max H for 6 outcomes: ") (display (log2 6)) (newline)
(display "max H for 8 outcomes: ") (display (log2 8)) (newline)

; Deviations from uniform always decrease H
(display "uniform (3): ") (display (entropy '(0.333 0.333 0.334))) (newline)
(display "skewed (3):  ") (display (entropy '(0.7 0.2 0.1))) (newline)
(display "peaked (3):  ") (display (entropy '(0.98 0.01 0.01)))

Shannon's telegraph example

Shannon's original motivation: English text is not uniform. The letter E appears far more often than Z. Because the distribution is skewed, the entropy per letter is lower than log₂(27) = 4.75 bits. Shannon estimated about 1 to 1.5 bits per character in English. That gap is what compression exploits, and it is the principle behind context density as a measure of how much meaning is packed into a given span of text.

Scheme

; English is not uniform: entropy per letter is much
; less than log2(27) = 4.75 bits
(define (log2 x) (/ (log x) (log 2)))

(define (entropy probs)
  (let loop ((ps probs) (h 0))
    (if (null? ps) h
        (let ((p (car ps)))
          (if (= p 0) (loop (cdr ps) h)
              (loop (cdr ps)
                    (- h (* p (log2 p)))))))))

; Uniform over 27 symbols (26 letters + space)
(define uniform-27
  (let loop ((n 27) (acc '()))
    (if (= n 0) acc
        (loop (- n 1) (cons (/ 1.0 27) acc)))))

(display "uniform 27:  ")
(display (entropy uniform-27))
(display " bits/char") (newline)

; Approximate English letter frequencies (top 6 + rest lumped)
; E=12.7%, T=9.1%, A=8.2%, O=7.5%, I=7.0%, N=6.7%, rest~48.8% over 21
(define english-approx
  (list 0.127 0.091 0.082 0.075 0.070 0.067
        0.063 0.060 0.054 0.043 0.040 0.037
        0.034 0.029 0.025 0.024 0.020 0.019
        0.015 0.010 0.007 0.007 0.006 0.005
        0.002 0.001 0.001))

(display "English:     ")
(display (entropy english-approx))
(display " bits/char") (newline)

; Shannon estimated 1.0-1.5 bits/char accounting
; for sequential dependencies (bigrams, trigrams...)
(display "Shannon est: ~1.0-1.5 bits/char with context")

; English is not uniform: entropy per letter is much
; less than log2(27) = 4.75 bits
(define (log2 x) (/ (log x) (log 2)))

(define (entropy probs)
  (let loop ((ps probs) (h 0))
    (if (null? ps) h
        (let ((p (car ps)))
          (if (= p 0) (loop (cdr ps) h)
              (loop (cdr ps)
                    (- h (* p (log2 p)))))))))

; Uniform over 27 symbols (26 letters + space)
(define uniform-27
  (let loop ((n 27) (acc '()))
    (if (= n 0) acc
        (loop (- n 1) (cons (/ 1.0 27) acc)))))

(display "uniform 27:  ")
(display (entropy uniform-27))
(display " bits/char") (newline)

; Approximate English letter frequencies (top 6 + rest lumped)
; E=12.7%, T=9.1%, A=8.2%, O=7.5%, I=7.0%, N=6.7%, rest~48.8% over 21
(define english-approx
  (list 0.127 0.091 0.082 0.075 0.070 0.067
        0.063 0.060 0.054 0.043 0.040 0.037
        0.034 0.029 0.025 0.024 0.020 0.019
        0.015 0.010 0.007 0.007 0.006 0.005
        0.002 0.001 0.001))

(display "English:     ")
(display (entropy english-approx))
(display " bits/char") (newline)

; Shannon estimated 1.0-1.5 bits/char accounting
; for sequential dependencies (bigrams, trigrams...)
(display "Shannon est: ~1.0-1.5 bits/char with context")

Notation reference

Symbol	Scheme	Meaning
H(X) = −∑ p log p	(entropy probs)	Shannon entropy
H(X) ≤ log n	(log2 n)	Maximum entropy (uniform)
H(p, 1−p)	(entropy (list p (- 1 p)))	Binary entropy function

Neighbors

📡 Shannon 01 — surprise: the atomic unit
📡 Shannon 03 — joint and conditional entropy
🍞 Baez & Fritz 2011 — uniqueness: Shannon entropy is the only functor satisfying these axioms
🍞 Leinster 2021 — diversity = exp(H): entropy measures effective number of types
Entropy (information theory)
Binary entropy function

← Surprise by june.kim Joint and Conditional →