Shannon 1948 (public domain) · Wikipedia (CC BY-SA 4.0)
Entropy H(X) = −∑ P(x) log2 P(x) is the average surprise of a random variable. A fair coin has maximum entropy (1 bit). A loaded coin has less. Certainty has zero.
From surprise to entropy
Self-information measures the surprise of a single event. Entropy averages that surprise over the entire distribution. If you know the probabilities, entropy tells you how many bits per symbol you need on average to encode messages from the source. Shannon invented this to solve the telegraph problem: how to send messages efficiently.
importmathdef entropy(probs):
return -sum(p * math.log2(p) for p in probs if p > 0)
print(f"fair coin: {entropy([0.5, 0.5]):.4f} bits")
print(f"loaded coin: {entropy([0.9, 0.1]):.4f} bits")
print(f"fair die: {entropy([1/6]*6):.4f} bits")
print(f"certain: {entropy([1.0]):.4f} bits")
Uniform distributions maximize entropy
Among all distributions over n outcomes, the uniform distribution P(x) = 1/n has maximum entropy H = log2(n). Any deviation from uniformity reduces entropy. This is because concentrating probability on some outcomes reduces average surprise.
Scheme
; Entropy is maximized by the uniform distribution
(define (log2 x) (/ (log x) (log 2)))
(define (entropy probs)
(let loop ((ps probs) (h 0))
(if (null? ps) h
(let ((p (car ps)))
(if (= p 0) (loop (cdr ps) h)
(loop (cdr ps)
(- h (* p (log2 p)))))))))
; Maximum entropy for n outcomes = log2(n)
(display "max H for 2 outcomes: ") (display (log2 2)) (newline)
(display "max H for 6 outcomes: ") (display (log2 6)) (newline)
(display "max H for 8 outcomes: ") (display (log2 8)) (newline)
; Deviations from uniform always decrease H
(display "uniform (3): ") (display (entropy '(0.3330.3330.334))) (newline)
(display "skewed (3): ") (display (entropy '(0.70.20.1))) (newline)
(display "peaked (3): ") (display (entropy '(0.980.010.01)))
Shannon's telegraph example
Shannon's original motivation: English text is not uniform. The letter E appears far more often than Z. Because the distribution is skewed, the entropy per letter is lower than log2(27) = 4.75 bits. Shannon estimated about 1 to 1.5 bits per character in English. That gap is what compression exploits, and it is the principle behind context density as a measure of how much meaning is packed into a given span of text.
Scheme
; English is not uniform: entropy per letter is much; less than log2(27) = 4.75 bits
(define (log2 x) (/ (log x) (log 2)))
(define (entropy probs)
(let loop ((ps probs) (h 0))
(if (null? ps) h
(let ((p (car ps)))
(if (= p 0) (loop (cdr ps) h)
(loop (cdr ps)
(- h (* p (log2 p)))))))))
; Uniform over 27 symbols (26 letters + space)
(define uniform-27
(let loop ((n 27) (acc '()))
(if (= n 0) acc
(loop (- n 1) (cons (/ 1.027) acc)))))
(display "uniform 27: ")
(display (entropy uniform-27))
(display " bits/char") (newline)
; Approximate English letter frequencies (top 6 + rest lumped); E=12.7%, T=9.1%, A=8.2%, O=7.5%, I=7.0%, N=6.7%, rest~48.8% over 21
(define english-approx
(list 0.1270.0910.0820.0750.0700.0670.0630.0600.0540.0430.0400.0370.0340.0290.0250.0240.0200.0190.0150.0100.0070.0070.0060.0050.0020.0010.001))
(display "English: ")
(display (entropy english-approx))
(display " bits/char") (newline)
; Shannon estimated 1.0-1.5 bits/char accounting; for sequential dependencies (bigrams, trigrams...)
(display "Shannon est: ~1.0-1.5 bits/char with context")