← back to info theory

Entropy: expected surprise

Shannon 1948 (public domain) · Wikipedia (CC BY-SA 4.0)

Entropy H(X) = −∑ P(x) log2 P(x) is the average surprise of a random variable. A fair coin has maximum entropy (1 bit). A loaded coin has less. Certainty has zero.

From surprise to entropy

Self-information measures the surprise of a single event. Entropy averages that surprise over the entire distribution. If you know the probabilities, entropy tells you how many bits per symbol you need on average to encode messages from the source. Shannon invented this to solve the telegraph problem: how to send messages efficiently.

p (probability of heads) H(p) bits 0 0.5 1 0 1 max = 1 bit 0.47 bits p = 0.9 H(p) = −p log p − (1−p) log(1−p)
Scheme

Uniform distributions maximize entropy

Among all distributions over n outcomes, the uniform distribution P(x) = 1/n has maximum entropy H = log2(n). Any deviation from uniformity reduces entropy. This is because concentrating probability on some outcomes reduces average surprise.

Scheme

Shannon's telegraph example

Shannon's original motivation: English text is not uniform. The letter E appears far more often than Z. Because the distribution is skewed, the entropy per letter is lower than log2(27) = 4.75 bits. Shannon estimated about 1 to 1.5 bits per character in English. That gap is what compression exploits, and it is the principle behind jkcontext density as a measure of how much meaning is packed into a given span of text.

Scheme

Notation reference

Symbol Scheme Meaning
H(X) = −∑ p log p(entropy probs)Shannon entropy
H(X) ≤ log n(log2 n)Maximum entropy (uniform)
H(p, 1−p)(entropy (list p (- 1 p)))Binary entropy function
Neighbors