Surprise: the atomic unit
Shannon 1948 (public domain) · Wikipedia (CC BY-SA 4.0)
The information content of an event is I(x) = −log2 P(x). Rare events carry more information. Certain events carry none. The unit is the bit.
Why logarithm?
Shannon needed a measure of "surprise" with one property: independent events should add, not multiply. If you flip a coin and roll a die, the total surprise should be the sum. Since P(A and B) = P(A) × P(B) for independent events, and log turns products into sums, the logarithm is the only choice. This measure of surprise per event is the foundation of
substance detection: high information content signals that something is worth attending to.
Additivity forces the logarithm
Shannon's key insight: any function f(p) that satisfies f(p × q) = f(p) + f(q) must be a logarithm. This is the Cauchy functional equation. Adding the constraints that f is continuous and f(1/2) = 1 (defining the bit), the unique solution is f(p) = −log2(p).
Choosing the base
The base of the logarithm picks the unit. Base 2 gives bits (Shannon's choice for digital communication). Base e gives nats (convenient for calculus). Base 10 gives hartleys. They differ only by a constant factor. The structure is the same.
Notation reference
| Symbol | Scheme | Meaning |
|---|---|---|
| I(x) = −log2 P(x) | (self-info p) | Self-information / surprise |
| bit | log base 2 | Unit when base = 2 |
| nat | log base e | Unit when base = e |
| I(x,y) = I(x) + I(y) | (+ (self-info p) (self-info q)) | Additivity (independent events) |
Neighbors
- 📡 Shannon 02 — entropy is expected surprise
- 🍞 Baez & Fritz 2011 — entropy as a functor: the category-theoretic characterization
- 🧠 Lovelace Ch.7 Language — surprisal as a linking hypothesis between language models and reading data
Information content
Bit
- 🎰 Probability Ch.1 — probability is the foundation: self-information is -log P(event)