← back to reading

📡 Information Theory

Sources: Claude Shannon, "A Mathematical Theory of Communication" (1948, public domain); wpWikipedia (CC BY-SA 4.0).

Entropy, divergence, and channels. If a paper page mentions "information loss" or "data processing inequality" and you want the definitions, start here.

H(X|Y) sec. 3 H(Y|X) sec. 3 I(X;Y) sec. 4 H(X) sec. 2 H(Y) sec. 2 H(X,Y) — sec. 3
Chapter
1. Surprise I(x) = -log P(x): rare events carry more information 📡
2. Entropy H(X) = expected surprise. Fair coin = 1 bit. Certainty = 0 📡
3. Joint and conditional H(X|Y) = H(X,Y) - H(Y): knowing Y can only reduce uncertainty about X 📡
4. Mutual information I(X;Y) = how much knowing one tells you about the other 📡
5. KL divergence D(P||Q) = surprise of Q when the truth is P 📡
6. Data processing inequality Processing cannot create information: X to Y to Z implies I(X;Z) ≤ I(X;Y) 📡
7. Channels and capacity A channel is P(Y|X). Capacity = max mutual information over inputs 📡
8. Entropy as functor Shannon entropy is the unique information measure that respects composition 📡

📺 Video lectures: MIT 6.441 Information Theory

Neighbors