📡 Information Theory
Sources: Claude Shannon, "A Mathematical Theory of Communication" (1948, public domain); Wikipedia (CC BY-SA 4.0).
Entropy, divergence, and channels. If a paper page mentions "information loss" or "data processing inequality" and you want the definitions, start here.
| Chapter | |||
|---|---|---|---|
| 1. | Surprise | I(x) = -log P(x): rare events carry more information | 📡 |
| 2. | Entropy | H(X) = expected surprise. Fair coin = 1 bit. Certainty = 0 | 📡 |
| 3. | Joint and conditional | H(X|Y) = H(X,Y) - H(Y): knowing Y can only reduce uncertainty about X | 📡 |
| 4. | Mutual information | I(X;Y) = how much knowing one tells you about the other | 📡 |
| 5. | KL divergence | D(P||Q) = surprise of Q when the truth is P | 📡 |
| 6. | Data processing inequality | Processing cannot create information: X to Y to Z implies I(X;Z) ≤ I(X;Y) | 📡 |
| 7. | Channels and capacity | A channel is P(Y|X). Capacity = max mutual information over inputs | 📡 |
| 8. | Entropy as functor | Shannon entropy is the unique information measure that respects composition | 📡 |
📺 Video lectures: MIT 6.441 Information Theory
Neighbors
- 🎰 Probability — entropy is defined over probability distributions
- 🤖 Machine Learning — cross-entropy loss, mutual information, and compression
- 🐱 Category Theory — entropy as a functor is the final chapter
- 📊 Statistics — KL divergence bridges both fields