📡 Information Theory

Sources: Claude Shannon, "A Mathematical Theory of Communication" (1948, public domain); Wikipedia (CC BY-SA 4.0).

Entropy, divergence, and channels. If a paper page mentions "information loss" or "data processing inequality" and you want the definitions, start here.

	Chapter
1.	Surprise	I(x) = -log P(x): rare events carry more information	📡
2.	Entropy	H(X) = expected surprise. Fair coin = 1 bit. Certainty = 0	📡
3.	Joint and conditional	H(X\|Y) = H(X,Y) - H(Y): knowing Y can only reduce uncertainty about X	📡
4.	Mutual information	I(X;Y) = how much knowing one tells you about the other	📡
5.	KL divergence	D(P\|\|Q) = surprise of Q when the truth is P	📡
6.	Data processing inequality	Processing cannot create information: X to Y to Z implies I(X;Z) ≤ I(X;Y)	📡
7.	Channels and capacity	A channel is P(Y\|X). Capacity = max mutual information over inputs	📡
8.	Entropy as functor	Shannon entropy is the unique information measure that respects composition	📡

📺 Video lectures: MIT 6.441 Information Theory

Neighbors

🎰 Probability — entropy is defined over probability distributions
🤖 Machine Learning — cross-entropy loss, mutual information, and compression
🐱 Category Theory — entropy as a functor is the final chapter
📊 Statistics — KL divergence bridges both fields