โ† back to index

A Characterization of Entropy in Terms of Information Loss

Baez, Fritz, Leinster ยท 2011 ยท arxiv arXiv:1106.1791

Assumes you've seen ๐Ÿž Markov categories. 5 minutes.

wpShannon entropy is the only way to measure information loss that respects composition. There is no other choice.

Information loss

When a stochastic function processes a distribution, some information is lost. The question is: how do you measure that loss? wpShannon defined entropy in 1948. Baez, Fritz, and Leinster proved that Shannon's definition is the only one that satisfies three natural properties.

Scheme
distribution p H 1.85 bits H(f ; g) = H(f) + H(g | f) loss of composite = sum of losses. the functor preserves composition.

Functoriality โ€” the key constraint

The paper's theorem: Shannon entropy is the unique measure of information loss that is functorial. Functorial means: the loss of a composite equals the sum of the losses of the parts.

If you process data in two steps, the total information lost is the loss from step 1 plus the loss from step 2. No other definition of "loss" has this property (up to a constant factor).

Scheme

Data processing inequality

A corollary: information can only decrease through processing. You can't create information by computing โ€” you can only lose it or preserve it. This is the wpdata processing inequality (DPI), proved categorically.

Scheme

The uniqueness theorem

The paper's main result (Theorem 4): any function F that assigns a real number to a morphism in FinProb (finite probability distributions) and satisfies:

  1. Functoriality โ€” F(g โˆ˜ f) = F(f) + F(g)
  2. Convex linearity โ€” F respects mixtures of distributions
  3. Continuity โ€” small changes in probabilities โ†’ small changes in F

...must be F(f) = c ยท (H(input) โˆ’ H(output)) for some constant c โ‰ฅ 0. That's Shannon entropy, up to scale. There is no other option.

Scheme

Notation reference

Paper Python Meaning
H(p)-sum(p*log2(p))Shannon entropy
F(f) = c(H(p)โˆ’H(q))h_in - h_outInformation loss of morphism f
FinProb# dicts summing to 1Category of finite probability distributions
DPIh_out <= h_inData processing inequality
Neighbors

Other paper pages

Foundations (Wikipedia)

Translation notes

All examples use finite distributions with explicit probabilities. The paper's theorem holds over FinProb (finite probability distributions with morphisms that preserve total mass). For example: the DPI example on this page merges four outcomes into two and checks that entropy decreases. In the paper, the same inequality applies to a noisy communication channel transmitting continuous signals โ€” the entropy of the received signal is always less than or equal to the entropy of the transmitted signal, regardless of the channel's noise model. The inequality is identical. The distributions are not.

The uniqueness result requires all three axioms โ€” dropping any one allows other measures. Every example is Simplified.

Framework connection: Shannon entropy is the unique measure of the Natural Framework pipeline's information budget โ€” DPI governs how bits flow through each stage. (jkThe Natural Framework)

Ready for the real thing? arxiv Read the paper. Start at Theorem 4 (p.8) for the uniqueness result. Short paper โ€” 15 pages.