Entropy as Functor

Shannon 1948 (public domain) · Wikipedia (CC BY-SA 4.0)

Prereqs: 📡 Entropy, 🍞 Baez-Fritz-Leinster 2011. 5 min.

Shannon entropy is the unique functorial information measure. Baez, Fritz, and Leinster proved: any continuous function from finite probability spaces to real numbers that satisfies the chain rule and maximality must be Shannon entropy (up to a constant). Shannon did not choose entropy. The axioms forced it.

Entropy respects composition

The chain rule says H(X,Y) = H(X) + H(Y|X). When you process data in two steps, the total information equals the information from step 1 plus the conditional information from step 2. This is functoriality: the measure of the composite equals the sum of the measures of the parts.

Scheme

; Entropy respects composition: H(X,Y) = H(X) + H(Y|X)
; This is the chain rule — the functorial property.
(define (log2 x) (/ (log x) (log 2)))

(define (entropy probs)
  (- (apply + (map (lambda (p)
    (if (> p 0) (* p (log2 p)) 0)) probs))))

; Joint distribution P(X,Y)
; X = weather (sun/rain), Y = umbrella (yes/no)
; sun+no=0.4, sun+yes=0.1, rain+no=0.1, rain+yes=0.4
(define pxy '(0.4 0.1 0.1 0.4))
(define px  '(0.5 0.5))

(define h-joint (entropy pxy))
(define h-x (entropy px))
(define h-y-given-x (- h-joint h-x))

(display "H(X,Y)   = ") (display h-joint) (newline)
(display "H(X)     = ") (display h-x) (newline)
(display "H(Y|X)   = ") (display h-y-given-x) (newline)
(display "H(X) + H(Y|X) = ") (display (+ h-x h-y-given-x)) (newline)
; H(X,Y) = H(X) + H(Y|X). Composition preserved.

The three axioms

Baez-Fritz-Leinster proved that Shannon entropy is the unique function satisfying three properties:

Continuity — small changes in probabilities produce small changes in entropy
Maximality — the uniform distribution has the highest entropy among all distributions on n outcomes
Chain rule — H(X,Y) = H(X) + H(Y|X). Composition is additive.

Drop any one axiom and other measures become possible. All three together force Shannon's formula.

Scheme

; The three axioms that force Shannon entropy
(define (log2 x) (/ (log x) (log 2)))

(define (entropy probs)
  (- (apply + (map (lambda (p)
    (if (> p 0) (* p (log2 p)) 0)) probs))))

; 1. Continuity: nearby distributions have nearby entropy
(define h1 (entropy '(0.5 0.5)))
(define h2 (entropy '(0.51 0.49)))
(display "H(0.50,0.50) = ") (display h1) (newline)
(display "H(0.51,0.49) = ") (display h2) (newline)
(display "difference    = ") (display (abs (- h1 h2))) (newline)

; 2. Maximality: uniform is max
(define h-uniform (entropy '(0.25 0.25 0.25 0.25)))
(define h-skewed  (entropy '(0.7 0.1 0.1 0.1)))
(display "H(uniform) = ") (display h-uniform) (newline)
(display "H(skewed)  = ") (display h-skewed) (newline)
(display "uniform >= skewed? ") (display (>= h-uniform h-skewed)) (newline)

; 3. Chain rule: H(X,Y) = H(X) + H(Y|X)
(define h-xy (entropy '(0.3 0.2 0.1 0.4)))
(define h-x  (entropy '(0.5 0.5)))
(display "chain rule: ") (display h-xy) (display " = ")
(display h-x) (display " + ") (display (- h-xy h-x))

; The three axioms that force Shannon entropy
(define (log2 x) (/ (log x) (log 2)))

(define (entropy probs)
  (- (apply + (map (lambda (p)
    (if (> p 0) (* p (log2 p)) 0)) probs))))

; 1. Continuity: nearby distributions have nearby entropy
(define h1 (entropy '(0.5 0.5)))
(define h2 (entropy '(0.51 0.49)))
(display "H(0.50,0.50) = ") (display h1) (newline)
(display "H(0.51,0.49) = ") (display h2) (newline)
(display "difference    = ") (display (abs (- h1 h2))) (newline)

; 2. Maximality: uniform is max
(define h-uniform (entropy '(0.25 0.25 0.25 0.25)))
(define h-skewed  (entropy '(0.7 0.1 0.1 0.1)))
(display "H(uniform) = ") (display h-uniform) (newline)
(display "H(skewed)  = ") (display h-skewed) (newline)
(display "uniform >= skewed? ") (display (>= h-uniform h-skewed)) (newline)

; 3. Chain rule: H(X,Y) = H(X) + H(Y|X)
(define h-xy (entropy '(0.3 0.2 0.1 0.4)))
(define h-x  (entropy '(0.5 0.5)))
(display "chain rule: ") (display h-xy) (display " = ")
(display h-x) (display " + ") (display (- h-xy h-x))

The payoff: foundations meet category theory

This is the bridge page. Shannon defined entropy operationally in 1948. Baez-Fritz-Leinster proved in 2011 that it is the unique functor from the category of finite probability spaces to the real numbers that satisfies continuity, maximality, and the chain rule. The foundations (pages 1–7) are not arbitrary definitions. They are the only definitions that compose. The Stoch/Giry framework extends this categorical view to stochastic channels, connecting entropy to the broader structure of probabilistic computation.

Scheme

; The uniqueness result in action
; Any measure F that is functorial, continuous, and maximal
; MUST be F = c * Shannon entropy.
(define (log2 x) (/ (log x) (log 2)))

(define (entropy probs)
  (- (apply + (map (lambda (p)
    (if (> p 0) (* p (log2 p)) 0)) probs))))

; Information loss of a morphism f: P -> Q
; F(f) = H(input) - H(output) = information destroyed
(define (info-loss input-dist output-dist)
  (- (entropy input-dist) (entropy output-dist)))

; Step 1: 8 symbols -> 4 symbols (merge pairs)
(define loss1 (info-loss '(0.125 0.125 0.125 0.125
                           0.125 0.125 0.125 0.125)
                         '(0.25 0.25 0.25 0.25)))

; Step 2: 4 symbols -> 2 symbols
(define loss2 (info-loss '(0.25 0.25 0.25 0.25)
                         '(0.5 0.5)))

; Composite: 8 symbols -> 2 symbols
(define loss-total (info-loss '(0.125 0.125 0.125 0.125
                                0.125 0.125 0.125 0.125)
                              '(0.5 0.5)))

(display "Loss(step 1)    = ") (display loss1) (newline)
(display "Loss(step 2)    = ") (display loss2) (newline)
(display "Loss(composite) = ") (display loss-total) (newline)
(display "Sum of parts    = ") (display (+ loss1 loss2)) (newline)
; Functorial: loss of composite = sum of losses.
; Shannon entropy is the ONLY measure with this property.

; The uniqueness result in action
; Any measure F that is functorial, continuous, and maximal
; MUST be F = c * Shannon entropy.
(define (log2 x) (/ (log x) (log 2)))

(define (entropy probs)
  (- (apply + (map (lambda (p)
    (if (> p 0) (* p (log2 p)) 0)) probs))))

; Information loss of a morphism f: P -> Q
; F(f) = H(input) - H(output) = information destroyed
(define (info-loss input-dist output-dist)
  (- (entropy input-dist) (entropy output-dist)))

; Step 1: 8 symbols -> 4 symbols (merge pairs)
(define loss1 (info-loss '(0.125 0.125 0.125 0.125
                           0.125 0.125 0.125 0.125)
                         '(0.25 0.25 0.25 0.25)))

; Step 2: 4 symbols -> 2 symbols
(define loss2 (info-loss '(0.25 0.25 0.25 0.25)
                         '(0.5 0.5)))

; Composite: 8 symbols -> 2 symbols
(define loss-total (info-loss '(0.125 0.125 0.125 0.125
                                0.125 0.125 0.125 0.125)
                              '(0.5 0.5)))

(display "Loss(step 1)    = ") (display loss1) (newline)
(display "Loss(step 2)    = ") (display loss2) (newline)
(display "Loss(composite) = ") (display loss-total) (newline)
(display "Sum of parts    = ") (display (+ loss1 loss2)) (newline)
; Functorial: loss of composite = sum of losses.
; Shannon entropy is the ONLY measure with this property.

Notation reference

Symbol	Scheme	Meaning
H(X,Y) = H(X) + H(Y\|X)	chain rule	Entropy is additive under composition
FinProb	list of probabilities	Category of finite probability spaces
H : FinProb → R	(entropy probs)	Entropy as functor
F(f) = c · (H(p) − H(q))	(info-loss p q)	Information loss (unique up to scale)
H(g ; f) = H(f) + H(g\|f)	functoriality	Loss of composite = sum of losses

Neighbors

📡 Entropy — the definition that the uniqueness theorem vindicates
🍞 Baez, Fritz, Leinster 2011 — the main theorem: entropy is the unique functorial measure
🍞 Chen, Vigneaux 2023 — entropy equals magnitude, connecting information to size
🍞 Leinster 2021 — entropy and diversity in ecology, same uniqueness result
Shannon entropy
Functor

Translation notes

All examples use finite uniform or explicit distributions. The Baez-Fritz-Leinster theorem works over FinProb, the category of finite probability distributions with measure-preserving maps as morphisms. The "functor" here maps each morphism to its information loss (a real number), and the theorem says this functor is unique. Renyi entropy, Tsallis entropy, and other generalizations satisfy some but not all three axioms. The chain rule is the one that eliminates them.

← Channels by june.kim fin · 8 of 8