Data Processing Inequality

Shannon 1948 (public domain) · Wikipedia (CC BY-SA 4.0)

If X → Y → Z is a Markov chain, then I(X;Z) ≤ I(X;Y). Processing cannot create information. Every step in a pipeline can only lose information or preserve it, never gain it. Equivalently: functors can't increase information.

The Markov chain condition

X → Y → Z means Z depends on X only through Y. Given Y, Z is conditionally independent of X. Whatever Y forgets about X is gone forever. No downstream processing of Y can recover it.

Scheme

; Data Processing Inequality: I(X;Z) <= I(X;Y) when X->Y->Z
(define (log2 x) (/ (log x) (log 2)))

(define (entropy probs)
  (- (apply + (map (lambda (p)
    (if (> p 0) (* p (log2 p)) 0)) probs))))

; X = fair coin: H T with equal probability
; Y = noisy copy: flips X with 10% probability
; Z = very noisy copy of Y: flips Y with 30% probability

; Joint P(X,Y): rows = X, cols = Y
; P(X=H) = 0.5
;   P(Y=H|X=H) = 0.9, P(Y=T|X=H) = 0.1
; P(X=T) = 0.5
;   P(Y=H|X=T) = 0.1, P(Y=T|X=T) = 0.9

(define pxy '(0.45 0.05 0.05 0.45))  ; HH HT TH TT
(define px  '(0.5 0.5))
(define py  '(0.5 0.5))

; I(X;Y) = H(X) + H(Y) - H(X,Y)
(define ixy (+ (entropy px) (entropy py) (- (entropy pxy))))

; Z = noisy copy of Y (30% flip)
; P(X,Z) marginalizing over Y
(define pxz '(0.38 0.12 0.12 0.38))  ; computed from chain
(define pz  '(0.5 0.5))
(define ixz (+ (entropy px) (entropy pz) (- (entropy pxz))))

(display "I(X;Y) = ") (display ixy) (newline)
(display "I(X;Z) = ") (display ixz) (newline)
(display "I(X;Z) <= I(X;Y)? ") (display (<= ixz ixy))
; Processing Y->Z lost more information about X.

; Data Processing Inequality: I(X;Z) <= I(X;Y) when X->Y->Z
(define (log2 x) (/ (log x) (log 2)))

(define (entropy probs)
  (- (apply + (map (lambda (p)
    (if (> p 0) (* p (log2 p)) 0)) probs))))

; X = fair coin: H T with equal probability
; Y = noisy copy: flips X with 10% probability
; Z = very noisy copy of Y: flips Y with 30% probability

; Joint P(X,Y): rows = X, cols = Y
; P(X=H) = 0.5
;   P(Y=H|X=H) = 0.9, P(Y=T|X=H) = 0.1
; P(X=T) = 0.5
;   P(Y=H|X=T) = 0.1, P(Y=T|X=T) = 0.9

(define pxy '(0.45 0.05 0.05 0.45))  ; HH HT TH TT
(define px  '(0.5 0.5))
(define py  '(0.5 0.5))

; I(X;Y) = H(X) + H(Y) - H(X,Y)
(define ixy (+ (entropy px) (entropy py) (- (entropy pxy))))

; Z = noisy copy of Y (30% flip)
; P(X,Z) marginalizing over Y
(define pxz '(0.38 0.12 0.12 0.38))  ; computed from chain
(define pz  '(0.5 0.5))
(define ixz (+ (entropy px) (entropy pz) (- (entropy pxz))))

(display "I(X;Y) = ") (display ixy) (newline)
(display "I(X;Z) = ") (display ixz) (newline)
(display "I(X;Z) <= I(X;Y)? ") (display (<= ixz ixy))
; Processing Y->Z lost more information about X.

Why it matters

DPI is the reason compression and summarization are lossy. If your pipeline is X → Y → Z, then Z cannot know more about X than Y does. No clever algorithm applied to Y can recover what Y already lost. This bounds every learning algorithm, every codec, every statistical estimator. It also constrains multi-stage embedding pipelines: each stage in the chain can only preserve or destroy information, never create it.

Scheme

; DPI in action: a pipeline of lossy steps
(define (log2 x) (/ (log x) (log 2)))

(define (entropy probs)
  (- (apply + (map (lambda (p)
    (if (> p 0) (* p (log2 p)) 0)) probs))))

; Original signal: 8 equiprobable symbols = 3 bits
(define h0 (entropy '(0.125 0.125 0.125 0.125
                      0.125 0.125 0.125 0.125)))

; Step 1: quantize to 4 levels (merge pairs)
(define h1 (entropy '(0.25 0.25 0.25 0.25)))

; Step 2: quantize to 2 levels (merge pairs again)
(define h2 (entropy '(0.5 0.5)))

; Step 3: threshold to 1 level (everything maps to same bin)
(define h3 (entropy '(1.0)))

(display "H(original) = ") (display h0) (newline)
(display "H(step 1)   = ") (display h1) (newline)
(display "H(step 2)   = ") (display h2) (newline)
(display "H(step 3)   = ") (display h3) (newline)
; 3 >= 2 >= 1 >= 0. Each step loses bits. None gains.

The functorial perspective

In categorical terms, a deterministic function f : X → Y is a functor. DPI says this functor cannot increase mutual information. Baez-Fritz-Leinster proved that entropy is the unique such functor. DPI is the axiom that makes this characterization work.

Scheme

; Functors can't increase information
; If f: X -> Y is a function (deterministic channel),
; then H(f(X)) <= H(X).
(define (log2 x) (/ (log x) (log 2)))

(define (entropy probs)
  (- (apply + (map (lambda (p)
    (if (> p 0) (* p (log2 p)) 0)) probs))))

; X has 4 outcomes
(define hx (entropy '(0.1 0.2 0.3 0.4)))

; f maps outcomes: 1->A, 2->A, 3->B, 4->B (merging)
; P(A) = 0.1+0.2 = 0.3, P(B) = 0.3+0.4 = 0.7
(define hfx (entropy '(0.3 0.7)))

(display "H(X)    = ") (display hx) (newline)
(display "H(f(X)) = ") (display hfx) (newline)
(display "Lost    = ") (display (- hx hfx)) (newline)
(display "H(f(X)) <= H(X)? ") (display (<= hfx hx))
; The function f can only destroy information.

Notation reference

Symbol	Scheme	Meaning
X → Y → Z	chain of channels	Markov chain
I(X;Z) ≤ I(X;Y)	(<= ixz ixy)	Data processing inequality
H(f(X)) ≤ H(X)	(<= hfx hx)	Entropy can't increase under functions
f : X → Y	(lambda (x) ...)	Deterministic channel (functor)

Neighbors

📡 Mutual information — the quantity that DPI bounds
📡 Channels and capacity — DPI constrains what channels can transmit
🍞 Baez, Fritz, Leinster 2011 — DPI is the key axiom in their characterization
🍞 Panangaden 2009 — labelled Markov processes and information flow
Data processing inequality
Markov chain

Translation notes

All examples use discrete finite distributions. The data processing inequality holds for continuous distributions, Markov kernels, and general measurable spaces. The inequality is tight (equality) when Y is a sufficient statistic for X with respect to Z. Every example computes entropy directly from probability vectors rather than simulating the Markov chain.

← KL Divergence by june.kim Channels →