← back to information theory

Data Processing Inequality

Shannon 1948 (public domain) · Wikipedia (CC BY-SA 4.0)

If X → Y → Z is a wpMarkov chain, then I(X;Z) ≤ I(X;Y). Processing cannot create information. Every step in a pipeline can only lose information or preserve it, never gain it. Equivalently: functors can't increase information.

The Markov chain condition

X → Y → Z means Z depends on X only through Y. Given Y, Z is conditionally independent of X. Whatever Y forgets about X is gone forever. No downstream processing of Y can recover it.

X channel Y processing Z I(X;Y) I(X;Z) information about X can only shrink along the chain
Scheme

Why it matters

DPI is the reason compression and summarization are lossy. If your pipeline is X → Y → Z, then Z cannot know more about X than Y does. No clever algorithm applied to Y can recover what Y already lost. This bounds every learning algorithm, every codec, every statistical estimator. It also constrains jkmulti-stage embedding pipelines: each stage in the chain can only preserve or destroy information, never create it.

Scheme

The functorial perspective

In categorical terms, a deterministic function f : X → Y is a functor. DPI says this functor cannot increase mutual information. Baez-Fritz-Leinster proved that entropy is the unique such functor. DPI is the axiom that makes this characterization work.

Scheme

Notation reference

Symbol Scheme Meaning
X → Y → Zchain of channelsMarkov chain
I(X;Z) ≤ I(X;Y)(<= ixz ixy)Data processing inequality
H(f(X)) ≤ H(X)(<= hfx hx)Entropy can't increase under functions
f : X → Y(lambda (x) ...)Deterministic channel (functor)
Neighbors

Translation notes

All examples use discrete finite distributions. The data processing inequality holds for continuous distributions, Markov kernels, and general measurable spaces. The inequality is tight (equality) when Y is a sufficient statistic for X with respect to Z. Every example computes entropy directly from probability vectors rather than simulating the Markov chain.