← back to information theory

KL Divergence

Shannon 1948 (public domain) · Wikipedia (CC BY-SA 4.0)

wpKL divergence measures the extra surprise you get when you use distribution Q to code data that actually comes from P. It is not a distance: D(P||Q) ≠ D(Q||P). It is always ≥ 0, and zero only when P = Q.

The formula

DKL(P || Q) = ∑ P(x) log(P(x) / Q(x)). Each term weights the log-ratio by how likely x actually is under P. When Q assigns low probability to something P considers likely, that term blows up. When Q matches P perfectly, every ratio is 1 and every log is 0.

P Q P (truth) Q (model) divergence a b c d e
Scheme

Gibbs' inequality: always non-negative

KL divergence is always ≥ 0. This is wpGibbs' inequality. The proof uses wpJensen's inequality on the concavity of log. Intuitively: you can never do better than the truth. Any mismatch between model and reality costs you bits.

Scheme

Not a distance

A true distance satisfies d(x,y) = d(y,x). KL divergence does not. D(P||Q) asks "how surprised am I using Q when P is true?" while D(Q||P) asks the reverse. These are different questions with different answers. Using a bad model for rare events is worse than using one for common events.

Scheme

Notation reference

Symbol Scheme Meaning
DKL(P || Q)(kl-divergence P Q)KL divergence from Q to P
P(x) log(P(x)/Q(x))(* pi (log2 (/ pi qi)))Pointwise divergence term
D ≥ 0(>= d 0)Gibbs' inequality
D = 0 iff P = Q(kl-divergence P P) = 0Zero divergence means identical
Neighbors

Translation notes

All examples use finite discrete distributions with explicit probability vectors. KL divergence extends to continuous distributions via integrals, and to distributions over infinite countable sets via infinite sums. The asymmetry is the same in all cases. The non-negativity proof (Gibbs' inequality) uses Jensen's inequality on the convexity of -log, which holds regardless of whether the sample space is finite.