KL Divergence
Shannon 1948 (public domain) · Wikipedia (CC BY-SA 4.0)
KL divergence measures the extra surprise you get when you use distribution Q to code data that actually comes from P. It is not a distance: D(P||Q) ≠ D(Q||P). It is always ≥ 0, and zero only when P = Q.
The formula
DKL(P || Q) = ∑ P(x) log(P(x) / Q(x)). Each term weights the log-ratio by how likely x actually is under P. When Q assigns low probability to something P considers likely, that term blows up. When Q matches P perfectly, every ratio is 1 and every log is 0.
Gibbs' inequality: always non-negative
KL divergence is always ≥ 0. This is Gibbs' inequality. The proof uses
Jensen's inequality on the concavity of log. Intuitively: you can never do better than the truth. Any mismatch between model and reality costs you bits.
Not a distance
A true distance satisfies d(x,y) = d(y,x). KL divergence does not. D(P||Q) asks "how surprised am I using Q when P is true?" while D(Q||P) asks the reverse. These are different questions with different answers. Using a bad model for rare events is worse than using one for common events.
Notation reference
| Symbol | Scheme | Meaning |
|---|---|---|
| DKL(P || Q) | (kl-divergence P Q) | KL divergence from Q to P |
| P(x) log(P(x)/Q(x)) | (* pi (log2 (/ pi qi))) | Pointwise divergence term |
| D ≥ 0 | (>= d 0) | Gibbs' inequality |
| D = 0 iff P = Q | (kl-divergence P P) = 0 | Zero divergence means identical |
Neighbors
- 📡 Mutual information — KL divergence between joint and product of marginals
- 📡 Data processing inequality — processing can only increase divergence from truth
- 🍞 Sato, Katsumata 2023 — divergences on monads; KL is the primary example
KL divergence
Gibbs' inequality
Cross-entropy
Translation notes
All examples use finite discrete distributions with explicit probability vectors. KL divergence extends to continuous distributions via integrals, and to distributions over infinite countable sets via infinite sums. The asymmetry is the same in all cases. The non-negativity proof (Gibbs' inequality) uses Jensen's inequality on the convexity of -log, which holds regardless of whether the sample space is finite.