KL Divergence

Shannon 1948 (public domain) · Wikipedia (CC BY-SA 4.0)

KL divergence measures the extra surprise you get when you use distribution Q to code data that actually comes from P. It is not a distance: D(P||Q) ≠ D(Q||P). It is always ≥ 0, and zero only when P = Q.

The formula

D_KL(P || Q) = ∑ P(x) log(P(x) / Q(x)). Each term weights the log-ratio by how likely x actually is under P. When Q assigns low probability to something P considers likely, that term blows up. When Q matches P perfectly, every ratio is 1 and every log is 0.

Scheme

; KL divergence: D(P||Q) = sum P(x) * log2(P(x)/Q(x))
(define (log2 x) (/ (log x) (log 2)))

(define (kl-divergence p q)
  (apply + (map (lambda (pi qi)
    (if (> pi 0) (* pi (log2 (/ pi qi))) 0))
  p q)))

; P = truth, Q = model
(define P '(0.4 0.3 0.15 0.1 0.05))
(define Q '(0.1 0.2 0.3 0.25 0.15))

(display "D(P||Q) = ") (display (kl-divergence P Q)) (newline)
(display "D(Q||P) = ") (display (kl-divergence Q P)) (newline)
; Not equal — KL divergence is asymmetric.

; D(P||P) should be 0
(display "D(P||P) = ") (display (kl-divergence P P)) (newline)
; Zero: no extra surprise when the model matches the truth.

Gibbs' inequality: always non-negative

KL divergence is always ≥ 0. This is Gibbs' inequality. The proof uses Jensen's inequality on the concavity of log. Intuitively: you can never do better than the truth. Any mismatch between model and reality costs you bits.

Scheme

; Gibbs' inequality: D(P||Q) >= 0 always
(define (log2 x) (/ (log x) (log 2)))

(define (kl-divergence p q)
  (apply + (map (lambda (pi qi)
    (if (> pi 0) (* pi (log2 (/ pi qi))) 0))
  p q)))

; Try many random-ish distributions — all non-negative
(define (check p q label)
  (let ((d (kl-divergence p q)))
    (display label) (display ": D = ")
    (display d) (display (if (>= d 0) "  ok" "  FAIL"))
    (newline)))

(check '(0.5 0.5)     '(0.5 0.5)     "uniform vs uniform")
(check '(0.9 0.1)     '(0.5 0.5)     "biased vs uniform ")
(check '(0.5 0.5)     '(0.9 0.1)     "uniform vs biased ")
(check '(0.7 0.2 0.1) '(0.1 0.2 0.7) "flipped           ")
; All >= 0. Gibbs' inequality holds.

Not a distance

A true distance satisfies d(x,y) = d(y,x). KL divergence does not. D(P||Q) asks "how surprised am I using Q when P is true?" while D(Q||P) asks the reverse. These are different questions with different answers. Using a bad model for rare events is worse than using one for common events.

Scheme

; Asymmetry: D(P||Q) != D(Q||P) in general
(define (log2 x) (/ (log x) (log 2)))

(define (kl-divergence p q)
  (apply + (map (lambda (pi qi)
    (if (> pi 0) (* pi (log2 (/ pi qi))) 0))
  p q)))

; P concentrates on outcome 1. Q spreads evenly.
(define P '(0.9 0.05 0.05))
(define Q '(0.33 0.34 0.33))

(display "D(P||Q) = ") (display (kl-divergence P Q)) (newline)
(display "D(Q||P) = ") (display (kl-divergence Q P)) (newline)
(display "ratio   = ") (display (/ (kl-divergence P Q) (kl-divergence Q P)))
; Very different. Direction matters.

Notation reference

Symbol	Scheme	Meaning
D_KL(P \|\| Q)	(kl-divergence P Q)	KL divergence from Q to P
P(x) log(P(x)/Q(x))	(* pi (log2 (/ pi qi)))	Pointwise divergence term
D ≥ 0	(>= d 0)	Gibbs' inequality
D = 0 iff P = Q	(kl-divergence P P) = 0	Zero divergence means identical

Neighbors

📡 Mutual information — KL divergence between joint and product of marginals
📡 Data processing inequality — processing can only increase divergence from truth
🍞 Sato, Katsumata 2023 — divergences on monads; KL is the primary example
KL divergence
Gibbs' inequality
Cross-entropy

Translation notes

All examples use finite discrete distributions with explicit probability vectors. KL divergence extends to continuous distributions via integrals, and to distributions over infinite countable sets via infinite sums. The asymmetry is the same in all cases. The non-negativity proof (Gibbs' inequality) uses Jensen's inequality on the convexity of -log, which holds regardless of whether the sample space is finite.

← Mutual Information by june.kim Data Processing →