Shannon 1948 (public domain) · Wikipedia (CC BY-SA 4.0)
Mutual information I(X;Y) = H(X) + H(Y) − H(X,Y) measures the information shared between two variables. It is always non-negative, and zero if and only if X and Y are independent.
The shared information
Mutual information answers: how much does knowing X tell you about Y? Equivalently: how much does the joint distribution differ from the product of marginals? Three equivalent formulas:
I(X;Y) = H(X) + H(Y) − H(X,Y)
I(X;Y) = H(X) − H(X|Y)
I(X;Y) = H(Y) − H(Y|X)
The first says mutual information is the overlap in the Venn diagram. The second says it is how much Y reduces uncertainty about X. The third is the symmetric version.
Scheme
; Mutual information: I(X;Y) = H(X) + H(Y) - H(X,Y); The information shared between X and Y
(define (log2 x) (/ (log x) (log 2)))
(define (entropy probs)
(let loop ((ps probs) (h 0))
(if (null? ps) h
(let ((p (car ps)))
(if (= p 0) (loop (cdr ps) h)
(loop (cdr ps)
(- h (* p (log2 p)))))))))
; Weather/umbrella example from ch.3
(define h-joint (entropy '(0.150.550.250.05)))
(define h-weather (entropy '(0.700.30)))
(define h-umbrella (entropy '(0.400.60)))
; Three equivalent formulas
(define mi-1 (+ h-weather h-umbrella (- h-joint)))
(define mi-2 (- h-weather (- h-joint h-umbrella)))
(define mi-3 (- h-umbrella (- h-joint h-weather)))
(display "I(W;U) = H(W) + H(U) - H(W,U) = ")
(display mi-1) (newline)
(display "I(W;U) = H(W) - H(W|U) = ")
(display mi-2) (newline)
(display "I(W;U) = H(U) - H(U|W) = ")
(display mi-3) (newline)
; All three agree
(display "all equal? ")
(display (and (< (abs (- mi-1 mi-2)) 0.0001)
(< (abs (- mi-2 mi-3)) 0.0001)))
Python
importmathdef entropy(probs):
return -sum(p * math.log2(p) for p in probs if p > 0)
h_joint = entropy([0.15, 0.55, 0.25, 0.05])
h_w = entropy([0.70, 0.30])
h_u = entropy([0.40, 0.60])
mi = h_w + h_u - h_joint
print(f"I(W;U) = {mi:.4f} bits")
print(f"H(W) - H(W|U) = {h_w - (h_joint - h_u):.4f} bits")
print(f"H(U) - H(U|W) = {h_u - (h_joint - h_w):.4f} bits")
Non-negativity and independence
Mutual information is always non-negative: I(X;Y) ≥ 0. It equals zero if and only if X and Y are independent, because independent variables share no information. This follows from the fact that H(X|Y) ≤ H(X): conditioning never increases entropy.
Mutual information is symmetric: I(X;Y) = I(Y;X). Knowing X tells you as much about Y as knowing Y tells you about X. This is not true of conditional entropy: H(X|Y) and H(Y|X) can differ.