Mutual information

Shannon 1948 (public domain) · Wikipedia (CC BY-SA 4.0)

Mutual information I(X;Y) = H(X) + H(Y) − H(X,Y) measures the information shared between two variables. It is always non-negative, and zero if and only if X and Y are independent.

The shared information

Mutual information answers: how much does knowing X tell you about Y? Equivalently: how much does the joint distribution differ from the product of marginals? Three equivalent formulas:

I(X;Y) = H(X) + H(Y) − H(X,Y)
I(X;Y) = H(X) − H(X|Y)
I(X;Y) = H(Y) − H(Y|X)

The first says mutual information is the overlap in the Venn diagram. The second says it is how much Y reduces uncertainty about X. The third is the symmetric version.

Scheme

; Mutual information: I(X;Y) = H(X) + H(Y) - H(X,Y)
; The information shared between X and Y

(define (log2 x) (/ (log x) (log 2)))

(define (entropy probs)
  (let loop ((ps probs) (h 0))
    (if (null? ps) h
        (let ((p (car ps)))
          (if (= p 0) (loop (cdr ps) h)
              (loop (cdr ps)
                    (- h (* p (log2 p)))))))))

; Weather/umbrella example from ch.3
(define h-joint (entropy '(0.15 0.55 0.25 0.05)))
(define h-weather (entropy '(0.70 0.30)))
(define h-umbrella (entropy '(0.40 0.60)))

; Three equivalent formulas
(define mi-1 (+ h-weather h-umbrella (- h-joint)))
(define mi-2 (- h-weather (- h-joint h-umbrella)))
(define mi-3 (- h-umbrella (- h-joint h-weather)))

(display "I(W;U) = H(W) + H(U) - H(W,U) = ")
(display mi-1) (newline)
(display "I(W;U) = H(W) - H(W|U)         = ")
(display mi-2) (newline)
(display "I(W;U) = H(U) - H(U|W)         = ")
(display mi-3) (newline)

; All three agree
(display "all equal? ")
(display (and (< (abs (- mi-1 mi-2)) 0.0001)
              (< (abs (- mi-2 mi-3)) 0.0001)))

; Mutual information: I(X;Y) = H(X) + H(Y) - H(X,Y)
; The information shared between X and Y

(define (log2 x) (/ (log x) (log 2)))

(define (entropy probs)
  (let loop ((ps probs) (h 0))
    (if (null? ps) h
        (let ((p (car ps)))
          (if (= p 0) (loop (cdr ps) h)
              (loop (cdr ps)
                    (- h (* p (log2 p)))))))))

; Weather/umbrella example from ch.3
(define h-joint (entropy '(0.15 0.55 0.25 0.05)))
(define h-weather (entropy '(0.70 0.30)))
(define h-umbrella (entropy '(0.40 0.60)))

; Three equivalent formulas
(define mi-1 (+ h-weather h-umbrella (- h-joint)))
(define mi-2 (- h-weather (- h-joint h-umbrella)))
(define mi-3 (- h-umbrella (- h-joint h-weather)))

(display "I(W;U) = H(W) + H(U) - H(W,U) = ")
(display mi-1) (newline)
(display "I(W;U) = H(W) - H(W|U)         = ")
(display mi-2) (newline)
(display "I(W;U) = H(U) - H(U|W)         = ")
(display mi-3) (newline)

; All three agree
(display "all equal? ")
(display (and (< (abs (- mi-1 mi-2)) 0.0001)
              (< (abs (- mi-2 mi-3)) 0.0001)))

Non-negativity and independence

Mutual information is always non-negative: I(X;Y) ≥ 0. It equals zero if and only if X and Y are independent, because independent variables share no information. This follows from the fact that H(X|Y) ≤ H(X): conditioning never increases entropy.

Scheme

; I(X;Y) = 0 iff X and Y are independent
(define (log2 x) (/ (log x) (log 2)))

(define (entropy probs)
  (let loop ((ps probs) (h 0))
    (if (null? ps) h
        (let ((p (car ps)))
          (if (= p 0) (loop (cdr ps) h)
              (loop (cdr ps)
                    (- h (* p (log2 p)))))))))

(define (mutual-info joint marginal-x marginal-y)
  (+ (entropy marginal-x)
     (entropy marginal-y)
     (- (entropy joint))))

; Independent: coin flip and die roll
; P(H,1) = 0.5 * 1/6, etc.
(define coin-die-joint
  (list (* 0.5 (/ 1.0 6)) (* 0.5 (/ 1.0 6)) (* 0.5 (/ 1.0 6))
        (* 0.5 (/ 1.0 6)) (* 0.5 (/ 1.0 6)) (* 0.5 (/ 1.0 6))
        (* 0.5 (/ 1.0 6)) (* 0.5 (/ 1.0 6)) (* 0.5 (/ 1.0 6))
        (* 0.5 (/ 1.0 6)) (* 0.5 (/ 1.0 6)) (* 0.5 (/ 1.0 6))))

(display "independent (coin,die): I = ")
(display (mutual-info coin-die-joint
                      '(0.5 0.5)
                      (list (/ 1.0 6) (/ 1.0 6) (/ 1.0 6)
                            (/ 1.0 6) (/ 1.0 6) (/ 1.0 6))))
(newline)

; Dependent: weather/umbrella
(display "dependent (W,U):        I = ")
(display (mutual-info '(0.15 0.55 0.25 0.05)
                      '(0.70 0.30)
                      '(0.40 0.60)))

; I(X;Y) = 0 iff X and Y are independent
(define (log2 x) (/ (log x) (log 2)))

(define (mutual-info joint marginal-x marginal-y)
  (+ (entropy marginal-x)
     (entropy marginal-y)
     (- (entropy joint))))

; Independent: coin flip and die roll
; P(H,1) = 0.5 * 1/6, etc.
(define coin-die-joint
  (list (* 0.5 (/ 1.0 6)) (* 0.5 (/ 1.0 6)) (* 0.5 (/ 1.0 6))
        (* 0.5 (/ 1.0 6)) (* 0.5 (/ 1.0 6)) (* 0.5 (/ 1.0 6))
        (* 0.5 (/ 1.0 6)) (* 0.5 (/ 1.0 6)) (* 0.5 (/ 1.0 6))
        (* 0.5 (/ 1.0 6)) (* 0.5 (/ 1.0 6)) (* 0.5 (/ 1.0 6))))

(display "independent (coin,die): I = ")
(display (mutual-info coin-die-joint
                      '(0.5 0.5)
                      (list (/ 1.0 6) (/ 1.0 6) (/ 1.0 6)
                            (/ 1.0 6) (/ 1.0 6) (/ 1.0 6))))
(newline)

; Dependent: weather/umbrella
(display "dependent (W,U):        I = ")
(display (mutual-info '(0.15 0.55 0.25 0.05)
                      '(0.70 0.30)
                      '(0.40 0.60)))

Symmetry

Mutual information is symmetric: I(X;Y) = I(Y;X). Knowing X tells you as much about Y as knowing Y tells you about X. This is not true of conditional entropy: H(X|Y) and H(Y|X) can differ.

Scheme

; Symmetry: I(X;Y) = I(Y;X)
; But H(X|Y) != H(Y|X) in general
(define (log2 x) (/ (log x) (log 2)))

(define (entropy probs)
  (let loop ((ps probs) (h 0))
    (if (null? ps) h
        (let ((p (car ps)))
          (if (= p 0) (loop (cdr ps) h)
              (loop (cdr ps)
                    (- h (* p (log2 p)))))))))

(define h-joint (entropy '(0.15 0.55 0.25 0.05)))
(define h-w (entropy '(0.70 0.30)))
(define h-u (entropy '(0.40 0.60)))

; MI is symmetric
(display "I(W;U) = H(W) - H(W|U) = ")
(display (- h-w (- h-joint h-u))) (newline)
(display "I(U;W) = H(U) - H(U|W) = ")
(display (- h-u (- h-joint h-w))) (newline)

; Conditional entropy is NOT symmetric
(display "H(W|U) = ") (display (- h-joint h-u)) (newline)
(display "H(U|W) = ") (display (- h-joint h-w)) (newline)
(display "H(W|U) = H(U|W)? ")
(display (< (abs (- (- h-joint h-u) (- h-joint h-w))) 0.0001))

Notation reference

Symbol	Scheme	Meaning
I(X;Y) = H(X) + H(Y) − H(X,Y)	(mutual-info joint mx my)	Mutual information
I(X;Y) = H(X) − H(X\|Y)	(- h-x h-x-given-y)	Reduction in uncertainty
I(X;Y) ≥ 0	(>= mi 0)	Non-negativity
I(X;Y) = 0 iff independent	(= mi 0)	Zero means no shared info

Neighbors

📡 Shannon 03 — joint and conditional entropy
📡 Shannon 05 — KL divergence
📡 Shannon 06 — data processing inequality
🍞 Smithe 2021 — Bayesian lenses trade mutual information between prior and posterior
Mutual information
Information diagram

← Joint and Conditional by june.kim KL Divergence →