Joint and conditional entropy

Shannon 1948 (public domain) · Wikipedia (CC BY-SA 4.0)

Joint entropy H(X,Y) measures the total uncertainty of two variables together. Conditional entropy H(X|Y) = H(X,Y) − H(Y) is what remains uncertain about X after observing Y. Conditioning never increases entropy.

Joint entropy

The joint entropy H(X,Y) = −∑ P(x,y) log₂ P(x,y) measures the total surprise of observing both X and Y together. If X and Y are independent, H(X,Y) = H(X) + H(Y). If they are dependent, the joint entropy is less: shared structure reduces total uncertainty.

Scheme

; Joint entropy: H(X,Y) = -sum P(x,y) log2 P(x,y)
(define (log2 x) (/ (log x) (log 2)))

(define (entropy probs)
  (let loop ((ps probs) (h 0))
    (if (null? ps) h
        (let ((p (car ps)))
          (if (= p 0) (loop (cdr ps) h)
              (loop (cdr ps)
                    (- h (* p (log2 p)))))))))

; Weather (sunny/rainy) and umbrella (yes/no)
; Joint distribution:
;              umbrella-yes  umbrella-no
; sunny         0.15          0.55
; rainy         0.25          0.05

(define joint '(0.15 0.55 0.25 0.05))
(display "H(weather,umbrella) = ")
(display (entropy joint))
(display " bits") (newline)

; Marginals
(define p-sunny (+ 0.15 0.55))  ; 0.70
(define p-rainy (+ 0.25 0.05))  ; 0.30
(define p-umb-y (+ 0.15 0.25))  ; 0.40
(define p-umb-n (+ 0.55 0.05))  ; 0.60

(display "H(weather)  = ")
(display (entropy (list p-sunny p-rainy)))
(display " bits") (newline)

(display "H(umbrella) = ")
(display (entropy (list p-umb-y p-umb-n)))
(display " bits") (newline)

; If independent, H(X,Y) = H(X) + H(Y)
(display "H(X)+H(Y)  = ")
(display (+ (entropy (list p-sunny p-rainy))
            (entropy (list p-umb-y p-umb-n))))
(display " bits") (newline)

; H(X,Y) < H(X) + H(Y): they share information
(display "dependent? ")
(display (< (entropy joint)
            (+ (entropy (list p-sunny p-rainy))
               (entropy (list p-umb-y p-umb-n)))))

; Joint entropy: H(X,Y) = -sum P(x,y) log2 P(x,y)
(define (log2 x) (/ (log x) (log 2)))

(define (entropy probs)
  (let loop ((ps probs) (h 0))
    (if (null? ps) h
        (let ((p (car ps)))
          (if (= p 0) (loop (cdr ps) h)
              (loop (cdr ps)
                    (- h (* p (log2 p)))))))))

; Weather (sunny/rainy) and umbrella (yes/no)
; Joint distribution:
;              umbrella-yes  umbrella-no
; sunny         0.15          0.55
; rainy         0.25          0.05

(define joint '(0.15 0.55 0.25 0.05))
(display "H(weather,umbrella) = ")
(display (entropy joint))
(display " bits") (newline)

; Marginals
(define p-sunny (+ 0.15 0.55))  ; 0.70
(define p-rainy (+ 0.25 0.05))  ; 0.30
(define p-umb-y (+ 0.15 0.25))  ; 0.40
(define p-umb-n (+ 0.55 0.05))  ; 0.60

(display "H(weather)  = ")
(display (entropy (list p-sunny p-rainy)))
(display " bits") (newline)

(display "H(umbrella) = ")
(display (entropy (list p-umb-y p-umb-n)))
(display " bits") (newline)

; If independent, H(X,Y) = H(X) + H(Y)
(display "H(X)+H(Y)  = ")
(display (+ (entropy (list p-sunny p-rainy))
            (entropy (list p-umb-y p-umb-n))))
(display " bits") (newline)

; H(X,Y) < H(X) + H(Y): they share information
(display "dependent? ")
(display (< (entropy joint)
            (+ (entropy (list p-sunny p-rainy))
               (entropy (list p-umb-y p-umb-n)))))

Conditional entropy

H(X|Y) = H(X,Y) − H(Y) tells you how much uncertainty remains about X once you know Y. This is the chain rule of entropy: H(X,Y) = H(Y) + H(X|Y). Conditioning never increases entropy: H(X|Y) ≤ H(X). Knowing something can only help.

Scheme

; Conditional entropy: H(X|Y) = H(X,Y) - H(Y)
; "How much uncertainty about X remains after observing Y?"
(define (log2 x) (/ (log x) (log 2)))

(define (entropy probs)
  (let loop ((ps probs) (h 0))
    (if (null? ps) h
        (let ((p (car ps)))
          (if (= p 0) (loop (cdr ps) h)
              (loop (cdr ps)
                    (- h (* p (log2 p)))))))))

; Same weather/umbrella example
(define h-joint (entropy '(0.15 0.55 0.25 0.05)))
(define h-weather (entropy '(0.70 0.30)))
(define h-umbrella (entropy '(0.40 0.60)))

; H(weather | umbrella) = H(weather,umbrella) - H(umbrella)
(define h-w-given-u (- h-joint h-umbrella))
(display "H(W|U) = ") (display h-w-given-u) (display " bits") (newline)

; H(umbrella | weather) = H(weather,umbrella) - H(weather)
(define h-u-given-w (- h-joint h-weather))
(display "H(U|W) = ") (display h-u-given-w) (display " bits") (newline)

; Conditioning reduces entropy
(display "H(W)   = ") (display h-weather) (display " bits") (newline)
(display "H(W|U) <= H(W)? ") (display (<= h-w-given-u h-weather)) (newline)

; Chain rule: H(X,Y) = H(X) + H(Y|X)
(display "H(W) + H(U|W) = ") (display (+ h-weather h-u-given-w))
(display " = H(W,U) = ") (display h-joint)

; Conditional entropy: H(X|Y) = H(X,Y) - H(Y)
; "How much uncertainty about X remains after observing Y?"
(define (log2 x) (/ (log x) (log 2)))

; Same weather/umbrella example
(define h-joint (entropy '(0.15 0.55 0.25 0.05)))
(define h-weather (entropy '(0.70 0.30)))
(define h-umbrella (entropy '(0.40 0.60)))

; H(weather | umbrella) = H(weather,umbrella) - H(umbrella)
(define h-w-given-u (- h-joint h-umbrella))
(display "H(W|U) = ") (display h-w-given-u) (display " bits") (newline)

; H(umbrella | weather) = H(weather,umbrella) - H(weather)
(define h-u-given-w (- h-joint h-weather))
(display "H(U|W) = ") (display h-u-given-w) (display " bits") (newline)

; Conditioning reduces entropy
(display "H(W)   = ") (display h-weather) (display " bits") (newline)
(display "H(W|U) <= H(W)? ") (display (<= h-w-given-u h-weather)) (newline)

; Chain rule: H(X,Y) = H(X) + H(Y|X)
(display "H(W) + H(U|W) = ") (display (+ h-weather h-u-given-w))
(display " = H(W,U) = ") (display h-joint)

The chain rule

The chain rule generalizes: H(X₁, ..., X_n) = H(X₁) + H(X₂|X₁) + ... + H(X_n|X₁,...,X_n-1). Each new variable adds only its residual uncertainty, conditioned on everything before it.

Scheme

; Chain rule: H(X,Y,Z) = H(X) + H(Y|X) + H(Z|X,Y)
(define (log2 x) (/ (log x) (log 2)))

(define (entropy probs)
  (let loop ((ps probs) (h 0))
    (if (null? ps) h
        (let ((p (car ps)))
          (if (= p 0) (loop (cdr ps) h)
              (loop (cdr ps)
                    (- h (* p (log2 p)))))))))

; Three binary variables: X, Y, Z
; Joint: 8 probabilities (XYZ = 000, 001, 010, 011, 100, 101, 110, 111)
(define joint-xyz '(0.20 0.05 0.10 0.05 0.05 0.15 0.10 0.30))
(define h-xyz (entropy joint-xyz))

; Marginal of X: P(X=0) = sum first 4, P(X=1) = sum last 4
(define px0 (+ 0.20 0.05 0.10 0.05))
(define px1 (+ 0.05 0.15 0.10 0.30))
(define h-x (entropy (list px0 px1)))

; H(X,Y): marginalize over Z
(define pxy '(0.25 0.15 0.20 0.40))
(define h-xy (entropy pxy))

; Chain rule check
(define h-y-given-x (- h-xy h-x))
(define h-z-given-xy (- h-xyz h-xy))

(display "H(X)       = ") (display h-x) (newline)
(display "H(Y|X)     = ") (display h-y-given-x) (newline)
(display "H(Z|X,Y)   = ") (display h-z-given-xy) (newline)
(display "sum         = ") (display (+ h-x h-y-given-x h-z-given-xy)) (newline)
(display "H(X,Y,Z)   = ") (display h-xyz)

; Chain rule: H(X,Y,Z) = H(X) + H(Y|X) + H(Z|X,Y)
(define (log2 x) (/ (log x) (log 2)))

; Three binary variables: X, Y, Z
; Joint: 8 probabilities (XYZ = 000, 001, 010, 011, 100, 101, 110, 111)
(define joint-xyz '(0.20 0.05 0.10 0.05 0.05 0.15 0.10 0.30))
(define h-xyz (entropy joint-xyz))

; Marginal of X: P(X=0) = sum first 4, P(X=1) = sum last 4
(define px0 (+ 0.20 0.05 0.10 0.05))
(define px1 (+ 0.05 0.15 0.10 0.30))
(define h-x (entropy (list px0 px1)))

; H(X,Y): marginalize over Z
(define pxy '(0.25 0.15 0.20 0.40))
(define h-xy (entropy pxy))

; Chain rule check
(define h-y-given-x (- h-xy h-x))
(define h-z-given-xy (- h-xyz h-xy))

(display "H(X)       = ") (display h-x) (newline)
(display "H(Y|X)     = ") (display h-y-given-x) (newline)
(display "H(Z|X,Y)   = ") (display h-z-given-xy) (newline)
(display "sum         = ") (display (+ h-x h-y-given-x h-z-given-xy)) (newline)
(display "H(X,Y,Z)   = ") (display h-xyz)

Notation reference

Symbol	Scheme	Meaning
H(X,Y)	(entropy joint)	Joint entropy
H(X\|Y) = H(X,Y) − H(Y)	(- h-joint h-y)	Conditional entropy
H(X,Y) = H(X) + H(Y\|X)	(+ h-x h-y-given-x)	Chain rule
H(X\|Y) ≤ H(X)	(<= h-x-given-y h-x)	Conditioning reduces entropy

Neighbors

📡 Shannon 02 — entropy: expected surprise
📡 Shannon 04 — mutual information
🍞 Fritz 2020 — Markov kernels are conditional distributions: the categorical view
Joint entropy
Conditional entropy
🎰 Probability Ch.4 — conditional probability is the foundation of conditional entropy

← Entropy by june.kim Mutual Information →