Shannon 1948 (public domain) · Wikipedia (CC BY-SA 4.0)
Joint entropy H(X,Y) measures the total uncertainty of two variables together. Conditional entropy H(X|Y) = H(X,Y) − H(Y) is what remains uncertain about X after observing Y. Conditioning never increases entropy.
Joint entropy
The joint entropy H(X,Y) = −∑ P(x,y) log2 P(x,y) measures the total surprise of observing both X and Y together. If X and Y are independent, H(X,Y) = H(X) + H(Y). If they are dependent, the joint entropy is less: shared structure reduces total uncertainty.
importmathdef entropy(probs):
return -sum(p * math.log2(p) for p in probs if p > 0)
joint = [0.15, 0.55, 0.25, 0.05]
p_weather = [0.70, 0.30]
p_umbrella = [0.40, 0.60]
print(f"H(weather,umbrella) = {entropy(joint):.4f} bits")
print(f"H(weather) = {entropy(p_weather):.4f} bits")
print(f"H(umbrella) = {entropy(p_umbrella):.4f} bits")
print(f"H(X)+H(Y) = {entropy(p_weather)+entropy(p_umbrella):.4f} bits")
print(f"dependent? {entropy(joint) < entropy(p_weather)+entropy(p_umbrella)}")
Conditional entropy
H(X|Y) = H(X,Y) − H(Y) tells you how much uncertainty remains about X once you know Y. This is the chain rule of entropy: H(X,Y) = H(Y) + H(X|Y). Conditioning never increases entropy: H(X|Y) ≤ H(X). Knowing something can only help.
The chain rule generalizes: H(X1, ..., Xn) = H(X1) + H(X2|X1) + ... + H(Xn|X1,...,Xn-1). Each new variable adds only its residual uncertainty, conditioned on everything before it.