Logistic Regression
Deisenroth et al., Mathematics for Machine Learning (CC BY 4.0) · mml-book.github.io
Logistic regression squashes linear output through a sigmoid to get probabilities. Cross-entropy loss measures how wrong the probabilities are — it penalizes confident wrong answers heavily. Maximum likelihood estimation is equivalent to minimizing cross-entropy. The decision boundary is where the probability hits 0.5.
Sigmoid function
The sigmoid σ(z) = 1 / (1 + e⁻ᶻ) maps any real number to (0, 1). Large positive inputs map near 1; large negative inputs map near 0. At z = 0, σ(0) = 0.5.
Cross-entropy loss
Cross-entropy measures the distance between the true label distribution and the predicted probabilities. For binary classification: L = -[y log(p) + (1-y) log(1-p)]. When the model is confident and right, loss is low. When confident and wrong, loss explodes.
Gradient descent for logistic regression
The gradient of cross-entropy loss with respect to the weight is (p - y) · x. Same update rule as linear regression, but p comes from the sigmoid. This elegance is not a coincidence — it falls out of the exponential family.
Notation reference
| Math | Scheme | Python | Meaning |
|---|---|---|---|
| σ(z) | (sigmoid z) | sigmoid(z) | Sigmoid function |
| -log p | (- (log p)) | -math.log(p) | Surprise / information |
| H(y, p) | (cross-entropy y p) | cross_entropy(y, p) | Cross-entropy loss |
| p̂ = σ(wx + b) | (sigmoid (+ (* w x) b)) | sigmoid(w*x + b) | Predicted probability |
| ∂L/∂w = (p-y)x | dw | dw | Gradient of loss w.r.t. weight |
Translation notes
The sigmoid is the bridge between linear models and probability. Scheme's (exp z) maps directly to Python's math.exp(z) — same function, same numerical behavior. The gradient (p - y) · x looks identical to linear regression's gradient, but p is now the sigmoid output instead of the raw linear prediction. This unification is the power of the generalized linear model.
Cross-entropy is measured in nats here (natural log). Shannon's original formulation uses log base 2 (bits). The optimization is the same either way — only the scale changes. The connection between cross-entropy and information content runs deep: a
freshness filter uses exactly this measure of surprise to detect whether a signal carries substance or is just noise.
Neighbors
- ๐ก Shannon Ch.1 — information and surprise: cross-entropy is the expected surprise under the wrong distribution
- ๐ฒ Grinstead Ch.4 — Bayes' theorem: the probabilistic foundation for maximum likelihood
- ๐ก Information Theory Ch.2 — cross-entropy loss is KL divergence from the true distribution
- ๐ฐ Probability Ch.4 — Bayes' theorem connects posterior probability to the logistic function
- ๐ Statistics Ch.5 — hypothesis testing applies the same maximum likelihood framework