← back to machine learning

Logistic Regression

Deisenroth et al., Mathematics for Machine Learning (CC BY 4.0) · mml-book.github.io

Logistic regression squashes linear output through a sigmoid to get probabilities. Cross-entropy loss measures how wrong the probabilities are — it penalizes confident wrong answers heavily. Maximum likelihood estimation is equivalent to minimizing cross-entropy. The decision boundary is where the probability hits 0.5.

z σ 1 0 class 0 class 1

Sigmoid function

The sigmoid σ(z) = 1 / (1 + e⁻ᶻ) maps any real number to (0, 1). Large positive inputs map near 1; large negative inputs map near 0. At z = 0, σ(0) = 0.5.

Scheme

Cross-entropy loss

Cross-entropy measures the distance between the true label distribution and the predicted probabilities. For binary classification: L = -[y log(p) + (1-y) log(1-p)]. When the model is confident and right, loss is low. When confident and wrong, loss explodes.

Scheme

Gradient descent for logistic regression

The gradient of cross-entropy loss with respect to the weight is (p - y) · x. Same update rule as linear regression, but p comes from the sigmoid. This elegance is not a coincidence — it falls out of the exponential family.

Scheme

Notation reference

Math Scheme Python Meaning
σ(z)(sigmoid z)sigmoid(z)Sigmoid function
-log p(- (log p))-math.log(p)Surprise / information
H(y, p)(cross-entropy y p)cross_entropy(y, p)Cross-entropy loss
p̂ = σ(wx + b)(sigmoid (+ (* w x) b))sigmoid(w*x + b)Predicted probability
∂L/∂w = (p-y)xdwdwGradient of loss w.r.t. weight

Translation notes

The sigmoid is the bridge between linear models and probability. Scheme's (exp z) maps directly to Python's math.exp(z) — same function, same numerical behavior. The gradient (p - y) · x looks identical to linear regression's gradient, but p is now the sigmoid output instead of the raw linear prediction. This unification is the power of the generalized linear model.

Cross-entropy is measured in nats here (natural log). Shannon's original formulation uses log base 2 (bits). The optimization is the same either way — only the scale changes. The connection between cross-entropy and information content runs deep: a jkfreshness filter uses exactly this measure of surprise to detect whether a signal carries substance or is just noise.

Neighbors
Ready for the real thing? Read Mathematics for Machine Learning Ch. 12 and D2L Ch. 4.