Logistic Regression

Deisenroth et al., Mathematics for Machine Learning (CC BY 4.0) · mml-book.github.io

Logistic regression squashes linear output through a sigmoid to get probabilities. Cross-entropy loss measures how wrong the probabilities are — it penalizes confident wrong answers heavily. Maximum likelihood estimation is equivalent to minimizing cross-entropy. The decision boundary is where the probability hits 0.5.

Sigmoid function

The sigmoid σ(z) = 1 / (1 + e⁻ᶻ) maps any real number to (0, 1). Large positive inputs map near 1; large negative inputs map near 0. At z = 0, σ(0) = 0.5.

Scheme

; Sigmoid function: sigma(z) = 1 / (1 + e^(-z))

(define (sigmoid z)
  (/ 1 (+ 1 (exp (- z)))))

; Test at key points
(display "sigmoid(-5) = ") (display (sigmoid -5)) (newline)
(display "sigmoid(-1) = ") (display (sigmoid -1)) (newline)
(display "sigmoid(0)  = ") (display (sigmoid 0)) (newline)
(display "sigmoid(1)  = ") (display (sigmoid 1)) (newline)
(display "sigmoid(5)  = ") (display (sigmoid 5)) (newline)

; Key property: sigma(-z) = 1 - sigma(z)
(display "sigmoid(2) + sigmoid(-2) = ")
(display (+ (sigmoid 2) (sigmoid -2)))

Cross-entropy loss

Cross-entropy measures the distance between the true label distribution and the predicted probabilities. For binary classification: L = -[y log(p) + (1-y) log(1-p)]. When the model is confident and right, loss is low. When confident and wrong, loss explodes.

Scheme

; Cross-entropy loss for binary classification
; L = -[y * log(p) + (1-y) * log(1-p)]

(define (cross-entropy y p)
  (- (+ (* y (log p))
        (* (- 1 y) (log (- 1 p))))))

; When model is right (y=1, p close to 1)
(display "y=1, p=0.9:  L = ") (display (cross-entropy 1 0.9)) (newline)
(display "y=1, p=0.99: L = ") (display (cross-entropy 1 0.99)) (newline)

; When model is wrong (y=1, p close to 0)
(display "y=1, p=0.5:  L = ") (display (cross-entropy 1 0.5)) (newline)
(display "y=1, p=0.1:  L = ") (display (cross-entropy 1 0.1)) (newline)
(display "y=1, p=0.01: L = ") (display (cross-entropy 1 0.01)) (newline)

; Confident and wrong is heavily penalized
(display "y=0, p=0.99: L = ") (display (cross-entropy 0 0.99))

Gradient descent for logistic regression

The gradient of cross-entropy loss with respect to the weight is (p - y) · x. Same update rule as linear regression, but p comes from the sigmoid. This elegance is not a coincidence — it falls out of the exponential family.

Scheme

; Logistic regression via gradient descent
; Dataset: (x, label) — 1D for clarity

(define data '((1 0) (2 0) (3 0) (4 0)
               (6 1) (7 1) (8 1) (9 1)))
(define n (length data))
(define lr 0.1)

(define (sigmoid z) (/ 1 (+ 1 (exp (- z)))))

(define (train w b steps)
  (if (= steps 0)
      (begin
        (display "w = ") (display w) (newline)
        (display "b = ") (display b) (newline)
        ; Show predictions
        (display "Predictions:") (newline)
        (for-each
          (lambda (point)
            (let ((x (car point))
                  (y (cadr point))
                  (p (sigmoid (+ (* w (car point)) b))))
              (display "  x=") (display x)
              (display " true=") (display y)
              (display " pred=") (display p) (newline)))
          data))
      (let* ((preds (map (lambda (pt)
                           (sigmoid (+ (* w (car pt)) b)))
                         data))
             (labels (map cadr data))
             (errors (map - preds labels))
             (xs (map car data))
             (dw (/ (apply + (map * errors xs)) n))
             (db (/ (apply + errors) n)))
        (train (- w (* lr dw))
               (- b (* lr db))
               (- steps 1)))))

(display "Logistic regression on 1D data") (newline)
(train 0 0 200)

; Logistic regression via gradient descent
; Dataset: (x, label) — 1D for clarity

(define data '((1 0) (2 0) (3 0) (4 0)
               (6 1) (7 1) (8 1) (9 1)))
(define n (length data))
(define lr 0.1)

(define (sigmoid z) (/ 1 (+ 1 (exp (- z)))))

(define (train w b steps)
  (if (= steps 0)
      (begin
        (display "w = ") (display w) (newline)
        (display "b = ") (display b) (newline)
        ; Show predictions
        (display "Predictions:") (newline)
        (for-each
          (lambda (point)
            (let ((x (car point))
                  (y (cadr point))
                  (p (sigmoid (+ (* w (car point)) b))))
              (display "  x=") (display x)
              (display " true=") (display y)
              (display " pred=") (display p) (newline)))
          data))
      (let* ((preds (map (lambda (pt)
                           (sigmoid (+ (* w (car pt)) b)))
                         data))
             (labels (map cadr data))
             (errors (map - preds labels))
             (xs (map car data))
             (dw (/ (apply + (map * errors xs)) n))
             (db (/ (apply + errors) n)))
        (train (- w (* lr dw))
               (- b (* lr db))
               (- steps 1)))))

(display "Logistic regression on 1D data") (newline)
(train 0 0 200)

Notation reference

Math	Scheme	Python	Meaning
σ(z)	(sigmoid z)	sigmoid(z)	Sigmoid function
-log p	(- (log p))	-math.log(p)	Surprise / information
H(y, p)	(cross-entropy y p)	cross_entropy(y, p)	Cross-entropy loss
p̂ = σ(wx + b)	(sigmoid (+ (* w x) b))	sigmoid(w*x + b)	Predicted probability
∂L/∂w = (p-y)x	dw	dw	Gradient of loss w.r.t. weight

Translation notes

The sigmoid is the bridge between linear models and probability. Scheme's (exp z) maps directly to Python's math.exp(z) — same function, same numerical behavior. The gradient (p - y) · x looks identical to linear regression's gradient, but p is now the sigmoid output instead of the raw linear prediction. This unification is the power of the generalized linear model.

Cross-entropy is measured in nats here (natural log). Shannon's original formulation uses log base 2 (bits). The optimization is the same either way — only the scale changes. The connection between cross-entropy and information content runs deep: a freshness filter uses exactly this measure of surprise to detect whether a signal carries substance or is just noise.

Neighbors

📡 Shannon Ch.1 — information and surprise: cross-entropy is the expected surprise under the wrong distribution
🎲 Grinstead Ch.4 — Bayes' theorem: the probabilistic foundation for maximum likelihood
📡 Information Theory Ch.2 — cross-entropy loss is KL divergence from the true distribution
🎰 Probability Ch.4 — Bayes' theorem connects posterior probability to the logistic function
📊 Statistics Ch.5 — hypothesis testing applies the same maximum likelihood framework

Ready for the real thing? Read Mathematics for Machine Learning Ch. 12 and D2L Ch. 4.

← Optimization by june.kim Kernel Methods · 5 of 12 →