Linear Regression

Deisenroth et al., Mathematics for Machine Learning (CC BY 4.0) · mml-book.github.io

Find weights that minimize squared error. The normal equation gives the closed-form solution: w = (XᵀX)⁻¹Xᵀy. Gradient descent gives the iterative one: update weights in the direction that reduces error fastest. Both arrive at the same answer for linear models.

Normal equation

For y = wx + b, the least-squares solution has a closed form. With one feature: w = sum((xᵢ - x̄)(yᵢ - ȳ)) / sum((xᵢ - x̄)²) and b = ȳ - w·x̄. This is the normal equation in scalar form.

Scheme

; Normal equation for simple linear regression
; y = wx + b, minimize sum of squared residuals

(define xs '(1 2 3 4 5 6 7))
(define ys '(2.1 4.0 5.8 8.1 9.9 12.2 13.8))

(define (mean lst)
  (/ (apply + lst) (length lst)))

(define (dot a b)
  (apply + (map * a b)))

(define x-bar (mean xs))
(define y-bar (mean ys))

; Center the data
(define xc (map (lambda (x) (- x x-bar)) xs))
(define yc (map (lambda (y) (- y y-bar)) ys))

; w = sum(xc * yc) / sum(xc * xc)
(define w (/ (dot xc yc) (dot xc xc)))
(define b (- y-bar (* w x-bar)))

(display "w = ") (display w) (newline)
(display "b = ") (display b) (newline)
(display "Model: y = ") (display w) (display "x + ")
(display b)

Gradient descent loop

When closed-form solutions are too expensive (millions of features), iterate instead. Compute the gradient of the loss with respect to each weight, then take a small step downhill. Repeat until convergence.

Scheme

; Gradient descent for linear regression
; y = wx + b, loss = (1/n) * sum((y_pred - y_true)^2)

(define xs '(1 2 3 4 5 6 7))
(define ys '(2.1 4.0 5.8 8.1 9.9 12.2 13.8))
(define n (length xs))
(define lr 0.01)  ; learning rate

; Start with w=0, b=0
(define (train w b steps)
  (if (= steps 0)
      (begin
        (display "w = ") (display w) (newline)
        (display "b = ") (display b))
      (let* ((preds (map (lambda (x) (+ (* w x) b)) xs))
             (errors (map - preds ys))
             (dw (/ (* 2 (apply + (map * errors xs))) n))
             (db (/ (* 2 (apply + errors)) n)))
        (train (- w (* lr dw))
               (- b (* lr db))
               (- steps 1)))))

(display "After 1000 steps of gradient descent:") (newline)
(train 0 0 1000)

Comparing predictions

Both methods should give the same answer. Compare by computing predictions on held-out data and measuring mean squared error.

Scheme

; Compare normal equation vs gradient descent predictions
; Using the weights we found above

; Normal equation solution (pre-computed)
(define w-ne 1.96)
(define b-ne 0.04)

; Test data
(define test-x '(8 9 10))
(define test-y '(16.0 18.1 19.8))

(define (predict w b x) (+ (* w x) b))

(define (mse predictions actuals)
  (/ (apply + (map (lambda (p a) (expt (- p a) 2))
                   predictions actuals))
     (length predictions)))

(define preds (map (lambda (x) (predict w-ne b-ne x))
                   test-x))

(display "Predictions: ") (display preds) (newline)
(display "Actuals:     ") (display test-y) (newline)
(display "MSE: ") (display (mse preds test-y))

Notation reference

Math	Scheme	Python	Meaning
w, b	w, b	w, b	Weight and bias
XᵀX	(dot xc xc)	sum(x*x for x in xc)	Inner product
∇L	dw, db	dw, db	Gradient of loss
η	lr	lr	Learning rate
MSE	(mse preds ys)	mse	Mean squared error

Translation notes

The normal equation computes the answer in one pass — clean and functional. Gradient descent is inherently iterative: each step depends on the previous weights. Scheme handles this via tail recursion (train calls itself with updated w and b). Python uses a for loop with mutation. Same computation, different idioms.

In practice, gradient descent wins for large datasets because the normal equation requires inverting XᵀX, which is O(d³) in the number of features. Gradient descent is O(nd) per step.

Neighbors

📐 Hefferon Ch.1 — linear systems: the normal equation is a linear system in disguise
∫ Calculus Ch.12 — gradient: the direction of steepest ascent that gradient descent reverses
📊 Statistics Ch.8 — linear regression in the statistical inference frame
📐 Linear Algebra Ch.1 — the normal equation solves a linear system
∫ Calculus Ch.12 — gradient descent is optimization by following partial derivatives

Ready for the real thing? Read Mathematics for Machine Learning Ch. 9 and D2L Ch. 3.

← What is ML? by june.kim Optimization · 3 of 12 →