Deisenroth et al., Mathematics for Machine Learning (CC BY 4.0) · mml-book.github.io
Find weights that minimize squared error. The normal equation gives the closed-form solution: w = (XᵀX)⁻¹Xᵀy. Gradient descent gives the iterative one: update weights in the direction that reduces error fastest. Both arrive at the same answer for linear models.
Normal equation
For y = wx + b, the least-squares solution has a closed form. With one feature: w = sum((xᵢ - x̄)(yᵢ - ȳ)) / sum((xᵢ - x̄)²) and b = ȳ - w·x̄. This is the normal equation in scalar form.
Scheme
; Normal equation for simple linear regression; y = wx + b, minimize sum of squared residuals
(define xs '(1234567))
(define ys '(2.14.05.88.19.912.213.8))
(define (mean lst)
(/ (apply + lst) (length lst)))
(define (dot a b)
(apply + (map * a b)))
(define x-bar (mean xs))
(define y-bar (mean ys))
; Center the data
(define xc (map (lambda (x) (- x x-bar)) xs))
(define yc (map (lambda (y) (- y y-bar)) ys))
; w = sum(xc * yc) / sum(xc * xc)
(define w (/ (dot xc yc) (dot xc xc)))
(define b (- y-bar (* w x-bar)))
(display "w = ") (display w) (newline)
(display "b = ") (display b) (newline)
(display "Model: y = ") (display w) (display "x + ")
(display b)
Python
# Normal equation for simple linear regression
xs = [1, 2, 3, 4, 5, 6, 7]
ys = [2.1, 4.0, 5.8, 8.1, 9.9, 12.2, 13.8]
x_bar = sum(xs) / len(xs)
y_bar = sum(ys) / len(ys)
xc = [x - x_bar for x in xs]
yc = [y - y_bar for y in ys]
w = sum(a*b for a,b inzip(xc, yc)) / sum(a*a for a in xc)
b = y_bar - w * x_bar
print("w =", round(w, 4))
print("b =", round(b, 4))
print("Model: y =", round(w, 4), "* x +", round(b, 4))
Gradient descent loop
When closed-form solutions are too expensive (millions of features), iterate instead. Compute the gradient of the loss with respect to each weight, then take a small step downhill. Repeat until convergence.
Scheme
; Gradient descent for linear regression; y = wx + b, loss = (1/n) * sum((y_pred - y_true)^2)
(define xs '(1234567))
(define ys '(2.14.05.88.19.912.213.8))
(define n (length xs))
(define lr 0.01) ; learning rate; Start with w=0, b=0
(define (train w b steps)
(if (= steps 0)
(begin
(display "w = ") (display w) (newline)
(display "b = ") (display b))
(let* ((preds (map (lambda (x) (+ (* w x) b)) xs))
(errors (map - preds ys))
(dw (/ (* 2 (apply + (map * errors xs))) n))
(db (/ (* 2 (apply + errors)) n)))
(train (- w (* lr dw))
(- b (* lr db))
(- steps 1)))))
(display "After 1000 steps of gradient descent:") (newline)
(train 001000)
Python
# Gradient descent for linear regression
xs = [1, 2, 3, 4, 5, 6, 7]
ys = [2.1, 4.0, 5.8, 8.1, 9.9, 12.2, 13.8]
n = len(xs)
lr = 0.01
w, b = 0.0, 0.0for step inrange(1000):
preds = [w * x + b for x in xs]
errors = [p - y for p, y inzip(preds, ys)]
dw = 2 * sum(e * x for e, x inzip(errors, xs)) / n
db = 2 * sum(errors) / n
w -= lr * dw
b -= lr * db
print("After 1000 steps:")
print("w =", round(w, 4))
print("b =", round(b, 4))
Comparing predictions
Both methods should give the same answer. Compare by computing predictions on held-out data and measuring mean squared error.
Scheme
; Compare normal equation vs gradient descent predictions; Using the weights we found above; Normal equation solution (pre-computed)
(define w-ne 1.96)
(define b-ne 0.04)
; Test data
(define test-x '(8910))
(define test-y '(16.018.119.8))
(define (predict w b x) (+ (* w x) b))
(define (mse predictions actuals)
(/ (apply + (map (lambda (p a) (expt (- p a) 2))
predictions actuals))
(length predictions)))
(define preds (map (lambda (x) (predict w-ne b-ne x))
test-x))
(display "Predictions: ") (display preds) (newline)
(display "Actuals: ") (display test-y) (newline)
(display "MSE: ") (display (mse preds test-y))
Python
# Compare predictions
w, b = 1.96, 0.04
test_x = [8, 9, 10]
test_y = [16.0, 18.1, 19.8]
preds = [w * x + b for x in test_x]
mse = sum((p - y)**2for p, y inzip(preds, test_y)) / len(test_y)
print("Predictions:", [round(p, 2) for p in preds])
print("Actuals: ", test_y)
print("MSE:", round(mse, 4))
Notation reference
Math
Scheme
Python
Meaning
w, b
w, b
w, b
Weight and bias
XᵀX
(dot xc xc)
sum(x*x for x in xc)
Inner product
∇L
dw, db
dw, db
Gradient of loss
η
lr
lr
Learning rate
MSE
(mse preds ys)
mse
Mean squared error
Translation notes
The normal equation computes the answer in one pass — clean and functional. Gradient descent is inherently iterative: each step depends on the previous weights. Scheme handles this via tail recursion (train calls itself with updated w and b). Python uses a for loop with mutation. Same computation, different idioms.
In practice, gradient descent wins for large datasets because the normal equation requires inverting XᵀX, which is O(d³) in the number of features. Gradient descent is O(nd) per step.
Neighbors
๐ Hefferon Ch.1 — linear systems: the normal equation is a linear system in disguise
โซ Calculus Ch.12 — gradient: the direction of steepest ascent that gradient descent reverses