← back to machine learning

Linear Regression

Deisenroth et al., Mathematics for Machine Learning (CC BY 4.0) · mml-book.github.io

Find weights that minimize squared error. The normal equation gives the closed-form solution: w = (XᵀX)⁻¹Xᵀy. Gradient descent gives the iterative one: update weights in the direction that reduces error fastest. Both arrive at the same answer for linear models.

x y data fit Residuals (green dashes) are what the line gets wrong.

Normal equation

For y = wx + b, the least-squares solution has a closed form. With one feature: w = sum((xᵢ - x̄)(yᵢ - ȳ)) / sum((xᵢ - x̄)²) and b = ȳ - w·x̄. This is the normal equation in scalar form.

Scheme

Gradient descent loop

When closed-form solutions are too expensive (millions of features), iterate instead. Compute the gradient of the loss with respect to each weight, then take a small step downhill. Repeat until convergence.

Scheme

Comparing predictions

Both methods should give the same answer. Compare by computing predictions on held-out data and measuring mean squared error.

Scheme

Notation reference

Math Scheme Python Meaning
w, bw, bw, bWeight and bias
XᵀX(dot xc xc)sum(x*x for x in xc)Inner product
∇Ldw, dbdw, dbGradient of loss
ηlrlrLearning rate
MSE(mse preds ys)mseMean squared error

Translation notes

The normal equation computes the answer in one pass — clean and functional. Gradient descent is inherently iterative: each step depends on the previous weights. Scheme handles this via tail recursion (train calls itself with updated w and b). Python uses a for loop with mutation. Same computation, different idioms.

In practice, gradient descent wins for large datasets because the normal equation requires inverting XᵀX, which is O(d³) in the number of features. Gradient descent is O(nd) per step.

Neighbors
Ready for the real thing? Read Mathematics for Machine Learning Ch. 9 and D2L Ch. 3.