← back to machine learning

Optimization

Deisenroth et al., Mathematics for Machine Learning (CC BY 4.0) · mml-book.github.io

Gradient descent follows the negative gradient downhill. The learning rate controls step size: too small and you crawl, too large and you overshoot. Convex functions have one minimum, so gradient descent always finds it. Non-convex functions have many minima, and you get whichever one you fall into.

min start Each step follows the negative gradient toward the minimum.

1D gradient descent

Start simple: minimize f(x) = x². The gradient is 2x, so each step moves x toward zero. Watch the value shrink exponentially.

Scheme

2D gradient descent

Now minimize f(x, y) = x² + 2y². The gradient is (2x, 4y). Two dimensions, same idea: follow the negative gradient at each step.

Scheme

Learning rate too large

What happens when the learning rate is too big? The updates overshoot the minimum and the loss increases instead of decreasing. This is divergence.

Scheme

Notation reference

Math Scheme Python Meaning
∇f(df x)df(x)Gradient
ηlrlrLearning rate
xₙ₊₁ = xₙ - η∇f(- x (* lr (df x)))x -= lr * df(x)Update rule
convexOne global minimum
non-convexMultiple local minima

Translation notes

Gradient descent is a natural fit for tail recursion: each call passes the updated state forward, and no work remains after the recursive call returns. The Scheme version makes this explicit — descend is a loop with no stack growth. Python's for loop achieves the same thing with mutable assignment.

The divergence example is the most important one. Learning rate selection is the first hyperparameter choice every practitioner faces, and the failure mode is dramatic: the loss explodes exponentially. In practice, learning rate schedules (decay, warmup) and adaptive methods (Adam, RMSProp) handle this automatically. Even with good optimization, jklearning plateaus emerge from the recursive structure of consolidation: the system cannot improve further until it reorganizes what it already knows.

Neighbors
Ready for the real thing? Read Mathematics for Machine Learning Ch. 7 and D2L Ch. 12.