Optimization
Deisenroth et al., Mathematics for Machine Learning (CC BY 4.0) · mml-book.github.io
Gradient descent follows the negative gradient downhill. The learning rate controls step size: too small and you crawl, too large and you overshoot. Convex functions have one minimum, so gradient descent always finds it. Non-convex functions have many minima, and you get whichever one you fall into.
1D gradient descent
Start simple: minimize f(x) = x². The gradient is 2x, so each step moves x toward zero. Watch the value shrink exponentially.
2D gradient descent
Now minimize f(x, y) = x² + 2y². The gradient is (2x, 4y). Two dimensions, same idea: follow the negative gradient at each step.
Learning rate too large
What happens when the learning rate is too big? The updates overshoot the minimum and the loss increases instead of decreasing. This is divergence.
Notation reference
| Math | Scheme | Python | Meaning |
|---|---|---|---|
| ∇f | (df x) | df(x) | Gradient |
| η | lr | lr | Learning rate |
| xₙ₊₁ = xₙ - η∇f | (- x (* lr (df x))) | x -= lr * df(x) | Update rule |
| convex | — | — | One global minimum |
| non-convex | — | — | Multiple local minima |
Translation notes
Gradient descent is a natural fit for tail recursion: each call passes the updated state forward, and no work remains after the recursive call returns. The Scheme version makes this explicit — descend is a loop with no stack growth. Python's for loop achieves the same thing with mutable assignment.
The divergence example is the most important one. Learning rate selection is the first hyperparameter choice every practitioner faces, and the failure mode is dramatic: the loss explodes exponentially. In practice, learning rate schedules (decay, warmup) and adaptive methods (Adam, RMSProp) handle this automatically. Even with good optimization,
learning plateaus emerge from the recursive structure of consolidation: the system cannot improve further until it reorganizes what it already knows.
Neighbors
- โซ Calculus Ch.12 — gradient: the mathematical foundation for the direction of steepest descent
- ๐ Analysis Ch.4 — continuity: the smoothness conditions that make gradient descent work
- โซ Calculus Ch.6 — applications of derivatives: gradient descent is optimization by following the derivative downhill
- ๐ Linear Algebra Ch.5 — eigenvalues determine convergence rates of gradient methods
- โ Algorithms Ch.2 — dynamic programming shares the optimal substructure property with convex optimization