Optimization

Deisenroth et al., Mathematics for Machine Learning (CC BY 4.0) · mml-book.github.io

Gradient descent follows the negative gradient downhill. The learning rate controls step size: too small and you crawl, too large and you overshoot. Convex functions have one minimum, so gradient descent always finds it. Non-convex functions have many minima, and you get whichever one you fall into.

1D gradient descent

Start simple: minimize f(x) = x². The gradient is 2x, so each step moves x toward zero. Watch the value shrink exponentially.

Scheme

; 1D gradient descent on f(x) = x^2
; Gradient: f'(x) = 2x

(define (f x) (* x x))
(define (df x) (* 2 x))
(define lr 0.1)

(define (descend x step)
  (if (= step 10)
      (begin (display "Final: x = ") (display x)
             (display ", f(x) = ") (display (f x)))
      (begin
        (if (< step 5)
            (begin (display "Step ") (display step)
                   (display ": x = ") (display x)
                   (display ", f(x) = ") (display (f x))
                   (newline))
            '())
        (descend (- x (* lr (df x))) (+ step 1)))))

(display "f(x) = x^2, starting at x = 5") (newline)
(descend 5 0)

2D gradient descent

Now minimize f(x, y) = x² + 2y². The gradient is (2x, 4y). Two dimensions, same idea: follow the negative gradient at each step.

Scheme

; 2D gradient descent on f(x,y) = x^2 + 2y^2
; Gradient: (2x, 4y)

(define (f x y) (+ (* x x) (* 2 y y)))
(define lr 0.1)

(define (descend x y step)
  (if (= step 20)
      (begin (display "Final: (") (display x) (display ", ")
             (display y) (display "), f = ") (display (f x y)))
      (begin
        (if (< step 6)
            (begin (display "Step ") (display step)
                   (display ": (") (display x) (display ", ")
                   (display y) (display "), f = ")
                   (display (f x y)) (newline))
            '())
        (descend (- x (* lr (* 2 x)))
                 (- y (* lr (* 4 y)))
                 (+ step 1)))))

(display "f(x,y) = x^2 + 2y^2") (newline)
(descend 4 3 0)

Learning rate too large

What happens when the learning rate is too big? The updates overshoot the minimum and the loss increases instead of decreasing. This is divergence.

Scheme

; Divergence: learning rate too large
; f(x) = x^2, but lr = 1.1 (too big!)

(define (f x) (* x x))
(define (df x) (* 2 x))
(define lr 1.1)  ; too large!

(define (descend x step)
  (if (= step 8)
      (begin (display "Diverged! x = ") (display x)
             (display ", f(x) = ") (display (f x)))
      (begin
        (display "Step ") (display step)
        (display ": x = ") (display x)
        (display ", f(x) = ") (display (f x)) (newline)
        (descend (- x (* lr (df x))) (+ step 1)))))

(display "lr = 1.1 (too large!)") (newline)
(descend 1 0)

Notation reference

Math	Scheme	Python	Meaning
∇f	(df x)	df(x)	Gradient
η	lr	lr	Learning rate
xₙ₊₁ = xₙ - η∇f	(- x (* lr (df x)))	x -= lr * df(x)	Update rule
convex	—	—	One global minimum
non-convex	—	—	Multiple local minima

Translation notes

Gradient descent is a natural fit for tail recursion: each call passes the updated state forward, and no work remains after the recursive call returns. The Scheme version makes this explicit — descend is a loop with no stack growth. Python's for loop achieves the same thing with mutable assignment.

The divergence example is the most important one. Learning rate selection is the first hyperparameter choice every practitioner faces, and the failure mode is dramatic: the loss explodes exponentially. In practice, learning rate schedules (decay, warmup) and adaptive methods (Adam, RMSProp) handle this automatically. Even with good optimization, learning plateaus emerge from the recursive structure of consolidation: the system cannot improve further until it reorganizes what it already knows.

Neighbors

∫ Calculus Ch.12 — gradient: the mathematical foundation for the direction of steepest descent
📏 Analysis Ch.4 — continuity: the smoothness conditions that make gradient descent work
∫ Calculus Ch.6 — applications of derivatives: gradient descent is optimization by following the derivative downhill
📐 Linear Algebra Ch.5 — eigenvalues determine convergence rates of gradient methods
⚙ Algorithms Ch.2 — dynamic programming shares the optimal substructure property with convex optimization

Ready for the real thing? Read Mathematics for Machine Learning Ch. 7 and D2L Ch. 12.

← Linear Regression by june.kim Logistic Regression · 4 of 12 →