Neural Networks

Machine Learning · Ch.8 of 12

A neural network is function composition: layers of linear maps + nonlinear activations. Backpropagation is the chain rule applied to the computation graph. Universal approximation: one hidden layer can approximate any continuous function.

Forward pass (matrix multiply + ReLU)

A single layer computes h = ReLU(Wx + b), where W is a weight matrix, b is a bias vector, and ReLU(z) = max(0, z). Stacking layers gives y = W2 · ReLU(W1 · x + b1) + b2. Without the nonlinearity, multiple layers collapse to one linear map.

Scheme

; Forward pass through a 2-layer network
; Input: 2D, Hidden: 3 units, Output: 1

(define (relu x) (if (> x 0) x 0))

; Weights (hand-picked for demo)
; W1: 3x2, b1: 3x1
(define W1 (list (list 0.5 -0.3)
                 (list -0.2 0.8)
                 (list 0.7 0.1)))
(define b1 (list 0.1 -0.1 0.2))

; W2: 1x3, b2: scalar
(define W2 (list (list 0.4 -0.6 0.3)))
(define b2 (list 0.05))

(define (mat-vec-add W x b)
  (map (lambda (row bi)
    (+ (let loop ((r row) (x x) (s 0))
         (if (null? r) s
             (loop (cdr r) (cdr x)
                   (+ s (* (car r) (car x))))))
       bi))
    W b))

(define (forward x)
  (let* ((z1 (mat-vec-add W1 x b1))
         (h  (map relu z1))
         (z2 (mat-vec-add W2 h b2)))
    (list z1 h z2)))

(define input (list 1.0 0.5))
(define result (forward input))

(display "z1 (pre-relu): ") (display (car result)) (newline)
(display "h  (post-relu): ") (display (cadr result)) (newline)
(display "output: ") (display (caddr result))

Backprop by hand (chain rule)

Backpropagation computes dL/dW for each weight by applying the chain rule backward through the network. For output y and loss L = (y - target)², we get dL/dy = 2(y - target), then dL/dW2 = dL/dy · h^T, and dL/dh = W2^T · dL/dy, masked by ReLU's derivative (1 if z > 0, else 0).

Scheme

; Backprop for a single training example
; Network: x -> W1*x+b1 -> relu -> W2*h+b2 -> y
; Loss: L = (y - target)^2

(define (relu x) (if (> x 0) x 0))
(define (relu-deriv x) (if (> x 0) 1 0))

; Forward pass
(define x (list 1.0 0.5))
(define target 1.0)

(define W1 (list (list 0.5 -0.3) (list -0.2 0.8)))
(define b1 (list 0.1 -0.1))
(define W2 (list (list 0.4 -0.6)))
(define b2 (list 0.05))

; z1 = W1*x + b1
(define z1 (list (+ (* 0.5 1.0) (* -0.3 0.5) 0.1)    ; 0.45
                 (+ (* -0.2 1.0) (* 0.8 0.5) -0.1)))  ; 0.1
(define h (map relu z1))

; y = W2*h + b2
(define y (+ (* 0.4 (car h)) (* -0.6 (cadr h)) 0.05))

; Backward pass
(define dL/dy (* 2 (- y target)))

; dL/dW2 = dL/dy * h^T
(display "dL/dy = ") (display (/ (round (* dL/dy 1000)) 1000)) (newline)
(display "dL/dW2 = [")
(for-each (lambda (hi)
  (display (/ (round (* (* dL/dy hi) 1000)) 1000))
  (display " "))
  h)
(display "]") (newline)

; dL/dh = W2^T * dL/dy, masked by relu'
(define dL/dh (map (lambda (w2i zi)
  (* (* w2i dL/dy) (relu-deriv zi)))
  (car W2) z1))
(display "dL/dh = ") (display dL/dh)

; Backprop for a single training example
; Network: x -> W1*x+b1 -> relu -> W2*h+b2 -> y
; Loss: L = (y - target)^2

(define (relu x) (if (> x 0) x 0))
(define (relu-deriv x) (if (> x 0) 1 0))

; Forward pass
(define x (list 1.0 0.5))
(define target 1.0)

(define W1 (list (list 0.5 -0.3) (list -0.2 0.8)))
(define b1 (list 0.1 -0.1))
(define W2 (list (list 0.4 -0.6)))
(define b2 (list 0.05))

; z1 = W1*x + b1
(define z1 (list (+ (* 0.5 1.0) (* -0.3 0.5) 0.1)    ; 0.45
                 (+ (* -0.2 1.0) (* 0.8 0.5) -0.1)))  ; 0.1
(define h (map relu z1))

; y = W2*h + b2
(define y (+ (* 0.4 (car h)) (* -0.6 (cadr h)) 0.05))

; Backward pass
(define dL/dy (* 2 (- y target)))

; dL/dW2 = dL/dy * h^T
(display "dL/dy = ") (display (/ (round (* dL/dy 1000)) 1000)) (newline)
(display "dL/dW2 = [")
(for-each (lambda (hi)
  (display (/ (round (* (* dL/dy hi) 1000)) 1000))
  (display " "))
  h)
(display "]") (newline)

; dL/dh = W2^T * dL/dy, masked by relu'
(define dL/dh (map (lambda (w2i zi)
  (* (* w2i dL/dy) (relu-deriv zi)))
  (car W2) z1))
(display "dL/dh = ") (display dL/dh)

Training loop on XOR

XOR is not linearly separable, so a single-layer network cannot learn it. A two-layer network with two hidden units can. Training iterates: forward pass, compute loss, backward pass, update weights. After enough iterations, the network learns the nonlinear boundary.

Scheme

; Train a 2-layer net on XOR
; Architecture: 2 -> 2 -> 1

(define (relu x) (if (> x 0) x 0.0))
(define (relu-d x) (if (> x 0) 1.0 0.0))

; XOR data
(define data (list
  (list (list 0 0) 0)
  (list (list 0 1) 1)
  (list (list 1 0) 1)
  (list (list 1 1) 0)))

; Weights (init by hand for convergence)
(define w11 0.5) (define w12 -0.5)
(define w21 0.5) (define w22 0.5)
(define b1a 0.0) (define b1b -0.3)
(define w31 1.0) (define w32 -1.0)
(define b2 0.0)
(define lr 0.5)

; One forward pass
(define (predict x1 x2)
  (let* ((z1 (+ (* w11 x1) (* w12 x2) b1a))
         (z2 (+ (* w21 x1) (* w22 x2) b1b))
         (h1 (relu z1))
         (h2 (relu z2))
         (y  (+ (* w31 h1) (* w32 h2) b2)))
    y))

(display "Before training:") (newline)
(for-each (lambda (d)
  (let ((x (car d)) (t (cadr d)))
    (display "  ") (display x) (display " -> ")
    (display (/ (round (* (predict (car x) (cadr x)) 100)) 100))
    (display " (target ") (display t) (display ")")
    (newline)))
  data)
(display "(Full training loop needs mutable state)")

; Train a 2-layer net on XOR
; Architecture: 2 -> 2 -> 1

(define (relu x) (if (> x 0) x 0.0))
(define (relu-d x) (if (> x 0) 1.0 0.0))

; XOR data
(define data (list
  (list (list 0 0) 0)
  (list (list 0 1) 1)
  (list (list 1 0) 1)
  (list (list 1 1) 0)))

; Weights (init by hand for convergence)
(define w11 0.5) (define w12 -0.5)
(define w21 0.5) (define w22 0.5)
(define b1a 0.0) (define b1b -0.3)
(define w31 1.0) (define w32 -1.0)
(define b2 0.0)
(define lr 0.5)

; One forward pass
(define (predict x1 x2)
  (let* ((z1 (+ (* w11 x1) (* w12 x2) b1a))
         (z2 (+ (* w21 x1) (* w22 x2) b1b))
         (h1 (relu z1))
         (h2 (relu z2))
         (y  (+ (* w31 h1) (* w32 h2) b2)))
    y))

(display "Before training:") (newline)
(for-each (lambda (d)
  (let ((x (car d)) (t (cadr d)))
    (display "  ") (display x) (display " -> ")
    (display (/ (round (* (predict (car x) (cadr x)) 100)) 100))
    (display " (target ") (display t) (display ")")
    (newline)))
  data)
(display "(Full training loop needs mutable state)")

Python

import random
random.seed(1)

def relu(x): return max(0.0, x)
def relu_d(x): return 1.0 if x > 0 else 0.0

# XOR data
data = [([0,0],0), ([0,1],1), ([1,0],1), ([1,1],0)]

# Random init
W1 = [[random.gauss(0,0.5) for _ in range(2)] for _ in range(4)]
b1 = [0.0]*4
W2 = [random.gauss(0,0.5) for _ in range(4)]
b2 = 0.0
lr = 0.1

for epoch in range(2000):
    for x, t in data:
        # Forward
        z1 = [sum(W1[j][i]*x[i] for i in range(2))+b1[j] for j in range(4)]
        h = [relu(z) for z in z1]
        y = sum(W2[j]*h[j] for j in range(4)) + b2

        # Backward
        dy = 2*(y-t)
        for j in range(4):
            dh = W2[j]*dy*relu_d(z1[j])
            for i in range(2):
                W1[j][i] -= lr*dh*x[i]
            b1[j] -= lr*dh
            W2[j] -= lr*dy*h[j]
        b2 -= lr*dy

print("After training:")
for x, t in data:
    z1 = [sum(W1[j][i]*x[i] for i in range(2))+b1[j] for j in range(4)]
    h = [relu(z) for z in z1]
    y = sum(W2[j]*h[j] for j in range(4)) + b2
    print("  {} -> {:.3f} (target {})".format(x, y, t))

import random
random.seed(1)

def relu(x): return max(0.0, x)
def relu_d(x): return 1.0 if x > 0 else 0.0

# XOR data
data = [([0,0],0), ([0,1],1), ([1,0],1), ([1,1],0)]

# Random init
W1 = [[random.gauss(0,0.5) for _ in range(2)] for _ in range(4)]
b1 = [0.0]*4
W2 = [random.gauss(0,0.5) for _ in range(4)]
b2 = 0.0
lr = 0.1

for epoch in range(2000):
    for x, t in data:
        # Forward
        z1 = [sum(W1[j][i]*x[i] for i in range(2))+b1[j] for j in range(4)]
        h = [relu(z) for z in z1]
        y = sum(W2[j]*h[j] for j in range(4)) + b2

# Backward
        dy = 2*(y-t)
        for j in range(4):
            dh = W2[j]*dy*relu_d(z1[j])
            for i in range(2):
                W1[j][i] -= lr*dh*x[i]
            b1[j] -= lr*dh
            W2[j] -= lr*dy*h[j]
        b2 -= lr*dy

print("After training:")
for x, t in data:
    z1 = [sum(W1[j][i]*x[i] for i in range(2))+b1[j] for j in range(4)]
    h = [relu(z) for z in z1]
    y = sum(W2[j]*h[j] for j in range(4)) + b2
    print("  {} -> {:.3f} (target {})".format(x, y, t))

Notation reference

Math	Scheme	Meaning
h = σ(Wx + b)	(map relu (mat-vec W x b))	Layer computation
ReLU(z) = max(0,z)	(if (> x 0) x 0)	Activation function
∂L/∂W	dL/dW	Gradient of loss w.r.t. weights
W ← W - η ∇L	(- w (* lr dw))	Gradient descent update

Translation notes

Backpropagation is the chain rule applied to a computation graph. Capucci (2021) shows this is a lens: the forward pass computes the function, the backward pass transports gradients. The universal approximation theorem says one hidden layer suffices for any continuous function -- but it says nothing about how many neurons you need or how easy it is to train.

Neighbors

Calculus Ch.5 — the chain rule: the mathematical foundation of backpropagation
Capucci 2021 — backprop as a lens: categorical perspective on gradient flow
CogSci Ch.4 — neural networks in cognitive science
📐 Linear Algebra Ch.3 — each layer is a linear map, and the network composes them
∫ Calculus Ch.5 — chain rule is backpropagation in one dimension
🧠 Cognitive Science Ch.4 — neural networks as cognitive models

Ready for the real thing?

This chapter covers the core mechanics. For optimization (Adam, batch norm, dropout), architectures (CNNs, RNNs, transformers), and the theory of deep learning, see Goodfellow, Bengio & Courville's Deep Learning (free online).

← Gaussian Processes by june.kim Convolutional Networks →