A neural network is function composition: layers of linear maps + nonlinear activations. Backpropagation is the chain rule applied to the computation graph. Universal approximation: one hidden layer can approximate any continuous function.
Forward pass (matrix multiply + ReLU)
A single layer computes h = ReLU(Wx + b), where W is a weight matrix, b is a bias vector, and ReLU(z) = max(0, z). Stacking layers gives y = W2 · ReLU(W1 · x + b1) + b2. Without the nonlinearity, multiple layers collapse to one linear map.
Scheme
; Forward pass through a 2-layer network; Input: 2D, Hidden: 3 units, Output: 1
(define (relu x) (if (> x 0) x 0))
; Weights (hand-picked for demo); W1: 3x2, b1: 3x1
(define W1 (list (list 0.5-0.3)
(list -0.20.8)
(list 0.70.1)))
(define b1 (list 0.1-0.10.2))
; W2: 1x3, b2: scalar
(define W2 (list (list 0.4-0.60.3)))
(define b2 (list 0.05))
(define (mat-vec-add W x b)
(map (lambda (row bi)
(+ (let loop ((r row) (x x) (s 0))
(if (null? r) s
(loop (cdr r) (cdr x)
(+ s (* (car r) (car x))))))
bi))
W b))
(define (forward x)
(let* ((z1 (mat-vec-add W1 x b1))
(h (map relu z1))
(z2 (mat-vec-add W2 h b2)))
(list z1 h z2)))
(define input (list 1.00.5))
(define result (forward input))
(display "z1 (pre-relu): ") (display (car result)) (newline)
(display "h (post-relu): ") (display (cadr result)) (newline)
(display "output: ") (display (caddr result))
Python
# Forward pass: 2-layer networkdef relu(x):
returnmax(0, x)
def mat_vec(W, x, b):
return [sum(w*xi for w,xi inzip(row, x)) + bi
for row, bi inzip(W, b)]
W1 = [[0.5, -0.3], [-0.2, 0.8], [0.7, 0.1]]
b1 = [0.1, -0.1, 0.2]
W2 = [[0.4, -0.6, 0.3]]
b2 = [0.05]
x = [1.0, 0.5]
z1 = mat_vec(W1, x, b1)
h = [relu(z) for z in z1]
y = mat_vec(W2, h, b2)
print("z1 (pre-relu):", [round(v,3) for v in z1])
print("h (post-relu):", [round(v,3) for v in h])
print("output:", [round(v,3) for v in y])
Backprop by hand (chain rule)
Backpropagation computes dL/dW for each weight by applying the chain rule backward through the network. For output y and loss L = (y - target)2, we get dL/dy = 2(y - target), then dL/dW2 = dL/dy · hT, and dL/dh = W2T · dL/dy, masked by ReLU's derivative (1 if z > 0, else 0).
# Backprop by hand for a 2-layer networkdef relu(x): returnmax(0, x)
def relu_d(x): return1if x > 0else0
x = [1.0, 0.5]
target = 1.0
W1 = [[0.5, -0.3], [-0.2, 0.8]]
b1 = [0.1, -0.1]
W2 = [[0.4, -0.6]]
b2 = [0.05]
# Forward
z1 = [sum(w*xi for w,xi inzip(row,x))+b for row,b inzip(W1,b1)]
h = [relu(z) for z in z1]
y = sum(w*hi for w,hi inzip(W2[0],h)) + b2[0]
loss = (y - target)**2# Backward
dL_dy = 2 * (y - target)
dL_dW2 = [dL_dy * hi for hi in h]
dL_dh = [W2[0][i] * dL_dy * relu_d(z1[i]) for i inrange(2)]
print("y = {:.3f}, loss = {:.4f}".format(y, loss))
print("dL/dy = {:.3f}".format(dL_dy))
print("dL/dW2 =", [round(v,3) for v in dL_dW2])
print("dL/dh =", [round(v,3) for v in dL_dh])
Training loop on XOR
XOR is not linearly separable, so a single-layer network cannot learn it. A two-layer network with two hidden units can. Training iterates: forward pass, compute loss, backward pass, update weights. After enough iterations, the network learns the nonlinear boundary.
import random
random.seed(1)
def relu(x): returnmax(0.0, x)
def relu_d(x): return1.0if x > 0else0.0# XOR data
data = [([0,0],0), ([0,1],1), ([1,0],1), ([1,1],0)]
# Random init
W1 = [[random.gauss(0,0.5) for _ inrange(2)] for _ inrange(4)]
b1 = [0.0]*4
W2 = [random.gauss(0,0.5) for _ inrange(4)]
b2 = 0.0
lr = 0.1for epoch inrange(2000):
for x, t in data:
# Forward
z1 = [sum(W1[j][i]*x[i] for i inrange(2))+b1[j] for j inrange(4)]
h = [relu(z) for z in z1]
y = sum(W2[j]*h[j] for j inrange(4)) + b2
# Backward
dy = 2*(y-t)
for j inrange(4):
dh = W2[j]*dy*relu_d(z1[j])
for i inrange(2):
W1[j][i] -= lr*dh*x[i]
b1[j] -= lr*dh
W2[j] -= lr*dy*h[j]
b2 -= lr*dy
print("After training:")
for x, t in data:
z1 = [sum(W1[j][i]*x[i] for i inrange(2))+b1[j] for j inrange(4)]
h = [relu(z) for z in z1]
y = sum(W2[j]*h[j] for j inrange(4)) + b2
print(" {} -> {:.3f} (target {})".format(x, y, t))
Notation reference
Math
Scheme
Meaning
h = σ(Wx + b)
(map relu (mat-vec W x b))
Layer computation
ReLU(z) = max(0,z)
(if (> x 0) x 0)
Activation function
∂L/∂W
dL/dW
Gradient of loss w.r.t. weights
W ← W - η ∇L
(- w (* lr dw))
Gradient descent update
Translation notes
Backpropagation is the chain rule applied to a computation graph. Capucci (2021) shows this is a lens: the forward pass computes the function, the backward pass transports gradients. The universal approximation theorem says one hidden layer suffices for any continuous function -- but it says nothing about how many neurons you need or how easy it is to train.
Neighbors
Calculus Ch.5 โ the chain rule: the mathematical foundation of backpropagation
Capucci 2021 โ backprop as a lens: categorical perspective on gradient flow
CogSci Ch.4 โ neural networks in cognitive science
This chapter covers the core mechanics. For optimization (Adam, batch norm, dropout), architectures (CNNs, RNNs, transformers), and the theory of deep learning, see Goodfellow, Bengio & Courville's Deep Learning (free online).