← back to machine learning

Neural Networks

Machine Learning ยท Ch.8 of 12

A neural network is function composition: layers of linear maps + nonlinear activations. Backpropagation is the chain rule applied to the computation graph. wpUniversal approximation: one hidden layer can approximate any continuous function.

x1 x2 x3 h1 h2 h3 h4 y input hidden output W1 · x + b1 W2 · h + b2

Forward pass (matrix multiply + ReLU)

A single layer computes h = ReLU(Wx + b), where W is a weight matrix, b is a bias vector, and ReLU(z) = max(0, z). Stacking layers gives y = W2 · ReLU(W1 · x + b1) + b2. Without the nonlinearity, multiple layers collapse to one linear map.

Scheme

Backprop by hand (chain rule)

Backpropagation computes dL/dW for each weight by applying the chain rule backward through the network. For output y and loss L = (y - target)2, we get dL/dy = 2(y - target), then dL/dW2 = dL/dy · hT, and dL/dh = W2T · dL/dy, masked by ReLU's derivative (1 if z > 0, else 0).

input x hidden h = σ(Wx) output ŷ = σ(Vh) W V forward pass ∂L/∂V ∂L/∂W backward pass (gradients) forward pass computes output; backward pass computes gradients
Scheme

Training loop on XOR

XOR is not linearly separable, so a single-layer network cannot learn it. A two-layer network with two hidden units can. Training iterates: forward pass, compute loss, backward pass, update weights. After enough iterations, the network learns the nonlinear boundary.

Scheme

Notation reference

Math Scheme Meaning
h = σ(Wx + b)(map relu (mat-vec W x b))Layer computation
ReLU(z) = max(0,z)(if (> x 0) x 0)Activation function
∂L/∂WdL/dWGradient of loss w.r.t. weights
W ← W - η ∇L(- w (* lr dw))Gradient descent update

Translation notes

Backpropagation is the chain rule applied to a computation graph. Capucci (2021) shows this is a lens: the forward pass computes the function, the backward pass transports gradients. The universal approximation theorem says one hidden layer suffices for any continuous function -- but it says nothing about how many neurons you need or how easy it is to train.

Neighbors

Ready for the real thing?

This chapter covers the core mechanics. For optimization (Adam, batch norm, dropout), architectures (CNNs, RNNs, transformers), and the theory of deep learning, see Goodfellow, Bengio & Courville's Deep Learning (free online).