← back to machine learning

Convolutional Networks

MML (CC BY 4.0) · D2L (CC BY-SA 4.0) · 9 of 12

Convolutions exploit spatial structure: shared weights, local receptive fields, translation equivariance. Pooling provides invariance. The hierarchy — edges, textures, parts, objects — emerges from stacking layers.

input 3 1 4 1 5 9 2 kernel 1 0 -1 4*1 + 1*0 + 5*(-1) = -1 output -2 3 -1 -4 2 A kernel slides across the input. Same weights at every position = translation equivariance.

1D convolution by hand

A 1D convolution slides a kernel (a small vector of weights) across an input signal. At each position, it computes the dot product of the kernel with the local patch. The kernel is shared across all positions — this is the parameter sharing that makes CNNs efficient.

Scheme

Max pooling

Pooling downsamples the feature map by taking the maximum (or average) over local windows. This provides a degree of translation invariance: small shifts in the input don't change the pooled output. It also reduces the spatial dimensions, cutting computation.

Scheme

Conv + pool pipeline

A CNN stacks convolution and pooling layers. Each convolution extracts local features; each pooling step compresses the spatial dimension. After enough layers, the feature map is small enough to feed into a fully connected classifier. The early layers detect edges, the middle layers textures, the deep layers objects.

Scheme

Notation reference

Math Scheme Meaning
(f * g)[n](conv1d signal kernel)1D convolution
max(xi..i+k)(max-pool xs k)Max pooling with window k
ReLU(x) = max(0,x)(max 0 x)Rectified linear unit
stride, paddingstep size, zero-fillControl output dimensions

Translation notes

Convolution in signal processing flips the kernel; in deep learning, cross-correlation (no flip) is standard but still called "convolution." The Scheme code above implements cross-correlation. In practice, learned kernels absorb the flip.

Neighbors