Convolutional Networks
MML (CC BY 4.0) · D2L (CC BY-SA 4.0) · 9 of 12
Convolutions exploit spatial structure: shared weights, local receptive fields, translation equivariance. Pooling provides invariance. The hierarchy — edges, textures, parts, objects — emerges from stacking layers.
1D convolution by hand
A 1D convolution slides a kernel (a small vector of weights) across an input signal. At each position, it computes the dot product of the kernel with the local patch. The kernel is shared across all positions — this is the parameter sharing that makes CNNs efficient.
Max pooling
Pooling downsamples the feature map by taking the maximum (or average) over local windows. This provides a degree of translation invariance: small shifts in the input don't change the pooled output. It also reduces the spatial dimensions, cutting computation.
Conv + pool pipeline
A CNN stacks convolution and pooling layers. Each convolution extracts local features; each pooling step compresses the spatial dimension. After enough layers, the feature map is small enough to feed into a fully connected classifier. The early layers detect edges, the middle layers textures, the deep layers objects.
Notation reference
| Math | Scheme | Meaning |
|---|---|---|
| (f * g)[n] | (conv1d signal kernel) | 1D convolution |
| max(xi..i+k) | (max-pool xs k) | Max pooling with window k |
| ReLU(x) = max(0,x) | (max 0 x) | Rectified linear unit |
| stride, padding | step size, zero-fill | Control output dimensions |
Translation notes
Convolution in signal processing flips the kernel; in deep learning, cross-correlation (no flip) is standard but still called "convolution." The Scheme code above implements cross-correlation. In practice, learned kernels absorb the flip.
Neighbors
- ∫ Ch.8 Integration Techniques — convolution as integral transform
- 📡 Shannon Ch.6 — channel capacity constrains what filters can extract