Day Two, Session One

Convolutional Neural Networks

Mark Andrews

Images as data

  • A greyscale image is a 2D array of pixel intensities.
  • A colour image has three such arrays, one per channel (RGB).
  • PyTorch represents images as tensors of shape \((\text{channels}, \text{height}, \text{width})\).
  • The question: how do we design a network that exploits the spatial structure of images?

The problem with MLPs for images

Parameter count. A \(224 \times 224\) colour image has \(224 \times 224 \times 3 = 150{,}528\) inputs. A first hidden layer of 512 units requires over 77 million weights.

No spatial structure. Flattening an image discards all proximity information. A fully-connected layer must learn spatial relationships from scratch, which is wasteful and slow.

Convolutional layers: local connectivity

  • A convolutional layer replaces the global weight matrix with a small, reusable filter.
  • A typical filter is \(3 \times 3\) — nine weights, plus a bias.
  • The filter slides across the image and computes a weighted sum at each position.
  • Each output unit depends only on a small local region of the input.

The convolution operation

For filter \(w\) of size \(k \times k\) on a single-channel input \(x\):

\[y_{i,j} = \sum_{m=0}^{k-1}\sum_{n=0}^{k-1} x_{i+m,\,j+n}\cdot w_{m,n} + b\]

  • One filter produces one feature map of the same spatial size (with padding).
  • A layer with 32 filters produces 32 feature maps stacked together.

Parameter sharing and translation invariance

  • The same filter weights are used at every position: parameter sharing.
  • A \(3 \times 3\) filter on a \(28 \times 28\) image uses \(9 + 1 = 10\) parameters regardless of image size.
  • This means the filter learns to detect a feature — an edge, a corner, a texture — wherever it appears.
  • Translation invariance: a cat in the top-left and a cat in the bottom-right activate the same filters.

Pooling

  • Pooling layers reduce spatial dimensions without learnable parameters.
  • Max pooling takes the maximum value in each non-overlapping window.
  • A \(2 \times 2\) max pool halves height and width.
  • Combined with convolutions, successive pooling builds increasingly compact representations.
  • Also provides a degree of local translation invariance within the pooling window.

Batch normalisation

Applied after convolution, before activation, per channel:

\[\hat{x} = \frac{x - \mu_B}{\sqrt{\sigma_B^2 + \varepsilon}}, \qquad y = \gamma\hat{x} + \beta\]

  • \(\mu_B\) and \(\sigma_B^2\) are the batch mean and variance.
  • \(\gamma\) and \(\beta\) are learned scale and shift parameters.
  • Prevents activations from growing very large or collapsing to zero in deep networks.
  • At evaluation time, uses running estimates accumulated during training.

CNN architecture

A typical image classification CNN:

input image (C, H, W)
  → Conv → BatchNorm → ReLU → Pool
  → Conv → BatchNorm → ReLU → Pool
  → Flatten
  → Linear → logits
  • Spatial dimensions shrink as depth increases.
  • The number of channels typically grows (more filters per layer).
  • The final flatten converts spatial feature maps into a vector for the classifier.