Day Two, Session One

Images as data

A greyscale image is a 2D array of pixel intensities.
A colour image has three such arrays, one per channel (RGB).
PyTorch represents images as tensors of shape \((\text{channels}, \text{height}, \text{width})\).
The question: how do we design a network that exploits the spatial structure of images?

The problem with MLPs for images

Parameter count. A \(224 \times 224\) colour image has \(224 \times 224 \times 3 = 150{,}528\) inputs. A first hidden layer of 512 units requires over 77 million weights.

No spatial structure. Flattening an image discards all proximity information. A fully-connected layer must learn spatial relationships from scratch, which is wasteful and slow.

Convolutional layers: local connectivity

A convolutional layer replaces the global weight matrix with a small, reusable filter.
A typical filter is \(3 \times 3\) — nine weights, plus a bias.
The filter slides across the image and computes a weighted sum at each position.
Each output unit depends only on a small local region of the input.

The convolution operation

For filter \(w\) of size \(k \times k\) on a single-channel input \(x\):

\[y_{i,j} = \sum_{m=0}^{k-1}\sum_{n=0}^{k-1} x_{i+m,\,j+n}\cdot w_{m,n} + b\]

One filter produces one feature map of the same spatial size (with padding).
A layer with 32 filters produces 32 feature maps stacked together.

Parameter sharing and translation invariance

The same filter weights are used at every position: parameter sharing.
A \(3 \times 3\) filter on a \(28 \times 28\) image uses \(9 + 1 = 10\) parameters regardless of image size.
This means the filter learns to detect a feature — an edge, a corner, a texture — wherever it appears.
Translation invariance: a cat in the top-left and a cat in the bottom-right activate the same filters.

Pooling

Pooling layers reduce spatial dimensions without learnable parameters.
Max pooling takes the maximum value in each non-overlapping window.
A \(2 \times 2\) max pool halves height and width.
Combined with convolutions, successive pooling builds increasingly compact representations.
Also provides a degree of local translation invariance within the pooling window.

Batch normalisation

Applied after convolution, before activation, per channel:

\[\hat{x} = \frac{x - \mu_B}{\sqrt{\sigma_B^2 + \varepsilon}}, \qquad y = \gamma\hat{x} + \beta\]

\(\mu_B\) and \(\sigma_B^2\) are the batch mean and variance.
\(\gamma\) and \(\beta\) are learned scale and shift parameters.
Prevents activations from growing very large or collapsing to zero in deep networks.
At evaluation time, uses running estimates accumulated during training.

CNN architecture

A typical image classification CNN:

input image (C, H, W)
  → Conv → BatchNorm → ReLU → Pool
  → Conv → BatchNorm → ReLU → Pool
  → Flatten
  → Linear → logits

Spatial dimensions shrink as depth increases.
The number of channels typically grows (more filters per layer).
The final flatten converts spatial feature maps into a vector for the classifier.