Day One, Session Two

Training Neural Networks

Mark Andrews

The training problem

A network with random weights produces random outputs.
Training is the process of adjusting the weights so outputs become useful.
We need three things: a way to measure error (loss function), a way to compute gradients (autograd), and a rule for updating weights (optimiser).

Loss functions

A loss function measures how far predictions are from targets.
Training minimises the loss over the data by adjusting the parameters.
The choice of loss depends on the task.

Task	Loss
Regression	Mean squared error
Classification	Cross-entropy

Mean squared error

For \(n\) samples with predictions \(\hat{y}\) and targets \(y\):

\[\text{MSE} = \frac{1}{n}\sum_{i=1}^{n}(y_i - \hat{y}_i)^2\]

Penalises large errors more than small ones.
Appropriate when the target is a continuous quantity.

Cross-entropy loss

For one sample with \(C\) classes, predicted probabilities \(\hat{p}_1,\ldots,\hat{p}_C\) (from softmax), and true class \(c\):

\[\text{CE} = -\log \hat{p}_c\]

Negative log-likelihood of the correct class: maximising likelihood minimises the loss.

The logit form \(-z_c + \log\sum_j e^{z_j}\) is algebraically identical but numerically stable.
PyTorch’s nn.CrossEntropyLoss expects raw logits. Do not apply softmax first.

Autograd

PyTorch builds a computational graph as operations are applied to tensors with requires_grad=True.
Calling .backward() on a scalar traverses the graph in reverse using the chain rule.
Gradients accumulate in the .grad attribute of each leaf tensor — the model parameters.

dh/dx = 6x, computed automatically.

Gradient descent

The basic parameter update:

\[\theta \leftarrow \theta - \eta\,\nabla_\theta L\]

\(\eta\) is the learning rate: too large overshoots, too small converges slowly.
Running gradient descent on the full dataset is expensive for large datasets.
In practice we use mini-batches — small random subsets of the data.

Mini-batches and epochs

A mini-batch is a small random subset of the training data.
Gradients estimated from a mini-batch are noisy but fast to compute.
One pass through the entire training set is one epoch.
The noise in mini-batch gradients can help escape shallow local minima.

The Adam optimiser

Plain SGD uses the same learning rate for every parameter throughout training.
Adam maintains per-parameter running estimates of the mean gradient and mean squared gradient.
Parameters with consistently large gradients receive a smaller effective rate; sparse gradients receive larger updates.
Adam converges faster than SGD and requires less learning rate tuning.

Overfitting and generalisation

A network trained long enough memorises the training data rather than learning transferable patterns.
We detect this by monitoring performance on data the model never sees during training.
Standard split: training for learning, validation for monitoring, test for final unbiased evaluation.

Regularisation

Dropout: during training, each activation is set to zero with probability \(p\) and survivors are scaled by \(1/(1-p)\). Disabled at evaluation. Forces the network to learn distributed representations.

Weight decay: adds \(\lambda\sum_\theta \theta^2\) to the loss, penalising large weights and preventing any single parameter from dominating. Passed as weight_decay to the optimiser.