Day One, Session Two

Training Neural Networks

Mark Andrews

The training problem

  • A network with random weights produces random outputs.
  • Training is the process of adjusting the weights so outputs become useful.
  • We need three things: a way to measure error (loss function), a way to compute gradients (autograd), and a rule for updating weights (optimiser).

Loss functions

  • A loss function measures how far predictions are from targets.
  • Training minimises the loss over the data by adjusting the parameters.
  • The choice of loss depends on the task.
Task Loss
Regression Mean squared error
Classification Cross-entropy

Mean squared error

For \(n\) samples with predictions \(\hat{y}\) and targets \(y\):

\[\text{MSE} = \frac{1}{n}\sum_{i=1}^{n}(y_i - \hat{y}_i)^2\]

  • Penalises large errors more than small ones.
  • Appropriate when the target is a continuous quantity.

Cross-entropy loss

For one sample with \(C\) classes, predicted probabilities \(\hat{p}_1,\ldots,\hat{p}_C\) (from softmax), and true class \(c\):

\[\text{CE} = -\log \hat{p}_c\]

Negative log-likelihood of the correct class: maximising likelihood minimises the loss.

  • The logit form \(-z_c + \log\sum_j e^{z_j}\) is algebraically identical but numerically stable.
  • PyTorch’s nn.CrossEntropyLoss expects raw logits. Do not apply softmax first.

Autograd

  • PyTorch builds a computational graph as operations are applied to tensors with requires_grad=True.
  • Calling .backward() on a scalar traverses the graph in reverse using the chain rule.
  • Gradients accumulate in the .grad attribute of each leaf tensor — the model parameters.

dh/dx = 6x, computed automatically.

Gradient descent

The basic parameter update:

\[\theta \leftarrow \theta - \eta\,\nabla_\theta L\]

  • \(\eta\) is the learning rate: too large overshoots, too small converges slowly.
  • Running gradient descent on the full dataset is expensive for large datasets.
  • In practice we use mini-batches — small random subsets of the data.

Mini-batches and epochs

  • A mini-batch is a small random subset of the training data.
  • Gradients estimated from a mini-batch are noisy but fast to compute.
  • One pass through the entire training set is one epoch.
  • The noise in mini-batch gradients can help escape shallow local minima.

The Adam optimiser

  • Plain SGD uses the same learning rate for every parameter throughout training.
  • Adam maintains per-parameter running estimates of the mean gradient and mean squared gradient.
  • Parameters with consistently large gradients receive a smaller effective rate; sparse gradients receive larger updates.
  • Adam converges faster than SGD and requires less learning rate tuning.

Overfitting and generalisation

  • A network trained long enough memorises the training data rather than learning transferable patterns.
  • We detect this by monitoring performance on data the model never sees during training.
  • Standard split: training for learning, validation for monitoring, test for final unbiased evaluation.

Regularisation

Dropout: during training, each activation is set to zero with probability \(p\) and survivors are scaled by \(1/(1-p)\). Disabled at evaluation. Forces the network to learn distributed representations.

Weight decay: adds \(\lambda\sum_\theta \theta^2\) to the loss, penalising large weights and preventing any single parameter from dominating. Passed as weight_decay to the optimiser.