import torch
x = torch.tensor([1.5, 2.0, -1.0])
xtensor([ 1.5000, 2.0000, -1.0000])
Training Neural Networks
Mark Andrews
We introduce PyTorch tensors and autograd, PyTorch’s automatic differentiation engine. We then cover loss functions, the mechanics of gradient descent, and how PyTorch optimizers implement and extend it. The session closes with train/validation/test splitting and the two most common regularisation techniques: dropout and weight decay.
A PyTorch tensor is the equivalent of a NumPy array, a multi-dimensional array of numbers, but with two additional capabilities: it can run on a GPU, and it can track operations for automatic differentiation.
Tensors and NumPy arrays convert to one another cheaply.
tensor([1., 2., 3.], dtype=torch.float64)
Element-wise and matrix operations work as in NumPy.
PyTorch distinguishes between operations that return a new tensor and operations that modify a tensor in place. x.abs() returns a new tensor containing the absolute values, leaving x unchanged. x.abs_() modifies x in place and returns it. The trailing underscore is a consistent PyTorch convention: any method ending in _ operates in place.
In-place operations on tensors that participate in a computational graph will raise an error or silently produce incorrect gradients. Avoid in-place operations on any tensor that has requires_grad=True or that was produced by such a tensor.
Autograd is PyTorch’s automatic differentiation engine. When a tensor is created with requires_grad=True, PyTorch builds a computational graph as operations are applied to it. Calling .backward() on a scalar traverses that graph in reverse and computes derivatives via the chain rule.
The computed derivative is stored in the .grad attribute of each leaf tensor. A leaf tensor is one created directly by the user rather than computed from other tensors. The key point is that .grad on a leaf tensor always stores the derivative of whatever scalar .backward() was called on, with respect to that leaf tensor.
In the simplest case, \(y = x^2\) where \(x = 3\). Calling y.backward() computes \(dy/dx = 2x = 6\) and stores it in x.grad.
The reason the gradient appears on x rather than on y is that x is the leaf tensor. y is an intermediate result computed from x, and PyTorch does not store gradients on intermediate tensors by default.
This generalises to any chain of operations. Suppose \(y = x^2\), \(z = 3y\), \(h = z + 1\), so that \(h = 3x^2 + 1\). Calling h.backward() computes \(dh/dx\) by the chain rule through all intermediate steps and stores the result in x.grad.
tensor(12.)
Activation functions are differentiable operations, so autograd handles them transparently.
tensor(12.)
The ReLU derivative is 1 where the input is positive and 0 where it is negative, so the chain rule passes through unchanged when the input lies in the active region. Tanh and sigmoid have smooth, everywhere-defined derivatives that autograd computes equally easily.
tensor(0.4200)
PyTorch only populates .grad on leaf tensors. After .backward(), the .grad attribute of intermediate tensors such as y and z in the examples above remains None. This is intentional: storing gradients for every intermediate value in a large network would be prohibitively expensive.
If you do need the gradient with respect to an intermediate tensor, call .retain_grad() on it before the backward pass.
After .backward() is called, PyTorch frees the computational graph by default to reclaim memory. Attempting to call .backward() again on any tensor from that graph will raise a RuntimeError.
Passing retain_graph=True prevents the graph from being freed, allowing multiple backward passes. The cost is that all intermediate buffers remain allocated for as long as you hold the reference.
PyTorch adds newly computed gradients to .grad rather than replacing it. Calling .backward() more than once on computations that share the same leaf tensor will accumulate the results.
tensor(4.)
The gradient from the second call was added to the value already in x.grad. This accumulation is by design: in some training setups, gradient contributions from multiple loss terms are summed by running several backward passes before taking a single parameter update step. In a standard training loop, however, each backward pass should operate on a fresh gradient, so .grad must be zeroed explicitly before each new pass.
Without zeroing, running the same training step twice would double the gradient and corrupt the parameter update. This is why optimizer.zero_grad() is the first line of every training loop.
For a vector-valued leaf tensor, the gradient is computed element-wise. Each element of .grad is the derivative of the scalar output with respect to the corresponding element of the leaf.
The same logic extends to matrices. Each element of .grad is the derivative of the scalar output with respect to the corresponding element of the parameter matrix. This is the situation for weight matrices in a neural network.
To see all of this in a realistic setting, consider a two-layer network with a three-dimensional input, a four-dimensional hidden layer with ReLU activation, and a two-dimensional output.
(torch.Size([4, 3]), torch.Size([2, 4]))
The network output out is a two-dimensional vector. Because .backward() requires a scalar, a loss function collapses the output to a single number before differentiating. Each weight matrix then receives a gradient of the same shape as itself, where every entry is the derivative of the scalar loss with respect to that weight.
When the output is a vector and you want the full matrix of partial derivatives (the derivative of each output component with respect to each input component), the result is a Jacobian. PyTorch provides torch.autograd.functional.jacobian for this case.
tensor([[0., 0., 0.],
[0., 0., 0.]])
The entry at row \(i\), column \(j\) is the partial derivative of the \(i\)-th output with respect to the \(j\)-th input. In standard neural network training the loss is always a scalar, so the Jacobian is not needed in the training loop itself. It arises in sensitivity analysis and certain second-order optimisation methods.
A loss function measures how far the network’s predictions are from the true targets. Training minimises this loss over the data by adjusting the network’s parameters.
Mean squared error (MSE) is the standard loss for regression. For \(n\) predictions \(\hat{y}_i\) and targets \(y_i\):
\[\text{MSE} = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2\]
Cross-entropy loss is the standard loss for classification. It arises from maximum likelihood estimation: minimising it is equivalent to maximising the log-likelihood of the observed labels under the model’s predicted probabilities.
For a \(K\)-class problem, the network produces a probability distribution \(\hat{p}_1, \ldots, \hat{p}_K\) via softmax. The observed label is one-hot: \(y_c = 1\) for the true class \(c\) and \(y_k = 0\) for all other \(k\). The general cross-entropy between the true and predicted distributions is:
\[H(y, \hat{p}) = -\sum_{k=1}^{K} y_k \log \hat{p}_k\]
Because only the \(c\)-th term is non-zero, this reduces to:
\[\text{CE} = -\log \hat{p}_c\]
Substituting the softmax \(\hat{p}_k = e^{z_k} / \sum_j e^{z_j}\) and expanding the logarithm gives an equivalent expression entirely in terms of the raw logits \(z_1, \ldots, z_K\):
\[\text{CE} = -z_c + \log \sum_{j=1}^{K} e^{z_j}\]
The two forms are algebraically identical. The logit form is used in practice because computing the softmax probabilities explicitly can cause numerical overflow when logits are large, whereas the log-sum-exp computation avoids this.
PyTorch’s nn.CrossEntropyLoss takes raw logits and applies the softmax internally, so no softmax layer is needed at the end of the network.
tensor(0.2413)
The higher the logit for the true class relative to the others, the lower the loss.
A different situation arises when the output is not a single \(K\)-way categorical variable but \(K\) independent binary variables. A multi-label classification problem is a typical example: each output unit indicates whether a given category is present or absent. Each output unit gets a sigmoid activation, giving \(K\) independent probabilities \(\hat{p}_k = \sigma(z_k) \in (0, 1)\), each the probability that binary variable \(k\) equals 1.
This is a different distributional model: \(K\) independent Bernoulli variables rather than one categorical variable. The log-likelihood for a single binary variable with observed label \(y_k \in \{0, 1\}\) and predicted probability \(\hat{p}_k\) is \(y_k \log \hat{p}_k + (1 - y_k) \log(1 - \hat{p}_k)\), so the negative log-likelihood — the binary cross-entropy — is:
\[\text{BCE} = -[y_k \log \hat{p}_k + (1 - y_k) \log(1 - \hat{p}_k)]\]
When \(y_k = 1\), only the first term survives: the loss penalises a low predicted probability for the positive class. When \(y_k = 0\), only the second term survives: the loss penalises a high predicted probability for the positive class. For \(K\) binary outputs the total loss is the mean of the \(K\) individual terms.
nn.BCEWithLogitsLoss accepts raw logits and applies sigmoid internally. This is preferred over nn.BCELoss (which expects probabilities already passed through sigmoid) for the same numerical stability reason as with multi-class cross-entropy.
Once we have a loss, we want to adjust the parameters to reduce it. Gradient descent does this by moving each parameter a small step in the direction that decreases the loss. For a parameter \(\theta\) and a loss \(L\), the update rule is:
\[\theta \leftarrow \theta - \eta \, \nabla_\theta L\]
where \(\eta\) is the learning rate, a small positive scalar that controls the step size, and \(\nabla_\theta L\) is the gradient of the loss with respect to \(\theta\). A large learning rate risks overshooting the minimum; a small one converges slowly.
Autograd computes \(\nabla_\theta L\) for every parameter automatically. The manual loop below makes the mechanics explicit: compute the loss, call .backward() to get the gradient, then update the parameter.
step 1: w=1.8000 loss=1.0000
step 2: w=1.6400 loss=0.6400
step 3: w=1.5120 loss=0.4096
step 4: w=1.4096 loss=0.2621
step 5: w=1.3277 loss=0.1678
step 6: w=1.2621 loss=0.1074
torch.no_grad() is needed around the parameter update so that PyTorch does not add that operation to the computational graph.
In practice, gradient descent is run on small random subsets of the data called mini-batches rather than the full dataset. Each pass through the full training set is one epoch. Mini-batch gradient descent is faster and the noise in the gradient estimates actually helps escape local minima.
PyTorch optimizers handle the parameter update and gradient zeroing, replacing the manual loop above. The pattern is always the same four lines:
The simplest optimizer is SGD (stochastic gradient descent), which implements the update rule above directly.
step 1: w=1.8000 loss=1.0000
step 2: w=1.6400 loss=0.6400
step 3: w=1.5120 loss=0.4096
step 4: w=1.4096 loss=0.2621
step 5: w=1.3277 loss=0.1678
step 6: w=1.2621 loss=0.1074
Adam (Adaptive Moment Estimation) is the default choice for most deep learning. Plain SGD uses the same learning rate for every parameter throughout training. Adam instead maintains a running estimate of the mean gradient and the mean squared gradient for each parameter separately, and uses these to scale the update for each parameter individually.
Parameters whose gradients have been consistently large get a smaller effective learning rate. Parameters whose gradients have been small or inconsistent get a larger effective learning rate. The result is that Adam tends to converge faster than SGD, requires less learning rate tuning, and handles the very different scales of gradients that occur across the layers of a deep network.
A model trained long enough will memorise the training data rather than learn patterns that generalise. To detect this, we hold out data the model never sees during training.
The standard split is three-way. The training set is what the model is trained on. The validation set is used to monitor the model during training: we evaluate loss and accuracy on it after each epoch to see whether the model is still improving or beginning to overfit. The test set is held back entirely and evaluated only once, at the end, to give an unbiased estimate of performance.
Using the validation set to make decisions (such as when to stop training, or which architecture to use) means it is no longer fully independent, which is why a separate test set is needed for the final evaluation.
With PyTorch’s random_split, a dataset can be divided into any proportions.
from torch.utils.data import random_split, TensorDataset
# A small synthetic dataset for illustration
X = torch.randn(1000, 10)
y = torch.randint(0, 2, (1000,))
dataset = TensorDataset(X, y)
n_train, n_val, n_test = 700, 150, 150
train_set, val_set, test_set = random_split(dataset, [n_train, n_val, n_test])
len(train_set), len(val_set), len(test_set)(700, 150, 150)
In practice, many benchmark datasets (including MNIST) provide a predefined train/test split. The test set is treated as the held-out evaluation set, and a portion of the training data is reserved as a validation set.
Regularisation refers to techniques that reduce overfitting by discouraging the model from becoming too complex.
Dropout is a regularisation technique in which each neuron in a layer is independently and randomly deactivated during each forward pass of training. Concretely, each activation \(h_i\) is multiplied by a Bernoulli random variable \(b_i \sim \text{Bernoulli}(1-p)\), where \(p\) is the dropout probability. When \(b_i = 0\) the unit is set to zero; when \(b_i = 1\) it is kept. The surviving activations are scaled by \(1/(1-p)\) so that the expected total activation remains unchanged.
During evaluation, dropout is disabled entirely: all units are active and no scaling is applied. The net effect during training is that the network cannot rely on any single unit always being present, which forces it to learn more distributed and robust representations.
tensor([0., 0., 2., 0., 2., 2., 2., 0., 0., 0.])
Weight decay penalises large parameter values by adding a term to the loss proportional to the sum of squared parameters:
\[L_{\text{reg}} = L + \lambda \sum_\theta \theta^2\]
where \(\lambda\) is the weight decay coefficient. This encourages the model to use smaller weights and prevents any single parameter from dominating. In PyTorch it is passed directly to the optimizer.