Training Neural Networks

Author

Mark Andrews

Abstract

We cover autograd, torch’s automatic differentiation engine, showing how torch builds a computational graph and computes gradients by backpropagation. We then cover loss functions, the mechanics of gradient descent, and how torch optimizers implement and extend it. The session closes with train/validation/test splitting and the two most common regularisation techniques: dropout and weight decay.

Autograd

Autograd is torch’s automatic differentiation engine. When a tensor is created with requires_grad = TRUE, torch builds a computational graph as operations are applied to it. Calling $backward() on a scalar traverses that graph in reverse and computes derivatives via the chain rule.

The computed derivative is stored in the $grad attribute of each leaf tensor. A leaf tensor is one created directly by the user rather than computed from other tensors.

In the simplest case, \(y = x^2\) where \(x = 3\). Calling y$backward() computes \(dy/dx = 2x = 6\) and stores it in x$grad.

x <- torch_tensor(3.0, requires_grad = TRUE)
y <- x^2
y$backward()
x$grad
torch_tensor
 6
[ CPUFloatType{1} ]

The reason the gradient appears on x rather than on y is that x is the leaf tensor. y is an intermediate result computed from x, and torch does not store gradients on intermediate tensors by default.

This generalises to any chain of operations. Suppose \(y = x^2\), \(z = 3y\), \(h = z + 1\), so that \(h = 3x^2 + 1\). Calling h$backward() computes \(dh/dx\) by the chain rule through all intermediate steps.

x <- torch_tensor(2.0, requires_grad = TRUE)
y <- x^2        # y = x^2
z <- 3 * y      # z = 3x^2
h <- z + 1      # h = 3x^2 + 1
h$backward()
x$grad          # dh/dx = 6x = 12
torch_tensor
 12
[ CPUFloatType{1} ]

Autograd through activation functions

Activation functions are differentiable operations, so autograd handles them transparently.

x <- torch_tensor(2.0, requires_grad = TRUE)
y <- x^2
z <- 3 * y
h <- nnf_relu(z)    # relu is identity for positive inputs
h$backward()
x$grad           # d(relu(3x^2))/dx = 6x = 12, since 3x^2 > 0 at x = 2
torch_tensor
 12
[ CPUFloatType{1} ]

The ReLU derivative is 1 where the input is positive and 0 where it is negative, so the chain rule passes through unchanged when the input lies in the active region.

x <- torch_tensor(1.0, requires_grad = TRUE)
y <- torch_tanh(x)
y$backward()
x$grad           # d tanh(x)/dx = 1 - tanh(x)^2
torch_tensor
 0.4200
[ CPUFloatType{1} ]
x <- torch_tensor(1.0, requires_grad = TRUE)
y <- torch_sigmoid(x)
y$backward()
x$grad           # sigmoid(x) * (1 - sigmoid(x))
torch_tensor
 0.1966
[ CPUFloatType{1} ]

Intermediate tensor gradients

torch only populates $grad on leaf tensors. If you need the gradient with respect to an intermediate tensor, call $retain_grad() on it before the backward pass.

x <- torch_tensor(2.0, requires_grad = TRUE)
y <- x^2
z <- 3 * y
z$retain_grad()    # ask torch to keep z's gradient
h <- z + 1
h$backward()
z$grad             # dh/dz = 1
torch_tensor
 1
[ CPUFloatType{1} ]

Graph retention

After $backward() is called, torch frees the computational graph by default to reclaim memory. Passing retain_graph = TRUE prevents the graph from being freed, allowing multiple backward passes.

x <- torch_tensor(2.0, requires_grad = TRUE)
y <- x^2
z <- 3 * y
h <- nnf_relu(z)
h$backward(retain_graph = TRUE)   # graph kept
x$grad                            # dh/dx = 12
torch_tensor
 12
[ CPUFloatType{1} ]

Gradient accumulation

torch adds newly computed gradients to $grad rather than replacing it. Calling $backward() more than once will accumulate the results.

x <- torch_tensor(2.0, requires_grad = TRUE)
y <- x^2
y$backward()
x$grad           # dy/dx = 2x = 4.0
torch_tensor
 4
[ CPUFloatType{1} ]
z <- x^3
z$backward()
x$grad           # 4.0 + dz/dx = 4.0 + 3x^2 = 4.0 + 12.0 = 16.0
torch_tensor
 16
[ CPUFloatType{1} ]

The gradient from the second call was added to the value already in x$grad. In a standard training loop each backward pass should operate on a fresh gradient, so $grad must be zeroed explicitly before each new pass.

x$grad$zero_()
torch_tensor
 0
[ CPUFloatType{1} ]
x$grad
torch_tensor
 0
[ CPUFloatType{1} ]

Without zeroing, running the same training step twice would double the gradient and corrupt the parameter update. This is why optimizer$zero_grad() is the first line of every training loop.

Vector-valued leaf tensors

For a vector-valued leaf tensor, the gradient is computed element-wise. Each element of $grad is the derivative of the scalar output with respect to the corresponding element of the leaf.

x <- torch_tensor(c(1.0, 2.0, 3.0), requires_grad = TRUE)
y <- (x^2)$sum()
y$backward()
x$grad           # dy/dx_i = 2*x_i for each i
torch_tensor
 2
 4
 6
[ CPUFloatType{3} ]

Matrix-valued leaf tensors

The same logic extends to matrices. Each element of $grad is the derivative of the scalar output with respect to the corresponding element of the parameter matrix. This is the situation for weight matrices in a neural network.

W <- torch_randn(3, 2, requires_grad = TRUE)
xv <- torch_tensor(c(1.0, 2.0))
y <- (W$mv(xv))$sum()
y$backward()
W$grad           # shape (3, 2): each entry is dy/dW_ij
torch_tensor
 1  2
 1  2
 1  2
[ CPUFloatType{3,2} ]

A full forward pass

To see all of this in a realistic setting, consider a two-layer network with a three-dimensional input, a four-dimensional hidden layer with ReLU activation, and a two-dimensional output.

W1 <- torch_randn(4, 3, requires_grad = TRUE)
b1 <- torch_zeros(4, requires_grad = TRUE)
W2 <- torch_randn(2, 4, requires_grad = TRUE)
b2 <- torch_zeros(2, requires_grad = TRUE)

x_in <- torch_tensor(c(1.0, 0.5, -1.0))
h   <- nnf_relu(W1$mv(x_in) + b1)   # hidden layer: shape (4,)
out <- W2$mv(h) + b2                 # output: shape (2,)
loss <- out$sum()                    # reduce to scalar for backward
loss$backward()

list(W1$grad$shape, W2$grad$shape)
[[1]]
[1] 4 3

[[2]]
[1] 2 4

Each weight matrix receives a gradient of the same shape as itself, where every entry is the derivative of the scalar loss with respect to that weight.

Loss functions

A loss function measures how far the network’s predictions are from the true targets. Training minimises this loss over the data by adjusting the network’s parameters.

Mean squared error

Mean squared error (MSE) is the standard loss for regression. For \(n\) predictions \(\hat{y}_i\) and targets \(y_i\):

\[\text{MSE} = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2\]

mse <- nn_mse_loss()
predictions <- torch_tensor(c(0.8, 0.2, 0.9))
targets     <- torch_tensor(c(1.0, 0.0, 1.0))
mse(predictions, targets)
torch_tensor
0.030000001192092896
[ CPUFloatType{} ]

Cross-entropy loss

Cross-entropy loss is the standard loss for classification. It arises from maximum likelihood estimation: minimising it is equivalent to maximising the log-likelihood of the observed labels under the model’s predicted probabilities.

For a \(K\)-class problem, the network produces a probability distribution \(\hat{p}_1, \ldots, \hat{p}_K\) via softmax. The observed label is one-hot: \(y_c = 1\) for the true class \(c\) and \(y_k = 0\) for all other \(k\). The cross-entropy reduces to:

\[\text{CE} = -\log \hat{p}_c\]

Substituting the softmax and expanding gives an equivalent expression in terms of the raw logits \(z_1, \ldots, z_K\):

\[\text{CE} = -z_c + \log \sum_{j=1}^{K} e^{z_j}\]

nn_cross_entropy_loss takes raw logits and applies the softmax internally, so no softmax layer is needed at the end of the network.

ce <- nn_cross_entropy_loss()
logits <- torch_tensor(matrix(c(2.0, 0.5, -1.0), nrow = 1))  # 1 sample, 3 classes
target <- torch_tensor(1L)                                      # true class is 1 (1-indexed)
ce(logits, target)
torch_tensor
0.24131132662296295
[ CPUFloatType{} ]

The higher the logit for the true class relative to the others, the lower the loss.

Binary cross-entropy

A different situation arises when the output is not a single \(K\)-way categorical variable but \(K\) independent binary variables. Each output unit gets a sigmoid activation, giving \(K\) independent probabilities \(\hat{p}_k = \sigma(z_k) \in (0, 1)\). The binary cross-entropy for a single output is:

\[\text{BCE} = -[y_k \log \hat{p}_k + (1 - y_k) \log(1 - \hat{p}_k)]\]

nn_bce_with_logits_loss accepts raw logits and applies sigmoid internally.

bce <- nn_bce_with_logits_loss()
logits_b <- torch_tensor(c(2.0, -1.0, 0.5))    # 3 independent binary outputs
targets_b <- torch_tensor(c(1.0, 0.0, 1.0))    # true binary labels
bce(logits_b, targets_b)
torch_tensor
0.3047555983066559
[ CPUFloatType{} ]

Gradient descent

Once we have a loss, we want to adjust the parameters to reduce it. Gradient descent does this by moving each parameter a small step in the direction that decreases the loss. For a parameter \(\theta\) and a loss \(L\), the update rule is:

\[\theta \leftarrow \theta - \eta \, \nabla_\theta L\]

where \(\eta\) is the learning rate, a small positive scalar that controls the step size.

The manual loop below makes the mechanics explicit: compute the loss, call $backward() to get the gradient, then update the parameter.

w <- torch_tensor(2.0, requires_grad = TRUE)
lr <- 0.1

for (step in seq_len(6)) {
  loss <- (w - 1)^2         # minimum at w = 1
  loss$backward()
  with_no_grad({
    w$subtract_(lr * w$grad)  # update without tracking
  })
  w$grad$zero_()
  cat(sprintf("step %d: w=%.4f  loss=%.4f\n", step, w$item(), loss$item()))
}
step 1: w=1.8000  loss=1.0000
step 2: w=1.6400  loss=0.6400
step 3: w=1.5120  loss=0.4096
step 4: w=1.4096  loss=0.2621
step 5: w=1.3277  loss=0.1678
step 6: w=1.2621  loss=0.1074

with_no_grad is needed around the parameter update so that torch does not add that operation to the computational graph.

In practice, gradient descent is run on small random subsets of the data called mini-batches rather than the full dataset. Each pass through the full training set is one epoch.

Optimizers

torch optimizers handle the parameter update and gradient zeroing, replacing the manual loop above. The pattern is always the same four lines:

optimizer$zero_grad()          # clear accumulated gradients
loss <- criterion(model(X), y) # forward pass and loss
loss$backward()                # compute gradients
optimizer$step()               # update parameters

The simplest optimizer is SGD (stochastic gradient descent), which implements the update rule directly.

w <- torch_tensor(2.0, requires_grad = TRUE)
optimizer <- optim_sgd(list(w), lr = 0.1)

for (step in seq_len(6)) {
  optimizer$zero_grad()
  loss <- (w - 1)^2
  loss$backward()
  optimizer$step()
  cat(sprintf("step %d: w=%.4f  loss=%.4f\n", step, w$item(), loss$item()))
}
step 1: w=1.8000  loss=1.0000
step 2: w=1.6400  loss=0.6400
step 3: w=1.5120  loss=0.4096
step 4: w=1.4096  loss=0.2621
step 5: w=1.3277  loss=0.1678
step 6: w=1.2621  loss=0.1074

Adam

Adam (Adaptive Moment Estimation) is the default choice for most deep learning. Plain SGD uses the same learning rate for every parameter throughout training. Adam instead maintains a running estimate of the mean gradient and the mean squared gradient for each parameter separately, and uses these to scale the update for each parameter individually.

Parameters whose gradients have been consistently large get a smaller effective learning rate. Parameters whose gradients have been small or inconsistent get a larger effective learning rate. The result is that Adam tends to converge faster than SGD, requires less learning rate tuning, and handles the very different scales of gradients that occur across the layers of a deep network.

w <- torch_tensor(2.0, requires_grad = TRUE)
optimizer <- optim_adam(list(w), lr = 0.3)

for (step in seq_len(10)) {
  optimizer$zero_grad()
  loss <- (w - 1)^2
  loss$backward()
  optimizer$step()
}

w$item()
[1] 0.6446719

Train, validation, and test splits

A model trained long enough will memorise the training data rather than learn patterns that generalise. To detect this, we hold out data the model never sees during training.

The standard split is three-way. The training set is what the model is trained on. The validation set is used to monitor the model during training. The test set is held back entirely and evaluated only once, at the end.

With dataset_subset, a dataset can be divided into any proportions by passing index vectors.

X_data <- torch_randn(1000, 10)
y_data <- torch_randint(0L, 2L, 1000L)
dataset <- tensor_dataset(X_data, y_data)

idx       <- sample(length(dataset))
train_set <- dataset_subset(dataset, idx[1:700])
val_set   <- dataset_subset(dataset, idx[701:850])
test_set  <- dataset_subset(dataset, idx[851:1000])

c(length(train_set), length(val_set), length(test_set))
[1] 700 150 150

Regularisation

Regularisation refers to techniques that reduce overfitting by discouraging the model from becoming too complex.

Dropout

Dropout randomly deactivates each neuron with probability \(p\) during each forward pass of training. The surviving activations are scaled by \(1/(1-p)\) so that the expected total activation remains unchanged. During evaluation, dropout is disabled entirely.

dropout <- nn_dropout(p = 0.5)

x <- torch_ones(10)
dropout(x)        # training mode: each unit kept with probability 0.5, survivors scaled by 2
torch_tensor
 0
 0
 0
 0
 0
 0
 2
 0
 2
 0
[ CPUFloatType{10} ]
dropout$eval()
dropout(x)        # evaluation mode: all units active, no scaling
torch_tensor
 1
 1
 1
 1
 1
 1
 1
 1
 1
 1
[ CPUFloatType{10} ]

Weight decay

Weight decay penalises large parameter values by adding a term to the loss proportional to the sum of squared parameters:

\[L_{\text{reg}} = L + \lambda \sum_\theta \theta^2\]

where \(\lambda\) is the weight decay coefficient. In torch it is passed directly to the optimizer.

optimizer <- optim_adam(model$parameters, lr = 1e-3, weight_decay = 1e-4)