x <- torch_tensor(3.0, requires_grad = TRUE)
y <- x^2
y$backward()
x$gradtorch_tensor
6
[ CPUFloatType{1} ]
Mark Andrews
We cover autograd, torch’s automatic differentiation engine, showing how torch builds a computational graph and computes gradients by backpropagation. We then cover loss functions, the mechanics of gradient descent, and how torch optimizers implement and extend it. The session closes with train/validation/test splitting and the two most common regularisation techniques: dropout and weight decay.
Autograd is torch’s automatic differentiation engine. When a tensor is created with requires_grad = TRUE, torch builds a computational graph as operations are applied to it. Calling $backward() on a scalar traverses that graph in reverse and computes derivatives via the chain rule.
The computed derivative is stored in the $grad attribute of each leaf tensor. A leaf tensor is one created directly by the user rather than computed from other tensors.
In the simplest case, \(y = x^2\) where \(x = 3\). Calling y$backward() computes \(dy/dx = 2x = 6\) and stores it in x$grad.
torch_tensor
6
[ CPUFloatType{1} ]
The reason the gradient appears on x rather than on y is that x is the leaf tensor. y is an intermediate result computed from x, and torch does not store gradients on intermediate tensors by default.
This generalises to any chain of operations. Suppose \(y = x^2\), \(z = 3y\), \(h = z + 1\), so that \(h = 3x^2 + 1\). Calling h$backward() computes \(dh/dx\) by the chain rule through all intermediate steps.
torch_tensor
12
[ CPUFloatType{1} ]
Activation functions are differentiable operations, so autograd handles them transparently.
torch_tensor
12
[ CPUFloatType{1} ]
The ReLU derivative is 1 where the input is positive and 0 where it is negative, so the chain rule passes through unchanged when the input lies in the active region.
torch_tensor
0.4200
[ CPUFloatType{1} ]
torch only populates $grad on leaf tensors. If you need the gradient with respect to an intermediate tensor, call $retain_grad() on it before the backward pass.
After $backward() is called, torch frees the computational graph by default to reclaim memory. Passing retain_graph = TRUE prevents the graph from being freed, allowing multiple backward passes.
torch adds newly computed gradients to $grad rather than replacing it. Calling $backward() more than once will accumulate the results.
torch_tensor
4
[ CPUFloatType{1} ]
torch_tensor
16
[ CPUFloatType{1} ]
The gradient from the second call was added to the value already in x$grad. In a standard training loop each backward pass should operate on a fresh gradient, so $grad must be zeroed explicitly before each new pass.
Without zeroing, running the same training step twice would double the gradient and corrupt the parameter update. This is why optimizer$zero_grad() is the first line of every training loop.
For a vector-valued leaf tensor, the gradient is computed element-wise. Each element of $grad is the derivative of the scalar output with respect to the corresponding element of the leaf.
The same logic extends to matrices. Each element of $grad is the derivative of the scalar output with respect to the corresponding element of the parameter matrix. This is the situation for weight matrices in a neural network.
To see all of this in a realistic setting, consider a two-layer network with a three-dimensional input, a four-dimensional hidden layer with ReLU activation, and a two-dimensional output.
[[1]]
[1] 4 3
[[2]]
[1] 2 4
Each weight matrix receives a gradient of the same shape as itself, where every entry is the derivative of the scalar loss with respect to that weight.
A loss function measures how far the network’s predictions are from the true targets. Training minimises this loss over the data by adjusting the network’s parameters.
Mean squared error (MSE) is the standard loss for regression. For \(n\) predictions \(\hat{y}_i\) and targets \(y_i\):
\[\text{MSE} = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2\]
Cross-entropy loss is the standard loss for classification. It arises from maximum likelihood estimation: minimising it is equivalent to maximising the log-likelihood of the observed labels under the model’s predicted probabilities.
For a \(K\)-class problem, the network produces a probability distribution \(\hat{p}_1, \ldots, \hat{p}_K\) via softmax. The observed label is one-hot: \(y_c = 1\) for the true class \(c\) and \(y_k = 0\) for all other \(k\). The cross-entropy reduces to:
\[\text{CE} = -\log \hat{p}_c\]
Substituting the softmax and expanding gives an equivalent expression in terms of the raw logits \(z_1, \ldots, z_K\):
\[\text{CE} = -z_c + \log \sum_{j=1}^{K} e^{z_j}\]
nn_cross_entropy_loss takes raw logits and applies the softmax internally, so no softmax layer is needed at the end of the network.
torch_tensor
0.24131132662296295
[ CPUFloatType{} ]
The higher the logit for the true class relative to the others, the lower the loss.
A different situation arises when the output is not a single \(K\)-way categorical variable but \(K\) independent binary variables. Each output unit gets a sigmoid activation, giving \(K\) independent probabilities \(\hat{p}_k = \sigma(z_k) \in (0, 1)\). The binary cross-entropy for a single output is:
\[\text{BCE} = -[y_k \log \hat{p}_k + (1 - y_k) \log(1 - \hat{p}_k)]\]
nn_bce_with_logits_loss accepts raw logits and applies sigmoid internally.
Once we have a loss, we want to adjust the parameters to reduce it. Gradient descent does this by moving each parameter a small step in the direction that decreases the loss. For a parameter \(\theta\) and a loss \(L\), the update rule is:
\[\theta \leftarrow \theta - \eta \, \nabla_\theta L\]
where \(\eta\) is the learning rate, a small positive scalar that controls the step size.
The manual loop below makes the mechanics explicit: compute the loss, call $backward() to get the gradient, then update the parameter.
step 1: w=1.8000 loss=1.0000
step 2: w=1.6400 loss=0.6400
step 3: w=1.5120 loss=0.4096
step 4: w=1.4096 loss=0.2621
step 5: w=1.3277 loss=0.1678
step 6: w=1.2621 loss=0.1074
with_no_grad is needed around the parameter update so that torch does not add that operation to the computational graph.
In practice, gradient descent is run on small random subsets of the data called mini-batches rather than the full dataset. Each pass through the full training set is one epoch.
torch optimizers handle the parameter update and gradient zeroing, replacing the manual loop above. The pattern is always the same four lines:
The simplest optimizer is SGD (stochastic gradient descent), which implements the update rule directly.
step 1: w=1.8000 loss=1.0000
step 2: w=1.6400 loss=0.6400
step 3: w=1.5120 loss=0.4096
step 4: w=1.4096 loss=0.2621
step 5: w=1.3277 loss=0.1678
step 6: w=1.2621 loss=0.1074
Adam (Adaptive Moment Estimation) is the default choice for most deep learning. Plain SGD uses the same learning rate for every parameter throughout training. Adam instead maintains a running estimate of the mean gradient and the mean squared gradient for each parameter separately, and uses these to scale the update for each parameter individually.
Parameters whose gradients have been consistently large get a smaller effective learning rate. Parameters whose gradients have been small or inconsistent get a larger effective learning rate. The result is that Adam tends to converge faster than SGD, requires less learning rate tuning, and handles the very different scales of gradients that occur across the layers of a deep network.
A model trained long enough will memorise the training data rather than learn patterns that generalise. To detect this, we hold out data the model never sees during training.
The standard split is three-way. The training set is what the model is trained on. The validation set is used to monitor the model during training. The test set is held back entirely and evaluated only once, at the end.
With dataset_subset, a dataset can be divided into any proportions by passing index vectors.
X_data <- torch_randn(1000, 10)
y_data <- torch_randint(0L, 2L, 1000L)
dataset <- tensor_dataset(X_data, y_data)
idx <- sample(length(dataset))
train_set <- dataset_subset(dataset, idx[1:700])
val_set <- dataset_subset(dataset, idx[701:850])
test_set <- dataset_subset(dataset, idx[851:1000])
c(length(train_set), length(val_set), length(test_set))[1] 700 150 150
Regularisation refers to techniques that reduce overfitting by discouraging the model from becoming too complex.
Dropout randomly deactivates each neuron with probability \(p\) during each forward pass of training. The surviving activations are scaled by \(1/(1-p)\) so that the expected total activation remains unchanged. During evaluation, dropout is disabled entirely.
torch_tensor
0
0
0
0
0
0
2
0
2
0
[ CPUFloatType{10} ]
Weight decay penalises large parameter values by adding a term to the loss proportional to the sum of squared parameters:
\[L_{\text{reg}} = L + \lambda \sum_\theta \theta^2\]
where \(\lambda\) is the weight decay coefficient. In torch it is passed directly to the optimizer.