We build and train a multilayer perceptron using torch’s nn_module. The running example is MNIST handwritten digit classification. We first implement the training loop explicitly to see all the moving parts, then replace it with luz, which provides a high-level training interface and reduces the boilerplate considerably.
Defining a network with nn_module
The standard way to define a network in torch for R is to use nn_module. The function takes two key components: initialize, which declares the learnable components of the network, and forward, which describes the computation.
nn_linear(in_features, out_features) creates a single fully-connected layer. Internally it holds two learnable tensors: a weight matrix \(W\) of shape \((\text{out}, \text{in})\) and a bias vector \(b\) of shape \((\text{out},)\). When called on an input \(x\) of shape \((\text{batch}, \text{in})\), it computes:
\[y = x W^T + b\]
giving an output of shape \((\text{batch}, \text{out})\).
Chaining layers
In initialize we are only declaring the three layers. The connection between them is specified in forward. The dimensions must be consistent: the output size of fc1 is 128, so the input size of fc2 must also be 128, and so on. The network produces 10 outputs, one logit per digit class.
forward
forward defines the computation. nnf_relu is applied after fc1 and fc2 to introduce non-linearity. No activation is applied after fc3 because the cross-entropy loss expects raw logits.
Model parameters
Every nn_linear layer registers its weight matrix and bias vector as parameters. We can count them.
model$parameters returns a list of all learnable tensors in the network. This is what we pass to the optimizer so that it knows what to update.
nn_sequential
For straightforward feedforward networks, nn_sequential avoids writing a full nn_module. It takes a sequence of modules and calls them in order.
model <-nn_sequential(nn_flatten(),nn_linear(784, 128),nn_relu(),nn_linear(128, 64),nn_relu(),nn_linear(64, 10))
nn_relu() is a module, which is what nn_sequential expects. nn_flatten() collapses all dimensions except the batch dimension into a single vector.
nn_sequential is appropriate when data flows straight through from one layer to the next. For anything more complex — skip connections, branching paths, multiple inputs or outputs — you need a full nn_module.
Loading MNIST
torchvision provides MNIST and other standard datasets. The transform argument applies a preprocessing function to each sample as it is loaded. transform_to_tensor converts each image to a float tensor with values in \([0.0, 1.0]\).
Split "train" of dataset <mnist> (~12 MB) will be downloaded and processed if
not already available.
<mnist> dataset loaded with 60000 images across 10 classes.
Split "test" of dataset <mnist> (~12 MB) will be downloaded and processed if
not already available.
<mnist> dataset loaded with 10000 images across 10 classes.
The image shape is [1, 28, 28]: one channel (greyscale), 28 rows, 28 columns. nn_linear expects a flat vector, not a 3D tensor, which is why nn_flatten is needed before the first linear layer.
dataloader
dataloader wraps a dataset and serves it in mini-batches, handling shuffling and parallel data loading.
Each iteration yields a list of (images, labels) where images has shape [batch_size, 28, 28]. The channel dimension (size 1 for greyscale) is omitted by the default batch collation in torch for R. The nn_flatten() layer at the start of the network handles this correctly, flattening [B, 28, 28] to [B, 784].
b <-dataloader_make_iter(train_loader)$.next()b[[1]]$shape
[1] 64 1 28 28
b[[2]]$shape
[1] 64
Training loop
We are now ready to train. The explicit training loop below is the standard torch pattern. It is deliberately verbose — the goal is to make every step visible. Later in this session we will use luz to replace most of this boilerplate.
First, define the model, the loss function, and the optimizer. The model includes nn_flatten() as its first layer so that the [batch, 1, 28, 28] images coming from the dataloader are flattened to [batch, 784] automatically.
model <-nn_sequential(nn_flatten(),nn_linear(784, 128),nn_relu(),nn_linear(128, 10))criterion <-nn_cross_entropy_loss()optimizer <-optim_adam(model$parameters, lr =1e-3)
The training loop runs for a fixed number of epochs. Within each epoch it iterates over every mini-batch: forward pass, loss, backward pass, parameter update.
This is the irreducible core of neural network training in torch. The four lines inside the inner loop — zero_grad, forward pass, backward, step — are always the same regardless of model architecture, dataset, or task.
Evaluation
After training, switch the model to evaluation mode before measuring accuracy. This disables dropout and any other training-specific behaviour. with_no_grad suppresses gradient tracking during inference, saving memory and time.
argmax(dim = 2) picks the class with the highest logit for each sample in the batch. In torch for R, dimensions are 1-indexed, so dim = 2 refers to the class dimension of the [batch, classes] output tensor.
Plotting the loss
plot(seq_along(losses), losses, type ="o", pch =16,xlab ="Epoch", ylab ="Loss", main ="Training loss")
luz
The explicit loop above works, but it requires writing the same boilerplate every time. luz wraps torch models in a high-level training interface, replacing the manual epoch loop with a single call to fit and printing a formatted training table automatically.
To use luz we define the network architecture as an nn_module and pass the class (not an instance) to setup. luz handles the training loop, evaluation, and metric computation internally.
The data must be provided via dataloaders. We prepare flat tensors from the raw MNIST arrays. train_data$data holds the images as a plain R array; torch_tensor() converts it to a tensor before further operations.
A `luz_module_evaluation`
── Results ─────────────────────────────────────────────────────────────────────
loss: 0.0896
acc: 0.9723
Because luz provides a consistent interface, switching architectures, loss functions, or optimizers requires only changing the relevant arguments to setup.