Introduction

Deep Learning and Artificial Neural Networks

Mark Andrews

Deep learning vs artificial neural networks

Artificial neural networks (ANNs) are the general family of models built from interconnected computational neurons that map inputs to outputs.
Deep learning refers to training ANNs with multiple stacked layers and the practical ecosystem — large data, specialised architectures (CNNs, RNNs, transformers), optimisation techniques, and hardware — that enables hierarchical representation learning.
In short: deep learning \(\subset\) ANNs — deep learning denotes deep, large-scale ANNs plus the methods and infrastructure that make them effective.

In 1958 Frank Rosenblatt introduced the perceptron as a simple neuron model that could learn linear decision boundaries from labelled examples.
Early experiments on image recognition created excitement because the perceptron demonstrated that machines could adapt their internal parameters from data.
Researchers soon recognised that single-layer perceptrons could not represent functions like exclusive-or, which hinted at the need for multilayer architectures.

In 1969 Marvin Minsky and Seymour Papert published a rigorous critique showing fundamental limits of perceptrons, which discouraged funding and interest.
Through the 1970s and early 1980s, symbolic approaches dominated artificial intelligence while connectionist methods received limited attention.
Small pockets of research persisted, but progress slowed because training deeper networks remained computationally and algorithmically difficult.

In 1986 David Rumelhart, Geoffrey Hinton, and Ronald Williams popularised error backpropagation, which made it practical to train multilayer feedforward networks.
Yann LeCun and colleagues demonstrated convolutional neural networks with gradient descent for handwritten digit recognition, linking neural networks to computer vision.
Despite these advances, compute, data, and hardware constraints limited performance on large real-world problems for another two decades.

Around 2006 Geoffrey Hinton, Ruslan Salakhutdinov, Yoshua Bengio, and others showed that layer-wise pretraining could initialise deep networks effectively.
In 2012 Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton won the ImageNet Large Scale Visual Recognition Challenge with AlexNet, powered by GPUs and ReLU activations.
Rapid progress followed across speech, vision, and language as larger datasets, better regularisation, and improved architectures unlocked new capabilities.

In 2017 Ashish Vaswani and colleagues introduced the Transformer architecture with the principle “attention is all you need.”
Transformers replaced recurrent computation with self-attention, enabling efficient parallel training and superior performance on sequence modelling tasks.
Subsequent models such as BERT, GPT, and ViT scaled data and parameters, establishing Transformers as a general-purpose foundation across modalities.

Large-scale pretraining, transfer learning, and instruction tuning have produced versatile models that support research, industry, and education at unprecedented scale.
Advances in optimisation, tooling, and hardware have turned neural networks into an engineering discipline with reproducible pipelines and strong empirical baselines.
Ongoing debates about safety, evaluation, and societal impact now accompany technical progress, as researchers balance innovation with responsible deployment.

The network starts with random weights and biases.
It computes outputs for a batch of inputs and measures error with a loss function that compares outputs to targets.
Gradients indicate how each weight should change to reduce the loss, and the parameters are nudged in that direction.
This cycle repeats over many passes until performance on held-out data stops improving.

The network produces predictions from inputs, the loss measures the gap to targets, gradients are computed, and the weights are updated.
Repeating this loop gradually improves performance if the task is learnable and the data are informative.

A network can memorise training examples instead of learning general patterns that transfer to new data.
Performance on validation data helps detect this because accuracy on unseen data can decline while training accuracy keeps rising.
Simpler architectures, regularisation techniques, and more diverse data reduce overfitting.

A network is a flexible function built from many simple units arranged in layers.
Each unit adds weighted inputs, shifts the result with a bias, and applies a non-linear activation.
Learning is an iterative process that reduces a loss on examples by adjusting the weights and biases with gradient information.
Depth, non-linearity, and sufficient data provide the power seen in modern applications.