Introduction

Deep Learning and Artificial Neural Networks

Mark Andrews

Deep learning vs artificial neural networks

  • Artificial neural networks (ANNs) are the general family of models built from interconnected computational neurons that map inputs to outputs.
  • Deep learning refers to training ANNs with multiple stacked layers and the practical ecosystem — large data, specialised architectures (CNNs, RNNs, transformers), optimisation techniques, and hardware — that enables hierarchical representation learning.
  • In short: deep learning \(\subset\) ANNs — deep learning denotes deep, large-scale ANNs plus the methods and infrastructure that make them effective.

Single-layer era: perceptrons

  • In 1958 Frank Rosenblatt introduced the perceptron as a simple neuron model that could learn linear decision boundaries from labelled examples.
  • Early experiments on image recognition created excitement because the perceptron demonstrated that machines could adapt their internal parameters from data.
  • Researchers soon recognised that single-layer perceptrons could not represent functions like exclusive-or, which hinted at the need for multilayer architectures.

The “dark ages” after Minsky and Papert

  • In 1969 Marvin Minsky and Seymour Papert published a rigorous critique showing fundamental limits of perceptrons, which discouraged funding and interest.
  • Through the 1970s and early 1980s, symbolic approaches dominated artificial intelligence while connectionist methods received limited attention.
  • Small pockets of research persisted, but progress slowed because training deeper networks remained computationally and algorithmically difficult.

Backpropagation and the return of multilayer learning

  • In 1986 David Rumelhart, Geoffrey Hinton, and Ronald Williams popularised error backpropagation, which made it practical to train multilayer feedforward networks.
  • Yann LeCun and colleagues demonstrated convolutional neural networks with gradient descent for handwritten digit recognition, linking neural networks to computer vision.
  • Despite these advances, compute, data, and hardware constraints limited performance on large real-world problems for another two decades.

Deep learning renaissance

  • Around 2006 Geoffrey Hinton, Ruslan Salakhutdinov, Yoshua Bengio, and others showed that layer-wise pretraining could initialise deep networks effectively.
  • In 2012 Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton won the ImageNet Large Scale Visual Recognition Challenge with AlexNet, powered by GPUs and ReLU activations.
  • Rapid progress followed across speech, vision, and language as larger datasets, better regularisation, and improved architectures unlocked new capabilities.

Attention and the Transformer

  • In 2017 Ashish Vaswani and colleagues introduced the Transformer architecture with the principle “attention is all you need.”
  • Transformers replaced recurrent computation with self-attention, enabling efficient parallel training and superior performance on sequence modelling tasks.
  • Subsequent models such as BERT, GPT, and ViT scaled data and parameters, establishing Transformers as a general-purpose foundation across modalities.

The current era

  • Large-scale pretraining, transfer learning, and instruction tuning have produced versatile models that support research, industry, and education at unprecedented scale.
  • Advances in optimisation, tooling, and hardware have turned neural networks into an engineering discipline with reproducible pipelines and strong empirical baselines.
  • Ongoing debates about safety, evaluation, and societal impact now accompany technical progress, as researchers balance innovation with responsible deployment.

How learning works

  • The network starts with random weights and biases.
  • It computes outputs for a batch of inputs and measures error with a loss function that compares outputs to targets.
  • Gradients indicate how each weight should change to reduce the loss, and the parameters are nudged in that direction.
  • This cycle repeats over many passes until performance on held-out data stops improving.

The learning loop

  • The network produces predictions from inputs, the loss measures the gap to targets, gradients are computed, and the weights are updated.
  • Repeating this loop gradually improves performance if the task is learnable and the data are informative.

Overfitting

  • A network can memorise training examples instead of learning general patterns that transfer to new data.
  • Performance on validation data helps detect this because accuracy on unseen data can decline while training accuracy keeps rising.
  • Simpler architectures, regularisation techniques, and more diverse data reduce overfitting.

What to remember

  • A network is a flexible function built from many simple units arranged in layers.
  • Each unit adds weighted inputs, shifts the result with a bias, and applies a non-linear activation.
  • Learning is an iterative process that reduces a loss on examples by adjusting the weights and biases with gradient information.
  • Depth, non-linearity, and sufficient data provide the power seen in modern applications.

Further reading

  • Bishop, C. M., & Bishop, H. (2023). Deep learning: Foundations and concepts. Springer.
  • Prince, S. J. D. (2023). Understanding deep learning. MIT Press.