Attention Mechanism Fundamentals

Transformers in Brief

Mark Andrews

The core problem

Standard feedforward and convolutional layers process each input element independently.
Many tasks require each element to be informed by others — including elements that are far away.
In language: the meaning of a word depends on distant context.
In images: understanding a region can require relating it to distant regions.

Long-range context in language

Many words have several senses whose disambiguation requires context.
“I submitted my paper to the journal” — paper means academic manuscript.
“I picked up the paper from the doorstep” — paper means newspaper.
The disambiguating words (journal, doorstep) need not be adjacent.
A fixed window of neighbouring words is not enough.

Long-range context in images

Image patches that are far apart often belong to the same object or structure.
A face: the eyes, nose, mouth, and jawline are spread across the image, but recognising a face requires relating all of them simultaneously.
The distance between the eyes, the proportions of the features relative to each other — none of these are local relationships.
A convolutional layer can only directly combine nearby pixels.
Long-range dependencies in CNNs require stacking many layers so the receptive field grows gradually.
Attention provides a direct path between any two positions in a single step.

What attention does

Attention allows each element to gather information from every other element.
The contribution of each element is weighted by its relevance.
The weights are learned, not fixed.
The result: each element’s representation is informed by global context, not just local neighbours.

Queries, keys, and values

Each input element produces three vectors via learned linear projections:

Query: what this element is looking for in others.
Key: what this element offers to others seeking information.
Value: the actual information content passed when attended to.

The dot product of query \(i\) with key \(j\) gives the relevance of element \(j\) to element \(i\). Softmax converts scores to weights. The output for element \(i\) is a weighted sum of all value vectors.

The attention formula

For input matrix \(X \in \mathbb{R}^{n \times d}\), with projections \(Q = XW_Q\), \(K = XW_K\), \(V = XW_V\):

\[\text{Attention}(Q, K, V) = \text{softmax}\!\left(\frac{QK^\top}{\sqrt{d_k}}\right)V\]

\(QK^\top\) is an \(n \times n\) score matrix — relevance of every element to every other.
Dividing by \(\sqrt{d_k}\) prevents scores from growing large, which would collapse the softmax to near-zero gradients.
Softmax is applied row-wise: each row sums to 1.
Multiplying by \(V\) produces an \(n \times d_v\) output — one updated vector per element.

Multi-head attention

A single attention head finds one pattern of relevance across elements.
Multi-head attention runs \(h\) heads in parallel, each in a lower-dimensional subspace.
Different heads learn different notions of relevance simultaneously.
Outputs from all heads are concatenated and projected back to the original dimension.
Total parameter count stays roughly equal to single-head attention.

The transformer block

Each block applies two operations in sequence, with residual connections and layer normalisation at each step:

Self-attention — each element attends to all others.
Feedforward network — a small two-layer MLP applied independently at each position.

\[x \leftarrow \text{LayerNorm}(x + \text{Attention}(x))\] \[x \leftarrow \text{LayerNorm}(x + \text{FFN}(x))\]

Residual connections allow gradients to flow directly through many stacked blocks.
Layer normalisation keeps activations at a stable scale throughout training.

Why the feedforward network?

Attention is a weighted average — a linear operation.
Without nonlinearity, stacking many attention layers collapses to a single linear transformation.
The feedforward network introduces nonlinearity at each position after the mixing step.
It gives each token’s representation the capacity to be further transformed beyond what attention assembled.
Attention handles across tokens; the feedforward network handles within each token.

The transformer architecture

input elements  →  embeddings
                →  transformer block 1
                →  transformer block 2
                →  ...
                →  transformer block N
                →  task-specific output layer

Each block output has the same shape as its input — blocks stack cleanly.
After \(N\) blocks, each element’s representation carries rich global context.
For language: input elements are tokens; output is next-token probabilities or class labels.
For images (Vision Transformer): input elements are image patches; output is class probabilities.
The transformer stack is generic — the task-specific part is a thin layer at the end.