Day Two, Session Two

Language Models and Transformers

Mark Andrews

The language modelling task

  • A language model assigns a probability to a sequence of tokens.
  • The training objective is next-token prediction: given tokens so far, predict the next one.

\[p(x_{t+1} \mid x_1, x_2, \ldots, x_t)\]

  • Training minimises cross-entropy between the predicted distribution and the true next token.
  • This single objective is sufficient to learn rich representations of language.

Tokenisation

  • Text must be converted to integers before a model can process it.
  • Character-level: each character maps to one integer; simple, small vocabulary (~60–70 tokens).
  • Subword (e.g. BPE): common sequences of characters become single tokens; reduces sequence length.
  • For this session we use character-level tokenisation — no external library needed.

Embeddings and positional encoding

  • An embedding layer maps each integer token to a dense vector: nn.Embedding(vocab_size, embed_dim).
  • The embedding vectors are learned parameters.
  • A transformer processes all tokens simultaneously — it has no inherent notion of order.
  • A second embedding indexed by position is added to inject positional information.

\[x_t = \text{tok_emb}(t) + \text{pos_emb}(t)\]

The problem that attention solves

  • In a sequence, each token’s meaning depends on its context.
  • “Bank” means different things in “river bank” and “bank account”.
  • Recurrent networks process tokens one at a time; this limits parallelism and struggles with long-range dependencies.
  • Attention lets every token directly aggregate information from any other position in the sequence.

Scaled dot-product attention

For \(T\) tokens packed into matrices \(Q, K, V \in \mathbb{R}^{T \times d_k}\):

\[\text{Attention}(Q, K, V) = \text{softmax}\!\left(\frac{QK^T}{\sqrt{d_k}}\right)V\]

  • Each row of \(Q\) is a query — what this position is looking for.
  • Each row of \(K\) is a key — what this position has to offer.
  • Each row of \(V\) is the value — what is aggregated if selected.
  • The query–key dot product measures compatibility; softmax turns scores into weights.

Why divide by \(\sqrt{d_k}\)?

  • For large \(d_k\), dot products grow large in magnitude.
  • This pushes the softmax into regions of near-zero gradient — the distribution becomes a near one-hot vector.
  • Dividing by \(\sqrt{d_k}\) keeps scores on a scale where the softmax gradient remains healthy and training is stable.

Causal masking

  • In language modelling, position \(t\) must not attend to positions \(> t\).
  • Without masking, the model could read the answer directly during training — it would learn nothing useful.
  • The causal mask sets future entries in the score matrix to \(-\infty\) before the softmax.
  • After softmax, \(-\infty \to 0\): those positions receive no weight.

Multi-head attention

Single-head attention computes one context vector per position. Multi-head attention runs \(H\) attention operations in parallel, each in a subspace of dimension \(d_k = C / H\):

\[\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \ldots, \text{head}_H)\,W^O\]

  • Each head uses its own learned Q, K, V projection matrices.
  • Different heads can attend to different kinds of relationship simultaneously.
  • Outputs are concatenated and projected back to the original dimension \(C\).

The transformer block

Each block applies, in order:

  1. Multi-head self-attention
  2. Residual connection and layer normalisation: \(x \leftarrow \text{LayerNorm}(x + \text{attn}(x))\)
  3. Feedforward sublayer: two linear layers with ReLU between
  4. Another residual connection and layer normalisation

Residual connections allow gradients to flow back through many blocks without vanishing, making deep transformers trainable.

The GPT architecture

\[\text{token embeddings} + \text{positional embeddings}\] \[\downarrow\] \[N \text{ transformer blocks (causal self-attention)}\] \[\downarrow\] \[\text{layer norm} \;\to\; \text{linear} \;\to\; \text{logits over vocabulary}\]

  • Decoder-only: attends only to past tokens.
  • Generates text autoregressively — one token at a time, feeding each output back as input.
  • Session three implements this in full and trains it on a text corpus.