Day Two, Session Two

Language Models and Transformers

Author

Mark Andrews

Abstract

We introduce the language modelling task and build up the transformer architecture from its component parts. Starting with tokenisation and embeddings, we implement scaled dot-product attention from scratch, explain causal masking, and assemble a transformer block. Session three builds a complete GPT-style model on these foundations.

The language modelling task

A language model assigns a probability to a sequence of tokens. The standard training objective is next-token prediction: given a sequence of tokens so far, predict the next one. This single objective turns out to be sufficient to learn rich representations of language.

The model sees a sequence \(x_1, x_2, \ldots, x_T\) and must produce a probability distribution over the vocabulary at each position:

\[p(x_{t+1} \mid x_1, x_2, \ldots, x_t)\]

Training minimises the cross-entropy between the predicted distribution and the true next token, summed over all positions. The same model that is trained to predict the next token can then generate text by sampling from its own predictions autoregressively, one token at a time.

Tokenisation

Before a model can process text, the text must be converted to a sequence of integers — tokens. The simplest scheme is character-level tokenisation, where each character in the vocabulary is mapped to an integer.

text = "To be, or not to be, that is the question."

chars = sorted(set(text))
vocab_size = len(chars)
print(f"Vocabulary: {chars}")
print(f"Vocab size: {vocab_size}")
Vocabulary: [' ', ',', '.', 'T', 'a', 'b', 'e', 'h', 'i', 'n', 'o', 'q', 'r', 's', 't', 'u']
Vocab size: 16
stoi = {c: i for i, c in enumerate(chars)}  # string to integer
itos = {i: c for c, i in stoi.items()}       # integer to string

encoded = [stoi[c] for c in text]
decoded = ''.join(itos[i] for i in encoded)

print(encoded[:10])
print(decoded[:20])
[3, 10, 0, 5, 6, 1, 0, 10, 12, 0]
To be, or not to be,

Character-level tokenisation is simple but produces long sequences, since each character is one token. In practice, modern language models use subword tokenisation methods such as Byte Pair Encoding (BPE), which represents common sequences of characters as single tokens. This reduces sequence length and allows the vocabulary to cover rare words through combinations of subword units. The Hugging Face tokenizers library implements BPE and other schemes.

from tokenizers import Tokenizer
from tokenizers.models import BPE
from tokenizers.pre_tokenizers import Whitespace

# A BPE tokenizer (untrained — just to show the interface)
bpe_tokenizer = Tokenizer(BPE())
bpe_tokenizer.pre_tokenizer = Whitespace()

For this session we will use character-level tokenisation, which is sufficient to understand the full architecture.

Embeddings

Raw token indices are integers with no meaningful geometric relationship. An embedding layer maps each integer to a dense vector in a continuous space. Nearby vectors in that space come to represent tokens that appear in similar contexts.

nn.Embedding(num_embeddings, embedding_dim) is a learnable lookup table. Given a tensor of integer token indices, it returns the corresponding rows of the table.

import torch
import torch.nn as nn

embed_dim = 16
embedding = nn.Embedding(vocab_size, embed_dim)

token_ids = torch.tensor([stoi['T'], stoi['o'], stoi[' ']])
embedded = embedding(token_ids)
embedded.shape            # (3, 16): three tokens, each a vector of length 16
torch.Size([3, 16])

The embedding vectors are parameters with requires_grad=True and are updated during training like any other weight.

Positional encoding

A transformer processes all tokens in a sequence simultaneously rather than left to right. This means positional information must be injected explicitly, since the model has no other way to know the order of tokens.

The simplest approach is a learned positional embedding: a second embedding table indexed by position rather than token. The positional embedding for position \(t\) is added to the token embedding for the token at that position.

max_seq_len = 64
pos_embedding = nn.Embedding(max_seq_len, embed_dim)

seq_len = token_ids.shape[0]
positions = torch.arange(seq_len)
pos_emb = pos_embedding(positions)

x = embedded + pos_emb    # token embedding + position embedding
x.shape
torch.Size([3, 16])

The combined representation carries both what the token is and where it appears in the sequence.

Self-attention

Self-attention is the mechanism that allows each token to gather information from other tokens in the sequence. It is the defining operation of the transformer, and the main reason transformers have largely replaced earlier recurrent architectures.

Each input vector is projected into three separate vectors: a query \(q\), a key \(k\), and a value \(v\). These are produced by three separate linear transformations applied to the input. The query and key of each token are compared by a dot product to produce attention scores. Tokens whose keys align closely with a given token’s query receive high attention weights. The output for each token is a weighted sum of the value vectors of all tokens, where the weights come from the attention scores.

For a sequence of \(T\) tokens with representations packed into matrices \(Q\), \(K\), \(V \in \mathbb{R}^{T \times d_k}\), scaled dot-product attention is:

\[\text{Attention}(Q, K, V) = \text{softmax}\!\left(\frac{Q K^T}{\sqrt{d_k}}\right) V\]

The scaling by \(\sqrt{d_k}\) prevents the dot products from becoming very large when \(d_k\) is large, which would push the softmax into regions where its gradient is near zero.

def scaled_dot_product_attention(Q, K, V, mask=None):
    d_k = Q.shape[-1]
    scores = Q @ K.transpose(-2, -1) / d_k ** 0.5
    if mask is not None:
        scores = scores.masked_fill(mask, float('-inf'))
    weights = torch.softmax(scores, dim=-1)
    return weights @ V, weights

We can try this on a small example. Three tokens with query, key, and value all equal to their embedding vectors.

Q = K = V = x   # (3, 16)
out, weights = scaled_dot_product_attention(Q, K, V)
out.shape        # (3, 16): one output vector per token
torch.Size([3, 16])
weights          # (3, 3): attention weight from each token to every other
tensor([[9.5804e-01, 4.0746e-02, 1.2104e-03],
        [3.6737e-03, 9.9632e-01, 8.5242e-06],
        [4.9348e-03, 3.8546e-04, 9.9468e-01]], grad_fn=<SoftmaxBackward0>)

Each row of weights sums to one and tells us how much attention token \(i\) pays to every token when computing its output.

Causal masking

In language modelling, the model predicts the next token given only past tokens. When computing attention, a token must not be allowed to attend to future tokens — otherwise training would be trivially easy and the model would learn nothing useful.

Causal masking enforces this by setting attention scores for future positions to \(-\infty\) before the softmax. After the softmax, \(-\infty\) becomes 0, so those positions receive no weight.

T = 5
mask = torch.triu(torch.ones(T, T), diagonal=1).bool()
mask
tensor([[False,  True,  True,  True,  True],
        [False, False,  True,  True,  True],
        [False, False, False,  True,  True],
        [False, False, False, False,  True],
        [False, False, False, False, False]])

The upper triangle is True, meaning those positions are masked. diagonal=1 leaves the main diagonal unmasked, so each token can still attend to itself.

Q = K = V = torch.randn(T, embed_dim)
out, weights = scaled_dot_product_attention(Q, K, V, mask=mask)
weights.round(decimals=2)
tensor([[1.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.0000, 1.0000, 0.0000, 0.0000, 0.0000],
        [0.0000, 0.0100, 0.9900, 0.0000, 0.0000],
        [0.0100, 0.0100, 0.0100, 0.9700, 0.0000],
        [0.1000, 0.0200, 0.0300, 0.0400, 0.8100]])

The lower-triangular pattern confirms that each token only attends to itself and earlier tokens.

Multi-head attention

Single-head attention computes one weighted combination of value vectors at each position. Multi-head attention runs \(h\) attention operations in parallel, each in a lower-dimensional subspace. Different heads can simultaneously attend to different kinds of relationship: one might track local co-occurrence, another long-range dependencies, another syntactic structure.

The computation has three stages. First, Q, K, and V are each projected through separate learned weight matrices into \(h\) subspaces of dimension \(d_h = C / h\), where \(C\) is the embedding dimension. Second, scaled dot-product attention runs independently in each subspace. Third, the \(h\) outputs are concatenated back to dimension \(C\) and passed through a final linear projection.

For a batched input of shape \((B, T, C)\):

  • After projection and splitting into heads: Q, K, V each \((B, H, T, C/H)\)
  • Attention scores per head: \((B, H, T, T)\)
  • Attention output per head: \((B, H, T, C/H)\)
  • After concatenating heads and projecting: \((B, T, C)\)
class MultiHeadAttention(nn.Module):
    def __init__(self, embed_dim, num_heads):
        super().__init__()
        assert embed_dim % num_heads == 0
        self.num_heads = num_heads
        self.head_dim  = embed_dim // num_heads
        self.q_proj    = nn.Linear(embed_dim, embed_dim)
        self.k_proj    = nn.Linear(embed_dim, embed_dim)
        self.v_proj    = nn.Linear(embed_dim, embed_dim)
        self.out_proj  = nn.Linear(embed_dim, embed_dim)

    def forward(self, x, mask=None):
        B, T, C = x.shape
        H = self.num_heads
        # Project and split into heads: (B, T, C) -> (B, H, T, C/H)
        Q = self.q_proj(x).view(B, T, H, self.head_dim).transpose(1, 2)
        K = self.k_proj(x).view(B, T, H, self.head_dim).transpose(1, 2)
        V = self.v_proj(x).view(B, T, H, self.head_dim).transpose(1, 2)
        # scaled_dot_product_attention broadcasts over B and H
        out, _ = scaled_dot_product_attention(Q, K, V, mask=mask)   # (B, H, T, C/H)
        # Concatenate heads and project back: (B, H, T, C/H) -> (B, T, C)
        out = out.transpose(1, 2).contiguous().view(B, T, C)
        return self.out_proj(out)

The .view and .transpose calls do the head-splitting and merging. After projecting Q to shape (B, T, C), .view(B, T, H, head_dim) reinterprets the last axis as \(H\) heads each of width head_dim, and .transpose(1, 2) moves the head axis before the sequence axis so that scaled_dot_product_attention sees independent (T, head_dim) slices per head. After attention, .transpose(1, 2).contiguous().view(B, T, C) reverses this, concatenating the heads back into a single vector of length \(C\) at each position.

mha_hand = MultiHeadAttention(embed_dim=embed_dim, num_heads=4)
x        = torch.randn(2, T, embed_dim)
out      = mha_hand(x, mask=mask)
out.shape    # (2, 5, 16): batch=2, seq=5, embed=16
torch.Size([2, 5, 16])

PyTorch’s nn.MultiheadAttention implements the same computation with additional optimisations.

mha = nn.MultiheadAttention(embed_dim=embed_dim, num_heads=4, batch_first=True)
attn_mask = torch.triu(torch.ones(T, T), diagonal=1).bool()

out, weights = mha(x, x, x, attn_mask=attn_mask)
out.shape
torch.Size([2, 5, 16])

The three x arguments are the query, key, and value inputs — all the same tensor for self-attention.

Transformer blocks

A transformer block combines multi-head self-attention with a small feedforward network. Two features appear in every block: residual connections and layer normalisation.

A residual connection adds the block’s input directly to its output before normalisation. If the block computes \(f(x)\), the residual output is \(x + f(x)\). Residual connections allow gradients to flow directly back through many layers without vanishing, which makes deep transformers trainable.

Layer normalisation normalises the activations across the embedding dimension for each token independently, keeping training stable.

class TransformerBlock(nn.Module):
    def __init__(self, embed_dim, num_heads, ff_dim):
        super().__init__()
        self.attn  = nn.MultiheadAttention(embed_dim, num_heads, batch_first=True)
        self.ff    = nn.Sequential(
            nn.Linear(embed_dim, ff_dim),
            nn.ReLU(),
            nn.Linear(ff_dim, embed_dim),
        )
        self.norm1 = nn.LayerNorm(embed_dim)
        self.norm2 = nn.LayerNorm(embed_dim)

    def forward(self, x, attn_mask=None):
        attn_out, _ = self.attn(x, x, x, attn_mask=attn_mask)
        x = self.norm1(x + attn_out)          # residual + normalise
        x = self.norm2(x + self.ff(x))        # residual + normalise
        return x

block = TransformerBlock(embed_dim=embed_dim, num_heads=4, ff_dim=64)
x = torch.randn(2, T, embed_dim)
block(x, attn_mask=attn_mask).shape
torch.Size([2, 5, 16])

The GPT architecture

GPT (Generative Pre-trained Transformer) is a decoder-only transformer: it consists of a stack of transformer blocks with causal masking, so it can only attend to past tokens. The full model is token embedding + positional embedding, followed by \(N\) transformer blocks, then a final linear layer that projects to logit scores over the vocabulary.

input tokens → token embeddings + positional embeddings
             → transformer block 1
             → transformer block 2
             → ...
             → transformer block N
             → layer norm
             → linear layer → logits over vocabulary

Session three implements this in full and trains it on a text corpus.