We introduce the language modelling task and build up the transformer architecture from its component parts. Starting with tokenisation and embeddings, we implement scaled dot-product attention from scratch, explain causal masking, and assemble a transformer block. Session three builds a complete GPT-style model on these foundations.
The language modelling task
A language model assigns a probability to a sequence of tokens. The standard training objective is next-token prediction: given a sequence of tokens so far, predict the next one. This single objective turns out to be sufficient to learn rich representations of language.
The model sees a sequence \(x_1, x_2, \ldots, x_T\) and must produce a probability distribution over the vocabulary at each position:
\[p(x_{t+1} \mid x_1, x_2, \ldots, x_t)\]
Training minimises the cross-entropy between the predicted distribution and the true next token, summed over all positions. The same model that is trained to predict the next token can then generate text by sampling from its own predictions autoregressively, one token at a time.
Tokenisation
Before a model can process text, the text must be converted to a sequence of integers — tokens. The simplest scheme is character-level tokenisation, where each character in the vocabulary is mapped to an integer.
text ="To be, or not to be, that is the question."chars =sorted(set(text))vocab_size =len(chars)print(f"Vocabulary: {chars}")print(f"Vocab size: {vocab_size}")
stoi = {c: i for i, c inenumerate(chars)} # string to integeritos = {i: c for c, i in stoi.items()} # integer to stringencoded = [stoi[c] for c in text]decoded =''.join(itos[i] for i in encoded)print(encoded[:10])print(decoded[:20])
[3, 10, 0, 5, 6, 1, 0, 10, 12, 0]
To be, or not to be,
Character-level tokenisation is simple but produces long sequences, since each character is one token. In practice, modern language models use subword tokenisation methods such as Byte Pair Encoding (BPE), which represents common sequences of characters as single tokens. This reduces sequence length and allows the vocabulary to cover rare words through combinations of subword units. The Hugging Face tokenizers library implements BPE and other schemes.
from tokenizers import Tokenizerfrom tokenizers.models import BPEfrom tokenizers.pre_tokenizers import Whitespace# A BPE tokenizer (untrained — just to show the interface)bpe_tokenizer = Tokenizer(BPE())bpe_tokenizer.pre_tokenizer = Whitespace()
For this session we will use character-level tokenisation, which is sufficient to understand the full architecture.
Embeddings
Raw token indices are integers with no meaningful geometric relationship. An embedding layer maps each integer to a dense vector in a continuous space. Nearby vectors in that space come to represent tokens that appear in similar contexts.
nn.Embedding(num_embeddings, embedding_dim) is a learnable lookup table. Given a tensor of integer token indices, it returns the corresponding rows of the table.
import torchimport torch.nn as nnembed_dim =16embedding = nn.Embedding(vocab_size, embed_dim)token_ids = torch.tensor([stoi['T'], stoi['o'], stoi[' ']])embedded = embedding(token_ids)embedded.shape # (3, 16): three tokens, each a vector of length 16
torch.Size([3, 16])
The embedding vectors are parameters with requires_grad=True and are updated during training like any other weight.
Positional encoding
A transformer processes all tokens in a sequence simultaneously rather than left to right. This means positional information must be injected explicitly, since the model has no other way to know the order of tokens.
The simplest approach is a learned positional embedding: a second embedding table indexed by position rather than token. The positional embedding for position \(t\) is added to the token embedding for the token at that position.
The combined representation carries both what the token is and where it appears in the sequence.
Self-attention
Self-attention is the mechanism that allows each token to gather information from other tokens in the sequence. It is the defining operation of the transformer, and the main reason transformers have largely replaced earlier recurrent architectures.
Each input vector is projected into three separate vectors: a query \(q\), a key \(k\), and a value \(v\). These are produced by three separate linear transformations applied to the input. The query and key of each token are compared by a dot product to produce attention scores. Tokens whose keys align closely with a given token’s query receive high attention weights. The output for each token is a weighted sum of the value vectors of all tokens, where the weights come from the attention scores.
For a sequence of \(T\) tokens with representations packed into matrices \(Q\), \(K\), \(V \in \mathbb{R}^{T \times d_k}\), scaled dot-product attention is:
The scaling by \(\sqrt{d_k}\) prevents the dot products from becoming very large when \(d_k\) is large, which would push the softmax into regions where its gradient is near zero.
Each row of weights sums to one and tells us how much attention token \(i\) pays to every token when computing its output.
Causal masking
In language modelling, the model predicts the next token given only past tokens. When computing attention, a token must not be allowed to attend to future tokens — otherwise training would be trivially easy and the model would learn nothing useful.
Causal masking enforces this by setting attention scores for future positions to \(-\infty\) before the softmax. After the softmax, \(-\infty\) becomes 0, so those positions receive no weight.
T =5mask = torch.triu(torch.ones(T, T), diagonal=1).bool()mask
The upper triangle is True, meaning those positions are masked. diagonal=1 leaves the main diagonal unmasked, so each token can still attend to itself.
Q = K = V = torch.randn(T, embed_dim)out, weights = scaled_dot_product_attention(Q, K, V, mask=mask)weights.round(decimals=2)
The lower-triangular pattern confirms that each token only attends to itself and earlier tokens.
Multi-head attention
Single-head attention computes one weighted combination of value vectors at each position. Multi-head attention runs \(h\) attention operations in parallel, each in a lower-dimensional subspace. Different heads can simultaneously attend to different kinds of relationship: one might track local co-occurrence, another long-range dependencies, another syntactic structure.
The computation has three stages. First, Q, K, and V are each projected through separate learned weight matrices into \(h\) subspaces of dimension \(d_h = C / h\), where \(C\) is the embedding dimension. Second, scaled dot-product attention runs independently in each subspace. Third, the \(h\) outputs are concatenated back to dimension \(C\) and passed through a final linear projection.
For a batched input of shape \((B, T, C)\):
After projection and splitting into heads: Q, K, V each \((B, H, T, C/H)\)
Attention scores per head: \((B, H, T, T)\)
Attention output per head: \((B, H, T, C/H)\)
After concatenating heads and projecting: \((B, T, C)\)
class MultiHeadAttention(nn.Module):def__init__(self, embed_dim, num_heads):super().__init__()assert embed_dim % num_heads ==0self.num_heads = num_headsself.head_dim = embed_dim // num_headsself.q_proj = nn.Linear(embed_dim, embed_dim)self.k_proj = nn.Linear(embed_dim, embed_dim)self.v_proj = nn.Linear(embed_dim, embed_dim)self.out_proj = nn.Linear(embed_dim, embed_dim)def forward(self, x, mask=None): B, T, C = x.shape H =self.num_heads# Project and split into heads: (B, T, C) -> (B, H, T, C/H) Q =self.q_proj(x).view(B, T, H, self.head_dim).transpose(1, 2) K =self.k_proj(x).view(B, T, H, self.head_dim).transpose(1, 2) V =self.v_proj(x).view(B, T, H, self.head_dim).transpose(1, 2)# scaled_dot_product_attention broadcasts over B and H out, _ = scaled_dot_product_attention(Q, K, V, mask=mask) # (B, H, T, C/H)# Concatenate heads and project back: (B, H, T, C/H) -> (B, T, C) out = out.transpose(1, 2).contiguous().view(B, T, C)returnself.out_proj(out)
The .view and .transpose calls do the head-splitting and merging. After projecting Q to shape (B, T, C), .view(B, T, H, head_dim) reinterprets the last axis as \(H\) heads each of width head_dim, and .transpose(1, 2) moves the head axis before the sequence axis so that scaled_dot_product_attention sees independent (T, head_dim) slices per head. After attention, .transpose(1, 2).contiguous().view(B, T, C) reverses this, concatenating the heads back into a single vector of length \(C\) at each position.
The three x arguments are the query, key, and value inputs — all the same tensor for self-attention.
Transformer blocks
A transformer block combines multi-head self-attention with a small feedforward network. Two features appear in every block: residual connections and layer normalisation.
A residual connection adds the block’s input directly to its output before normalisation. If the block computes \(f(x)\), the residual output is \(x + f(x)\). Residual connections allow gradients to flow directly back through many layers without vanishing, which makes deep transformers trainable.
Layer normalisation normalises the activations across the embedding dimension for each token independently, keeping training stable.
GPT (Generative Pre-trained Transformer) is a decoder-only transformer: it consists of a stack of transformer blocks with causal masking, so it can only attend to past tokens. The full model is token embedding + positional embedding, followed by \(N\) transformer blocks, then a final linear layer that projects to logit scores over the vocabulary.