Language Models and Transformers

Author

Mark Andrews

Abstract

We introduce the language modelling task and build up the transformer architecture from its component parts. Starting with tokenisation and embeddings, we implement scaled dot-product attention from scratch, explain causal masking, and assemble a transformer block. Session three builds a complete GPT-style model on these foundations.

The language modelling task

A language model assigns a probability to a sequence of tokens. The standard training objective is next-token prediction: given a sequence of tokens so far, predict the next one. This single objective turns out to be sufficient to learn rich representations of language.

The model sees a sequence \(x_1, x_2, \ldots, x_T\) and must produce a probability distribution over the vocabulary at each position:

\[p(x_{t+1} \mid x_1, x_2, \ldots, x_t)\]

Training minimises the cross-entropy between the predicted distribution and the true next token, summed over all positions. The same model that is trained to predict the next token can then generate text by sampling from its own predictions autoregressively, one token at a time.

Tokenisation

Before a model can process text, the text must be converted to a sequence of integers — tokens. The simplest scheme is character-level tokenisation, where each character in the vocabulary is mapped to an integer.

library(torch)

text  <- "To be, or not to be, that is the question."
chars <- sort(unique(strsplit(text, "")[[1]]))
vocab_size <- length(chars)
cat("Vocabulary:", paste(chars, collapse = " "), "\n")

Vocabulary:   , . a b e h i n o q r s t T u

cat("Vocab size:", vocab_size, "\n")

Vocab size: 16

stoi <- setNames(seq_along(chars), chars)
itos <- setNames(chars, as.character(seq_along(chars)))

text_chars <- strsplit(text, "")[[1]]
encoded <- as.integer(stoi[text_chars])
decoded <- paste(itos[as.character(encoded)], collapse = "")

cat(head(encoded, 10), "\n")

15 10 1 5 6 2 1 10 12 1

cat(substr(decoded, 1, 20), "\n")

To be, or not to be,

Character-level tokenisation is simple but produces long sequences, since each character is one token. In practice, modern language models use subword tokenisation methods such as Byte Pair Encoding (BPE), which represents common sequences of characters as single tokens. This reduces sequence length and allows the vocabulary to cover rare words through combinations of subword units. For this session we will use character-level tokenisation, which is sufficient to understand the full architecture.

Embeddings

Raw token indices are integers with no meaningful geometric relationship. An embedding layer maps each integer to a dense vector in a continuous space. Nearby vectors in that space come to represent tokens that appear in similar contexts.

nn_embedding(num_embeddings, embedding_dim) is a learnable lookup table. Given a tensor of integer token indices, it returns the corresponding rows of the table.

embed_dim  <- 16L
embedding  <- nn_embedding(vocab_size, embed_dim)

token_ids  <- torch_tensor(as.integer(stoi[c("T", "o", " ")]), dtype = torch_long())
embedded   <- embedding(token_ids)
embedded$shape    # [3, 16]: three tokens, each a vector of length 16

[1]  3 16

The embedding vectors are parameters with requires_grad = TRUE and are updated during training like any other weight.

Positional encoding

A transformer processes all tokens in a sequence simultaneously rather than left to right. This means positional information must be injected explicitly, since the model has no other way to know the order of tokens.

The simplest approach is a learned positional embedding: a second embedding table indexed by position rather than token. The positional embedding for position \(t\) is added to the token embedding for the token at that position.

max_seq_len  <- 64L
pos_embedding <- nn_embedding(max_seq_len, embed_dim)

seq_len   <- token_ids$shape[1]
positions <- torch_arange(1, seq_len, dtype = torch_long())
pos_emb   <- pos_embedding(positions)

x <- embedded + pos_emb    # token embedding + position embedding
x$shape

[1]  3 16

The combined representation carries both what the token is and where it appears in the sequence.

Self-attention

Self-attention is the mechanism that allows each token to gather information from other tokens in the sequence. It is the defining operation of the transformer, and the main reason transformers have largely replaced earlier recurrent architectures.

Each input vector is projected into three separate vectors: a query \(q\), a key \(k\), and a value \(v\). The query and key of each token are compared by a dot product to produce attention scores. Tokens whose keys align closely with a given token’s query receive high attention weights. The output for each token is a weighted sum of the value vectors of all tokens, where the weights come from the attention scores.

For a sequence of \(T\) tokens with representations packed into matrices \(Q\), \(K\), \(V \in \mathbb{R}^{T \times d_k}\), scaled dot-product attention is:

\[\text{Attention}(Q, K, V) = \text{softmax}\!\left(\frac{Q K^T}{\sqrt{d_k}}\right) V\]

The scaling by \(\sqrt{d_k}\) prevents the dot products from becoming very large when \(d_k\) is large, which would push the softmax into regions where its gradient is near zero.

scaled_dot_product_attention <- function(Q, K, V, mask = NULL) {
  d_k    <- Q$shape[length(Q$shape)]
  scores <- Q$matmul(K$transpose(-2, -1)) / sqrt(d_k)
  if (!is.null(mask)) {
    scores <- scores$masked_fill(mask, -Inf)
  }
  weights <- nnf_softmax(scores, dim = -1)
  list(weights$matmul(V), weights)
}

We can try this on a small example. Three tokens with query, key, and value all equal to their embedding vectors.

Q <- K <- V <- x   # [3, 16]
result  <- scaled_dot_product_attention(Q, K, V)
out     <- result[[1]]
weights <- result[[2]]
out$shape     # [3, 16]: one output vector per token

[1]  3 16

weights   # [3, 3]: attention weight from each token to every other

torch_tensor
 9.9833e-01  5.8591e-04  1.0806e-03
 6.8523e-07  9.9995e-01  4.4754e-05
 2.1547e-03  7.6302e-02  9.2154e-01
[ CPUFloatType{3,3} ][ grad_fn = <SoftmaxBackward0> ]

Each row of weights sums to one and tells us how much attention token \(i\) pays to every token when computing its output.

Causal masking

In language modelling, the model predicts the next token given only past tokens. When computing attention, a token must not be allowed to attend to future tokens — otherwise training would be trivially easy and the model would learn nothing useful.

Causal masking enforces this by setting attention scores for future positions to \(-\infty\) before the softmax. After the softmax, \(-\infty\) becomes 0, so those positions receive no weight.

T_len <- 5L
mask  <- torch_triu(torch_ones(T_len, T_len), diagonal = 1)$to(dtype = torch_bool())
mask

torch_tensor
 0  1  1  1  1
 0  0  1  1  1
 0  0  0  1  1
 0  0  0  0  1
 0  0  0  0  0
[ CPUBoolType{5,5} ]

The upper triangle is TRUE, meaning those positions are masked. diagonal = 1 leaves the main diagonal unmasked, so each token can still attend to itself.

Q <- K <- V <- torch_randn(T_len, embed_dim)
result  <- scaled_dot_product_attention(Q, K, V, mask = mask)
weights <- result[[2]]
round(as.array(weights), 2)

     [,1] [,2] [,3] [,4] [,5]
[1,] 1.00 0.00 0.00 0.00 0.00
[2,] 0.02 0.98 0.00 0.00 0.00
[3,] 0.19 0.03 0.78 0.00 0.00
[4,] 0.02 0.04 0.05 0.89 0.00
[5,] 0.00 0.00 0.00 0.00 0.99

The lower-triangular pattern confirms that each token only attends to itself and earlier tokens.

Multi-head attention

Single-head attention computes one weighted combination of value vectors at each position. Multi-head attention runs \(h\) attention operations in parallel, each in a lower-dimensional subspace. Different heads can simultaneously attend to different kinds of relationship: one might track local co-occurrence, another long-range dependencies, another syntactic structure.

The computation has three stages. First, Q, K, and V are each projected through separate learned weight matrices into \(h\) subspaces of dimension \(d_h = C / h\), where \(C\) is the embedding dimension. Second, scaled dot-product attention runs independently in each subspace. Third, the \(h\) outputs are concatenated back to dimension \(C\) and passed through a final linear projection.

MultiHeadAttention <- nn_module(
  initialize = function(embed_dim, num_heads) {
    stopifnot(embed_dim %% num_heads == 0)
    self$num_heads <- num_heads
    self$head_dim  <- embed_dim %/% num_heads
    self$q_proj    <- nn_linear(embed_dim, embed_dim)
    self$k_proj    <- nn_linear(embed_dim, embed_dim)
    self$v_proj    <- nn_linear(embed_dim, embed_dim)
    self$out_proj  <- nn_linear(embed_dim, embed_dim)
  },
  forward = function(x, mask = NULL) {
    B <- x$shape[1]
    T <- x$shape[2]
    C <- x$shape[3]
    H <- self$num_heads
    # Project and split into heads: [B, T, C] -> [B, H, T, C/H]
    Q <- self$q_proj(x)$view(c(B, T, H, self$head_dim))$transpose(2, 3)
    K <- self$k_proj(x)$view(c(B, T, H, self$head_dim))$transpose(2, 3)
    V <- self$v_proj(x)$view(c(B, T, H, self$head_dim))$transpose(2, 3)
    result <- scaled_dot_product_attention(Q, K, V, mask = mask)
    out    <- result[[1]]   # [B, H, T, C/H]
    # Concatenate heads: [B, H, T, C/H] -> [B, T, C]
    out <- out$transpose(2, 3)$contiguous()$view(c(B, T, C))
    self$out_proj(out)
  }
)

mha_hand <- MultiHeadAttention(embed_dim = embed_dim, num_heads = 4L)
x_batch  <- torch_randn(2, T_len, embed_dim)
out      <- mha_hand(x_batch, mask = mask)
out$shape    # [2, 5, 16]: batch=2, seq=5, embed=16

[1]  2  5 16

torch’s nn_multihead_attention implements the same computation with additional optimisations.

mha       <- nn_multihead_attention(embed_dim = embed_dim, num_heads = 4L, batch_first = TRUE)
attn_mask <- torch_triu(torch_ones(T_len, T_len), diagonal = 1L)$to(dtype = torch_bool())

result  <- mha(x_batch, x_batch, x_batch, attn_mask = attn_mask)
out     <- result[[1]]
out$shape

[1]  2  5 16

The three x_batch arguments are the query, key, and value inputs — all the same tensor for self-attention.

Transformer blocks

A transformer block combines multi-head self-attention with a small feedforward network. Two features appear in every block: residual connections and layer normalisation.

A residual connection adds the block’s input directly to its output before normalisation. If the block computes \(f(x)\), the residual output is \(x + f(x)\). Residual connections allow gradients to flow directly back through many layers without vanishing, which makes deep transformers trainable.

Layer normalisation normalises the activations across the embedding dimension for each token independently.

TransformerBlock <- nn_module(
  initialize = function(embed_dim, num_heads, ff_dim) {
    self$attn  <- nn_multihead_attention(embed_dim, num_heads, batch_first = TRUE)
    self$ff    <- nn_sequential(
      nn_linear(embed_dim, ff_dim),
      nn_relu(),
      nn_linear(ff_dim, embed_dim)
    )
    self$norm1 <- nn_layer_norm(embed_dim)
    self$norm2 <- nn_layer_norm(embed_dim)
  },
  forward = function(x, attn_mask = NULL) {
    attn_out <- self$attn(x, x, x, attn_mask = attn_mask)[[1]]
    x <- self$norm1(x + attn_out)          # residual + normalise
    x <- self$norm2(x + self$ff(x))        # residual + normalise
    x
  }
)

block <- TransformerBlock(embed_dim = embed_dim, num_heads = 4L, ff_dim = 64L)
x_batch <- torch_randn(2, T_len, embed_dim)
block(x_batch, attn_mask = attn_mask)$shape

[1]  2  5 16

The GPT architecture

GPT (Generative Pre-trained Transformer) is a decoder-only transformer: it consists of a stack of transformer blocks with causal masking, so it can only attend to past tokens. The full model is token embedding + positional embedding, followed by \(N\) transformer blocks, then a final linear layer that projects to logit scores over the vocabulary.

input tokens → token embeddings + positional embeddings
             → transformer block 1
             → transformer block 2
             → ...
             → transformer block N
             → layer norm
             → linear layer → logits over vocabulary

Session three implements this in full and trains it on a text corpus.