Day Two, Session Three

Training a Language Model and Pretrained Transformers

Mark Andrews

What we are building

  • We assemble all the components from the previous session into a working GPT-style model.
  • We train it on a character-level corpus from scratch.
  • We implement temperature and top-k sampling for text generation.
  • We close by examining what pretrained models provide and how to use them.

Character-level language modelling

  • Vocabulary: the set of distinct characters in the training text — typically 60–70 characters.
  • Input: a sequence of character indices of fixed length seq_len.
  • Target: the same sequence shifted by one position.
  • The model must predict each next character given the preceding context.

Why character-level? No tokeniser required; vocabulary is completely transparent; the model’s sampled output provides immediate, interpretable feedback.

Preparing training data

  • Encode the full text as a sequence of integers.
  • Extract overlapping windows of length seq_len to form (input, target) pairs.
  • Input at position \(t\) is the sequence \(x_{t-T}, \ldots, x_{t-1}\); target is \(x_{t-T+1}, \ldots, x_t\).
  • Wrap in a TensorDataset and serve in batches with DataLoader.

Training a language model

  • Instantiate the GPT model with small hyperparameters for CPU training.
  • The model outputs logits of shape \((B, T, \text{vocab\_size})\).
  • Reshape to \((B \times T, \text{vocab\_size})\) before computing cross-entropy against targets of shape \((B \times T,)\).
  • The same four-step loop: zero grad, forward, backward, step.

At each position the model is simultaneously learning to predict the next character from all prefix lengths up to \(T\).

Generating text

  1. Feed a seed sequence into the model.
  2. Take the logits at the last position.
  3. Convert to a probability distribution.
  4. Sample the next token.
  5. Append it to the sequence and repeat.

The model generates one character at a time, feeding its own output back as the next input.

Temperature

Temperature \(\tau\) scales logits before the softmax:

\[p_i = \frac{e^{z_i/\tau}}{\sum_j e^{z_j/\tau}}\]

  • \(\tau > 1\): flattens the distribution — more varied, more random output.
  • \(\tau < 1\): sharpens the distribution — more repetitive, more conservative output.
  • \(\tau = 1\): the model’s native distribution.

Top-k sampling

  • At each step, restrict sampling to the \(k\) most probable tokens.
  • All other tokens are assigned zero probability before sampling.
  • Prevents rare or incoherent tokens from appearing while retaining variability.
  • Top-p (nucleus) sampling is a related approach: include the smallest set of tokens whose cumulative probability exceeds \(p\).

Both strategies are composable with temperature.

From scratch to pretrained

Training from scratch on a short corpus produces a model that memorises patterns in that corpus only. Pretrained models have been trained on hundreds of billions of tokens and encode far richer representations.

Model Parameters Trained on
GPT-2 (small) 117M ~40GB text
BERT (base) 110M Books + Wikipedia
DistilBERT 66M Same (distilled)

The HuggingFace transformers library provides all of these ready to use.

The HuggingFace ecosystem

  • pipeline: bundles tokeniser, model, and post-processing into one callable. Easiest entry point.
  • AutoTokenizer, AutoModel: load any model by name; the correct class is detected automatically.
  • AutoModelForSequenceClassification: adds a classification head for labelled tasks.
  • Fine-tuning: continue training a pretrained model on a new dataset with a small learning rate.
  • The Trainer API handles the training loop, evaluation, checkpointing, and logging.