Day Two, Session Three

Training a Language Model and Pretrained Transformers

Mark Andrews

What we are building

We assemble all the components from the previous session into a working GPT-style model.
We train it on a character-level corpus from scratch.
We implement temperature and top-k sampling for text generation.
We close by examining what pretrained models provide and how to use them.

Character-level language modelling

Vocabulary: the set of distinct characters in the training text — typically 60–70 characters.
Input: a sequence of character indices of fixed length seq_len.
Target: the same sequence shifted by one position.
The model must predict each next character given the preceding context.

Why character-level? No tokeniser required; vocabulary is completely transparent; the model’s sampled output provides immediate, interpretable feedback.

Preparing training data

Encode the full text as a sequence of integers.
Extract overlapping windows of length seq_len to form (input, target) pairs.
Input at position \(t\) is the sequence \(x_{t-T}, \ldots, x_{t-1}\); target is \(x_{t-T+1}, \ldots, x_t\).
Wrap in a TensorDataset and serve in batches with DataLoader.

Training a language model

Instantiate the GPT model with small hyperparameters for CPU training.
The model outputs logits of shape \((B, T, \text{vocab\_size})\).
Reshape to \((B \times T, \text{vocab\_size})\) before computing cross-entropy against targets of shape \((B \times T,)\).
The same four-step loop: zero grad, forward, backward, step.

At each position the model is simultaneously learning to predict the next character from all prefix lengths up to \(T\).

Generating text

Feed a seed sequence into the model.
Take the logits at the last position.
Convert to a probability distribution.
Sample the next token.
Append it to the sequence and repeat.

The model generates one character at a time, feeding its own output back as the next input.

Temperature

Temperature \(\tau\) scales logits before the softmax:

\[p_i = \frac{e^{z_i/\tau}}{\sum_j e^{z_j/\tau}}\]

\(\tau > 1\): flattens the distribution — more varied, more random output.
\(\tau < 1\): sharpens the distribution — more repetitive, more conservative output.
\(\tau = 1\): the model’s native distribution.

Top-k sampling

At each step, restrict sampling to the \(k\) most probable tokens.
All other tokens are assigned zero probability before sampling.
Prevents rare or incoherent tokens from appearing while retaining variability.
Top-p (nucleus) sampling is a related approach: include the smallest set of tokens whose cumulative probability exceeds \(p\).

Both strategies are composable with temperature.

From scratch to pretrained

Training from scratch on a short corpus produces a model that memorises patterns in that corpus only. Pretrained models have been trained on hundreds of billions of tokens and encode far richer representations.

Model	Parameters	Trained on
GPT-2 (small)	117M	~40GB text
BERT (base)	110M	Books + Wikipedia
DistilBERT	66M	Same (distilled)

The HuggingFace transformers library provides all of these ready to use.

The HuggingFace ecosystem

pipeline: bundles tokeniser, model, and post-processing into one callable. Easiest entry point.
AutoTokenizer, AutoModel: load any model by name; the correct class is detected automatically.
AutoModelForSequenceClassification: adds a classification head for labelled tasks.
Fine-tuning: continue training a pretrained model on a new dataset with a small learning rate.
The Trainer API handles the training loop, evaluation, checkpointing, and logging.