Implementing a GPT Language Model

Author

Mark Andrews

Abstract

We implement a minimal GPT-style language model from scratch in PyTorch and train it on a character-level text corpus. We then cover text generation strategies: temperature scaling, top-k, and top-p sampling. The topic closes with the Hugging Face Transformers library, covering the pipeline API, AutoTokenizer and AutoModel, and text classification with a pre-trained model.

A minimal GPT

We assemble the components from the previous topic into a complete GPT-style model. The model takes a sequence of token indices as input and produces a distribution over the vocabulary at each position, from which the next token can be sampled.

We reuse the TransformerBlock defined in the previous topic.

import torch
import torch.nn as nn

class TransformerBlock(nn.Module):
    def __init__(self, embed_dim, num_heads, ff_dim):
        super().__init__()
        self.attn  = nn.MultiheadAttention(embed_dim, num_heads, batch_first=True)
        self.ff    = nn.Sequential(
            nn.Linear(embed_dim, ff_dim),
            nn.ReLU(),
            nn.Linear(ff_dim, embed_dim),
        )
        self.norm1 = nn.LayerNorm(embed_dim)
        self.norm2 = nn.LayerNorm(embed_dim)

    def forward(self, x, attn_mask=None):
        attn_out, _ = self.attn(x, x, x, attn_mask=attn_mask)
        x = self.norm1(x + attn_out)
        x = self.norm2(x + self.ff(x))
        return x

The GPT class adds token embeddings, positional embeddings, a stack of transformer blocks, a final layer normalisation, and a linear head that projects to vocabulary logits. The causal mask is constructed inside forward from the current sequence length, so the model handles sequences of any length up to max_seq_len.

class GPT(nn.Module):
    def __init__(self, vocab_size, embed_dim, num_heads, num_layers, max_seq_len):
        super().__init__()
        self.tok_emb = nn.Embedding(vocab_size, embed_dim)
        self.pos_emb = nn.Embedding(max_seq_len, embed_dim)
        self.blocks  = nn.ModuleList([
            TransformerBlock(embed_dim, num_heads, ff_dim=4 * embed_dim)
            for _ in range(num_layers)
        ])
        self.ln      = nn.LayerNorm(embed_dim)
        self.head    = nn.Linear(embed_dim, vocab_size, bias=False)
        self.max_seq_len = max_seq_len

    def forward(self, idx):
        B, T = idx.shape
        tok = self.tok_emb(idx)
        pos = self.pos_emb(torch.arange(T, device=idx.device))
        x = tok + pos
        mask = torch.triu(torch.ones(T, T, device=idx.device), diagonal=1).bool()
        for block in self.blocks:
            x = block(x, attn_mask=mask)
        x = self.ln(x)
        return self.head(x)

Preparing text data

We use the Tiny Shakespeare dataset, a widely used benchmark for character-level language models.

import urllib.request

url = "https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt"
with urllib.request.urlopen(url) as f:
    full_text = f.read().decode('utf-8')

print(f"Full corpus: {len(full_text):,} characters")
print(full_text[:200])

Full corpus: 1,115,394 characters
First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You are all resolved rather to die than to famish?

All:
Resolved. resolved.

First Citizen:
First, you

The full corpus is about 1 million characters. For this rendered document we use a short excerpt to keep execution time manageable. In a live session, use the full text for a model that learns more coherent Shakespeare-like patterns.

text = full_text[:10_000]   # use first 10k characters for rendering
                             # replace with full_text in a live session

We build the character vocabulary and encode the corpus as a sequence of integers.

chars = sorted(set(text))
vocab_size = len(chars)
stoi = {c: i for i, c in enumerate(chars)}
itos = {i: c for c, i in stoi.items()}

data = torch.tensor([stoi[c] for c in text], dtype=torch.long)
print(f"Vocab size: {vocab_size}")
print(f"Encoded length: {len(data):,}")

Vocab size: 57
Encoded length: 10,000

We create input-target pairs by extracting overlapping windows of length seq_len. The target at each position is the token one step ahead: the model’s job is to predict y from x.

seq_len = 64
n = len(data) - seq_len

X = torch.stack([data[i : i + seq_len]         for i in range(n)])
y = torch.stack([data[i + 1 : i + seq_len + 1] for i in range(n)])

X.shape, y.shape

(torch.Size([9936, 64]), torch.Size([9936, 64]))

from torch.utils.data import TensorDataset, DataLoader

dataset = TensorDataset(X, y)
loader  = DataLoader(dataset, batch_size=64, shuffle=True)

Training

We instantiate a small model — small enough to train on CPU in a short time.

model = GPT(
    vocab_size  = vocab_size,
    embed_dim   = 64,
    num_heads   = 4,
    num_layers  = 2,
    max_seq_len = seq_len,
)

sum(p.numel() for p in model.parameters())

The training loop is identical to the one from day one. The only difference is that the loss is averaged across all positions in every sequence, not just one output per sample.

criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=3e-3)

for epoch in range(5):
    model.train()
    epoch_loss = 0
    for xb, yb in loader:
        optimizer.zero_grad()
        logits = model(xb)                          # (B, T, vocab_size)
        loss = criterion(
            logits.view(-1, vocab_size),            # (B*T, vocab_size)
            yb.view(-1)                             # (B*T,)
        )
        loss.backward()
        optimizer.step()
        epoch_loss += loss.item()
    print(f"Epoch {epoch+1}: loss={epoch_loss/len(loader):.4f}")

Epoch 1: loss=2.4347
Epoch 2: loss=1.7955
Epoch 3: loss=1.2408
Epoch 4: loss=0.7726
Epoch 5: loss=0.4705

The logits tensor has shape (B, T, vocab_size). We reshape it to (B×T, vocab_size) before passing to CrossEntropyLoss, which expects a 2D tensor of scores and a 1D tensor of target indices.

Text generation

To generate text, we feed a seed sequence into the model, sample the next token from the predicted distribution, append it to the sequence, and repeat.

def generate(model, seed_ids, max_new_tokens, temperature=1.0, top_k=None):
    model.eval()
    idx = seed_ids.unsqueeze(0)                     # (1, T)
    with torch.no_grad():
        for _ in range(max_new_tokens):
            idx_cond = idx[:, -model.max_seq_len:]
            logits = model(idx_cond)[:, -1, :]      # logits for last position
            logits = logits / temperature
            if top_k is not None:
                top_vals, _ = torch.topk(logits, top_k)
                logits[logits < top_vals[:, -1:]] = float('-inf')
            probs = torch.softmax(logits, dim=-1)
            next_id = torch.multinomial(probs, 1)
            idx = torch.cat([idx, next_id], dim=1)
    return idx.squeeze(0)

seed = torch.tensor([stoi[c] for c in "ROMEO:"])
ids  = generate(model, seed, max_new_tokens=200, temperature=1.0)
print(''.join(itos[i.item()] for i in ids))

ROMEO:
Confesome for usurers; repeady: th likn to their recie love that endeach,
Yet are the couls, in in in repetits and his not noble sot the patring one at that
hatheirs receive lon. What's the more, you

The output will not be coherent Shakespeare when trained on a small excerpt. With the full corpus and more training time, the model produces recognisable dialogue structure.

Temperature

Temperature \(\tau\) scales the logits before the softmax. High temperature (\(\tau > 1\)) flattens the distribution, making all tokens more equally likely and output more random. Low temperature (\(\tau < 1\)) sharpens the distribution, making the most likely tokens even more dominant and output more repetitive.

\[p_i = \frac{e^{z_i / \tau}}{\sum_j e^{z_j / \tau}}\]

generate(model, seed, max_new_tokens=100, temperature=0.5)   # sharp, repetitive
generate(model, seed, max_new_tokens=100, temperature=2.0)   # flat, random

Top-k sampling

Top-k sampling restricts sampling to the \(k\) most probable tokens at each step. All other tokens are assigned zero probability. This prevents the model from sampling rare or incoherent tokens while retaining some variability.

ids = generate(model, seed, max_new_tokens=200, temperature=1.0, top_k=10)
print(''.join(itos[i.item()] for i in ids))

ROMEO:
Ceffetor us! True, indeed! True, in kior,
ebun king the long patcntence thus, devinged here
igh on statacce comd ches enemy: thy surter, dog tranks answer'd slaves, in hightr; and oncal the counsels

Top-p sampling (nucleus sampling) is a related approach. Instead of fixing the number of tokens, it includes the smallest set of tokens whose cumulative probability exceeds a threshold \(p\). Both strategies are available in the Hugging Face generation API.

Hugging Face Transformers

Training from scratch produces a model that has learned only from the small corpus we gave it. Pre-trained models such as BERT, GPT-2, and DistilBERT have been trained on hundreds of billions of tokens and encode far richer representations of language. The Hugging Face transformers library provides these models ready to use.

The pipeline API

pipeline is the simplest entry point. It bundles a tokenizer, a pre-trained model, and post-processing into a single callable. Models are downloaded automatically on first use and cached locally.

from transformers import pipeline

classifier = pipeline('sentiment-analysis')
classifier(["This course is excellent.", "The explanation was very confusing."])

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f.
Using a pipeline without specifying a model name and revision in production is not recommended.
Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.

[{'label': 'POSITIVE', 'score': 0.9998623132705688},
 {'label': 'NEGATIVE', 'score': 0.9989590644836426}]

# GPT-2 text generation is slow on CPU; run interactively rather than rendering
generator = pipeline('text-generation', model='gpt2')
result = generator("The study of statistics", max_new_tokens=40, num_return_sequences=1)
print(result[0]['generated_text'])

AutoTokenizer and AutoModel

For more control, load the tokenizer and model separately. AutoTokenizer and AutoModel detect the correct class from the model name and load it automatically.

from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained('distilbert-base-uncased')
model_hf  = AutoModel.from_pretrained('distilbert-base-uncased')

tokens = tokenizer("Deep learning is powerful.", return_tensors='pt')
tokens

DistilBertModel LOAD REPORT from: distilbert-base-uncased

Key                     | Status     |  | 

------------------------+------------+--+-

vocab_transform.bias    | UNEXPECTED |  | 

vocab_layer_norm.weight | UNEXPECTED |  | 

vocab_transform.weight  | UNEXPECTED |  | 

vocab_projector.bias    | UNEXPECTED |  | 

vocab_layer_norm.bias   | UNEXPECTED |  | 



Notes:

- UNEXPECTED:   can be ignored when loading from different task/architecture; not ok if you expect identical arch.

{'input_ids': tensor([[ 101, 2784, 4083, 2003, 3928, 1012,  102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1]])}

with torch.no_grad():
    output = model_hf(**tokens)

output.last_hidden_state.shape    # (1, seq_len, 768): one vector per token

torch.Size([1, 7, 768])

last_hidden_state contains the contextualised representation of each token. These representations can be used directly as features for downstream tasks.

Text classification

For classification we load a model with a classification head. AutoModelForSequenceClassification adds a linear layer on top of the pooled representation.

from transformers import AutoModelForSequenceClassification

clf = AutoModelForSequenceClassification.from_pretrained(
    'distilbert-base-uncased-finetuned-sst-2-english'
)
clf.eval()

sentences = ["The food was excellent.", "The service was slow and disappointing."]
inputs = tokenizer(sentences, padding=True, return_tensors='pt')

with torch.no_grad():
    logits = clf(**inputs).logits

predictions = logits.argmax(dim=1)
labels = ['NEGATIVE', 'POSITIVE']
[labels[p] for p in predictions]

['POSITIVE', 'NEGATIVE']

Fine-tuning

Pre-trained models can be fine-tuned on a new dataset by continuing training with a small learning rate. The Trainer API in Hugging Face handles the training loop, evaluation, checkpointing, and logging.

from transformers import TrainingArguments, Trainer

args = TrainingArguments(
    output_dir='./results',
    num_train_epochs=3,
    per_device_train_batch_size=16,
    learning_rate=2e-5,
)

trainer = Trainer(
    model=clf,
    args=args,
    train_dataset=train_dataset,   # a HuggingFace Dataset object
    eval_dataset=eval_dataset,
)

trainer.train()

Fine-tuning a pre-trained model on a domain-specific dataset typically requires far less data and compute than training from scratch, while achieving better performance. This is the standard approach for applying language models to new tasks.