Introduction to MCMC

Mark Andrews

Posterior distributions

Given observed data \(D\) and a model \(\Omega\), the posterior distribution over the parameters \(\theta\) is

\[ p(\theta \mid D, \Omega) = \frac{\overbrace{p(D \mid \theta)}^{\text{Likelihood}}\;\overbrace{p(\theta \mid \Omega)}^{\text{Prior}}} {\underbrace{\int p(D \mid \theta)\, p(\theta \mid \Omega)\, d\theta}_{\text{Marginal likelihood}}} \]

The marginal likelihood gives the probability of the observed data under the model. Our aim is usually to characterise the posterior in terms of its mean, variance, credible intervals, and so on. We may also wish to calculate the posterior predictive distribution: \(p(x_{\text{new}} \mid D, \Omega) = \int p(x_{\text{new}} \mid \theta, \Omega)\, p(\theta \mid D, \Omega)\, d\theta\).

The limits of analytical solutions

  • Conjugate priors are the primary route to analytical Bayesian inference.
  • Conjugate priors and analytical solutions only exist for a limited set of models, such as the Bernoulli model and the normal linear model, but most models of practical interest, including generalised linear models and hierarchical models, do not have analytical posteriors.
  • The posterior exists in all these cases, but we cannot write it down in closed form.
  • The solution is to generate samples from the posterior numerically.

Monte Carlo methods

In only rare cases can we characterise the posterior or compute posterior predictive distributions in closed form. If we can draw samples \(\tilde{\theta}_1, \tilde{\theta}_2, \ldots, \tilde{\theta}_N\) from \(p(\theta \mid D, \Omega)\), we can approximate the posterior mean by

\[ \langle\theta\rangle \approx \frac{1}{N} \sum_{i=1}^N \tilde{\theta}_i, \]

and the posterior predictive distribution by

\[ p(x_{\text{new}} \mid D, \Omega) = \int p(x_{\text{new}} \mid \theta, \Omega)\, p(\theta \mid D, \Omega)\, d\theta \approx \frac{1}{N} \sum_{i=1}^N p(x_{\text{new}} \mid \tilde{\theta}_i, \Omega). \]

Markov Chain Monte Carlo

  • MCMC generates a sequence of samples from the posterior.
  • Each sample depends only on the previous one, which is the Markov property.
  • The chain eventually reaches a stationary distribution equal to the posterior.
  • We discard the early “burn-in” samples and use the rest.

The Metropolis algorithm

Let \(f(\theta) = p(\theta \mid D, \Omega)\). Starting at \(\tilde{\theta}_0\), we draw a proposal \(\tilde{\theta} \sim Q(\theta \mid \tilde{\theta}_0)\) from a symmetric proposal distribution centred on the current position, then accept it with probability

\[ \alpha = \min\!\left(1,\, \frac{f(\tilde{\theta})}{f(\tilde{\theta}_0)}\right). \]

The chain therefore moves toward regions of higher posterior probability more often than lower. After convergence, the accepted samples are draws from \(f(\theta)\). Crucially, \(f(\theta)\) need only be known up to a proportional constant, so the intractable marginal likelihood cancels.

The normalising constant cancels

\[ \frac{f(\tilde{\theta}_1)}{f(\tilde{\theta}_0)} = \frac{p(\tilde{\theta}_1 \mid D, \Omega)}{p(\tilde{\theta}_0 \mid D, \Omega)} = \frac{p(D \mid \tilde{\theta}_1)\, p(\tilde{\theta}_1 \mid \Omega)}{p(D \mid \tilde{\theta}_0)\, p(\tilde{\theta}_0 \mid \Omega)} \]

When we expand the second ratio using Bayes’s theorem, the marginal likelihood \(p(D \mid \Omega)\) appears in both the numerator and denominator and cancels. MCMC only requires the unnormalised posterior, the product of the likelihood and the prior.

Hamiltonian Monte Carlo

The Metropolis algorithm as described is random walk Metropolis. At each step it proposes a move in a random direction, and in high-dimensional spaces almost all random directions lead away from the region of high posterior mass. The sampler therefore either takes tiny steps or rejects proposals at very high rates.

Hamiltonian Monte Carlo introduces an auxiliary momentum variable and simulates physical dynamics. The log-posterior plays the role of potential energy. As the trajectory moves away from the typical set, potential energy rises and momentum slows, eventually reversing and carrying the chain back into the high-probability region. The result is large, directed leaps that remain within the typical set rather than diffusing randomly through parameter space.

Probabilistic programming languages

A probabilistic programming language (PPL) allows the user to specify the model (the likelihood and the prior) and constructs and runs the sampler automatically.

  • BUGS was the original PPL, using Gibbs sampling.
  • JAGS is a later reimplementation of BUGS.
  • Stan is the current state of the art, implementing Hamiltonian Monte Carlo and compiling to C++ for speed.
  • brms is an R package that generates Stan code automatically from standard R model formulas.

The brms pipeline

brm() formula → Stan code → C++ → compiled sampler → MCMC samples
M_2 <- brm(y ~ x_1 + x_2, data = data_df1)
stancode(M_2)

Comparing brm with lm

M_1 <- lm(y ~ x_1 + x_2, data = data_df1)
summary(M_1)

M_2 <- brm(y ~ x_1 + x_2, data = data_df1)
summary(M_2)

Point estimates are numerically similar. Uncertainty quantification differs fundamentally in meaning.

MCMC settings

M_3 <- brm(y ~ x_1 + x_2,
  chains = 4,
  cores = 4,
  iter = 10000,
  warmup = 2000,
  save_pars = save_pars(all = TRUE),
  data = data_df1
)
  • Four chains run in parallel.
  • Two thousand warm-up (burn-in) samples are discarded per chain.
  • Eight thousand post-warmup samples are retained in total.

Burn-in and convergence

Chains start outside the high-density region and converge to the posterior. Burn-in samples are discarded before analysis.

Summary

  • MCMC generates samples from the posterior without computing it analytically.
  • The normalising constant cancels in the Metropolis acceptance ratio.
  • Stan implements Hamiltonian Monte Carlo, and brms provides the R formula interface.
  • From here, fitting complex models is largely a matter of changing the formula.