Introduction to MCMC and brms

Author

Mark Andrews

Abstract

Markov Chain Monte Carlo methods are introduced as a general numerical solution for Bayesian inference in models where analytical calculation is not possible. We introduce the Metropolis algorithm to build intuition, then turn to Stan via the brms package and fit our first Bayesian linear regression model.

Why MCMC?

The analytical approach of the previous sessions works because the beta prior is conjugate to the binomial likelihood. The posterior has a known closed form. For almost every realistic statistical model this is not the case. A normal linear model with more than one predictor, a logistic regression, a multilevel model: none of these have analytically tractable posteriors.

The posterior exists in principle, it is just an integral that we cannot evaluate in closed form. Markov Chain Monte Carlo is the general solution. Rather than computing the posterior analytically, MCMC generates a large number of samples from the posterior distribution. Once we have those samples, we can approximate any posterior quantity: the mean, the median, any credible interval, the probability that \(\theta > 0\).

The Metropolis algorithm

The basic idea of MCMC can be understood through the Metropolis algorithm, the simplest member of the MCMC family. The algorithm explores the parameter space by proposing random moves and accepting or rejecting them in a way that ensures the chain eventually samples from the posterior.

Start at some initial value \(\tilde{\theta}_0\). At each step, propose a new value \(\tilde{\theta}_1\) drawn from a proposal distribution centred on the current position. Compute the acceptance ratio:

\[ r = \frac{p(\tilde{\theta}_1 \mid \text{data})}{p(\tilde{\theta}_0 \mid \text{data})} \]

If \(r \geq 1\) the proposed value is more probable than the current position, so accept it. If \(r < 1\) accept the proposal with probability \(r\), otherwise stay at \(\tilde{\theta}_0\).

A crucial point: the normalising constant cancels in this ratio. The posterior is proportional to likelihood times prior, so:

\[ \frac{p(\tilde{\theta}_1 \mid D)}{p(\tilde{\theta}_0 \mid D)} = \frac{p(D \mid \tilde{\theta}_1)\, p(\tilde{\theta}_1)}{p(D \mid \tilde{\theta}_0)\, p(\tilde{\theta}_0)} \]

The intractable denominator \(p(D)\) is the same in both numerator and denominator and cancels out. This is why MCMC works for posteriors we cannot calculate directly: we only need the unnormalised posterior, which is just likelihood times prior.

Stan and brms

Stan is a state-of-the-art probabilistic programming language and MCMC engine. It implements Hamiltonian Monte Carlo, a more efficient variant of Metropolis that uses gradient information to make better proposals. Stan compiles models to efficient C++ machine code.

Writing Stan models directly requires learning Stan’s syntax. The brms package provides a high-level R interface to Stan that accepts familiar R formula syntax. The pipeline is: brm() in R specifies the model, brms translates it to Stan code, Stan compiles it to C++, and the compiled sampler runs MCMC.

You can inspect the Stan code that brms generates for any model.

M_2 <- brm(y ~ x_1 + x_2, data = data_df1)
stancode(M_2)

Comparing brm with lm

The most direct way to understand what brm does is to compare it with lm on the same data.

M_1 <- lm(y ~ x_1 + x_2, data = data_df1)
summary(M_1)

M_2 <- brm(y ~ x_1 + x_2, data = data_df1)
summary(M_2)

The point estimates (posterior means from brm, coefficient estimates from lm) are numerically very close when default weakly informative priors are used. The key difference is in what the uncertainty estimates mean. The credible intervals from brm are direct probability statements about the parameters. The confidence intervals from lm are not: they refer to the long-run coverage of the procedure, not to the probability that the parameter lies in any given interval.

MCMC settings

By default brm runs four chains, each with 2000 iterations of which the first 1000 are warm-up (burn-in). The warm-up samples are discarded. These defaults can be adjusted.

Running chains in parallel with cores = 4 is generally advisable on a multicore machine. The save_pars(all = TRUE) argument saves all parameters including those needed for Bayes factor computation.