Introduction to Bayesian Inference

Mark Andrews

What is Bayesian data analysis?

  • An approach to statistical inference based on Bayes’ theorem
  • Not a specialised technique but an alternative general framework
  • Sits alongside, not above, classical frequentist statistics
  • The two approaches differ in how probability is defined and used
  • But in practice, they are defined by two alternative approaches to statistical inference

Frequentist probability

  • Probability is defined as long-run relative frequency
  • Parameters are fixed but unknown constants, hence Bayesian methods are “wholly rejected: Bayesian methods were ..founded upon an error, and must be wholly rejected because (i)nferences respecting populations, from which known samples have been drawn, cannot be expressed in terms of probability. Fisher (1925).
  • Inference is based non sampling distributions: if the true parameter value of \(\theta\) were \(\theta_0\), what is the probability distribution of the observed data?
  • Leads to p-values, confidence intervals, rejection regions

Bayesian probability

  • Probability is a means to quantify uncertainty
  • Parameters are quantities about which we are uncertain
  • Uncertainty is always represented by a probability distribution
  • Before data: unknowns have a prior distribution; after data: they have a posterior distribution, calculated using Bayes’ rule

Data and model

\[y_1, y_2, \ldots, y_n\]

\[\mathbf{x}_1, \mathbf{x}_2, \ldots, \mathbf{x}_n\]

A model specifies a distribution for \(y_i\), conditional on \(\mathbf{x}_i\). The model is parameterized by fixed and unknown parameters \(\theta\).

Example: A linear model

\[y_i \sim \mathrm{N}(\mu_i, \sigma^2)\]

\[\mu_i = \beta_0 + \sum_{k=1}^{K} \beta_k x_{ki}\]

  • The “sampling” part: \(y_i \sim \mathrm{N}(\mu_i, \sigma^2)\)
  • The “structural” part: the linear predictor
  • Bayesian inference begins by assigning or assuming prior distributions for \(\beta_0, \beta_1, \ldots, \beta_K, \sigma^2\)

Bayes’ theorem

\[p(\theta \mid \text{data}) = \frac{p(\text{data} \mid \theta)\, p(\theta)}{\int p(\text{data} \mid \theta)\, p(\theta)\, d\theta}\]

  • Posterior \(\propto\) Likelihood \(\times\) Prior
  • The denominator normalises; it does not depend on \(\theta\)

What the posterior gives you

  • A complete probability distribution over the parameters
  • The posterior tells you exactly what you can say about the unknowns given what you know (e.g. data) or have assumed (e.g. model, prior distribution).
  • Summarise the posterior however you wish: mean, median, any interval
  • Posterior (aka credible) intervals are direct probability statements about the parameters
  • No need to condition on a hypothetical null hypothesis

When Bayesian methods offer practical advantages

  • Small samples: prior information stabilises estimation
  • Complex models: MCMC is a general purpose method for inference
  • Quantifying uncertainty: the posterior is exactly what you need and we can use the full repertoire of probability for any inference and any prediction
  • This includes model comparison: e.g., posterior model probabilities

That wretched prior

  • The prior is just another modelling assumption. It represents uncertainty about parameters before seeing data
  • Can encode genuine prior knowledge or just weak regularisation
  • Its influence diminishes as more data are collected
  • With enough data, different reasonable priors give essentially the same posterior

Summary

  • Bayesian inference updates prior beliefs using data via Bayes’ theorem
  • The result is a posterior distribution over the parameters
  • This is conceptually straightforward and practically powerful
  • The key tool we will use is brms, an R interface to Stan