Introduction to Bayesian Inference

Author

Mark Andrews

Abstract

This guide introduces Bayesian data analysis as an alternative approach to statistical inference. We cover what Bayesian inference is, how it differs from the classical frequentist tradition, and when it offers practical advantages for empirical researchers.

What is Bayesian data analysis?

Bayesian data analysis is an approach to statistical inference based on a simple and general principle. Given some observed data and a statistical model, we want to know what the model’s unknown parameters are likely to be, given everything we have observed. The answer is a probability distribution over the parameters, called the posterior distribution, which is obtained by applying Bayes’ theorem.

This is not a specialised or advanced form of statistics. It is an alternative general approach, sitting alongside the classical tradition variously called frequentist, sampling-theory-based, or null-hypothesis significance testing. The two approaches differ fundamentally in how they define and use probability.

Bayesian versus frequentist probability

In the classical tradition, probability is defined as long-run relative frequency. A statement like “the probability of heads is 0.5” means that in a long sequence of coin flips, half would come up heads. Parameters are treated as fixed but unknown constants. Classical inference answers questions like “if the true parameter were θ, how likely is this data?”, which leads to p-values, confidence intervals, and rejection regions.

In the Bayesian tradition, probability is a measure of degree of belief or uncertainty. A parameter is not a fixed unknown constant but a quantity about which we are uncertain, and that uncertainty is represented by a probability distribution. Before seeing the data we have a prior distribution over the parameters. After seeing the data we update to a posterior distribution, using Bayes’ theorem to do the updating.

The posterior distribution is the complete answer to the question “what do the data tell us about the parameters?”. It contains everything we know about the parameters after observing the data, and we can summarise it in any way we choose: its mean, its mode, any interval containing 95% of its mass, and so on.

Data and model notation

Consider a dataset of \(n\) observations. The observed values are \(y_1, y_2, \ldots, y_n\), and for each observation we may also have predictor values \(\mathbf{x}_1, \mathbf{x}_2, \ldots, \mathbf{x}_n\). A statistical model specifies a probability distribution for \(y_i\) given some parameters \(\theta\). In a linear model, for example, we write

\[ y_i \sim \mathrm{N}(\mu_i,\, \sigma^2), \quad \mu_i = \beta_0 + \sum_{k=1}^{K} \beta_k x_{ki} \]

The parameters here are \(\beta_0, \beta_1, \ldots, \beta_K, \sigma^2\). In the Bayesian approach we assign prior distributions to all of these parameters and then update those priors using the data.

When do Bayesian methods offer advantages?

For simple models with large amounts of data, Bayesian and classical methods often give numerically similar answers, even though the conceptual foundations differ. The practical advantages of Bayesian methods become more apparent in several situations.

When data are scarce, incorporating prior information can make estimation possible where classical methods would be unstable or undefined. When models are complex or hierarchical, Bayesian methods via MCMC handle them naturally; classical maximum likelihood methods often face convergence problems. When the goal is to quantify uncertainty about parameters rather than simply to reject a null hypothesis, the posterior distribution provides exactly what is needed. When comparing models, Bayes factors and posterior model probabilities give coherent answers that classical likelihood ratio tests and information criteria approximate.

The role of the prior

The prior distribution is both the most distinctive and the most contested feature of Bayesian inference. It represents our uncertainty about the parameters before seeing the data. It can encode genuine prior knowledge, domain expertise, or simply a weak regularising constraint that keeps the model from fitting noise.

The influence of the prior diminishes as more data are collected. With enough data, the posterior is determined almost entirely by the likelihood and different reasonable priors give effectively the same answer. This is reassuring: Bayesian inference is not primarily about imposing subjective beliefs on the data but about a coherent mechanism for updating beliefs in light of evidence.

We will see priors and posteriors concretely in the next two sessions, where we work through a complete example with a one-parameter model and visualise exactly how the prior, the likelihood, and the posterior relate to each other.