Bayes’ Rule and the Bernoulli Model

Mark Andrews

The product rule of probability

For events \(A\) and \(B\):

\[p(A, B) = p(A \mid B)\, p(B) = p(B \mid A)\, p(A)\]

Setting these equal and rearranging:

\[p(B \mid A) = \frac{p(A \mid B)\, p(B)}{p(A)}\]

This is Bayes’ rule — derived from basic probability rules.

Bayes’ rule for statistical inference

Replace \(B\) with parameter \(\theta\), \(A\) with data:

\[P(\theta \mid \text{data}) = \frac{P(\text{data} \mid \theta)\, P(\theta)}{P(\text{data})}\]

\(P(\theta \mid \text{data})\): posterior distribution
\(P(\text{data} \mid \theta)\): likelihood
\(P(\theta)\): prior distribution
\(P(\text{data})\) is the marginal likelihood (normalising constant)

\[ P(\text{data}) = \int P(\text{data} \mid \theta)\, P(\theta)\ d\theta. \]

Posterior is proportional to likelihood times prior

\[P(\theta \mid \text{data}) \propto P(\text{data} \mid \theta)\, P(\theta)\]

The denominator \(P(\text{data})\) does not depend on \(\theta\). It just ensures the posterior integrates to one.

The Bernoulli model

Observe \(n\) binary outcomes: coin flips, positive/negative tests, success/failure
Unknown parameter: \(\theta \in [0, 1]\), the probability of success
Each observation: \(x_i \sim \text{Bernoulli}(\theta)\)

Simple enough to treat analytically. Rich enough to illustrate every key concept in Bayesian inference.

The likelihood function

With \(m\) successes in \(n\) trials:

\[p(m \mid \theta) = \binom{n}{m} \theta^m (1 - \theta)^{n-m}\]

Viewed as a function of \(\theta\): peaks at \(\hat{\theta} = m/n\) (the MLE)
Wider when \(n\) is small, narrower when \(n\) is large

Visualising the likelihood

bernoulli_likelihood(n = 250, m = 139)

Peak at \(\theta \approx 0.556\)
Quantifies how probable the data are for each possible value of \(\theta\)

The beta distribution as prior

Need a distribution on \([0, 1]\)
The beta distribution: \(p(\theta) \propto \theta^{\alpha - 1}(1-\theta)^{\beta-1}\)
\(\alpha\) and \(\beta\) are shape parameters

beta_plot(alpha = 3, beta = 5) # mild prior below 0.5

Different beta priors

beta_plot(alpha = 1, beta = 1) # uniform
beta_plot(alpha = 3, beta = 5) # mild prior below 0.5, at 3/8
beta_plot(alpha = 9, beta = 15) # stronger prior at 0.375

\(\text{Beta}(1,1)\): complete ignorance
Other choices encode prior stronger assumptions about \(\theta\)

Conjugacy

Beta prior + binomial likelihood \(\mapsto\) beta posterior
Prior: \(\text{Beta}(\alpha, \beta)\)
Posterior: \(\text{Beta}(\alpha + m,\; \beta + n - m)\)

The prior parameters act like pseudo-counts. This allows us to compute the posterior analytically.

Summary

Bayes’ rule follows from the product rule of probability
Posterior \(\propto\) likelihood \(\times\) prior
For the Bernoulli model with a beta prior, the posterior is also beta
We can visualise all three components simultaneously