The General Linear Model

Mark Andrews

Regression as a probability model

Regression is about modelling how the probability distribution of an outcome variable depends on predictor variables
It is not just “fitting lines to points”
The general linear model is the foundation for all generalized linear models

The normal linear model

For observations \(i = 1, \ldots, n\):

\[ \begin{aligned} y_i &\sim \mathrm{Normal}(\mu_i, \sigma^2)\\ \mu_i &= \beta_0 + \sum_{k=1}^K \beta_k x_{ki} \end{aligned} \]

Each \(y_i\) is normally distributed with the same variance \(\sigma^2\)
The mean \(\mu_i\) is a linear function of the predictors

Interpreting coefficients

\[ \mu_i = \beta_0 + \beta_1 x_{1i} + \beta_2 x_{2i} + \cdots \]

\(\beta_0\) is the expected value of \(y\) when all predictors are zero
\(\beta_k\) is the expected change in \(y\) for a one-unit increase in \(x_k\), holding other predictors fixed

Estimation

The parameters \(\beta_0, \beta_1, \ldots, \beta_K\) and \(\sigma^2\) are estimated by maximum likelihood.

The maximum likelihood estimators (MLEs) are:

\[ \hat{\beta}_{MLE} = (X^\top X)^{-1} X^\top y, \qquad \hat{\sigma}^2_{MLE} = \frac{1}{n}\sum_{i=1}^n (y_i - \hat{\mu}_i)^2 = \frac{RSS}{n} \]

\(\hat{\beta}_{MLE}\) coincides with the ordinary least squares estimator
\(\hat{\sigma}^2_{MLE}\) is biased; the unbiased estimator is \(s^2 = RSS/(n-p)\) where \(p\) is the number of parameters

Standard errors are the standard deviations of the sampling distributions of the estimators.

Fitting with lm

weight_df <- read_csv("data/weight.csv")

M_1 <- lm(weight ~ height, data = weight_df)
coef(M_1)
sigma(M_1)   # sqrt(RSS / (n - p)), the unbiased estimator

M_2 <- lm(weight ~ height + age, data = weight_df)
summary(M_2)

Categorical predictors

Categorical predictors are re-coded as binary dummy variables
For a predictor with \(L\) levels, \(L - 1\) dummies are created
The omitted level is the reference category; all other coefficients are differences from it

M_3 <- lm(weight ~ height + age + gender, data = weight_df)
summary(M_3)

Confidence intervals

A 95% confidence interval for \(\beta_k\) is

\[ \hat{\beta}_k \pm t^*_{n-p,\, 0.975} \cdot SE(\hat{\beta}_k) \]

where \(t^*_{n-p,\, 0.975}\) is the 97.5th percentile of the \(t\)-distribution with \(n - p\) degrees of freedom.

confint(M_3)

Predictions

For predictions at new \(\vec{x}^*\), two types of interval are available:

Confidence interval on the mean response \(\mu^* = \vec{x}^{*\top}\hat{\beta}\)
Prediction interval for a new individual observation (always wider, adds \(\hat{\sigma}^2\) for the new observation’s own variability)

new_df <- tibble(height = c(160, 170, 180), age = 30, gender = "female")
predict(M_3, newdata = new_df, interval = "confidence")
predict(M_3, newdata = new_df, interval = "prediction")

The likelihood at the maximum

The normal linear model’s log-likelihood is

\[ \log L(\beta, \sigma^2 \mid \mathbf{y}) = -\frac{n}{2}\log(2\pi\sigma^2) - \frac{1}{2\sigma^2}\sum_{i=1}^n(y_i - \mu_i)^2 \]

Evaluated at the MLEs \(\hat{\beta}_{MLE}\) and \(\hat{\sigma}^2_{MLE} = RSS/n\), this achieves its maximum value — the highest log-likelihood attainable by any choice of parameters within this model structure.

Model comparison

For nested models \(\mathcal{M}_0 \subset \mathcal{M}_1\), we use the F-test:

\[ F = \frac{(RSS_0 - RSS_1)\,/\,q}{RSS_1\,/\,(n - p)} \sim F(q,\, n-p) \]

where \(q = K_1 - K_0\) is the difference in the number of parameters. This distribution is exact under normality.

The F-test is closely related to the log likelihood ratio statistic:

\[ \Lambda = 2\log\frac{L(\hat{\beta}_1)}{L(\hat{\beta}_0)} \]

anova(M_null, M_full)

AIC

\[ \mathrm{AIC} = -2 \log L(\hat{\beta}) + 2p \]

\(p\) is the number of estimated parameters
Lower AIC indicates a better balance of fit and complexity
Useful for comparing non-nested models

AIC(M_null, M_full)

Summary

The normal linear model assumes normally distributed outcomes with constant variance
Coefficients are MLEs; sigma() returns the unbiased estimator of \(\sigma\), not the MLE
Standard errors are sampling distribution standard deviations; confidence intervals are centred on the MLE \(\pm\) a \(t\)-quantile times the standard error
Nested model comparison uses the F-test, which is an exact finite-sample counterpart of the log likelihood ratio statistic \(\Lambda\)