The General Linear Model

Author

Mark Andrews

Abstract

This guide covers the normal general linear model as the foundation for all generalized linear models. We treat regression as a model of how the probability distribution of an outcome variable depends on predictor variables, fit models with lm, interpret continuous and categorical predictors, and compare nested models using likelihood ratio tests and AIC.

Regression as a probability model

Regression is often introduced as fitting a line to a scatter plot, which is a limited perspective. More generally, a regression model specifies how the probability distribution of an outcome variable varies as a function of predictor variables. In the normal linear model, we assume the outcome is normally distributed and that its mean is a linear function of the predictors.

For \(n\) observations \((y_i, \vec{x}_i)\), the normal linear model is

\[ \begin{aligned} y_i &\sim \mathrm{Normal}(\mu_i, \sigma^2), \quad i = 1, \ldots, n,\\ \mu_i &= \beta_0 + \sum_{k=1}^K \beta_k x_{ki}. \end{aligned} \]

Each observation is drawn from a normal distribution with the same variance \(\sigma^2\) but a mean \(\mu_i\) that depends on the predictors for that observation. The \(\beta\) coefficients tell us how the mean changes with the predictors: a one-unit increase in \(x_k\) shifts the mean by \(\beta_k\).

Fitting a linear model

We fit linear models with lm. The weight.csv dataset contains body weight (kg), height (cm), age (years), and gender for a sample of individuals.

weight_df <- read_csv("https://raw.githubusercontent.com/mark-andrews/iglmr24/main/data/weight.csv")
Rows: 5769 Columns: 7
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): gender, handedness, race
dbl (4): subjectid, height, weight, age

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

A simple model predicting weight from height:

M_1 <- lm(weight ~ height, data = weight_df)
coef(M_1)
(Intercept)      height 
-115.889544    1.141495 
sigma(M_1)
[1] 11.813

The coefficient for height tells us the expected change in weight in kilograms for each additional centimetre of height. sigma returns the estimated standard deviation \(\sigma\) of the residuals.

Adding age as a second predictor:

M_2 <- lm(weight ~ height + age, data = weight_df)
summary(M_2)

Call:
lm(formula = weight ~ height + age, data = weight_df)

Residuals:
    Min      1Q  Median      3Q     Max 
-42.166  -7.521  -0.621   6.841  47.628 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -124.18784    2.90999  -42.68   <2e-16 ***
height         1.12596    0.01679   67.06   <2e-16 ***
age            0.36825    0.01721   21.39   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 11.37 on 5766 degrees of freedom
Multiple R-squared:  0.4688,    Adjusted R-squared:  0.4686 
F-statistic:  2544 on 2 and 5766 DF,  p-value: < 2.2e-16

The coefficient for each predictor is now interpreted as its effect holding the other predictors constant.

Categorical predictors

When a predictor is categorical, R creates dummy (indicator) variables automatically. For a binary predictor such as gender, one level is treated as the reference category and the coefficient gives the difference between the two groups.

M_3 <- lm(weight ~ height + age + gender, data = weight_df)
summary(M_3)

Call:
lm(formula = weight ~ height + age + gender, data = weight_df)

Residuals:
    Min      1Q  Median      3Q     Max 
-39.711  -7.584  -0.672   6.618  48.000 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -94.99699    3.67976  -25.82   <2e-16 ***
height        0.93654    0.02231   41.99   <2e-16 ***
age           0.35758    0.01700   21.03   <2e-16 ***
gendermale    5.39907    0.42584   12.68   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 11.22 on 5765 degrees of freedom
Multiple R-squared:  0.4832,    Adjusted R-squared:  0.4829 
F-statistic:  1797 on 3 and 5765 DF,  p-value: < 2.2e-16

For a categorical predictor with \(L\) levels, \(L - 1\) dummy variables are created, each comparing one level to the reference. The intercept in the model represents the expected value when all predictors are at their reference levels.

Confidence intervals and predictions

Confidence intervals on the coefficients:

confint(M_3)
                   2.5 %      97.5 %
(Intercept) -102.2107065 -87.7832753
height         0.8928152   0.9802688
age            0.3242573   0.3909109
gendermale     4.5642684   6.2338793

Predicted values for new data can be obtained with predict or add_predictions from the modelr package:

new_df <- tibble(height = c(160, 170, 180), age = 30, gender = "female")
predict(M_3, newdata = new_df)
       1        2        3 
65.57726 74.94268 84.30810 
add_predictions(new_df, M_3)
# A tibble: 3 × 4
  height   age gender  pred
   <dbl> <dbl> <chr>  <dbl>
1    160    30 female  65.6
2    170    30 female  74.9
3    180    30 female  84.3

Model comparison

When comparing two nested models — a full model \(\mathcal{M}_1\) with \(K\) predictors and a reduced model \(\mathcal{M}_0\) with a subset of those predictors — we can use a likelihood ratio test. The test statistic is the difference in deviances, \(D_0 - D_1\), where the deviance of a model is \(-2 \log L(\hat{\beta} \mid \mathcal{D})\). Under the null hypothesis that the two models are equivalent, this difference is \(\chi^2\)-distributed with degrees of freedom equal to the difference in the number of parameters.

M_null <- lm(weight ~ height, data = weight_df)
M_full <- lm(weight ~ height + age + gender, data = weight_df)
anova(M_null, M_full)
Analysis of Variance Table

Model 1: weight ~ height
Model 2: weight ~ height + age + gender
  Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
1   5767 804767                                  
2   5765 725358  2     79409 315.56 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

AIC provides a penalised measure of fit useful for comparing non-nested models or choosing among a set of candidates:

AIC(M_null, M_full)
       df      AIC
M_null  3 44865.35
M_full  5 44270.02

Lower AIC indicates a better trade-off between fit and complexity.