This guide covers the normal general linear model as the foundation for all generalized linear models. We treat regression as a model of how the probability distribution of an outcome variable depends on predictor variables, fit models with lm, interpret continuous and categorical predictors, and compare nested models using likelihood ratio tests and AIC.
Regression as a probability model
Regression is often introduced as fitting a line to a scatter plot, which is a limited perspective. More generally, a regression model specifies how the probability distribution of an outcome variable varies as a function of predictor variables. In the normal linear model, we assume the outcome is normally distributed and that its mean is a linear function of the predictors.
For \(n\) observations \((y_i, \vec{x}_i)\), the normal linear model is
Each observation is drawn from a normal distribution with the same variance \(\sigma^2\) but a mean \(\mu_i\) that depends on the predictors for that observation. The \(\beta\) coefficients tell us how the mean changes with the predictors: a one-unit increase in \(x_k\) shifts the mean by \(\beta_k\).
Fitting a linear model
We fit linear models with lm. The weight.csv dataset contains body weight (kg), height (cm), age (years), and gender for a sample of individuals.
Rows: 5769 Columns: 7
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): gender, handedness, race
dbl (4): subjectid, height, weight, age
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
A simple model predicting weight from height:
M_1 <-lm(weight ~ height, data = weight_df)coef(M_1)
(Intercept) height
-115.889544 1.141495
sigma(M_1)
[1] 11.813
The coefficient for height tells us the expected change in weight in kilograms for each additional centimetre of height. sigma returns the estimated standard deviation \(\sigma\) of the residuals.
Adding age as a second predictor:
M_2 <-lm(weight ~ height + age, data = weight_df)summary(M_2)
Call:
lm(formula = weight ~ height + age, data = weight_df)
Residuals:
Min 1Q Median 3Q Max
-42.166 -7.521 -0.621 6.841 47.628
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -124.18784 2.90999 -42.68 <2e-16 ***
height 1.12596 0.01679 67.06 <2e-16 ***
age 0.36825 0.01721 21.39 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 11.37 on 5766 degrees of freedom
Multiple R-squared: 0.4688, Adjusted R-squared: 0.4686
F-statistic: 2544 on 2 and 5766 DF, p-value: < 2.2e-16
The coefficient for each predictor is now interpreted as its effect holding the other predictors constant.
Categorical predictors
When a predictor is categorical, R creates dummy (indicator) variables automatically. For a binary predictor such as gender, one level is treated as the reference category and the coefficient gives the difference between the two groups.
M_3 <-lm(weight ~ height + age + gender, data = weight_df)summary(M_3)
Call:
lm(formula = weight ~ height + age + gender, data = weight_df)
Residuals:
Min 1Q Median 3Q Max
-39.711 -7.584 -0.672 6.618 48.000
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -94.99699 3.67976 -25.82 <2e-16 ***
height 0.93654 0.02231 41.99 <2e-16 ***
age 0.35758 0.01700 21.03 <2e-16 ***
gendermale 5.39907 0.42584 12.68 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 11.22 on 5765 degrees of freedom
Multiple R-squared: 0.4832, Adjusted R-squared: 0.4829
F-statistic: 1797 on 3 and 5765 DF, p-value: < 2.2e-16
For a categorical predictor with \(L\) levels, \(L - 1\) dummy variables are created, each comparing one level to the reference. The intercept in the model represents the expected value when all predictors are at their reference levels.
# A tibble: 3 × 4
height age gender pred
<dbl> <dbl> <chr> <dbl>
1 160 30 female 65.6
2 170 30 female 74.9
3 180 30 female 84.3
Model comparison
When comparing two nested models — a full model \(\mathcal{M}_1\) with \(K\) predictors and a reduced model \(\mathcal{M}_0\) with a subset of those predictors — we can use a likelihood ratio test. The test statistic is the difference in deviances, \(D_0 - D_1\), where the deviance of a model is \(-2 \log L(\hat{\beta} \mid \mathcal{D})\). Under the null hypothesis that the two models are equivalent, this difference is \(\chi^2\)-distributed with degrees of freedom equal to the difference in the number of parameters.
M_null <-lm(weight ~ height, data = weight_df)M_full <-lm(weight ~ height + age + gender, data = weight_df)anova(M_null, M_full)
Analysis of Variance Table
Model 1: weight ~ height
Model 2: weight ~ height + age + gender
Res.Df RSS Df Sum of Sq F Pr(>F)
1 5767 804767
2 5765 725358 2 79409 315.56 < 2.2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
AIC provides a penalised measure of fit useful for comparing non-nested models or choosing among a set of candidates:
AIC(M_null, M_full)
df AIC
M_null 3 44865.35
M_full 5 44270.02
Lower AIC indicates a better trade-off between fit and complexity.