Binary logistic regression is the workhorse model for binary outcome data. This guide covers the Bernoulli distributional assumption, the logit link function, odds and log odds, fitting with glm, interpreting coefficients as log odds ratios, obtaining predicted probabilities, and comparing models with deviance and AIC.
The binary outcome problem
When the outcome variable is binary — taking only the values 0 and 1 — modelling it as normally distributed is a severe misspecification. The predicted values can fall outside the \([0, 1]\) range, and the constant-variance assumption makes no sense for a Bernoulli variable. The natural fix is to replace the normal distribution with the Bernoulli distribution and connect the linear predictor to the probability parameter via the logit function.
The logistic regression model
For observations \(i = 1, \ldots, n\) with binary outcomes \(y_i \in \{0, 1\}\) and predictors \(\vec{x}_i\), binary logistic regression assumes
The logit transformation maps the probability \(\theta \in (0, 1)\) to the real line, so the linear predictor can take any real value. The inverse of the logit is
which maps any real number back to the interval \((0, 1)\). In R, plogis computes the inverse logit.
Odds and log odds
The odds of an event with probability \(\theta\) is \(\theta / (1 - \theta)\). Odds greater than 1 indicate the event is more likely than not; odds less than 1 indicate the opposite. The log odds is simply the natural logarithm of the odds, which maps to the real line symmetrically around zero.
The affairs.csv dataset records whether individuals have had an extramarital affair, along with predictors including years married, age, gender, religiousness, and a self-rated marriage quality score.
Rows: 601 Columns: 9
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): gender, children
dbl (7): affairs, age, yearsmarried, religiousness, education, occupation, r...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Call:
glm(formula = had_affair ~ yearsmarried, family = binomial(link = "logit"),
data = affairs_df)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.60859 0.18327 -8.777 < 2e-16 ***
yearsmarried 0.05882 0.01724 3.412 0.000646 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 675.38 on 600 degrees of freedom
Residual deviance: 663.50 on 599 degrees of freedom
AIC: 667.5
Number of Fisher Scoring iterations: 4
The family = binomial(link = "logit") argument tells glm to use a Bernoulli likelihood with the logit link.
Interpreting coefficients
Coefficients in logistic regression are on the log-odds scale. A one-unit increase in yearsmarried changes the log odds of an affair by the coefficient for that predictor. Exponentiating gives the odds ratio — the multiplicative factor by which the odds change for a one-unit increase in the predictor.
predict(M_4, newdata = affairs_new, type ="response") # probabilities
1 2 3
0.2117386 0.2649559 0.3936137
add_predictions(affairs_new, M_4) # log odds via modelr
# A tibble: 3 × 2
yearsmarried pred
<dbl> <dbl>
1 5 -1.31
2 10 -1.02
3 20 -0.432
add_predictions(affairs_new, M_4, type ="response") # probabilities
# A tibble: 3 × 2
yearsmarried pred
<dbl> <dbl>
1 5 0.212
2 10 0.265
3 20 0.394
Model comparison with deviance
The deviance of a model is \(-2 \log L(\hat{\beta} \mid \mathcal{D})\). Comparing the deviances of a full and a reduced model gives a likelihood ratio test statistic that is approximately \(\chi^2\) distributed under the null hypothesis.
deviance(M_4)
[1] 663.5002
deviance(M_5)
[1] 609.5104
anova(M_4, M_5, test ="Chisq")
Analysis of Deviance Table
Model 1: had_affair ~ yearsmarried
Model 2: had_affair ~ gender + age + yearsmarried + children + religiousness +
education + occupation + rating
Resid. Df Resid. Dev Df Deviance Pr(>Chi)
1 599 663.50
2 592 609.51 7 53.99 2.363e-09 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
AIC penalises for model complexity and is useful for comparing non-nested models:
AIC(M_4, M_5)
df AIC
M_4 2 667.5002
M_5 9 627.5104
Stepwise selection by AIC can help identify a parsimonious model when there are many predictors:
M_5_step <-step(M_5, trace =0)summary(M_5_step)
Call:
glm(formula = had_affair ~ gender + age + yearsmarried + religiousness +
rating, family = binomial(link = "logit"), data = affairs_df)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.94760 0.61234 3.181 0.001470 **
gendermale 0.38612 0.20703 1.865 0.062171 .
age -0.04393 0.01806 -2.432 0.015011 *
yearsmarried 0.11133 0.02983 3.732 0.000190 ***
religiousness -0.32714 0.08947 -3.656 0.000256 ***
rating -0.46721 0.08928 -5.233 1.67e-07 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 675.38 on 600 degrees of freedom
Residual deviance: 611.86 on 595 degrees of freedom
AIC: 623.86
Number of Fisher Scoring iterations: 4