Binary Logistic Regression

Author

Mark Andrews

Abstract

Binary logistic regression is the workhorse model for binary outcome data. This guide covers the Bernoulli distributional assumption, the logit link function, odds and log odds, fitting with glm, interpreting coefficients as log odds ratios, obtaining predicted probabilities, and comparing models with deviance and AIC.

The binary outcome problem

When the outcome variable is binary — taking only the values 0 and 1 — modelling it as normally distributed is a severe misspecification. The predicted values can fall outside the \([0, 1]\) range, and the constant-variance assumption makes no sense for a Bernoulli variable. The natural fix is to replace the normal distribution with the Bernoulli distribution and connect the linear predictor to the probability parameter via the logit function.

The logistic regression model

For observations \(i = 1, \ldots, n\) with binary outcomes \(y_i \in \{0, 1\}\) and predictors \(\vec{x}_i\), binary logistic regression assumes

\[ \begin{aligned} y_i &\sim \mathrm{Bernoulli}(\theta_i),\\ \mathrm{logit}(\theta_i) &= \beta_0 + \sum_{k=1}^K \beta_k x_{ki}, \end{aligned} \]

where

\[ \mathrm{logit}(\theta) = \log\!\left(\frac{\theta}{1 - \theta}\right). \]

The logit transformation maps the probability \(\theta \in (0, 1)\) to the real line, so the linear predictor can take any real value. The inverse of the logit is

\[ \theta = \mathrm{ilogit}(x) = \frac{1}{1 + e^{-x}}, \]

which maps any real number back to the interval \((0, 1)\). In R, plogis computes the inverse logit.

Odds and log odds

The odds of an event with probability \(\theta\) is \(\theta / (1 - \theta)\). Odds greater than 1 indicate the event is more likely than not; odds less than 1 indicate the opposite. The log odds is simply the natural logarithm of the odds, which maps to the real line symmetrically around zero.

theta <- c(0.1, 0.25, 0.5, 0.75, 0.9)
odds  <- theta / (1 - theta)
log_odds <- log(odds)
tibble(theta, odds, log_odds)
# A tibble: 5 × 3
  theta  odds log_odds
  <dbl> <dbl>    <dbl>
1  0.1  0.111    -2.20
2  0.25 0.333    -1.10
3  0.5  1         0   
4  0.75 3         1.10
5  0.9  9         2.20

Fitting with glm

The affairs.csv dataset records whether individuals have had an extramarital affair, along with predictors including years married, age, gender, religiousness, and a self-rated marriage quality score.

affairs_df <- read_csv("https://raw.githubusercontent.com/mark-andrews/iglmr24/main/data/affairs.csv") |>
  mutate(had_affair = affairs > 0)
Rows: 601 Columns: 9
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): gender, children
dbl (7): affairs, age, yearsmarried, religiousness, education, occupation, r...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

A model with a single predictor:

M_4 <- glm(had_affair ~ yearsmarried,
           family = binomial(link = "logit"),
           data = affairs_df)
summary(M_4)

Call:
glm(formula = had_affair ~ yearsmarried, family = binomial(link = "logit"), 
    data = affairs_df)

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept)  -1.60859    0.18327  -8.777  < 2e-16 ***
yearsmarried  0.05882    0.01724   3.412 0.000646 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 675.38  on 600  degrees of freedom
Residual deviance: 663.50  on 599  degrees of freedom
AIC: 667.5

Number of Fisher Scoring iterations: 4

The family = binomial(link = "logit") argument tells glm to use a Bernoulli likelihood with the logit link.

Interpreting coefficients

Coefficients in logistic regression are on the log-odds scale. A one-unit increase in yearsmarried changes the log odds of an affair by the coefficient for that predictor. Exponentiating gives the odds ratio — the multiplicative factor by which the odds change for a one-unit increase in the predictor.

exp(coef(M_4))
 (Intercept) yearsmarried 
   0.2001702    1.0605864 
exp(confint.default(M_4))
                 2.5 %    97.5 %
(Intercept)  0.1397644 0.2866833
yearsmarried 1.0253452 1.0970389

A fuller model with several predictors:

M_5 <- glm(had_affair ~ gender + age + yearsmarried + children +
             religiousness + education + occupation + rating,
           family = binomial(link = "logit"),
           data = affairs_df)
exp(confint.default(M_5))
                  2.5 %     97.5 %
(Intercept)   0.6957875 22.5836746
gendermale    0.8283413  2.1146797
age           0.9230973  0.9915466
yearsmarried  1.0321395  1.1710634
childrenyes   0.8405721  2.6353522
religiousness 0.6061456  0.8617361
education     0.9250129  1.1275524
occupation    0.8960496  1.1872016
rating        0.5238093  0.7480534

Predicted probabilities

The linear predictor gives predicted log odds; applying plogis converts to predicted probabilities.

affairs_new <- tibble(yearsmarried = c(5, 10, 20))

predict(M_4, newdata = affairs_new)               # log odds
         1          2          3 
-1.3144772 -1.0203672 -0.4321472 
predict(M_4, newdata = affairs_new, type = "response")  # probabilities
        1         2         3 
0.2117386 0.2649559 0.3936137 
add_predictions(affairs_new, M_4)                 # log odds via modelr
# A tibble: 3 × 2
  yearsmarried   pred
         <dbl>  <dbl>
1            5 -1.31 
2           10 -1.02 
3           20 -0.432
add_predictions(affairs_new, M_4, type = "response")    # probabilities
# A tibble: 3 × 2
  yearsmarried  pred
         <dbl> <dbl>
1            5 0.212
2           10 0.265
3           20 0.394

Model comparison with deviance

The deviance of a model is \(-2 \log L(\hat{\beta} \mid \mathcal{D})\). Comparing the deviances of a full and a reduced model gives a likelihood ratio test statistic that is approximately \(\chi^2\) distributed under the null hypothesis.

deviance(M_4)
[1] 663.5002
deviance(M_5)
[1] 609.5104
anova(M_4, M_5, test = "Chisq")
Analysis of Deviance Table

Model 1: had_affair ~ yearsmarried
Model 2: had_affair ~ gender + age + yearsmarried + children + religiousness + 
    education + occupation + rating
  Resid. Df Resid. Dev Df Deviance  Pr(>Chi)    
1       599     663.50                          
2       592     609.51  7    53.99 2.363e-09 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

AIC penalises for model complexity and is useful for comparing non-nested models:

AIC(M_4, M_5)
    df      AIC
M_4  2 667.5002
M_5  9 627.5104

Stepwise selection by AIC can help identify a parsimonious model when there are many predictors:

M_5_step <- step(M_5, trace = 0)
summary(M_5_step)

Call:
glm(formula = had_affair ~ gender + age + yearsmarried + religiousness + 
    rating, family = binomial(link = "logit"), data = affairs_df)

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept)    1.94760    0.61234   3.181 0.001470 ** 
gendermale     0.38612    0.20703   1.865 0.062171 .  
age           -0.04393    0.01806  -2.432 0.015011 *  
yearsmarried   0.11133    0.02983   3.732 0.000190 ***
religiousness -0.32714    0.08947  -3.656 0.000256 ***
rating        -0.46721    0.08928  -5.233 1.67e-07 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 675.38  on 600  degrees of freedom
Residual deviance: 611.86  on 595  degrees of freedom
AIC: 623.86

Number of Fisher Scoring iterations: 4