Poisson Regression

Mark Andrews

The Poisson distribution

  • A discrete distribution over \(\{0, 1, 2, \ldots\}\)
  • Parameterised by a single rate \(\lambda > 0\)

\[ \Pr(x = k \mid \lambda) = \frac{e^{-\lambda}\lambda^k}{k!} \]

  • Mean \(= \lambda\), Variance \(= \lambda\) (equidispersion)

The Poisson distribution: shape

As \(\lambda\) increases, the distribution shifts right and widens

\[ \lambda = 3.5, \quad \lambda = 5, \quad \lambda = 10, \quad \lambda = 15 \]

The variance grows with the mean — unlike the normal distribution

Poisson regression

\[ \begin{aligned} y_i &\sim \mathrm{Poisson}(\lambda_i)\\ \log(\lambda_i) &= \beta_0 + \sum_k \beta_k x_{ki} \end{aligned} \]

  • The log link ensures \(\lambda_i > 0\)
  • Equivalently: \(\lambda_i = \exp(\beta_0 + \sum_k \beta_k x_{ki})\)

Fitting with glm

doctor_df <- read_csv("data/DoctorAUS.csv") |>
  mutate(age = age * 100)

M_10 <- glm(doctorco ~ age,
            data = doctor_df,
            family = poisson(link = "log"))
summary(M_10)

Interpreting coefficients

  • On the log scale, \(\beta_k\) is the additive change in \(\log(\lambda)\) per unit increase in \(x_k\)
  • On the original count scale, a unit increase in \(x_k\) multiplies \(\lambda\) by \(e^{\beta_k}\)

\[ \lambda^+ = \lambda \cdot e^{\beta_k} \]

exp(coef(M_10)["age"])   # multiplicative effect per unit increase in age

Predicted counts

doctor_new <- tibble(age = seq(20, 80))
add_predictions(doctor_new, M_10, type = "response")

Expected count increases (or decreases) multiplicatively with predictors

Model comparison

M_12 <- glm(doctorco ~ sex + age + income,
            data = doctor_df,
            family = poisson(link = "log"))

anova(M_10, M_12, test = "Chisq")

The likelihood ratio test works the same way as for other GLMs

Exposure and offset

  • When observation periods differ across individuals, counts are not directly comparable
  • Include the log of exposure as a fixed offset in the linear predictor:

\[ \log(\lambda_i) = \log(u_i) + \beta_0 + \sum_k \beta_k x_{ki} \]

M_ins <- glm(Claims ~ District + Group + Age + offset(log(Holders)),
             data = insur_df, family = poisson)

The offset in practice

  • offset(log(Holders)) tells R to add \(\log(\text{Holders})\) to the linear predictor with coefficient fixed at 1
  • Coefficients now describe effects on the rate (claims per holder), not the raw count
  • The exposure correction makes observations from different time windows comparable

Summary

  • Poisson regression is the standard model for unbounded count data
  • Log link ensures positive rates; coefficients are log rate ratios
  • \(e^{\beta_k}\) is the multiplicative change in the expected count
  • Exposure terms handle unequal observation windows via offsets