Poisson Regression

Mark Andrews

The Poisson distribution

A discrete distribution over \(\{0, 1, 2, \ldots\}\)
Parameterised by a single rate \(\lambda > 0\)

\[ \Pr(x = k \mid \lambda) = \frac{e^{-\lambda}\lambda^k}{k!} \]

Mean \(= \lambda\), Variance \(= \lambda\) (equidispersion)

The Poisson distribution: shape

As \(\lambda\) increases, the distribution shifts right and widens

\[ \lambda = 3.5, \quad \lambda = 5, \quad \lambda = 10, \quad \lambda = 15 \]

The variance grows with the mean — unlike the normal distribution

Poisson regression

\[ \begin{aligned} y_i &\sim \mathrm{Poisson}(\lambda_i)\\ \log(\lambda_i) &= \beta_0 + \sum_k \beta_k x_{ki} \end{aligned} \]

The log link ensures \(\lambda_i > 0\)
Equivalently: \(\lambda_i = \exp(\beta_0 + \sum_k \beta_k x_{ki})\)

Fitting with glm

doctor_df <- read_csv("data/DoctorAUS.csv") |>
  mutate(age = age * 100)

M_10 <- glm(doctorco ~ age,
            data = doctor_df,
            family = poisson(link = "log"))
summary(M_10)

Interpreting coefficients

On the log scale, \(\beta_k\) is the additive change in \(\log(\lambda)\) per unit increase in \(x_k\)
On the original count scale, a unit increase in \(x_k\) multiplies \(\lambda\) by \(e^{\beta_k}\)

\[ \lambda^+ = \lambda \cdot e^{\beta_k} \]

exp(coef(M_10)["age"])   # multiplicative effect per unit increase in age

Predicted counts

doctor_new <- tibble(age = seq(20, 80))
add_predictions(doctor_new, M_10, type = "response")

Expected count increases (or decreases) multiplicatively with predictors

Model comparison

M_12 <- glm(doctorco ~ sex + age + income,
            data = doctor_df,
            family = poisson(link = "log"))

anova(M_10, M_12, test = "Chisq")

The likelihood ratio test works the same way as for other GLMs

Exposure and offset

When observation periods differ across individuals, counts are not directly comparable
Include the log of exposure as a fixed offset in the linear predictor:

\[ \log(\lambda_i) = \log(u_i) + \beta_0 + \sum_k \beta_k x_{ki} \]

M_ins <- glm(Claims ~ District + Group + Age + offset(log(Holders)),
             data = insur_df, family = poisson)

The offset in practice

offset(log(Holders)) tells R to add \(\log(\text{Holders})\) to the linear predictor with coefficient fixed at 1
Coefficients now describe effects on the rate (claims per holder), not the raw count
The exposure correction makes observations from different time windows comparable

Summary

Poisson regression is the standard model for unbounded count data
Log link ensures positive rates; coefficients are log rate ratios
\(e^{\beta_k}\) is the multiplicative change in the expected count
Exposure terms handle unequal observation windows via offsets