Negative Binomial Regression

Mark Andrews

Overdispersion

  • The Poisson distribution has mean \(=\) variance \(= \lambda\)
  • Real count data often have variance much greater than the mean — overdispersion
  • Fitting a Poisson model to overdispersed data underestimates standard errors

Diagnosing overdispersion

For the biochemists dataset (publications by PhD students):

pubs <- biochem_df$publications
c(mean = mean(pubs), variance = var(pubs), ratio = var(pubs)/mean(pubs))

A variance-to-mean ratio well above 1 indicates overdispersion

The negative binomial distribution

  • A Poisson distribution where \(\lambda\) itself varies according to a gamma distribution
  • This gamma mixing inflates the variance relative to Poisson
  • Mean \(= \mu\), Variance \(= \mu + \mu^2/r\) where \(r > 0\) is a dispersion parameter

As \(r \to \infty\), the distribution approaches Poisson

Probability mass function

\[ \Pr(x = k \mid r, \theta) = \binom{r+k-1}{k} \theta^r (1-\theta)^k \]

Or in the \((\mu, r)\) parameterisation:

\[ \mathrm{Var}(x) = \mu + \frac{\mu^2}{r} \]

Larger \(r\) \(\Rightarrow\) less overdispersion (closer to Poisson)

Negative binomial regression

\[ \begin{aligned} y_i &\sim \mathrm{NegBinomial}(\mu_i, r)\\ \log(\mu_i) &= \beta_0 + \sum_k \beta_k x_{ki} \end{aligned} \]

  • Log link, same coefficient interpretation as Poisson
  • Dispersion \(r\) is estimated from the data

Fitting with glm.nb

library(MASS)

M_13 <- glm.nb(publications ~ prestige, data = biochem_df)
summary(M_13)
  • Theta in the output is the estimated dispersion \(r\)
  • Larger Theta means less overdispersion

Comparing standard errors: Poisson vs negative binomial

M_14 <- glm(publications ~ prestige, family = poisson, data = biochem_df)

summary(M_14)$coefficients   # Poisson (SEs likely too small)
summary(M_13)$coefficients   # Negative binomial (SEs correctly inflated)

When overdispersion is present, negative binomial standard errors are larger and more honest

A fuller model

M_16 <- glm.nb(publications ~ gender + married + I(children > 0) + prestige + mentor,
               data = biochem_df)
summary(M_16)

Model comparison

Nested negative binomial models can be compared with anova:

anova(M_17, M_16)

AIC compares Poisson against negative binomial:

AIC(M_14, M_13)

Summary

  • Overdispersion occurs when the variance substantially exceeds the mean
  • Negative binomial regression adds a dispersion parameter \(r\) to handle this
  • Log link and coefficient interpretation remain the same as Poisson
  • glm.nb from MASS fits the model; Theta is the estimated dispersion