Negative Binomial Regression

Author

Mark Andrews

Abstract

The negative binomial model is an alternative to Poisson regression for overdispersed count data, where the variance substantially exceeds the mean. This guide covers the negative binomial distribution, overdispersion diagnostics, fitting with glm.nb from the MASS package, and comparing Poisson and negative binomial fits.

Overdispersion in count data

The Poisson distribution has the property that its mean and variance are both equal to \(\lambda\). Real count data frequently violate this: the variance can be considerably larger than the mean, a condition called overdispersion. Fitting a Poisson model to overdispersed data leads to standard errors that are too small, inflating test statistics and producing overconfident inferences.

The biochemists dataset records the number of academic publications by PhD students over three years.

biochem_df <- read_csv("https://raw.githubusercontent.com/mark-andrews/iglmr24/main/data/biochemist.csv")

Rows: 915 Columns: 6
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): gender, married
dbl (4): publications, children, prestige, mentor

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

pubs <- biochem_df$publications
c(mean = mean(pubs), variance = var(pubs), ratio = var(pubs) / mean(pubs))

    mean variance    ratio 
1.692896 3.709742 2.191358

A variance-to-mean ratio substantially greater than 1 signals overdispersion.

The negative binomial distribution

The negative binomial distribution can be derived as a Poisson distribution whose rate parameter \(\lambda\) itself varies across observations according to a gamma distribution. Marginalising over this gamma mixing yields a distribution with mean \(\mu\) and variance \(\mu + \mu^2 / r\), where \(r > 0\) is a dispersion parameter. As \(r \to \infty\), the extra variance term vanishes and the distribution approaches Poisson.

Negative binomial regression

In negative binomial regression, each count \(y_i\) is modelled as

\[ \begin{aligned} y_i &\sim \mathrm{NegBinomial}(\mu_i, r),\\ \log(\mu_i) &= \beta_0 + \sum_k \beta_k x_{ki}. \end{aligned} \]

The log link and coefficient interpretation are identical to Poisson regression. The extra parameter \(r\) is estimated from the data and accounts for overdispersion.

Fitting with glm.nb

M_13 <- glm.nb(publications ~ prestige, data = biochem_df)
summary(M_13)


Call:
glm.nb(formula = publications ~ prestige, data = biochem_df, 
    init.theta = 1.731467472, link = log)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)  
(Intercept)  0.25820    0.12018   2.148   0.0317 *
prestige     0.08531    0.03648   2.338   0.0194 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for Negative Binomial(1.7315) family taken to be 1)

    Null deviance: 1007.1  on 914  degrees of freedom
Residual deviance: 1001.7  on 913  degrees of freedom
AIC: 3220.4

Number of Fisher Scoring iterations: 1

              Theta:  1.731 
          Std. Err.:  0.181 

 2 x log-likelihood:  -3214.415

The Theta parameter in the output is the estimated dispersion \(r\). A larger value indicates less overdispersion (closer to Poisson).

Compare standard errors between Poisson and negative binomial models:

M_14 <- glm(publications ~ prestige,
            family = poisson(link = "log"),
            data = biochem_df)

summary(M_14)$coefficients

              Estimate Std. Error z value    Pr(>|z|)
(Intercept) 0.25817731 0.08665761 2.97928 0.002889264
prestige    0.08531946 0.02601124 3.28010 0.001037702

summary(M_13)$coefficients

              Estimate Std. Error  z value   Pr(>|z|)
(Intercept) 0.25820123 0.12017797 2.148490 0.03167481
prestige    0.08531208 0.03648261 2.338431 0.01936489

The negative binomial standard errors will be larger when overdispersion is present, reflecting the true uncertainty more honestly.

A fuller model

M_16 <- glm.nb(publications ~ gender + married + I(children > 0) + prestige + mentor,
               data = biochem_df)
summary(M_16)


Call:
glm.nb(formula = publications ~ gender + married + I(children > 
    0) + prestige + mentor, data = biochem_df, init.theta = 2.23905577, 
    link = log)

Coefficients:
                     Estimate Std. Error z value Pr(>|z|)    
(Intercept)          0.391470   0.128451   3.048  0.00231 ** 
genderWomen         -0.206377   0.072876  -2.832  0.00463 ** 
marriedSingle       -0.143843   0.084788  -1.696  0.08979 .  
I(children > 0)TRUE -0.228953   0.085438  -2.680  0.00737 ** 
prestige             0.015478   0.035978   0.430  0.66705    
mentor               0.029274   0.003229   9.066  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for Negative Binomial(2.2391) family taken to be 1)

    Null deviance: 1104.7  on 914  degrees of freedom
Residual deviance: 1004.5  on 909  degrees of freedom
AIC: 3139.8

Number of Fisher Scoring iterations: 1

              Theta:  2.239 
          Std. Err.:  0.267 

 2 x log-likelihood:  -3125.827

Model comparison

Nested negative binomial models can be compared with anova:

M_17 <- glm.nb(publications ~ gender + I(children > 0) + mentor,
               data = biochem_df)
anova(M_17, M_16)

Likelihood ratio tests of Negative Binomial Models

Response: publications
                                                   Model    theta Resid. df
1                      gender + I(children > 0) + mentor 2.218397       911
2 gender + married + I(children > 0) + prestige + mentor 2.239056       909
     2 x log-lik.   Test    df LR stat.   Pr(Chi)
1       -3128.777                                
2       -3125.827 1 vs 2     2 2.949479 0.2288383

Poisson and negative binomial models are not nested in the usual sense (they have different dispersion assumptions), but AIC can still be used to compare them as candidate models for the same data.