Zero-Inflated Models

Author

Mark Andrews

Abstract

Zero-inflated models are for count data that contain more zeros than a standard Poisson or negative binomial model predicts. They combine a binary component (modelling whether an observation is a structural zero) with a count component (modelling the count for non-structural observations). This guide covers the zero-inflated Poisson and zero-inflated negative binomial models and their implementation via zeroinfl from the pscl package.

Excess zeros

Count data sometimes contain far more zeros than a Poisson or negative binomial model predicts. A practical diagnostic is to compare the observed proportion of zeros to the proportion predicted under the fitted model. If there is a large discrepancy, a zero-inflated model may be appropriate.

The smoking.csv dataset records the number of cigarettes smoked daily by survey respondents. Many non-smokers are included, producing a spike at zero that no standard count model fits well.

smoking_df <- read_csv("https://raw.githubusercontent.com/mark-andrews/iglmr24/main/data/smoking.csv")

Rows: 807 Columns: 10
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (10): educ, cigpric, white, age, income, cigs, restaurn, lincome, agesq,...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

table(smoking_df$cigs)


  0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  18  19  20 
497   7   5   5   2   7   3   2   3   2  28   2   4   2   1  23   1   3   1 101 
 25  28  30  33  35  40  50  55  60  80 
  7   3  42   1   2  37   6   1   8   1

mean(smoking_df$cigs)

[1] 8.686493

For comparison, a Poisson with the observed mean predicts far fewer zeros than we observe:

mean(rpois(1e6, lambda = mean(smoking_df$cigs)) == 0)

[1] 0.000191

The zero-inflated Poisson model

A zero-inflated Poisson (ZIP) model assumes that each observation comes from one of two latent groups. With probability \(\theta_i\), the observation is a structural zero — the individual is, so to speak, incapable of producing a non-zero count. With probability \(1 - \theta_i\), the count is drawn from a Poisson distribution with rate \(\lambda_i\).

\[ y_i \sim \begin{cases} 0 & \text{with probability } \theta_i,\\ \mathrm{Poisson}(\lambda_i) & \text{with probability } 1 - \theta_i. \end{cases} \]

Both \(\theta_i\) and \(\lambda_i\) can depend on predictors:

\[ \mathrm{logit}(\theta_i) = a + b x_i, \qquad \log(\lambda_i) = \alpha + \beta x_i. \]

Fitting with zeroinfl

M_19 <- zeroinfl(cigs ~ educ, data = smoking_df)
summary(M_19)


Call:
zeroinfl(formula = cigs ~ educ, data = smoking_df)

Pearson residuals:
    Min      1Q  Median      3Q     Max 
-0.9775 -0.7747 -0.6606  0.9575  5.6549 

Count model coefficients (poisson with log link):
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) 2.697867   0.056779  47.516  < 2e-16 ***
educ        0.034718   0.004536   7.653 1.96e-14 ***

Zero-inflation model coefficients (binomial with logit link):
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) -0.56273    0.30605  -1.839 0.065960 .  
educ         0.08357    0.02417   3.457 0.000546 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

Number of iterations in BFGS optimization: 16 
Log-likelihood: -2441 on 4 Df

The output has two sections: Count model coefficients (the Poisson component, log scale) and Zero-inflation model coefficients (the binary component, logit scale).

Extracting and interpreting the two components

estimates_zero  <- coef(M_19, model = "zero")
estimates_count <- coef(M_19, model = "count")

# Probability of being a structural zero (non-smoker) at education = 10 and 20
plogis(estimates_zero[1] + estimates_zero[2] * 10)

(Intercept) 
  0.5678196

plogis(estimates_zero[1] + estimates_zero[2] * 20)

(Intercept) 
  0.7518781

# Expected number of cigarettes smoked (among smokers) at education = 10 and 20
exp(estimates_count[1] + estimates_count[2] * 10)

(Intercept) 
   21.01109

exp(estimates_count[1] + estimates_count[2] * 20)

(Intercept) 
   29.73232

Predictions

zeroinfl supports three prediction types:

smoking_new <- tibble(educ = seq(5, 20))

add_predictions(smoking_new, M_19, type = "response")  # overall expected count

# A tibble: 16 × 2
    educ  pred
   <int> <dbl>
 1     5  9.47
 2     6  9.42
 3     7  9.36
 4     8  9.28
 5     9  9.19
 6    10  9.08
 7    11  8.96
 8    12  8.82
 9    13  8.67
10    14  8.51
11    15  8.34
12    16  8.17
13    17  7.98
14    18  7.78
15    19  7.58
16    20  7.38

add_predictions(smoking_new, M_19, type = "count")     # expected count among smokers

# A tibble: 16 × 2
    educ  pred
   <int> <dbl>
 1     5  17.7
 2     6  18.3
 3     7  18.9
 4     8  19.6
 5     9  20.3
 6    10  21.0
 7    11  21.8
 8    12  22.5
 9    13  23.3
10    14  24.1
11    15  25.0
12    16  25.9
13    17  26.8
14    18  27.7
15    19  28.7
16    20  29.7

add_predictions(smoking_new, M_19, type = "zero") |>   # P(structural zero)
  mutate(pred = 1 - pred)                              # convert to P(smoker)

# A tibble: 16 × 2
    educ  pred
   <int> <dbl>
 1     5 0.536
 2     6 0.515
 3     7 0.494
 4     8 0.474
 5     9 0.453
 6    10 0.432
 7    11 0.412
 8    12 0.392
 9    13 0.372
10    14 0.353
11    15 0.334
12    16 0.316
13    17 0.298
14    18 0.281
15    19 0.264
16    20 0.248

Zero-inflated negative binomial

If the count component is also overdispersed, we can use a zero-inflated negative binomial model by adding dist = "negbin":

M_zinb <- zeroinfl(cigs ~ educ, data = smoking_df, dist = "negbin")
summary(M_zinb)


Call:
zeroinfl(formula = cigs ~ educ, data = smoking_df, dist = "negbin")

Pearson residuals:
    Min      1Q  Median      3Q     Max 
-0.7695 -0.6307 -0.5465  0.7923  4.7048 

Count model coefficients (negbin with log link):
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  2.69164    0.16546   16.27  < 2e-16 ***
educ         0.03509    0.01344    2.61  0.00906 ** 
Log(theta)   1.09484    0.09501   11.52  < 2e-16 ***

Zero-inflation model coefficients (binomial with logit link):
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) -0.57130    0.30675  -1.862 0.062546 .  
educ         0.08403    0.02422   3.469 0.000522 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

Theta = 2.9887 
Number of iterations in BFGS optimization: 19 
Log-likelihood: -1748 on 5 Df

AIC(M_19, M_zinb)

       df      AIC
M_19    4 4890.561
M_zinb  5 3506.621