Zero-Inflated Models

Mark Andrews

Excess zeros

  • Count data sometimes contain far more zeros than a Poisson or negative binomial model predicts
  • Diagnostic: compare observed proportion of zeros to model-predicted proportion
  • Example: survey data on cigarettes smoked daily — many non-smokers produce structural zeros

The Poisson distribution with \(\lambda = 5.5\)

A standard Poisson can produce zeros by chance, but this zero-probability is limited:

\[ \Pr(x = 0 \mid \lambda = 5.5) = e^{-5.5} \approx 0.004 \]

If the observed proportion of zeros is 0.6, the Poisson is clearly inadequate

Zero-inflated Poisson (ZIP)

Each observation comes from a mixture of two components:

\[ y_i \sim \begin{cases} 0 & \text{with probability } \theta_i \\ \mathrm{Poisson}(\lambda_i) & \text{with probability } 1-\theta_i \end{cases} \]

  • \(\theta_i\) is the probability of a structural zero (a “true” zero regardless of process)
  • \(\lambda_i\) is the Poisson rate for non-structural observations

The ZIP model components

Both components are regression models:

\[ \mathrm{logit}(\theta_i) = a + bx_i \qquad \log(\lambda_i) = \alpha + \beta x_i \]

  • Binary component: logistic regression predicting the zero component
  • Count component: standard Poisson regression

Fitting with zeroinfl

library(pscl)

smoking_df <- read_csv("data/smoking.csv")
M_19 <- zeroinfl(cigs ~ educ, data = smoking_df)
summary(M_19)

Output has two sections: Count model and Zero-inflation model

Interpreting the two components

est_zero  <- coef(M_19, model = "zero")
est_count <- coef(M_19, model = "count")

# P(structural zero | educ = 10)
plogis(est_zero[1] + est_zero[2] * 10)

# Expected cigarettes for smokers | educ = 10
exp(est_count[1] + est_count[2] * 10)

Three types of prediction

smoking_new <- tibble(educ = seq(5, 20))

add_predictions(smoking_new, M_19, type = "response")  # overall expected count
add_predictions(smoking_new, M_19, type = "count")     # count among non-structural zeros
add_predictions(smoking_new, M_19, type = "zero")      # P(structural zero)

Zero-inflated negative binomial

For overdispersed count data alongside excess zeros:

M_zinb <- zeroinfl(cigs ~ educ, data = smoking_df, dist = "negbin")
AIC(M_19, M_zinb)

Summary

  • Zero-inflated models combine a binary component (structural zeros) with a count component
  • Both components are regression models with their own coefficients
  • zeroinfl from pscl fits ZIP and ZINB models
  • Three prediction types: overall response, count component, zero component