Zero-inflated models are for count data that contain more zeros than a standard Poisson or negative binomial model predicts. They combine a binary component (modelling whether an observation is a structural zero) with a count component (modelling the count for non-structural observations). This guide covers the zero-inflated Poisson and zero-inflated negative binomial models and their implementation via zeroinfl from the pscl package.
Excess zeros
Count data sometimes contain far more zeros than a Poisson or negative binomial model predicts. A practical diagnostic is to compare the observed proportion of zeros to the proportion predicted under the fitted model. If there is a large discrepancy, a zero-inflated model may be appropriate.
The smoking.csv dataset records the number of cigarettes smoked daily by survey respondents. Many non-smokers are included, producing a spike at zero that no standard count model fits well.
Rows: 807 Columns: 10
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (10): educ, cigpric, white, age, income, cigs, restaurn, lincome, agesq,...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
A zero-inflated Poisson (ZIP) model assumes that each observation comes from one of two latent groups. With probability \(\theta_i\), the observation is a structural zero — the individual is, so to speak, incapable of producing a non-zero count. With probability \(1 - \theta_i\), the count is drawn from a Poisson distribution with rate \(\lambda_i\).
\[
y_i \sim \begin{cases}
0 & \text{with probability } \theta_i,\\
\mathrm{Poisson}(\lambda_i) & \text{with probability } 1 - \theta_i.
\end{cases}
\]
Both \(\theta_i\) and \(\lambda_i\) can depend on predictors:
\[
\mathrm{logit}(\theta_i) = a + b x_i, \qquad \log(\lambda_i) = \alpha + \beta x_i.
\]
Fitting with zeroinfl
M_19 <-zeroinfl(cigs ~ educ, data = smoking_df)summary(M_19)
Call:
zeroinfl(formula = cigs ~ educ, data = smoking_df)
Pearson residuals:
Min 1Q Median 3Q Max
-0.9775 -0.7747 -0.6606 0.9575 5.6549
Count model coefficients (poisson with log link):
Estimate Std. Error z value Pr(>|z|)
(Intercept) 2.697867 0.056779 47.516 < 2e-16 ***
educ 0.034718 0.004536 7.653 1.96e-14 ***
Zero-inflation model coefficients (binomial with logit link):
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.56273 0.30605 -1.839 0.065960 .
educ 0.08357 0.02417 3.457 0.000546 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Number of iterations in BFGS optimization: 16
Log-likelihood: -2441 on 4 Df
The output has two sections: Count model coefficients (the Poisson component, log scale) and Zero-inflation model coefficients (the binary component, logit scale).
Extracting and interpreting the two components
estimates_zero <-coef(M_19, model ="zero")estimates_count <-coef(M_19, model ="count")# Probability of being a structural zero (non-smoker) at education = 10 and 20plogis(estimates_zero[1] + estimates_zero[2] *10)
(Intercept)
0.5678196
plogis(estimates_zero[1] + estimates_zero[2] *20)
(Intercept)
0.7518781
# Expected number of cigarettes smoked (among smokers) at education = 10 and 20exp(estimates_count[1] + estimates_count[2] *10)