Hurdle models are a close relative of zero-inflated models, also designed for count data with excess zeros. Unlike zero-inflated models, they treat the zero/non-zero boundary as a conceptually clean hurdle: the first model component predicts whether the count exceeds zero, and the second predicts the size of the count given that it has. This guide covers hurdle Poisson and hurdle negative binomial models implemented via hurdle from pscl.
The hurdle framework
Both zero-inflated and hurdle models address excess zeros, but they embody different causal stories.
In a zero-inflated model, zeros can arise in two ways: from the structural zero-generating process, or by chance from the count process (a Poisson random variable can produce a zero). This makes the distinction between “true zeros” and “chance zeros” latent and unobserved.
In a hurdle model, the distinction is cleaner. Every observation is first classified as zero or non-zero by a binary model (the hurdle stage). If the count exceeds zero, its exact value is then modelled by a truncated count distribution — a Poisson or negative binomial that assigns zero probability to zero. There is no latent mixture: all zeros come from the binary component, and all positive counts come from the truncated count component.
This separation makes hurdle models easier to interpret when the two stages correspond to genuinely distinct processes. For example, in the smoking dataset: the hurdle stage models whether someone is a smoker at all, and the count stage models how many cigarettes a smoker typically smokes.
The hurdle Poisson model
The model has two components. The binary component is a logistic regression predicting \(\Pr(y_i > 0)\):
\[
\mathrm{logit}\!\left(\Pr(y_i > 0 \mid \vec{x}_i)\right) = a + b x_i.
\]
The count component is a zero-truncated Poisson regression for observations with \(y_i > 0\):
Rows: 807 Columns: 10
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (10): educ, cigpric, white, age, income, cigs, restaurn, lincome, agesq,...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
M_hurdle <-hurdle(cigs ~ educ, data = smoking_df)summary(M_hurdle)
Call:
hurdle(formula = cigs ~ educ, data = smoking_df)
Pearson residuals:
Min 1Q Median 3Q Max
-0.9775 -0.7747 -0.6606 0.9575 5.6549
Count model coefficients (truncated poisson with log link):
Estimate Std. Error z value Pr(>|z|)
(Intercept) 2.697867 0.056779 47.516 < 2e-16 ***
educ 0.034718 0.004536 7.653 1.96e-14 ***
Zero hurdle model coefficients (binomial with logit link):
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.56273 0.30605 1.839 0.065960 .
educ -0.08357 0.02417 -3.457 0.000546 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Number of iterations in BFGS optimization: 11
Log-likelihood: -2441 on 4 Df
The output has two sections: Count model coefficients (zero-truncated Poisson, log scale) and Zero hurdle model coefficients (logistic, logit scale). Note that the sign convention in the hurdle model’s binary component is the opposite of the zero-inflated model: a positive coefficient here increases the probability of a non-zero count, whereas in zeroinfl a positive coefficient increases the probability of a structural zero.
Interpreting the two components
estimates_hurdle <-coef(M_hurdle, model ="zero")estimates_count <-coef(M_hurdle, model ="count")# Probability of a non-zero count (being a smoker) at education = 10 and 20plogis(estimates_hurdle[1] + estimates_hurdle[2] *10)
When the count component is overdispersed, use dist = "negbin":
M_hurdle_nb <-hurdle(cigs ~ educ, data = smoking_df, dist ="negbin")summary(M_hurdle_nb)
Call:
hurdle(formula = cigs ~ educ, data = smoking_df, dist = "negbin")
Pearson residuals:
Min 1Q Median 3Q Max
-0.7696 -0.6306 -0.5466 0.7927 4.7072
Count model coefficients (truncated negbin with log link):
Estimate Std. Error z value Pr(>|z|)
(Intercept) 2.69331 0.16515 16.309 < 2e-16 ***
educ 0.03496 0.01342 2.605 0.00919 **
Log(theta) 1.09528 0.09495 11.535 < 2e-16 ***
Zero hurdle model coefficients (binomial with logit link):
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.56273 0.30605 1.839 0.065960 .
educ -0.08357 0.02417 -3.457 0.000546 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Theta: count = 2.99
Number of iterations in BFGS optimization: 14
Log-likelihood: -1748 on 5 Df
AIC(M_hurdle, M_hurdle_nb)
df AIC
M_hurdle 4 4890.561
M_hurdle_nb 5 3506.659
Comparing hurdle and zero-inflated models
Hurdle and zero-inflated models are not nested, so they cannot be compared with a likelihood ratio test. AIC is the appropriate tool:
M_zip <-zeroinfl(cigs ~ educ, data = smoking_df)AIC(M_zip, M_hurdle)
df AIC
M_zip 4 4890.561
M_hurdle 4 4890.561
The choice between them should also be guided by subject-matter considerations: if the zero-generating process and the count-generating process are conceptually distinct, the hurdle model’s clean separation is usually easier to justify and communicate.