Hurdle Models

Mark Andrews

Two ways to handle excess zeros

Zero-inflated and hurdle models both address excess zeros, but they embody different causal stories

Zero-inflated: zeros arise from two sources — structural zeros AND chance zeros from the Poisson process

Hurdle: zeros and positive counts arise from two completely separate processes — no chance zeros in the count component

The hurdle framework

Hurdle stage: a binary model predicts whether \(y_i > 0\) or \(y_i = 0\)
Count stage: a zero-truncated count model predicts the value of \(y_i\) given that \(y_i > 0\)

All zeros come from the binary stage
All positive counts come from the count stage

The hurdle Poisson model

Binary component (logistic regression):

\[ \mathrm{logit}\!\left(\Pr(y_i > 0)\right) = a + bx_i \]

Count component (zero-truncated Poisson):

\[ y_i \mid y_i > 0 \sim \mathrm{ZeroTruncatedPoisson}(\lambda_i), \quad \log(\lambda_i) = \alpha + \beta x_i \]

Sign convention

In zeroinfl: positive zero-inflation coefficient \(\Rightarrow\) more likely to be a structural zero
In hurdle: positive hurdle coefficient \(\Rightarrow\) more likely to exceed zero (clear the hurdle)

The direction of the binary component is reversed between the two model families

Fitting with hurdle

library(pscl)

M_hurdle <- hurdle(cigs ~ educ, data = smoking_df)
summary(M_hurdle)

Two sections: Count model coefficients (zero-truncated Poisson) and Zero hurdle model coefficients (logistic)

Interpreting the components

est_hurdle <- coef(M_hurdle, model = "zero")
est_count  <- coef(M_hurdle, model = "count")

# P(non-zero count | educ = 10) — probability of clearing the hurdle
plogis(est_hurdle[1] + est_hurdle[2] * 10)

# Expected cigarettes given smoker | educ = 10
exp(est_count[1] + est_count[2] * 10)

Predictions

smoking_new <- tibble(educ = seq(5, 20))

add_predictions(smoking_new, M_hurdle, type = "response")  # overall expected count
add_predictions(smoking_new, M_hurdle, type = "count")     # expected count | non-zero
add_predictions(smoking_new, M_hurdle, type = "zero")      # P(non-zero)

Hurdle negative binomial

When the count component is overdispersed:

M_hurdle_nb <- hurdle(cigs ~ educ, data = smoking_df, dist = "negbin")
AIC(M_hurdle, M_hurdle_nb)

Comparing hurdle and zero-inflated models

Not nested, so no likelihood ratio test
Use AIC for statistical comparison:

M_zip <- zeroinfl(cigs ~ educ, data = smoking_df)
AIC(M_zip, M_hurdle)

Subject-matter considerations matter: if the zero and count processes are conceptually distinct, the hurdle model is usually easier to interpret and justify

Summary

Hurdle models cleanly separate zero generation from count generation
All zeros from the binary stage; all positive counts from the truncated count stage
hurdle from pscl fits hurdle Poisson and hurdle negative binomial models
Compare with zero-inflated models using AIC and subject-matter reasoning