Hurdle Models

Mark Andrews

Two ways to handle excess zeros

Zero-inflated and hurdle models both address excess zeros, but they embody different causal stories

Zero-inflated: zeros arise from two sources — structural zeros AND chance zeros from the Poisson process

Hurdle: zeros and positive counts arise from two completely separate processes — no chance zeros in the count component

The hurdle framework

  1. Hurdle stage: a binary model predicts whether \(y_i > 0\) or \(y_i = 0\)
  2. Count stage: a zero-truncated count model predicts the value of \(y_i\) given that \(y_i > 0\)
  • All zeros come from the binary stage
  • All positive counts come from the count stage

The hurdle Poisson model

Binary component (logistic regression):

\[ \mathrm{logit}\!\left(\Pr(y_i > 0)\right) = a + bx_i \]

Count component (zero-truncated Poisson):

\[ y_i \mid y_i > 0 \sim \mathrm{ZeroTruncatedPoisson}(\lambda_i), \quad \log(\lambda_i) = \alpha + \beta x_i \]

Sign convention

  • In zeroinfl: positive zero-inflation coefficient \(\Rightarrow\) more likely to be a structural zero
  • In hurdle: positive hurdle coefficient \(\Rightarrow\) more likely to exceed zero (clear the hurdle)

The direction of the binary component is reversed between the two model families

Fitting with hurdle

library(pscl)

M_hurdle <- hurdle(cigs ~ educ, data = smoking_df)
summary(M_hurdle)

Two sections: Count model coefficients (zero-truncated Poisson) and Zero hurdle model coefficients (logistic)

Interpreting the components

est_hurdle <- coef(M_hurdle, model = "zero")
est_count  <- coef(M_hurdle, model = "count")

# P(non-zero count | educ = 10) — probability of clearing the hurdle
plogis(est_hurdle[1] + est_hurdle[2] * 10)

# Expected cigarettes given smoker | educ = 10
exp(est_count[1] + est_count[2] * 10)

Predictions

smoking_new <- tibble(educ = seq(5, 20))

add_predictions(smoking_new, M_hurdle, type = "response")  # overall expected count
add_predictions(smoking_new, M_hurdle, type = "count")     # expected count | non-zero
add_predictions(smoking_new, M_hurdle, type = "zero")      # P(non-zero)

Hurdle negative binomial

When the count component is overdispersed:

M_hurdle_nb <- hurdle(cigs ~ educ, data = smoking_df, dist = "negbin")
AIC(M_hurdle, M_hurdle_nb)

Comparing hurdle and zero-inflated models

  • Not nested, so no likelihood ratio test
  • Use AIC for statistical comparison:
M_zip <- zeroinfl(cigs ~ educ, data = smoking_df)
AIC(M_zip, M_hurdle)
  • Subject-matter considerations matter: if the zero and count processes are conceptually distinct, the hurdle model is usually easier to interpret and justify

Summary

  • Hurdle models cleanly separate zero generation from count generation
  • All zeros from the binary stage; all positive counts from the truncated count stage
  • hurdle from pscl fits hurdle Poisson and hurdle negative binomial models
  • Compare with zero-inflated models using AIC and subject-matter reasoning