Hurdle Models

Author

Mark Andrews

Abstract

Hurdle models are a close relative of zero-inflated models, also designed for count data with excess zeros. Unlike zero-inflated models, they treat the zero/non-zero boundary as a conceptually clean hurdle: the first model component predicts whether the count exceeds zero, and the second predicts the size of the count given that it has. This guide covers hurdle Poisson and hurdle negative binomial models implemented via hurdle from pscl.

The hurdle framework

Both zero-inflated and hurdle models address excess zeros, but they embody different causal stories.

In a zero-inflated model, zeros can arise in two ways: from the structural zero-generating process, or by chance from the count process (a Poisson random variable can produce a zero). This makes the distinction between “true zeros” and “chance zeros” latent and unobserved.

In a hurdle model, the distinction is cleaner. Every observation is first classified as zero or non-zero by a binary model (the hurdle stage). If the count exceeds zero, its exact value is then modelled by a truncated count distribution — a Poisson or negative binomial that assigns zero probability to zero. There is no latent mixture: all zeros come from the binary component, and all positive counts come from the truncated count component.

This separation makes hurdle models easier to interpret when the two stages correspond to genuinely distinct processes. For example, in the smoking dataset: the hurdle stage models whether someone is a smoker at all, and the count stage models how many cigarettes a smoker typically smokes.

The hurdle Poisson model

The model has two components. The binary component is a logistic regression predicting \(\Pr(y_i > 0)\):

\[ \mathrm{logit}\!\left(\Pr(y_i > 0 \mid \vec{x}_i)\right) = a + b x_i. \]

The count component is a zero-truncated Poisson regression for observations with \(y_i > 0\):

\[ y_i \mid y_i > 0 \sim \mathrm{ZeroTruncatedPoisson}(\lambda_i), \quad \log(\lambda_i) = \alpha + \beta x_i. \]

Fitting with hurdle

smoking_df <- read_csv(
  "https://raw.githubusercontent.com/mark-andrews/iglmr24/main/data/smoking.csv"
)
Rows: 807 Columns: 10
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (10): educ, cigpric, white, age, income, cigs, restaurn, lincome, agesq,...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
M_hurdle <- hurdle(cigs ~ educ, data = smoking_df)
summary(M_hurdle)

Call:
hurdle(formula = cigs ~ educ, data = smoking_df)

Pearson residuals:
    Min      1Q  Median      3Q     Max 
-0.9775 -0.7747 -0.6606  0.9575  5.6549 

Count model coefficients (truncated poisson with log link):
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) 2.697867   0.056779  47.516  < 2e-16 ***
educ        0.034718   0.004536   7.653 1.96e-14 ***
Zero hurdle model coefficients (binomial with logit link):
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  0.56273    0.30605   1.839 0.065960 .  
educ        -0.08357    0.02417  -3.457 0.000546 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

Number of iterations in BFGS optimization: 11 
Log-likelihood: -2441 on 4 Df

The output has two sections: Count model coefficients (zero-truncated Poisson, log scale) and Zero hurdle model coefficients (logistic, logit scale). Note that the sign convention in the hurdle model’s binary component is the opposite of the zero-inflated model: a positive coefficient here increases the probability of a non-zero count, whereas in zeroinfl a positive coefficient increases the probability of a structural zero.

Interpreting the two components

estimates_hurdle <- coef(M_hurdle, model = "zero")
estimates_count <- coef(M_hurdle, model = "count")

# Probability of a non-zero count (being a smoker) at education = 10 and 20
plogis(estimates_hurdle[1] + estimates_hurdle[2] * 10)
(Intercept) 
  0.4321804 
plogis(estimates_hurdle[1] + estimates_hurdle[2] * 20)
(Intercept) 
  0.2481219 
# Expected cigarettes conditional on being a smoker, at education = 10 and 20
exp(estimates_count[1] + estimates_count[2] * 10)
(Intercept) 
   21.01109 
exp(estimates_count[1] + estimates_count[2] * 20)
(Intercept) 
   29.73232 

Predictions

smoking_new <- tibble(educ = seq(5, 20))

add_predictions(smoking_new, M_hurdle, type = "response") # overall expected count
# A tibble: 16 × 2
    educ  pred
   <int> <dbl>
 1     5  9.47
 2     6  9.42
 3     7  9.36
 4     8  9.28
 5     9  9.19
 6    10  9.08
 7    11  8.96
 8    12  8.82
 9    13  8.67
10    14  8.51
11    15  8.34
12    16  8.17
13    17  7.98
14    18  7.78
15    19  7.58
16    20  7.38
add_predictions(smoking_new, M_hurdle, type = "count") # expected count | non-zero
# A tibble: 16 × 2
    educ  pred
   <int> <dbl>
 1     5  17.7
 2     6  18.3
 3     7  18.9
 4     8  19.6
 5     9  20.3
 6    10  21.0
 7    11  21.8
 8    12  22.5
 9    13  23.3
10    14  24.1
11    15  25.0
12    16  25.9
13    17  26.8
14    18  27.7
15    19  28.7
16    20  29.7
add_predictions(smoking_new, M_hurdle, type = "zero") # P(non-zero)
# A tibble: 16 × 2
    educ  pred
   <int> <dbl>
 1     5 0.536
 2     6 0.515
 3     7 0.494
 4     8 0.474
 5     9 0.453
 6    10 0.432
 7    11 0.412
 8    12 0.392
 9    13 0.372
10    14 0.353
11    15 0.334
12    16 0.316
13    17 0.298
14    18 0.281
15    19 0.264
16    20 0.248

Hurdle negative binomial

When the count component is overdispersed, use dist = "negbin":

M_hurdle_nb <- hurdle(cigs ~ educ, data = smoking_df, dist = "negbin")
summary(M_hurdle_nb)

Call:
hurdle(formula = cigs ~ educ, data = smoking_df, dist = "negbin")

Pearson residuals:
    Min      1Q  Median      3Q     Max 
-0.7696 -0.6306 -0.5466  0.7927  4.7072 

Count model coefficients (truncated negbin with log link):
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  2.69331    0.16515  16.309  < 2e-16 ***
educ         0.03496    0.01342   2.605  0.00919 ** 
Log(theta)   1.09528    0.09495  11.535  < 2e-16 ***
Zero hurdle model coefficients (binomial with logit link):
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  0.56273    0.30605   1.839 0.065960 .  
educ        -0.08357    0.02417  -3.457 0.000546 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

Theta: count = 2.99
Number of iterations in BFGS optimization: 14 
Log-likelihood: -1748 on 5 Df
AIC(M_hurdle, M_hurdle_nb)
            df      AIC
M_hurdle     4 4890.561
M_hurdle_nb  5 3506.659

Comparing hurdle and zero-inflated models

Hurdle and zero-inflated models are not nested, so they cannot be compared with a likelihood ratio test. AIC is the appropriate tool:

M_zip <- zeroinfl(cigs ~ educ, data = smoking_df)
AIC(M_zip, M_hurdle)
         df      AIC
M_zip     4 4890.561
M_hurdle  4 4890.561

The choice between them should also be guided by subject-matter considerations: if the zero-generating process and the count-generating process are conceptually distinct, the hurdle model’s clean separation is usually easier to justify and communicate.