Regression Models for Count Data

Loading up the Data and Models from Part 1

library(foreign)
dat = read.dta("http://www.ats.ucla.edu/stat/stata/dae/nb_data.dta")
dat = within(dat, {
  prog = factor(prog, levels = 1:3, labels = c("General", "Academic", "Vocational"))
  id = factor(id)
})

pois.1 = glm(daysabs ~ math + prog, data=dat, family=poisson(link="log")) 

library(MASS)
nb.1 = glm.nb(daysabs ~ math + prog, data=dat)
summary(nb.1)

## 
## Call:
## glm.nb(formula = daysabs ~ math + prog, data = dat, init.theta = 1.032713156, 
##     link = log)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.1547  -1.0192  -0.3694   0.2285   2.5273  
## 
## Coefficients:
##                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)     2.615265   0.197460  13.245  < 2e-16 ***
## math           -0.005993   0.002505  -2.392   0.0167 *  
## progAcademic   -0.440760   0.182610  -2.414   0.0158 *  
## progVocational -1.278651   0.200720  -6.370 1.89e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(1.0327) family taken to be 1)
## 
##     Null deviance: 427.54  on 313  degrees of freedom
## Residual deviance: 358.52  on 310  degrees of freedom
## AIC: 1741.3
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  1.033 
##           Std. Err.:  0.106 
## 
##  2 x log-likelihood:  -1731.258

couart2 = read.dta("http://www.ats.ucla.edu/stat/stata/examples/long/couart2.dta")
summary(couart2)

##       art            fem           mar           kid5       
##  Min.   : 0.000   Men  :494   Single :309   Min.   :0.0000  
##  1st Qu.: 0.000   Women:421   Married:606   1st Qu.:0.0000  
##  Median : 1.000                             Median :0.0000  
##  Mean   : 1.693                             Mean   :0.4951  
##  3rd Qu.: 2.000                             3rd Qu.:1.0000  
##  Max.   :19.000                             Max.   :3.0000  
##       phd             ment       
##  Min.   :0.755   Min.   : 0.000  
##  1st Qu.:2.260   1st Qu.: 3.000  
##  Median :3.150   Median : 6.000  
##  Mean   :3.103   Mean   : 8.767  
##  3rd Qu.:3.920   3rd Qu.:12.000  
##  Max.   :4.620   Max.   :77.000

lm.c = lm(art ~ fem + mar + kid5 + phd + ment, data=couart2)
summary(lm.c)

## 
## Call:
## lm(formula = art ~ fem + mar + kid5 + phd + ment, data = couart2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.0209 -1.2358 -0.4125  0.7517 14.8409 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.334249   0.240911   5.538    4e-08 ***
## femWomen    -0.380885   0.128139  -2.972  0.00303 ** 
## marMarried   0.263199   0.145423   1.810  0.07064 .  
## kid5        -0.291442   0.090919  -3.206  0.00140 ** 
## phd         -0.011365   0.063666  -0.179  0.85836    
## ment         0.061495   0.006605   9.310  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.822 on 909 degrees of freedom
## Multiple R-squared:  0.1104, Adjusted R-squared:  0.1055 
## F-statistic: 22.55 on 5 and 909 DF,  p-value: < 2.2e-16

pois.2 = glm(art ~ fem + mar + kid5 + phd + ment, data=couart2, family=poisson(link="log"))
summary(pois.2)

## 
## Call:
## glm(formula = art ~ fem + mar + kid5 + phd + ment, family = poisson(link = "log"), 
##     data = couart2)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.5672  -1.5398  -0.3660   0.5722   5.4467  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  0.304617   0.102981   2.958   0.0031 ** 
## femWomen    -0.224594   0.054613  -4.112 3.92e-05 ***
## marMarried   0.155243   0.061374   2.529   0.0114 *  
## kid5        -0.184883   0.040127  -4.607 4.08e-06 ***
## phd          0.012823   0.026397   0.486   0.6271    
## ment         0.025543   0.002006  12.733  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 1817.4  on 914  degrees of freedom
## Residual deviance: 1634.4  on 909  degrees of freedom
## AIC: 3314.1
## 
## Number of Fisher Scoring iterations: 5

library(MASS)
nb.2 = glm.nb(art ~ fem + mar + kid5 + phd + ment, data=couart2)
summary(nb.2)

## 
## Call:
## glm.nb(formula = art ~ fem + mar + kid5 + phd + ment, data = couart2, 
##     init.theta = 2.264387695, link = log)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.1678  -1.3617  -0.2806   0.4476   3.4524  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  0.256144   0.137348   1.865 0.062191 .  
## femWomen    -0.216418   0.072636  -2.979 0.002887 ** 
## marMarried   0.150489   0.082097   1.833 0.066791 .  
## kid5        -0.176415   0.052813  -3.340 0.000837 ***
## phd          0.015271   0.035873   0.426 0.670326    
## ment         0.029082   0.003214   9.048  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(2.2644) family taken to be 1)
## 
##     Null deviance: 1109.0  on 914  degrees of freedom
## Residual deviance: 1004.3  on 909  degrees of freedom
## AIC: 3135.9
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  2.264 
##           Std. Err.:  0.271 
## 
##  2 x log-likelihood:  -3121.917

Excess Zeros

Count data often have “excessive” zeroes; the majority of the units in the sample do not experience the event at all. Usually the Negative Binomial model can accommodate this by increasing the conditional variance while leaving the conditional mean the same. Note: Conditional refers to the fact that in the model the expected value of the outcome is specified as a function of the independent variables. In that sense the expected value of the outcome is conditional upon the values of the independent variables.

However, often the Negative Binomial will not be able to fully adjust for the excess zeroes. For example, this may be because the basic assumption that all scientists may not publish at all or that all scientists will publish is unrealistic. Perhaps there are those who will never publish and others who will publish at least one article. This possibility can be modeled by way of what are referred to as the zero-inflated models, sometimes also called the with zeros or hurdle models, depending upon how the model is specified and estimated. These models adjust the basic Poisson and Negative Binomial models by not only increasing the conditional variance but also increasing the probability of seeing zeros.

Poisson and Negative Binomial fail with Excess Zeros

Let us see how the Poisson and the Negative Binomial fall short in terms of being able to account for the excess zeros.

zero.obs = couart2$art == 0 
zero.poi = dpois(0, lambda=exp(predict(pois.2)))
c(obs=mean(zero.obs), poi=mean(zero.poi))

##       obs       poi 
## 0.3005464 0.2092071

The code above is doing the following.

First, it is counting how many zeros are there out of the total sample and saving this as zero.obs
It is then using the predictions from the model to estimate the number of zeros predicted by the model and saving this as zero.poi
The c() command is calculating the means and showing you the proportion of actual zeros (zero.obs) versus the proportion of predicted zeros (zero.poi)
Note that whereas some 30.05% of the sample has zeros, the Poisson model is only predicting about 20.92% of zeros.

What about the Negative Binomial? How does it fit the excess zeros?

munb = exp(predict(nb.2))
theta = nb.2$theta
nb = dnbinom(0, mu=munb, size=theta)
mean(nb)

## [1] 0.3035957

Not bad at all! Note that the Negative Binomial is predicting some 30.35% of the sample will have zeros, quite close to the observed proportion of 30.05%

The With Zeros Model

This model assumes the population is made up of two groups, and a person falls in group 1 (these persons only have zero counts) with probability \(\psi\) and in group 2 (these persons may have zero counts but will also have some non-zero counts) with probability \(1 - \psi\). This probability is unknown and thus has to be estimated from the data. Now, if you see a scientist with zero publications, you don’t know what group she/he is in so how can we estimate \(\psi\)? By making \(\psi\) a function of some set of independent variables. Usually \(\psi\) is modeled with the same set of independent variables and estimated via binary logit or probit models. These are then two-equation models where the first equation determines via logit or probit whether the person is in group 1 or 2, and the second question is either a Poisson (leading to the zero-inflated Poisson model) or the Negative Binomial (leading to the zero-inflated Negative Binomial model).

Caution: The binary logit/probit models are fit such that the outcome of interest that is being predicted is the probability of being in group 1 (the always zero group). As such if you see a statistically significant positive coefficient for a variable, it means this variable increases the probability of seeing only zeros. If the sign on the statistically significant coefficient is negative then the variable decreases the probability of seeing only zeros. Don’t forget this coding scheme when interpreting your estimates.

The Zero-Inflated Poisson Model

library(pscl)
zip.2 = zeroinfl(art ~ fem + mar + kid5 + phd + ment | fem + mar + kid5 + phd + ment, data=couart2, dist="poisson", link="logit", EM=TRUE)
summary(zip.2)

## 
## Call:
## zeroinfl(formula = art ~ fem + mar + kid5 + phd + ment | fem + mar + 
##     kid5 + phd + ment, data = couart2, dist = "poisson", link = "logit", 
##     EM = TRUE)
## 
## Pearson residuals:
##     Min      1Q  Median      3Q     Max 
## -2.3254 -0.8652 -0.2826  0.5404  7.2974 
## 
## Count model coefficients (poisson with log link):
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  0.640863   0.121306   5.283 1.27e-07 ***
## femWomen    -0.209147   0.063405  -3.299 0.000972 ***
## marMarried   0.103749   0.071111   1.459 0.144573    
## kid5        -0.143317   0.047429  -3.022 0.002514 ** 
## phd         -0.006170   0.031008  -0.199 0.842269    
## ment         0.018098   0.002294   7.888 3.07e-15 ***
## 
## Zero-inflation model coefficients (binomial with logit link):
##              Estimate Std. Error z value Pr(>|z|)   
## (Intercept) -0.577045   0.509356  -1.133  0.25726   
## femWomen     0.109722   0.280066   0.392  0.69523   
## marMarried  -0.353985   0.317591  -1.115  0.26502   
## kid5         0.217101   0.196469   1.105  0.26915   
## phd          0.001225   0.145252   0.008  0.99327   
## ment        -0.134072   0.045224  -2.965  0.00303 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Number of iterations in BFGS optimization: 1 
## Log-likelihood: -1605 on 12 Df

The Zero-Inflated Negative Binomial Model

library(pscl)
zinb.2 = zeroinfl(art ~ fem + mar + kid5 + phd + ment | fem + mar + kid5 + phd + ment, data=couart2, dist="negbin", link="logit", EM=TRUE)
summary(zinb.2)

## 
## Call:
## zeroinfl(formula = art ~ fem + mar + kid5 + phd + ment | fem + mar + 
##     kid5 + phd + ment, data = couart2, dist = "negbin", link = "logit", 
##     EM = TRUE)
## 
## Pearson residuals:
##     Min      1Q  Median      3Q     Max 
## -1.2942 -0.7601 -0.2910  0.4447  6.4154 
## 
## Count model coefficients (negbin with log link):
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  0.4167587  0.1435966   2.902  0.00370 ** 
## femWomen    -0.1954951  0.0755919  -2.586  0.00970 ** 
## marMarried   0.0975896  0.0844519   1.156  0.24786    
## kid5        -0.1517237  0.0542062  -2.799  0.00513 ** 
## phd         -0.0006967  0.0362697  -0.019  0.98467    
## ment         0.0247845  0.0034926   7.096 1.28e-12 ***
## Log(theta)   0.9764245  0.1354754   7.207 5.70e-13 ***
## 
## Zero-inflation model coefficients (binomial with logit link):
##             Estimate Std. Error z value Pr(>|z|)   
## (Intercept) -0.19248    1.32231  -0.146  0.88426   
## femWomen     0.63599    0.84857   0.749  0.45356   
## marMarried  -1.49868    0.93824  -1.597  0.11019   
## kid5         0.62831    0.44265   1.419  0.15578   
## phd         -0.03754    0.30789  -0.122  0.90296   
## ment        -0.88189    0.31604  -2.790  0.00526 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Theta = 2.6549 
## Number of iterations in BFGS optimization: 1 
## Log-likelihood: -1550 on 13 Df

Note a couple of things …

ment is significant and negative in the logit portion of both ZIP and ZINB, indicating that students of more published mentors are less likely to be in the always zeros group
Holding all else constant, women publish fewer articles than do men
Holding all else constant, those with more children five years old or younger publish less
Holding all else constant, students of more published mentors publish more themselves

Choosing Between Models

Some argue that the Poisson model is nested within the ZIP model, and likewise the Negative Binomial model is nested within the ZINB model. More generally, however, it is argued that these are really non-nested models, and as such we will have to employ the Vuong test which is implemented in pscl.

The Vuong non-nested test is based on a comparison of the predicted probabilities of two models that do not nest. Examples include comparisons of zero-inflated count models with their nonzero-inflated analogs (e.g., zero-inflated Poisson versus ordinary Poisson, or zero-inflated Negative Binomial versus ordinary Negative Binomial). A large, positive test statistic provides evidence of the superiority of model 1 over model 2, while a large, negative test statistic is evidence of the superiority of model 2 over model 1. Under the null that the models are indistinguishable, the test statistic is asymptotically distributed standard normal.

vuong(pois.2, zip.2)

## Vuong Non-Nested Hypothesis Test-Statistic: 
## (test-statistic is asymptotically distributed N(0,1) under the
##  null that the models are indistinguishible)
## -------------------------------------------------------------
##               Vuong z-statistic             H_A    p-value
## Raw                   -4.180398 model2 > model1  1.455e-05
## AIC-corrected         -3.638468 model2 > model1 0.00013713
## BIC-corrected         -2.332709 model2 > model1 0.00983172

vuong(nb.2, zinb.2)

## Vuong Non-Nested Hypothesis Test-Statistic: 
## (test-statistic is asymptotically distributed N(0,1) under the
##  null that the models are indistinguishible)
## -------------------------------------------------------------
##               Vuong z-statistic             H_A  p-value
## Raw                   -2.241702 model2 > model1 0.012490
## AIC-corrected         -1.015327 model2 > model1 0.154975
## BIC-corrected          1.939579 model1 > model2 0.026215

So the ZIP is better than the Poisson, and the ZINB is better than the Negative Binomial, bringing the choice down to the ZIP model versus the ZINB model. Some folks run the vuong test on ZIP versus ZINB but some see this as a mistake. The classical way then to compare the ZIP and the ZINB models is via the likelihood-ratio test in the lmtest package.

library(lmtest)

## Loading required package: zoo

## 
## Attaching package: 'zoo'

## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric

lrtest(zip.2, zinb.2)

## Likelihood ratio test
## 
## Model 1: art ~ fem + mar + kid5 + phd + ment | fem + mar + kid5 + phd + 
##     ment
## Model 2: art ~ fem + mar + kid5 + phd + ment | fem + mar + kid5 + phd + 
##     ment
##   #Df  LogLik Df  Chisq Pr(>Chisq)    
## 1  12 -1604.8                         
## 2  13 -1550.0  1 109.56  < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

This test rules in favor of the ZINB model so that is the one we would opt for. You can also see the ZINB favored overall (i.e., among the four models) by the AIC:

AIC(pois.2, nb.2, zip.2, zinb.2)

##        df      AIC
## pois.2  6 3314.113
## nb.2    7 3135.917
## zip.2  12 3233.546
## zinb.2 13 3125.982

Before we leave this, however, let us also see the mean zero counts from the ZIP versus the ZINB.

pr = predict(zip.2, type="zero")  # π
mu = predict(zip.2, type="count") # μ
zip = predict(zip.2,type="prob")[,1]

pr = predict(zinb.2, type="zero")  # π
mu = predict(zinb.2, type="count") # μ
zinb = predict(zinb.2, type="prob")[,1]

c(zero.obs = mean(zero.obs), zip = mean(zip), zinb = mean(zinb))

##  zero.obs       zip      zinb 
## 0.3005464 0.2985749 0.3119524

Note that the ZIP model is predicting about 29.85% to be zeros while the ZINB model is predicting 31.19% to be zeros. The actual share of observed zeros is 30.05%.

Hurdle Models

Another method of dealing with the excess zeros is to fit what are called hurdle models. In brief, these models use a logit or probit to fit the “always zero” versus “Always 1 or more” groups. Then, only positive, non-zero counts of the outcome are used to fit a truncated Poisson model or a truncated Negative Binomial model. These are sometimes preferred because they are easier to work with but the choice should be motivated by the substantive problem you are modeling and of course how well you know the subject you are studying.

So how do these models differ from the ZIP/ZINB models? ZIP/ZINB models assume that there are two groups – those that will always be zero and those that may sometimes be zero but will typically have non-zero counts most of the time. Hurdle models assume a clear separation – there are really two groups, those who are always zero and those who are always at least 1. Thus, for example, if you thought that people can be divided into two groups, those who never go to a doctor and those who will go at least once, then the hurdle model would be appropriate to analyze why some folks never go to a doctor while others go one or more times per year.

hurdle.p2 = hurdle(art ~ fem + mar + kid5 + phd + ment, data=couart2, dist="poisson", link="logit")
summary(hurdle.p2)

## 
## Call:
## hurdle(formula = art ~ fem + mar + kid5 + phd + ment, data = couart2, 
##     dist = "poisson", link = "logit")
## 
## Pearson residuals:
##     Min      1Q  Median      3Q     Max 
## -2.4105 -0.8913 -0.2817  0.5530  7.0324 
## 
## Count model coefficients (truncated poisson with log link):
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  0.67114    0.12246   5.481 4.24e-08 ***
## femWomen    -0.22858    0.06522  -3.505 0.000457 ***
## marMarried   0.09649    0.07283   1.325 0.185209    
## kid5        -0.14219    0.04845  -2.934 0.003341 ** 
## phd         -0.01273    0.03130  -0.407 0.684343    
## ment         0.01875    0.00228   8.222  < 2e-16 ***
## Zero hurdle model coefficients (binomial with logit link):
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  0.23680    0.29552   0.801   0.4230    
## femWomen    -0.25115    0.15911  -1.579   0.1144    
## marMarried   0.32623    0.18082   1.804   0.0712 .  
## kid5        -0.28525    0.11113  -2.567   0.0103 *  
## phd          0.02222    0.07956   0.279   0.7800    
## ment         0.08012    0.01302   6.155 7.52e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Number of iterations in BFGS optimization: 12 
## Log-likelihood: -1605 on 12 Df

hurdle.nb2 = hurdle(art ~ fem + mar + kid5 + phd + ment, data=couart2, dist="negbin", link="logit")
summary(hurdle.nb2)

## 
## Call:
## hurdle(formula = art ~ fem + mar + kid5 + phd + ment, data = couart2, 
##     dist = "negbin", link = "logit")
## 
## Pearson residuals:
##     Min      1Q  Median      3Q     Max 
## -1.2581 -0.8036 -0.2497  0.4745  6.2753 
## 
## Count model coefficients (truncated negbin with log link):
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  0.355125   0.196832   1.804  0.07120 .  
## femWomen    -0.244672   0.097218  -2.517  0.01184 *  
## marMarried   0.103417   0.109430   0.945  0.34463    
## kid5        -0.153260   0.072229  -2.122  0.03385 *  
## phd         -0.002933   0.048067  -0.061  0.95134    
## ment         0.023738   0.004287   5.537 3.07e-08 ***
## Log(theta)   0.603472   0.224995   2.682  0.00731 ** 
## Zero hurdle model coefficients (binomial with logit link):
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  0.23680    0.29552   0.801   0.4230    
## femWomen    -0.25115    0.15911  -1.579   0.1144    
## marMarried   0.32623    0.18082   1.804   0.0712 .  
## kid5        -0.28525    0.11113  -2.567   0.0103 *  
## phd          0.02222    0.07956   0.279   0.7800    
## ment         0.08012    0.01302   6.155 7.52e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Theta: count = 1.8285
## Number of iterations in BFGS optimization: 15 
## Log-likelihood: -1553 on 13 Df

Note that in these models the binary equation (logit/probit) is predicting the probability of non-zero positive counts so a positive statistically significant coefficient suggests a higher probability of seeing 1 or more counts of the outcome.

Comparing the ZIP, ZINB and Hurdle Models with the absences data:

pois.1 = glm(daysabs ~ math + prog, data=dat, family=poisson(link="log")) 
nb.1 = glm.nb(daysabs ~ math + prog, data=dat)
zip.1 = zeroinfl(daysabs ~ math + prog | math, data=dat, dist="poisson", link="logit", EM=TRUE)
summary(zip.1)

## 
## Call:
## zeroinfl(formula = daysabs ~ math + prog | math, data = dat, dist = "poisson", 
##     link = "logit", EM = TRUE)
## 
## Pearson residuals:
##     Min      1Q  Median      3Q     Max 
## -2.2495 -1.2165 -0.5783  0.4888  7.2434 
## 
## Count model coefficients (poisson with log link):
##                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)     2.5885695  0.0614817  42.103  < 2e-16 ***
## math           -0.0052220  0.0009319  -5.603  2.1e-08 ***
## progAcademic   -0.3040149  0.0567368  -5.358  8.4e-08 ***
## progVocational -0.9599637  0.0813696 -11.798  < 2e-16 ***
## 
## Zero-inflation model coefficients (binomial with logit link):
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -2.465589   0.383326  -6.432 1.26e-10 ***
## math         0.017144   0.006385   2.685  0.00725 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Number of iterations in BFGS optimization: 1 
## Log-likelihood: -1204 on 6 Df

zinb.1 = zeroinfl(daysabs ~ math + prog | math, data=dat, dist="negbin", link="logit", EM=TRUE)
summary(zinb.1)

## 
## Call:
## zeroinfl(formula = daysabs ~ math + prog | math, data = dat, dist = "negbin", 
##     link = "logit", EM = TRUE)
## 
## Pearson residuals:
##     Min      1Q  Median      3Q     Max 
## -0.9787 -0.7216 -0.3309  0.2581  4.9535 
## 
## Count model coefficients (negbin with log link):
##                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)     2.597121   0.195399  13.291  < 2e-16 ***
## math           -0.005575   0.002602  -2.142   0.0322 *  
## progAcademic   -0.423978   0.182301  -2.326   0.0200 *  
## progVocational -1.260127   0.203269  -6.199 5.67e-10 ***
## Log(theta)      0.079718   0.144461   0.552   0.5811    
## 
## Zero-inflation model coefficients (binomial with logit link):
##             Estimate Std. Error z value Pr(>|z|)
## (Intercept) -5.81661    3.77644  -1.540    0.124
## math         0.03083    0.03914   0.788    0.431
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Theta = 1.083 
## Number of iterations in BFGS optimization: 1 
## Log-likelihood: -865.5 on 7 Df

hurdle.p1 = hurdle(daysabs ~ math + prog, data=dat, dist="poisson", link="logit")
summary(hurdle.p1)

## 
## Call:
## hurdle(formula = daysabs ~ math + prog, data = dat, dist = "poisson", 
##     link = "logit")
## 
## Pearson residuals:
##     Min      1Q  Median      3Q     Max 
## -2.3412 -1.1691 -0.6011  0.5870  7.3923 
## 
## Count model coefficients (truncated poisson with log link):
##                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)     2.5883018  0.0614399  42.127  < 2e-16 ***
## math           -0.0052167  0.0009301  -5.609 2.04e-08 ***
## progAcademic   -0.3040621  0.0567529  -5.358 8.43e-08 ***
## progVocational -0.9397182  0.0794914 -11.822  < 2e-16 ***
## Zero hurdle model coefficients (binomial with logit link):
##                  Estimate Std. Error z value Pr(>|z|)  
## (Intercept)     1.919e+01  1.680e+03   0.011   0.9909  
## math           -1.294e-02  6.791e-03  -1.905   0.0568 .
## progAcademic   -1.671e+01  1.680e+03  -0.010   0.9921  
## progVocational -1.770e+01  1.680e+03  -0.011   0.9916  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Number of iterations in BFGS optimization: 10 
## Log-likelihood: -1192 on 8 Df

hurdle.nb1 = hurdle(daysabs ~ math + prog, data=dat, dist="negbin", link="logit")
summary(hurdle.nb1)

## 
## Call:
## hurdle(formula = daysabs ~ math + prog, data = dat, dist = "negbin", 
##     link = "logit")
## 
## Pearson residuals:
##     Min      1Q  Median      3Q     Max 
## -1.0047 -0.7325 -0.3413  0.2973  4.6260 
## 
## Count model coefficients (truncated negbin with log link):
##                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)     2.485921   0.195939  12.687  < 2e-16 ***
## math           -0.004954   0.002648  -1.871   0.0613 .  
## progAcademic   -0.329739   0.180483  -1.827   0.0677 .  
## progVocational -1.076286   0.212003  -5.077 3.84e-07 ***
## Log(theta)      0.106207   0.170355   0.623   0.5330    
## Zero hurdle model coefficients (binomial with logit link):
##                  Estimate Std. Error z value Pr(>|z|)  
## (Intercept)     1.919e+01  1.680e+03   0.011   0.9909  
## math           -1.294e-02  6.791e-03  -1.905   0.0568 .
## progAcademic   -1.671e+01  1.680e+03  -0.010   0.9921  
## progVocational -1.770e+01  1.680e+03  -0.011   0.9916  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Theta: count = 1.1121
## Number of iterations in BFGS optimization: 10 
## Log-likelihood: -859.9 on 9 Df

vuong(pois.1, zip.1)

## Vuong Non-Nested Hypothesis Test-Statistic: 
## (test-statistic is asymptotically distributed N(0,1) under the
##  null that the models are indistinguishible)
## -------------------------------------------------------------
##               Vuong z-statistic             H_A    p-value
## Raw                   -4.454775 model2 > model1 4.1991e-06
## AIC-corrected         -4.383097 model2 > model1 5.8502e-06
## BIC-corrected         -4.248721 model2 > model1 1.0750e-05

vuong(nb.1, zinb.1)

## Vuong Non-Nested Hypothesis Test-Statistic: 
## (test-statistic is asymptotically distributed N(0,1) under the
##  null that the models are indistinguishible)
## -------------------------------------------------------------
##               Vuong z-statistic             H_A    p-value
## Raw                   -0.245435 model2 > model1    0.40306
## AIC-corrected          3.724452 model1 > model2  9.787e-05
## BIC-corrected         11.166786 model1 > model2 < 2.22e-16

lrtest(zip.1, zinb.1)

## Likelihood ratio test
## 
## Model 1: daysabs ~ math + prog | math
## Model 2: daysabs ~ math + prog | math
##   #Df   LogLik Df  Chisq Pr(>Chisq)    
## 1   6 -1204.34                         
## 2   7  -865.51  1 677.68  < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

AIC(pois.1, nb.1, zip.1, zinb.1)

##        df      AIC
## pois.1  4 2665.285
## nb.1    5 1741.258
## zip.1   6 2420.687
## zinb.1  7 1745.010

Concluding Thoughts

Small samples might be problematic in that your model estimates could be very noisy and in some cases the model may not converge
You will have to worry about the separation issue here as well since you now have a binary logit/probit to deal with.
In practice ZINB models are some of the hardest to work with because of convergence issues
If you have excess zeros, the Negative Binomial will often work well but you should not assume this but instead try to fit ZIP and ZINB models before settling for a particular model specification
Generating predicted values, etc. is very cumbersome here. This is one reason (apart from theoretical or substantive motivations) that folks often veer towards the hurdle models since in a pinch one can basically fit two separate equations – one the logit/probit and the the other the truncated \((y \geq 1)\) Poisson or Negative Binomial – and then use visreg for the plots

Practice Data-sets

Data (a)

We have data on 250 groups that went to a park. Each group was questioned before leaving the park about how many fish they caught (count), how many children were in the group (child), how many people were in the group (persons), and whether or not they brought a camper to the park (camper). The outcome variable of interest will be the number of fish caught. Even though the question about the number of fish caught was asked to everyone, it does not mean that everyone went fishing. What would be the reason for someone to report a zero count? Was it because this person was unlucky and didn’t catch any fish, or was it because this person didn’t go fishing at all? If a person didn’t go fishing, the outcome would be always zero. Otherwise, if a person went fishing, the count could be zero or non-zero. So we can see that there seemed to be two processes that would generate zero counts: unlucky in fishing or didn’t go fishing. For now we will ignore these more interesting questions and focus on fitting Poisson and Negative Binomial models to these data.

fish = read.dta("http://www.stata-press.com/data/r10/fish.dta")
names(fish); summary(fish)

## [1] "nofish"   "livebait" "camper"   "persons"  "child"    "xb"      
## [7] "zg"       "count"

##      nofish         livebait         camper         persons     
##  Min.   :0.000   Min.   :0.000   Min.   :0.000   Min.   :1.000  
##  1st Qu.:0.000   1st Qu.:1.000   1st Qu.:0.000   1st Qu.:2.000  
##  Median :0.000   Median :1.000   Median :1.000   Median :2.000  
##  Mean   :0.296   Mean   :0.864   Mean   :0.588   Mean   :2.528  
##  3rd Qu.:1.000   3rd Qu.:1.000   3rd Qu.:1.000   3rd Qu.:4.000  
##  Max.   :1.000   Max.   :1.000   Max.   :1.000   Max.   :4.000  
##      child             xb                  zg              count        
##  Min.   :0.000   Min.   :-3.275050   Min.   :-5.6259   Min.   :  0.000  
##  1st Qu.:0.000   1st Qu.: 0.008267   1st Qu.:-1.2527   1st Qu.:  0.000  
##  Median :0.000   Median : 0.954550   Median : 0.6051   Median :  0.000  
##  Mean   :0.684   Mean   : 0.973796   Mean   : 0.2523   Mean   :  3.296  
##  3rd Qu.:1.000   3rd Qu.: 1.963855   3rd Qu.: 1.9932   3rd Qu.:  2.000  
##  Max.   :3.000   Max.   : 5.352674   Max.   : 4.2632   Max.   :149.000

Data (b)

Cross-section data on the number of recreational boating trips to Lake Somerville, Texas, in 1980, based on a survey administered to 2,000 registered leisure boat owners in 23 counties in eastern Texas.

trips – Number of recreational boating trips.
quality – Facility’s subjective quality ranking on a scale of 1 to 5.
ski – factor. Was the individual engaged in water-skiing at the lake?
income – Annual household income of the respondent (in 1,000 USD).
userfee – factor. Did the individual pay an annual user fee at Lake Somerville?
costC – Expenditure when visiting Lake Conroe (in USD).
costS – Expenditure when visiting Lake Somerville (in USD).
costH – Expenditure when visiting Lake Houston (in USD).

library(AER)
data(RecreationDemand)
names(RecreationDemand)

## [1] "trips"   "quality" "ski"     "income"  "userfee" "costC"   "costS"  
## [8] "costH"

summary(RecreationDemand)

##      trips           quality       ski          income      userfee  
##  Min.   : 0.000   Min.   :0.000   no :417   Min.   :1.000   no :646  
##  1st Qu.: 0.000   1st Qu.:0.000   yes:242   1st Qu.:3.000   yes: 13  
##  Median : 0.000   Median :0.000             Median :3.000            
##  Mean   : 2.244   Mean   :1.419             Mean   :3.853            
##  3rd Qu.: 2.000   3rd Qu.:3.000             3rd Qu.:5.000            
##  Max.   :88.000   Max.   :5.000             Max.   :9.000            
##      costC            costS             costH       
##  Min.   :  4.34   Min.   :  4.767   Min.   :  5.70  
##  1st Qu.: 28.24   1st Qu.: 33.312   1st Qu.: 28.96  
##  Median : 41.19   Median : 47.000   Median : 42.38  
##  Mean   : 55.42   Mean   : 59.928   Mean   : 55.99  
##  3rd Qu.: 69.67   3rd Qu.: 72.573   3rd Qu.: 68.56  
##  Max.   :493.77   Max.   :491.547   Max.   :491.05