library(foreign)
dat = read.dta("http://www.ats.ucla.edu/stat/stata/dae/nb_data.dta")
dat = within(dat, {
prog = factor(prog, levels = 1:3, labels = c("General", "Academic", "Vocational"))
id = factor(id)
})
pois.1 = glm(daysabs ~ math + prog, data=dat, family=poisson(link="log"))
library(MASS)
nb.1 = glm.nb(daysabs ~ math + prog, data=dat)
summary(nb.1)
##
## Call:
## glm.nb(formula = daysabs ~ math + prog, data = dat, init.theta = 1.032713156,
## link = log)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.1547 -1.0192 -0.3694 0.2285 2.5273
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.615265 0.197460 13.245 < 2e-16 ***
## math -0.005993 0.002505 -2.392 0.0167 *
## progAcademic -0.440760 0.182610 -2.414 0.0158 *
## progVocational -1.278651 0.200720 -6.370 1.89e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Negative Binomial(1.0327) family taken to be 1)
##
## Null deviance: 427.54 on 313 degrees of freedom
## Residual deviance: 358.52 on 310 degrees of freedom
## AIC: 1741.3
##
## Number of Fisher Scoring iterations: 1
##
##
## Theta: 1.033
## Std. Err.: 0.106
##
## 2 x log-likelihood: -1731.258
couart2 = read.dta("http://www.ats.ucla.edu/stat/stata/examples/long/couart2.dta")
summary(couart2)
## art fem mar kid5
## Min. : 0.000 Men :494 Single :309 Min. :0.0000
## 1st Qu.: 0.000 Women:421 Married:606 1st Qu.:0.0000
## Median : 1.000 Median :0.0000
## Mean : 1.693 Mean :0.4951
## 3rd Qu.: 2.000 3rd Qu.:1.0000
## Max. :19.000 Max. :3.0000
## phd ment
## Min. :0.755 Min. : 0.000
## 1st Qu.:2.260 1st Qu.: 3.000
## Median :3.150 Median : 6.000
## Mean :3.103 Mean : 8.767
## 3rd Qu.:3.920 3rd Qu.:12.000
## Max. :4.620 Max. :77.000
lm.c = lm(art ~ fem + mar + kid5 + phd + ment, data=couart2)
summary(lm.c)
##
## Call:
## lm(formula = art ~ fem + mar + kid5 + phd + ment, data = couart2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.0209 -1.2358 -0.4125 0.7517 14.8409
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.334249 0.240911 5.538 4e-08 ***
## femWomen -0.380885 0.128139 -2.972 0.00303 **
## marMarried 0.263199 0.145423 1.810 0.07064 .
## kid5 -0.291442 0.090919 -3.206 0.00140 **
## phd -0.011365 0.063666 -0.179 0.85836
## ment 0.061495 0.006605 9.310 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.822 on 909 degrees of freedom
## Multiple R-squared: 0.1104, Adjusted R-squared: 0.1055
## F-statistic: 22.55 on 5 and 909 DF, p-value: < 2.2e-16
pois.2 = glm(art ~ fem + mar + kid5 + phd + ment, data=couart2, family=poisson(link="log"))
summary(pois.2)
##
## Call:
## glm(formula = art ~ fem + mar + kid5 + phd + ment, family = poisson(link = "log"),
## data = couart2)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.5672 -1.5398 -0.3660 0.5722 5.4467
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.304617 0.102981 2.958 0.0031 **
## femWomen -0.224594 0.054613 -4.112 3.92e-05 ***
## marMarried 0.155243 0.061374 2.529 0.0114 *
## kid5 -0.184883 0.040127 -4.607 4.08e-06 ***
## phd 0.012823 0.026397 0.486 0.6271
## ment 0.025543 0.002006 12.733 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 1817.4 on 914 degrees of freedom
## Residual deviance: 1634.4 on 909 degrees of freedom
## AIC: 3314.1
##
## Number of Fisher Scoring iterations: 5
library(MASS)
nb.2 = glm.nb(art ~ fem + mar + kid5 + phd + ment, data=couart2)
summary(nb.2)
##
## Call:
## glm.nb(formula = art ~ fem + mar + kid5 + phd + ment, data = couart2,
## init.theta = 2.264387695, link = log)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.1678 -1.3617 -0.2806 0.4476 3.4524
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.256144 0.137348 1.865 0.062191 .
## femWomen -0.216418 0.072636 -2.979 0.002887 **
## marMarried 0.150489 0.082097 1.833 0.066791 .
## kid5 -0.176415 0.052813 -3.340 0.000837 ***
## phd 0.015271 0.035873 0.426 0.670326
## ment 0.029082 0.003214 9.048 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Negative Binomial(2.2644) family taken to be 1)
##
## Null deviance: 1109.0 on 914 degrees of freedom
## Residual deviance: 1004.3 on 909 degrees of freedom
## AIC: 3135.9
##
## Number of Fisher Scoring iterations: 1
##
##
## Theta: 2.264
## Std. Err.: 0.271
##
## 2 x log-likelihood: -3121.917
Count data often have “excessive” zeroes; the majority of the units in the sample do not experience the event at all. Usually the Negative Binomial model can accommodate this by increasing the conditional variance while leaving the conditional mean the same. Note: Conditional refers to the fact that in the model the expected value of the outcome is specified as a function of the independent variables. In that sense the expected value of the outcome is conditional upon the values of the independent variables.
However, often the Negative Binomial will not be able to fully adjust for the excess zeroes. For example, this may be because the basic assumption that all scientists may not publish at all or that all scientists will publish is unrealistic. Perhaps there are those who will never publish and others who will publish at least one article. This possibility can be modeled by way of what are referred to as the zero-inflated models, sometimes also called the with zeros or hurdle models, depending upon how the model is specified and estimated. These models adjust the basic Poisson and Negative Binomial models by not only increasing the conditional variance but also increasing the probability of seeing zeros.
Let us see how the Poisson and the Negative Binomial fall short in terms of being able to account for the excess zeros.
zero.obs = couart2$art == 0
zero.poi = dpois(0, lambda=exp(predict(pois.2)))
c(obs=mean(zero.obs), poi=mean(zero.poi))
## obs poi
## 0.3005464 0.2092071
The code above is doing the following.
What about the Negative Binomial? How does it fit the excess zeros?
munb = exp(predict(nb.2))
theta = nb.2$theta
nb = dnbinom(0, mu=munb, size=theta)
mean(nb)
## [1] 0.3035957
Not bad at all! Note that the Negative Binomial is predicting some 30.35% of the sample will have zeros, quite close to the observed proportion of 30.05%
This model assumes the population is made up of two groups, and a person falls in group 1 (these persons only have zero counts) with probability \(\psi\) and in group 2 (these persons may have zero counts but will also have some non-zero counts) with probability \(1 - \psi\). This probability is unknown and thus has to be estimated from the data. Now, if you see a scientist with zero publications, you don’t know what group she/he is in so how can we estimate \(\psi\)? By making \(\psi\) a function of some set of independent variables. Usually \(\psi\) is modeled with the same set of independent variables and estimated via binary logit or probit models. These are then two-equation models where the first equation determines via logit or probit whether the person is in group 1 or 2, and the second question is either a Poisson (leading to the zero-inflated Poisson model) or the Negative Binomial (leading to the zero-inflated Negative Binomial model).
Caution: The binary logit/probit models are fit such that the outcome of interest that is being predicted is the probability of being in group 1 (the always zero group). As such if you see a statistically significant positive coefficient for a variable, it means this variable increases the probability of seeing only zeros. If the sign on the statistically significant coefficient is negative then the variable decreases the probability of seeing only zeros. Don’t forget this coding scheme when interpreting your estimates.
library(pscl)
zip.2 = zeroinfl(art ~ fem + mar + kid5 + phd + ment | fem + mar + kid5 + phd + ment, data=couart2, dist="poisson", link="logit", EM=TRUE)
summary(zip.2)
##
## Call:
## zeroinfl(formula = art ~ fem + mar + kid5 + phd + ment | fem + mar +
## kid5 + phd + ment, data = couart2, dist = "poisson", link = "logit",
## EM = TRUE)
##
## Pearson residuals:
## Min 1Q Median 3Q Max
## -2.3254 -0.8652 -0.2826 0.5404 7.2974
##
## Count model coefficients (poisson with log link):
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.640863 0.121306 5.283 1.27e-07 ***
## femWomen -0.209147 0.063405 -3.299 0.000972 ***
## marMarried 0.103749 0.071111 1.459 0.144573
## kid5 -0.143317 0.047429 -3.022 0.002514 **
## phd -0.006170 0.031008 -0.199 0.842269
## ment 0.018098 0.002294 7.888 3.07e-15 ***
##
## Zero-inflation model coefficients (binomial with logit link):
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.577045 0.509356 -1.133 0.25726
## femWomen 0.109722 0.280066 0.392 0.69523
## marMarried -0.353985 0.317591 -1.115 0.26502
## kid5 0.217101 0.196469 1.105 0.26915
## phd 0.001225 0.145252 0.008 0.99327
## ment -0.134072 0.045224 -2.965 0.00303 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Number of iterations in BFGS optimization: 1
## Log-likelihood: -1605 on 12 Df
library(pscl)
zinb.2 = zeroinfl(art ~ fem + mar + kid5 + phd + ment | fem + mar + kid5 + phd + ment, data=couart2, dist="negbin", link="logit", EM=TRUE)
summary(zinb.2)
##
## Call:
## zeroinfl(formula = art ~ fem + mar + kid5 + phd + ment | fem + mar +
## kid5 + phd + ment, data = couart2, dist = "negbin", link = "logit",
## EM = TRUE)
##
## Pearson residuals:
## Min 1Q Median 3Q Max
## -1.2942 -0.7601 -0.2910 0.4447 6.4154
##
## Count model coefficients (negbin with log link):
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.4167587 0.1435966 2.902 0.00370 **
## femWomen -0.1954951 0.0755919 -2.586 0.00970 **
## marMarried 0.0975896 0.0844519 1.156 0.24786
## kid5 -0.1517237 0.0542062 -2.799 0.00513 **
## phd -0.0006967 0.0362697 -0.019 0.98467
## ment 0.0247845 0.0034926 7.096 1.28e-12 ***
## Log(theta) 0.9764245 0.1354754 7.207 5.70e-13 ***
##
## Zero-inflation model coefficients (binomial with logit link):
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.19248 1.32231 -0.146 0.88426
## femWomen 0.63599 0.84857 0.749 0.45356
## marMarried -1.49868 0.93824 -1.597 0.11019
## kid5 0.62831 0.44265 1.419 0.15578
## phd -0.03754 0.30789 -0.122 0.90296
## ment -0.88189 0.31604 -2.790 0.00526 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Theta = 2.6549
## Number of iterations in BFGS optimization: 1
## Log-likelihood: -1550 on 13 Df
Note a couple of things …
Some argue that the Poisson model is nested within the ZIP model, and likewise the Negative Binomial model is nested within the ZINB model. More generally, however, it is argued that these are really non-nested models, and as such we will have to employ the Vuong test which is implemented in pscl
.
The Vuong non-nested test is based on a comparison of the predicted probabilities of two models that do not nest. Examples include comparisons of zero-inflated count models with their nonzero-inflated analogs (e.g., zero-inflated Poisson versus ordinary Poisson, or zero-inflated Negative Binomial versus ordinary Negative Binomial). A large, positive test statistic provides evidence of the superiority of model 1 over model 2, while a large, negative test statistic is evidence of the superiority of model 2 over model 1. Under the null that the models are indistinguishable, the test statistic is asymptotically distributed standard normal.
vuong(pois.2, zip.2)
## Vuong Non-Nested Hypothesis Test-Statistic:
## (test-statistic is asymptotically distributed N(0,1) under the
## null that the models are indistinguishible)
## -------------------------------------------------------------
## Vuong z-statistic H_A p-value
## Raw -4.180398 model2 > model1 1.455e-05
## AIC-corrected -3.638468 model2 > model1 0.00013713
## BIC-corrected -2.332709 model2 > model1 0.00983172
vuong(nb.2, zinb.2)
## Vuong Non-Nested Hypothesis Test-Statistic:
## (test-statistic is asymptotically distributed N(0,1) under the
## null that the models are indistinguishible)
## -------------------------------------------------------------
## Vuong z-statistic H_A p-value
## Raw -2.241702 model2 > model1 0.012490
## AIC-corrected -1.015327 model2 > model1 0.154975
## BIC-corrected 1.939579 model1 > model2 0.026215
So the ZIP is better than the Poisson, and the ZINB is better than the Negative Binomial, bringing the choice down to the ZIP model versus the ZINB model. Some folks run the vuong
test on ZIP versus ZINB but some see this as a mistake. The classical way then to compare the ZIP and the ZINB models is via the likelihood-ratio test in the lmtest
package.
library(lmtest)
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
lrtest(zip.2, zinb.2)
## Likelihood ratio test
##
## Model 1: art ~ fem + mar + kid5 + phd + ment | fem + mar + kid5 + phd +
## ment
## Model 2: art ~ fem + mar + kid5 + phd + ment | fem + mar + kid5 + phd +
## ment
## #Df LogLik Df Chisq Pr(>Chisq)
## 1 12 -1604.8
## 2 13 -1550.0 1 109.56 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
This test rules in favor of the ZINB model so that is the one we would opt for. You can also see the ZINB favored overall (i.e., among the four models) by the AIC:
AIC(pois.2, nb.2, zip.2, zinb.2)
## df AIC
## pois.2 6 3314.113
## nb.2 7 3135.917
## zip.2 12 3233.546
## zinb.2 13 3125.982
Before we leave this, however, let us also see the mean zero counts from the ZIP versus the ZINB.
pr = predict(zip.2, type="zero") # π
mu = predict(zip.2, type="count") # μ
zip = predict(zip.2,type="prob")[,1]
pr = predict(zinb.2, type="zero") # π
mu = predict(zinb.2, type="count") # μ
zinb = predict(zinb.2, type="prob")[,1]
c(zero.obs = mean(zero.obs), zip = mean(zip), zinb = mean(zinb))
## zero.obs zip zinb
## 0.3005464 0.2985749 0.3119524
Note that the ZIP model is predicting about 29.85% to be zeros while the ZINB model is predicting 31.19% to be zeros. The actual share of observed zeros is 30.05%.
Another method of dealing with the excess zeros is to fit what are called hurdle models. In brief, these models use a logit or probit to fit the “always zero” versus “Always 1 or more” groups. Then, only positive, non-zero counts of the outcome are used to fit a truncated Poisson model or a truncated Negative Binomial model. These are sometimes preferred because they are easier to work with but the choice should be motivated by the substantive problem you are modeling and of course how well you know the subject you are studying.
So how do these models differ from the ZIP/ZINB models? ZIP/ZINB models assume that there are two groups – those that will always be zero and those that may sometimes be zero but will typically have non-zero counts most of the time. Hurdle models assume a clear separation – there are really two groups, those who are always zero and those who are always at least 1. Thus, for example, if you thought that people can be divided into two groups, those who never go to a doctor and those who will go at least once, then the hurdle model would be appropriate to analyze why some folks never go to a doctor while others go one or more times per year.
hurdle.p2 = hurdle(art ~ fem + mar + kid5 + phd + ment, data=couart2, dist="poisson", link="logit")
summary(hurdle.p2)
##
## Call:
## hurdle(formula = art ~ fem + mar + kid5 + phd + ment, data = couart2,
## dist = "poisson", link = "logit")
##
## Pearson residuals:
## Min 1Q Median 3Q Max
## -2.4105 -0.8913 -0.2817 0.5530 7.0324
##
## Count model coefficients (truncated poisson with log link):
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.67114 0.12246 5.481 4.24e-08 ***
## femWomen -0.22858 0.06522 -3.505 0.000457 ***
## marMarried 0.09649 0.07283 1.325 0.185209
## kid5 -0.14219 0.04845 -2.934 0.003341 **
## phd -0.01273 0.03130 -0.407 0.684343
## ment 0.01875 0.00228 8.222 < 2e-16 ***
## Zero hurdle model coefficients (binomial with logit link):
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.23680 0.29552 0.801 0.4230
## femWomen -0.25115 0.15911 -1.579 0.1144
## marMarried 0.32623 0.18082 1.804 0.0712 .
## kid5 -0.28525 0.11113 -2.567 0.0103 *
## phd 0.02222 0.07956 0.279 0.7800
## ment 0.08012 0.01302 6.155 7.52e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Number of iterations in BFGS optimization: 12
## Log-likelihood: -1605 on 12 Df
hurdle.nb2 = hurdle(art ~ fem + mar + kid5 + phd + ment, data=couart2, dist="negbin", link="logit")
summary(hurdle.nb2)
##
## Call:
## hurdle(formula = art ~ fem + mar + kid5 + phd + ment, data = couart2,
## dist = "negbin", link = "logit")
##
## Pearson residuals:
## Min 1Q Median 3Q Max
## -1.2581 -0.8036 -0.2497 0.4745 6.2753
##
## Count model coefficients (truncated negbin with log link):
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.355125 0.196832 1.804 0.07120 .
## femWomen -0.244672 0.097218 -2.517 0.01184 *
## marMarried 0.103417 0.109430 0.945 0.34463
## kid5 -0.153260 0.072229 -2.122 0.03385 *
## phd -0.002933 0.048067 -0.061 0.95134
## ment 0.023738 0.004287 5.537 3.07e-08 ***
## Log(theta) 0.603472 0.224995 2.682 0.00731 **
## Zero hurdle model coefficients (binomial with logit link):
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.23680 0.29552 0.801 0.4230
## femWomen -0.25115 0.15911 -1.579 0.1144
## marMarried 0.32623 0.18082 1.804 0.0712 .
## kid5 -0.28525 0.11113 -2.567 0.0103 *
## phd 0.02222 0.07956 0.279 0.7800
## ment 0.08012 0.01302 6.155 7.52e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Theta: count = 1.8285
## Number of iterations in BFGS optimization: 15
## Log-likelihood: -1553 on 13 Df
Note that in these models the binary equation (logit/probit) is predicting the probability of non-zero positive counts so a positive statistically significant coefficient suggests a higher probability of seeing 1 or more counts of the outcome.
pois.1 = glm(daysabs ~ math + prog, data=dat, family=poisson(link="log"))
nb.1 = glm.nb(daysabs ~ math + prog, data=dat)
zip.1 = zeroinfl(daysabs ~ math + prog | math, data=dat, dist="poisson", link="logit", EM=TRUE)
summary(zip.1)
##
## Call:
## zeroinfl(formula = daysabs ~ math + prog | math, data = dat, dist = "poisson",
## link = "logit", EM = TRUE)
##
## Pearson residuals:
## Min 1Q Median 3Q Max
## -2.2495 -1.2165 -0.5783 0.4888 7.2434
##
## Count model coefficients (poisson with log link):
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.5885695 0.0614817 42.103 < 2e-16 ***
## math -0.0052220 0.0009319 -5.603 2.1e-08 ***
## progAcademic -0.3040149 0.0567368 -5.358 8.4e-08 ***
## progVocational -0.9599637 0.0813696 -11.798 < 2e-16 ***
##
## Zero-inflation model coefficients (binomial with logit link):
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.465589 0.383326 -6.432 1.26e-10 ***
## math 0.017144 0.006385 2.685 0.00725 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Number of iterations in BFGS optimization: 1
## Log-likelihood: -1204 on 6 Df
zinb.1 = zeroinfl(daysabs ~ math + prog | math, data=dat, dist="negbin", link="logit", EM=TRUE)
summary(zinb.1)
##
## Call:
## zeroinfl(formula = daysabs ~ math + prog | math, data = dat, dist = "negbin",
## link = "logit", EM = TRUE)
##
## Pearson residuals:
## Min 1Q Median 3Q Max
## -0.9787 -0.7216 -0.3309 0.2581 4.9535
##
## Count model coefficients (negbin with log link):
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.597121 0.195399 13.291 < 2e-16 ***
## math -0.005575 0.002602 -2.142 0.0322 *
## progAcademic -0.423978 0.182301 -2.326 0.0200 *
## progVocational -1.260127 0.203269 -6.199 5.67e-10 ***
## Log(theta) 0.079718 0.144461 0.552 0.5811
##
## Zero-inflation model coefficients (binomial with logit link):
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -5.81661 3.77644 -1.540 0.124
## math 0.03083 0.03914 0.788 0.431
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Theta = 1.083
## Number of iterations in BFGS optimization: 1
## Log-likelihood: -865.5 on 7 Df
hurdle.p1 = hurdle(daysabs ~ math + prog, data=dat, dist="poisson", link="logit")
summary(hurdle.p1)
##
## Call:
## hurdle(formula = daysabs ~ math + prog, data = dat, dist = "poisson",
## link = "logit")
##
## Pearson residuals:
## Min 1Q Median 3Q Max
## -2.3412 -1.1691 -0.6011 0.5870 7.3923
##
## Count model coefficients (truncated poisson with log link):
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.5883018 0.0614399 42.127 < 2e-16 ***
## math -0.0052167 0.0009301 -5.609 2.04e-08 ***
## progAcademic -0.3040621 0.0567529 -5.358 8.43e-08 ***
## progVocational -0.9397182 0.0794914 -11.822 < 2e-16 ***
## Zero hurdle model coefficients (binomial with logit link):
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.919e+01 1.680e+03 0.011 0.9909
## math -1.294e-02 6.791e-03 -1.905 0.0568 .
## progAcademic -1.671e+01 1.680e+03 -0.010 0.9921
## progVocational -1.770e+01 1.680e+03 -0.011 0.9916
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Number of iterations in BFGS optimization: 10
## Log-likelihood: -1192 on 8 Df
hurdle.nb1 = hurdle(daysabs ~ math + prog, data=dat, dist="negbin", link="logit")
summary(hurdle.nb1)
##
## Call:
## hurdle(formula = daysabs ~ math + prog, data = dat, dist = "negbin",
## link = "logit")
##
## Pearson residuals:
## Min 1Q Median 3Q Max
## -1.0047 -0.7325 -0.3413 0.2973 4.6260
##
## Count model coefficients (truncated negbin with log link):
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.485921 0.195939 12.687 < 2e-16 ***
## math -0.004954 0.002648 -1.871 0.0613 .
## progAcademic -0.329739 0.180483 -1.827 0.0677 .
## progVocational -1.076286 0.212003 -5.077 3.84e-07 ***
## Log(theta) 0.106207 0.170355 0.623 0.5330
## Zero hurdle model coefficients (binomial with logit link):
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.919e+01 1.680e+03 0.011 0.9909
## math -1.294e-02 6.791e-03 -1.905 0.0568 .
## progAcademic -1.671e+01 1.680e+03 -0.010 0.9921
## progVocational -1.770e+01 1.680e+03 -0.011 0.9916
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Theta: count = 1.1121
## Number of iterations in BFGS optimization: 10
## Log-likelihood: -859.9 on 9 Df
vuong(pois.1, zip.1)
## Vuong Non-Nested Hypothesis Test-Statistic:
## (test-statistic is asymptotically distributed N(0,1) under the
## null that the models are indistinguishible)
## -------------------------------------------------------------
## Vuong z-statistic H_A p-value
## Raw -4.454775 model2 > model1 4.1991e-06
## AIC-corrected -4.383097 model2 > model1 5.8502e-06
## BIC-corrected -4.248721 model2 > model1 1.0750e-05
vuong(nb.1, zinb.1)
## Vuong Non-Nested Hypothesis Test-Statistic:
## (test-statistic is asymptotically distributed N(0,1) under the
## null that the models are indistinguishible)
## -------------------------------------------------------------
## Vuong z-statistic H_A p-value
## Raw -0.245435 model2 > model1 0.40306
## AIC-corrected 3.724452 model1 > model2 9.787e-05
## BIC-corrected 11.166786 model1 > model2 < 2.22e-16
lrtest(zip.1, zinb.1)
## Likelihood ratio test
##
## Model 1: daysabs ~ math + prog | math
## Model 2: daysabs ~ math + prog | math
## #Df LogLik Df Chisq Pr(>Chisq)
## 1 6 -1204.34
## 2 7 -865.51 1 677.68 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
AIC(pois.1, nb.1, zip.1, zinb.1)
## df AIC
## pois.1 4 2665.285
## nb.1 5 1741.258
## zip.1 6 2420.687
## zinb.1 7 1745.010
visreg
for the plotsWe have data on 250 groups that went to a park. Each group was questioned before leaving the park about how many fish they caught (count), how many children were in the group (child), how many people were in the group (persons), and whether or not they brought a camper to the park (camper). The outcome variable of interest will be the number of fish caught. Even though the question about the number of fish caught was asked to everyone, it does not mean that everyone went fishing. What would be the reason for someone to report a zero count? Was it because this person was unlucky and didn’t catch any fish, or was it because this person didn’t go fishing at all? If a person didn’t go fishing, the outcome would be always zero. Otherwise, if a person went fishing, the count could be zero or non-zero. So we can see that there seemed to be two processes that would generate zero counts: unlucky in fishing or didn’t go fishing. For now we will ignore these more interesting questions and focus on fitting Poisson and Negative Binomial models to these data.
fish = read.dta("http://www.stata-press.com/data/r10/fish.dta")
names(fish); summary(fish)
## [1] "nofish" "livebait" "camper" "persons" "child" "xb"
## [7] "zg" "count"
## nofish livebait camper persons
## Min. :0.000 Min. :0.000 Min. :0.000 Min. :1.000
## 1st Qu.:0.000 1st Qu.:1.000 1st Qu.:0.000 1st Qu.:2.000
## Median :0.000 Median :1.000 Median :1.000 Median :2.000
## Mean :0.296 Mean :0.864 Mean :0.588 Mean :2.528
## 3rd Qu.:1.000 3rd Qu.:1.000 3rd Qu.:1.000 3rd Qu.:4.000
## Max. :1.000 Max. :1.000 Max. :1.000 Max. :4.000
## child xb zg count
## Min. :0.000 Min. :-3.275050 Min. :-5.6259 Min. : 0.000
## 1st Qu.:0.000 1st Qu.: 0.008267 1st Qu.:-1.2527 1st Qu.: 0.000
## Median :0.000 Median : 0.954550 Median : 0.6051 Median : 0.000
## Mean :0.684 Mean : 0.973796 Mean : 0.2523 Mean : 3.296
## 3rd Qu.:1.000 3rd Qu.: 1.963855 3rd Qu.: 1.9932 3rd Qu.: 2.000
## Max. :3.000 Max. : 5.352674 Max. : 4.2632 Max. :149.000
Cross-section data on the number of recreational boating trips to Lake Somerville, Texas, in 1980, based on a survey administered to 2,000 registered leisure boat owners in 23 counties in eastern Texas.
library(AER)
data(RecreationDemand)
names(RecreationDemand)
## [1] "trips" "quality" "ski" "income" "userfee" "costC" "costS"
## [8] "costH"
summary(RecreationDemand)
## trips quality ski income userfee
## Min. : 0.000 Min. :0.000 no :417 Min. :1.000 no :646
## 1st Qu.: 0.000 1st Qu.:0.000 yes:242 1st Qu.:3.000 yes: 13
## Median : 0.000 Median :0.000 Median :3.000
## Mean : 2.244 Mean :1.419 Mean :3.853
## 3rd Qu.: 2.000 3rd Qu.:3.000 3rd Qu.:5.000
## Max. :88.000 Max. :5.000 Max. :9.000
## costC costS costH
## Min. : 4.34 Min. : 4.767 Min. : 5.70
## 1st Qu.: 28.24 1st Qu.: 33.312 1st Qu.: 28.96
## Median : 41.19 Median : 47.000 Median : 42.38
## Mean : 55.42 Mean : 59.928 Mean : 55.99
## 3rd Qu.: 69.67 3rd Qu.: 72.573 3rd Qu.: 68.56
## Max. :493.77 Max. :491.547 Max. :491.05