Several outcomes of interest are measured as counts – the number of patient visits to the Emergency Room in a year, the number of fish caught, the number of traffic fatalities, the number of charitable contributions made, the number of wars fought, and so on. Just as with the binary logit/probit and with the ordinal logit/probit estimating count outcomes as if they were continuous variables leads to inconsistent and biased estimates. Rather, these data should be modeled as Poisson (and its extensions) processes. In the Poisson model, the probability of seeing a certain count value is determined by the Poisson distribution.
Let \(y\) be a random variable indicating the number of times an event has occurred in a specific interval of time. This variable \(y\) is said to be Poisson distributed with some positive mean \((\lambda > 0)\)
\[Pr(y | \lambda) = \dfrac{e^{-\lambda}\lambda^{y}}{y!}\] \[\text{for } y = 0, 1, 2, 4, ...\]
Let us set \(\lambda\) equal to specific values to see what the distribution looks like. In particular, see the plots below where \(\lambda = 0.8, \lambda = 1.5, \lambda = 3, \lambda = 10\)
par(mfrow=c(2,2))
x = seq(0, 20, by=1)
x.d = dpois(x, lambda=0.8)
plot(x, x.d, type="h", lwd=2, col="firebrick", ylim=c(0, 0.5), ylab="Probability of a Specific y value", xlab="Specific values of y")
x = seq(0, 20, by=1)
x.d = dpois(x, lambda=1.5)
plot(x, x.d, type="h", lwd=2, col="firebrick", ylim=c(0, 0.4), ylab="Probability of a Specific y value", xlab="Specific values of y")
x = seq(0, 20, by=1)
x.d = dpois(x, lambda=3)
plot(x, x.d, type="h", lwd=2, col="firebrick", ylim=c(0, 0.3), ylab="Probability of a Specific y value", xlab="Specific values of y")
x = seq(0, 20, by=1)
x.d = dpois(x, lambda=10)
plot(x, x.d, type="h", lwd=2, col="firebrick", ylim=c(0, 0.2), ylab="Probability of a Specific y value", xlab="Specific values of y")
Simulated Poisson Distributions with Varying Lambda (i.e., Mean) Values
par(mfrow=c(1,1))
library(foreign)
dat = read.dta("http://www.ats.ucla.edu/stat/stata/dae/nb_data.dta")
dat = within(dat, {
prog = factor(prog, levels = 1:3, labels = c("General", "Academic", "Vocational"))
id = factor(id)
})
summary(dat)
## id gender math daysabs
## 1001 : 1 female:160 Min. : 1.00 Min. : 0.000
## 1002 : 1 male :154 1st Qu.:28.00 1st Qu.: 1.000
## 1003 : 1 Median :48.00 Median : 4.000
## 1004 : 1 Mean :48.27 Mean : 5.955
## 1005 : 1 3rd Qu.:70.00 3rd Qu.: 8.000
## 1006 : 1 Max. :99.00 Max. :35.000
## (Other):308
## prog
## General : 40
## Academic :167
## Vocational:107
##
##
##
##
The outcome of interest is the number of days a student is absent in a school-year. We have three independent variables – the student’s gender, the student’s math score, and the program a student is in.
Let us start with a simple barplot so we can see the counts of the number of days students are absent.
tab.1 = table(dat$daysabs); tab.1
##
## 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 23 24 27
## 57 41 28 27 25 20 16 14 10 13 5 7 7 6 4 3 7 1 2 3 2 2 2 1 2
## 28 29 30 34 35
## 2 1 2 2 2
barplot(tab.1, ylab="Frequency", xlab="No. of Days Absent", ylim=c(0,60), col="salmon")
Let us also count the mean and the variance of absences
mean(dat$daysabs)
## [1] 5.955414
var(dat$daysabs)
## [1] 49.51877
Notice that the variance exceeds the mean so we have overdispersion. Note also that the most likely outcome is 0 days absent. The overdispersion implies that these data are most likely not generated by a Poisson data generating process. This overdispersion could be driven by a misspecified model, or by heterogeneity. For example, what if male students had more absences than female students or vice-versa? As a rule, then, we will have to be sure that we are fitting the correct model and that heterogeneity is being accounted for, before we can really discard the Poisson distribution and opt for another distribution (the most common option being the Negative Binomial in the case of overdispersion, and that is because this model does not assume that the mean equals the variance).
lm.0 = lm(daysabs ~ math + prog, data=dat)
summary(lm.0)
##
## Call:
## lm(formula = daysabs ~ math + prog, data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.747 -4.082 -1.652 1.705 27.735
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12.60373 1.22469 10.291 < 2e-16 ***
## math -0.04359 0.01505 -2.896 0.00405 **
## progAcademic -3.81316 1.13845 -3.349 0.00091 ***
## progVocational -7.38494 1.21535 -6.076 3.6e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.464 on 310 degrees of freedom
## Multiple R-squared: 0.1642, Adjusted R-squared: 0.1561
## F-statistic: 20.3 on 3 and 310 DF, p-value: 4.894e-12
Note what is significant and what isn’t significant.
pois.1 = glm(daysabs ~ math + prog, data=dat, family=poisson(link="log"))
summary(pois.1)
##
## Call:
## glm(formula = daysabs ~ math + prog, family = poisson(link = "log"),
## data = dat)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -4.2597 -2.2038 -0.9193 0.6511 7.4233
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.651974 0.060736 43.664 < 2e-16 ***
## math -0.006808 0.000931 -7.313 2.62e-13 ***
## progAcademic -0.439897 0.056672 -7.762 8.35e-15 ***
## progVocational -1.281364 0.077886 -16.452 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 2217.7 on 313 degrees of freedom
## Residual deviance: 1774.0 on 310 degrees of freedom
## AIC: 2665.3
##
## Number of Fisher Scoring iterations: 5
exp(coef(pois.1))
## (Intercept) math progAcademic progVocational
## 14.1820032 0.9932147 0.6441025 0.2776583
AIC(pois.1); BIC(pois.1)
## [1] 2665.285
## [1] 2680.283
library(pscl)
pR2(pois.1)
## llh llhNull G2 McFadden r2ML
## -1328.6424931 -1550.5092295 443.7334727 0.1430928 0.7566279
## r2CU
## 0.7566668
The usual set of basic model fit statistics are available to us – the AIC, BIC, and the pseudo-\(R^2\). We could also recover the odds-ratios to interpret the estimates but it would be far easier to generate calculated probabilities that could either be shown in a table for the more interesting covariate profiles and/or plotted.
Here are the basic predicted probabilities generated from in-sample values.
dat$phat = predict(pois.1, type="response")
How about plotting these?
dat = dat[with(dat, order(prog, math)), ]
library(ggplot2)
plot.p = ggplot(dat, aes(x = math, y = phat, colour = prog)) + geom_point(aes(y = daysabs), alpha=.5, position=position_jitter(h=.2)) + geom_line(size = 1) + labs(x = "Math Score", y = "Expected number of Days Absent")
plot.p
These are the predicted number of days a student is absent (\(\hat{y}\)), not very useful because these predictions are not for specific covariate profiles. Let us generate a specific profile and see how the predicted counts differ from the actual counts for the profiled students.
newdata.2 = data.frame(
math = rep(seq(from = min(dat$math), to = max(dat$math), length.out = 100), 3),
prog = factor(rep(1:3, each = 100), levels = 1:3, labels =
levels(dat$prog)))
newdata.2 = cbind(newdata.2, predict(pois.1, newdata.2, type = "link", se.fit=TRUE))
newdata.2 = within(newdata.2, {
DaysAbsent.p = exp(fit)
LL.p = exp(fit - 1.96 * se.fit)
UL.p = exp(fit + 1.96 * se.fit)
})
ggplot(newdata.2, aes(math, DaysAbsent.p)) +
geom_ribbon(aes(ymin = LL.p, ymax = UL.p, fill = prog), alpha = .25) +
geom_line(aes(colour = prog), size = 2) +
labs(x = "Math Score", y = "Predicted Days Absent")
We could also use visreg
library(visreg)
visreg(pois.1, "math", by="prog", scale="response", overlay=TRUE)
The Poisson model will rarely fit in practice because there is almost always overdispersion. If we proceed to use the Poisson we will have consistent estimates but the standard errors will be incorrect, with the usual consequences. So the Negative Binomial model works around the restrictive assumption of the Poisson, that the mean equal the variance, by allowing the variance of the outcome to exceed the mean.
library(MASS)
nb.1 = glm.nb(daysabs ~ math + prog, data=dat)
summary(nb.1)
##
## Call:
## glm.nb(formula = daysabs ~ math + prog, data = dat, init.theta = 1.032713156,
## link = log)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.1547 -1.0192 -0.3694 0.2285 2.5273
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.615265 0.197460 13.245 < 2e-16 ***
## math -0.005993 0.002505 -2.392 0.0167 *
## progAcademic -0.440760 0.182610 -2.414 0.0158 *
## progVocational -1.278651 0.200720 -6.370 1.89e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Negative Binomial(1.0327) family taken to be 1)
##
## Null deviance: 427.54 on 313 degrees of freedom
## Residual deviance: 358.52 on 310 degrees of freedom
## AIC: 1741.3
##
## Number of Fisher Scoring iterations: 1
##
##
## Theta: 1.033
## Std. Err.: 0.106
##
## 2 x log-likelihood: -1731.258
library(pscl)
pR2(nb.1)
## llh llhNull G2 McFadden r2ML
## -865.62889583 -896.47237109 61.68695051 0.03440538 0.17836191
## r2CU
## 0.17895472
The formal test for overdispersion, with the NULL of \(\alpha=0\) (i.e., that the Poisson model fits these data with \(\alpha\) constrained to be \(0\)) can be run with the odTest command in the pscl
library.
library(pscl)
odTest(nb.1)
## Likelihood ratio test of H0: Poisson, as restricted NB model:
## n.b., the distribution of the test-statistic under H0 is non-standard
## e.g., see help(odTest) for details/references
##
## Critical value of test statistic at the alpha= 0.05 level: 2.7055
## Chi-Square Test Statistic = 926.0272 p-value = < 2.2e-16
Note that the NULL of no overdispersion is soundly rejected. We should also consider the AIC() and BIC() of each.
AIC(nb.1)
## [1] 1741.258
BIC(nb.1)
## [1] 1760.005
dat$nbhat = predict(nb.1, type="response")
dat = dat[with(dat, order(prog, math)), ]
plot.nb = ggplot(dat, aes(x = math, y = nbhat, colour = prog)) + geom_point(aes(y = daysabs), alpha=.5, position=position_jitter(h=.2)) + geom_line(size = 1) + labs(x = "Math Score", y = "Expected number of Days Absent")
plot.nb
And now using newdata.2 …
newdata.2 = data.frame(
math = rep(seq(from = min(dat$math), to = max(dat$math), length.out = 100), 3),
prog = factor(rep(1:3, each = 100), levels = 1:3, labels =
levels(dat$prog)))
newdata.2 = cbind(newdata.2, predict(nb.1, newdata.2, type = "link", se.fit=TRUE))
newdata.2 = within(newdata.2, {
DaysAbsent.nb = exp(fit)
LL.nb = exp(fit - 1.96 * se.fit)
UL.nb = exp(fit + 1.96 * se.fit)
})
ggplot(newdata.2, aes(math, DaysAbsent.nb)) +
geom_ribbon(aes(ymin = LL.nb, ymax = UL.nb, fill = prog), alpha = .25) + geom_line(aes(colour = prog), size = 2) + labs(x = "Math Score", y = "Predicted Days Absent")
We could also use visreg
library(visreg)
visreg(nb.1, "math", by="prog", scale="response", overlay=TRUE)
library(stargazer)
stargazer(lm.0, pois.1, nb.1, type="html", star.cutoffs=c(0.05, 0.01, 0.001))
Dependent variable: | |||
daysabs | |||
OLS | Poisson | negative | |
binomial | |||
(1) | (2) | (3) | |
math | -0.044** | -0.007*** | -0.006* |
(0.015) | (0.001) | (0.003) | |
progAcademic | -3.813*** | -0.440*** | -0.441* |
(1.138) | (0.057) | (0.183) | |
progVocational | -7.385*** | -1.281*** | -1.279*** |
(1.215) | (0.078) | (0.201) | |
Constant | 12.604*** | 2.652*** | 2.615*** |
(1.225) | (0.061) | (0.197) | |
Observations | 314 | 314 | 314 |
R2 | 0.164 | ||
Adjusted R2 | 0.156 | ||
Log Likelihood | -1,328.642 | -866.629 | |
theta | 1.033*** (0.106) | ||
Akaike Inf. Crit. | 2,665.285 | 1,741.258 | |
Residual Std. Error | 6.464 (df = 310) | ||
F Statistic | 20.300*** (df = 3; 310) | ||
Note: | p<0.05; p<0.01; p<0.001 |
In a study of scientific productivity Long (1990) considered factors affecting the number of papers published during graduate school by a sample of 915 biochemists. The average number of articles was 1.7 with a variance of 3.7, which indicates that there is overdispersion in the distribution of articles. These data (given below) include the following measures:
library(foreign)
couart2 = read.dta("http://www.ats.ucla.edu/stat/stata/examples/long/couart2.dta")
summary(couart2)
## art fem mar kid5
## Min. : 0.000 Men :494 Single :309 Min. :0.0000
## 1st Qu.: 0.000 Women:421 Married:606 1st Qu.:0.0000
## Median : 1.000 Median :0.0000
## Mean : 1.693 Mean :0.4951
## 3rd Qu.: 2.000 3rd Qu.:1.0000
## Max. :19.000 Max. :3.0000
## phd ment
## Min. :0.755 Min. : 0.000
## 1st Qu.:2.260 1st Qu.: 3.000
## Median :3.150 Median : 6.000
## Mean :3.103 Mean : 8.767
## 3rd Qu.:3.920 3rd Qu.:12.000
## Max. :4.620 Max. :77.000
If we plot the number of articles published, and we calculate the mean and the variance, we see the following:
mean(couart2$art)
## [1] 1.692896
var(couart2$art)
## [1] 3.709742
barplot(table(couart2$art), ylab="Frequency", xlab="No. of Articles Published", ylim=c(0,300), col="salmon")
lm.c = lm(art ~ fem + mar + kid5 + phd + ment, data=couart2)
summary(lm.c)
##
## Call:
## lm(formula = art ~ fem + mar + kid5 + phd + ment, data = couart2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.0209 -1.2358 -0.4125 0.7517 14.8409
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.334249 0.240911 5.538 4e-08 ***
## femWomen -0.380885 0.128139 -2.972 0.00303 **
## marMarried 0.263199 0.145423 1.810 0.07064 .
## kid5 -0.291442 0.090919 -3.206 0.00140 **
## phd -0.011365 0.063666 -0.179 0.85836
## ment 0.061495 0.006605 9.310 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.822 on 909 degrees of freedom
## Multiple R-squared: 0.1104, Adjusted R-squared: 0.1055
## F-statistic: 22.55 on 5 and 909 DF, p-value: < 2.2e-16
Make a quick note of what is significant and the signs on the significant estimates. We know this is the wrong model so let us move on.
Let us fit the following Poisson model:
\[art = fem + mar + kid5 + phd + ment\]
pois.2 = glm(art ~ fem + mar + kid5 + phd + ment, data=couart2, family=poisson(link="log"))
summary(pois.2)
##
## Call:
## glm(formula = art ~ fem + mar + kid5 + phd + ment, family = poisson(link = "log"),
## data = couart2)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.5672 -1.5398 -0.3660 0.5722 5.4467
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.304617 0.102981 2.958 0.0031 **
## femWomen -0.224594 0.054613 -4.112 3.92e-05 ***
## marMarried 0.155243 0.061374 2.529 0.0114 *
## kid5 -0.184883 0.040127 -4.607 4.08e-06 ***
## phd 0.012823 0.026397 0.486 0.6271
## ment 0.025543 0.002006 12.733 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 1817.4 on 914 degrees of freedom
## Residual deviance: 1634.4 on 909 degrees of freedom
## AIC: 3314.1
##
## Number of Fisher Scoring iterations: 5
exp(coef(pois.2))
## (Intercept) femWomen marMarried kid5 phd ment
## 1.3561053 0.7988403 1.1679422 0.8312018 1.0129051 1.0258718
AIC(pois.2); BIC(pois.2)
## [1] 3314.113
## [1] 3343.026
Here are the basic predicted probabilities generated from in-sample values.
couart2$phat.pc = predict(pois.2, type="response")
These are the predicted number of articles (\(\hat{y}\)), not very useful because these predictions are not for specific covariate profiles. Let us generate a specific profile and see how the predicted counts differ from the actual counts for the profiled individuals.
newdata.a = data.frame(fem="Women", mar="Married", kid5 = 0, phd=3.103, ment=seq(0, 77, by=1))
newdata.a$phat.pc = predict(pois.2, newdata=newdata.a, type="response")
Now let us plot how \(\hat{y}\) changes with ment and, as a visual comparison of how well the model fits the data, superimpose the actual counts \((y)\) on this plot itself.
phat.pc = predict(pois.2, newdata=newdata.a, type="response")
plot(couart2$ment, couart2$art, ylim=c(0,12), pch=16, col="salmon", xlab="Articles by Mentor in Last 3 Years", ylab="Actual and Predicted Counts of Articles")
par(new=TRUE)
lines(newdata.a$ment, newdata.a$phat.pc, type="l", col="cornflowerblue", xlab="", ylab="", xaxt="no", yaxt="no", ylim=c(0,12))
The plot shows that predicted counts overestimate the actual counts at the high end of ment and underestimate the actual counts at the low end of ment.
Remember that overdispersion could be because of many things but heterogeneity is certainly one of the usual suspects any time we see overdispersion. Consequently, we can try and “fix” the standard errors by relying on the Huber-White Sandwich Estimator (very similar to White’s heteroscedasticity-corrected standard errors). Let us do this now:
library(lmtest)
library(sandwich)
coeftest(pois.2, vcov = sandwich)
##
## z test of coefficients:
##
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.3046168 0.1465195 2.0790 0.0376156 *
## femWomen -0.2245942 0.0716622 -3.1341 0.0017240 **
## marMarried 0.1552434 0.0819292 1.8948 0.0581125 .
## kid5 -0.1848827 0.0559633 -3.3036 0.0009544 ***
## phd 0.0128226 0.0419641 0.3056 0.7599392
## ment 0.0255427 0.0038178 6.6905 2.224e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Compare these estimated standard errors and p-values to that of the base Poisson model. Note what has changed.
library(MASS)
nb.2 = glm.nb(art ~ fem + mar + kid5 + phd + ment, data=couart2)
summary(nb.2)
##
## Call:
## glm.nb(formula = art ~ fem + mar + kid5 + phd + ment, data = couart2,
## init.theta = 2.264387695, link = log)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.1678 -1.3617 -0.2806 0.4476 3.4524
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.256144 0.137348 1.865 0.062191 .
## femWomen -0.216418 0.072636 -2.979 0.002887 **
## marMarried 0.150489 0.082097 1.833 0.066791 .
## kid5 -0.176415 0.052813 -3.340 0.000837 ***
## phd 0.015271 0.035873 0.426 0.670326
## ment 0.029082 0.003214 9.048 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Negative Binomial(2.2644) family taken to be 1)
##
## Null deviance: 1109.0 on 914 degrees of freedom
## Residual deviance: 1004.3 on 909 degrees of freedom
## AIC: 3135.9
##
## Number of Fisher Scoring iterations: 1
##
##
## Theta: 2.264
## Std. Err.: 0.271
##
## 2 x log-likelihood: -3121.917
The formal test for overdispersion, with the NULL of \(\alpha=0\) (i.e., that the Poisson model fits these data with \(\alpha\) constrained to be \(0\)) can be run with the odTest command in the pscl
library.
library(pscl)
odTest(nb.2)
## Likelihood ratio test of H0: Poisson, as restricted NB model:
## n.b., the distribution of the test-statistic under H0 is non-standard
## e.g., see help(odTest) for details/references
##
## Critical value of test statistic at the alpha= 0.05 level: 2.7055
## Chi-Square Test Statistic = 180.196 p-value = < 2.2e-16
Note that the NULL of no overdispersion is soundly rejected. We should also consider the AIC() and BIC().
AIC(nb.2)
## [1] 3135.917
BIC(nb.2)
## [1] 3169.649
Let us see how the predicted counts from the Negative Binomial stack up against those from the Poisson.
newdata.a$phat.nbc = predict(nb.2, newdata=newdata.a, type="response")
plot(couart2$ment, couart2$art, , ylim=c(0,12), pch=16, col="salmon", xlab="Articles by Mentor in Last 3 Years", ylab="Actual and Predicted Counts of Articles")
par(new=TRUE)
plot(newdata.a$ment, newdata.a$phat.pc, type="l", col="cornflowerblue", xlab="", ylab="", xaxt="no", yaxt="no", ylim=c(0,12))
par(new=TRUE)
plot(newdata.a$ment, newdata.a$phat.nbc, type="l", col="darkgreen", xlab="", ylab="", xaxt="no", yaxt="no", , ylim=c(0,12))
text(60, 9, "Negative Binomial", cex=0.75)
text(70, 5.5, "Poisson", cex=0.75)
We could also use visreg
library(visreg)
visreg(pois.2, "ment", by="fem", scale="response", overlay=TRUE)
visreg(nb.2, "ment", by="fem", scale="response", overlay=TRUE)
And now putting the estimates in a table …
library(stargazer)
stargazer(lm.c, pois.2, nb.2, type="html", star.cutoffs=c(0.05, 0.01, 0.001))
Dependent variable: | |||
art | |||
OLS | Poisson | negative | |
binomial | |||
(1) | (2) | (3) | |
femWomen | -0.381** | -0.225*** | -0.216** |
(0.128) | (0.055) | (0.073) | |
marMarried | 0.263 | 0.155* | 0.150 |
(0.145) | (0.061) | (0.082) | |
kid5 | -0.291** | -0.185*** | -0.176*** |
(0.091) | (0.040) | (0.053) | |
phd | -0.011 | 0.013 | 0.015 |
(0.064) | (0.026) | (0.036) | |
ment | 0.061*** | 0.026*** | 0.029*** |
(0.007) | (0.002) | (0.003) | |
Constant | 1.334*** | 0.305** | 0.256 |
(0.241) | (0.103) | (0.137) | |
Observations | 915 | 915 | 915 |
R2 | 0.110 | ||
Adjusted R2 | 0.105 | ||
Log Likelihood | -1,651.056 | -1,561.958 | |
theta | 2.264*** (0.271) | ||
Akaike Inf. Crit. | 3,314.113 | 3,135.917 | |
Residual Std. Error | 1.822 (df = 909) | ||
F Statistic | 22.553*** (df = 5; 909) | ||
Note: | p<0.05; p<0.01; p<0.001 |
We have data on 250 groups that went to a park. Each group was questioned before leaving the park about how many fish they caught (count), how many children were in the group (child), how many people were in the group (persons), and whether or not they brought a camper to the park (camper). The outcome variable of interest will be the number of fish caught. Even though the question about the number of fish caught was asked to everyone, it does not mean that everyone went fishing. What would be the reason for someone to report a zero count? Was it because this person was unlucky and didn’t catch any fish, or was it because this person didn’t go fishing at all? If a person didn’t go fishing, the outcome would be always zero. Otherwise, if a person went fishing, the count could be zero or non-zero. So we can see that there seemed to be two processes that would generate zero counts: unlucky in fishing or didn’t go fishing. For now we will ignore these more interesting questions and focus on fitting Poisson and Negative Binomial models to these data.
fish = read.dta("http://www.stata-press.com/data/r10/fish.dta")
names(fish); summary(fish)
## [1] "nofish" "livebait" "camper" "persons" "child" "xb"
## [7] "zg" "count"
## nofish livebait camper persons
## Min. :0.000 Min. :0.000 Min. :0.000 Min. :1.000
## 1st Qu.:0.000 1st Qu.:1.000 1st Qu.:0.000 1st Qu.:2.000
## Median :0.000 Median :1.000 Median :1.000 Median :2.000
## Mean :0.296 Mean :0.864 Mean :0.588 Mean :2.528
## 3rd Qu.:1.000 3rd Qu.:1.000 3rd Qu.:1.000 3rd Qu.:4.000
## Max. :1.000 Max. :1.000 Max. :1.000 Max. :4.000
## child xb zg count
## Min. :0.000 Min. :-3.275050 Min. :-5.6259 Min. : 0.000
## 1st Qu.:0.000 1st Qu.: 0.008267 1st Qu.:-1.2527 1st Qu.: 0.000
## Median :0.000 Median : 0.954550 Median : 0.6051 Median : 0.000
## Mean :0.684 Mean : 0.973796 Mean : 0.2523 Mean : 3.296
## 3rd Qu.:1.000 3rd Qu.: 1.963855 3rd Qu.: 1.9932 3rd Qu.: 2.000
## Max. :3.000 Max. : 5.352674 Max. : 4.2632 Max. :149.000
Cross-section data on the number of recreational boating trips to Lake Somerville, Texas, in 1980, based on a survey administered to 2,000 registered leisure boat owners in 23 counties in eastern Texas.
library(AER)
data(RecreationDemand)
names(RecreationDemand)
## [1] "trips" "quality" "ski" "income" "userfee" "costC" "costS"
## [8] "costH"
summary(RecreationDemand)
## trips quality ski income userfee
## Min. : 0.000 Min. :0.000 no :417 Min. :1.000 no :646
## 1st Qu.: 0.000 1st Qu.:0.000 yes:242 1st Qu.:3.000 yes: 13
## Median : 0.000 Median :0.000 Median :3.000
## Mean : 2.244 Mean :1.419 Mean :3.853
## 3rd Qu.: 2.000 3rd Qu.:3.000 3rd Qu.:5.000
## Max. :88.000 Max. :5.000 Max. :9.000
## costC costS costH
## Min. : 4.34 Min. : 4.767 Min. : 5.70
## 1st Qu.: 28.24 1st Qu.: 33.312 1st Qu.: 28.96
## Median : 41.19 Median : 47.000 Median : 42.38
## Mean : 55.42 Mean : 59.928 Mean : 55.99
## 3rd Qu.: 69.67 3rd Qu.: 72.573 3rd Qu.: 68.56
## Max. :493.77 Max. :491.547 Max. :491.05