Regression Models for Count Data

Several outcomes of interest are measured as counts – the number of patient visits to the Emergency Room in a year, the number of fish caught, the number of traffic fatalities, the number of charitable contributions made, the number of wars fought, and so on. Just as with the binary logit/probit and with the ordinal logit/probit estimating count outcomes as if they were continuous variables leads to inconsistent and biased estimates. Rather, these data should be modeled as Poisson (and its extensions) processes. In the Poisson model, the probability of seeing a certain count value is determined by the Poisson distribution.

The Poisson Distribution

Let \(y\) be a random variable indicating the number of times an event has occurred in a specific interval of time. This variable \(y\) is said to be Poisson distributed with some positive mean \((\lambda > 0)\)

\[Pr(y | \lambda) = \dfrac{e^{-\lambda}\lambda^{y}}{y!}\] \[\text{for } y = 0, 1, 2, 4, ...\]

Let us set \(\lambda\) equal to specific values to see what the distribution looks like. In particular, see the plots below where \(\lambda = 0.8, \lambda = 1.5, \lambda = 3, \lambda = 10\)

par(mfrow=c(2,2))
x = seq(0, 20, by=1)
x.d = dpois(x, lambda=0.8)
plot(x, x.d, type="h", lwd=2, col="firebrick", ylim=c(0, 0.5), ylab="Probability of a Specific y value", xlab="Specific values of y")

x = seq(0, 20, by=1)
x.d = dpois(x, lambda=1.5)
plot(x, x.d, type="h", lwd=2, col="firebrick", ylim=c(0, 0.4), ylab="Probability of a Specific y value", xlab="Specific values of y")

x = seq(0, 20, by=1)
x.d = dpois(x, lambda=3)
plot(x, x.d, type="h", lwd=2, col="firebrick", ylim=c(0, 0.3), ylab="Probability of a Specific y value", xlab="Specific values of y")

x = seq(0, 20, by=1)
x.d = dpois(x, lambda=10)
plot(x, x.d, type="h", lwd=2, col="firebrick", ylim=c(0, 0.2), ylab="Probability of a Specific y value", xlab="Specific values of y")

Simulated Poisson Distributions with Varying Lambda (i.e., Mean) Values

par(mfrow=c(1,1))

In the four plots see how the most likely value of \(y\) shifts to the right as the mean \((\lambda)\) increases. So on average, if you were told that the mean rate of events is 2 you would expect to see 2 events in the next time interval … \(E(y)=\lambda\)
As the mean \((\lambda)\) increases, the probability of seeing \(y=0\) decreases. Unfortunately, for many count variables there will be more \(y=0\) seen in the data than \(y>0\)
As the mean \((\lambda)\) increases the Poisson distribution approximates the Normal distribution.
The Poisson distribution assumes that every event is independent – i.e., if we see an event occur in the very next time interval this has no impact on the probability of seeing the event occur in the very next time interval
The Poisson has mean = variance, i.e., \(Var(y) = E(y) = \lambda\) … this is known as equidispersion although in practice you often see overdispersion – where the variance far exceeds the mean.
Formally, in a Poisson model, equidispersion means that the Residual deviance divided by the critical \(\chi^2\) with \(df=\) the Residual deviance’s degrees of freedom should be \(=1\). Values \(> 1\) indicate overdispersion, that is, the true variance is bigger than the mean, values \(< 1\) indicate underdispersion, that is, the true variance is smaller than the mean. Corrective measures include allowing the dispersion parameter to be estimated by the model and the data itself (quasipoisson) or, in the case of overdispersion, fitting the Negative Binomial model.

A Simple Example to Start With

library(foreign)
dat = read.dta("http://www.ats.ucla.edu/stat/stata/dae/nb_data.dta")
dat = within(dat, {
  prog = factor(prog, levels = 1:3, labels = c("General", "Academic", "Vocational"))
  id = factor(id)
})

summary(dat)

##        id         gender         math          daysabs      
##  1001   :  1   female:160   Min.   : 1.00   Min.   : 0.000  
##  1002   :  1   male  :154   1st Qu.:28.00   1st Qu.: 1.000  
##  1003   :  1                Median :48.00   Median : 4.000  
##  1004   :  1                Mean   :48.27   Mean   : 5.955  
##  1005   :  1                3rd Qu.:70.00   3rd Qu.: 8.000  
##  1006   :  1                Max.   :99.00   Max.   :35.000  
##  (Other):308                                                
##          prog    
##  General   : 40  
##  Academic  :167  
##  Vocational:107  
##                  
##                  
##                  
##

The outcome of interest is the number of days a student is absent in a school-year. We have three independent variables – the student’s gender, the student’s math score, and the program a student is in.

Let us start with a simple barplot so we can see the counts of the number of days students are absent.

tab.1 = table(dat$daysabs); tab.1

## 
##  0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 23 24 27 
## 57 41 28 27 25 20 16 14 10 13  5  7  7  6  4  3  7  1  2  3  2  2  2  1  2 
## 28 29 30 34 35 
##  2  1  2  2  2

barplot(tab.1, ylab="Frequency", xlab="No. of Days Absent", ylim=c(0,60), col="salmon")

Let us also count the mean and the variance of absences

mean(dat$daysabs)

## [1] 5.955414

var(dat$daysabs)

## [1] 49.51877

Notice that the variance exceeds the mean so we have overdispersion. Note also that the most likely outcome is 0 days absent. The overdispersion implies that these data are most likely not generated by a Poisson data generating process. This overdispersion could be driven by a misspecified model, or by heterogeneity. For example, what if male students had more absences than female students or vice-versa? As a rule, then, we will have to be sure that we are fitting the correct model and that heterogeneity is being accounted for, before we can really discard the Poisson distribution and opt for another distribution (the most common option being the Negative Binomial in the case of overdispersion, and that is because this model does not assume that the mean equals the variance).

Fitting a Linear Regression Model

lm.0 = lm(daysabs ~ math + prog, data=dat)
summary(lm.0)

## 
## Call:
## lm(formula = daysabs ~ math + prog, data = dat)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -8.747 -4.082 -1.652  1.705 27.735 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    12.60373    1.22469  10.291  < 2e-16 ***
## math           -0.04359    0.01505  -2.896  0.00405 ** 
## progAcademic   -3.81316    1.13845  -3.349  0.00091 ***
## progVocational -7.38494    1.21535  -6.076  3.6e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.464 on 310 degrees of freedom
## Multiple R-squared:  0.1642, Adjusted R-squared:  0.1561 
## F-statistic:  20.3 on 3 and 310 DF,  p-value: 4.894e-12

Note what is significant and what isn’t significant.

Fitting the Poisson Model to these Data

pois.1 = glm(daysabs ~ math + prog, data=dat, family=poisson(link="log")) 
summary(pois.1)

## 
## Call:
## glm(formula = daysabs ~ math + prog, family = poisson(link = "log"), 
##     data = dat)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -4.2597  -2.2038  -0.9193   0.6511   7.4233  
## 
## Coefficients:
##                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)     2.651974   0.060736  43.664  < 2e-16 ***
## math           -0.006808   0.000931  -7.313 2.62e-13 ***
## progAcademic   -0.439897   0.056672  -7.762 8.35e-15 ***
## progVocational -1.281364   0.077886 -16.452  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 2217.7  on 313  degrees of freedom
## Residual deviance: 1774.0  on 310  degrees of freedom
## AIC: 2665.3
## 
## Number of Fisher Scoring iterations: 5

exp(coef(pois.1))

##    (Intercept)           math   progAcademic progVocational 
##     14.1820032      0.9932147      0.6441025      0.2776583

AIC(pois.1); BIC(pois.1)

## [1] 2665.285

## [1] 2680.283

library(pscl)
pR2(pois.1)

##           llh       llhNull            G2      McFadden          r2ML 
## -1328.6424931 -1550.5092295   443.7334727     0.1430928     0.7566279 
##          r2CU 
##     0.7566668

The usual set of basic model fit statistics are available to us – the AIC, BIC, and the pseudo-\(R^2\). We could also recover the odds-ratios to interpret the estimates but it would be far easier to generate calculated probabilities that could either be shown in a table for the more interesting covariate profiles and/or plotted.

Here are the basic predicted probabilities generated from in-sample values.

dat$phat = predict(pois.1, type="response")

How about plotting these?

dat = dat[with(dat, order(prog, math)), ]
library(ggplot2)
plot.p = ggplot(dat, aes(x = math, y = phat, colour = prog)) + geom_point(aes(y = daysabs), alpha=.5, position=position_jitter(h=.2)) + geom_line(size = 1) + labs(x = "Math Score", y = "Expected number of Days Absent")
plot.p

These are the predicted number of days a student is absent (\(\hat{y}\)), not very useful because these predictions are not for specific covariate profiles. Let us generate a specific profile and see how the predicted counts differ from the actual counts for the profiled students.

newdata.2 = data.frame(
  math = rep(seq(from = min(dat$math), to = max(dat$math), length.out = 100), 3),
  prog = factor(rep(1:3, each = 100), levels = 1:3, labels =
  levels(dat$prog)))

newdata.2 = cbind(newdata.2, predict(pois.1, newdata.2, type = "link", se.fit=TRUE))
newdata.2 = within(newdata.2, {
  DaysAbsent.p = exp(fit)
  LL.p = exp(fit - 1.96 * se.fit)
  UL.p = exp(fit + 1.96 * se.fit)
})

ggplot(newdata.2, aes(math, DaysAbsent.p)) +
  geom_ribbon(aes(ymin = LL.p, ymax = UL.p, fill = prog), alpha = .25) +
  geom_line(aes(colour = prog), size = 2) +
  labs(x = "Math Score", y = "Predicted Days Absent")

We could also use visreg

library(visreg)
visreg(pois.1, "math", by="prog", scale="response", overlay=TRUE)

The Negative Binomial Model

The Poisson model will rarely fit in practice because there is almost always overdispersion. If we proceed to use the Poisson we will have consistent estimates but the standard errors will be incorrect, with the usual consequences. So the Negative Binomial model works around the restrictive assumption of the Poisson, that the mean equal the variance, by allowing the variance of the outcome to exceed the mean.

The expected value of the outcome (i.e., the mean) is the same for both Poisson and for the Negative Binomial … \(\lambda\)
The variance in Poisson is \(\lambda\) but under the Negative Binomial it is \(\lambda + \alpha\left(\lambda^2\right)\)
If \(\alpha = 0\) you are back to the Poisson
One usually tests for this under \(H_0: \alpha=0\); \(H_A: \alpha \neq 0\)
Note that in R you will see \(\theta\) reported, and \(\theta = \dfrac{1}{\alpha}\).

library(MASS)
nb.1 = glm.nb(daysabs ~ math + prog, data=dat)
summary(nb.1)

## 
## Call:
## glm.nb(formula = daysabs ~ math + prog, data = dat, init.theta = 1.032713156, 
##     link = log)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.1547  -1.0192  -0.3694   0.2285   2.5273  
## 
## Coefficients:
##                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)     2.615265   0.197460  13.245  < 2e-16 ***
## math           -0.005993   0.002505  -2.392   0.0167 *  
## progAcademic   -0.440760   0.182610  -2.414   0.0158 *  
## progVocational -1.278651   0.200720  -6.370 1.89e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(1.0327) family taken to be 1)
## 
##     Null deviance: 427.54  on 313  degrees of freedom
## Residual deviance: 358.52  on 310  degrees of freedom
## AIC: 1741.3
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  1.033 
##           Std. Err.:  0.106 
## 
##  2 x log-likelihood:  -1731.258

library(pscl)
pR2(nb.1)

##           llh       llhNull            G2      McFadden          r2ML 
## -865.62889583 -896.47237109   61.68695051    0.03440538    0.17836191 
##          r2CU 
##    0.17895472

The formal test for overdispersion, with the NULL of \(\alpha=0\) (i.e., that the Poisson model fits these data with \(\alpha\) constrained to be \(0\)) can be run with the odTest command in the pscl library.

library(pscl)
odTest(nb.1)

## Likelihood ratio test of H0: Poisson, as restricted NB model:
## n.b., the distribution of the test-statistic under H0 is non-standard
## e.g., see help(odTest) for details/references
## 
## Critical value of test statistic at the alpha= 0.05 level: 2.7055 
## Chi-Square Test Statistic =  926.0272 p-value = < 2.2e-16

Note that the NULL of no overdispersion is soundly rejected. We should also consider the AIC() and BIC() of each.

AIC(nb.1)

## [1] 1741.258

BIC(nb.1)

## [1] 1760.005

Plotting Predicted Values for the Negative Binomial

dat$nbhat = predict(nb.1, type="response")

dat = dat[with(dat, order(prog, math)), ]

plot.nb = ggplot(dat, aes(x = math, y = nbhat, colour = prog)) + geom_point(aes(y = daysabs), alpha=.5, position=position_jitter(h=.2)) + geom_line(size = 1) + labs(x = "Math Score", y = "Expected number of Days Absent")
plot.nb

And now using newdata.2 …

newdata.2 = data.frame(
  math = rep(seq(from = min(dat$math), to = max(dat$math), length.out = 100), 3),
  prog = factor(rep(1:3, each = 100), levels = 1:3, labels =
  levels(dat$prog)))

newdata.2 = cbind(newdata.2, predict(nb.1, newdata.2, type = "link", se.fit=TRUE))
newdata.2 = within(newdata.2, {
  DaysAbsent.nb = exp(fit)
  LL.nb = exp(fit - 1.96 * se.fit)
  UL.nb = exp(fit + 1.96 * se.fit)
})

ggplot(newdata.2, aes(math, DaysAbsent.nb)) +
  geom_ribbon(aes(ymin = LL.nb, ymax = UL.nb, fill = prog), alpha = .25) + geom_line(aes(colour = prog), size = 2) + labs(x = "Math Score", y = "Predicted Days Absent")

We could also use visreg

library(visreg)
visreg(nb.1, "math", by="prog", scale="response", overlay=TRUE)

library(stargazer)
stargazer(lm.0, pois.1, nb.1, type="html", star.cutoffs=c(0.05, 0.01, 0.001))


	Dependent variable:

	daysabs
	OLS	Poisson	negative
			binomial
	(1)	(2)	(3)

math	-0.044^**	-0.007^***	-0.006^*
	(0.015)	(0.001)	(0.003)

progAcademic	-3.813^***	-0.440^***	-0.441^*
	(1.138)	(0.057)	(0.183)

progVocational	-7.385^***	-1.281^***	-1.279^***
	(1.215)	(0.078)	(0.201)

Constant	12.604^***	2.652^***	2.615^***
	(1.225)	(0.061)	(0.197)


Observations	314	314	314
R²	0.164
Adjusted R²	0.156
Log Likelihood		-1,328.642	-866.629
theta			1.033^*** (0.106)
Akaike Inf. Crit.		2,665.285	1,741.258
Residual Std. Error	6.464 (df = 310)
F Statistic	20.300^*** (df = 3; 310)

Note:	p<0.05; p<0.01; p<0.001

Another Example: Article Counts

In a study of scientific productivity Long (1990) considered factors affecting the number of papers published during graduate school by a sample of 915 biochemists. The average number of articles was 1.7 with a variance of 3.7, which indicates that there is overdispersion in the distribution of articles. These data (given below) include the following measures:

art … Articles in last 3 yrs of PhD
fem … 1=female 0=male
mar … Married: 1=yes 0=no
kid5 … Number of children < 6
phd … PhD prestige
ment … Article by mentor in last 3 yrs

library(foreign)
couart2 = read.dta("http://www.ats.ucla.edu/stat/stata/examples/long/couart2.dta")
summary(couart2)

##       art            fem           mar           kid5       
##  Min.   : 0.000   Men  :494   Single :309   Min.   :0.0000  
##  1st Qu.: 0.000   Women:421   Married:606   1st Qu.:0.0000  
##  Median : 1.000                             Median :0.0000  
##  Mean   : 1.693                             Mean   :0.4951  
##  3rd Qu.: 2.000                             3rd Qu.:1.0000  
##  Max.   :19.000                             Max.   :3.0000  
##       phd             ment       
##  Min.   :0.755   Min.   : 0.000  
##  1st Qu.:2.260   1st Qu.: 3.000  
##  Median :3.150   Median : 6.000  
##  Mean   :3.103   Mean   : 8.767  
##  3rd Qu.:3.920   3rd Qu.:12.000  
##  Max.   :4.620   Max.   :77.000

If we plot the number of articles published, and we calculate the mean and the variance, we see the following:

mean(couart2$art)

## [1] 1.692896

var(couart2$art)

## [1] 3.709742

barplot(table(couart2$art), ylab="Frequency", xlab="No. of Articles Published", ylim=c(0,300), col="salmon")

Fitting the Inappropriate Linear Regression Model

lm.c = lm(art ~ fem + mar + kid5 + phd + ment, data=couart2)
summary(lm.c)

## 
## Call:
## lm(formula = art ~ fem + mar + kid5 + phd + ment, data = couart2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.0209 -1.2358 -0.4125  0.7517 14.8409 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.334249   0.240911   5.538    4e-08 ***
## femWomen    -0.380885   0.128139  -2.972  0.00303 ** 
## marMarried   0.263199   0.145423   1.810  0.07064 .  
## kid5        -0.291442   0.090919  -3.206  0.00140 ** 
## phd         -0.011365   0.063666  -0.179  0.85836    
## ment         0.061495   0.006605   9.310  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.822 on 909 degrees of freedom
## Multiple R-squared:  0.1104, Adjusted R-squared:  0.1055 
## F-statistic: 22.55 on 5 and 909 DF,  p-value: < 2.2e-16

Make a quick note of what is significant and the signs on the significant estimates. We know this is the wrong model so let us move on.

Fitting the Poisson Model

Let us fit the following Poisson model:

\[art = fem + mar + kid5 + phd + ment\]

pois.2 = glm(art ~ fem + mar + kid5 + phd + ment, data=couart2, family=poisson(link="log"))
summary(pois.2)

## 
## Call:
## glm(formula = art ~ fem + mar + kid5 + phd + ment, family = poisson(link = "log"), 
##     data = couart2)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.5672  -1.5398  -0.3660   0.5722   5.4467  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  0.304617   0.102981   2.958   0.0031 ** 
## femWomen    -0.224594   0.054613  -4.112 3.92e-05 ***
## marMarried   0.155243   0.061374   2.529   0.0114 *  
## kid5        -0.184883   0.040127  -4.607 4.08e-06 ***
## phd          0.012823   0.026397   0.486   0.6271    
## ment         0.025543   0.002006  12.733  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 1817.4  on 914  degrees of freedom
## Residual deviance: 1634.4  on 909  degrees of freedom
## AIC: 3314.1
## 
## Number of Fisher Scoring iterations: 5

exp(coef(pois.2))

## (Intercept)    femWomen  marMarried        kid5         phd        ment 
##   1.3561053   0.7988403   1.1679422   0.8312018   1.0129051   1.0258718

AIC(pois.2); BIC(pois.2)

## [1] 3314.113

## [1] 3343.026

Here are the basic predicted probabilities generated from in-sample values.

couart2$phat.pc = predict(pois.2, type="response")

These are the predicted number of articles (\(\hat{y}\)), not very useful because these predictions are not for specific covariate profiles. Let us generate a specific profile and see how the predicted counts differ from the actual counts for the profiled individuals.

newdata.a = data.frame(fem="Women", mar="Married", kid5 = 0, phd=3.103, ment=seq(0, 77, by=1))
newdata.a$phat.pc = predict(pois.2, newdata=newdata.a, type="response")

Now let us plot how \(\hat{y}\) changes with ment and, as a visual comparison of how well the model fits the data, superimpose the actual counts \((y)\) on this plot itself.

phat.pc = predict(pois.2, newdata=newdata.a, type="response")
plot(couart2$ment, couart2$art, ylim=c(0,12), pch=16, col="salmon", xlab="Articles by Mentor in Last 3 Years", ylab="Actual and Predicted Counts of Articles")
par(new=TRUE)
lines(newdata.a$ment, newdata.a$phat.pc, type="l", col="cornflowerblue", xlab="", ylab="", xaxt="no", yaxt="no", ylim=c(0,12))

The plot shows that predicted counts overestimate the actual counts at the high end of ment and underestimate the actual counts at the low end of ment.

Remember that overdispersion could be because of many things but heterogeneity is certainly one of the usual suspects any time we see overdispersion. Consequently, we can try and “fix” the standard errors by relying on the Huber-White Sandwich Estimator (very similar to White’s heteroscedasticity-corrected standard errors). Let us do this now:

library(lmtest)
library(sandwich)
coeftest(pois.2, vcov = sandwich)

## 
## z test of coefficients:
## 
##               Estimate Std. Error z value  Pr(>|z|)    
## (Intercept)  0.3046168  0.1465195  2.0790 0.0376156 *  
## femWomen    -0.2245942  0.0716622 -3.1341 0.0017240 ** 
## marMarried   0.1552434  0.0819292  1.8948 0.0581125 .  
## kid5        -0.1848827  0.0559633 -3.3036 0.0009544 ***
## phd          0.0128226  0.0419641  0.3056 0.7599392    
## ment         0.0255427  0.0038178  6.6905 2.224e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Compare these estimated standard errors and p-values to that of the base Poisson model. Note what has changed.

The Negative Binomial Model for couart2

library(MASS)
nb.2 = glm.nb(art ~ fem + mar + kid5 + phd + ment, data=couart2)
summary(nb.2)

## 
## Call:
## glm.nb(formula = art ~ fem + mar + kid5 + phd + ment, data = couart2, 
##     init.theta = 2.264387695, link = log)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.1678  -1.3617  -0.2806   0.4476   3.4524  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  0.256144   0.137348   1.865 0.062191 .  
## femWomen    -0.216418   0.072636  -2.979 0.002887 ** 
## marMarried   0.150489   0.082097   1.833 0.066791 .  
## kid5        -0.176415   0.052813  -3.340 0.000837 ***
## phd          0.015271   0.035873   0.426 0.670326    
## ment         0.029082   0.003214   9.048  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(2.2644) family taken to be 1)
## 
##     Null deviance: 1109.0  on 914  degrees of freedom
## Residual deviance: 1004.3  on 909  degrees of freedom
## AIC: 3135.9
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  2.264 
##           Std. Err.:  0.271 
## 
##  2 x log-likelihood:  -3121.917

library(pscl)
odTest(nb.2)

## Likelihood ratio test of H0: Poisson, as restricted NB model:
## n.b., the distribution of the test-statistic under H0 is non-standard
## e.g., see help(odTest) for details/references
## 
## Critical value of test statistic at the alpha= 0.05 level: 2.7055 
## Chi-Square Test Statistic =  180.196 p-value = < 2.2e-16

Note that the NULL of no overdispersion is soundly rejected. We should also consider the AIC() and BIC().

AIC(nb.2)

## [1] 3135.917

BIC(nb.2)

## [1] 3169.649

Generating Predicted Counts after Negative Binomial

Let us see how the predicted counts from the Negative Binomial stack up against those from the Poisson.

newdata.a$phat.nbc = predict(nb.2, newdata=newdata.a, type="response")
plot(couart2$ment, couart2$art, , ylim=c(0,12), pch=16, col="salmon", xlab="Articles by Mentor in Last 3 Years", ylab="Actual and Predicted Counts of Articles")
par(new=TRUE)
plot(newdata.a$ment, newdata.a$phat.pc, type="l", col="cornflowerblue", xlab="", ylab="", xaxt="no", yaxt="no", ylim=c(0,12))
par(new=TRUE)
plot(newdata.a$ment, newdata.a$phat.nbc, type="l", col="darkgreen", xlab="", ylab="", xaxt="no", yaxt="no", , ylim=c(0,12))
text(60, 9, "Negative Binomial", cex=0.75)
text(70, 5.5, "Poisson", cex=0.75)

We could also use visreg

library(visreg)
visreg(pois.2, "ment", by="fem", scale="response", overlay=TRUE)

visreg(nb.2, "ment", by="fem", scale="response", overlay=TRUE)

And now putting the estimates in a table …

library(stargazer)
stargazer(lm.c, pois.2, nb.2, type="html", star.cutoffs=c(0.05, 0.01, 0.001))


	Dependent variable:

	art
	OLS	Poisson	negative
			binomial
	(1)	(2)	(3)

femWomen	-0.381^**	-0.225^***	-0.216^**
	(0.128)	(0.055)	(0.073)

marMarried	0.263	0.155^*	0.150
	(0.145)	(0.061)	(0.082)

kid5	-0.291^**	-0.185^***	-0.176^***
	(0.091)	(0.040)	(0.053)

phd	-0.011	0.013	0.015
	(0.064)	(0.026)	(0.036)

ment	0.061^***	0.026^***	0.029^***
	(0.007)	(0.002)	(0.003)

Constant	1.334^***	0.305^**	0.256
	(0.241)	(0.103)	(0.137)


Observations	915	915	915
R²	0.110
Adjusted R²	0.105
Log Likelihood		-1,651.056	-1,561.958
theta			2.264^*** (0.271)
Akaike Inf. Crit.		3,314.113	3,135.917
Residual Std. Error	1.822 (df = 909)
F Statistic	22.553^*** (df = 5; 909)

Note:	p<0.05; p<0.01; p<0.001

Practice Data-sets

Data (a)

We have data on 250 groups that went to a park. Each group was questioned before leaving the park about how many fish they caught (count), how many children were in the group (child), how many people were in the group (persons), and whether or not they brought a camper to the park (camper). The outcome variable of interest will be the number of fish caught. Even though the question about the number of fish caught was asked to everyone, it does not mean that everyone went fishing. What would be the reason for someone to report a zero count? Was it because this person was unlucky and didn’t catch any fish, or was it because this person didn’t go fishing at all? If a person didn’t go fishing, the outcome would be always zero. Otherwise, if a person went fishing, the count could be zero or non-zero. So we can see that there seemed to be two processes that would generate zero counts: unlucky in fishing or didn’t go fishing. For now we will ignore these more interesting questions and focus on fitting Poisson and Negative Binomial models to these data.

fish = read.dta("http://www.stata-press.com/data/r10/fish.dta")
names(fish); summary(fish)

## [1] "nofish"   "livebait" "camper"   "persons"  "child"    "xb"      
## [7] "zg"       "count"

##      nofish         livebait         camper         persons     
##  Min.   :0.000   Min.   :0.000   Min.   :0.000   Min.   :1.000  
##  1st Qu.:0.000   1st Qu.:1.000   1st Qu.:0.000   1st Qu.:2.000  
##  Median :0.000   Median :1.000   Median :1.000   Median :2.000  
##  Mean   :0.296   Mean   :0.864   Mean   :0.588   Mean   :2.528  
##  3rd Qu.:1.000   3rd Qu.:1.000   3rd Qu.:1.000   3rd Qu.:4.000  
##  Max.   :1.000   Max.   :1.000   Max.   :1.000   Max.   :4.000  
##      child             xb                  zg              count        
##  Min.   :0.000   Min.   :-3.275050   Min.   :-5.6259   Min.   :  0.000  
##  1st Qu.:0.000   1st Qu.: 0.008267   1st Qu.:-1.2527   1st Qu.:  0.000  
##  Median :0.000   Median : 0.954550   Median : 0.6051   Median :  0.000  
##  Mean   :0.684   Mean   : 0.973796   Mean   : 0.2523   Mean   :  3.296  
##  3rd Qu.:1.000   3rd Qu.: 1.963855   3rd Qu.: 1.9932   3rd Qu.:  2.000  
##  Max.   :3.000   Max.   : 5.352674   Max.   : 4.2632   Max.   :149.000

Data (b)

Cross-section data on the number of recreational boating trips to Lake Somerville, Texas, in 1980, based on a survey administered to 2,000 registered leisure boat owners in 23 counties in eastern Texas.

trips – Number of recreational boating trips.
quality – Facility’s subjective quality ranking on a scale of 1 to 5.
ski – factor. Was the individual engaged in water-skiing at the lake?
income – Annual household income of the respondent (in 1,000 USD).
userfee – factor. Did the individual pay an annual user fee at Lake Somerville?
costC – Expenditure when visiting Lake Conroe (in USD).
costS – Expenditure when visiting Lake Somerville (in USD).
costH – Expenditure when visiting Lake Houston (in USD).

library(AER)
data(RecreationDemand)
names(RecreationDemand)

## [1] "trips"   "quality" "ski"     "income"  "userfee" "costC"   "costS"  
## [8] "costH"

summary(RecreationDemand)

##      trips           quality       ski          income      userfee  
##  Min.   : 0.000   Min.   :0.000   no :417   Min.   :1.000   no :646  
##  1st Qu.: 0.000   1st Qu.:0.000   yes:242   1st Qu.:3.000   yes: 13  
##  Median : 0.000   Median :0.000             Median :3.000            
##  Mean   : 2.244   Mean   :1.419             Mean   :3.853            
##  3rd Qu.: 2.000   3rd Qu.:3.000             3rd Qu.:5.000            
##  Max.   :88.000   Max.   :5.000             Max.   :9.000            
##      costC            costS             costH       
##  Min.   :  4.34   Min.   :  4.767   Min.   :  5.70  
##  1st Qu.: 28.24   1st Qu.: 33.312   1st Qu.: 28.96  
##  Median : 41.19   Median : 47.000   Median : 42.38  
##  Mean   : 55.42   Mean   : 59.928   Mean   : 55.99  
##  3rd Qu.: 69.67   3rd Qu.: 72.573   3rd Qu.: 68.56  
##  Max.   :493.77   Max.   :491.547   Max.   :491.05

Regression Models for Count Data

Ruhil

2016-04-16

The Poisson Distribution

A Simple Example to Start With

Fitting a Linear Regression Model

Fitting the Poisson Model to these Data

The Negative Binomial Model

Plotting Predicted Values for the Negative Binomial

Another Example: Article Counts

Fitting the Inappropriate Linear Regression Model

Fitting the Poisson Model

The Negative Binomial Model for couart2

Generating Predicted Counts after Negative Binomial

Practice Data-sets

Data (a)

Data (b)