Linear Probability Models

Ever since regression analysis could be conducted with widely available software packages that used drop-down menus, folks began running regression models on everything under the sun. The fact that the dependent variable was a categorical variable that could only assume one of two values was not a deterrent. As a result, you saw the usual regression model being fit to dependent variables such as

whether an individual had registered to vote (\(y=1\)) or not (\(y=0\))
whether a patient undergoing some experimental treatment for a terminal illness improved (\(y=1\)) or not (\(y=0\))
whether an individual who was provided job training as part of a welfare program secured employment (\(y=1\)) or not (\(y=0\))
whether prisoner released on parole and placed in a program designed to reduce the chance of his/her recidivating did indeed commit another crime (\(y=1\)) or not (\(y=0\))
whether a voter voted for candidate A or for candidate B

The list of possible dependent variables is endless.

In principle, of course, the basic logic of regression analysis still holds. That is, we could line up whatever independent variables \((x_1, x_2, x_3, \cdots, x_k)\) we believe explain the outcome \((y)\) and then test our beliefs by fitting an appropriate model. For example, the results would allow us to figure out if the probability of seeing a particular outcome increases/decreases as \(x_1\) increases or if it has no impact on the outcome. We could do this for all independent variables. We could model interactions and see if these help improve the model. We could generate predicted values of the outcomes and plot these against each independent variable. In brief, then, we could do everything with categorical dependent variables that we have done with the continuous dependent variable we have used in our regression models thus far. So what is the big deal? Let us see.

Fitting the Linear Probability Model (LPM)

We’ll use a particular data-set to see the trouble with the LPM. In particular, Spector and Mazzeo examined the effect of a teaching method known as PSI on the performance of students in a course, intermediate macro economics. The question was whether students exposed to the method scored higher on exams in the class. They collected data from students in two classes, one in which PSI was used and another in which a traditional teaching method was employed. For each of 32 students, they gathered data on:

gpa — Grade point average before taking the class. Observed values range from a low of 2.06 to a high of 4.0 with mean 3.12.
tuce — the score on an exam given at the beginning of the term to test entering knowledge of the material. In the sample, TUCE ranges from a low of 12 to a high of 29 with a mean of 21.94.
psi — a dummy variable indicating the teaching method used (1 = used Psi, 0 = other method). 14 of the 32 sample members (43.75%) are in PSI.
grade — coded 1 if the final grade was an A, 0 if the final grade was a B or C. 11 sample members (34.38%) got As and are coded 1.

grade was the dependent variable, and of particular interest was whether psi had a significant effect on grade, with tuce and gpa included as control variables. The model then is:

\[grade = \beta_0 + \beta_1(gpa) + \beta_2(tuce) + \beta_3(psi) + e\]

library(foreign)
psi = read.dta("http://www3.nd.edu/~rwilliam/statafiles/logist.dta")
psi$psi = factor(psi$psi, labels=c("Not in PSI", "In PSI"))
summary(psi)

##      grade             gpa             tuce               psi    
##  Min.   :0.0000   Min.   :2.060   Min.   :12.00   Not in PSI:18  
##  1st Qu.:0.0000   1st Qu.:2.812   1st Qu.:19.75   In PSI    :14  
##  Median :0.0000   Median :3.065   Median :22.50                  
##  Mean   :0.3438   Mean   :3.117   Mean   :21.94                  
##  3rd Qu.:1.0000   3rd Qu.:3.515   3rd Qu.:25.00                  
##  Max.   :1.0000   Max.   :4.000   Max.   :29.00

lpm = lm(grade ~ gpa + tuce + psi, data = psi) 
summary(lpm)

## 
## Call:
## lm(formula = grade ~ gpa + tuce + psi, data = psi)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.78153 -0.27731  0.00531  0.21089  0.81145 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept) -1.49802    0.52389  -2.859  0.00793 **
## gpa          0.46385    0.16196   2.864  0.00784 **
## tuce         0.01050    0.01948   0.539  0.59436   
## psiIn PSI    0.37855    0.13917   2.720  0.01109 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3881 on 28 degrees of freedom
## Multiple R-squared:  0.4159, Adjusted R-squared:  0.3533 
## F-statistic: 6.646 on 3 and 28 DF,  p-value: 0.001571

Note that grade is coded such that \(grade=1\) if the student got an A and \(grade = 0\) if the student got any other grade. So you can think of the regression as modeling the probability of getting an A (i.e., \(y=1\))

Now interpret the regression results …

Let us generate the predicted values, the residuals, and let us also run a residual plot:

psi$predicted = fitted(lpm)
psi$residuals = resid(lpm)
library(car)
residualPlots(lpm)

##            Test stat Pr(>|t|)
## gpa            1.441    0.161
## tuce           0.445    0.660
## psi               NA       NA
## Tukey test     0.408    0.683

Go ahead and click on the data-frame to open it up in the viewer. Notice anything strange about the predicted values and the residuals? What pattern do you see? By the way, you are also seeing this in the residualPlots(); focus on the plot against Fitted values. Nothing about this plot says the residuals show no pattern; they do, and always will. Consequently we end up having heteroscedasticity in our model, with the usual consequences – our estimates are unbiased and correct BUT the standard errors are way off and so the significant tests and confidence intervals will be unreliable.

We also realize that in cases with a binary dependent variable that is tapping some latent (i.e., underlying but unmeasured) probability, the relationship between \(x\) and \(y\) is unlikely to be linear. Let us see this with a simple plot:

library(ggplot2)
ggplot(psi, aes(gpa, grade)) + geom_point() + geom_smooth(method="lm")

The regression line says your probability of getting an A increases by a set amount for every unit increase in gpa. But surely once my gpa reaches a certain level where I am practically guaranteed to get an A, any farther increase in my gpa should have negligible impacts on my probability of getting an A?

The same thing should happen at the low-end of gpa. If my gpa is the minimum, the probability of getting an A probably hovers around 0 until I cross some level of gpa after which my chances of an A get closer to 50%, and after some more increases in gpa my chances of an A should be better than 50%. What this suggests is that the probability of getting an A most likely looks like the image below: The ubiquitous S-shaped curve

This S-shaped curve most likely captures how the probability of getting an A changes as gpa increases. It stays flat at low values of gpa, then shoots up around the middle of the distribution of gpa, and finally levels off for the highest values of gpa. All in all, then, the linear probability model has some problems:

The model will be plagued by heteroscedasticity (non-constant variance of the residuals) and as such significance tests become unreliable
Predictions will be out of bounds (going below 0 and above 1)
The true relationship between \(y\) and \(x\) might be non-linear but the linear regression model will not capture this possibility and as a result the LPM is using an incorrectly specified model (which, remember, also leads to heteroscedasticity).

Therefore, because the LPM is inappropriate for a binary dependent variable we have to rely on a certain family of models – the logit (or logistic) and probit models. I’ll start by explaining some key constructs – contingency tables, odds, and odds-ratios.

Contingency Tables

Contingency analysis comes into use when we need to test for the independence of two categorical variables. For example, did Male passengers on the Titanic have the same survival chances as Female passengers? Is lung cancer independent of smoking?

In order to do such tests we often rely upon odds and odds ratios. The odds of an event are calculated as the number of events divided by the number of non-events. For e.g., on average 51 boys are born in every 100 births, so the odds of any randomly chosen delivery being that of a boy is \(\dfrac{51}{49}=1.040816\).

Likewise the odds of a girl being born are \(\dfrac{49}{51}=0.9607843\)

Note: If \(p=\) probability of success; \(1-p\) = probability of failure, then the odds of success: \(O = \dfrac{p}{1-p}\). Since with a sample we estimate the odds we have \(\hat{O} = \dfrac{\hat{p}}{1-\hat{p}}\)

Aspirin and Cancer

Does Aspirin reduce the chances of getting cancer?

Outcome	Aspirin	Placebo	Total
Cancer	1438	1427	2865
No Cancer	18496	18515	37011
Total	19934	19942	39876

What were the odds of (i) not getting cancer, and (ii) getting cancer for the Aspirin group versus the Placebo group? These were:

p.ncA = (18496/19934); p.ncA

## [1] 0.9278619

p.cA = (1438/19934); p.cA

## [1] 0.07213806

Odds.ncA = p.ncA/p.cA; Odds.ncA

## [1] 12.86231

p.ncP = (18515/19942); p.ncP

## [1] 0.9284425

p.cP = (1427/19942); p.cP

## [1] 0.07155752

Odds.ncP = p.ncP/p.cP; Odds.ncP

## [1] 12.97477

So the odds-ratio of not getting cancer for the Aspirin versus the Placebo group was:

OR.Aspirin = Odds.ncA/Odds.ncP; OR.Aspirin

## [1] 0.9913321

In plain words: The odds of not getting cancer were slightly less for the Aspirin group than for the Placebo group.

What is the range of values that the Odds and the Odds Ratios can assume?

Let \(p=\) the probability of success; \(1-p\) = probability of failure. Then, as previously defined, the odds of success are calculated as \(O = \dfrac{p}{1-p}\) and for two groups the odds ratio of success is \(O=\dfrac{O_1}{O_2}\).

Table 1 (see below) maps out the minimum probability of success (\(=0\)) and the maximum probability of success (\(=1\)) for a single group. Note \(p=0\) means success never occurs and \(p=1\) means success always occurs, for this particular group.

Expressing the probability as odds yields a corresponding range of values that are anchored below at \(0\) and above at \(\infty\) … these are the limits of the odds.

\(p\)	\(1-p\)	\(odds\)
0.00	1.00	0.00
0.10	0.90	0.11
0.20	0.80	0.25
0.30	0.70	0.43
0.40	0.60	0.67
0.50	0.50	1.00
0.60	0.40	1.50
0.70	0.30	2.33
0.80	0.20	4.00
0.90	0.10	9.00
1.00	0.00	\(\infty\)

The implication is that the odds follow an asymmetric distribution and hence require a transformation in order for us to come up with some estimate of the uncertainty surrounding the odds. The log-transformation achieves this by normalizing the distribution of the odds.

Logit Models

The natural logarithm of odds is called the logit, or the logit transformation of success \((p)\). Because of the properties of odds given above, the logit has these properties:

If \(odds(success) = 1\), then \(logit(p) = 0\)
If \(odds(success) < 1\), then \(logit(p) < 0\)
If \(odds(success) > 1\), then \(logit(p) > 0\)
The logit transform fails if \(p = 0\)

Mathematically, if \(y = f(x)\), and \(y\) is a binary variable, the logistic function of \(y\) is given as follows:

\[y = \dfrac{e^{\beta_0 + \beta_1(x)}}{1 + e^{\beta_0 + \beta_1(x)}}\]

Taking the natural logarithm of the above formula gives: \[logit(y) = \beta_0 + \beta_1(x)\]

As such, a logit (or logistic) regression is a linear regression on the logit transform of \(y\), where \(y\) is the probability of success for a given value of \(x\). So how we we fit a logit model? Let us do this first for the Aspirin data we started with.

aspirin = read.csv(url("http://whitlockschluter.zoology.ubc.ca/wp-content/data/chapter09/chap09e2AspirinCancer.csv"))

glm.c = glm(cancer ~ aspirinTreatment, data=aspirin, family=binomial(link=logit))
summary(glm.c)

## 
## Call:
## glm(formula = cancer ~ aspirinTreatment, family = binomial(link = logit), 
##     data = aspirin)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.2966   0.3854   0.3854   0.3870   0.3870  
## 
## Coefficients:
##                         Estimate Std. Error z value Pr(>|z|)    
## (Intercept)             2.554301   0.027377  93.303   <2e-16 ***
## aspirinTreatmentPlacebo 0.008706   0.038785   0.224    0.822    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 20607  on 39875  degrees of freedom
## Residual deviance: 20607  on 39874  degrees of freedom
## AIC: 20611
## 
## Number of Fisher Scoring iterations: 5

exp(glm.c$coefficients)

##             (Intercept) aspirinTreatmentPlacebo 
##               12.862309                1.008744

The odds-ratio is what we get from the contingency table itself (as it should be).

Now on to the psi data

glm.0 = glm(grade ~ gpa + tuce + psi, data=psi, family=binomial(link="logit"))
summary(glm.0)

## 
## Call:
## glm(formula = grade ~ gpa + tuce + psi, family = binomial(link = "logit"), 
##     data = psi)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.9551  -0.6453  -0.2570   0.5888   2.0966  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)   
## (Intercept) -13.02135    4.93127  -2.641  0.00828 **
## gpa           2.82611    1.26293   2.238  0.02524 * 
## tuce          0.09516    0.14155   0.672  0.50143   
## psiIn PSI     2.37869    1.06456   2.234  0.02545 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 41.183  on 31  degrees of freedom
## Residual deviance: 25.779  on 28  degrees of freedom
## AIC: 33.779
## 
## Number of Fisher Scoring iterations: 5

psi$predicted = fitted(glm.0)

Interpreting Logit Model Estimates

The estimates are \(\hat{\beta_i}\) and in log-odds units, and have to be interpreted with care. In particular,

Holding all else constant, as gpa increases by 1 the log-odds of getting an A \((i.e., y=1)\) increase by 2.82611
Holding all else constant, the log-odds of getting an A \((i.e., y=1)\) are 2.37869 for those in PSI as compared to those not in PSI
Changes in tuce do not influence the log-odds of grade=A

Estimates in log-odds units are hard to make sense of so it is often preferred that we report the odds ratios. These can be recovered from the estimates very easily because the odds-ratio is just \(e^{\beta_i}\)

Variable	Estimate	Odds-Ratio
gpa	2.82611	16.87967
tuce	0.09516	1.099835
psi	2.37869	10.79076

Exponentiating estimates to get the odds-ratios is done as follows:

glm.0.OR = exp(coef(glm.0))
round(glm.0.OR, digits=5)

## (Intercept)         gpa        tuce   psiIn PSI 
##     0.00000    16.87972     1.09983    10.79073

Now we can say that holding all else constant, the odds of getting an A are 10.79 for someone in PSI as compared to someone not in PSI. What about interpreting gpa? In general, interpreting the estimates of a logit model is tricky because most of us are not used to dealing with log odds or odds ratios. One solution is thus to use predicted probabilities to explain what the model is telling us.

We can do this in two ways, either by

Calculating predicted probabilities for an individual with specific values of gpa, tuce, and psi and then changing some of these values to see how the predicted probability changes, or
Graphically, via visreg

We will do both in sequence. First, let us see what happens when two students are both in psi, both have tuce = 19.97, but one has gpa=2.812 and the other has gpa=3.065. How much does this difference in their gpa impact the odds-ratio?

\[\beta_1(gpa=3.065) - \beta_1(gpa=2.812)\] \[= \left(2.82611 \times 3.065\right) - \left(2.82611 \times 2.812\right)\] \[e^{2.82611 * \left(3.065-2.812 \right)} = e^{2.82611 * 0.253}\]

exp(2.82611 * 0.253)

newdata1 = data.frame(tuce=19.97, gpa=2.812, psi="In PSI")
predict(glm.0, newdata1, type="response")

##         1 
## 0.3110249

newdata2 = data.frame(tuce=19.97, gpa=3.065, psi="In PSI")
predict(glm.0, newdata2, type="response")

##         1 
## 0.4799294

newdata3 = data.frame(tuce=19.97, gpa=3.065, psi="Not in PSI")
predict(glm.0, newdata3, type="response")

##          1 
## 0.07878191

Note what I have done here. I created newdata to be a data-frame with specific values of the independent variables. The only difference between newdata1 and newdata2 is the value of gpa.

In newdata1 gpa is set to the first quartile value and in newdata2 it is set to the median value.
Similarly, newdata3 differs from newdata2 only in terms of the individual being in psi (see newdata2) and not being in psi (see newdata3).

The resulting values of predicted probabilities can then be used to explain the results.

This could be useful if the audience wanted to know the impact of specific variables, and how changes in these variables’ values influence the probability of \(y=1\).

Now we do it en masse to see how probabilities change as gpa changes and compare what happens to those in psi versus those not in psi, but tuce being held constant.

newdata4 = data.frame(tuce=19.97, gpa=seq(2.1, 4.0, by=0.1), psi="Not in PSI")
yhat.4 = predict(glm.0, newdata4, type="response")
yhat.4

##           1           2           3           4           5           6 
## 0.005562042 0.007365167 0.009747107 0.012889378 0.017027236 0.022463235 
##           7           8           9          10          11          12 
## 0.029582466 0.038868267 0.050915907 0.066439704 0.086266357 0.111304187 
##          13          14          15          16          17          18 
## 0.142475849 0.180603283 0.226241947 0.279479898 0.339744807 0.405687926 
##          19          20 
## 0.475218059 0.545722471

plot(newdata4$gpa, yhat.4)

newdata5 = data.frame(tuce=19.97, gpa=seq(2.1, 4.0, by=0.1), psi="In PSI")
yhat.5 = predict(glm.0, newdata5, type="response")
yhat.5

##          1          2          3          4          5          6 
## 0.05691891 0.07413001 0.09601554 0.12350052 0.15748258 0.19869539 
##          7          8          9         10         11         12 
## 0.24752487 0.30380461 0.36664563 0.43437502 0.50464694 0.57473574 
##         13         14         15         16         17         18 
## 0.64194382 0.70400068 0.75933433 0.80715746 0.84738750 0.88046778 
##         19         20 
## 0.90716315 0.92838154

plot(newdata5$gpa, yhat.5)

Let us compare the two predicted probabilities in a single plot:

plot.data = data.frame(cbind(yhat.4, yhat.5, newdata4$gpa, (yhat.5 - yhat.4)))
colnames(plot.data) = c("Yhat4", "Yhat5", "gpa", "Yhat5-Yhat4")
plot(plot.data$gpa, yhat.4, type="b", col="cornflowerblue", xlab="gpa", ylab="Probability grade = A", ylim=c(0,1))
par(new=TRUE)
plot(plot.data$gpa, yhat.5, type="b", col="firebrick", yaxt="no", ylab="", xlab="", ylim=c(0,1))
abline(h=0.5, col="gray", type="l")

If you look at this plot it is obvious which group’s probability of getting an A increases faster.
You also notice that for low gpa there not much difference but as gpa rises, the psi group crosses the horizontal line (at probability = 0.50) for a smaller gpa than does the non-psi group.
You also see the psi group’s probability starting to flatten once you hit a gpa of 3.5 * Clearly psi mattered because otherwise you would not see such a gulf in the chance of getting an A for otherwise similar students (i.e., students with the same gpa and tuce being held at a set value for all students).

visreg is helpful more globally, in capturing the effects of the independent variables in a more summary format.

library(visreg)
visreg(glm.0, "gpa", by="psi", rug=2)

visreg(glm.0, "gpa", by="psi", rug=2, scale="response", overlay=TRUE)

visreg(glm.0, "gpa", by="psi", rug=2, scale="response")

Notice a couple of things about these plots as well:

You see the rug being used here. Rugs on top are for \(y=1\) and rugs below are for \(y=0\).
The first plot is useful for seeing how well the model fits. Too many observations away from the line will hint at a poor fit.
The second plot is too noisy since the confidence intervals overlap, and so the third plot circumvents this issue by switching off the overlay option.
What is clearly visible in the third plot is the fact that if you were in psi, your gpa translated into a higher probability of earning an A faster than it did for the non-psi group. One way of thinking about this is to draw an imaginary horizontal line from 0.5 on the vertical axis and then drop a perpendicular line down to gpa from where the horizontal line hits the curve. You will see that a smaller gpa is needed to hit this 50:50 point in the psi group than is true of the non-psi group.

Testing Goodness of Fit for Logit Models

There is no \(R^2\) for logit models so we have to settle on other measures to decide how well our model fits the actual data. A popular test is the Hosmer and Lemeshow goodness of fit test. In brief, this test compares the observed distribution of \(y= [0,1]\) to that predicted from the model. The test is basically a \(\chi^2\) test that splits your data into groups, the number of groups specified as the number of independent variables \(+ 1\). Within each group the test compares the number of observed versus expected zeroes and ones. If there is a good fit across most groups – i.e., within each group the model is perfectly predicting every 0 and every 1 then we have a ``good model’’. As such the null hypothesis if of good model fit. The alternative is a poor model fit.

library(ResourceSelection)
hl = hoslem.test(psi$grade, fitted(glm.0), g=4)
hl

## 
##  Hosmer and Lemeshow goodness of fit (GOF) test
## 
## data:  psi$grade, fitted(glm.0)
## X-squared = 2.8318, df = 2, p-value = 0.2427

Many people also just set the number of groups to 10. Let us see what this happens if we go this route:

hl = hoslem.test(psi$grade, fitted(glm.0), g=10)
hl

## 
##  Hosmer and Lemeshow goodness of fit (GOF) test
## 
## data:  psi$grade, fitted(glm.0)
## X-squared = 6.0948, df = 8, p-value = 0.6366

Notice that for the glm.0 model we are unable to reject \(H_0\) of good model fit. This does not necessarily mean that we have a good model (because rejecting \(H_0\) is not the same as ``accepting’’ \(H_A\)). However, this is a popular test and so we should know how to carry it out and to interpret the results. We should also recognize that the Hosmer-Lemeshow test is most trustworthy in large samples (where \(n \geq 400\)); in fact the authors do not recommend its use when \(n < 400\).

McFadden’s \(R^2\) (or the \(psuedo-R^2\))

People trained in linear regression always want something like the \(R^2\) or \(adjusted-R^2\) to figure out how good the model really is. One of the popular choices floating around for some time now has been McFadden’s \(R^2\) (also known as the \(psuedo-R^2\)). This measure basically compares the log-likelihood from the full model to the log-likelihood from the NULL model (i.e., the model without any independent variables). The statistic is \[1 - \dfrac{LL_{full}}{LL_{NULL}}\]

The closer it is to \(1\) the better a model we have but we cannot say these variables explain x% of variation in y so be careful; this is just a rustic measure of model fit.

library(pscl)
pR2(glm.0)

##         llh     llhNull          G2    McFadden        r2ML        r2CU 
## -12.8896335 -20.5917297  15.4041925   0.3740383   0.3820706   0.5277965

Note that McFadden’s \(R^2\) is 0.3740383

Confidence Intervals of the Logit Estimates

confint(glm.0)

##                   2.5 %     97.5 %
## (Intercept) -25.1681675 -4.9023538
## gpa           0.6393669  5.7572537
## tuce         -0.1702054  0.4050443
## psiIn PSI     0.4786090  4.8102606

log.odds = cbind(log.odds = coef(glm.0), confint(glm.0)); log.odds

##                 log.odds       2.5 %     97.5 %
## (Intercept) -13.02134832 -25.1681675 -4.9023538
## gpa           2.82611297   0.6393669  5.7572537
## tuce          0.09515767  -0.1702054  0.4050443
## psiIn PSI     2.37868773   0.4786090  4.8102606

odds.ratios = exp(cbind(odds.ratios = coef(glm.0), confint(glm.0))); odds.ratios

##              odds.ratios        2.5 %       97.5 %
## (Intercept) 2.212587e-06 1.173826e-11 7.429076e-03
## gpa         1.687972e+01 1.895281e+00 3.164780e+02
## tuce        1.099832e+00 8.434915e-01 1.499369e+00
## psiIn PSI   1.079073e+01 1.613828e+00 1.227636e+02

round(log.odds, digits=4)

##             log.odds    2.5 %  97.5 %
## (Intercept) -13.0213 -25.1682 -4.9024
## gpa           2.8261   0.6394  5.7573
## tuce          0.0952  -0.1702  0.4050
## psiIn PSI     2.3787   0.4786  4.8103

round(odds.ratios, digits=4)

##             odds.ratios  2.5 %   97.5 %
## (Intercept)      0.0000 0.0000   0.0074
## gpa             16.8797 1.8953 316.4780
## tuce             1.0998 0.8435   1.4994
## psiIn PSI       10.7907 1.6138 122.7636

The first cbind() command gives you the estimates and the confidence intervals in terms of the log odds.
The second cbind() converts these estimates into the odds-ratios. Convention dictates that we present the odds-ratios because they can be more readily interpreted by the reader.
Notice the rounding to 4 decimal places. If we wanted, we could reduce these to two decimal places (but don’t go any lower).

Percent Correctly Predicted

One other popular statistic to evaluate logit models is the percent correctly predicted – basically a simple count of the number of correctly predicted zeroes \(+\) the number of correctly predicted ones, divided by the total sample size. So if the model if perfect, we should have the percent correctly predicted at 100%.

Some folks are used to doing this manually by creating a new predicted value that is 1 if the probability is predicted to be \(> 0.5\) and equal to 0 if the probability is predicted to be \(\leq 0.5\). One then creates a \(2 \times 2\) table of actual 0 and 1 versus predicted 0 and 1. A command in the pscl library does the calculations for us:

library(pscl)
hitmiss(glm.0)

## Classification Threshold = 0.5 
##        y=0 y=1
## yhat=0  18   3
## yhat=1   3   8
## Percent Correctly Predicted = 81.25%
## Percent Correctly Predicted = 85.71%, for y = 0
## Percent Correctly Predicted = 72.73%  for y = 1
## Null Model Correctly Predicts 65.62%

## [1] 81.25000 85.71429 72.72727

Some statisticians argue that the decision to use the 0.5 threshold to convert the predicted probabilities into 0 and 1 is problematic. They would in fact recommend the following:

probabilities = predict(glm.0, type = "response")
percent.correct = (1 - mean(abs(probabilities - psi$grade))) * 100
percent.correct

## [1] 74.2198

You will notice that you get a slightly smaller answer from this latter command. I would recommend using hitmiss() but recognizing that not everyone agrees with this approach.

Comparing Nested Models via the Likelihood Ratio Test

Recall that in the linear regression case we had compared models with one or more additional independent variables versus a model with these variables excluded. We did this via the \(F-test\).

Logit models have a similar tool for comparing what are called nested models. Assume a model such as \(logit(y) = \beta_0 + \beta_1(x_1) + \beta_2(x_2)\) is the FULL model. Then, a model that excludes either \(x_1\) or \(x_2\) is said to be nested within the FULL model. Nested models such as these can be compared very easily.

summary(glm.0)

## 
## Call:
## glm(formula = grade ~ gpa + tuce + psi, family = binomial(link = "logit"), 
##     data = psi)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.9551  -0.6453  -0.2570   0.5888   2.0966  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)   
## (Intercept) -13.02135    4.93127  -2.641  0.00828 **
## gpa           2.82611    1.26293   2.238  0.02524 * 
## tuce          0.09516    0.14155   0.672  0.50143   
## psiIn PSI     2.37869    1.06456   2.234  0.02545 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 41.183  on 31  degrees of freedom
## Residual deviance: 25.779  on 28  degrees of freedom
## AIC: 33.779
## 
## Number of Fisher Scoring iterations: 5

Notice that the summary() command gives you the deviance … this is similar (in principle but not in the mathematics) to the residual standard error from linear regression. A lower deviance is better than a higher deviance. There is a simple test that compares the deviance from the NULL model (i.e., with no independent variables) to the one with all the independent variables you specified. The test statistic is a \(\chi^2\) test:

with(glm.0, pchisq(null.deviance - deviance, df.null - df.residual, lower.tail = FALSE))

## [1] 0.001501878

The p-value of the test is 0.0015, quite a bit less than \(\alpha=0.05\) so we can conclude that FULL model is an improvement upon the NULL model that has no independent variables.

We can also use a similar approach to compare two nested models. Let us fit a smaller nested model and compare the two models to see which is better:

glm.1 = glm(grade ~ gpa + tuce, data=psi, family=binomial(link="logit"))
summary(glm.1)

## 
## Call:
## glm(formula = grade ~ gpa + tuce, family = binomial(link = "logit"), 
##     data = psi)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.4085  -0.6882  -0.4677   0.7262   2.4553  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)   
## (Intercept) -10.65601    4.05709  -2.627  0.00863 **
## gpa           2.53828    1.18185   2.148  0.03174 * 
## tuce          0.08555    0.13318   0.642  0.52064   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 41.183  on 31  degrees of freedom
## Residual deviance: 31.983  on 29  degrees of freedom
## AIC: 37.983
## 
## Number of Fisher Scoring iterations: 4

anova(glm.0, glm.1)

## Analysis of Deviance Table
## 
## Model 1: grade ~ gpa + tuce + psi
## Model 2: grade ~ gpa + tuce
##   Resid. Df Resid. Dev Df Deviance
## 1        28     25.779            
## 2        29     31.983 -1  -6.2037

The results indicate that the deviance has been reduced in the FULL model by 6.2037 at the expense of one degree of freedom by adding psi.

The Akaike Information Criterion (AIC) and the Bayesian Information Criterion (BIC)

R defines AIC as \(-2 \text{ maximized log-likelihood } + 2 \times (\text{ number of parameters})\) and as a rule a lower AIC is preferable to a higher AIC.

We can keep testing alternative models until such time as the AIC doesn’t drop any farther.

We also have a related but more robust criterion for choosing between alternate models – the Bayesian Information Criterion (BIC). Many analysts prefer the BIC over the AIC because

\[BIC = -2 \times ln(LL_{FULL}) + ln(n) \times (\text{ number of parameters})\]

In either event, convention suggests that we report both the AIC and the BIC, recognizing that when choosing between models lower values of AIC and BIC suggest a better model fit.

AIC(glm.0); BIC(glm.0)

## [1] 33.77927

## [1] 39.64221

AIC(glm.1); BIC(glm.1)

## [1] 37.98296

## [1] 42.38017

In closing, note that AIC and BIC can also be used to compare non-nested models.

Things to Bear in Mind …

Empty cells or small cells: You should check for empty or small cells by running frequency tables that cross-tabulate your categorical independent variable(s) and the dependent variable. If any cell has low frequency counts the model could run into trouble. With very small cell counts the model may even crash.
There is a technical issue of separation or quasi-separation (also called perfect prediction). In brief, if the outcome does not vary for given values of the independent variable(s) the model crashes.
These are maximum likelihood estimates. As such they require larger samples than does the simple linear regression model estimated via ordinary least squares (OLS).
If the dependent variable assumes \(Y=1\) very rarely in your sample, most likely because you are studying a very rare outcome, the default logit model will not work. Rather, you should use a logit model designed for rare events – the rare events logit.

Diagnostics for Logit Regression Models

You can use the same diagnostics for logit models that you used for regression models. There are some differences in how residuals are defined, etc. but we’ll ignore these mathematical details.

library(car)
marginalModelPlots(glm.0)

avPlots(glm.0)

residualPlots(glm.0)

##      Test stat Pr(>|t|)
## gpa      2.112    0.146
## tuce     0.172    0.678
## psi         NA       NA

influenceIndexPlot(glm.0)

influencePlot(glm.0)

##     StudRes       Hat     CookD
## 2  2.364568 0.1298970 0.3434321
## 30 1.222462 0.3288526 0.1377491

outlierTest(glm.0)

## 
## No Studentized residuals with Bonferonni p < 0.05
## Largest |rstudent|:
##   rstudent unadjusted p-value Bonferonni p
## 2 2.364568           0.018051      0.57764

vif(glm.0)

##      gpa     tuce      psi 
## 1.188153 1.078528 1.148723

Note: You cannot use ncvTest() because it only works with a linear regression model.

Separation? Quasi-Complete Separation?

y = c(0,0,0,0,1,1,1,1)
x1 =c(1,2,3,3,5,6,10,11)
x2 =c(3,2,-1,-1,2,4,1,0)
my.df1 = data.frame(cbind(y, x1, x2))
lm1 = glm(y ~ x1 + x2, data=my.df1, family=binomial)
summary(lm1)

## 
## Call:
## glm(formula = y ~ x1 + x2, family = binomial, data = my.df1)
## 
## Deviance Residuals: 
##          1           2           3           4           5           6  
## -2.110e-08  -1.404e-05  -2.522e-06  -2.522e-06   1.564e-05   2.110e-08  
##          7           8  
##  2.110e-08   2.110e-08  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)
## (Intercept)    -66.098 183471.722   0.000        1
## x1              15.288  27362.843   0.001        1
## x2               6.241  81543.720   0.000        1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1.1090e+01  on 7  degrees of freedom
## Residual deviance: 4.5454e-10  on 5  degrees of freedom
## AIC: 6
## 
## Number of Fisher Scoring iterations: 24

Note what happens here. You get a warning message. Why? Take a look at my.df1 … \(y = 0\) whenever \(x_1 \leq 3\) … this is perfect separation and the model becomes unstable. Look at your standard errors; they have exploded. A simple plot will explain what you have here.

with(my.df1, plot(x1, y))
abline(v=3.1, col="red")

Now quasi-complete separation …

y = c(0,0,0,0,1,1,1,1,1,1)
x1 =c(1,2,3,3,3,4,5,6,10,11)
x2 =c(3,0,-1,4,1,0,2,7,3,4)
my.df2 = data.frame(cbind(y, x1, x2))
lm2 = glm(y~ x1 + x2, data=my.df2, family=binomial)
summary(lm2)

## 
## Call:
## glm(formula = y ~ x1 + x2, family = binomial, data = my.df2)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.00419  -0.00006   0.00000   0.00000   1.46892  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)
## (Intercept)   -58.0761 17511.9031  -0.003    0.997
## x1             19.1778  5837.3010   0.003    0.997
## x2             -0.1206     0.6098  -0.198    0.843
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 13.4602  on 9  degrees of freedom
## Residual deviance:  3.7792  on 7  degrees of freedom
## AIC: 9.7792
## 
## Number of Fisher Scoring iterations: 21

Here, \(y=0\) whenever \(x_1 < 3\) and \(y = 1\) whenever \(x_1 > 3\) so the only zone of uncertainty occurs when \(x_1 = 3\). This is quite obviously a less severe situation than we have in my.df1.

with(my.df2, plot(x1, y))
abline(v=2.9, col="red")
abline(v=3.1, col="red")

In brief, what you are seeing is a problem with the distributions of the variables – you just don’t have a healthy mix of every independent variable and dependent variable values. This will basically force you to drop \(x_1\) with perfect separation. If you have quasi-complete separation you could keep \(x_1\) in the model but you won’t get any reasonable estimates of the “slope” and standard error for \(x_1\) but these estimates will be just fine for all other variables not plagued by the separation problem.

Voting in the 1992 General Election

Now let us practice with the vote92 data in the pscl package.

Survey data containing self-reports of vote choice in the 1992 U.S. Presidential election, with numerous covariates, from the 1992 American National Election Studies. A data frame with 909 observations on the following 10 variables.

vote – a factor with levels Perot Clinton Bush dem a numeric vector, 1 if the respondent reports identifying with the Democratic party, 0 otherwise.
rep – a numeric vector, 1 if the respondent reports identifying with the Republican party, 0 otherwise
female – a numeric vector, 1 if the respondent is female, 0 otherwise
persfinance – a numeric vector, -1 if the respondent reports that their personal financial situation has gotten worse over the last 12 months, 0 for no change, 1 if better
natlecon – a numeric vector, -1 if the respondent reports that national economic conditions have gotten worse over the last 12 months, 0 for no change, 1 if better
clintondis – a numeric vector, squared difference between respondent’s self-placement on a scale measure of political ideology and the respondent’s placement of the Democratic candidate, Bill Clinton
bushdis – a numeric vector, squared ideological distance of the respondent from the Republican candidate, President George H.W. Bush
perotdis – a numeric vector, squared ideological distance of the respondent from the Reform Party candidate, Ross Perot

library(pscl)
data(vote92); names(vote92)

## [1] "vote"        "dem"         "rep"         "female"      "persfinance"
## [6] "natlecon"    "clintondis"  "bushdis"     "perotdis"

summary(vote92)

##       vote          dem              rep             female      
##  Perot  :183   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  Clinton:416   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000  
##  Bush   :310   Median :0.0000   Median :0.0000   Median :0.0000  
##                Mean   :0.4884   Mean   :0.4301   Mean   :0.4752  
##                3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:1.0000  
##                Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
##   persfinance           natlecon         clintondis         bushdis       
##  Min.   :-1.000000   Min.   :-1.0000   Min.   : 0.0004   Min.   : 0.1024  
##  1st Qu.:-1.000000   1st Qu.:-1.0000   1st Qu.: 0.9604   1st Qu.: 0.4624  
##  Median : 0.000000   Median :-1.0000   Median : 1.0404   Median : 1.7424  
##  Mean   :-0.009901   Mean   :-0.6722   Mean   : 3.5062   Mean   : 3.3793  
##  3rd Qu.: 1.000000   3rd Qu.: 0.0000   3rd Qu.: 4.0804   3rd Qu.: 5.3824  
##  Max.   : 1.000000   Max.   : 1.0000   Max.   :16.1600   Max.   :18.6620  
##     perotdis      
##  Min.   : 0.2401  
##  1st Qu.: 0.2401  
##  Median : 2.2201  
##  Mean   : 2.1710  
##  3rd Qu.: 2.2801  
##  Max.   :12.1800

Let us drop Perot voters from the data-frame and create a second data-frame:

vote92b = subset(vote92, vote != "Perot", )
attach(vote92b)
vote92b$vote[vote=="Perot"] = NA
detach(vote92b)
summary(vote92b)

##       vote          dem             rep             female      
##  Perot  :  0   Min.   :0.000   Min.   :0.0000   Min.   :0.0000  
##  Clinton:416   1st Qu.:0.000   1st Qu.:0.0000   1st Qu.:0.0000  
##  Bush   :310   Median :1.000   Median :0.0000   Median :0.0000  
##                Mean   :0.522   Mean   :0.4187   Mean   :0.4959  
##                3rd Qu.:1.000   3rd Qu.:1.0000   3rd Qu.:1.0000  
##                Max.   :1.000   Max.   :1.0000   Max.   :1.0000  
##   persfinance          natlecon         clintondis         bushdis       
##  Min.   :-1.00000   Min.   :-1.0000   Min.   : 0.0004   Min.   : 0.1024  
##  1st Qu.:-1.00000   1st Qu.:-1.0000   1st Qu.: 0.9604   1st Qu.: 0.4624  
##  Median : 0.00000   Median :-1.0000   Median : 1.0404   Median : 1.7424  
##  Mean   :-0.01102   Mean   :-0.6749   Mean   : 3.5811   Mean   : 3.5482  
##  3rd Qu.: 1.00000   3rd Qu.: 0.0000   3rd Qu.: 4.0804   3rd Qu.: 5.3824  
##  Max.   : 1.00000   Max.   : 1.0000   Max.   :16.1600   Max.   :18.6620  
##     perotdis      
##  Min.   : 0.2401  
##  1st Qu.: 0.2401  
##  Median : 2.2201  
##  Mean   : 2.3066  
##  3rd Qu.: 2.2801  
##  Max.   :12.1800

Now we fit the model:

glm.v0 = glm(vote ~ dem + rep + female + clintondis + bushdis + persfinance + natlecon, data=vote92b, famil=binomial(link="logit"))
summary(glm.v0)

## 
## Call:
## glm(formula = vote ~ dem + rep + female + clintondis + bushdis + 
##     persfinance + natlecon, family = binomial(link = "logit"), 
##     data = vote92b)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.6356  -0.3245  -0.1250   0.3512   2.7556  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -0.002577   0.470869  -0.005 0.995634    
## dem         -1.891518   0.441773  -4.282 1.86e-05 ***
## rep          2.494360   0.427411   5.836 5.35e-09 ***
## female       0.442343   0.291080   1.520 0.128596    
## clintondis   0.098416   0.041049   2.397 0.016508 *  
## bushdis     -0.207106   0.054690  -3.787 0.000153 ***
## persfinance  0.294340   0.182242   1.615 0.106288    
## natlecon     0.913165   0.270653   3.374 0.000741 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 990.92  on 725  degrees of freedom
## Residual deviance: 359.10  on 718  degrees of freedom
## AIC: 375.1
## 
## Number of Fisher Scoring iterations: 6

vote92b$yhat = predict(glm.v0, type="response")
library(visreg)
visreg(glm.v0, "clintondis", by="persfinance", scale="response")

Logit and Probit Models

Ruhil

2016-04-16