Ever since regression analysis could be conducted with widely available software packages that used drop-down menus, folks began running regression models on everything under the sun. The fact that the dependent variable was a categorical variable that could only assume one of two values was not a deterrent. As a result, you saw the usual regression model being fit to dependent variables such as
The list of possible dependent variables is endless.
In principle, of course, the basic logic of regression analysis still holds. That is, we could line up whatever independent variables \((x_1, x_2, x_3, \cdots, x_k)\) we believe explain the outcome \((y)\) and then test our beliefs by fitting an appropriate model. For example, the results would allow us to figure out if the probability of seeing a particular outcome increases/decreases as \(x_1\) increases or if it has no impact on the outcome. We could do this for all independent variables. We could model interactions and see if these help improve the model. We could generate predicted values of the outcomes and plot these against each independent variable. In brief, then, we could do everything with categorical dependent variables that we have done with the continuous dependent variable we have used in our regression models thus far. So what is the big deal? Let us see.
We’ll use a particular data-set to see the trouble with the LPM. In particular, Spector and Mazzeo examined the effect of a teaching method known as PSI on the performance of students in a course, intermediate macro economics. The question was whether students exposed to the method scored higher on exams in the class. They collected data from students in two classes, one in which PSI was used and another in which a traditional teaching method was employed. For each of 32 students, they gathered data on:
grade was the dependent variable, and of particular interest was whether psi had a significant effect on grade, with tuce and gpa included as control variables. The model then is:
\[grade = \beta_0 + \beta_1(gpa) + \beta_2(tuce) + \beta_3(psi) + e\]
library(foreign)
psi = read.dta("http://www3.nd.edu/~rwilliam/statafiles/logist.dta")
psi$psi = factor(psi$psi, labels=c("Not in PSI", "In PSI"))
summary(psi)
## grade gpa tuce psi
## Min. :0.0000 Min. :2.060 Min. :12.00 Not in PSI:18
## 1st Qu.:0.0000 1st Qu.:2.812 1st Qu.:19.75 In PSI :14
## Median :0.0000 Median :3.065 Median :22.50
## Mean :0.3438 Mean :3.117 Mean :21.94
## 3rd Qu.:1.0000 3rd Qu.:3.515 3rd Qu.:25.00
## Max. :1.0000 Max. :4.000 Max. :29.00
lpm = lm(grade ~ gpa + tuce + psi, data = psi)
summary(lpm)
##
## Call:
## lm(formula = grade ~ gpa + tuce + psi, data = psi)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.78153 -0.27731 0.00531 0.21089 0.81145
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.49802 0.52389 -2.859 0.00793 **
## gpa 0.46385 0.16196 2.864 0.00784 **
## tuce 0.01050 0.01948 0.539 0.59436
## psiIn PSI 0.37855 0.13917 2.720 0.01109 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3881 on 28 degrees of freedom
## Multiple R-squared: 0.4159, Adjusted R-squared: 0.3533
## F-statistic: 6.646 on 3 and 28 DF, p-value: 0.001571
Note that grade is coded such that \(grade=1\) if the student got an A and \(grade = 0\) if the student got any other grade. So you can think of the regression as modeling the probability of getting an A (i.e., \(y=1\))
Now interpret the regression results …
Let us generate the predicted values, the residuals, and let us also run a residual plot:
psi$predicted = fitted(lpm)
psi$residuals = resid(lpm)
library(car)
residualPlots(lpm)
## Test stat Pr(>|t|)
## gpa 1.441 0.161
## tuce 0.445 0.660
## psi NA NA
## Tukey test 0.408 0.683
Go ahead and click on the data-frame to open it up in the viewer. Notice anything strange about the predicted values and the residuals? What pattern do you see? By the way, you are also seeing this in the residualPlots(); focus on the plot against Fitted values. Nothing about this plot says the residuals show no pattern; they do, and always will. Consequently we end up having heteroscedasticity in our model, with the usual consequences – our estimates are unbiased and correct BUT the standard errors are way off and so the significant tests and confidence intervals will be unreliable.
We also realize that in cases with a binary dependent variable that is tapping some latent (i.e., underlying but unmeasured) probability, the relationship between \(x\) and \(y\) is unlikely to be linear. Let us see this with a simple plot:
library(ggplot2)
ggplot(psi, aes(gpa, grade)) + geom_point() + geom_smooth(method="lm")
The regression line says your probability of getting an A increases by a set amount for every unit increase in gpa. But surely once my gpa reaches a certain level where I am practically guaranteed to get an A, any farther increase in my gpa should have negligible impacts on my probability of getting an A?
The same thing should happen at the low-end of gpa. If my gpa is the minimum, the probability of getting an A probably hovers around 0 until I cross some level of gpa after which my chances of an A get closer to 50%, and after some more increases in gpa my chances of an A should be better than 50%. What this suggests is that the probability of getting an A most likely looks like the image below:
This S-shaped curve most likely captures how the probability of getting an A changes as gpa increases. It stays flat at low values of gpa, then shoots up around the middle of the distribution of gpa, and finally levels off for the highest values of gpa. All in all, then, the linear probability model has some problems:
Therefore, because the LPM is inappropriate for a binary dependent variable we have to rely on a certain family of models – the logit (or logistic)
and probit
models. I’ll start by explaining some key constructs – contingency tables, odds, and odds-ratios.
Contingency analysis comes into use when we need to test for the independence of two categorical variables. For example, did Male passengers on the Titanic have the same survival chances as Female passengers? Is lung cancer independent of smoking?
In order to do such tests we often rely upon odds
and odds ratios
. The odds of an event are calculated as the number of events divided by the number of non-events. For e.g., on average 51 boys are born in every 100 births, so the odds of any randomly chosen delivery being that of a boy is \(\dfrac{51}{49}=1.040816\).
Likewise the odds of a girl being born are \(\dfrac{49}{51}=0.9607843\)
Note: If \(p=\) probability of success; \(1-p\) = probability of failure, then the odds of success: \(O = \dfrac{p}{1-p}\). Since with a sample we estimate the odds we have \(\hat{O} = \dfrac{\hat{p}}{1-\hat{p}}\)
Does Aspirin reduce the chances of getting cancer?
Outcome | Aspirin | Placebo | Total |
---|---|---|---|
Cancer | 1438 | 1427 | 2865 |
No Cancer | 18496 | 18515 | 37011 |
Total | 19934 | 19942 | 39876 |
What were the odds of (i) not getting cancer, and (ii) getting cancer for the Aspirin group versus the Placebo group? These were:
p.ncA = (18496/19934); p.ncA
## [1] 0.9278619
p.cA = (1438/19934); p.cA
## [1] 0.07213806
Odds.ncA = p.ncA/p.cA; Odds.ncA
## [1] 12.86231
p.ncP = (18515/19942); p.ncP
## [1] 0.9284425
p.cP = (1427/19942); p.cP
## [1] 0.07155752
Odds.ncP = p.ncP/p.cP; Odds.ncP
## [1] 12.97477
So the odds-ratio of not getting cancer for the Aspirin versus the Placebo group was:
OR.Aspirin = Odds.ncA/Odds.ncP; OR.Aspirin
## [1] 0.9913321
In plain words: The odds of not getting cancer were slightly less for the Aspirin group than for the Placebo group.
Let \(p=\) the probability of success; \(1-p\) = probability of failure. Then, as previously defined, the odds of success are calculated as \(O = \dfrac{p}{1-p}\) and for two groups the odds ratio of success is \(O=\dfrac{O_1}{O_2}\).
Table 1 (see below) maps out the minimum probability of success (\(=0\)) and the maximum probability of success (\(=1\)) for a single group. Note \(p=0\) means success never occurs and \(p=1\) means success always occurs, for this particular group.
Expressing the probability as odds yields a corresponding range of values that are anchored below at \(0\) and above at \(\infty\) … these are the limits of the odds.
\(p\) | \(1-p\) | \(odds\) |
---|---|---|
0.00 | 1.00 | 0.00 |
0.10 | 0.90 | 0.11 |
0.20 | 0.80 | 0.25 |
0.30 | 0.70 | 0.43 |
0.40 | 0.60 | 0.67 |
0.50 | 0.50 | 1.00 |
0.60 | 0.40 | 1.50 |
0.70 | 0.30 | 2.33 |
0.80 | 0.20 | 4.00 |
0.90 | 0.10 | 9.00 |
1.00 | 0.00 | \(\infty\) |
The implication is that the odds follow an asymmetric distribution and hence require a transformation in order for us to come up with some estimate of the uncertainty surrounding the odds. The log-transformation achieves this by normalizing the distribution of the odds.
The natural logarithm of odds is called the logit, or the logit transformation of success \((p)\). Because of the properties of odds given above, the logit has these properties:
Mathematically, if \(y = f(x)\), and \(y\) is a binary variable, the logistic function of \(y\) is given as follows:
\[y = \dfrac{e^{\beta_0 + \beta_1(x)}}{1 + e^{\beta_0 + \beta_1(x)}}\]
Taking the natural logarithm of the above formula gives: \[logit(y) = \beta_0 + \beta_1(x)\]
As such, a logit (or logistic) regression is a linear regression on the logit transform of \(y\), where \(y\) is the probability of success for a given value of \(x\). So how we we fit a logit model? Let us do this first for the Aspirin data we started with.
aspirin = read.csv(url("http://whitlockschluter.zoology.ubc.ca/wp-content/data/chapter09/chap09e2AspirinCancer.csv"))
glm.c = glm(cancer ~ aspirinTreatment, data=aspirin, family=binomial(link=logit))
summary(glm.c)
##
## Call:
## glm(formula = cancer ~ aspirinTreatment, family = binomial(link = logit),
## data = aspirin)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.2966 0.3854 0.3854 0.3870 0.3870
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.554301 0.027377 93.303 <2e-16 ***
## aspirinTreatmentPlacebo 0.008706 0.038785 0.224 0.822
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 20607 on 39875 degrees of freedom
## Residual deviance: 20607 on 39874 degrees of freedom
## AIC: 20611
##
## Number of Fisher Scoring iterations: 5
exp(glm.c$coefficients)
## (Intercept) aspirinTreatmentPlacebo
## 12.862309 1.008744
The odds-ratio is what we get from the contingency table itself (as it should be).
Now on to the psi data
glm.0 = glm(grade ~ gpa + tuce + psi, data=psi, family=binomial(link="logit"))
summary(glm.0)
##
## Call:
## glm(formula = grade ~ gpa + tuce + psi, family = binomial(link = "logit"),
## data = psi)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.9551 -0.6453 -0.2570 0.5888 2.0966
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -13.02135 4.93127 -2.641 0.00828 **
## gpa 2.82611 1.26293 2.238 0.02524 *
## tuce 0.09516 0.14155 0.672 0.50143
## psiIn PSI 2.37869 1.06456 2.234 0.02545 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 41.183 on 31 degrees of freedom
## Residual deviance: 25.779 on 28 degrees of freedom
## AIC: 33.779
##
## Number of Fisher Scoring iterations: 5
psi$predicted = fitted(glm.0)
The estimates are \(\hat{\beta_i}\) and in log-odds units, and have to be interpreted with care. In particular,
Estimates in log-odds units are hard to make sense of so it is often preferred that we report the odds ratios. These can be recovered from the estimates very easily because the odds-ratio is just \(e^{\beta_i}\)
Variable | Estimate | Odds-Ratio |
---|---|---|
gpa | 2.82611 | 16.87967 |
tuce | 0.09516 | 1.099835 |
psi | 2.37869 | 10.79076 |
Exponentiating estimates to get the odds-ratios is done as follows:
glm.0.OR = exp(coef(glm.0))
round(glm.0.OR, digits=5)
## (Intercept) gpa tuce psiIn PSI
## 0.00000 16.87972 1.09983 10.79073
Now we can say that holding all else constant, the odds of getting an A are 10.79 for someone in PSI as compared to someone not in PSI. What about interpreting gpa? In general, interpreting the estimates of a logit model is tricky because most of us are not used to dealing with log odds or odds ratios. One solution is thus to use predicted probabilities to explain what the model is telling us.
We can do this in two ways, either by
visreg
We will do both in sequence. First, let us see what happens when two students are both in psi, both have tuce = 19.97, but one has gpa=2.812 and the other has gpa=3.065. How much does this difference in their gpa impact the odds-ratio?
\[\beta_1(gpa=3.065) - \beta_1(gpa=2.812)\] \[= \left(2.82611 \times 3.065\right) - \left(2.82611 \times 2.812\right)\] \[e^{2.82611 * \left(3.065-2.812 \right)} = e^{2.82611 * 0.253}\]
exp(2.82611 * 0.253)
newdata1 = data.frame(tuce=19.97, gpa=2.812, psi="In PSI")
predict(glm.0, newdata1, type="response")
## 1
## 0.3110249
newdata2 = data.frame(tuce=19.97, gpa=3.065, psi="In PSI")
predict(glm.0, newdata2, type="response")
## 1
## 0.4799294
newdata3 = data.frame(tuce=19.97, gpa=3.065, psi="Not in PSI")
predict(glm.0, newdata3, type="response")
## 1
## 0.07878191
Note what I have done here. I created newdata
to be a data-frame with specific values of the independent variables. The only difference between newdata1
and newdata2
is the value of gpa.
The resulting values of predicted probabilities can then be used to explain the results.
This could be useful if the audience wanted to know the impact of specific variables, and how changes in these variables’ values influence the probability of \(y=1\).
Now we do it en masse to see how probabilities change as gpa changes and compare what happens to those in psi versus those not in psi, but tuce being held constant.
newdata4 = data.frame(tuce=19.97, gpa=seq(2.1, 4.0, by=0.1), psi="Not in PSI")
yhat.4 = predict(glm.0, newdata4, type="response")
yhat.4
## 1 2 3 4 5 6
## 0.005562042 0.007365167 0.009747107 0.012889378 0.017027236 0.022463235
## 7 8 9 10 11 12
## 0.029582466 0.038868267 0.050915907 0.066439704 0.086266357 0.111304187
## 13 14 15 16 17 18
## 0.142475849 0.180603283 0.226241947 0.279479898 0.339744807 0.405687926
## 19 20
## 0.475218059 0.545722471
plot(newdata4$gpa, yhat.4)
newdata5 = data.frame(tuce=19.97, gpa=seq(2.1, 4.0, by=0.1), psi="In PSI")
yhat.5 = predict(glm.0, newdata5, type="response")
yhat.5
## 1 2 3 4 5 6
## 0.05691891 0.07413001 0.09601554 0.12350052 0.15748258 0.19869539
## 7 8 9 10 11 12
## 0.24752487 0.30380461 0.36664563 0.43437502 0.50464694 0.57473574
## 13 14 15 16 17 18
## 0.64194382 0.70400068 0.75933433 0.80715746 0.84738750 0.88046778
## 19 20
## 0.90716315 0.92838154
plot(newdata5$gpa, yhat.5)
Let us compare the two predicted probabilities in a single plot:
plot.data = data.frame(cbind(yhat.4, yhat.5, newdata4$gpa, (yhat.5 - yhat.4)))
colnames(plot.data) = c("Yhat4", "Yhat5", "gpa", "Yhat5-Yhat4")
plot(plot.data$gpa, yhat.4, type="b", col="cornflowerblue", xlab="gpa", ylab="Probability grade = A", ylim=c(0,1))
par(new=TRUE)
plot(plot.data$gpa, yhat.5, type="b", col="firebrick", yaxt="no", ylab="", xlab="", ylim=c(0,1))
abline(h=0.5, col="gray", type="l")
visreg
is helpful more globally, in capturing the effects of the independent variables in a more summary format.
library(visreg)
visreg(glm.0, "gpa", by="psi", rug=2)
visreg(glm.0, "gpa", by="psi", rug=2, scale="response", overlay=TRUE)
visreg(glm.0, "gpa", by="psi", rug=2, scale="response")
Notice a couple of things about these plots as well:
There is no \(R^2\) for logit models so we have to settle on other measures to decide how well our model fits the actual data. A popular test is the Hosmer and Lemeshow goodness of fit test. In brief, this test compares the observed distribution of \(y= [0,1]\) to that predicted from the model. The test is basically a \(\chi^2\) test that splits your data into groups, the number of groups specified as the number of independent variables \(+ 1\). Within each group the test compares the number of observed versus expected zeroes and ones. If there is a good fit across most groups – i.e., within each group the model is perfectly predicting every 0 and every 1 then we have a ``good model’’. As such the null hypothesis if of good model fit. The alternative is a poor model fit.
library(ResourceSelection)
hl = hoslem.test(psi$grade, fitted(glm.0), g=4)
hl
##
## Hosmer and Lemeshow goodness of fit (GOF) test
##
## data: psi$grade, fitted(glm.0)
## X-squared = 2.8318, df = 2, p-value = 0.2427
Many people also just set the number of groups to 10. Let us see what this happens if we go this route:
hl = hoslem.test(psi$grade, fitted(glm.0), g=10)
hl
##
## Hosmer and Lemeshow goodness of fit (GOF) test
##
## data: psi$grade, fitted(glm.0)
## X-squared = 6.0948, df = 8, p-value = 0.6366
Notice that for the glm.0 model we are unable to reject \(H_0\) of good model fit. This does not necessarily mean that we have a good model (because rejecting \(H_0\) is not the same as ``accepting’’ \(H_A\)). However, this is a popular test and so we should know how to carry it out and to interpret the results. We should also recognize that the Hosmer-Lemeshow test is most trustworthy in large samples (where \(n \geq 400\)); in fact the authors do not recommend its use when \(n < 400\).
People trained in linear regression always want something like the \(R^2\) or \(adjusted-R^2\) to figure out how good the model really is. One of the popular choices floating around for some time now has been McFadden’s \(R^2\) (also known as the \(psuedo-R^2\)). This measure basically compares the log-likelihood from the full model to the log-likelihood from the NULL model (i.e., the model without any independent variables). The statistic is \[1 - \dfrac{LL_{full}}{LL_{NULL}}\]
The closer it is to \(1\) the better a model we have but we cannot say these variables explain x% of variation in y so be careful; this is just a rustic measure of model fit.
library(pscl)
pR2(glm.0)
## llh llhNull G2 McFadden r2ML r2CU
## -12.8896335 -20.5917297 15.4041925 0.3740383 0.3820706 0.5277965
Note that McFadden’s \(R^2\) is 0.3740383
confint(glm.0)
## 2.5 % 97.5 %
## (Intercept) -25.1681675 -4.9023538
## gpa 0.6393669 5.7572537
## tuce -0.1702054 0.4050443
## psiIn PSI 0.4786090 4.8102606
log.odds = cbind(log.odds = coef(glm.0), confint(glm.0)); log.odds
## log.odds 2.5 % 97.5 %
## (Intercept) -13.02134832 -25.1681675 -4.9023538
## gpa 2.82611297 0.6393669 5.7572537
## tuce 0.09515767 -0.1702054 0.4050443
## psiIn PSI 2.37868773 0.4786090 4.8102606
odds.ratios = exp(cbind(odds.ratios = coef(glm.0), confint(glm.0))); odds.ratios
## odds.ratios 2.5 % 97.5 %
## (Intercept) 2.212587e-06 1.173826e-11 7.429076e-03
## gpa 1.687972e+01 1.895281e+00 3.164780e+02
## tuce 1.099832e+00 8.434915e-01 1.499369e+00
## psiIn PSI 1.079073e+01 1.613828e+00 1.227636e+02
round(log.odds, digits=4)
## log.odds 2.5 % 97.5 %
## (Intercept) -13.0213 -25.1682 -4.9024
## gpa 2.8261 0.6394 5.7573
## tuce 0.0952 -0.1702 0.4050
## psiIn PSI 2.3787 0.4786 4.8103
round(odds.ratios, digits=4)
## odds.ratios 2.5 % 97.5 %
## (Intercept) 0.0000 0.0000 0.0074
## gpa 16.8797 1.8953 316.4780
## tuce 1.0998 0.8435 1.4994
## psiIn PSI 10.7907 1.6138 122.7636
One other popular statistic to evaluate logit models is the percent correctly predicted – basically a simple count of the number of correctly predicted zeroes \(+\) the number of correctly predicted ones, divided by the total sample size. So if the model if perfect, we should have the percent correctly predicted at 100%.
Some folks are used to doing this manually by creating a new predicted value that is 1 if the probability is predicted to be \(> 0.5\) and equal to 0 if the probability is predicted to be \(\leq 0.5\). One then creates a \(2 \times 2\) table of actual 0 and 1 versus predicted 0 and 1. A command in the pscl
library does the calculations for us:
library(pscl)
hitmiss(glm.0)
## Classification Threshold = 0.5
## y=0 y=1
## yhat=0 18 3
## yhat=1 3 8
## Percent Correctly Predicted = 81.25%
## Percent Correctly Predicted = 85.71%, for y = 0
## Percent Correctly Predicted = 72.73% for y = 1
## Null Model Correctly Predicts 65.62%
## [1] 81.25000 85.71429 72.72727
Some statisticians argue that the decision to use the 0.5 threshold to convert the predicted probabilities into 0 and 1 is problematic. They would in fact recommend the following:
probabilities = predict(glm.0, type = "response")
percent.correct = (1 - mean(abs(probabilities - psi$grade))) * 100
percent.correct
## [1] 74.2198
You will notice that you get a slightly smaller answer from this latter command. I would recommend using hitmiss()
but recognizing that not everyone agrees with this approach.
Recall that in the linear regression case we had compared models with one or more additional independent variables versus a model with these variables excluded. We did this via the \(F-test\).
Logit models have a similar tool for comparing what are called nested models. Assume a model such as \(logit(y) = \beta_0 + \beta_1(x_1) + \beta_2(x_2)\) is the FULL model. Then, a model that excludes either \(x_1\) or \(x_2\) is said to be nested within the FULL model. Nested models such as these can be compared very easily.
summary(glm.0)
##
## Call:
## glm(formula = grade ~ gpa + tuce + psi, family = binomial(link = "logit"),
## data = psi)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.9551 -0.6453 -0.2570 0.5888 2.0966
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -13.02135 4.93127 -2.641 0.00828 **
## gpa 2.82611 1.26293 2.238 0.02524 *
## tuce 0.09516 0.14155 0.672 0.50143
## psiIn PSI 2.37869 1.06456 2.234 0.02545 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 41.183 on 31 degrees of freedom
## Residual deviance: 25.779 on 28 degrees of freedom
## AIC: 33.779
##
## Number of Fisher Scoring iterations: 5
Notice that the summary()
command gives you the deviance … this is similar (in principle but not in the mathematics) to the residual standard error from linear regression. A lower deviance is better than a higher deviance. There is a simple test that compares the deviance from the NULL model (i.e., with no independent variables) to the one with all the independent variables you specified. The test statistic is a \(\chi^2\) test:
with(glm.0, pchisq(null.deviance - deviance, df.null - df.residual, lower.tail = FALSE))
## [1] 0.001501878
The p-value of the test is 0.0015, quite a bit less than \(\alpha=0.05\) so we can conclude that FULL model is an improvement upon the NULL model that has no independent variables.
We can also use a similar approach to compare two nested models. Let us fit a smaller nested model and compare the two models to see which is better:
glm.1 = glm(grade ~ gpa + tuce, data=psi, family=binomial(link="logit"))
summary(glm.1)
##
## Call:
## glm(formula = grade ~ gpa + tuce, family = binomial(link = "logit"),
## data = psi)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.4085 -0.6882 -0.4677 0.7262 2.4553
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -10.65601 4.05709 -2.627 0.00863 **
## gpa 2.53828 1.18185 2.148 0.03174 *
## tuce 0.08555 0.13318 0.642 0.52064
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 41.183 on 31 degrees of freedom
## Residual deviance: 31.983 on 29 degrees of freedom
## AIC: 37.983
##
## Number of Fisher Scoring iterations: 4
anova(glm.0, glm.1)
## Analysis of Deviance Table
##
## Model 1: grade ~ gpa + tuce + psi
## Model 2: grade ~ gpa + tuce
## Resid. Df Resid. Dev Df Deviance
## 1 28 25.779
## 2 29 31.983 -1 -6.2037
The results indicate that the deviance has been reduced in the FULL model by 6.2037 at the expense of one degree of freedom by adding psi.
R defines AIC as \(-2 \text{ maximized log-likelihood } + 2 \times (\text{ number of parameters})\) and as a rule a lower AIC is preferable to a higher AIC.
We can keep testing alternative models until such time as the AIC doesn’t drop any farther.
We also have a related but more robust criterion for choosing between alternate models – the Bayesian Information Criterion (BIC). Many analysts prefer the BIC over the AIC because
\[BIC = -2 \times ln(LL_{FULL}) + ln(n) \times (\text{ number of parameters})\]
In either event, convention suggests that we report both the AIC and the BIC, recognizing that when choosing between models lower values of AIC and BIC suggest a better model fit.
AIC(glm.0); BIC(glm.0)
## [1] 33.77927
## [1] 39.64221
AIC(glm.1); BIC(glm.1)
## [1] 37.98296
## [1] 42.38017
In closing, note that AIC and BIC can also be used to compare non-nested models.
You can use the same diagnostics for logit models that you used for regression models. There are some differences in how residuals are defined, etc. but we’ll ignore these mathematical details.
library(car)
marginalModelPlots(glm.0)
avPlots(glm.0)
residualPlots(glm.0)
## Test stat Pr(>|t|)
## gpa 2.112 0.146
## tuce 0.172 0.678
## psi NA NA
influenceIndexPlot(glm.0)
influencePlot(glm.0)
## StudRes Hat CookD
## 2 2.364568 0.1298970 0.3434321
## 30 1.222462 0.3288526 0.1377491
outlierTest(glm.0)
##
## No Studentized residuals with Bonferonni p < 0.05
## Largest |rstudent|:
## rstudent unadjusted p-value Bonferonni p
## 2 2.364568 0.018051 0.57764
vif(glm.0)
## gpa tuce psi
## 1.188153 1.078528 1.148723
Note: You cannot use ncvTest()
because it only works with a linear regression model.
y = c(0,0,0,0,1,1,1,1)
x1 =c(1,2,3,3,5,6,10,11)
x2 =c(3,2,-1,-1,2,4,1,0)
my.df1 = data.frame(cbind(y, x1, x2))
lm1 = glm(y ~ x1 + x2, data=my.df1, family=binomial)
summary(lm1)
##
## Call:
## glm(formula = y ~ x1 + x2, family = binomial, data = my.df1)
##
## Deviance Residuals:
## 1 2 3 4 5 6
## -2.110e-08 -1.404e-05 -2.522e-06 -2.522e-06 1.564e-05 2.110e-08
## 7 8
## 2.110e-08 2.110e-08
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -66.098 183471.722 0.000 1
## x1 15.288 27362.843 0.001 1
## x2 6.241 81543.720 0.000 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1.1090e+01 on 7 degrees of freedom
## Residual deviance: 4.5454e-10 on 5 degrees of freedom
## AIC: 6
##
## Number of Fisher Scoring iterations: 24
Note what happens here. You get a warning message. Why? Take a look at my.df1
… \(y = 0\) whenever \(x_1 \leq 3\) … this is perfect separation and the model becomes unstable. Look at your standard errors; they have exploded. A simple plot will explain what you have here.
with(my.df1, plot(x1, y))
abline(v=3.1, col="red")
Now quasi-complete separation …
y = c(0,0,0,0,1,1,1,1,1,1)
x1 =c(1,2,3,3,3,4,5,6,10,11)
x2 =c(3,0,-1,4,1,0,2,7,3,4)
my.df2 = data.frame(cbind(y, x1, x2))
lm2 = glm(y~ x1 + x2, data=my.df2, family=binomial)
summary(lm2)
##
## Call:
## glm(formula = y ~ x1 + x2, family = binomial, data = my.df2)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.00419 -0.00006 0.00000 0.00000 1.46892
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -58.0761 17511.9031 -0.003 0.997
## x1 19.1778 5837.3010 0.003 0.997
## x2 -0.1206 0.6098 -0.198 0.843
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 13.4602 on 9 degrees of freedom
## Residual deviance: 3.7792 on 7 degrees of freedom
## AIC: 9.7792
##
## Number of Fisher Scoring iterations: 21
Here, \(y=0\) whenever \(x_1 < 3\) and \(y = 1\) whenever \(x_1 > 3\) so the only zone of uncertainty occurs when \(x_1 = 3\). This is quite obviously a less severe situation than we have in my.df1
.
with(my.df2, plot(x1, y))
abline(v=2.9, col="red")
abline(v=3.1, col="red")
In brief, what you are seeing is a problem with the distributions of the variables – you just don’t have a healthy mix of every independent variable and dependent variable values. This will basically force you to drop \(x_1\) with perfect separation. If you have quasi-complete separation you could keep \(x_1\) in the model but you won’t get any reasonable estimates of the “slope” and standard error for \(x_1\) but these estimates will be just fine for all other variables not plagued by the separation problem.
Now let us practice with the vote92
data in the pscl package.
Survey data containing self-reports of vote choice in the 1992 U.S. Presidential election, with numerous covariates, from the 1992 American National Election Studies. A data frame with 909 observations on the following 10 variables.
library(pscl)
data(vote92); names(vote92)
## [1] "vote" "dem" "rep" "female" "persfinance"
## [6] "natlecon" "clintondis" "bushdis" "perotdis"
summary(vote92)
## vote dem rep female
## Perot :183 Min. :0.0000 Min. :0.0000 Min. :0.0000
## Clinton:416 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000
## Bush :310 Median :0.0000 Median :0.0000 Median :0.0000
## Mean :0.4884 Mean :0.4301 Mean :0.4752
## 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:1.0000
## Max. :1.0000 Max. :1.0000 Max. :1.0000
## persfinance natlecon clintondis bushdis
## Min. :-1.000000 Min. :-1.0000 Min. : 0.0004 Min. : 0.1024
## 1st Qu.:-1.000000 1st Qu.:-1.0000 1st Qu.: 0.9604 1st Qu.: 0.4624
## Median : 0.000000 Median :-1.0000 Median : 1.0404 Median : 1.7424
## Mean :-0.009901 Mean :-0.6722 Mean : 3.5062 Mean : 3.3793
## 3rd Qu.: 1.000000 3rd Qu.: 0.0000 3rd Qu.: 4.0804 3rd Qu.: 5.3824
## Max. : 1.000000 Max. : 1.0000 Max. :16.1600 Max. :18.6620
## perotdis
## Min. : 0.2401
## 1st Qu.: 0.2401
## Median : 2.2201
## Mean : 2.1710
## 3rd Qu.: 2.2801
## Max. :12.1800
Let us drop Perot voters from the data-frame and create a second data-frame:
vote92b = subset(vote92, vote != "Perot", )
attach(vote92b)
vote92b$vote[vote=="Perot"] = NA
detach(vote92b)
summary(vote92b)
## vote dem rep female
## Perot : 0 Min. :0.000 Min. :0.0000 Min. :0.0000
## Clinton:416 1st Qu.:0.000 1st Qu.:0.0000 1st Qu.:0.0000
## Bush :310 Median :1.000 Median :0.0000 Median :0.0000
## Mean :0.522 Mean :0.4187 Mean :0.4959
## 3rd Qu.:1.000 3rd Qu.:1.0000 3rd Qu.:1.0000
## Max. :1.000 Max. :1.0000 Max. :1.0000
## persfinance natlecon clintondis bushdis
## Min. :-1.00000 Min. :-1.0000 Min. : 0.0004 Min. : 0.1024
## 1st Qu.:-1.00000 1st Qu.:-1.0000 1st Qu.: 0.9604 1st Qu.: 0.4624
## Median : 0.00000 Median :-1.0000 Median : 1.0404 Median : 1.7424
## Mean :-0.01102 Mean :-0.6749 Mean : 3.5811 Mean : 3.5482
## 3rd Qu.: 1.00000 3rd Qu.: 0.0000 3rd Qu.: 4.0804 3rd Qu.: 5.3824
## Max. : 1.00000 Max. : 1.0000 Max. :16.1600 Max. :18.6620
## perotdis
## Min. : 0.2401
## 1st Qu.: 0.2401
## Median : 2.2201
## Mean : 2.3066
## 3rd Qu.: 2.2801
## Max. :12.1800
Now we fit the model:
glm.v0 = glm(vote ~ dem + rep + female + clintondis + bushdis + persfinance + natlecon, data=vote92b, famil=binomial(link="logit"))
summary(glm.v0)
##
## Call:
## glm(formula = vote ~ dem + rep + female + clintondis + bushdis +
## persfinance + natlecon, family = binomial(link = "logit"),
## data = vote92b)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.6356 -0.3245 -0.1250 0.3512 2.7556
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.002577 0.470869 -0.005 0.995634
## dem -1.891518 0.441773 -4.282 1.86e-05 ***
## rep 2.494360 0.427411 5.836 5.35e-09 ***
## female 0.442343 0.291080 1.520 0.128596
## clintondis 0.098416 0.041049 2.397 0.016508 *
## bushdis -0.207106 0.054690 -3.787 0.000153 ***
## persfinance 0.294340 0.182242 1.615 0.106288
## natlecon 0.913165 0.270653 3.374 0.000741 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 990.92 on 725 degrees of freedom
## Residual deviance: 359.10 on 718 degrees of freedom
## AIC: 375.1
##
## Number of Fisher Scoring iterations: 6
vote92b$yhat = predict(glm.v0, type="response")
library(visreg)
visreg(glm.v0, "clintondis", by="persfinance", scale="response")