library(pscl)
pR2(glm.2)
library(pscl)
hitmiss(glm.2)
library(ResourceSelection)
hl = hoslem.test(psi$grade, fitted(glm.2), g=10)
hl
AIC(glm.2); BIC(glm.2)
Think of the outcome we are modeling in terms of a latent variable – something that isn’t observed but instead what is observed (and measured) is some other version of this outcome. Assume that there really exists some continuous variable \(y^{*}\) that reflects a student’s TRUE probability of earning an A. Since all we observe is \(y = [0, 1]\) – whether a student did not or did earn an A there is a slight disconnect between \(y^{*}\) and \(y\). Mathematically, it is always easier to think of models like the logit and probit in terms of latent variables, in which case the logit and probit models can be depicted as follows:
\[y^{*}_{i} = x_{i}\beta + e_{i}\]
\[y=\begin{cases} 1 & \text{if } y^{*}_{i} > \tau, \\ 0 & \text{if }y^{*}_{i} \leq \tau \end{cases}\]
Note that \(\tau\) is some threshold that, when crossed, lets us observe \(y=1\) and if the threshold is not crossed then we observe \(y=0\).
Just as in the linear regression case, here too we know that the mean of the residuals will be \(0\). However, because we don’t see \(y^*\) we don’t know how the residuals might vary. We thus have to assume some distribution holds that would be a reasonable aproximation to \(y^{*}\). Since we are dealing with a binary outcome, the binomial function (more on this below) is one of the simplest generators of binary outcomes. Therefore when we specify which distribution we are assuming, in R we say the outcome is coming from the family=binomial()
.
When we specify link=“logit”
we are saying the logit function describes the distribution of outcomes well. For the logit fucntion the errors are assumed to have a certain variance: \(Var\left(\epsilon | x\right) =\dfrac{\pi^{2}}{3}\) while for the probit the errors are assumed to have a certain variance: \(Var\left(\epsilon | x\right) =1\). If we could not specify what specific function within the binomial family describes the outcomes we would be unable to model these data because maximum likelihood estimation needs to know the variance of the residuals it must assume to give you your logit/probit results.
Note: Our choice of logit and probit is arbitrary because there is no way to use our sample and our models to really say yes, the variance of the errors when predicting \(y^{*}\) do indeed follow the logit or the probit distribution.
Assume we wanted to draw a random sample of 10 individuals and wondered how likely we were to see \(y=0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10\) men? The answer depends upon the population proportion of men versus women. If 50% of the population is male then we can calculate the probability of seeing \(y=0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10\) men very easily via the binomial formula:
\[\begin{eqnarray*} P\left[X \text{ successes}\right] = \binom{n}{x}p^{x}\left(1 - p\right)^{n-x} \\ \text{where } \binom{n}{x} = \dfrac{n!}{x!(n-x)!} \text{ and } \\ n! = n \times (n-1) \times (n-2) \times \cdots \times 2 \times 1 \end{eqnarray*}\]
Thus, for example, if we wanted to know the probability of finding 3 men in a random sample of 10 adults under the assumption that the population is split 50:50, the probability could be calculated as:
\[\begin{eqnarray*} P\left[3 \text{ men}\right] = \binom{10}{3}0.5^{3}\left(1 - 0.5\right)^{10-3} \end{eqnarray*}\]
x = seq(0, 10, by=1)
n = 10
x.d = dbinom(x, n, p=0.5)
round(x.d, digits=4)
## [1] 0.0010 0.0098 0.0439 0.1172 0.2051 0.2461 0.2051 0.1172 0.0439 0.0098
## [11] 0.0010
barplot(x.d)
So clearly we would expect to see no male with a probability of 0.0009, 1 man with 0.0097, 2 men with 0.0429, and so on. Note that the highest probability is for 5 men … which is what you would expect if men made up 50% of the population (as we have assumed).
In the real world, of course, you don’t know the population proportion but instead are using your sample to figure out how likely is it that what you see in your sample holds for the population as well. In other words, you have \(x\), you have \(n\), but you don’t know \(p_0\). The method of maximum likelihood thus works to figure out, given your sample, the most likely value of \(p_0\).
Thus, for example, say we didn’t know that men made up 50% of the population. In fact we didn’t know what proportion of the population was male. What we would do is try to figure out what population proportion gives us the highest probability of seeing exactly 3 men in a random sample of 10 adults.
x = 3
n = 10
probs = seq(0.01, 0.99, by=0.01)
x.d = dbinom(3, n, p=probs)
round(x.d, digits=4)
## [1] 0.0001 0.0008 0.0026 0.0058 0.0105 0.0168 0.0248 0.0343 0.0452 0.0574
## [11] 0.0706 0.0847 0.0995 0.1146 0.1298 0.1450 0.1600 0.1745 0.1883 0.2013
## [21] 0.2134 0.2244 0.2343 0.2429 0.2503 0.2563 0.2609 0.2642 0.2662 0.2668
## [31] 0.2662 0.2644 0.2614 0.2573 0.2522 0.2462 0.2394 0.2319 0.2237 0.2150
## [41] 0.2058 0.1963 0.1865 0.1765 0.1665 0.1564 0.1464 0.1364 0.1267 0.1172
## [51] 0.1080 0.0991 0.0905 0.0824 0.0746 0.0673 0.0604 0.0540 0.0480 0.0425
## [61] 0.0374 0.0327 0.0285 0.0247 0.0212 0.0181 0.0154 0.0130 0.0108 0.0090
## [71] 0.0074 0.0060 0.0049 0.0039 0.0031 0.0024 0.0019 0.0014 0.0011 0.0008
## [81] 0.0006 0.0004 0.0003 0.0002 0.0001 0.0001 0.0000 0.0000 0.0000 0.0000
## [91] 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
plot(probs, x.d, xlab="Proportion of Men in the Population", ylab="Probability of 3 men in n=10", type="l", col="firebrick")
abline(v=0.3, col="cornflowerblue")
What you see in the plot is that the most likely \(p_0\) has to be \(p_0=3\).
Similarly, if you had ended up with 6 men out of a random sample of 10 adults, what would be the most likely population proportion?
x = 6
n = 10
probs = seq(0.01, 0.99, by=0.01)
x.d = dbinom(6, n, p=probs)
round(x.d, digits=4)
## [1] 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0001 0.0001
## [11] 0.0002 0.0004 0.0006 0.0009 0.0012 0.0018 0.0024 0.0032 0.0043 0.0055
## [21] 0.0070 0.0088 0.0109 0.0134 0.0162 0.0195 0.0231 0.0272 0.0317 0.0368
## [31] 0.0422 0.0482 0.0547 0.0616 0.0689 0.0767 0.0849 0.0934 0.1023 0.1115
## [41] 0.1209 0.1304 0.1401 0.1499 0.1596 0.1692 0.1786 0.1878 0.1966 0.2051
## [51] 0.2130 0.2204 0.2271 0.2331 0.2384 0.2427 0.2462 0.2488 0.2503 0.2508
## [61] 0.2503 0.2487 0.2461 0.2424 0.2377 0.2320 0.2253 0.2177 0.2093 0.2001
## [71] 0.1903 0.1798 0.1689 0.1576 0.1460 0.1343 0.1225 0.1108 0.0993 0.0881
## [81] 0.0773 0.0670 0.0573 0.0483 0.0401 0.0326 0.0260 0.0202 0.0153 0.0112
## [91] 0.0078 0.0052 0.0033 0.0019 0.0010 0.0004 0.0001 0.0000 0.0000
plot(probs, x.d, xlab="Proportion of Men in the Population", ylab="Probability of 4 men in n=10", type="l", col="firebrick")
abline(v=0.6, col="cornflowerblue")
Note how the maximum probability is at 0.6.
This is pretty much how the principle of maximum likelihood estimation (MLE) works with regression models but of course in far more complex ways because it doesn’t just have to find a single value but instead figure out the intercept, your slope coefficients, their standard errors, etc. So what MLE does is it looks at the model you have specified and then tries to figure out the most likely values of \(\beta_0, \beta_1, \ldots, \beta_k\) and their standard errors in the population that would generate your sample data.
Whereas the logit model is given by:
\[logit(y) = \beta_0 + \beta_1(x)\]
Mathematically, this is expressed as: \[Pr\left(y_i = 1 | x_i\right) = \dfrac{e^{x_{i}\beta}}{1 + e^{x_{i}\beta}}\]
The probit model, instead of working with the log-odds, relies on the cumulative distribution function of the normal distribution. In particular,
\[probit(y) = \beta_0 + \beta_1(x)\]
Mathematically, this is expressed as: \[Pr\left(y_i = 1 | x_i\right) = \int^{x_{i}\beta}_{-\infty}\dfrac{1}{\sqrt{2 \pi \sigma^{2}}}e^{- \dfrac{\left(y_{i} - x_{i}\beta \right)^{2}}{2 \sigma^{2}}}dx = \Phi\left(x_{i}\beta \right)\]
In simple terms you should think of the probit as fitting the outcome \((y)\) as a standard normal variable (a \(z\) score) where \(z = \beta_0 + \beta_1(x)\). That is, the probit function takes your independent variables as you have specified them, calculates \(b_i\), and then calculates a \(z\)
library(foreign)
psi = read.dta("http://www3.nd.edu/~rwilliam/statafiles/logist.dta")
psi$psi = factor(psi$psi, labels=c("Not in PSI", "In PSI"))
summary(psi)
## grade gpa tuce psi
## Min. :0.0000 Min. :2.060 Min. :12.00 Not in PSI:18
## 1st Qu.:0.0000 1st Qu.:2.812 1st Qu.:19.75 In PSI :14
## Median :0.0000 Median :3.065 Median :22.50
## Mean :0.3438 Mean :3.117 Mean :21.94
## 3rd Qu.:1.0000 3rd Qu.:3.515 3rd Qu.:25.00
## Max. :1.0000 Max. :4.000 Max. :29.00
glm.2 = glm(grade ~ gpa + tuce + psi, data=psi, family=binomial(link="probit"))
summary(glm.2)
##
## Call:
## glm(formula = grade ~ gpa + tuce + psi, family = binomial(link = "probit"),
## data = psi)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.9392 -0.6508 -0.2229 0.5934 2.0451
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -7.45231 2.57152 -2.898 0.00376 **
## gpa 1.62581 0.68973 2.357 0.01841 *
## tuce 0.05173 0.08119 0.637 0.52406
## psiIn PSI 1.42633 0.58695 2.430 0.01510 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 41.183 on 31 degrees of freedom
## Residual deviance: 25.638 on 28 degrees of freedom
## AIC: 33.638
##
## Number of Fisher Scoring iterations: 6
psi$predictedP = fitted(glm.2)
Let \(gpa=2.06\), \(tuce=22\) and \(psi=1\). Let us see what value the model predicts …
\[-7.45231 + 1.62581(2.06) + 0.05173(22) + 1.42633(1)\]
\[= -1.538751\]
What you have calculated is the \(z\) score for this student. Converting this \(z\) score into a probability will give us the probability of this student getting an A …
pnorm(-1.538751, lower.tail=TRUE)
## [1] 0.06193255
This gives us a probability of 0.06193255. If you compare this with predictedP
in the data-frame you will see they are one and the same thing. This is how the probit model works.
Making sense of probit coefficients is less straightforward than the logit. Here, the best we can do by looking at the estimates is to say:
It is far better, as usual, to rely on graphics and calculated values of predicted probabilities for a student with specific values of the independent variables.
Let us use the same covariate values we used for the logit:
newdata1 = data.frame(tuce=19.97, gpa=2.812, psi="In PSI")
predict(glm.2, newdata1, type="response")
## 1
## 0.3368113
newdata2 = data.frame(tuce=19.97, gpa=3.065, psi="In PSI")
predict(glm.2, newdata2, type="response")
## 1
## 0.4960702
newdata3 = data.frame(tuce=19.97, gpa=3.065, psi="Not in PSI")
predict(glm.2, newdata3, type="response")
## 1
## 0.07547535
Note what I have done here. I created newdata
to be a data-frame with specific values of the independent variables. The only difference between newdata1
and newdata2
is the value of gpa.
The resulting values of predicted probabilities can then be used to explain the results.
This could be useful if the audience wanted to know the impact of specific variables, and how changes in these variables’ values influence the probability of \(y=1\).
Now we do it en masse to see how probabilities change as gpa changes and compare what happens to those in psi versus those not in psi, but tuce being held constant.
newdata4 = data.frame(tuce=19.97, gpa=seq(2.1, 4.0, by=0.1), psi="Not in PSI")
yhat.6 = predict(glm.2, newdata4, type="response")
yhat.6
## 1 2 3 4 5 6
## 0.001327509 0.002237996 0.003681899 0.005912118 0.009267197 0.014183099
## 7 8 9 10 11 12
## 0.021198450 0.030949269 0.044149402 0.061553888 0.083904512 0.111859788
## 13 14 15 16 17 18
## 0.145914995 0.186321283 0.233015172 0.285570508 0.343183348 0.404696424
## 19 20
## 0.468664208 0.533453149
plot(newdata4$gpa, yhat.6)
newdata5 = data.frame(tuce=19.97, gpa=seq(2.1, 4.0, by=0.1), psi="In PSI")
yhat.7 = predict(glm.2, newdata5, type="response")
yhat.7
## 1 2 3 4 5 6
## 0.05719557 0.07836161 0.10499419 0.13763289 0.17659102 0.22188170
## 7 8 9 10 11 12
## 0.27316375 0.32971831 0.39046400 0.45401306 0.51876435 0.58302327
## 13 14 15 16 17 18
## 0.64513365 0.70360466 0.75721674 0.80509414 0.84673733 0.88201526
## 19 20
## 0.91112297 0.93451449
plot(newdata5$gpa, yhat.7)
Let us compare the two predicted probabilities in a single plot:
plot.data = data.frame(cbind(yhat.6, yhat.7, newdata4$gpa, (yhat.7 - yhat.6)))
colnames(plot.data) = c("Yhat6", "Yhat7", "gpa", "Yhat7-Yhat6")
plot(plot.data$gpa, yhat.6, type="b", col="cornflowerblue", xlab="gpa", ylab="Probability grade = A", ylim=c(0,1))
par(new=TRUE)
plot(plot.data$gpa, yhat.7, type="b", col="firebrick", yaxt="no", ylab="", xlab="", ylim=c(0,1))
abline(h=0.5, col="gray")
code>visreg is helpful more globally, in capturing the effects of the independent variables in a more summary format.
library(visreg)
visreg(glm.2, "gpa", by="psi", rug=2)
visreg(glm.2, "gpa", by="psi", rug=2, scale="response", overlay=TRUE)
visreg(glm.2, "gpa", by="psi", rug=2, scale="response")
Notice a couple of things about these plots as well:
McFadden’s \(R^2\) works here as well, and so do the other goodness-of-fit tests.
library(pscl)
pR2(glm.2)
## llh llhNull G2 McFadden r2ML r2CU
## -12.8188033 -20.5917297 15.5458527 0.3774781 0.3848000 0.5315670
library(ResourceSelection)
hl = hoslem.test(psi$grade, fitted(glm.2), g=10)
hl
##
## Hosmer and Lemeshow goodness of fit (GOF) test
##
## data: psi$grade, fitted(glm.2)
## X-squared = 6.0702, df = 8, p-value = 0.6394
Again, we are unable to reject the \(H_0\) that the model fit is good.
What about the percent correctly predicted?
library(pscl)
hitmiss(glm.2)
## Classification Threshold = 0.5
## y=0 y=1
## yhat=0 18 3
## yhat=1 3 8
## Percent Correctly Predicted = 81.25%
## Percent Correctly Predicted = 85.71%, for y = 0
## Percent Correctly Predicted = 72.73% for y = 1
## Null Model Correctly Predicts 65.62%
## [1] 81.25000 85.71429 72.72727
Some statisticians argue that the decision to use the 0.5 threshold to convert the predicted probabilities into 0 and 1 is problematic. They would in fact recommend the following:
probabilities = predict(glm.2, type = "response")
percent.correct = (1 - mean(abs(probabilities - psi$grade))) * 100
percent.correct
## [1] 74.20134
You will notice that you get a slightly smaller answer from this latter command. I would recommend using hitmiss()
but recognizing that not everyone agrees with this approach.
Confidence intervals also work the same way:
round(confint(glm.2, level=0.95), digits=4)
## 2.5 % 97.5 %
## (Intercept) -13.1149 -2.9738
## gpa 0.3740 3.1314
## tuce -0.1044 0.2279
## psiIn PSI 0.3286 2.7032
round(confint(glm.2, level=0.99), digits=4)
## 0.5 % 99.5 %
## (Intercept) -15.1747 -1.7206
## gpa 0.0145 3.6698
## tuce -0.1515 0.2884
## psiIn PSI 0.0033 3.1583
What about comparing nested models? Same thing as before …
with(glm.2, pchisq(null.deviance - deviance, df.null - df.residual, lower.tail = FALSE))
## [1] 0.001404895
glm.3 = glm(grade ~ gpa + tuce, data=psi, family=binomial(link="probit"))
summary(glm.3)
##
## Call:
## glm(formula = grade ~ gpa + tuce, family = binomial(link = "probit"),
## data = psi)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.3950 -0.7229 -0.4824 0.7625 2.4645
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -6.03431 2.16642 -2.785 0.00535 **
## gpa 1.40957 0.65097 2.165 0.03036 *
## tuce 0.05267 0.07433 0.709 0.47861
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 41.183 on 31 degrees of freedom
## Residual deviance: 32.304 on 29 degrees of freedom
## AIC: 38.304
##
## Number of Fisher Scoring iterations: 5
anova(glm.2, glm.3)
## Analysis of Deviance Table
##
## Model 1: grade ~ gpa + tuce + psi
## Model 2: grade ~ gpa + tuce
## Resid. Df Resid. Dev Df Deviance
## 1 28 25.638
## 2 29 32.304 -1 -6.6667
These tests can also be run via the lmtest
package.
library(lmtest)
lrtest(glm.3, glm.2)
## Likelihood ratio test
##
## Model 1: grade ~ gpa + tuce
## Model 2: grade ~ gpa + tuce + psi
## #Df LogLik Df Chisq Pr(>Chisq)
## 1 3 -16.152
## 2 4 -12.819 1 6.6667 0.009823 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Finally, so goes the AIC and the BIC …
AIC(glm.2); BIC(glm.2)
## [1] 33.63761
## [1] 39.50055
AIC(glm.3); BIC(glm.3)
## [1] 38.30431
## [1] 42.70152
You can use the same diagnostics for probit models that you used for regression models. There are some differences in how residuals are defined, etc. but we’ll ignore these mathematical details.
library(car)
marginalModelPlots(glm.2)
avPlots(glm.2)
residualPlots(glm.2)
## Test stat Pr(>|t|)
## gpa 2.174 0.140
## tuce 0.131 0.717
## psi NA NA
influenceIndexPlot(glm.2)
influencePlot(glm.2)
## StudRes Hat CookD
## 2 2.3366794 0.1526221 0.37697378
## 7 -0.9006084 0.3150740 0.06311229
outlierTest(glm.2)
##
## No Studentized residuals with Bonferonni p < 0.05
## Largest |rstudent|:
## rstudent unadjusted p-value Bonferonni p
## 2 2.336679 0.019456 0.62259
vif(glm.2)
## gpa tuce psi
## 1.158296 1.104320 1.071533
There has been considerable debate on whether these models need an interaction effect to be explicitly modeled. Some folks argue that you don’t need it, that basically you can fit a model such as this:
\[logit(grade) = \beta_0 + \beta_1(gpa) + \beta_2(tuce) + \beta_3(psi)\]
and still talk about how two or more variables interact to change the probability of the student earning an A.
Others argue that you need to specify the interaction in the model itself. This latter group argues that the model should then look like this if you think psi interacts with gpa:
\[logit(grade) = \beta_0 + \beta_1(gpa) + \beta_2(tuce) + \beta_3(psi) + \beta_4(gpa * psi)\]
My recommendation: Include the interaction in the model. Then,
visreg
to see how the predicted probability changes as the values of the variables that make up the interaction change.glm.4 = glm(grade ~ gpa + tuce + psi + gpa:psi, data=psi, family=binomial(link="probit"))
summary(glm.4)
##
## Call:
## glm(formula = grade ~ gpa + tuce + psi + gpa:psi, family = binomial(link = "probit"),
## data = psi)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.73973 -0.59883 -0.08161 0.57457 1.98176
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -12.3819 5.6864 -2.177 0.0294 *
## gpa 3.1152 1.5879 1.962 0.0498 *
## tuce 0.0459 0.0876 0.524 0.6003
## psiIn PSI 8.2143 5.9857 1.372 0.1700
## gpa:psiIn PSI -2.0480 1.7475 -1.172 0.2412
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 41.183 on 31 degrees of freedom
## Residual deviance: 23.976 on 27 degrees of freedom
## AIC: 33.976
##
## Number of Fisher Scoring iterations: 8
psi$predictedP = fitted(glm.4)
AIC(glm.2, glm.4); BIC(glm.2, glm.4)
## df AIC
## glm.2 4 33.63761
## glm.4 5 33.97578
## df BIC
## glm.2 4 39.50055
## glm.4 5 41.30446
library(visreg)
visreg(glm.4, "gpa", by="psi", rug=2, scale="response", overlay=TRUE)
visreg(glm.4, "gpa", by="psi", rug=2, scale="response")
library(pscl)
pR2(glm.4)
## llh llhNull G2 McFadden r2ML r2CU
## -11.9878895 -20.5917297 17.2076805 0.4178299 0.4159333 0.5745748
library(ResourceSelection)
hl = hoslem.test(psi$grade, fitted(glm.4), g=10)
hl
##
## Hosmer and Lemeshow goodness of fit (GOF) test
##
## data: psi$grade, fitted(glm.4)
## X-squared = 5.1669, df = 8, p-value = 0.7396
hitmiss(glm.4)
## Classification Threshold = 0.5
## y=0 y=1
## yhat=0 18 2
## yhat=1 3 9
## Percent Correctly Predicted = 84.38%
## Percent Correctly Predicted = 85.71%, for y = 0
## Percent Correctly Predicted = 81.82% for y = 1
## Null Model Correctly Predicts 65.62%
## [1] 84.37500 85.71429 81.81818
library(lmtest)
lrtest(glm.2, glm.4)
## Likelihood ratio test
##
## Model 1: grade ~ gpa + tuce + psi
## Model 2: grade ~ gpa + tuce + psi + gpa:psi
## #Df LogLik Df Chisq Pr(>Chisq)
## 1 4 -12.819
## 2 5 -11.988 1 1.6618 0.1974
So should we go with the model that has the interaction (glm.4) or the one without an interaction (glm.2)? This is how I would decide which model to go with …
Weighing the evidence listed above, if my goal was to have the best possible predictive model then I would go with glm.4 because this gives me a better fit to the data (look at percent correctly predicted). If, however, I was testing a theory about the interaction between gpa and psi then I would show glm.4 so people could see the interaction is indeed not significant.
What would happen if I was theory testing and went with glm.2 instead? Anybody who looked at my visreg plots would ‘see’ an interaction in the plots of gpa by psi because the predicted probabilities would not look like parallel curves. This would make the reader think gpa has a different effect for the psi group than it does for the non-psi group (which is what an interaction is supposed to test). … The reader would leave with a mistaken conclusion because had he/she sen glm.4 they would have realized there is no significant interaction.
Take-away: Always explicitly include the interaction term if it is a part of your theory. If you forget to include it or don’t want to for some reason, DO NOT interpret any plots or predicted probabilities as resulting from an interaction of this and that variables.
The logit and probit models are very similar and which one you choose is a matter of habit and ease. Traditionally, the logit model was easier to compute so most analysts veer in this direction. There are some differences of course:
glm.0 = glm(grade ~ gpa + tuce + psi, data=psi, family=binomial(link="logit"))
newdata4 = data.frame(tuce=19.97, gpa=seq(2.1, 4.0, by=0.1), psi="Not in PSI")
yhat.4 = predict(glm.0, newdata4, type="response")
newdata5 = data.frame(tuce=19.97, gpa=seq(2.1, 4.0, by=0.1), psi="In PSI")
yhat.5 = predict(glm.0, newdata5, type="response")
yhat.6 = predict(glm.2, newdata4, type="response")
yhat.7 = predict(glm.2, newdata5, type="response")
plot.data2 = data.frame(cbind(yhat.4, yhat.5, yhat.6, yhat.7, newdata4$gpa))
colnames(plot.data2) = c("Logit (no A)", "Probit (no A)", "Logit (A)", "Probit (A)", "gpa")
plot(plot.data$gpa, yhat.4, type="b", col="cornflowerblue", xlab="gpa", ylab="Probability grade = A", ylim=c(0,1))
par(new=TRUE)
plot(plot.data$gpa, yhat.6, type="b", col="firebrick", yaxt="no", ylab="", xlab="", ylim=c(0,1))
abline(h=0.5, col="gray")
plot(plot.data$gpa, yhat.5, type="b", col="cornflowerblue", xlab="gpa", ylab="Probability grade = A", ylim=c(0,1))
par(new=TRUE)
plot(plot.data$gpa, yhat.7, type="b", col="firebrick", yaxt="no", ylab="", xlab="", ylim=c(0,1))
abline(h=0.5, col="gray")
You see the slight drift between the two sets of estimates (logit predicted probabilities are in blue).
The variances are different. For the logit, the variance is set to \(\dfrac{\pi^{2}}{3}\) while for the probit, since this is the standard normal distribution (i.e., the \(z\) score distribution) the variance is set to \(1\).
If you multiply the probit coefficients by \(1.81\) (some say 1.61) you will get the approximate logit coefficient. If you multiply the logit coefficient by 0.55 (some say 0.625) you will get the approximate logit coefficient. Why? Because of the ratio of their variances \(\left(\dfrac{\pi}{\sqrt{3}} \approx 1.81 \right)\)
Some folks will fit both and then see which link function does a better job of predicting the outcomes, and then will opt to work with this link function.
library(stargazer)
stargazer(glm.0, glm.2, glm.4, type="html", covariate.labels = c("Incoming GPA", "TUCE (Pre-test Score)", "PSI Group", "Incoming GPA * PSI Group", "Intercept"))
Dependent variable: | |||
grade | |||
logistic | probit | ||
(1) | (2) | (3) | |
Incoming GPA | 2.826** | 1.626** | 3.115** |
(1.263) | (0.690) | (1.588) | |
TUCE (Pre-test Score) | 0.095 | 0.052 | 0.046 |
(0.142) | (0.081) | (0.088) | |
PSI Group | 2.379** | 1.426** | 8.214 |
(1.065) | (0.587) | (5.986) | |
Incoming GPA * PSI Group | -2.048 | ||
(1.747) | |||
Intercept | -13.021*** | -7.452*** | -12.382** |
(4.931) | (2.572) | (5.686) | |
Observations | 32 | 32 | 32 |
Log Likelihood | -12.890 | -12.819 | -11.988 |
Akaike Inf. Crit. | 33.779 | 33.638 | 33.976 |
Note: | p<0.1; p<0.05; p<0.01 |
Many of the wells used for drinking water in Bangladesh and other South Asian countries are contaminated with natural arsenic, affecting an estimated 100 million people. Exposure to arsenic increases the risk of cancer and other diseases, with the risk proportional to exposure.
Any locality can include wells with a range of arsenic levels, and even if your neighbor’s well is safe there is no guarantee that your well is safe. However, if your neighbor is willing to share water with you you can walk and collect water from their well. The nearest well may be quite a distance away though.
A research team measured and labeled wells in several villages, tagging the wells as “safe” (if arsenic levels were below 0.5 in units of hundreds of micrograms per liter, the standard set by Bangladesh) or “unsafe” (if arsenic levels exceeded this threshold). People with unsafe wells were encouraged to dig deeper tube wells or then find safe wells to get water. A few years later the team returned to see who had switched wells. The dependent variable is whether someone switched \((y=1)\) or not \((y=0)\). We have the following independent variables:
The data are in wells.RData
so download and save this file to your working directory.
Note: distance is in meters so first convert it into 100-meter units and then fit the following model:
\[Pr(switched) = \beta_{0} + \beta_{1}(arsenic) + \beta_{2}(distance) + \beta_{3}(education) + \beta_{4}(association)\]
load("/Users/ruhil/Box Sync/MPA6020/Data/wells.RData")
wells$dist100 = wells$dist/100
glm.w0 = glm(switched ~ arsenic + dist100 + educ + assoc, data=wells, family=binomial(link="probit"))
summary(glm.w0)
##
## Call:
## glm(formula = switched ~ arsenic + dist100 + educ + assoc, family = binomial(link = "probit"),
## data = wells)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.7310 -1.2001 0.7601 1.0628 1.6658
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.084460 0.061307 -1.378 0.1683
## arsenic 0.276508 0.024295 11.381 < 2e-16 ***
## dist100 -0.546022 0.063528 -8.595 < 2e-16 ***
## educ 0.026587 0.005881 4.521 6.15e-06 ***
## assocBelongs -0.079645 0.047428 -1.679 0.0931 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 4118.1 on 3019 degrees of freedom
## Residual deviance: 3909.7 on 3015 degrees of freedom
## AIC: 3919.7
##
## Number of Fisher Scoring iterations: 4
wells$predicted = predict(glm.w0, type="response")
plot(wells$arsenic, wells$predicted, lty=1)
library(visreg)
visreg(glm.w0, "arsenic", by="dist100", rug=2, scale="response", overlay=TRUE)
Of course this is not the be-all and end-all of binary dependent variables. There are several other types of models that may be called for. Below I am giving you snippets of two extensions you may end up seeing in the literature more often than other extensions.
The Scobit There is something known as the “skewed probit”, designed for a particular situation. The way the logit and probit curves are estimated, the independent variables are treated as having their maximal impact when \(Pr(y_{i}=1|x) = 0.5\), that is, right when the probability of success is 0.5. At this point even a small increase in an independent variable with a positive influence on \(y\) can push the probability of success above 0.5. This may not be true of the phenomenon you are studying and if that is the case, then the scobit would be recommended because it uses the data you have to estimate the point of maximal impact rather than assume it to be at \(0.5\). Unfortunately, though, these models are very hard to estimate and often will simply fail to converge, leaving you with no results whatsoever.
The Heteroscedastic Probit (and Logit) Because the probit model is built on the standard normal distribution (the \(z\) distribution), by assumption the residuals are forced to have a mean of \(0\) and a constant variance of \(1\). The problem is this assumption of a constant variance of \(1\); there is no reason to believe that heteroscedasticity can only arise in a linear regression setting. In fact, all the factors that gave rise to heteroscedasticity in a linear regression setting are also in effect here. The same is true for logit models (where again the variance is assumed to be a constant); heteroscedasticity could be an issue with logits as well.
As a result, several analysts try to correct for heteroscedasticity by running what is known as a heteroscedatic probit
. Not everybody agrees with this approach or thinks that it is an acceptable way to deal with heteroscedasticity. However, if you really need to do it, here is an example of how to fit such a model:
library(glmx)
hglm.wP = hetglm(switched ~ arsenic + dist100 + educ + assoc, data=wells, family=binomial(link="probit"))
summary(hglm.wP)
##
## Call:
## hetglm(formula = switched ~ arsenic + dist100 + educ + assoc, data = wells,
## family = binomial(link = "probit"))
##
## Deviance residuals:
## Min 1Q Median 3Q Max
## -2.0500 -1.1811 0.7375 1.0479 1.8877
##
## Coefficients (binomial model with probit link):
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.48868 0.14735 -3.316 0.000912 ***
## arsenic 0.78503 0.16268 4.826 1.40e-06 ***
## dist100 -0.96513 0.20413 -4.728 2.27e-06 ***
## educ 0.03357 0.01212 2.770 0.005603 **
## assocBelongs -0.02700 0.08931 -0.302 0.762412
##
## Latent scale model coefficients (with log link):
## Estimate Std. Error z value Pr(>|z|)
## arsenic 0.24385 0.05145 4.740 2.14e-06 ***
## dist100 0.16018 0.16052 0.998 0.3184
## educ -0.01104 0.01573 -0.702 0.4828
## assocBelongs 0.34836 0.14537 2.396 0.0166 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Log-likelihood: -1944 on 9 Df
## LR test for homoskedasticity: 22.61 on 4 Df, p-value: 0.0001517
## Dispersion: 1
## Number of iterations in nlminb optimization: 16
hglm.wL = hetglm(switched ~ arsenic + dist100 + educ + assoc, data=wells, family=binomial(link="logit"))
summary(hglm.wL)
##
## Call:
## hetglm(formula = switched ~ arsenic + dist100 + educ + assoc, data = wells,
## family = binomial(link = "logit"))
##
## Deviance residuals:
## Min 1Q Median 3Q Max
## -2.0264 -1.1801 0.7378 1.0468 1.8685
##
## Coefficients (binomial model with logit link):
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.79248 0.24054 -3.295 0.000986 ***
## arsenic 1.27238 0.26971 4.718 2.39e-06 ***
## dist100 -1.56452 0.33778 -4.632 3.63e-06 ***
## educ 0.05409 0.01966 2.751 0.005940 **
## assocBelongs -0.04192 0.14430 -0.290 0.771441
##
## Latent scale model coefficients (with log link):
## Estimate Std. Error z value Pr(>|z|)
## arsenic 0.24122 0.05341 4.516 6.29e-06 ***
## dist100 0.16537 0.16741 0.988 0.323
## educ -0.01175 0.01635 -0.719 0.472
## assocBelongs 0.35959 0.14923 2.410 0.016 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Log-likelihood: -1944 on 9 Df
## LR test for homoskedasticity: 20.47 on 4 Df, p-value: 0.0004027
## Dispersion: 1
## Number of iterations in nlminb optimization: 16
glm.wP = glm(switched ~ arsenic + dist100 + educ + assoc, data=wells, family=binomial(link="probit"))
summary(glm.wP)
##
## Call:
## glm(formula = switched ~ arsenic + dist100 + educ + assoc, family = binomial(link = "probit"),
## data = wells)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.7310 -1.2001 0.7601 1.0628 1.6658
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.084460 0.061307 -1.378 0.1683
## arsenic 0.276508 0.024295 11.381 < 2e-16 ***
## dist100 -0.546022 0.063528 -8.595 < 2e-16 ***
## educ 0.026587 0.005881 4.521 6.15e-06 ***
## assocBelongs -0.079645 0.047428 -1.679 0.0931 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 4118.1 on 3019 degrees of freedom
## Residual deviance: 3909.7 on 3015 degrees of freedom
## AIC: 3919.7
##
## Number of Fisher Scoring iterations: 4
glm.wL = glm(switched ~ arsenic + dist100 + educ + assoc, data=wells, family=binomial(link="logit"))
summary(glm.wL)
##
## Call:
## glm(formula = switched ~ arsenic + dist100 + educ + assoc, family = binomial(link = "logit"),
## data = wells)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.5942 -1.1976 0.7541 1.0632 1.6739
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.156712 0.099601 -1.573 0.116
## arsenic 0.467022 0.041602 11.226 < 2e-16 ***
## dist100 -0.896110 0.104576 -8.569 < 2e-16 ***
## educ 0.042447 0.009588 4.427 9.55e-06 ***
## assocBelongs -0.124300 0.076966 -1.615 0.106
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 4118.1 on 3019 degrees of freedom
## Residual deviance: 3907.8 on 3015 degrees of freedom
## AIC: 3917.8
##
## Number of Fisher Scoring iterations: 4
library(stargazer)
stargazer(glm.wL, glm.wP, hglm.wL, hglm.wP, type="html")
Dependent variable: | ||||
switched | ||||
logistic | probit | heteroskedastic | ||
GLM | ||||
(1) | (2) | (3) | (4) | |
arsenic | 0.467*** | 0.277*** | 1.272*** | 0.785*** |
(0.042) | (0.024) | (0.270) | (0.163) | |
dist100 | -0.896*** | -0.546*** | -1.565*** | -0.965*** |
(0.105) | (0.064) | (0.338) | (0.204) | |
educ | 0.042*** | 0.027*** | 0.054*** | 0.034*** |
(0.010) | (0.006) | (0.020) | (0.012) | |
assocBelongs | -0.124 | -0.080* | -0.042 | -0.027 |
(0.077) | (0.047) | (0.144) | (0.089) | |
Constant | -0.157 | -0.084 | -0.792*** | -0.489*** |
(0.100) | (0.061) | (0.241) | (0.147) | |
Observations | 3,020 | 3,020 | 3,020 | 3,020 |
Log Likelihood | -1,953.913 | -1,954.872 | -1,943.677 | -1,943.568 |
Akaike Inf. Crit. | 3,917.826 | 3,919.744 | ||
Note: | p<0.1; p<0.05; p<0.01 |
Note that the NULL hypothesis is one of no heterscedasticity (i.e., of constant variance) so if this gets rejected then it means we have heterscedasticity and should use the results from hglm.w0.
Note also that some of the variables (dist100 and educ) are no longer significant while one (assoc) is now significant.