Model Fit Statistics to report as a rule

  1. The \(pseudo-R^2\)
library(pscl)
pR2(glm.2) 
  1. The Percent Correctly Predicted
library(pscl)
hitmiss(glm.2)
  1. The Hosmer-Lemeshow Goodness of Fit Test
library(ResourceSelection)
hl = hoslem.test(psi$grade, fitted(glm.2), g=10)
hl
  1. The Akaike Information Criterion (AIC) and the Bayesian Information Criterion (BIC)
AIC(glm.2); BIC(glm.2)

The Latent Variable Formulation of the Logit and Probit

Think of the outcome we are modeling in terms of a latent variable – something that isn’t observed but instead what is observed (and measured) is some other version of this outcome. Assume that there really exists some continuous variable \(y^{*}\) that reflects a student’s TRUE probability of earning an A. Since all we observe is \(y = [0, 1]\) – whether a student did not or did earn an A there is a slight disconnect between \(y^{*}\) and \(y\). Mathematically, it is always easier to think of models like the logit and probit in terms of latent variables, in which case the logit and probit models can be depicted as follows:

\[y^{*}_{i} = x_{i}\beta + e_{i}\]

\[y=\begin{cases} 1 & \text{if } y^{*}_{i} > \tau, \\ 0 & \text{if }y^{*}_{i} \leq \tau \end{cases}\]

Note that \(\tau\) is some threshold that, when crossed, lets us observe \(y=1\) and if the threshold is not crossed then we observe \(y=0\).

Just as in the linear regression case, here too we know that the mean of the residuals will be \(0\). However, because we don’t see \(y^*\) we don’t know how the residuals might vary. We thus have to assume some distribution holds that would be a reasonable aproximation to \(y^{*}\). Since we are dealing with a binary outcome, the binomial function (more on this below) is one of the simplest generators of binary outcomes. Therefore when we specify which distribution we are assuming, in R we say the outcome is coming from the family=binomial().

When we specify link=“logit” we are saying the logit function describes the distribution of outcomes well. For the logit fucntion the errors are assumed to have a certain variance: \(Var\left(\epsilon | x\right) =\dfrac{\pi^{2}}{3}\) while for the probit the errors are assumed to have a certain variance: \(Var\left(\epsilon | x\right) =1\). If we could not specify what specific function within the binomial family describes the outcomes we would be unable to model these data because maximum likelihood estimation needs to know the variance of the residuals it must assume to give you your logit/probit results.

Note: Our choice of logit and probit is arbitrary because there is no way to use our sample and our models to really say yes, the variance of the errors when predicting \(y^{*}\) do indeed follow the logit or the probit distribution.

What is Maximum Likelihood

Assume we wanted to draw a random sample of 10 individuals and wondered how likely we were to see \(y=0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10\) men? The answer depends upon the population proportion of men versus women. If 50% of the population is male then we can calculate the probability of seeing \(y=0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10\) men very easily via the binomial formula:

\[\begin{eqnarray*} P\left[X \text{ successes}\right] = \binom{n}{x}p^{x}\left(1 - p\right)^{n-x} \\ \text{where } \binom{n}{x} = \dfrac{n!}{x!(n-x)!} \text{ and } \\ n! = n \times (n-1) \times (n-2) \times \cdots \times 2 \times 1 \end{eqnarray*}\]

Thus, for example, if we wanted to know the probability of finding 3 men in a random sample of 10 adults under the assumption that the population is split 50:50, the probability could be calculated as:

\[\begin{eqnarray*} P\left[3 \text{ men}\right] = \binom{10}{3}0.5^{3}\left(1 - 0.5\right)^{10-3} \end{eqnarray*}\]

x = seq(0, 10, by=1)
n = 10
x.d = dbinom(x, n, p=0.5)
round(x.d, digits=4)
##  [1] 0.0010 0.0098 0.0439 0.1172 0.2051 0.2461 0.2051 0.1172 0.0439 0.0098
## [11] 0.0010
barplot(x.d)

So clearly we would expect to see no male with a probability of 0.0009, 1 man with 0.0097, 2 men with 0.0429, and so on. Note that the highest probability is for 5 men … which is what you would expect if men made up 50% of the population (as we have assumed).

In the real world, of course, you don’t know the population proportion but instead are using your sample to figure out how likely is it that what you see in your sample holds for the population as well. In other words, you have \(x\), you have \(n\), but you don’t know \(p_0\). The method of maximum likelihood thus works to figure out, given your sample, the most likely value of \(p_0\).

Thus, for example, say we didn’t know that men made up 50% of the population. In fact we didn’t know what proportion of the population was male. What we would do is try to figure out what population proportion gives us the highest probability of seeing exactly 3 men in a random sample of 10 adults.

x = 3
n = 10
probs = seq(0.01, 0.99, by=0.01)
x.d = dbinom(3, n, p=probs)
round(x.d, digits=4)
##  [1] 0.0001 0.0008 0.0026 0.0058 0.0105 0.0168 0.0248 0.0343 0.0452 0.0574
## [11] 0.0706 0.0847 0.0995 0.1146 0.1298 0.1450 0.1600 0.1745 0.1883 0.2013
## [21] 0.2134 0.2244 0.2343 0.2429 0.2503 0.2563 0.2609 0.2642 0.2662 0.2668
## [31] 0.2662 0.2644 0.2614 0.2573 0.2522 0.2462 0.2394 0.2319 0.2237 0.2150
## [41] 0.2058 0.1963 0.1865 0.1765 0.1665 0.1564 0.1464 0.1364 0.1267 0.1172
## [51] 0.1080 0.0991 0.0905 0.0824 0.0746 0.0673 0.0604 0.0540 0.0480 0.0425
## [61] 0.0374 0.0327 0.0285 0.0247 0.0212 0.0181 0.0154 0.0130 0.0108 0.0090
## [71] 0.0074 0.0060 0.0049 0.0039 0.0031 0.0024 0.0019 0.0014 0.0011 0.0008
## [81] 0.0006 0.0004 0.0003 0.0002 0.0001 0.0001 0.0000 0.0000 0.0000 0.0000
## [91] 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
plot(probs, x.d, xlab="Proportion of Men in the Population", ylab="Probability of 3 men in n=10", type="l", col="firebrick")
abline(v=0.3, col="cornflowerblue")

What you see in the plot is that the most likely \(p_0\) has to be \(p_0=3\).

Similarly, if you had ended up with 6 men out of a random sample of 10 adults, what would be the most likely population proportion?

x = 6
n = 10
probs = seq(0.01, 0.99, by=0.01)
x.d = dbinom(6, n, p=probs)
round(x.d, digits=4)
##  [1] 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0001 0.0001
## [11] 0.0002 0.0004 0.0006 0.0009 0.0012 0.0018 0.0024 0.0032 0.0043 0.0055
## [21] 0.0070 0.0088 0.0109 0.0134 0.0162 0.0195 0.0231 0.0272 0.0317 0.0368
## [31] 0.0422 0.0482 0.0547 0.0616 0.0689 0.0767 0.0849 0.0934 0.1023 0.1115
## [41] 0.1209 0.1304 0.1401 0.1499 0.1596 0.1692 0.1786 0.1878 0.1966 0.2051
## [51] 0.2130 0.2204 0.2271 0.2331 0.2384 0.2427 0.2462 0.2488 0.2503 0.2508
## [61] 0.2503 0.2487 0.2461 0.2424 0.2377 0.2320 0.2253 0.2177 0.2093 0.2001
## [71] 0.1903 0.1798 0.1689 0.1576 0.1460 0.1343 0.1225 0.1108 0.0993 0.0881
## [81] 0.0773 0.0670 0.0573 0.0483 0.0401 0.0326 0.0260 0.0202 0.0153 0.0112
## [91] 0.0078 0.0052 0.0033 0.0019 0.0010 0.0004 0.0001 0.0000 0.0000
plot(probs, x.d, xlab="Proportion of Men in the Population", ylab="Probability of 4 men in n=10", type="l", col="firebrick")
abline(v=0.6, col="cornflowerblue")

Note how the maximum probability is at 0.6.

This is pretty much how the principle of maximum likelihood estimation (MLE) works with regression models but of course in far more complex ways because it doesn’t just have to find a single value but instead figure out the intercept, your slope coefficients, their standard errors, etc. So what MLE does is it looks at the model you have specified and then tries to figure out the most likely values of \(\beta_0, \beta_1, \ldots, \beta_k\) and their standard errors in the population that would generate your sample data.

Probit Models

Whereas the logit model is given by:

\[logit(y) = \beta_0 + \beta_1(x)\]

Mathematically, this is expressed as: \[Pr\left(y_i = 1 | x_i\right) = \dfrac{e^{x_{i}\beta}}{1 + e^{x_{i}\beta}}\]

The probit model, instead of working with the log-odds, relies on the cumulative distribution function of the normal distribution. In particular,

\[probit(y) = \beta_0 + \beta_1(x)\]

Mathematically, this is expressed as: \[Pr\left(y_i = 1 | x_i\right) = \int^{x_{i}\beta}_{-\infty}\dfrac{1}{\sqrt{2 \pi \sigma^{2}}}e^{- \dfrac{\left(y_{i} - x_{i}\beta \right)^{2}}{2 \sigma^{2}}}dx = \Phi\left(x_{i}\beta \right)\]

In simple terms you should think of the probit as fitting the outcome \((y)\) as a standard normal variable (a \(z\) score) where \(z = \beta_0 + \beta_1(x)\). That is, the probit function takes your independent variables as you have specified them, calculates \(b_i\), and then calculates a \(z\)

library(foreign)
psi = read.dta("http://www3.nd.edu/~rwilliam/statafiles/logist.dta")
psi$psi = factor(psi$psi, labels=c("Not in PSI", "In PSI"))
summary(psi)
##      grade             gpa             tuce               psi    
##  Min.   :0.0000   Min.   :2.060   Min.   :12.00   Not in PSI:18  
##  1st Qu.:0.0000   1st Qu.:2.812   1st Qu.:19.75   In PSI    :14  
##  Median :0.0000   Median :3.065   Median :22.50                  
##  Mean   :0.3438   Mean   :3.117   Mean   :21.94                  
##  3rd Qu.:1.0000   3rd Qu.:3.515   3rd Qu.:25.00                  
##  Max.   :1.0000   Max.   :4.000   Max.   :29.00
glm.2 = glm(grade ~ gpa + tuce + psi, data=psi, family=binomial(link="probit"))
summary(glm.2)
## 
## Call:
## glm(formula = grade ~ gpa + tuce + psi, family = binomial(link = "probit"), 
##     data = psi)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.9392  -0.6508  -0.2229   0.5934   2.0451  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)   
## (Intercept) -7.45231    2.57152  -2.898  0.00376 **
## gpa          1.62581    0.68973   2.357  0.01841 * 
## tuce         0.05173    0.08119   0.637  0.52406   
## psiIn PSI    1.42633    0.58695   2.430  0.01510 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 41.183  on 31  degrees of freedom
## Residual deviance: 25.638  on 28  degrees of freedom
## AIC: 33.638
## 
## Number of Fisher Scoring iterations: 6
psi$predictedP = fitted(glm.2)

Let \(gpa=2.06\), \(tuce=22\) and \(psi=1\). Let us see what value the model predicts …

\[-7.45231 + 1.62581(2.06) + 0.05173(22) + 1.42633(1)\]

\[= -1.538751\]

What you have calculated is the \(z\) score for this student. Converting this \(z\) score into a probability will give us the probability of this student getting an A …

pnorm(-1.538751, lower.tail=TRUE)
## [1] 0.06193255

This gives us a probability of 0.06193255. If you compare this with predictedP in the data-frame you will see they are one and the same thing. This is how the probit model works.

Interpreting Probit Coefficients

Making sense of probit coefficients is less straightforward than the logit. Here, the best we can do by looking at the estimates is to say:

  • Holding all else constant, a unit increase in gpa increases the probability of an A
  • Holding all else constant, being in psi increases the probability of an A

It is far better, as usual, to rely on graphics and calculated values of predicted probabilities for a student with specific values of the independent variables.

Calculating Probabilities

Let us use the same covariate values we used for the logit:

newdata1 = data.frame(tuce=19.97, gpa=2.812, psi="In PSI")
predict(glm.2, newdata1, type="response")
##         1 
## 0.3368113
newdata2 = data.frame(tuce=19.97, gpa=3.065, psi="In PSI")
predict(glm.2, newdata2, type="response")
##         1 
## 0.4960702
newdata3 = data.frame(tuce=19.97, gpa=3.065, psi="Not in PSI")
predict(glm.2, newdata3, type="response")
##          1 
## 0.07547535

Note what I have done here. I created newdata to be a data-frame with specific values of the independent variables. The only difference between newdata1 and newdata2 is the value of gpa.

  • In newdata1 gpa is set to the first quartile value and in newdata2 it is set to the median value.
  • Similarly, newdata3 differs from newdata2 only in terms of the individual being in psi (see newdata2) and not being in psi (see newdata3).

The resulting values of predicted probabilities can then be used to explain the results.

This could be useful if the audience wanted to know the impact of specific variables, and how changes in these variables’ values influence the probability of \(y=1\).

Now we do it en masse to see how probabilities change as gpa changes and compare what happens to those in psi versus those not in psi, but tuce being held constant.

newdata4 = data.frame(tuce=19.97, gpa=seq(2.1, 4.0, by=0.1), psi="Not in PSI")
yhat.6 = predict(glm.2, newdata4, type="response")
yhat.6
##           1           2           3           4           5           6 
## 0.001327509 0.002237996 0.003681899 0.005912118 0.009267197 0.014183099 
##           7           8           9          10          11          12 
## 0.021198450 0.030949269 0.044149402 0.061553888 0.083904512 0.111859788 
##          13          14          15          16          17          18 
## 0.145914995 0.186321283 0.233015172 0.285570508 0.343183348 0.404696424 
##          19          20 
## 0.468664208 0.533453149
plot(newdata4$gpa, yhat.6)

newdata5 = data.frame(tuce=19.97, gpa=seq(2.1, 4.0, by=0.1), psi="In PSI")
yhat.7 = predict(glm.2, newdata5, type="response")
yhat.7
##          1          2          3          4          5          6 
## 0.05719557 0.07836161 0.10499419 0.13763289 0.17659102 0.22188170 
##          7          8          9         10         11         12 
## 0.27316375 0.32971831 0.39046400 0.45401306 0.51876435 0.58302327 
##         13         14         15         16         17         18 
## 0.64513365 0.70360466 0.75721674 0.80509414 0.84673733 0.88201526 
##         19         20 
## 0.91112297 0.93451449
plot(newdata5$gpa, yhat.7)

Let us compare the two predicted probabilities in a single plot:

plot.data = data.frame(cbind(yhat.6, yhat.7, newdata4$gpa, (yhat.7 - yhat.6)))
colnames(plot.data) = c("Yhat6", "Yhat7", "gpa", "Yhat7-Yhat6")
plot(plot.data$gpa, yhat.6, type="b", col="cornflowerblue", xlab="gpa", ylab="Probability grade = A", ylim=c(0,1))
par(new=TRUE)
plot(plot.data$gpa, yhat.7, type="b", col="firebrick", yaxt="no", ylab="", xlab="", ylim=c(0,1))
abline(h=0.5, col="gray")

Using visreg

code>visreg is helpful more globally, in capturing the effects of the independent variables in a more summary format.

library(visreg)
visreg(glm.2, "gpa", by="psi", rug=2) 

visreg(glm.2, "gpa", by="psi", rug=2, scale="response", overlay=TRUE)

visreg(glm.2, "gpa", by="psi", rug=2, scale="response")

Notice a couple of things about these plots as well:

  1. You see the rug being used here. Rugs on top are for \(y=1\) and rugs below are for \(y=0\).
  2. The first plot is useful for seeing how well the model fits. Too many observations away from the line will hint at a poor fit.
  3. The second plot is too noisy since the confidence intervals overlap, and so the third plot circumvents this issue by switching off the overlay option.
  4. What is clearly visible in the third plot is the fact that if you were in psi, your gpa translated into a higher probability of earning an A faster than it did for the non-psi group. One way of thinking about this is to draw an imaginary horizontal line from 0.5 on the vertical axis and then drop a perpendicular line down to gpa from where the horizontal line hits the curve. You will see that a smaller gpa is needed to hit this 50:50 point in the psi group than is true of the non-psi group.

McFadden’s \(R^2\) works here as well, and so do the other goodness-of-fit tests.

library(pscl)
pR2(glm.2) 
##         llh     llhNull          G2    McFadden        r2ML        r2CU 
## -12.8188033 -20.5917297  15.5458527   0.3774781   0.3848000   0.5315670
library(ResourceSelection)
hl = hoslem.test(psi$grade, fitted(glm.2), g=10)
hl
## 
##  Hosmer and Lemeshow goodness of fit (GOF) test
## 
## data:  psi$grade, fitted(glm.2)
## X-squared = 6.0702, df = 8, p-value = 0.6394

Again, we are unable to reject the \(H_0\) that the model fit is good.

What about the percent correctly predicted?

library(pscl)
hitmiss(glm.2)
## Classification Threshold = 0.5 
##        y=0 y=1
## yhat=0  18   3
## yhat=1   3   8
## Percent Correctly Predicted = 81.25%
## Percent Correctly Predicted = 85.71%, for y = 0
## Percent Correctly Predicted = 72.73%  for y = 1
## Null Model Correctly Predicts 65.62%
## [1] 81.25000 85.71429 72.72727

Some statisticians argue that the decision to use the 0.5 threshold to convert the predicted probabilities into 0 and 1 is problematic. They would in fact recommend the following:

probabilities = predict(glm.2, type = "response")
percent.correct = (1 - mean(abs(probabilities - psi$grade))) * 100
percent.correct
## [1] 74.20134

You will notice that you get a slightly smaller answer from this latter command. I would recommend using hitmiss() but recognizing that not everyone agrees with this approach.

Confidence intervals also work the same way:

round(confint(glm.2, level=0.95), digits=4)
##                2.5 %  97.5 %
## (Intercept) -13.1149 -2.9738
## gpa           0.3740  3.1314
## tuce         -0.1044  0.2279
## psiIn PSI     0.3286  2.7032
round(confint(glm.2, level=0.99), digits=4)
##                0.5 %  99.5 %
## (Intercept) -15.1747 -1.7206
## gpa           0.0145  3.6698
## tuce         -0.1515  0.2884
## psiIn PSI     0.0033  3.1583

What about comparing nested models? Same thing as before …

with(glm.2, pchisq(null.deviance - deviance, df.null - df.residual, lower.tail = FALSE))
## [1] 0.001404895
glm.3 = glm(grade ~ gpa + tuce, data=psi, family=binomial(link="probit"))
summary(glm.3)
## 
## Call:
## glm(formula = grade ~ gpa + tuce, family = binomial(link = "probit"), 
##     data = psi)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.3950  -0.7229  -0.4824   0.7625   2.4645  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)   
## (Intercept) -6.03431    2.16642  -2.785  0.00535 **
## gpa          1.40957    0.65097   2.165  0.03036 * 
## tuce         0.05267    0.07433   0.709  0.47861   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 41.183  on 31  degrees of freedom
## Residual deviance: 32.304  on 29  degrees of freedom
## AIC: 38.304
## 
## Number of Fisher Scoring iterations: 5
anova(glm.2, glm.3)
## Analysis of Deviance Table
## 
## Model 1: grade ~ gpa + tuce + psi
## Model 2: grade ~ gpa + tuce
##   Resid. Df Resid. Dev Df Deviance
## 1        28     25.638            
## 2        29     32.304 -1  -6.6667

These tests can also be run via the lmtest package.

library(lmtest)
lrtest(glm.3, glm.2)
## Likelihood ratio test
## 
## Model 1: grade ~ gpa + tuce
## Model 2: grade ~ gpa + tuce + psi
##   #Df  LogLik Df  Chisq Pr(>Chisq)   
## 1   3 -16.152                        
## 2   4 -12.819  1 6.6667   0.009823 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Finally, so goes the AIC and the BIC …

AIC(glm.2); BIC(glm.2)
## [1] 33.63761
## [1] 39.50055
AIC(glm.3); BIC(glm.3)
## [1] 38.30431
## [1] 42.70152

Diagnostics for Probit Regression Models

You can use the same diagnostics for probit models that you used for regression models. There are some differences in how residuals are defined, etc. but we’ll ignore these mathematical details.

library(car)
marginalModelPlots(glm.2)

avPlots(glm.2)

residualPlots(glm.2)

##      Test stat Pr(>|t|)
## gpa      2.174    0.140
## tuce     0.131    0.717
## psi         NA       NA
influenceIndexPlot(glm.2)

influencePlot(glm.2)

##      StudRes       Hat      CookD
## 2  2.3366794 0.1526221 0.37697378
## 7 -0.9006084 0.3150740 0.06311229
outlierTest(glm.2)
## 
## No Studentized residuals with Bonferonni p < 0.05
## Largest |rstudent|:
##   rstudent unadjusted p-value Bonferonni p
## 2 2.336679           0.019456      0.62259
vif(glm.2)
##      gpa     tuce      psi 
## 1.158296 1.104320 1.071533

Modeling Interaction Effects in Logit and Probit Models

There has been considerable debate on whether these models need an interaction effect to be explicitly modeled. Some folks argue that you don’t need it, that basically you can fit a model such as this:

\[logit(grade) = \beta_0 + \beta_1(gpa) + \beta_2(tuce) + \beta_3(psi)\]

and still talk about how two or more variables interact to change the probability of the student earning an A.

Others argue that you need to specify the interaction in the model itself. This latter group argues that the model should then look like this if you think psi interacts with gpa:

\[logit(grade) = \beta_0 + \beta_1(gpa) + \beta_2(tuce) + \beta_3(psi) + \beta_4(gpa * psi)\]

My recommendation: Include the interaction in the model. Then,

  • Use AIC and BIC to see which model is better. Report the outcome in your write-up.
  • Generate predicted probabilities for specific values of the independent variables
  • Use visreg to see how the predicted probability changes as the values of the variables that make up the interaction change.
glm.4 = glm(grade ~ gpa + tuce + psi + gpa:psi, data=psi, family=binomial(link="probit"))
summary(glm.4)
## 
## Call:
## glm(formula = grade ~ gpa + tuce + psi + gpa:psi, family = binomial(link = "probit"), 
##     data = psi)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.73973  -0.59883  -0.08161   0.57457   1.98176  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)  
## (Intercept)   -12.3819     5.6864  -2.177   0.0294 *
## gpa             3.1152     1.5879   1.962   0.0498 *
## tuce            0.0459     0.0876   0.524   0.6003  
## psiIn PSI       8.2143     5.9857   1.372   0.1700  
## gpa:psiIn PSI  -2.0480     1.7475  -1.172   0.2412  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 41.183  on 31  degrees of freedom
## Residual deviance: 23.976  on 27  degrees of freedom
## AIC: 33.976
## 
## Number of Fisher Scoring iterations: 8
psi$predictedP = fitted(glm.4)

AIC(glm.2, glm.4); BIC(glm.2, glm.4)
##       df      AIC
## glm.2  4 33.63761
## glm.4  5 33.97578
##       df      BIC
## glm.2  4 39.50055
## glm.4  5 41.30446
library(visreg)
visreg(glm.4, "gpa", by="psi", rug=2, scale="response", overlay=TRUE)

visreg(glm.4, "gpa", by="psi", rug=2, scale="response")

library(pscl)
pR2(glm.4) 
##         llh     llhNull          G2    McFadden        r2ML        r2CU 
## -11.9878895 -20.5917297  17.2076805   0.4178299   0.4159333   0.5745748
library(ResourceSelection)
hl = hoslem.test(psi$grade, fitted(glm.4), g=10)
hl
## 
##  Hosmer and Lemeshow goodness of fit (GOF) test
## 
## data:  psi$grade, fitted(glm.4)
## X-squared = 5.1669, df = 8, p-value = 0.7396
hitmiss(glm.4)
## Classification Threshold = 0.5 
##        y=0 y=1
## yhat=0  18   2
## yhat=1   3   9
## Percent Correctly Predicted = 84.38%
## Percent Correctly Predicted = 85.71%, for y = 0
## Percent Correctly Predicted = 81.82%  for y = 1
## Null Model Correctly Predicts 65.62%
## [1] 84.37500 85.71429 81.81818
library(lmtest)
lrtest(glm.2, glm.4)
## Likelihood ratio test
## 
## Model 1: grade ~ gpa + tuce + psi
## Model 2: grade ~ gpa + tuce + psi + gpa:psi
##   #Df  LogLik Df  Chisq Pr(>Chisq)
## 1   4 -12.819                     
## 2   5 -11.988  1 1.6618     0.1974

So should we go with the model that has the interaction (glm.4) or the one without an interaction (glm.2)? This is how I would decide which model to go with …

  • Neither psi nor the interaction term \(gpa * psi\) are significant
  • The AIC and BIC are much lower for glm.2 than for glm.4
  • The likelihood-ratio test suggests that glm.4 is not a significant improvement over glm.2
  • When I compare the percent correctly predicted I see this to be 84.38% for glm.4 and 81.25% for glm.2
  • McFadden’s pseudo-\(R^2\) is 0.4178 for glm.4 and 0.3774 for glm.2

Weighing the evidence listed above, if my goal was to have the best possible predictive model then I would go with glm.4 because this gives me a better fit to the data (look at percent correctly predicted). If, however, I was testing a theory about the interaction between gpa and psi then I would show glm.4 so people could see the interaction is indeed not significant.

What would happen if I was theory testing and went with glm.2 instead? Anybody who looked at my visreg plots would ‘see’ an interaction in the plots of gpa by psi because the predicted probabilities would not look like parallel curves. This would make the reader think gpa has a different effect for the psi group than it does for the non-psi group (which is what an interaction is supposed to test). … The reader would leave with a mistaken conclusion because had he/she sen glm.4 they would have realized there is no significant interaction.

Take-away: Always explicitly include the interaction term if it is a part of your theory. If you forget to include it or don’t want to for some reason, DO NOT interpret any plots or predicted probabilities as resulting from an interaction of this and that variables.

Comparing Logit and Probit Models

The logit and probit models are very similar and which one you choose is a matter of habit and ease. Traditionally, the logit model was easier to compute so most analysts veer in this direction. There are some differences of course:

  • The logit model is fatter in the tails

Logit versus Probit curves

glm.0 = glm(grade ~ gpa + tuce + psi, data=psi, family=binomial(link="logit"))

newdata4 = data.frame(tuce=19.97, gpa=seq(2.1, 4.0, by=0.1), psi="Not in PSI")
yhat.4 = predict(glm.0, newdata4, type="response")

newdata5 = data.frame(tuce=19.97, gpa=seq(2.1, 4.0, by=0.1), psi="In PSI")
yhat.5 = predict(glm.0, newdata5, type="response")

yhat.6 = predict(glm.2, newdata4, type="response")

yhat.7 = predict(glm.2, newdata5, type="response")

plot.data2 = data.frame(cbind(yhat.4, yhat.5, yhat.6, yhat.7, newdata4$gpa))
colnames(plot.data2) = c("Logit (no A)", "Probit (no A)", "Logit (A)", "Probit (A)", "gpa")
plot(plot.data$gpa, yhat.4, type="b", col="cornflowerblue", xlab="gpa", ylab="Probability grade = A", ylim=c(0,1))
par(new=TRUE)
plot(plot.data$gpa, yhat.6, type="b", col="firebrick", yaxt="no", ylab="", xlab="", ylim=c(0,1))
abline(h=0.5, col="gray")

plot(plot.data$gpa, yhat.5, type="b", col="cornflowerblue", xlab="gpa", ylab="Probability grade = A", ylim=c(0,1))
par(new=TRUE)
plot(plot.data$gpa, yhat.7, type="b", col="firebrick", yaxt="no", ylab="", xlab="", ylim=c(0,1))
abline(h=0.5, col="gray")

You see the slight drift between the two sets of estimates (logit predicted probabilities are in blue).

  • The variances are different. For the logit, the variance is set to \(\dfrac{\pi^{2}}{3}\) while for the probit, since this is the standard normal distribution (i.e., the \(z\) score distribution) the variance is set to \(1\).

  • If you multiply the probit coefficients by \(1.81\) (some say 1.61) you will get the approximate logit coefficient. If you multiply the logit coefficient by 0.55 (some say 0.625) you will get the approximate logit coefficient. Why? Because of the ratio of their variances \(\left(\dfrac{\pi}{\sqrt{3}} \approx 1.81 \right)\)

Some folks will fit both and then see which link function does a better job of predicting the outcomes, and then will opt to work with this link function.

Using ‘stargazer’ to compare multiple models

library(stargazer)
stargazer(glm.0, glm.2, glm.4, type="html", covariate.labels = c("Incoming GPA", "TUCE (Pre-test Score)", "PSI Group", "Incoming GPA * PSI Group", "Intercept"))
Dependent variable:
grade
logistic probit
(1) (2) (3)
Incoming GPA 2.826** 1.626** 3.115**
(1.263) (0.690) (1.588)
TUCE (Pre-test Score) 0.095 0.052 0.046
(0.142) (0.081) (0.088)
PSI Group 2.379** 1.426** 8.214
(1.065) (0.587) (5.986)
Incoming GPA * PSI Group -2.048
(1.747)
Intercept -13.021*** -7.452*** -12.382**
(4.931) (2.572) (5.686)
Observations 32 32 32
Log Likelihood -12.890 -12.819 -11.988
Akaike Inf. Crit. 33.779 33.638 33.976
Note: p<0.1; p<0.05; p<0.01

Practicing with the Bangladesh Wells Data

Many of the wells used for drinking water in Bangladesh and other South Asian countries are contaminated with natural arsenic, affecting an estimated 100 million people. Exposure to arsenic increases the risk of cancer and other diseases, with the risk proportional to exposure.

Any locality can include wells with a range of arsenic levels, and even if your neighbor’s well is safe there is no guarantee that your well is safe. However, if your neighbor is willing to share water with you you can walk and collect water from their well. The nearest well may be quite a distance away though.

A research team measured and labeled wells in several villages, tagging the wells as “safe” (if arsenic levels were below 0.5 in units of hundreds of micrograms per liter, the standard set by Bangladesh) or “unsafe” (if arsenic levels exceeded this threshold). People with unsafe wells were encouraged to dig deeper tube wells or then find safe wells to get water. A few years later the team returned to see who had switched wells. The dependent variable is whether someone switched \((y=1)\) or not \((y=0)\). We have the following independent variables:

  • The distance (in meters) to the nearest safe well
  • The arsenic level of the respondent’s well
  • Whether any members of the household were active in local community organizations, and
  • The education level of the head of the household

The data are in wells.RData so download and save this file to your working directory.

Note: distance is in meters so first convert it into 100-meter units and then fit the following model:

\[Pr(switched) = \beta_{0} + \beta_{1}(arsenic) + \beta_{2}(distance) + \beta_{3}(education) + \beta_{4}(association)\]

load("/Users/ruhil/Box Sync/MPA6020/Data/wells.RData")
wells$dist100 = wells$dist/100
glm.w0 = glm(switched ~ arsenic + dist100 + educ + assoc, data=wells, family=binomial(link="probit"))
summary(glm.w0)
## 
## Call:
## glm(formula = switched ~ arsenic + dist100 + educ + assoc, family = binomial(link = "probit"), 
##     data = wells)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.7310  -1.2001   0.7601   1.0628   1.6658  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -0.084460   0.061307  -1.378   0.1683    
## arsenic       0.276508   0.024295  11.381  < 2e-16 ***
## dist100      -0.546022   0.063528  -8.595  < 2e-16 ***
## educ          0.026587   0.005881   4.521 6.15e-06 ***
## assocBelongs -0.079645   0.047428  -1.679   0.0931 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 4118.1  on 3019  degrees of freedom
## Residual deviance: 3909.7  on 3015  degrees of freedom
## AIC: 3919.7
## 
## Number of Fisher Scoring iterations: 4
wells$predicted = predict(glm.w0, type="response")

plot(wells$arsenic, wells$predicted, lty=1)

library(visreg)
visreg(glm.w0, "arsenic", by="dist100", rug=2, scale="response", overlay=TRUE)

Some Important Extensions to Keep in Mind

Of course this is not the be-all and end-all of binary dependent variables. There are several other types of models that may be called for. Below I am giving you snippets of two extensions you may end up seeing in the literature more often than other extensions.

  1. The Scobit There is something known as the “skewed probit”, designed for a particular situation. The way the logit and probit curves are estimated, the independent variables are treated as having their maximal impact when \(Pr(y_{i}=1|x) = 0.5\), that is, right when the probability of success is 0.5. At this point even a small increase in an independent variable with a positive influence on \(y\) can push the probability of success above 0.5. This may not be true of the phenomenon you are studying and if that is the case, then the scobit would be recommended because it uses the data you have to estimate the point of maximal impact rather than assume it to be at \(0.5\). Unfortunately, though, these models are very hard to estimate and often will simply fail to converge, leaving you with no results whatsoever.

  2. The Heteroscedastic Probit (and Logit) Because the probit model is built on the standard normal distribution (the \(z\) distribution), by assumption the residuals are forced to have a mean of \(0\) and a constant variance of \(1\). The problem is this assumption of a constant variance of \(1\); there is no reason to believe that heteroscedasticity can only arise in a linear regression setting. In fact, all the factors that gave rise to heteroscedasticity in a linear regression setting are also in effect here. The same is true for logit models (where again the variance is assumed to be a constant); heteroscedasticity could be an issue with logits as well.

As a result, several analysts try to correct for heteroscedasticity by running what is known as a heteroscedatic probit. Not everybody agrees with this approach or thinks that it is an acceptable way to deal with heteroscedasticity. However, if you really need to do it, here is an example of how to fit such a model:

library(glmx)
hglm.wP = hetglm(switched ~ arsenic + dist100 + educ + assoc, data=wells, family=binomial(link="probit"))
summary(hglm.wP)
## 
## Call:
## hetglm(formula = switched ~ arsenic + dist100 + educ + assoc, data = wells, 
##     family = binomial(link = "probit"))
## 
## Deviance residuals:
##     Min      1Q  Median      3Q     Max 
## -2.0500 -1.1811  0.7375  1.0479  1.8877 
## 
## Coefficients (binomial model with probit link):
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -0.48868    0.14735  -3.316 0.000912 ***
## arsenic       0.78503    0.16268   4.826 1.40e-06 ***
## dist100      -0.96513    0.20413  -4.728 2.27e-06 ***
## educ          0.03357    0.01212   2.770 0.005603 ** 
## assocBelongs -0.02700    0.08931  -0.302 0.762412    
## 
## Latent scale model coefficients (with log link):
##              Estimate Std. Error z value Pr(>|z|)    
## arsenic       0.24385    0.05145   4.740 2.14e-06 ***
## dist100       0.16018    0.16052   0.998   0.3184    
## educ         -0.01104    0.01573  -0.702   0.4828    
## assocBelongs  0.34836    0.14537   2.396   0.0166 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Log-likelihood: -1944 on 9 Df
## LR test for homoskedasticity: 22.61 on 4 Df, p-value: 0.0001517
## Dispersion: 1
## Number of iterations in nlminb optimization: 16
hglm.wL = hetglm(switched ~ arsenic + dist100 + educ + assoc, data=wells, family=binomial(link="logit"))
summary(hglm.wL)
## 
## Call:
## hetglm(formula = switched ~ arsenic + dist100 + educ + assoc, data = wells, 
##     family = binomial(link = "logit"))
## 
## Deviance residuals:
##     Min      1Q  Median      3Q     Max 
## -2.0264 -1.1801  0.7378  1.0468  1.8685 
## 
## Coefficients (binomial model with logit link):
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -0.79248    0.24054  -3.295 0.000986 ***
## arsenic       1.27238    0.26971   4.718 2.39e-06 ***
## dist100      -1.56452    0.33778  -4.632 3.63e-06 ***
## educ          0.05409    0.01966   2.751 0.005940 ** 
## assocBelongs -0.04192    0.14430  -0.290 0.771441    
## 
## Latent scale model coefficients (with log link):
##              Estimate Std. Error z value Pr(>|z|)    
## arsenic       0.24122    0.05341   4.516 6.29e-06 ***
## dist100       0.16537    0.16741   0.988    0.323    
## educ         -0.01175    0.01635  -0.719    0.472    
## assocBelongs  0.35959    0.14923   2.410    0.016 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Log-likelihood: -1944 on 9 Df
## LR test for homoskedasticity: 20.47 on 4 Df, p-value: 0.0004027
## Dispersion: 1
## Number of iterations in nlminb optimization: 16
glm.wP = glm(switched ~ arsenic + dist100 + educ + assoc, data=wells, family=binomial(link="probit"))
summary(glm.wP)
## 
## Call:
## glm(formula = switched ~ arsenic + dist100 + educ + assoc, family = binomial(link = "probit"), 
##     data = wells)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.7310  -1.2001   0.7601   1.0628   1.6658  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -0.084460   0.061307  -1.378   0.1683    
## arsenic       0.276508   0.024295  11.381  < 2e-16 ***
## dist100      -0.546022   0.063528  -8.595  < 2e-16 ***
## educ          0.026587   0.005881   4.521 6.15e-06 ***
## assocBelongs -0.079645   0.047428  -1.679   0.0931 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 4118.1  on 3019  degrees of freedom
## Residual deviance: 3909.7  on 3015  degrees of freedom
## AIC: 3919.7
## 
## Number of Fisher Scoring iterations: 4
glm.wL = glm(switched ~ arsenic + dist100 + educ + assoc, data=wells, family=binomial(link="logit"))
summary(glm.wL)
## 
## Call:
## glm(formula = switched ~ arsenic + dist100 + educ + assoc, family = binomial(link = "logit"), 
##     data = wells)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.5942  -1.1976   0.7541   1.0632   1.6739  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -0.156712   0.099601  -1.573    0.116    
## arsenic       0.467022   0.041602  11.226  < 2e-16 ***
## dist100      -0.896110   0.104576  -8.569  < 2e-16 ***
## educ          0.042447   0.009588   4.427 9.55e-06 ***
## assocBelongs -0.124300   0.076966  -1.615    0.106    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 4118.1  on 3019  degrees of freedom
## Residual deviance: 3907.8  on 3015  degrees of freedom
## AIC: 3917.8
## 
## Number of Fisher Scoring iterations: 4
library(stargazer)
stargazer(glm.wL, glm.wP, hglm.wL, hglm.wP, type="html")
Dependent variable:
switched
logistic probit heteroskedastic
GLM
(1) (2) (3) (4)
arsenic 0.467*** 0.277*** 1.272*** 0.785***
(0.042) (0.024) (0.270) (0.163)
dist100 -0.896*** -0.546*** -1.565*** -0.965***
(0.105) (0.064) (0.338) (0.204)
educ 0.042*** 0.027*** 0.054*** 0.034***
(0.010) (0.006) (0.020) (0.012)
assocBelongs -0.124 -0.080* -0.042 -0.027
(0.077) (0.047) (0.144) (0.089)
Constant -0.157 -0.084 -0.792*** -0.489***
(0.100) (0.061) (0.241) (0.147)
Observations 3,020 3,020 3,020 3,020
Log Likelihood -1,953.913 -1,954.872 -1,943.677 -1,943.568
Akaike Inf. Crit. 3,917.826 3,919.744
Note: p<0.1; p<0.05; p<0.01

Note that the NULL hypothesis is one of no heterscedasticity (i.e., of constant variance) so if this gets rejected then it means we have heterscedasticity and should use the results from hglm.w0.

Note also that some of the variables (dist100 and educ) are no longer significant while one (assoc) is now significant.