Regression Diagnostics

We saw the predicted piscore and the actual piscore to be very highly correlated, which is generally an indication of a good model, especially because these are out of sample predictions. That is, the 2014 data were used to generate a regression equation that was then used to generate predicted piscore for 2015. Now we ask if the model meets some of the central assumptions of regression analysis.

Joint Normality

One of the assumptions of a linear regression is the notion of joint normality. How might we see if this assumption is met? In several ways but we will just explore joint normality graphically. Let us draw a data ellipse first but only after dropping any observation with a missing value (otherwise the plot won’t work).

load("~/Downloads/df.15.RData")
df = na.omit(df.15)

library(car)
with(df, dataEllipse(pctmobile, piscore, charter, id.n = 2, id.cex = 0.5, pch = 21, labels = birn, center.pch = "+", level = c(0.99), fill = TRUE, fill.alpha = 0.1, color = c("cornfloweblue"), group.labels = c("Public", "Charter")))

Here we have one band – the 99% confidence band that traps most of the schools but then you see some schools outside these bands. The ones that are the farthest from the center of the distribution (the red cross) are identified for us (you see their birn displayed). You see two colors – red for charter schools and black for the public schools. The two types of schools farthest from their respective centers are birn = 139 and birn = 11511 for charter schools, and birn = 18671 and birn = 37101 for public schools.

Now we generate these for the other independent variables we used.

with(df, dataEllipse(pctdisabled, piscore, charter, id.n = 2, id.cex = 0.5, pch = 21, labels = birn, center.pch = "+", level = c(0.99), fill = TRUE, fill.alpha = 0.1, color = c("cornfloweblue"), group.labels = c("Public", "Charter")))

with(df, dataEllipse(pctecon, piscore, charter, id.n = 2, id.cex = 0.5, pch = 21, labels = birn, center.pch = "+", level = c(0.99), fill = TRUE, fill.alpha = 0.1, color = c("cornfloweblue"), group.labels = c("Public", "Charter")))

with(df, dataEllipse(pctlep, piscore, charter, id.n = 2, id.cex = 0.5, pch = 21, labels = birn, center.pch = "+", level = c(0.99), fill = TRUE, fill.alpha = 0.1, color = c("cornfloweblue"), group.labels = c("Public", "Charter")))

What can we conclude? The basic conclusion is that some schools are potential outliers. But are they really influencing our regression estimates? This question can be answered by looking at the residuals \(= y_i - \hat{y}_i\).

Normally distributed residuals?

One of the assumptions of regression analysis is that the residuals should be normally distributed. One way to test this is via the Shapiro-Wilk and Anderson-Darling tests. In order to run these tests we’ll have to save the residuals from our regression. I’ll do this from a regression fit to the 2015 data-set. I’ll use df so that we have no missing values.

lm.A = lm(piscore ~ pctlep + pctecon + pctmobile + pctdisabled + pctminority + enrollment + charter, data = df)
summary(lm.A)

## 
## Call:
## lm(formula = piscore ~ pctlep + pctecon + pctmobile + pctdisabled + 
##     pctminority + enrollment + charter, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -78.838  -4.085   0.686   4.592  31.075 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     1.076e+02  3.840e-01 280.194  < 2e-16 ***
## pctlep         -4.478e-02  1.800e-02  -2.488   0.0129 *  
## pctecon        -2.005e-01  6.600e-03 -30.376  < 2e-16 ***
## pctmobile      -2.261e-01  2.022e-02 -11.180  < 2e-16 ***
## pctdisabled    -3.283e-01  1.578e-02 -20.803  < 2e-16 ***
## pctminority    -9.995e-02  6.771e-03 -14.762  < 2e-16 ***
## enrollment      1.402e-03  3.158e-04   4.441 9.28e-06 ***
## charterCharter -3.051e+00  5.741e-01  -5.315 1.14e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.802 on 3213 degrees of freedom
## Multiple R-squared:  0.6777, Adjusted R-squared:  0.677 
## F-statistic: 965.1 on 7 and 3213 DF,  p-value: < 2.2e-16

df$residuals = resid(lm.A)

shapiro.test(df$residuals)

## 
##  Shapiro-Wilk normality test
## 
## data:  df$residuals
## W = 0.95978, p-value < 2.2e-16

library(nortest)
ad.test(df$residuals)

## 
##  Anderson-Darling normality test
## 
## data:  df$residuals
## A = 22.307, p-value < 2.2e-16

Both tests reject the NULL hypothesis of normally distributed residuals. This is not a lethal problem but still worth investigating.

Plotting residuals to identify problems

If the regression model has extracted all of the systematic variation in the piscore that is there to be explained, the residuals should represent pure random noise. They should exhibit no pattern but instead be scattered in an unsystematic way. We can see what pattern, if any, is exhibited by the residuals via a series of plots.

residualPlots(lm.A)

##             Test stat Pr(>|t|)
## pctlep         -5.177    0.000
## pctecon         3.001    0.003
## pctmobile       7.100    0.000
## pctdisabled     5.675    0.000
## pctminority   -12.237    0.000
## enrollment     -5.163    0.000
## charter            NA       NA
## Tukey test     -0.995    0.320

The plots show a red line that is drawn as a smoother through the data points. If this line exhibits curvature, and this curvature of significant enough, you can see that in the plot. Since the naked eye can be fooled quite easily, there exists a formal test with the NULL hypothesis being a linear relationship between the residuals and each independent variable, as well as a linear relationship between the residuals and the predicted dependent variable (Fitted values). If the p-value is \(\leq 0.05\) then we can reject the NULL of linearity, which would be a sign that we have a problem with a given variable. In the present case, all independent variables exhibit are problematic.

How is such an issue typically addressed? Since we see a curved line being drawn, that hints at non-linearity, which can be typically addressed either by squaring an independent variable and also including this squared term in the model, or then if that does not work, seeing if including the natural logarithm of the variable helps mitigate the problem.

These are useful if we have just a couple of independent variables that show up with significant p-values. Here all of them do! So what could we try? The first thing to do would be to see if we can control for the schools with the largest enrollments. We know these are ECOT and VAO. I’ll run an updated model that includes a dummy variable for each of these schools.

df$ECOT[df$birn == 133413] = 1
df$ECOT[df$birn != 133413] = 0
table(df$ECOT)

## 
##    0    1 
## 3220    1

df$VAO[df$birn == 142950] = 1
df$VAO[df$birn != 142950] = 0
table(df$VAO)

## 
##    0    1 
## 3220    1

lm.B = lm(piscore ~ pctlep + pctecon + pctmobile + pctdisabled + pctminority + enrollment + charter + ECOT + VAO, data = df)
summary(lm.B)

## 
## Call:
## lm(formula = piscore ~ pctlep + pctecon + pctmobile + pctdisabled + 
##     pctminority + enrollment + charter + ECOT + VAO, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -78.672  -4.070   0.600   4.554  31.400 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     1.065e+02  4.366e-01 243.935  < 2e-16 ***
## pctlep         -4.322e-02  1.793e-02  -2.411 0.015962 *  
## pctecon        -1.954e-01  6.647e-03 -29.396  < 2e-16 ***
## pctmobile      -2.161e-01  2.024e-02 -10.680  < 2e-16 ***
## pctdisabled    -3.261e-01  1.573e-02 -20.737  < 2e-16 ***
## pctminority    -1.058e-01  6.836e-03 -15.476  < 2e-16 ***
## enrollment      3.172e-03  4.623e-04   6.862 8.13e-12 ***
## charterCharter -2.563e+00  5.795e-01  -4.423 1.01e-05 ***
## ECOT           -4.981e+01  1.041e+01  -4.786 1.77e-06 ***
## VAO            -3.366e+01  9.192e+00  -3.662 0.000255 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.772 on 3211 degrees of freedom
## Multiple R-squared:  0.6804, Adjusted R-squared:  0.6795 
## F-statistic: 759.6 on 9 and 3211 DF,  p-value: < 2.2e-16

Now we plot the residuals again:

residualPlots(lm.B)

##             Test stat Pr(>|t|)
## pctlep         -5.193    0.000
## pctecon         2.838    0.005
## pctmobile       6.756    0.000
## pctdisabled     5.785    0.000
## pctminority   -11.685    0.000
## enrollment     -0.740    0.460
## charter            NA       NA
## ECOT           -1.380    0.168
## VAO             0.208    0.835
## Tukey test     -1.514    0.130

Now you see enrollment is no longer significant but the others remain so. This would be a good time to try and include the squared versions along with the original measures of pctlep through pctminority.

df$pctlep.sq = df$pctlep^2
df$pctecon.sq = df$pctecon^2
df$pctmobile.sq = df$pctmobile^2
df$pctdisabled.sq = df$pctdisabled^2
df$pctminority.sq = df$pctminority^2

lm.C = lm(piscore ~ pctlep + pctlep.sq + pctecon + pctecon.sq + pctmobile + pctmobile.sq + pctdisabled + pctdisabled.sq + pctminority + pctminority.sq + enrollment + charter + ECOT + VAO, data = df)
summary(lm.C)

## 
## Call:
## lm(formula = piscore ~ pctlep + pctlep.sq + pctecon + pctecon.sq + 
##     pctmobile + pctmobile.sq + pctdisabled + pctdisabled.sq + 
##     pctminority + pctminority.sq + enrollment + charter + ECOT + 
##     VAO, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -75.290  -3.884   0.493   4.480  31.611 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     1.081e+02  6.611e-01 163.462  < 2e-16 ***
## pctlep         -5.160e-03  3.790e-02  -0.136  0.89171    
## pctlep.sq      -1.228e-03  5.824e-04  -2.109  0.03502 *  
## pctecon        -2.263e-01  2.059e-02 -10.992  < 2e-16 ***
## pctecon.sq      4.620e-04  1.785e-04   2.589  0.00968 ** 
## pctmobile      -4.296e-01  3.609e-02 -11.904  < 2e-16 ***
## pctmobile.sq    3.967e-03  5.330e-04   7.442 1.26e-13 ***
## pctdisabled    -5.063e-01  4.235e-02 -11.956  < 2e-16 ***
## pctdisabled.sq  2.079e-03  4.814e-04   4.319 1.61e-05 ***
## pctminority     9.816e-02  2.048e-02   4.793 1.72e-06 ***
## pctminority.sq -2.192e-03  2.010e-04 -10.904  < 2e-16 ***
## enrollment      1.980e-03  4.595e-04   4.309 1.69e-05 ***
## charterCharter -3.396e+00  5.925e-01  -5.732 1.09e-08 ***
## ECOT           -3.164e+01  1.016e+01  -3.113  0.00187 ** 
## VAO            -2.032e+01  8.945e+00  -2.272  0.02318 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.508 on 3206 degrees of freedom
## Multiple R-squared:  0.7022, Adjusted R-squared:  0.7009 
## F-statistic: 539.9 on 14 and 3206 DF,  p-value: < 2.2e-16

residualPlots(lm.C)

##                Test stat Pr(>|t|)
## pctlep            -0.137    0.891
## pctlep.sq         -1.132    0.258
## pctecon            1.339    0.181
## pctecon.sq        -0.481    0.631
## pctmobile         -0.480    0.631
## pctmobile.sq       2.612    0.009
## pctdisabled       -0.625    0.532
## pctdisabled.sq    -2.505    0.012
## pctminority        0.010    0.992
## pctminority.sq     3.604    0.000
## enrollment         0.082    0.934
## charter               NA       NA
## ECOT               0.600    0.549
## VAO                0.864    0.387
## Tukey test        -4.615    0.000

Well, things have improved by we still see significant p-values for pctmobile.sq, pctdisabled.sq, pctminority.sq. Unfortunately, while the natural logarithm would usuallly help, you cannot take the natural logarithm of 0, and some schools have values of 0 for some of these continuous independent variables.

At this point our only option would be to run a formal test to identify outliers and if any are flagged, we would then see if things improve by excluding them.

outlierTest(lm.C)

##        rstudent unadjusted p-value Bonferonni p
## 3704 -10.233642         3.2899e-24   1.0590e-20
## 2707  -5.566818         2.8081e-08   9.0393e-05
## 3066  -5.405106         6.9512e-08   2.2376e-04
## 2821  -5.308145         1.1829e-07   3.8077e-04
## 47    -4.969087         7.0795e-07   2.2789e-03
## 2666  -4.898045         1.0159e-06   3.2702e-03
## 743   -4.897745         1.0174e-06   3.2751e-03

Yikes! Seven schools are flagged as outliers. Let us see if things improve by excluding these schools.

df2 = subset(df, rownames(df) != 3704 & rownames(df) != 2707 & rownames(df) != 3066 & rownames(df) != 2821 & rownames(df) != 47 & rownames(df) != 2666 & rownames(df) != 743)

lm.D = update(lm.C, data = df2)
summary(lm.D)

## 
## Call:
## lm(formula = piscore ~ pctlep + pctlep.sq + pctecon + pctecon.sq + 
##     pctmobile + pctmobile.sq + pctdisabled + pctdisabled.sq + 
##     pctminority + pctminority.sq + enrollment + charter + ECOT + 
##     VAO, data = df2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -31.400  -3.958   0.452   4.383  31.879 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     1.084e+02  6.366e-01 170.306  < 2e-16 ***
## pctlep          1.109e-02  3.652e-02   0.304  0.76133    
## pctlep.sq      -1.409e-03  5.600e-04  -2.516  0.01191 *  
## pctecon        -2.316e-01  1.978e-02 -11.711  < 2e-16 ***
## pctecon.sq      4.975e-04  1.714e-04   2.902  0.00373 ** 
## pctmobile      -4.403e-01  3.465e-02 -12.708  < 2e-16 ***
## pctmobile.sq    4.040e-03  5.116e-04   7.897 3.89e-15 ***
## pctdisabled    -5.028e-01  4.076e-02 -12.335  < 2e-16 ***
## pctdisabled.sq  2.197e-03  4.666e-04   4.709 2.60e-06 ***
## pctminority     9.051e-02  1.969e-02   4.598 4.43e-06 ***
## pctminority.sq -2.143e-03  1.932e-04 -11.091  < 2e-16 ***
## enrollment      1.827e-03  4.410e-04   4.143 3.52e-05 ***
## charterCharter -2.953e+00  5.703e-01  -5.178 2.38e-07 ***
## ECOT           -2.959e+01  9.751e+00  -3.035  0.00243 ** 
## VAO            -1.898e+01  8.583e+00  -2.211  0.02712 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.203 on 3199 degrees of freedom
## Multiple R-squared:  0.7197, Adjusted R-squared:  0.7185 
## F-statistic: 586.7 on 14 and 3199 DF,  p-value: < 2.2e-16

residualPlots(lm.D)

##                Test stat Pr(>|t|)
## pctlep             0.185    0.853
## pctlep.sq         -1.182    0.237
## pctecon           -1.001    0.317
## pctecon.sq        -0.312    0.755
## pctmobile         -0.716    0.474
## pctmobile.sq       2.691    0.007
## pctdisabled       -0.600    0.548
## pctdisabled.sq    -1.268    0.205
## pctminority        0.108    0.914
## pctminority.sq     3.665    0.000
## enrollment         0.405    0.686
## charter               NA       NA
## ECOT              -0.741    0.459
## VAO                0.729    0.466
## Tukey test        -6.851    0.000

outlierTest(lm.D)

##       rstudent unadjusted p-value Bonferonni p
## 687   4.452923         8.7608e-06     0.028140
## 1128 -4.385927         1.1922e-05     0.038293

So now pctmobile.sq and pctminority.sq are still problematic and the outlierTest() shows two more schools as outliers. Let us exclude these and see what happens.

df3 = subset(df2, rownames(df2) != 687 & rownames(df2) != 1128)

lm.E = update(lm.D, data = df3)
summary(lm.E)

## 
## Call:
## lm(formula = piscore ~ pctlep + pctlep.sq + pctecon + pctecon.sq + 
##     pctmobile + pctmobile.sq + pctdisabled + pctdisabled.sq + 
##     pctminority + pctminority.sq + enrollment + charter + ECOT + 
##     VAO, data = df3)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -29.6673  -3.9489   0.4722   4.3813  25.3580 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     1.083e+02  6.334e-01 170.957  < 2e-16 ***
## pctlep          1.363e-02  3.632e-02   0.375  0.70742    
## pctlep.sq      -1.423e-03  5.568e-04  -2.556  0.01063 *  
## pctecon        -2.321e-01  1.966e-02 -11.805  < 2e-16 ***
## pctecon.sq      4.896e-04  1.704e-04   2.873  0.00410 ** 
## pctmobile      -4.369e-01  3.448e-02 -12.672  < 2e-16 ***
## pctmobile.sq    4.045e-03  5.088e-04   7.950 2.56e-15 ***
## pctdisabled    -4.881e-01  4.060e-02 -12.022  < 2e-16 ***
## pctdisabled.sq  2.054e-03  4.646e-04   4.420 1.02e-05 ***
## pctminority     9.135e-02  1.957e-02   4.667 3.18e-06 ***
## pctminority.sq -2.154e-03  1.921e-04 -11.214  < 2e-16 ***
## enrollment      1.795e-03  4.385e-04   4.093 4.36e-05 ***
## charterCharter -2.937e+00  5.674e-01  -5.175 2.42e-07 ***
## ECOT           -2.928e+01  9.695e+00  -3.021  0.00254 ** 
## VAO            -1.873e+01  8.534e+00  -2.195  0.02823 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.162 on 3197 degrees of freedom
## Multiple R-squared:  0.7214, Adjusted R-squared:  0.7202 
## F-statistic: 591.3 on 14 and 3197 DF,  p-value: < 2.2e-16

residualPlots(lm.E)

##                Test stat Pr(>|t|)
## pctlep            -0.113    0.910
## pctlep.sq         -1.171    0.242
## pctecon            0.983    0.326
## pctecon.sq        -0.475    0.635
## pctmobile         -0.740    0.459
## pctmobile.sq       2.406    0.016
## pctdisabled       -0.524    0.600
## pctdisabled.sq    -1.187    0.235
## pctminority        0.172    0.863
## pctminority.sq     3.581    0.000
## enrollment         0.518    0.604
## charter               NA       NA
## ECOT              -0.857    0.392
## VAO                0.177    0.859
## Tukey test        -6.512    0.000

outlierTest(lm.E)

## 
## No Studentized residuals with Bonferonni p < 0.05
## Largest |rstudent|:
##       rstudent unadjusted p-value Bonferonni p
## 2819 -4.159712         3.2704e-05      0.10498

Now, the outlierTest() shows no significant outliers but we still see some non-linearity associated with pctmobile.sq and pctminority.sq.

What could we do? Not much. Ideally, we would be forced to dig into the schools and try to understand why we see these nonlinearities. We saw some of this speculation earlier this month, when we realized that there are some schools with 100% learning disabled kids and clearly these can’t be treated the same as other schools. There are also dropout recovery schools in the data-set, as well as inner-city schools, rural schools, suburban schools.

Measuring Influence

As a rule, then, we’ll have to figure out a way to identify what counts as an outlier, and even if some observation seems to be an outlier, whether it is large enough to influence the regression line. If an observation can change the regression line by virtue of being in the sample, we refer to it as a leveraged observation.

influencePlot(lm.E)

##        StudRes         Hat       CookD
## 2667 -4.143170 0.150846394 0.202270412
## 2819 -4.159712 0.003233525 0.003723136
## 3745       NaN 1.000000000         NaN

What school is this? Let us eyeball this observation.

df3[2667, ]

##       dirn  birn                building          district   county
## 3088 49031 29462 Payne Elementary School Wayne Trace Local Paulding
##                schtype piscore pipercent pigrade enrollment pctdisabled
## 3088 Elementary School  97.978      81.6       B        256    16.40625
##      pctmobile  pctecon pctlep pctminority charter residuals ECOT VAO
## 3088  14.45312 49.21875      0       3.125  Public  8.851546    0   0
##      pctlep.sq pctecon.sq pctmobile.sq pctdisabled.sq pctminority.sq
## 3088         0   2422.485     208.8928        269.165       9.765625

Non-Constant Variance (i.e., Heteroscedasticity)

Heteroscedasticity: \(Var(e_i) \neq \sigma^2\)} is a violation of the assumption that the variance of the residuals is constant. When might heteroscedasticity arise? If:

The outcome varies across the independent variable (for e.g., expenditure as a function of income; typing errors over time; etc.)
Outliers are present in the data
You omit one or more independent variables that should be in the model
One or more of the independent variable(s) is(are) skewed
Incorrect functional form (i.e., you need a squared term in the model but you forgot or didn’t know to include it in the model)

Consequences of ignoring heteroscedasticity:

Your estimates will not be biased so that isn’t the issue here
But your standard errors of the estimates will be off, they could be high or low and this will throw off the \(p-value\) of each coefficient
“The reason OLS is not optimal when heteroskedasticity is present is that it gives equal weight to all observations when, in fact, observations with larger disturbance variance contain less information than observations with smaller disturbance variance.”
Fortunately, unless heteroscedasticity is “marked,” significance tests are virtually unaffected, and thus OLS estimation can be used without concern of serious distortion. But, severe heteroscedasticity can sometimes be a problem.
Note that heteroscedasticity is often a by-product of other violations of assumptions.

We could sue various plots and tests to detect this but we’ll focus on one test. The Null hypothesis is of constant variance – basically saying that none of the independent variables have any statistically significant relationship with the residuals. This is what we want because then it means the residuals are random. But if \(H_0\) is rejected we have a problem; now the conclusion becomes that one or more of your independent variables are significant predictors of the residuals.

ncvTest(lm.E)

## Non-constant Variance Score Test 
## Variance formula: ~ fitted.values 
## Chisquare = 256.5238    Df = 1     p = 9.823328e-58

Multicollinearity

Variance Inflation Factors (VIFs) or Generalized Variance Inflation Factors (GVIFs) are some of the more commonly used indicators for multicollinearity. What these try to get at is an estimate of how much larger is the standard error (and hence the confidence interval) of each variable’s coefficient than would be seen if this variable had zero correlation with other variables in the regression model. A common rule of thumb is variables with VIF > 4 are suspect.

Here is a stylized example with very high multicollinearity.

library(MASS)
library(clusterGeneration)
set.seed(2)
num.vars = 15
num.obs = 200
cov.mat = genPositiveDefMat(num.vars,covMethod = "unifcorrmat")$Sigma
rand.vars = mvrnorm(num.obs,rep(0,num.vars),Sigma = cov.mat)

parms = runif(num.vars,-10,10)
y = rand.vars %*% matrix(parms) + rnorm(num.obs,sd=20)

lm.dat = data.frame(y,rand.vars)
form.in = paste('y ~',paste(names(lm.dat)[-1],collapse = '+'))
mod1 = lm(form.in,data = lm.dat)
summary(mod1)

## 
## Call:
## lm(formula = form.in, data = lm.dat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -65.855 -13.150   0.647  13.548  53.344 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)    0.631      1.489   0.424  0.67233   
## X1             5.523      2.345   2.356  0.01954 * 
## X2            -7.089      3.389  -2.091  0.03787 * 
## X3             2.512      1.685   1.490  0.13788   
## X4             3.811      5.430   0.702  0.48362   
## X5            -4.776      1.766  -2.704  0.00749 **
## X6            11.374      8.777   1.296  0.19665   
## X7            -3.426      6.207  -0.552  0.58166   
## X8            -0.310      7.855  -0.039  0.96857   
## X9            -6.424      3.279  -1.959  0.05160 . 
## X10           11.559      5.761   2.007  0.04626 * 
## X11            6.768      3.546   1.909  0.05782 . 
## X12            4.830      3.235   1.493  0.13712   
## X13            2.271      1.465   1.550  0.12294   
## X14           -1.182      3.619  -0.327  0.74433   
## X15            2.849      3.905   0.730  0.46647   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 20.38 on 184 degrees of freedom
## Multiple R-squared:  0.774,  Adjusted R-squared:  0.7556 
## F-statistic: 42.02 on 15 and 184 DF,  p-value: < 2.2e-16

vif(mod1)

##         X1         X2         X3         X4         X5         X6 
##  27.735278  36.894720  12.569420  50.738554   8.350699 114.685122 
##         X7         X8         X9        X10        X11        X12 
##  67.341542 153.597013  48.226663  50.737140  33.972005  43.254102 
##        X13        X14        X15 
##  12.082329  74.618689  29.872246

What does multicollinearity look like in our schools’ data and models? Does it threaten the validity of our last model?

vif(lm.E)

##         pctlep      pctlep.sq        pctecon     pctecon.sq      pctmobile 
##       5.528808       5.039877      21.382269      22.685046       4.983245 
##   pctmobile.sq    pctdisabled pctdisabled.sq    pctminority pctminority.sq 
##       3.807165       9.050120       8.571838      22.158275      19.461321 
##     enrollment        charter           ECOT            VAO 
##       2.359902       1.436646       1.831980       1.419406

Regression Diagnostics

MPA 6020

2016-04-11

Regression Diagnostics

Joint Normality

Normally distributed residuals?

Plotting residuals to identify problems

Measuring Influence

Non-Constant Variance (i.e., Heteroscedasticity)

Multicollinearity