The Regression Envelope

If variables are jointly normally distributed we can draw a band around the data points to see where a given percentage of the data should fall under the assumption of joint normality. Typically we are interested in the 95% band because we use $\alpha=0.05$ and anything that is then outside such a band hints at a data point that is outside the center of the regression space. Being outside the envelope doesn’t necessarily make the data point an outlier (in the sense of influencing the regression line) but nevertheless it is still an unusual value.

load("~/Downloads/sowc.RData")
sowc$Urban_Rural = factor(sowc$Urban_Rural)

library(car)

with(sowc, dataEllipse(FemaleYouthLR, Income, id.n = 4, id.cex = 0.5, pch = 21, labels = country, center.pch = "+", level = 0.95, fill = TRUE, fill.alpha = 0.1, ylim = c(-20000, 80000), xlim = c(20, 140), col = "cornflowerblue"))

with(sowc, dataEllipse(FemaleYouthLR, Income, Urban_Rural, id.n = 4, id.cex = 0.5, pch = c(16,21), labels = country, center.pch = "+", level = 0.95, fill = TRUE, fill.alpha = 0.1, ylim = c(-20000, 80000), xlim = c(20, 140), group.labels = c("Rural", "Urban")))

sowc.U = subset(sowc, Urban_Rural == "Urban")
sowc.R = subset(sowc, Urban_Rural == "Rural")

with(sowc.U, dataEllipse(FemaleYouthLR, Income, id.n = 4, id.cex = 0.5, pch = c(16,21), labels = country, center.pch = "+", level = 0.95, fill = TRUE, fill.alpha = 0.1, ylim = c(-20000, 80000), xlim = c(45, 125)))

with(sowc.R, dataEllipse(FemaleYouthLR, Income, id.n = 4, id.cex = 0.5, pch = c(16,21), labels = country, center.pch = "+", level = 0.95, fill = TRUE, fill.alpha = 0.1, ylim = c(-10000, 30000), xlim = c(20, 140)))

The first data ellipse shows the four countries farthest from the 95% envelope – Qatar, Singapore, Kuwait, and Guinea. The second adds one more dimension – the ability do differentiate between Urban and Rural countries. This latter data ellipse shows four Urban countries – Qatar, Nigeria, Cote’ de Ivoire, and Dem. Rep of Congo – and four Rural countries – Slovenia, Trinidad and Tobago, Guinea, and Equitorial Guinea. So these might be the countries likely to influence the regression line. Do they? We’ll see by focusing on the residuals from a regression model.

The Pairs Plot

There is another package that will give you some meaningful plots and show you how correlated your variables happen to be. You have seen this before – ggpairs!!

library(GGally)
ggpairs(sowc, c(2, 4, 7, 13, 15))

The Residuals

#options(show.signif.stars = TRUE)
library(car)

The one thing you’ll notice right away is that diagnostics are all about the ordinary residuals, remember them?

\[\text{Ordinary Residuals:} $e_i =y_i - \hat{y}_i$, where $i=1,2,3, \ldots, n\]$

Residuals are the basis of most diagnostic methods because if the regression model is correctly specified then the residuals should be strictly random with mean and variance given by: $E(e_i) =0$; $Var(e_i)=\sigma^2(1-h_{ii})$, respectively. Note that $\sigma^2$ is the variance of $y_i$ and estimated by the $\left(\text{Residual standard error}\right)^2$.

$h_{ii} = \dfrac{1}{n} + \dfrac{(x_i - \bar{x})^2}{\sum_{i=i}{(x_i - \bar{x})^2}}$ is the leverage or hat-value, and indicates how much of an impact a particular observation has on the regression line.
- We move $y_i$ just a bit, re-estimate the regression model and see if $\hat{y}_i$ changes.
- If $\hat{y}_i$ moves as much as you had moved $y_i$ then you know this observation $y_i$ is leveraged … it influences the regression line a good bit.
- If, however, $\hat{y}_i$ barely changes then this suggests that $y_i$ is not exerting a lot of influence on the regression line.
- In this sense $h_{ii} = \dfrac{\delta{\hat{y}_i}}{\delta{y_i}}$
A large $h_{ii}$ is a sign of unusual $x_i$ values; small $h_{ii}$ will be seen for observations close to the center of the regression space. When will $h_{ii}$ be large? … when an observation has an unusual value for an independent variable relative to the rest of the observations.
$0 \leq h_{ii} \leq 1$ or more generally between $\dfrac{1}{n} \leq h_{ii} \leq 1$

Of course, you can’t do any diagnostics without having a regression model to work with so I’ll rely on the sowc data and fit a simple model with a couple of independent variables (see below).

lm.1 = lm(U5MR ~ Income + FemaleYouthLR + Urban_Rural, data = sowc)
summary(lm.1)

## 
## Call:
## lm(formula = U5MR ~ Income + FemaleYouthLR + Urban_Rural, data = sowc)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -45.132 -11.296  -4.887   3.888  95.875 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       1.728e+02  1.014e+01  17.037   <2e-16 ***
## Income           -5.026e-04  2.162e-04  -2.325   0.0216 *  
## FemaleYouthLR    -1.412e+00  1.277e-01 -11.052   <2e-16 ***
## Urban_RuralUrban -9.116e+00  4.827e+00  -1.888   0.0612 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 23.34 on 129 degrees of freedom
## Multiple R-squared:  0.649,  Adjusted R-squared:  0.6408 
## F-statistic: 79.51 on 3 and 129 DF,  p-value: < 2.2e-16

residuals = resid(lm.1)

How good a regression model is lm.1? Let us see, but before we do, note that I have saved the residuals via the residuals = resid(lm.1) command.

Testing for Normally Distributed Residuals

We can use a variety of tests to see if the residuals are normally distributed (or not). The two we will focus on are the Anderson-Darling test and the Shapiro-Wilk test. The NULL hypothesis for each test is that the residuals are normally distributed, and the ALTERNATIVE hypothesis is that they are not normally distributed. Obviously we would like have the test tell us we cannot reject $H_0$ because if we do reject $H_0$ then we have a problem with the residuals. Note that Anderson-Darling is for large samples; Shapiro-Wilk is for small samples (under about $n=5,000$).

library(nortest)
ad.test(residuals)

## 
##  Anderson-Darling normality test
## 
## data:  residuals
## A = 6.1347, p-value = 3.727e-15

shapiro.test(residuals)

## 
##  Shapiro-Wilk normality test
## 
## data:  residuals
## W = 0.86863, p-value = 1.694e-09

This is tricky business so be careful. Why? Because large samples will easily reject the null hypothesis even when the distribution looks perfectly normal. I had pointed this out earlier when we first ran the Shapiro-Wilk test. In any event, Normality is not a big deal in linear regression so we need not worry about it too much.

Plotting Residuals

If we plot residuals against the independent variables, and also against the predicted values of $y$, the semblance of any pattern other than a random pattern tells us that something could be amiss. That is, maybe the residuals are not purely random because we have mis-specified the model. Maybe the residuals do not have constant variance. Maybe there are one or more outliers in the data-set. The car library has several built-in functions to generate some useful plots (see below).

residualPlots(lm.1)

##               Test stat Pr(>|t|)
## Income            1.779    0.078
## FemaleYouthLR    -3.571    0.001
## Urban_Rural          NA       NA
## Tukey test       -3.888    0.000

Notice the output. First, focus on the plots. You see three plots with the vertical axis labeled Pearson residuals which are, in our case, nothing but the ordinary residuals $e_i =y_i - \hat{y}_i$.

Second, notice that the plots are for each of the two independent variables (Income and FemaleYouthLR), and then for what are called the fitted values; these are the $\hat{y}$, the predicted U5MR.

Third, if the red line, which is a line being fit to try and match the shape of the data, exhibits significant curvature, then we should worry because it means we have some non-linearity in the data that is not being captured by our regression model. You see significant curvature in all three plots. If we had a categorical variable in the model, (Urban_Rural), we would want the Medians of each group to be around 0 and the box-plots to be roughly symmetrical for each group.

Fourth, you see some test results with p-value reported. The tests being run are that

each independent variable has a linear relationship with the residuals, and
the predicted values are also linearly related to the residuals.

The test involves squaring each independent variable and including it in the model. If this squared independent variable is not significant then we know that we can leave it out. The NULL hypothesis of each test is that of a linear relationship so we don’t want to reject $H_0$ otherwise we have to conclude that we have a problem. If you look at the p-values, you see the only ones that are $\leq 0.05$ are for FemaleYouthLR, and for the Tukey Test, which is being run with the fitted values (predicted U5MR) as an independent variable.

Note: You can also run these plots one at a time if that eases visibility.

residualPlots(lm.1, ~ 1, fitted = TRUE)

##            Test stat Pr(>|t|)
## Tukey test    -3.888        0

residualPlots(lm.1, ~ FemaleYouthLR, fitted = FALSE)

##               Test stat Pr(>|t|)
## FemaleYouthLR    -3.571    0.001

residualPlots(lm.1, ~ Income, fitted = FALSE)

##        Test stat Pr(>|t|)
## Income     1.779    0.078

So we have a problem with the FemaleYouthLR variable. Perhaps U5MR is non-linearly related to FemaleYouthLR. One way to see if this suspicion is true would be to square FemaleYouthLR and include it in the regression model.

sowc$FemaleYouthLR.sq = (sowc$FemaleYouthLR)^2

lm.2 = lm(U5MR ~ Income + FemaleYouthLR + FemaleYouthLR.sq, data = sowc)
summary(lm.2)

## 
## Call:
## lm(formula = U5MR ~ Income + FemaleYouthLR + FemaleYouthLR.sq, 
##     data = sowc)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -42.80 -10.92  -5.20   4.51  81.85 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      93.2053580 24.2798773   3.839 0.000193 ***
## Income           -0.0004244  0.0002070  -2.050 0.042387 *  
## FemaleYouthLR     1.1485934  0.7198725   1.596 0.113036    
## FemaleYouthLR.sq -0.0188275  0.0050436  -3.733 0.000283 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 22.48 on 129 degrees of freedom
## Multiple R-squared:  0.6745, Adjusted R-squared:  0.6669 
## F-statistic: 89.09 on 3 and 129 DF,  p-value: < 2.2e-16

residualPlots(lm.2) # To see if things improve.

##                  Test stat Pr(>|t|)
## Income               1.317    0.190
## FemaleYouthLR        0.230    0.818
## FemaleYouthLR.sq     1.247    0.215
## Tukey test           1.342    0.180

Note: If taking the square doesn’t do the trick, transforming the suspect independent variable by taking the natural logarithm usually does the trick. Why? In general if you take the logarithm of any variable the result will be a more symmetrically distributed variable. For example, with the fake data below, see the ratio between 2 and 10 (= 5) versus the logarithm of both these values (0.6931472 and 2.3025851, = 3.321928). So in brief, the logarithm compresses the spread of the values and hence nothing looks that extreme at the low- or the high-end.

x = c(2, 2, 3, 4, 9, 10, 10); x

## [1]  2  2  3  4  9 10 10

log(x)

## [1] 0.6931472 0.6931472 1.0986123 1.3862944 2.1972246 2.3025851 2.3025851

cbind(x, log(x))

##       x          
## [1,]  2 0.6931472
## [2,]  2 0.6931472
## [3,]  3 1.0986123
## [4,]  4 1.3862944
## [5,]  9 2.1972246
## [6,] 10 2.3025851
## [7,] 10 2.3025851

boxplot(x, horizontal = TRUE)

boxplot(log(x), horizontal = TRUE)

The box-plots show you the median shifting towards the middle for log(x).

Outliers and Influential Observations

Just as the simple mean and standard deviation (also the variance) of any numerical variable can be influenced by an outlier, so can the regression line. For example, assume we have the following data and model:

x0 = c(2, 3, 4, 5, 6)
y0 = c(3, 4.5, 4.5, 7.5, 9)
df0 = data.frame(cbind(x0, y0))

lm.00 = lm(y0 ~ x0)
summary(lm.00)

## 
## Call:
## lm(formula = y0 ~ x0)
## 
## Residuals:
##    1    2    3    4    5 
##  0.3  0.3 -1.2  0.3  0.3 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)  -0.3000     1.0392  -0.289  0.79163   
## x0            1.5000     0.2449   6.124  0.00875 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7746 on 3 degrees of freedom
## Multiple R-squared:  0.9259, Adjusted R-squared:  0.9012 
## F-statistic:  37.5 on 1 and 3 DF,  p-value: 0.008754

library(ggplot2)
p0 = ggplot(df0, aes(x = x0, y = y0)) + geom_point() + stat_smooth(method = "lm", se = FALSE)
p0

Now imagine an unusual observation:

x1 = c(2, 3, 4, 5, 16)
y1 = c(3, 4.5, 4.5, 7.5, 9)
df1 = data.frame(cbind(x1, y1))
lm.01 = lm(y1 ~ x1)
summary(lm.01)

## 
## Call:
## lm(formula = y1 ~ x1)
## 
## Residuals:
##        1        2        3        4        5 
## -1.22308 -0.09231 -0.46154  2.16923 -0.39231 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)   3.4846     1.0225   3.408   0.0422 *
## x1            0.3692     0.1299   2.843   0.0655 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.481 on 3 degrees of freedom
## Multiple R-squared:  0.7293, Adjusted R-squared:  0.6391 
## F-statistic: 8.084 on 1 and 3 DF,  p-value: 0.06547

p1 = ggplot(df1, aes(x = x1, y = y1)) + geom_point() + stat_smooth(method = "lm", se = FALSE)
p1

What happened to the regression line? It was pulled down towards the $x=16$ data point.

As a rule, then, we’ll have to (i) figure out a way to identify what counts as an outlier, and even if some observation seems to be an outlier, (ii) whether it is large enough to influence the regression line. If an observation can change the regression line simply by virtue of being in the sample, we refer to it as a leveraged observation.

One such diagnostic technique is to run a quantile-quantile plot. You can think of this plot as basically splitting the sample into percentiles, and then plotting what percent of the data fall below the 1%, 2%, 3%, …, 99% values of the residual against what would be expected if the residuals were normally distributed. If the sample is normally distributed then all points should fall on the straight line that is shown in the plot. If points deviate a lot then you know these deviates are unusual observations (i.e., outliers).

You will see a red-line with two dashed red-lines on either side; these are the 95% confidence intervals. The goal is to see if any observations fall outside these bands.

You will also see the $y-axis$ labeled Studentized Residuals. These are residuals calculated by leaving one observation out of the sample, re-estimating the regression model, calculating the residual, calculating the z-score of the residual, then throwing this one observation back in and taking another observation out, re-estimating the model, calculating the residual, calculating the z-score of the residual, and so on until every observation has been excluded one by one. Why do this?

An outlier is measured in terms of the residual $= y - \hat{y}$. But in order to calculate the z-score of a residual you need the standard deviation of the residuals, and this turns out to be the Residual Standard Error reported by the regression. But logic dictates that if we want to know if a particular observation is influencing the regression line then the only way to know this is to calculate the z-score of the residual as follows:

\[e_{si} = \dfrac{e_i}{\hat{\sigma_{-i}}\sqrt{1 - h_{ii}}}\]

where $\hat{\sigma_{-i}}$ is the standard deviation of the residuals from a regression model that excludes the $i^{th}$ observation. If $h_{ii}=0$ then $e_{si}=e_i$ and obviously the observation is not leveraged (i.e., does not exert undue influence on the regression line).

In simple terms, think of these as a modified version of the ordinary residual so that we can judge if any observation has an unusually large influence on the regression line. If you find one or more such observations, then you can suspect that something might be going on with these observations.

qqPlot(lm.2, id.n = 4, id.cex = 0.65, col = "blue")

##  64   3 104  35 
## 130 131 132 133

We set id.n=4 so that we could pick the four observations with the largest Studentized residuals (i.e., the four most unusual observations). The table shows you what the residual number is and what country is attached to each.

A formal test is available as well:

outlierTest(lm.2)

##     rstudent unadjusted p-value Bonferonni p
## 35  3.855320         0.00018207     0.024215
## 104 3.824442         0.00020379     0.027104
## 3   3.752534         0.00026430     0.035152

This test involves picking the observation with the largest (in absolute terms) Studentized residual and calculating a p-value. The p-value is essentially telling us how likely we would be to see such a large absolute Studentized residual if indeed there were no outliers.

You see something called a Bonferonni p. This is a modified $p-value$. Why? Because in a test such as this, where we are making multiple simultaneous comparisons by leaving one observation out at a time, the probability of committing a Type I error is no longer, say, 0.05, but in fact much higher. How?

Assume I have $\alpha = 0.05$ and I leave one observation out, then re-run the test. Maybe I conclude this observation that was excluded is not an outlier. Loosely put, I could be wrong, 5 times out of 100. Now I leave another observation out and repeat the exercise. Again $\alpha=0.05$. I think I could be wrong 5 times out of 100. Now, since I ran two such tests back-to-back, what is the probability that I end up with at least one $P-value \leq 0.05$ by chance? This is $1 -$ the probability that neither of the two tests gives a significant result ($= 0.95 * 0.95$), i.e., $1-0.9025 = 0.0975$!! In other words, if you are running two tests simultaneously with $\alpha=0.05$ the chance of having at least one test given you a significant result is as high as 0.0975 and _not 0.05$. If I were running 10 tests the chance of having at least one test be significant would be $1 - (0.95)^{10} = 0.4012$. Yikes!!

This has led to various corrections having to be made to the p-value, and one such correction is the Bonferonni p-value which says if you are making $n$ simultaneous comparisons via tests then your actual alpha is $\dfrac{\alpha}{n}$. So if I am using $\alpha=0.05$ and making 10 simultaneous comparisons my actual $\alpha$ is $\dfrac{0.05}{10} = 0.005$. I can now reject $H_0$ if and only if the p-value for each comparison is $\leq 0.005$. Because the test for outliers involves multiple simultaneous comparisons you see the Bonferonni p-value being calculated and compared with a TRUE Bonferonni p-value of 0.05. As it turns out, all three observation have a Bonferonni p-value less than 0.05, suggesting that each is an outlier.

Should we then exclude them and re-estimate the regression model? Note yet; let us do some more tests.

Measuring Influence

One other method of checking for influential observations involves constructing what are called index plots – graphing particular statistics against each observation. A popular plot happens to be the influenceIndexPlot() because it shows the Studentized residuals, the corresponding Bonferroni $p-values$ for outliers, the $hat-values$, and something called Cook’s distances.

influenceIndexPlot(lm.2)

Cook’s Distances

An observation that is influential if it is an outlier and has high leverage in that if you remove this observation the regression model will change significantly.

$b_{-i} - b$ is the difference between the estimate when the observation is removed $(b_{-i})$ versus when it is retained in the model $(b)$
Cook’s distance measures influence and is denoted by $D_i$ where $D_i = \dfrac{e^{2}_{Si}}{k+1} \times \dfrac{h_i}{1-h_i}$ where $e^{2}_{Si}$ is the squared standardized residual $e_{Si} = \dfrac{e_i}{\hat{\sigma} \sqrt{1-h_i}}$, $h_i$ is the hat-value for observation $i$, and $k$ is the number of parameters (including intercept)
The first term is an indicator of an outlier while the second term is an indicator of leverage. Notice that if either term is small then $D_i$ will be small.
Therefore a large $D_i$ flags influential observations – observations that are either outliers or exert leverage or both

An easy way to see Cook’s distances is by constructing the influencePlot() for the estimated model.

influencePlot(lm.2)

##      StudRes        Hat      CookD
## 35 3.8553196 0.01205959 0.04095727
## 94 0.9294449 0.40496289 0.14713536

In this plot, the size of the bubbles is proportional to the size of Cook’s distance for that observation; the larger the bubble the more likely this observation is both an outlier and is influencing the regression model. We see two culprits, observation 35 and observation 94. What countries are these? See for yourself.

Now let us see what happens if we remove observation 94 from the sample and re-estimate the regression model.

lm.3 <- update(lm.2, subset = rownames(sowc) != 94)
summary(lm.3)

## 
## Call:
## lm(formula = U5MR ~ Income + FemaleYouthLR + FemaleYouthLR.sq, 
##     data = sowc, subset = rownames(sowc) != 94)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -42.512 -10.663  -4.868   4.390  82.611 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      95.141641  24.381862   3.902 0.000153 ***
## Income           -0.000581   0.000267  -2.176 0.031393 *  
## FemaleYouthLR     1.070168   0.725178   1.476 0.142472    
## FemaleYouthLR.sq -0.018090   0.005108  -3.541 0.000556 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 22.49 on 128 degrees of freedom
## Multiple R-squared:  0.6747, Adjusted R-squared:  0.6671 
## F-statistic: 88.51 on 3 and 128 DF,  p-value: < 2.2e-16

compareCoefs(lm.2, lm.3)

## 
## Call:
## 1: lm(formula = U5MR ~ Income + FemaleYouthLR + FemaleYouthLR.sq, data 
##   = sowc)
## 2: lm(formula = U5MR ~ Income + FemaleYouthLR + FemaleYouthLR.sq, data 
##   = sowc, subset = rownames(sowc) != 94)
##                     Est. 1      SE 1    Est. 2      SE 2
## (Intercept)      93.205358 24.279877 95.141641 24.381862
## Income           -0.000424  0.000207 -0.000581  0.000267
## FemaleYouthLR     1.148593  0.719872  1.070168  0.725178
## FemaleYouthLR.sq -0.018828  0.005044 -0.018090  0.005108

lm.4 = update(lm.3, subset = rownames(sowc) != 35)
summary(lm.4)

## 
## Call:
## lm(formula = U5MR ~ Income + FemaleYouthLR + FemaleYouthLR.sq, 
##     data = sowc, subset = rownames(sowc) != 35)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -42.725 -10.101  -4.367   4.814  80.097 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      92.1791115 23.0732950   3.995 0.000108 ***
## Income           -0.0004448  0.0001968  -2.260 0.025496 *  
## FemaleYouthLR     1.1944856  0.6841566   1.746 0.083223 .  
## FemaleYouthLR.sq -0.0192625  0.0047939  -4.018 9.95e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 21.36 on 128 degrees of freedom
## Multiple R-squared:  0.7032, Adjusted R-squared:  0.6962 
## F-statistic: 101.1 on 3 and 128 DF,  p-value: < 2.2e-16

compareCoefs(lm.2, lm.3, lm.4)

## 
## Call:
## 1: lm(formula = U5MR ~ Income + FemaleYouthLR + FemaleYouthLR.sq, data 
##   = sowc)
## 2: lm(formula = U5MR ~ Income + FemaleYouthLR + FemaleYouthLR.sq, data 
##   = sowc, subset = rownames(sowc) != 94)
## 3: lm(formula = U5MR ~ Income + FemaleYouthLR + FemaleYouthLR.sq, data 
##   = sowc, subset = rownames(sowc) != 35)
##                     Est. 1      SE 1    Est. 2      SE 2    Est. 3
## (Intercept)      93.205358 24.279877 95.141641 24.381862 92.179111
## Income           -0.000424  0.000207 -0.000581  0.000267 -0.000445
## FemaleYouthLR     1.148593  0.719872  1.070168  0.725178  1.194486
## FemaleYouthLR.sq -0.018828  0.005044 -0.018090  0.005108 -0.019263
##                       SE 3
## (Intercept)      23.073295
## Income            0.000197
## FemaleYouthLR     0.684157
## FemaleYouthLR.sq  0.004794

lm.5 = update(lm.2, subset = (rownames(sowc) != 35 & rownames(sowc) != 94) )
summary(lm.5)

## 
## Call:
## lm(formula = U5MR ~ Income + FemaleYouthLR + FemaleYouthLR.sq, 
##     data = sowc, subset = (rownames(sowc) != 35 & rownames(sowc) != 
##         94))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -42.396 -10.211  -4.142   4.509  80.367 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      94.3862869 23.1339863   4.080  7.9e-05 ***
## Income           -0.0006244  0.0002536  -2.462 0.015142 *  
## FemaleYouthLR     1.1051346  0.6880976   1.606 0.110742    
## FemaleYouthLR.sq -0.0184219  0.0048474  -3.800 0.000223 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 21.34 on 127 degrees of freedom
## Multiple R-squared:  0.7044, Adjusted R-squared:  0.6974 
## F-statistic: 100.9 on 3 and 127 DF,  p-value: < 2.2e-16

compareCoefs(lm.2, lm.3, lm.4, lm.5)

## 
## Call:
## 1: lm(formula = U5MR ~ Income + FemaleYouthLR + FemaleYouthLR.sq, data 
##   = sowc)
## 2: lm(formula = U5MR ~ Income + FemaleYouthLR + FemaleYouthLR.sq, data 
##   = sowc, subset = rownames(sowc) != 94)
## 3: lm(formula = U5MR ~ Income + FemaleYouthLR + FemaleYouthLR.sq, data 
##   = sowc, subset = rownames(sowc) != 35)
## 4: lm(formula = U5MR ~ Income + FemaleYouthLR + FemaleYouthLR.sq, data 
##   = sowc, subset = (rownames(sowc) != 35 & rownames(sowc) != 94))
##                     Est. 1      SE 1    Est. 2      SE 2    Est. 3
## (Intercept)      93.205358 24.279877 95.141641 24.381862 92.179111
## Income           -0.000424  0.000207 -0.000581  0.000267 -0.000445
## FemaleYouthLR     1.148593  0.719872  1.070168  0.725178  1.194486
## FemaleYouthLR.sq -0.018828  0.005044 -0.018090  0.005108 -0.019263
##                       SE 3    Est. 4      SE 4
## (Intercept)      23.073295 94.386287 23.133986
## Income            0.000197 -0.000624  0.000254
## FemaleYouthLR     0.684157  1.105135  0.688098
## FemaleYouthLR.sq  0.004794 -0.018422  0.004847

Marginal Model Plots and Added Variable Plots

Another strategy that is commonly used is the Marginal Model Plot. These plots show, for each continuous $x$ variable the value of $y$ while ignoring all other independent variables in the model

A smoothed line is plotted for each $x$ and $y$ plot (this is the solid blue line)
A second smoothed line is plotted for the predicted values of $y$ for each value of $x$ (this is the dashed red line)
If both lines are “close together’’ then we have a regression model that fits the data well;”Large gaps" between the two lines suggest a poor fit

Added-Variable plots are similar to the Marginal Model Plots except that these adjust for all other independent variables when constructing the plot for each continuous independent variable. These are constructed as follows:

Assume model is $y= \beta_0 + \beta_1(x_1) + \beta_2(x_2) + \beta_3(x_3)$
Regress $y$ on $x_2$ and $x_3$ … i.e., estimate Model 1: $y= \beta_0 + \beta_2(x_2) + \beta_3(x_3)$
Save the residuals … these represent that part of $y$ not explained by $x_2$ and $x_3$
Regress $x_1$ on $x_2$ and $x_3$ … i.e., estimate Model 2: $x_1= a + \beta_2(x_2) + \beta_3(x_3)$
Save the residuals … these represent the unique part of $x_1$ that is not explained by $x_2$ and $x_3$
Now plot the residuals from Model 1 (on the vertical axis) against the residuals from Model 2 (on the horizontal axis)
Regress the residuals from Model 1 on the residuals from Model 2 … You are fitting a model that is testing whether the part of $y$ not explained by $x_2$ and $x_3$ can be explained by the unique part of $x_1$ that is not explained by $x_2$ and $x_3$
Add the resulting regression line from the preceding regression model
- If the slope $=0$ then no particular value of $x_1$ is unduly influencing the full regression model
- If the slope is $+$ or $-$ then some observed values of $x_1$ are influencing the regression line

marginalModelPlots(lm.1)

## Warning in mmps(...): Interactions and/or factors skipped

avPlots(lm.2, id.n = 2, id.cex = 0.6)

Last time we noticed that there was some non-linearity associated with FemaleYouthLR but once we took the natural logarithm of FemaleYouthLR, we could get around this problem. So unless we had fit this version of the model – (lm.2) – we would have to do so and re-check these marginal model plots and added variable plots.

sowc$FemaleYouthLR.sq = (sowc$FemaleYouthLR)^2
lm.3 = lm(U5MR ~ Income + FemaleYouthLR + FemaleYouthLR.sq, data = sowc)
summary(lm.3)

## 
## Call:
## lm(formula = U5MR ~ Income + FemaleYouthLR + FemaleYouthLR.sq, 
##     data = sowc)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -42.80 -10.92  -5.20   4.51  81.85 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      93.2053580 24.2798773   3.839 0.000193 ***
## Income           -0.0004244  0.0002070  -2.050 0.042387 *  
## FemaleYouthLR     1.1485934  0.7198725   1.596 0.113036    
## FemaleYouthLR.sq -0.0188275  0.0050436  -3.733 0.000283 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 22.48 on 129 degrees of freedom
## Multiple R-squared:  0.6745, Adjusted R-squared:  0.6669 
## F-statistic: 89.09 on 3 and 129 DF,  p-value: < 2.2e-16

residuals.3 = resid(lm.3)
marginalModelPlots(lm.3)

avPlots(lm.3, id.n = 2, id.cex = 0.6)

The squared version of FemaleYouthLR does a better job as evident from the two lines being a lot closer in these marginal model plots than in the case of lm.2 (where FemaleYouthLR.sq) was not included in the regression.

Non-Constant Variance (i.e., Heteroscedasticity)

Heteroscedasticity: $Var(e_i) \neq \sigma^2$} is a violation of the assumption that the variance of the residuals is constant. When might heteroscedasticity arise? If:

The outcome varies across the independent variable (for e.g., expenditure as a function of income; typing errors over time; etc.)
Outliers are present in the data
You omit one or more independent variables that should be in the model
One or more of the independent variable(s) is(are) skewed
Incorrect functional form (i.e., you need a squared term in the model but you forgot or didn’t know to include it in the model)

Consequences of ignoring heteroscedasticity:

Your estimates will not be biased so that isn’t the issue here
But your standard errors of the estimates will be off, they could be high or low and this will throw off the $p-value$ of each coefficient
“The reason OLS is not optimal when heteroskedasticity is present is that it gives equal weight to all observations when, in fact, observations with larger disturbance variance contain less information than observations with smaller disturbance variance.”
Fortunately, unless heteroscedasticity is “marked,” significance tests are virtually unaffected, and thus OLS estimation can be used without concern of serious distortion. But, severe heteroscedasticity can sometimes be a problem.
Note that heteroscedasticity is often a by-product of other violations of assumptions.

Detecting non-constant variance of the residuals

So how do we detect non-constant variance? As usual, we’ll have a choice of graphical explorations buttressed by formal statistical tests. Let us start with a graphical exploration.

We use Spread-Level Plots that graph the log of the absolute Studentized residuals against the log of the fitted values. Any pattern evident hints at a problem. The plot will also show a transformation suggested for the dependent variable.

A formal test – the Breusch-Pagan Test – is also available. This test basically checks whether the assumption of equal variance is met or should be rejected

The variance is either constant or, if it is non-constant it …
- depends on the mean: $Var(\epsilon_i) = \sigma^2 g\left[E(y|x) \right]$ or
- is a linear combination of the independent variables: $Var(\epsilon_i) = \sigma^2 g\left(\gamma_1x_{i1} + \gamma_2x_{i2} + \cdots + \gamma_px_{ip} \right)$
The Null hypothesis if of constant variance – basically saying that none of the independent variables have any statistically significant relationship with the residuals. This is what we want because then it means the residuals are random. But if $H_0$ is rejected we have a problem; now the conclusion becomes that one or more of your independent variables are significant predictors of the residuals.

An example of heteroscedasticity in action is shown below:

data(Ornstein)

The Ornstein data frame has 248 rows and 4 columns. The observations are the 248 largest Canadian firms with publicly available information in the mid-1970s. The names of the firms were not available. The variables are:

assets = the firm’s assets in millions of dollars
sector = the industrial sector of operations
nation = the controlling nation
interlocks = the number of interlocking directorate and executive positions maintained by a firm

lm.A <- lm(interlocks + 1 ~ log(assets) + sector + nation, data=Ornstein)
summary(lm.A)

## 
## Call:
## lm(formula = interlocks + 1 ~ log(assets) + sector + nation, 
##     data = Ornstein)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -41.936  -6.169  -0.140   4.745  47.440 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -27.4430     4.9272  -5.570 6.98e-08 ***
## log(assets)   5.9908     0.6814   8.792 3.24e-16 ***
## sectorBNK    17.3227     5.1847   3.341 0.000971 ***
## sectorCON    -2.7127     5.4241  -0.500 0.617463    
## sectorFIN    -1.2745     3.4121  -0.374 0.709100    
## sectorHLD    -2.2916     4.6132  -0.497 0.619835    
## sectorMAN     1.2440     2.3666   0.526 0.599621    
## sectorMER    -0.8801     3.0346  -0.290 0.772058    
## sectorMIN     1.7566     2.4448   0.719 0.473153    
## sectorTRN     1.8888     3.3023   0.572 0.567888    
## sectorWOD     5.1056     3.0990   1.647 0.100801    
## nationOTH    -3.0533     3.0872  -0.989 0.323676    
## nationUK     -5.3294     3.0714  -1.735 0.084030 .  
## nationUS     -8.4913     1.7174  -4.944 1.46e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11.26 on 234 degrees of freedom
## Multiple R-squared:  0.5353, Adjusted R-squared:  0.5094 
## F-statistic: 20.73 on 13 and 234 DF,  p-value: < 2.2e-16

spreadLevelPlot(lm.A)

## Warning in spreadLevelPlot.lm(lm.A): 16 negative fitted values removed

## 
## Suggested power transformation:  0.5540406

Notice the ballooning that happens at the upper-end of the distribution. This is a classic example of heteroscedasticity. What if we ran the formal test?

ncvTest(lm.A, ~log(assets) + sector + nation, data=Ornstein)

## Non-constant Variance Score Test 
## Variance formula: ~ log(assets) + sector + nation 
## Chisquare = 290.8694    Df = 13     p = 1.952521e-54

The $p-value$ is practically 0 so we have to reject the Null hypothesis of constant variance; we have a problem with heteroscedasticity.

What about the model we fit to the sowc data?

summary(lm.1)

## 
## Call:
## lm(formula = U5MR ~ Income + FemaleYouthLR + Urban_Rural, data = sowc)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -45.132 -11.296  -4.887   3.888  95.875 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       1.728e+02  1.014e+01  17.037   <2e-16 ***
## Income           -5.026e-04  2.162e-04  -2.325   0.0216 *  
## FemaleYouthLR    -1.412e+00  1.277e-01 -11.052   <2e-16 ***
## Urban_RuralUrban -9.116e+00  4.827e+00  -1.888   0.0612 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 23.34 on 129 degrees of freedom
## Multiple R-squared:  0.649,  Adjusted R-squared:  0.6408 
## F-statistic: 79.51 on 3 and 129 DF,  p-value: < 2.2e-16

spreadLevelPlot(lm.1)

## Warning in spreadLevelPlot.lm(lm.1): 2 negative fitted values removed

## 
## Suggested power transformation:  0.3511475

ncvTest(lm.1, ~ Income + FemaleYouthLR, data = sowc)

## Non-constant Variance Score Test 
## Variance formula: ~ Income + FemaleYouthLR 
## Chisquare = 40.05976    Df = 2     p = 2.000477e-09

Note that the spread-level plot shows you some deviance of the green line, and you also get a suggestion about transforming the dependent variable by raising it to the power of 0.3511475. What if we transform the dependent variable as suggested?

sowc$U5MR2 = sowc$U5MR^(0.3511475)
lm.new1 = update(lm.1, U5MR2 ~ Income + FemaleYouthLR)
summary(lm.new1)

## 
## Call:
## lm(formula = U5MR2 ~ Income + FemaleYouthLR, data = sowc)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.18482 -0.48231 -0.09641  0.34651  2.39779 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    7.153e+00  2.958e-01  24.183  < 2e-16 ***
## Income        -3.423e-05  6.205e-06  -5.516  1.8e-07 ***
## FemaleYouthLR -4.114e-02  3.499e-03 -11.757  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6938 on 130 degrees of freedom
## Multiple R-squared:  0.6631, Adjusted R-squared:  0.6579 
## F-statistic: 127.9 on 2 and 130 DF,  p-value: < 2.2e-16

spreadLevelPlot(lm.new1)

## 
## Suggested power transformation:  0.986857

ncvTest(lm.new1, ~ Income + FemaleYouthLR, data = sowc)

## Non-constant Variance Score Test 
## Variance formula: ~ Income + FemaleYouthLR 
## Chisquare = 3.755136    Df = 2     p = 0.1529617

Aha, now we have a high p-value so the NULL hypothesis of constant variance cannot be rejected. We have gotten around that problem.

Correcting for non-constant variance

What if we had been unable to achieve this fix? There is a sledgehammer we could use but we should not abuse it – this is the choice of using what are called Huber-White heteroscedasticity-corrected standard errors

summaryw <- function(model) {
s <- summary(model)
X <- model.matrix(model)
u2 <- residuals(model)^2
XDX <- 0
 
## Here one needs to calculate X'DX. But due to the fact that
## D is huge (NxN), it is better to do it with a cycle.
for(i in 1:nrow(X)) {
XDX <- XDX + u2[i]*X[i,]%*%t(X[i,])
}
 
# inverse(X'X)
XX1 <- solve(t(X)%*%X)
 
# Variance calculation (Bread x meat x Bread)
varcovar <- XX1 %*% XDX %*% XX1
 
# degrees of freedom adjustment
dfc <- sqrt(nrow(X))/sqrt(nrow(X)-ncol(X))
 
# Standard errors of the coefficient estimates are the
# square roots of the diagonal elements
stdh <- dfc*sqrt(diag(varcovar))
 
t <- model$coefficients/stdh
p <- 2*pnorm(-abs(t))
results <- cbind(model$coefficients, stdh, t, p)
dimnames(results) <- dimnames(s$coefficients)
results
}

Now let us compare the two standard errors – the original ones and the ones corrected for heteroscedasticity.

summary(lm.1)

## 
## Call:
## lm(formula = U5MR ~ Income + FemaleYouthLR + Urban_Rural, data = sowc)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -45.132 -11.296  -4.887   3.888  95.875 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       1.728e+02  1.014e+01  17.037   <2e-16 ***
## Income           -5.026e-04  2.162e-04  -2.325   0.0216 *  
## FemaleYouthLR    -1.412e+00  1.277e-01 -11.052   <2e-16 ***
## Urban_RuralUrban -9.116e+00  4.827e+00  -1.888   0.0612 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 23.34 on 129 degrees of freedom
## Multiple R-squared:  0.649,  Adjusted R-squared:  0.6408 
## F-statistic: 79.51 on 3 and 129 DF,  p-value: < 2.2e-16

summaryw(lm.1)

##                       Estimate   Std. Error   t value     Pr(>|t|)
## (Intercept)       1.727879e+02 1.578141e+01 10.948819 6.731948e-28
## Income           -5.026061e-04 1.615177e-04 -3.111770 1.859691e-03
## FemaleYouthLR    -1.411607e+00 1.889460e-01 -7.470954 7.961551e-14
## Urban_RuralUrban -9.116080e+00 5.698287e+00 -1.599793 1.096445e-01

Some of the partial slope coefficients’ standard errors will be larger, others will be smaller, but in general this correction – the Huber-White Sandwich Estimator – fixes heteroscedasticity.

ideally we would try to use something called weighted least squares or at least transform one or more independent variables FIRST
In practice you hardly ever see this done any more because
- 1. most of us don’t know enough about the problem we are studying to have a good sense of where the heteroscedasticity is originating from, and
- 1. White’s approach washes everything clean so most of us just go with it by default.

Multicollinearity

Variance Inflation Factors (VIFs) or Generalized Variance Inflation Factors (GVIFs) are some of the more commonly used indicators for multicollinearity. What these try to get at is an estimate of how much larger is the standard error (and hence the confidence interval) of each variable’s coefficient than would be seen if this variable had zero correlation with other variables in the regression model. These rely upon the following:

Estimated sampling variation of the $j^{th}$ regression coefficient is: $\widehat{Var}(b_j) = \dfrac{\widehat{\sigma}^2}{(n-1)s^2_j} \times \dfrac{1}{1-R^2_j}$ where $\widehat{Var}(b_j)$ is the estimated residual variance, $s^2_j$ is the sample variance of $x_j$, and $\dfrac{1}{1-R^2_j}$ is the VIF for $b_j$ and a function of the multiple correlation $R_j$ from the regression of $x_j$ on the other $x$ variables
If a variable has a large $VIF$ then we know that this variable’s coefficient has larger confidence intervals than would be the case if this variable were not highly collinear with the other variable(s)
Common rule-of-thumb is $VIF > 4$
In practice one relies on Generalized Variance Inflation Factors (GVIFs) because these take into account polynomials (i.e., $x^2, x^3, x^{-1}$, etc.) and categorical variables that yield multiple dummy variables (one for each category) … a GVIF $> 4$ indicates a problem with multicollinearity

data("Ericksen")
summary(Ericksen)

##     minority          crime           poverty         language     
##  Min.   : 0.700   Min.   : 25.00   Min.   : 6.80   Min.   : 0.200  
##  1st Qu.: 5.025   1st Qu.: 48.00   1st Qu.: 9.95   1st Qu.: 0.500  
##  Median :15.800   Median : 55.00   Median :12.50   Median : 0.850  
##  Mean   :19.436   Mean   : 63.06   Mean   :13.47   Mean   : 1.926  
##  3rd Qu.:28.200   3rd Qu.: 73.00   3rd Qu.:16.60   3rd Qu.: 2.350  
##  Max.   :72.600   Max.   :143.00   Max.   :23.90   Max.   :12.700  
##    highschool       housing         city     conventional   
##  Min.   :17.50   Min.   : 7.00   city :16   Min.   :  0.00  
##  1st Qu.:27.45   1st Qu.: 9.40   state:50   1st Qu.:  0.00  
##  Median :31.60   Median :11.50              Median :  0.00  
##  Mean   :33.65   Mean   :15.67              Mean   : 11.73  
##  3rd Qu.:41.67   3rd Qu.:20.30              3rd Qu.:  9.75  
##  Max.   :51.80   Max.   :52.10              Max.   :100.00  
##    undercount    
##  Min.   :-2.310  
##  1st Qu.: 0.320  
##  Median : 1.445  
##  Mean   : 1.921  
##  3rd Qu.: 3.305  
##  Max.   : 8.180

lm.E = lm(undercount ~ ., data = Ericksen)
summary(lm.E)

## 
## Call:
## lm(formula = undercount ~ ., data = Ericksen)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.8356 -0.8033 -0.0553  0.7050  4.2467 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -0.611411   1.720843  -0.355 0.723678    
## minority      0.079834   0.022609   3.531 0.000827 ***
## crime         0.030117   0.012998   2.317 0.024115 *  
## poverty      -0.178369   0.084916  -2.101 0.040117 *  
## language      0.215125   0.092209   2.333 0.023200 *  
## highschool    0.061290   0.044775   1.369 0.176415    
## housing      -0.034957   0.024630  -1.419 0.161259    
## citystate    -1.159982   0.770644  -1.505 0.137791    
## conventional  0.036989   0.009253   3.997 0.000186 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.426 on 57 degrees of freedom
## Multiple R-squared:  0.7077, Adjusted R-squared:  0.6667 
## F-statistic: 17.25 on 8 and 57 DF,  p-value: 1.044e-12

vif(lm.E)

##     minority        crime      poverty     language   highschool 
##     5.009065     3.343586     4.625178     1.635568     4.619169 
##      housing         city conventional 
##     1.871745     3.537750     1.691320

# sqrt(vif(lm.6)) > 2

What if we fit the sowc data with several independent variables and then tested the model for multicollinearity?

lm.7 = lm(U5MR ~ Income + Urban + FemaleYouthLR + MaleYouthLR + Fertility, data = sowc)
summary(lm.7)

## 
## Call:
## lm(formula = U5MR ~ Income + Urban + FemaleYouthLR + MaleYouthLR + 
##     Fertility, data = sowc)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -37.554  -8.960  -1.524   4.243  91.514 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   49.7803109 22.6136479   2.201   0.0295 *  
## Income        -0.0001930  0.0001993  -0.968   0.3346    
## Urban         -0.0849962  0.1039780  -0.817   0.4152    
## FemaleYouthLR -0.6491575  0.2999460  -2.164   0.0323 *  
## MaleYouthLR    0.0855617  0.4121839   0.208   0.8359    
## Fertility     15.0451524  1.8761706   8.019 6.04e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 19.07 on 127 degrees of freedom
## Multiple R-squared:  0.7693, Adjusted R-squared:  0.7602 
## F-statistic: 84.71 on 5 and 127 DF,  p-value: < 2.2e-16

vif(lm.7)

##        Income         Urban FemaleYouthLR   MaleYouthLR     Fertility 
##      1.599467      1.873242     11.395764     10.565046      2.916705

Regression Diagnostics

Ruhil

2016-04-04