If variables are jointly normally distributed we can draw a band around the data points to see where a given percentage of the data should fall under the assumption of joint normality. Typically we are interested in the 95% band because we use \(\alpha=0.05\) and anything that is then outside such a band hints at a data point that is outside the center of the regression space. Being outside the envelope doesn’t necessarily make the data point an outlier (in the sense of influencing the regression line) but nevertheless it is still an unusual value.
load("~/Downloads/sowc.RData")
sowc$Urban_Rural = factor(sowc$Urban_Rural)
library(car)
with(sowc, dataEllipse(FemaleYouthLR, Income, id.n = 4, id.cex = 0.5, pch = 21, labels = country, center.pch = "+", level = 0.95, fill = TRUE, fill.alpha = 0.1, ylim = c(-20000, 80000), xlim = c(20, 140), col = "cornflowerblue"))
with(sowc, dataEllipse(FemaleYouthLR, Income, Urban_Rural, id.n = 4, id.cex = 0.5, pch = c(16,21), labels = country, center.pch = "+", level = 0.95, fill = TRUE, fill.alpha = 0.1, ylim = c(-20000, 80000), xlim = c(20, 140), group.labels = c("Rural", "Urban")))
sowc.U = subset(sowc, Urban_Rural == "Urban")
sowc.R = subset(sowc, Urban_Rural == "Rural")
with(sowc.U, dataEllipse(FemaleYouthLR, Income, id.n = 4, id.cex = 0.5, pch = c(16,21), labels = country, center.pch = "+", level = 0.95, fill = TRUE, fill.alpha = 0.1, ylim = c(-20000, 80000), xlim = c(45, 125)))
with(sowc.R, dataEllipse(FemaleYouthLR, Income, id.n = 4, id.cex = 0.5, pch = c(16,21), labels = country, center.pch = "+", level = 0.95, fill = TRUE, fill.alpha = 0.1, ylim = c(-10000, 30000), xlim = c(20, 140)))
The first data ellipse shows the four countries farthest from the 95% envelope – Qatar, Singapore, Kuwait, and Guinea. The second adds one more dimension – the ability do differentiate between Urban and Rural countries. This latter data ellipse shows four Urban countries – Qatar, Nigeria, Cote’ de Ivoire, and Dem. Rep of Congo – and four Rural countries – Slovenia, Trinidad and Tobago, Guinea, and Equitorial Guinea. So these might be the countries likely to influence the regression line. Do they? We’ll see by focusing on the residuals from a regression model.
There is another package that will give you some meaningful plots and show you how correlated your variables happen to be. You have seen this before – ggpairs
!!
library(GGally)
ggpairs(sowc, c(2, 4, 7, 13, 15))
#options(show.signif.stars = TRUE)
library(car)
The one thing you’ll notice right away is that diagnostics are all about the ordinary residuals, remember them?
\[\text{Ordinary Residuals:} $e_i =y_i - \hat{y}_i$, where $i=1,2,3, \ldots, n\]$
Residuals are the basis of most diagnostic methods because if the regression model is correctly specified then the residuals should be strictly random with mean and variance given by: \(E(e_i) =0\); \(Var(e_i)=\sigma^2(1-h_{ii})\), respectively. Note that \(\sigma^2\) is the variance of \(y_i\) and estimated by the \(\left(\text{Residual standard error}\right)^2\).
Of course, you can’t do any diagnostics without having a regression model to work with so I’ll rely on the sowc
data and fit a simple model with a couple of independent variables (see below).
lm.1 = lm(U5MR ~ Income + FemaleYouthLR + Urban_Rural, data = sowc)
summary(lm.1)
##
## Call:
## lm(formula = U5MR ~ Income + FemaleYouthLR + Urban_Rural, data = sowc)
##
## Residuals:
## Min 1Q Median 3Q Max
## -45.132 -11.296 -4.887 3.888 95.875
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.728e+02 1.014e+01 17.037 <2e-16 ***
## Income -5.026e-04 2.162e-04 -2.325 0.0216 *
## FemaleYouthLR -1.412e+00 1.277e-01 -11.052 <2e-16 ***
## Urban_RuralUrban -9.116e+00 4.827e+00 -1.888 0.0612 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 23.34 on 129 degrees of freedom
## Multiple R-squared: 0.649, Adjusted R-squared: 0.6408
## F-statistic: 79.51 on 3 and 129 DF, p-value: < 2.2e-16
residuals = resid(lm.1)
How good a regression model is lm.1
? Let us see, but before we do, note that I have saved the residuals
via the residuals = resid(lm.1)
command.
We can use a variety of tests to see if the residuals are normally distributed (or not). The two we will focus on are the Anderson-Darling test and the Shapiro-Wilk test. The NULL hypothesis for each test is that the residuals are normally distributed, and the ALTERNATIVE hypothesis is that they are not normally distributed. Obviously we would like have the test tell us we cannot reject \(H_0\) because if we do reject \(H_0\) then we have a problem with the residuals. Note that Anderson-Darling is for large samples; Shapiro-Wilk is for small samples (under about \(n=5,000\)).
library(nortest)
ad.test(residuals)
##
## Anderson-Darling normality test
##
## data: residuals
## A = 6.1347, p-value = 3.727e-15
shapiro.test(residuals)
##
## Shapiro-Wilk normality test
##
## data: residuals
## W = 0.86863, p-value = 1.694e-09
This is tricky business so be careful. Why? Because large samples will easily reject the null hypothesis even when the distribution looks perfectly normal. I had pointed this out earlier when we first ran the Shapiro-Wilk test. In any event, Normality is not a big deal in linear regression so we need not worry about it too much.
If we plot residuals against the independent variables, and also against the predicted values of \(y\), the semblance of any pattern other than a random pattern tells us that something could be amiss. That is, maybe the residuals are not purely random because we have mis-specified the model. Maybe the residuals do not have constant variance. Maybe there are one or more outliers in the data-set. The car
library has several built-in functions to generate some useful plots (see below).
residualPlots(lm.1)
## Test stat Pr(>|t|)
## Income 1.779 0.078
## FemaleYouthLR -3.571 0.001
## Urban_Rural NA NA
## Tukey test -3.888 0.000
Notice the output. First, focus on the plots. You see three plots with the vertical axis labeled Pearson residuals which are, in our case, nothing but the ordinary residuals \(e_i =y_i - \hat{y}_i\).
Second, notice that the plots are for each of the two independent variables (Income and FemaleYouthLR), and then for what are called the fitted values; these are the \(\hat{y}\), the predicted U5MR.
Third, if the red line, which is a line being fit to try and match the shape of the data, exhibits significant curvature, then we should worry because it means we have some non-linearity in the data that is not being captured by our regression model. You see significant curvature in all three plots. If we had a categorical variable in the model, (Urban_Rural), we would want the Medians of each group to be around 0 and the box-plots to be roughly symmetrical for each group.
Fourth, you see some test results with p-value reported. The tests being run are that
The test involves squaring each independent variable and including it in the model. If this squared independent variable is not significant then we know that we can leave it out. The NULL hypothesis of each test is that of a linear relationship so we don’t want to reject \(H_0\) otherwise we have to conclude that we have a problem. If you look at the p-values, you see the only ones that are \(\leq 0.05\) are for FemaleYouthLR, and for the Tukey Test, which is being run with the fitted values (predicted U5MR) as an independent variable.
Note: You can also run these plots one at a time if that eases visibility.
residualPlots(lm.1, ~ 1, fitted = TRUE)
## Test stat Pr(>|t|)
## Tukey test -3.888 0
residualPlots(lm.1, ~ FemaleYouthLR, fitted = FALSE)
## Test stat Pr(>|t|)
## FemaleYouthLR -3.571 0.001
residualPlots(lm.1, ~ Income, fitted = FALSE)
## Test stat Pr(>|t|)
## Income 1.779 0.078
So we have a problem with the FemaleYouthLR variable. Perhaps U5MR is non-linearly related to FemaleYouthLR. One way to see if this suspicion is true would be to square FemaleYouthLR and include it in the regression model.
sowc$FemaleYouthLR.sq = (sowc$FemaleYouthLR)^2
lm.2 = lm(U5MR ~ Income + FemaleYouthLR + FemaleYouthLR.sq, data = sowc)
summary(lm.2)
##
## Call:
## lm(formula = U5MR ~ Income + FemaleYouthLR + FemaleYouthLR.sq,
## data = sowc)
##
## Residuals:
## Min 1Q Median 3Q Max
## -42.80 -10.92 -5.20 4.51 81.85
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 93.2053580 24.2798773 3.839 0.000193 ***
## Income -0.0004244 0.0002070 -2.050 0.042387 *
## FemaleYouthLR 1.1485934 0.7198725 1.596 0.113036
## FemaleYouthLR.sq -0.0188275 0.0050436 -3.733 0.000283 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 22.48 on 129 degrees of freedom
## Multiple R-squared: 0.6745, Adjusted R-squared: 0.6669
## F-statistic: 89.09 on 3 and 129 DF, p-value: < 2.2e-16
residualPlots(lm.2) # To see if things improve.
## Test stat Pr(>|t|)
## Income 1.317 0.190
## FemaleYouthLR 0.230 0.818
## FemaleYouthLR.sq 1.247 0.215
## Tukey test 1.342 0.180
Note: If taking the square doesn’t do the trick, transforming the suspect independent variable by taking the natural logarithm usually does the trick. Why? In general if you take the logarithm of any variable the result will be a more symmetrically distributed variable. For example, with the fake data below, see the ratio between 2 and 10 (= 5) versus the logarithm of both these values (0.6931472 and 2.3025851, = 3.321928). So in brief, the logarithm compresses the spread of the values and hence nothing looks that extreme at the low- or the high-end.
x = c(2, 2, 3, 4, 9, 10, 10); x
## [1] 2 2 3 4 9 10 10
log(x)
## [1] 0.6931472 0.6931472 1.0986123 1.3862944 2.1972246 2.3025851 2.3025851
cbind(x, log(x))
## x
## [1,] 2 0.6931472
## [2,] 2 0.6931472
## [3,] 3 1.0986123
## [4,] 4 1.3862944
## [5,] 9 2.1972246
## [6,] 10 2.3025851
## [7,] 10 2.3025851
boxplot(x, horizontal = TRUE)
boxplot(log(x), horizontal = TRUE)
The box-plots show you the median shifting towards the middle for log(x)
.
Just as the simple mean and standard deviation (also the variance) of any numerical variable can be influenced by an outlier, so can the regression line. For example, assume we have the following data and model:
x0 = c(2, 3, 4, 5, 6)
y0 = c(3, 4.5, 4.5, 7.5, 9)
df0 = data.frame(cbind(x0, y0))
lm.00 = lm(y0 ~ x0)
summary(lm.00)
##
## Call:
## lm(formula = y0 ~ x0)
##
## Residuals:
## 1 2 3 4 5
## 0.3 0.3 -1.2 0.3 0.3
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.3000 1.0392 -0.289 0.79163
## x0 1.5000 0.2449 6.124 0.00875 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7746 on 3 degrees of freedom
## Multiple R-squared: 0.9259, Adjusted R-squared: 0.9012
## F-statistic: 37.5 on 1 and 3 DF, p-value: 0.008754
library(ggplot2)
p0 = ggplot(df0, aes(x = x0, y = y0)) + geom_point() + stat_smooth(method = "lm", se = FALSE)
p0
Now imagine an unusual observation:
x1 = c(2, 3, 4, 5, 16)
y1 = c(3, 4.5, 4.5, 7.5, 9)
df1 = data.frame(cbind(x1, y1))
lm.01 = lm(y1 ~ x1)
summary(lm.01)
##
## Call:
## lm(formula = y1 ~ x1)
##
## Residuals:
## 1 2 3 4 5
## -1.22308 -0.09231 -0.46154 2.16923 -0.39231
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.4846 1.0225 3.408 0.0422 *
## x1 0.3692 0.1299 2.843 0.0655 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.481 on 3 degrees of freedom
## Multiple R-squared: 0.7293, Adjusted R-squared: 0.6391
## F-statistic: 8.084 on 1 and 3 DF, p-value: 0.06547
p1 = ggplot(df1, aes(x = x1, y = y1)) + geom_point() + stat_smooth(method = "lm", se = FALSE)
p1
What happened to the regression line? It was pulled down towards the \(x=16\) data point.
As a rule, then, we’ll have to (i) figure out a way to identify what counts as an outlier, and even if some observation seems to be an outlier, (ii) whether it is large enough to influence the regression line. If an observation can change the regression line simply by virtue of being in the sample, we refer to it as a leveraged observation.
One such diagnostic technique is to run a quantile-quantile plot. You can think of this plot as basically splitting the sample into percentiles, and then plotting what percent of the data fall below the 1%, 2%, 3%, …, 99% values of the residual against what would be expected if the residuals were normally distributed. If the sample is normally distributed then all points should fall on the straight line that is shown in the plot. If points deviate a lot then you know these deviates are unusual observations (i.e., outliers).
You will see a red-line with two dashed red-lines on either side; these are the 95% confidence intervals. The goal is to see if any observations fall outside these bands.
You will also see the \(y-axis\) labeled Studentized Residuals. These are residuals calculated by leaving one observation out of the sample, re-estimating the regression model, calculating the residual, calculating the z-score of the residual, then throwing this one observation back in and taking another observation out, re-estimating the model, calculating the residual, calculating the z-score of the residual, and so on until every observation has been excluded one by one. Why do this?
An outlier is measured in terms of the residual \(= y - \hat{y}\). But in order to calculate the z-score of a residual you need the standard deviation of the residuals, and this turns out to be the Residual Standard Error reported by the regression. But logic dictates that if we want to know if a particular observation is influencing the regression line then the only way to know this is to calculate the z-score of the residual as follows:
\[e_{si} = \dfrac{e_i}{\hat{\sigma_{-i}}\sqrt{1 - h_{ii}}}\]
where \(\hat{\sigma_{-i}}\) is the standard deviation of the residuals from a regression model that excludes the \(i^{th}\) observation. If \(h_{ii}=0\) then \(e_{si}=e_i\) and obviously the observation is not leveraged (i.e., does not exert undue influence on the regression line).
In simple terms, think of these as a modified version of the ordinary residual so that we can judge if any observation has an unusually large influence on the regression line. If you find one or more such observations, then you can suspect that something might be going on with these observations.
qqPlot(lm.2, id.n = 4, id.cex = 0.65, col = "blue")
## 64 3 104 35
## 130 131 132 133
We set id.n=4 so that we could pick the four observations with the largest Studentized residuals (i.e., the four most unusual observations). The table shows you what the residual number is and what country is attached to each.
A formal test is available as well:
outlierTest(lm.2)
## rstudent unadjusted p-value Bonferonni p
## 35 3.855320 0.00018207 0.024215
## 104 3.824442 0.00020379 0.027104
## 3 3.752534 0.00026430 0.035152
This test involves picking the observation with the largest (in absolute terms) Studentized residual and calculating a p-value. The p-value is essentially telling us how likely we would be to see such a large absolute Studentized residual if indeed there were no outliers.
You see something called a Bonferonni p. This is a modified \(p-value\). Why? Because in a test such as this, where we are making multiple simultaneous comparisons by leaving one observation out at a time, the probability of committing a Type I error is no longer, say, 0.05, but in fact much higher. How?
Assume I have \(\alpha = 0.05\) and I leave one observation out, then re-run the test. Maybe I conclude this observation that was excluded is not an outlier. Loosely put, I could be wrong, 5 times out of 100. Now I leave another observation out and repeat the exercise. Again \(\alpha=0.05\). I think I could be wrong 5 times out of 100. Now, since I ran two such tests back-to-back, what is the probability that I end up with at least one \(P-value \leq 0.05\) by chance? This is \(1 -\) the probability that neither of the two tests gives a significant result (\(= 0.95 * 0.95\)), i.e., \(1-0.9025 = 0.0975\)!! In other words, if you are running two tests simultaneously with \(\alpha=0.05\) the chance of having at least one test given you a significant result is as high as 0.0975 and _not 0.05$. If I were running 10 tests the chance of having at least one test be significant would be \(1 - (0.95)^{10} = 0.4012\). Yikes!!
This has led to various corrections having to be made to the p-value, and one such correction is the Bonferonni p-value which says if you are making \(n\) simultaneous comparisons via tests then your actual alpha is \(\dfrac{\alpha}{n}\). So if I am using \(\alpha=0.05\) and making 10 simultaneous comparisons my actual \(\alpha\) is \(\dfrac{0.05}{10} = 0.005\). I can now reject \(H_0\) if and only if the p-value for each comparison is \(\leq 0.005\). Because the test for outliers involves multiple simultaneous comparisons you see the Bonferonni p-value being calculated and compared with a TRUE Bonferonni p-value of 0.05. As it turns out, all three observation have a Bonferonni p-value less than 0.05, suggesting that each is an outlier.
Should we then exclude them and re-estimate the regression model? Note yet; let us do some more tests.
One other method of checking for influential observations involves constructing what are called index plots – graphing particular statistics against each observation. A popular plot happens to be the influenceIndexPlot()
because it shows the Studentized residuals, the corresponding Bonferroni \(p-values\) for outliers, the \(hat-values\), and something called Cook’s distances.
influenceIndexPlot(lm.2)
An observation that is influential if it is an outlier and has high leverage in that if you remove this observation the regression model will change significantly.
An easy way to see Cook’s distances is by constructing the influencePlot()
for the estimated model.
influencePlot(lm.2)
## StudRes Hat CookD
## 35 3.8553196 0.01205959 0.04095727
## 94 0.9294449 0.40496289 0.14713536
In this plot, the size of the bubbles is proportional to the size of Cook’s distance for that observation; the larger the bubble the more likely this observation is both an outlier and is influencing the regression model. We see two culprits, observation 35 and observation 94. What countries are these? See for yourself.
Now let us see what happens if we remove observation 94 from the sample and re-estimate the regression model.
lm.3 <- update(lm.2, subset = rownames(sowc) != 94)
summary(lm.3)
##
## Call:
## lm(formula = U5MR ~ Income + FemaleYouthLR + FemaleYouthLR.sq,
## data = sowc, subset = rownames(sowc) != 94)
##
## Residuals:
## Min 1Q Median 3Q Max
## -42.512 -10.663 -4.868 4.390 82.611
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 95.141641 24.381862 3.902 0.000153 ***
## Income -0.000581 0.000267 -2.176 0.031393 *
## FemaleYouthLR 1.070168 0.725178 1.476 0.142472
## FemaleYouthLR.sq -0.018090 0.005108 -3.541 0.000556 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 22.49 on 128 degrees of freedom
## Multiple R-squared: 0.6747, Adjusted R-squared: 0.6671
## F-statistic: 88.51 on 3 and 128 DF, p-value: < 2.2e-16
compareCoefs(lm.2, lm.3)
##
## Call:
## 1: lm(formula = U5MR ~ Income + FemaleYouthLR + FemaleYouthLR.sq, data
## = sowc)
## 2: lm(formula = U5MR ~ Income + FemaleYouthLR + FemaleYouthLR.sq, data
## = sowc, subset = rownames(sowc) != 94)
## Est. 1 SE 1 Est. 2 SE 2
## (Intercept) 93.205358 24.279877 95.141641 24.381862
## Income -0.000424 0.000207 -0.000581 0.000267
## FemaleYouthLR 1.148593 0.719872 1.070168 0.725178
## FemaleYouthLR.sq -0.018828 0.005044 -0.018090 0.005108
lm.4 = update(lm.3, subset = rownames(sowc) != 35)
summary(lm.4)
##
## Call:
## lm(formula = U5MR ~ Income + FemaleYouthLR + FemaleYouthLR.sq,
## data = sowc, subset = rownames(sowc) != 35)
##
## Residuals:
## Min 1Q Median 3Q Max
## -42.725 -10.101 -4.367 4.814 80.097
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 92.1791115 23.0732950 3.995 0.000108 ***
## Income -0.0004448 0.0001968 -2.260 0.025496 *
## FemaleYouthLR 1.1944856 0.6841566 1.746 0.083223 .
## FemaleYouthLR.sq -0.0192625 0.0047939 -4.018 9.95e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 21.36 on 128 degrees of freedom
## Multiple R-squared: 0.7032, Adjusted R-squared: 0.6962
## F-statistic: 101.1 on 3 and 128 DF, p-value: < 2.2e-16
compareCoefs(lm.2, lm.3, lm.4)
##
## Call:
## 1: lm(formula = U5MR ~ Income + FemaleYouthLR + FemaleYouthLR.sq, data
## = sowc)
## 2: lm(formula = U5MR ~ Income + FemaleYouthLR + FemaleYouthLR.sq, data
## = sowc, subset = rownames(sowc) != 94)
## 3: lm(formula = U5MR ~ Income + FemaleYouthLR + FemaleYouthLR.sq, data
## = sowc, subset = rownames(sowc) != 35)
## Est. 1 SE 1 Est. 2 SE 2 Est. 3
## (Intercept) 93.205358 24.279877 95.141641 24.381862 92.179111
## Income -0.000424 0.000207 -0.000581 0.000267 -0.000445
## FemaleYouthLR 1.148593 0.719872 1.070168 0.725178 1.194486
## FemaleYouthLR.sq -0.018828 0.005044 -0.018090 0.005108 -0.019263
## SE 3
## (Intercept) 23.073295
## Income 0.000197
## FemaleYouthLR 0.684157
## FemaleYouthLR.sq 0.004794
lm.5 = update(lm.2, subset = (rownames(sowc) != 35 & rownames(sowc) != 94) )
summary(lm.5)
##
## Call:
## lm(formula = U5MR ~ Income + FemaleYouthLR + FemaleYouthLR.sq,
## data = sowc, subset = (rownames(sowc) != 35 & rownames(sowc) !=
## 94))
##
## Residuals:
## Min 1Q Median 3Q Max
## -42.396 -10.211 -4.142 4.509 80.367
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 94.3862869 23.1339863 4.080 7.9e-05 ***
## Income -0.0006244 0.0002536 -2.462 0.015142 *
## FemaleYouthLR 1.1051346 0.6880976 1.606 0.110742
## FemaleYouthLR.sq -0.0184219 0.0048474 -3.800 0.000223 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 21.34 on 127 degrees of freedom
## Multiple R-squared: 0.7044, Adjusted R-squared: 0.6974
## F-statistic: 100.9 on 3 and 127 DF, p-value: < 2.2e-16
compareCoefs(lm.2, lm.3, lm.4, lm.5)
##
## Call:
## 1: lm(formula = U5MR ~ Income + FemaleYouthLR + FemaleYouthLR.sq, data
## = sowc)
## 2: lm(formula = U5MR ~ Income + FemaleYouthLR + FemaleYouthLR.sq, data
## = sowc, subset = rownames(sowc) != 94)
## 3: lm(formula = U5MR ~ Income + FemaleYouthLR + FemaleYouthLR.sq, data
## = sowc, subset = rownames(sowc) != 35)
## 4: lm(formula = U5MR ~ Income + FemaleYouthLR + FemaleYouthLR.sq, data
## = sowc, subset = (rownames(sowc) != 35 & rownames(sowc) != 94))
## Est. 1 SE 1 Est. 2 SE 2 Est. 3
## (Intercept) 93.205358 24.279877 95.141641 24.381862 92.179111
## Income -0.000424 0.000207 -0.000581 0.000267 -0.000445
## FemaleYouthLR 1.148593 0.719872 1.070168 0.725178 1.194486
## FemaleYouthLR.sq -0.018828 0.005044 -0.018090 0.005108 -0.019263
## SE 3 Est. 4 SE 4
## (Intercept) 23.073295 94.386287 23.133986
## Income 0.000197 -0.000624 0.000254
## FemaleYouthLR 0.684157 1.105135 0.688098
## FemaleYouthLR.sq 0.004794 -0.018422 0.004847
Another strategy that is commonly used is the Marginal Model Plot. These plots show, for each continuous \(x\) variable the value of \(y\) while ignoring all other independent variables in the model
Added-Variable plots are similar to the Marginal Model Plots except that these adjust for all other independent variables when constructing the plot for each continuous independent variable. These are constructed as follows:
marginalModelPlots(lm.1)
## Warning in mmps(...): Interactions and/or factors skipped
avPlots(lm.2, id.n = 2, id.cex = 0.6)
Last time we noticed that there was some non-linearity associated with FemaleYouthLR but once we took the natural logarithm of FemaleYouthLR, we could get around this problem. So unless we had fit this version of the model – (lm.2)
– we would have to do so and re-check these marginal model plots and added variable plots.
sowc$FemaleYouthLR.sq = (sowc$FemaleYouthLR)^2
lm.3 = lm(U5MR ~ Income + FemaleYouthLR + FemaleYouthLR.sq, data = sowc)
summary(lm.3)
##
## Call:
## lm(formula = U5MR ~ Income + FemaleYouthLR + FemaleYouthLR.sq,
## data = sowc)
##
## Residuals:
## Min 1Q Median 3Q Max
## -42.80 -10.92 -5.20 4.51 81.85
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 93.2053580 24.2798773 3.839 0.000193 ***
## Income -0.0004244 0.0002070 -2.050 0.042387 *
## FemaleYouthLR 1.1485934 0.7198725 1.596 0.113036
## FemaleYouthLR.sq -0.0188275 0.0050436 -3.733 0.000283 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 22.48 on 129 degrees of freedom
## Multiple R-squared: 0.6745, Adjusted R-squared: 0.6669
## F-statistic: 89.09 on 3 and 129 DF, p-value: < 2.2e-16
residuals.3 = resid(lm.3)
marginalModelPlots(lm.3)
avPlots(lm.3, id.n = 2, id.cex = 0.6)
The squared version of FemaleYouthLR does a better job as evident from the two lines being a lot closer in these marginal model plots than in the case of lm.2
(where FemaleYouthLR.sq) was not included in the regression.
Heteroscedasticity: \(Var(e_i) \neq \sigma^2\)} is a violation of the assumption that the variance of the residuals is constant. When might heteroscedasticity arise? If:
Consequences of ignoring heteroscedasticity:
So how do we detect non-constant variance? As usual, we’ll have a choice of graphical explorations buttressed by formal statistical tests. Let us start with a graphical exploration.
We use Spread-Level Plots that graph the log of the absolute Studentized residuals against the log of the fitted values. Any pattern evident hints at a problem. The plot will also show a transformation suggested for the dependent variable.
A formal test – the Breusch-Pagan Test – is also available. This test basically checks whether the assumption of equal variance is met or should be rejected
An example of heteroscedasticity in action is shown below:
data(Ornstein)
The Ornstein data frame has 248 rows and 4 columns. The observations are the 248 largest Canadian firms with publicly available information in the mid-1970s. The names of the firms were not available. The variables are:
lm.A <- lm(interlocks + 1 ~ log(assets) + sector + nation, data=Ornstein)
summary(lm.A)
##
## Call:
## lm(formula = interlocks + 1 ~ log(assets) + sector + nation,
## data = Ornstein)
##
## Residuals:
## Min 1Q Median 3Q Max
## -41.936 -6.169 -0.140 4.745 47.440
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -27.4430 4.9272 -5.570 6.98e-08 ***
## log(assets) 5.9908 0.6814 8.792 3.24e-16 ***
## sectorBNK 17.3227 5.1847 3.341 0.000971 ***
## sectorCON -2.7127 5.4241 -0.500 0.617463
## sectorFIN -1.2745 3.4121 -0.374 0.709100
## sectorHLD -2.2916 4.6132 -0.497 0.619835
## sectorMAN 1.2440 2.3666 0.526 0.599621
## sectorMER -0.8801 3.0346 -0.290 0.772058
## sectorMIN 1.7566 2.4448 0.719 0.473153
## sectorTRN 1.8888 3.3023 0.572 0.567888
## sectorWOD 5.1056 3.0990 1.647 0.100801
## nationOTH -3.0533 3.0872 -0.989 0.323676
## nationUK -5.3294 3.0714 -1.735 0.084030 .
## nationUS -8.4913 1.7174 -4.944 1.46e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11.26 on 234 degrees of freedom
## Multiple R-squared: 0.5353, Adjusted R-squared: 0.5094
## F-statistic: 20.73 on 13 and 234 DF, p-value: < 2.2e-16
spreadLevelPlot(lm.A)
## Warning in spreadLevelPlot.lm(lm.A): 16 negative fitted values removed
##
## Suggested power transformation: 0.5540406
Notice the ballooning that happens at the upper-end of the distribution. This is a classic example of heteroscedasticity. What if we ran the formal test?
ncvTest(lm.A, ~log(assets) + sector + nation, data=Ornstein)
## Non-constant Variance Score Test
## Variance formula: ~ log(assets) + sector + nation
## Chisquare = 290.8694 Df = 13 p = 1.952521e-54
The \(p-value\) is practically 0 so we have to reject the Null hypothesis of constant variance; we have a problem with heteroscedasticity.
What about the model we fit to the sowc
data?
summary(lm.1)
##
## Call:
## lm(formula = U5MR ~ Income + FemaleYouthLR + Urban_Rural, data = sowc)
##
## Residuals:
## Min 1Q Median 3Q Max
## -45.132 -11.296 -4.887 3.888 95.875
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.728e+02 1.014e+01 17.037 <2e-16 ***
## Income -5.026e-04 2.162e-04 -2.325 0.0216 *
## FemaleYouthLR -1.412e+00 1.277e-01 -11.052 <2e-16 ***
## Urban_RuralUrban -9.116e+00 4.827e+00 -1.888 0.0612 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 23.34 on 129 degrees of freedom
## Multiple R-squared: 0.649, Adjusted R-squared: 0.6408
## F-statistic: 79.51 on 3 and 129 DF, p-value: < 2.2e-16
spreadLevelPlot(lm.1)
## Warning in spreadLevelPlot.lm(lm.1): 2 negative fitted values removed
##
## Suggested power transformation: 0.3511475
ncvTest(lm.1, ~ Income + FemaleYouthLR, data = sowc)
## Non-constant Variance Score Test
## Variance formula: ~ Income + FemaleYouthLR
## Chisquare = 40.05976 Df = 2 p = 2.000477e-09
Note that the spread-level plot shows you some deviance of the green line, and you also get a suggestion about transforming the dependent variable by raising it to the power of 0.3511475
. What if we transform the dependent variable as suggested?
sowc$U5MR2 = sowc$U5MR^(0.3511475)
lm.new1 = update(lm.1, U5MR2 ~ Income + FemaleYouthLR)
summary(lm.new1)
##
## Call:
## lm(formula = U5MR2 ~ Income + FemaleYouthLR, data = sowc)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.18482 -0.48231 -0.09641 0.34651 2.39779
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.153e+00 2.958e-01 24.183 < 2e-16 ***
## Income -3.423e-05 6.205e-06 -5.516 1.8e-07 ***
## FemaleYouthLR -4.114e-02 3.499e-03 -11.757 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6938 on 130 degrees of freedom
## Multiple R-squared: 0.6631, Adjusted R-squared: 0.6579
## F-statistic: 127.9 on 2 and 130 DF, p-value: < 2.2e-16
spreadLevelPlot(lm.new1)
##
## Suggested power transformation: 0.986857
ncvTest(lm.new1, ~ Income + FemaleYouthLR, data = sowc)
## Non-constant Variance Score Test
## Variance formula: ~ Income + FemaleYouthLR
## Chisquare = 3.755136 Df = 2 p = 0.1529617
Aha, now we have a high p-value so the NULL hypothesis of constant variance cannot be rejected. We have gotten around that problem.
What if we had been unable to achieve this fix? There is a sledgehammer we could use but we should not abuse it – this is the choice of using what are called Huber-White heteroscedasticity-corrected standard errors
summaryw <- function(model) {
s <- summary(model)
X <- model.matrix(model)
u2 <- residuals(model)^2
XDX <- 0
## Here one needs to calculate X'DX. But due to the fact that
## D is huge (NxN), it is better to do it with a cycle.
for(i in 1:nrow(X)) {
XDX <- XDX + u2[i]*X[i,]%*%t(X[i,])
}
# inverse(X'X)
XX1 <- solve(t(X)%*%X)
# Variance calculation (Bread x meat x Bread)
varcovar <- XX1 %*% XDX %*% XX1
# degrees of freedom adjustment
dfc <- sqrt(nrow(X))/sqrt(nrow(X)-ncol(X))
# Standard errors of the coefficient estimates are the
# square roots of the diagonal elements
stdh <- dfc*sqrt(diag(varcovar))
t <- model$coefficients/stdh
p <- 2*pnorm(-abs(t))
results <- cbind(model$coefficients, stdh, t, p)
dimnames(results) <- dimnames(s$coefficients)
results
}
Now let us compare the two standard errors – the original ones and the ones corrected for heteroscedasticity.
summary(lm.1)
##
## Call:
## lm(formula = U5MR ~ Income + FemaleYouthLR + Urban_Rural, data = sowc)
##
## Residuals:
## Min 1Q Median 3Q Max
## -45.132 -11.296 -4.887 3.888 95.875
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.728e+02 1.014e+01 17.037 <2e-16 ***
## Income -5.026e-04 2.162e-04 -2.325 0.0216 *
## FemaleYouthLR -1.412e+00 1.277e-01 -11.052 <2e-16 ***
## Urban_RuralUrban -9.116e+00 4.827e+00 -1.888 0.0612 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 23.34 on 129 degrees of freedom
## Multiple R-squared: 0.649, Adjusted R-squared: 0.6408
## F-statistic: 79.51 on 3 and 129 DF, p-value: < 2.2e-16
summaryw(lm.1)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.727879e+02 1.578141e+01 10.948819 6.731948e-28
## Income -5.026061e-04 1.615177e-04 -3.111770 1.859691e-03
## FemaleYouthLR -1.411607e+00 1.889460e-01 -7.470954 7.961551e-14
## Urban_RuralUrban -9.116080e+00 5.698287e+00 -1.599793 1.096445e-01
Some of the partial slope coefficients’ standard errors will be larger, others will be smaller, but in general this correction – the Huber-White Sandwich Estimator – fixes heteroscedasticity.
Variance Inflation Factors (VIFs) or Generalized Variance Inflation Factors (GVIFs) are some of the more commonly used indicators for multicollinearity. What these try to get at is an estimate of how much larger is the standard error (and hence the confidence interval) of each variable’s coefficient than would be seen if this variable had zero correlation with other variables in the regression model. These rely upon the following:
data("Ericksen")
summary(Ericksen)
## minority crime poverty language
## Min. : 0.700 Min. : 25.00 Min. : 6.80 Min. : 0.200
## 1st Qu.: 5.025 1st Qu.: 48.00 1st Qu.: 9.95 1st Qu.: 0.500
## Median :15.800 Median : 55.00 Median :12.50 Median : 0.850
## Mean :19.436 Mean : 63.06 Mean :13.47 Mean : 1.926
## 3rd Qu.:28.200 3rd Qu.: 73.00 3rd Qu.:16.60 3rd Qu.: 2.350
## Max. :72.600 Max. :143.00 Max. :23.90 Max. :12.700
## highschool housing city conventional
## Min. :17.50 Min. : 7.00 city :16 Min. : 0.00
## 1st Qu.:27.45 1st Qu.: 9.40 state:50 1st Qu.: 0.00
## Median :31.60 Median :11.50 Median : 0.00
## Mean :33.65 Mean :15.67 Mean : 11.73
## 3rd Qu.:41.67 3rd Qu.:20.30 3rd Qu.: 9.75
## Max. :51.80 Max. :52.10 Max. :100.00
## undercount
## Min. :-2.310
## 1st Qu.: 0.320
## Median : 1.445
## Mean : 1.921
## 3rd Qu.: 3.305
## Max. : 8.180
lm.E = lm(undercount ~ ., data = Ericksen)
summary(lm.E)
##
## Call:
## lm(formula = undercount ~ ., data = Ericksen)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.8356 -0.8033 -0.0553 0.7050 4.2467
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.611411 1.720843 -0.355 0.723678
## minority 0.079834 0.022609 3.531 0.000827 ***
## crime 0.030117 0.012998 2.317 0.024115 *
## poverty -0.178369 0.084916 -2.101 0.040117 *
## language 0.215125 0.092209 2.333 0.023200 *
## highschool 0.061290 0.044775 1.369 0.176415
## housing -0.034957 0.024630 -1.419 0.161259
## citystate -1.159982 0.770644 -1.505 0.137791
## conventional 0.036989 0.009253 3.997 0.000186 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.426 on 57 degrees of freedom
## Multiple R-squared: 0.7077, Adjusted R-squared: 0.6667
## F-statistic: 17.25 on 8 and 57 DF, p-value: 1.044e-12
vif(lm.E)
## minority crime poverty language highschool
## 5.009065 3.343586 4.625178 1.635568 4.619169
## housing city conventional
## 1.871745 3.537750 1.691320
# sqrt(vif(lm.6)) > 2
What if we fit the sowc
data with several independent variables and then tested the model for multicollinearity?
lm.7 = lm(U5MR ~ Income + Urban + FemaleYouthLR + MaleYouthLR + Fertility, data = sowc)
summary(lm.7)
##
## Call:
## lm(formula = U5MR ~ Income + Urban + FemaleYouthLR + MaleYouthLR +
## Fertility, data = sowc)
##
## Residuals:
## Min 1Q Median 3Q Max
## -37.554 -8.960 -1.524 4.243 91.514
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 49.7803109 22.6136479 2.201 0.0295 *
## Income -0.0001930 0.0001993 -0.968 0.3346
## Urban -0.0849962 0.1039780 -0.817 0.4152
## FemaleYouthLR -0.6491575 0.2999460 -2.164 0.0323 *
## MaleYouthLR 0.0855617 0.4121839 0.208 0.8359
## Fertility 15.0451524 1.8761706 8.019 6.04e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 19.07 on 127 degrees of freedom
## Multiple R-squared: 0.7693, Adjusted R-squared: 0.7602
## F-statistic: 84.71 on 5 and 127 DF, p-value: < 2.2e-16
vif(lm.7)
## Income Urban FemaleYouthLR MaleYouthLR Fertility
## 1.599467 1.873242 11.395764 10.565046 2.916705