More on Simple Linear Regression

With one dummy variable as an independent variable we could suppress the intercept. This is often convenient in order to ease interpretation for non-technical audiences. Below we do this after first running the model with the intercept and then running the model without the intercept.

load("~/Downloads/sowc.RData")

The model with the intercept.

lm.3 = lm(U5MR ~ Urban_Rural, data = sowc)
summary(lm.3)
## 
## Call:
## lm(formula = U5MR ~ Urban_Rural, data = sowc)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -60.421 -18.421  -8.421   9.579 139.382 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        63.421      4.500  14.094  < 2e-16 ***
## Urban_RuralUrban  -38.803      5.953  -6.519  1.4e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 33.97 on 131 degrees of freedom
## Multiple R-squared:  0.2449, Adjusted R-squared:  0.2392 
## F-statistic: 42.49 on 1 and 131 DF,  p-value: 1.402e-09

The model without the intercept.

lm.4 = lm(U5MR ~ Urban_Rural - 1, data = sowc)
summary(lm.4)
## 
## Call:
## lm(formula = U5MR ~ Urban_Rural - 1, data = sowc)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -60.421 -18.421  -8.421   9.579 139.382 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## Urban_RuralRural   63.421      4.500  14.094  < 2e-16 ***
## Urban_RuralUrban   24.618      3.897   6.317 3.83e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 33.97 on 131 degrees of freedom
## Multiple R-squared:  0.6455, Adjusted R-squared:  0.6401 
## F-statistic: 119.3 on 2 and 131 DF,  p-value: < 2.2e-16

Let us stack the results side-by-side with the stargazer() library. This library allows you to make neat tables of various alternative models run on the same dependent variable but with different sets of independent variables.

library(stargazer)
stargazer(lm.3, lm.4, type = "html")
Dependent variable:
U5MR
(1) (2)
Urban_RuralRural 63.421***
(4.500)
Urban_RuralUrban -38.803*** 24.618***
(5.953) (3.897)
Constant 63.421***
(4.500)
Observations 133 133
R2 0.245 0.646
Adjusted R2 0.239 0.640
Residual Std. Error (df = 131) 33.972 33.972
F Statistic 42.492*** (df = 1; 131) 119.280*** (df = 2; 131)
Note: p<0.1; p<0.05; p<0.01

Note what this shows us …
* If it is a rural country then the predicted U5MR is 63.421
* If we have an intercept and it is an Urban country then the predicted U5MR is 24.618 * If we have suppressed the intercept and it is an Urban country then the predicted U5MR is 24.618

What are the means of U5MR for urban versus rural countries?

library(dplyr)
grouped = group_by(sowc, Urban_Rural)
summarize(grouped, Mean.U5MR = mean(U5MR))
Urban_Rural Mean.U5MR
Rural 63.42105
Urban 24.61842

In a nutshell, then, what your linear regression is doing is just predicting the mean U5MR for each country type.

Plotting Groups’ Regression Lines

Let us now go back to the simple model with Income and an interaction between Income and Urban_Rural

lm.5 = lm(U5MR ~ Income + Urban_Rural + Income:Urban_Rural, data = sowc)
summary(lm.5)
## 
## Call:
## lm(formula = U5MR ~ Income + Urban_Rural + Income:Urban_Rural, 
##     data = sowc)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -52.025 -16.480  -9.149   8.979 133.984 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              73.926165   5.004014  14.773  < 2e-16 ***
## Income                   -0.004076   0.001068  -3.816  0.00021 ***
## Urban_RuralUrban        -40.018345   6.962291  -5.748 6.19e-08 ***
## Income:Urban_RuralUrban   0.003226   0.001108   2.912  0.00424 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 31.55 on 129 degrees of freedom
## Multiple R-squared:  0.3588, Adjusted R-squared:  0.3439 
## F-statistic: 24.06 on 3 and 129 DF,  p-value: 1.97e-12

Could we plot the regression line and confidence intervals for Urban versus Rural countries? Sure we could:

lm5.data = cbind(sowc, predict(lm.5, interval = "confidence", 
    level = 0.95))

library(ggplot2)
ggplot(lm5.data, aes(x = Income, y = U5MR, group = Urban_Rural)) + 
    geom_point(aes(color = Urban_Rural)) + geom_line(aes(y = fit, 
    color = Urban_Rural)) + geom_ribbon(aes(ymin = lwr, ymax = upr), 
    alpha = 0.4)

Notice what we did …

  1. We generated the predicted values of U5MR along with the 95% confidence intervals via the predict(lm.5, interval = "confidence", level = 0.95) command and bound these to the sowc data via cbind() command.
  2. We then used ggplot2 to plot the actual observations for Income and U5MR, added the regression lines and the associated confidence intervals, all the while distinguishing between Urban and Rural countries by color.

What should be obvious is that the confidence intervals are not equally wide for either type of country as Income increases. Why do you think this is?

Coefplot()

We could also use a simple plot both to visualize and to communicate precision of our regression estimates; see below.

library(coefplot)
coefplot(lm.5, CI = 2)

Note that in this command the CI = 2 is plotting the 95% confidence interval. What are the values of these confidence intervals? We can see these quite easily:

CI.95 = confint.lm(lm.5, level = 0.95)
CI.95
2.5 % 97.5 %
(Intercept) 64.0256007 83.8267284
Income -0.0061893 -0.0019625
Urban_RuralUrban -53.7934073 -26.2432823
Income:Urban_RuralUrban 0.0010340 0.0054181
estimates1 = cbind(lm.5$coefficients, CI.95)
round(estimates1, digits = 4)
2.5 % 97.5 %
(Intercept) 73.9262 64.0256 83.8267
Income -0.0041 -0.0062 -0.0020
Urban_RuralUrban -40.0183 -53.7934 -26.2433
Income:Urban_RuralUrban 0.0032 0.0010 0.0054
CI.99 = confint.lm(lm.5, level = 0.99)
CI.99
0.5 % 99.5 %
(Intercept) 60.8432772 87.0090520
Income -0.0068686 -0.0012832
Urban_RuralUrban -58.2211052 -21.8155844
Income:Urban_RuralUrban 0.0003295 0.0061227
estimates2 = cbind(lm.5$coefficients, CI.99)
round(estimates2, digits = 4)
0.5 % 99.5 %
(Intercept) 73.9262 60.8433 87.0091
Income -0.0041 -0.0069 -0.0013
Urban_RuralUrban -40.0183 -58.2211 -21.8156
Income:Urban_RuralUrban 0.0032 0.0003 0.0061

Note that the round() command is rounding the digits to four decimal places. Otherwise the scientific notation will slow interpretation.

Plots with visreg()

Source: Visualization of Regression Models Using visreg

In simple linear regression, it is both straightforward and extremely useful to plot the regression line. The plot tells you everything you need to know about the model and what it predicts. It is common to superimpose this line over a scatter plot of the two variables. A further refinement is the addition of a confidence band. Thus, in one plot, the analyst can immediately assess the empirical relationship between \(x\) and \(y\) in addition to the relationship estimated by the model and the uncertainty in that estimate, and also assess how well the two agree and whether assumptions may be violated.

Multiple regression models address a more complicated question: what is the relationship between an explanatory variable and the outcome as the other explanatory variables are held constant? This relationship is just as important to visualize as the relationship in simple linear regression, but doing so is not nearly as common in statistical practice. As models get more complicated, it becomes more difficult to construct these sorts of plots. With multiple variables, we cannot simply plot the observed data, as this does not hold the other variables constant. Interactions among variables, transformations, and non-linear relationships all add extra barriers, making it time-consuming for the analyst to construct these plots. This is unfortunate, however – as models grow more complex, there is an even greater need to represent them with simple illustrations.

The visreg() package is a very handy tool to plot our regression lines, visualize interactions, and so on. The resulting plots are very handy presentation/reporting tools since they convey a lot of information. Let us see how the package works in the context of some of the regression models we have estimated.

lm.a = lm(U5MR ~ Income, data = sowc)
summary(lm.a)
## 
## Call:
## lm(formula = U5MR ~ Income, data = sowc)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -38.65 -26.53 -12.68  18.29 129.71 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 53.2358231  3.7144805  14.332  < 2e-16 ***
## Income      -0.0016308  0.0002899  -5.625 1.07e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 35.09 on 131 degrees of freedom
## Multiple R-squared:  0.1946, Adjusted R-squared:  0.1884 
## F-statistic: 31.65 on 1 and 131 DF,  p-value: 1.072e-07
library(visreg)
p1 = visreg(lm.a, alpha = 0.05)
abline(h = 7, lwd = 0.5)

These are just the usual plots of the regression line with the 95% and 99% confidence intervals, respectively. You can tell that when Income = 0 predicted U5MR is slightly above 50; the exact value is 53.23, evident from the regression results per se. The points shown on the plot are the actual U5MR values. The gap between \(= (U5MR_i - \hat{U5MR}_i)\) is the residual. Let us calculate and add the predicted values and the residuals to sowc.

sowc$predicted = predict(lm.a)
sowc$residuals = resid(lm.a)

Lets change how the columns are ordered so that we can see the essential columns next to one another.

names(sowc)
##  [1] "country"       "U5MR"          "Births"        "Income"       
##  [5] "AdultLR"       "MaleYouthLR"   "FemaleYouthLR" "Mobiles"      
##  [9] "Internet"      "Population"    "PopulationU5"  "Life"         
## [13] "Fertility"     "Urban"         "Urban_Rural"   "predicted"    
## [17] "residuals"
sowc = sowc[, c(1, 2, 4, 16, 17, 3, 5:15)]

Now open sowc and sort in descending order of Income. For Qatar, you’ll see the residual is \(7 - (-75.140475) = 82.1404748\).

Obviously Qatar and some of the other high-income countries (Singapore, followed by Kuwait) are most likely pulling the regression line towards them. What if we removed these from the data and then re-estimated the regression line?

sowc.sub = subset(sowc, Income <= 36040)
lm.b = lm(U5MR ~ Income, data = sowc.sub)
summary(lm.b)
## 
## Call:
## lm(formula = U5MR ~ Income, data = sowc.sub)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -41.631 -24.148  -9.631  15.356 123.541 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 60.1513744  3.9021094  15.415  < 2e-16 ***
## Income      -0.0029178  0.0004165  -7.005 1.25e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 33.32 on 128 degrees of freedom
## Multiple R-squared:  0.2771, Adjusted R-squared:  0.2715 
## F-statistic: 49.07 on 1 and 128 DF,  p-value: 1.253e-10
visreg(lm.b)

Notice the change in the estimates. These high income countries had very low U5MR so when you remove them average U5MR increases. This is being reflected in the higher Intercept. You also see the regression line with a steeper slope.

Visreg for Multiple Regression

Now let us see how visreg works with multiple independent variables. Lets say we go back to the model with Income and Urban_Rural, plus an interaction between these two variables.

lm.5 = lm(U5MR ~ Income + Urban_Rural + Income:Urban_Rural, data = sowc)
summary(lm.5)
## 
## Call:
## lm(formula = U5MR ~ Income + Urban_Rural + Income:Urban_Rural, 
##     data = sowc)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -52.025 -16.480  -9.149   8.979 133.984 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              73.926165   5.004014  14.773  < 2e-16 ***
## Income                   -0.004076   0.001068  -3.816  0.00021 ***
## Urban_RuralUrban        -40.018345   6.962291  -5.748 6.19e-08 ***
## Income:Urban_RuralUrban   0.003226   0.001108   2.912  0.00424 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 31.55 on 129 degrees of freedom
## Multiple R-squared:  0.3588, Adjusted R-squared:  0.3439 
## F-statistic: 24.06 on 3 and 129 DF,  p-value: 1.97e-12
visreg(lm.5, "Income", by = "Urban_Rural", overlay = TRUE)

visreg(lm.5, "Income", by = "Urban_Rural", overlay = FALSE)

visreg(lm.5, "Urban_Rural", "Income", breaks = c(100, 500, 1500), 
    overlay = FALSE)

visreg(lm.5, "Urban_Rural", "Income", breaks = c(100, 500, 1500), 
    overlay = TRUE)

Note the variants here. First, if overlay = TRUE then in some cases this helps to see patterns but in other cases it obfuscates things. Second, you can choose which variable you want on the x-axis. Third, you can also choose the values of the continuous variable; by default visreg uses the \(10^{th}\), \(50^{th}\), and the \(90^{th}\) percentiles. Note also that for a continuous variable being held constant visreg will hold it at its median value. For categorical variables it will hold them at their modal values.