With one dummy variable as an independent variable we could suppress the intercept. This is often convenient in order to ease interpretation for non-technical audiences. Below we do this after first running the model with the intercept and then running the model without the intercept.
load("~/Downloads/sowc.RData")
The model with the intercept.
lm.3 = lm(U5MR ~ Urban_Rural, data = sowc)
summary(lm.3)
##
## Call:
## lm(formula = U5MR ~ Urban_Rural, data = sowc)
##
## Residuals:
## Min 1Q Median 3Q Max
## -60.421 -18.421 -8.421 9.579 139.382
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 63.421 4.500 14.094 < 2e-16 ***
## Urban_RuralUrban -38.803 5.953 -6.519 1.4e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 33.97 on 131 degrees of freedom
## Multiple R-squared: 0.2449, Adjusted R-squared: 0.2392
## F-statistic: 42.49 on 1 and 131 DF, p-value: 1.402e-09
The model without the intercept.
lm.4 = lm(U5MR ~ Urban_Rural - 1, data = sowc)
summary(lm.4)
##
## Call:
## lm(formula = U5MR ~ Urban_Rural - 1, data = sowc)
##
## Residuals:
## Min 1Q Median 3Q Max
## -60.421 -18.421 -8.421 9.579 139.382
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## Urban_RuralRural 63.421 4.500 14.094 < 2e-16 ***
## Urban_RuralUrban 24.618 3.897 6.317 3.83e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 33.97 on 131 degrees of freedom
## Multiple R-squared: 0.6455, Adjusted R-squared: 0.6401
## F-statistic: 119.3 on 2 and 131 DF, p-value: < 2.2e-16
Let us stack the results side-by-side with the stargazer()
library. This library allows you to make neat tables of various alternative models run on the same dependent variable but with different sets of independent variables.
library(stargazer)
stargazer(lm.3, lm.4, type = "html")
Dependent variable: | ||
U5MR | ||
(1) | (2) | |
Urban_RuralRural | 63.421*** | |
(4.500) | ||
Urban_RuralUrban | -38.803*** | 24.618*** |
(5.953) | (3.897) | |
Constant | 63.421*** | |
(4.500) | ||
Observations | 133 | 133 |
R2 | 0.245 | 0.646 |
Adjusted R2 | 0.239 | 0.640 |
Residual Std. Error (df = 131) | 33.972 | 33.972 |
F Statistic | 42.492*** (df = 1; 131) | 119.280*** (df = 2; 131) |
Note: | p<0.1; p<0.05; p<0.01 |
Note what this shows us …
* If it is a rural country then the predicted U5MR is 63.421
* If we have an intercept and it is an Urban country then the predicted U5MR is 24.618
* If we have suppressed the intercept and it is an Urban country then the predicted U5MR is 24.618
What are the means of U5MR for urban versus rural countries?
library(dplyr)
grouped = group_by(sowc, Urban_Rural)
summarize(grouped, Mean.U5MR = mean(U5MR))
Urban_Rural | Mean.U5MR |
---|---|
Rural | 63.42105 |
Urban | 24.61842 |
In a nutshell, then, what your linear regression is doing is just predicting the mean U5MR for each country type.
Let us now go back to the simple model with Income and an interaction between Income and Urban_Rural
lm.5 = lm(U5MR ~ Income + Urban_Rural + Income:Urban_Rural, data = sowc)
summary(lm.5)
##
## Call:
## lm(formula = U5MR ~ Income + Urban_Rural + Income:Urban_Rural,
## data = sowc)
##
## Residuals:
## Min 1Q Median 3Q Max
## -52.025 -16.480 -9.149 8.979 133.984
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 73.926165 5.004014 14.773 < 2e-16 ***
## Income -0.004076 0.001068 -3.816 0.00021 ***
## Urban_RuralUrban -40.018345 6.962291 -5.748 6.19e-08 ***
## Income:Urban_RuralUrban 0.003226 0.001108 2.912 0.00424 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 31.55 on 129 degrees of freedom
## Multiple R-squared: 0.3588, Adjusted R-squared: 0.3439
## F-statistic: 24.06 on 3 and 129 DF, p-value: 1.97e-12
Could we plot the regression line and confidence intervals for Urban versus Rural countries? Sure we could:
lm5.data = cbind(sowc, predict(lm.5, interval = "confidence",
level = 0.95))
library(ggplot2)
ggplot(lm5.data, aes(x = Income, y = U5MR, group = Urban_Rural)) +
geom_point(aes(color = Urban_Rural)) + geom_line(aes(y = fit,
color = Urban_Rural)) + geom_ribbon(aes(ymin = lwr, ymax = upr),
alpha = 0.4)
Notice what we did …
predict(lm.5, interval = "confidence", level = 0.95)
command and bound these to the sowc data via cbind()
command.ggplot2
to plot the actual observations for Income and U5MR, added the regression lines and the associated confidence intervals, all the while distinguishing between Urban and Rural countries by color.What should be obvious is that the confidence intervals are not equally wide for either type of country as Income increases. Why do you think this is?
We could also use a simple plot both to visualize and to communicate precision of our regression estimates; see below.
library(coefplot)
coefplot(lm.5, CI = 2)
Note that in this command the CI = 2
is plotting the 95% confidence interval. What are the values of these confidence intervals? We can see these quite easily:
CI.95 = confint.lm(lm.5, level = 0.95)
CI.95
2.5 % | 97.5 % | |
---|---|---|
(Intercept) | 64.0256007 | 83.8267284 |
Income | -0.0061893 | -0.0019625 |
Urban_RuralUrban | -53.7934073 | -26.2432823 |
Income:Urban_RuralUrban | 0.0010340 | 0.0054181 |
estimates1 = cbind(lm.5$coefficients, CI.95)
round(estimates1, digits = 4)
2.5 % | 97.5 % | ||
---|---|---|---|
(Intercept) | 73.9262 | 64.0256 | 83.8267 |
Income | -0.0041 | -0.0062 | -0.0020 |
Urban_RuralUrban | -40.0183 | -53.7934 | -26.2433 |
Income:Urban_RuralUrban | 0.0032 | 0.0010 | 0.0054 |
CI.99 = confint.lm(lm.5, level = 0.99)
CI.99
0.5 % | 99.5 % | |
---|---|---|
(Intercept) | 60.8432772 | 87.0090520 |
Income | -0.0068686 | -0.0012832 |
Urban_RuralUrban | -58.2211052 | -21.8155844 |
Income:Urban_RuralUrban | 0.0003295 | 0.0061227 |
estimates2 = cbind(lm.5$coefficients, CI.99)
round(estimates2, digits = 4)
0.5 % | 99.5 % | ||
---|---|---|---|
(Intercept) | 73.9262 | 60.8433 | 87.0091 |
Income | -0.0041 | -0.0069 | -0.0013 |
Urban_RuralUrban | -40.0183 | -58.2211 | -21.8156 |
Income:Urban_RuralUrban | 0.0032 | 0.0003 | 0.0061 |
Note that the round()
command is rounding the digits to four decimal places. Otherwise the scientific notation will slow interpretation.
visreg()
Source: Visualization of Regression Models Using visreg
In simple linear regression, it is both straightforward and extremely useful to plot the regression line. The plot tells you everything you need to know about the model and what it predicts. It is common to superimpose this line over a scatter plot of the two variables. A further refinement is the addition of a confidence band. Thus, in one plot, the analyst can immediately assess the empirical relationship between \(x\) and \(y\) in addition to the relationship estimated by the model and the uncertainty in that estimate, and also assess how well the two agree and whether assumptions may be violated.
Multiple regression models address a more complicated question: what is the relationship between an explanatory variable and the outcome as the other explanatory variables are held constant? This relationship is just as important to visualize as the relationship in simple linear regression, but doing so is not nearly as common in statistical practice. As models get more complicated, it becomes more difficult to construct these sorts of plots. With multiple variables, we cannot simply plot the observed data, as this does not hold the other variables constant. Interactions among variables, transformations, and non-linear relationships all add extra barriers, making it time-consuming for the analyst to construct these plots. This is unfortunate, however – as models grow more complex, there is an even greater need to represent them with simple illustrations.
The visreg()
package is a very handy tool to plot our regression lines, visualize interactions, and so on. The resulting plots are very handy presentation/reporting tools since they convey a lot of information. Let us see how the package works in the context of some of the regression models we have estimated.
lm.a = lm(U5MR ~ Income, data = sowc)
summary(lm.a)
##
## Call:
## lm(formula = U5MR ~ Income, data = sowc)
##
## Residuals:
## Min 1Q Median 3Q Max
## -38.65 -26.53 -12.68 18.29 129.71
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 53.2358231 3.7144805 14.332 < 2e-16 ***
## Income -0.0016308 0.0002899 -5.625 1.07e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 35.09 on 131 degrees of freedom
## Multiple R-squared: 0.1946, Adjusted R-squared: 0.1884
## F-statistic: 31.65 on 1 and 131 DF, p-value: 1.072e-07
library(visreg)
p1 = visreg(lm.a, alpha = 0.05)
abline(h = 7, lwd = 0.5)
These are just the usual plots of the regression line with the 95% and 99% confidence intervals, respectively. You can tell that when Income = 0 predicted U5MR is slightly above 50; the exact value is 53.23, evident from the regression results per se. The points shown on the plot are the actual U5MR values. The gap between \(= (U5MR_i - \hat{U5MR}_i)\) is the residual. Let us calculate and add the predicted values and the residuals to sowc.
sowc$predicted = predict(lm.a)
sowc$residuals = resid(lm.a)
Lets change how the columns are ordered so that we can see the essential columns next to one another.
names(sowc)
## [1] "country" "U5MR" "Births" "Income"
## [5] "AdultLR" "MaleYouthLR" "FemaleYouthLR" "Mobiles"
## [9] "Internet" "Population" "PopulationU5" "Life"
## [13] "Fertility" "Urban" "Urban_Rural" "predicted"
## [17] "residuals"
sowc = sowc[, c(1, 2, 4, 16, 17, 3, 5:15)]
Now open sowc
and sort in descending order of Income. For Qatar, you’ll see the residual is \(7 - (-75.140475) = 82.1404748\).
Obviously Qatar and some of the other high-income countries (Singapore, followed by Kuwait) are most likely pulling the regression line towards them. What if we removed these from the data and then re-estimated the regression line?
sowc.sub = subset(sowc, Income <= 36040)
lm.b = lm(U5MR ~ Income, data = sowc.sub)
summary(lm.b)
##
## Call:
## lm(formula = U5MR ~ Income, data = sowc.sub)
##
## Residuals:
## Min 1Q Median 3Q Max
## -41.631 -24.148 -9.631 15.356 123.541
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 60.1513744 3.9021094 15.415 < 2e-16 ***
## Income -0.0029178 0.0004165 -7.005 1.25e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 33.32 on 128 degrees of freedom
## Multiple R-squared: 0.2771, Adjusted R-squared: 0.2715
## F-statistic: 49.07 on 1 and 128 DF, p-value: 1.253e-10
visreg(lm.b)
Notice the change in the estimates. These high income countries had very low U5MR so when you remove them average U5MR increases. This is being reflected in the higher Intercept. You also see the regression line with a steeper slope.
Now let us see how visreg works with multiple independent variables. Lets say we go back to the model with Income and Urban_Rural, plus an interaction between these two variables.
lm.5 = lm(U5MR ~ Income + Urban_Rural + Income:Urban_Rural, data = sowc)
summary(lm.5)
##
## Call:
## lm(formula = U5MR ~ Income + Urban_Rural + Income:Urban_Rural,
## data = sowc)
##
## Residuals:
## Min 1Q Median 3Q Max
## -52.025 -16.480 -9.149 8.979 133.984
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 73.926165 5.004014 14.773 < 2e-16 ***
## Income -0.004076 0.001068 -3.816 0.00021 ***
## Urban_RuralUrban -40.018345 6.962291 -5.748 6.19e-08 ***
## Income:Urban_RuralUrban 0.003226 0.001108 2.912 0.00424 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 31.55 on 129 degrees of freedom
## Multiple R-squared: 0.3588, Adjusted R-squared: 0.3439
## F-statistic: 24.06 on 3 and 129 DF, p-value: 1.97e-12
visreg(lm.5, "Income", by = "Urban_Rural", overlay = TRUE)
visreg(lm.5, "Income", by = "Urban_Rural", overlay = FALSE)
visreg(lm.5, "Urban_Rural", "Income", breaks = c(100, 500, 1500),
overlay = FALSE)
visreg(lm.5, "Urban_Rural", "Income", breaks = c(100, 500, 1500),
overlay = TRUE)
Note the variants here. First, if overlay = TRUE
then in some cases this helps to see patterns but in other cases it obfuscates things. Second, you can choose which variable you want on the x-axis. Third, you can also choose the values of the continuous variable; by default visreg uses the \(10^{th}\), \(50^{th}\), and the \(90^{th}\) percentiles. Note also that for a continuous variable being held constant visreg will hold it at its median value. For categorical variables it will hold them at their modal values.