Recall the fundamentals of a linear regression model. In brief, you have
Let us fit a simple linear regression model to the sowc
data, with U5MR as the dependent variable.
load("~/Downloads/sowc.RData")
lm.1 = lm(U5MR ~ Income, data = sowc)
summary(lm.1)
##
## Call:
## lm(formula = U5MR ~ Income, data = sowc)
##
## Residuals:
## Min 1Q Median 3Q Max
## -38.65 -26.53 -12.68 18.29 129.71
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 53.2358231 3.7144805 14.332 < 2e-16 ***
## Income -0.0016308 0.0002899 -5.625 1.07e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 35.09 on 131 degrees of freedom
## Multiple R-squared: 0.1946, Adjusted R-squared: 0.1884
## F-statistic: 31.65 on 1 and 131 DF, p-value: 1.072e-07
Note the structure of the command …
lm()
stands for a linear model. I have chosen an arbitray name for this regression, calling it lm.1
lm(dependent ~ independent, data = dataset)
summary(lm.1)
How might we interpret our results? Just as we did before. That is:
We can generate predicted values with relative ease, and then plot these as well.
new.data = data.frame(Income = seq(0, 80000, by = 1000))
predicted1 = predict(lm.1, interval = "confidence", conf.level = 0.95,
data = new.data)
head(predicted1)
fit | lwr | upr |
---|---|---|
46.56587 | 40.26342 | 52.86831 |
46.53325 | 40.23420 | 52.83230 |
45.76678 | 39.54192 | 51.99163 |
47.16926 | 40.80063 | 53.53789 |
43.36950 | 37.30482 | 49.43418 |
27.06154 | 19.24410 | 34.87898 |
colnames(predicted1) = c("predicted.U5MR1", "lwr1", "upr1")
head(predicted1)
predicted.U5MR1 | lwr1 | upr1 |
---|---|---|
46.56587 | 40.26342 | 52.86831 |
46.53325 | 40.23420 | 52.83230 |
45.76678 | 39.54192 | 51.99163 |
47.16926 | 40.80063 | 53.53789 |
43.36950 | 37.30482 | 49.43418 |
27.06154 | 19.24410 | 34.87898 |
predicted2 = predict(lm.1, interval = "prediction", conf.level = 0.95,
data = new.data)
## Warning in predict.lm(lm.1, interval = "prediction", conf.level = 0.95, : predictions on current data refer to _future_ responses
head(predicted2)
fit | lwr | upr |
---|---|---|
46.56587 | -23.12982 | 116.26156 |
46.53325 | -23.16213 | 116.22863 |
45.76678 | -23.92194 | 115.45549 |
47.16926 | -22.53245 | 116.87097 |
43.36950 | -26.30509 | 113.04410 |
27.06154 | -42.78745 | 96.91053 |
colnames(predicted2) = c("predicted.U5MR2", "lwr2", "upr2")
head(predicted2)
predicted.U5MR2 | lwr2 | upr2 |
---|---|---|
46.56587 | -23.12982 | 116.26156 |
46.53325 | -23.16213 | 116.22863 |
45.76678 | -23.92194 | 115.45549 |
47.16926 | -22.53245 | 116.87097 |
43.36950 | -26.30509 | 113.04410 |
27.06154 | -42.78745 | 96.91053 |
df = cbind.data.frame(sowc, predicted1, predicted2)
library(ggplot2)
p1 = ggplot(data = df, aes(x = Income, y = predicted.U5MR1)) +
geom_line() + geom_ribbon(data = df, aes(ymin = lwr1, ymax = upr1),
alpha = 0.3)
p1
p2 = ggplot(data = df, aes(x = Income, y = predicted.U5MR2)) +
geom_line() + geom_ribbon(data = df, aes(ymin = lwr2, ymax = upr2),
alpha = 0.3)
p2
Note a few things about the commands:
new.data = data.frame(Income = seq(0, 80000, by = 1000))
. In brief, I am setting Income to start at 0 and increase to 80,000, with the step increase of 1000 dollars.predict()
command generates predicted values by using the estimates from a specific model (here lm.1
) and Income from the new.data
we created. When generating predicted values you can ask for either the confidence intervals or the prediction intervals. You can also specify the interval (95% or 99%) via conf.level = 0.95
or conf.level = 0.99
.df = cbind.data.frame(sowc, predicted1, predicted2)
ggplot2
to generate a plot of the predicted regression line and the associated intervals. The geom_ribbon()
command is shading the intervals.Assume that we are also interested in seeing if U5MR also differs according to whether a country is flagged as Urban or as Rural. So we include this independent variable in our model.
sowc$Urban_Rural[sowc$Urban > 50] = "Urban"
sowc$Urban_Rural[sowc$Urban <= 50] = "Rural"
lm.2 = lm(U5MR ~ Income + Urban_Rural, data = sowc)
summary(lm.2)
##
## Call:
## lm(formula = U5MR ~ Income + Urban_Rural, data = sowc)
##
## Residuals:
## Min 1Q Median 3Q Max
## -54.188 -16.837 -8.108 10.429 132.540
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.620e+01 4.362e+00 15.175 < 2e-16 ***
## Income -1.077e-03 2.916e-04 -3.694 0.000324 ***
## Urban_RuralUrban -2.980e+01 6.185e+00 -4.819 3.96e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 32.44 on 130 degrees of freedom
## Multiple R-squared: 0.3166, Adjusted R-squared: 0.3061
## F-statistic: 30.12 on 2 and 130 DF, p-value: 1.787e-11
Watch the coefficient on Urban_Rural. First, you see the coefficient estimated for Urban
and the value is -29.80, indicating that holding all else constant there are almost 30 fewer infant mortalities per 1000 live briths in an Urban country as compared to in a Rural country.
Recall the discussion about a model with interactions. Briefly, this might be:
\[U5MR = \beta_0 + \beta_1(Income) + \beta_2(Urban\_Rural) + \beta_4(Income \times Urban\_Rural)\]
If it is an Urban country, then the equation becomes:
\[U5MR = \beta_0 + \beta_1(Income) + \beta_2(Urban) + \beta_4(Income \times Urban)\]
If it is a Rural country, then the equation becomes:
\[U5MR = \beta_0 + \beta_1(Income)\]
lm.3 = lm(U5MR ~ Income + Urban_Rural + Income:Urban_Rural, data = sowc)
summary(lm.3)
##
## Call:
## lm(formula = U5MR ~ Income + Urban_Rural + Income:Urban_Rural,
## data = sowc)
##
## Residuals:
## Min 1Q Median 3Q Max
## -52.025 -16.480 -9.149 8.979 133.984
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 73.926165 5.004014 14.773 < 2e-16 ***
## Income -0.004076 0.001068 -3.816 0.00021 ***
## Urban_RuralUrban -40.018345 6.962291 -5.748 6.19e-08 ***
## Income:Urban_RuralUrban 0.003226 0.001108 2.912 0.00424 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 31.55 on 129 degrees of freedom
## Multiple R-squared: 0.3588, Adjusted R-squared: 0.3439
## F-statistic: 24.06 on 3 and 129 DF, p-value: 1.97e-12
Note: R is using Urban as the group to be modeled, which means if we are interested in the Rural group we’ll have to cancel any term with Urban_RuralUrban
in it.