Simple Linear Regression

Recall the fundamentals of a linear regression model. In brief, you have

$y =$ a continuous dependent variable
$x_{i} =$ one or more independent variables that might be qualitative (categorical, nominal or ordinal) and/or quantitative (continuous/discrete, ratio or interval)
the existence of a linear relationship between the dependent and the independent variables
Some assumptions that, if met, ensure that our slope coefficients $(\hat{beta}_i)$ are the best linear unbiased estimates (BLUE)

Let us fit a simple linear regression model to the sowc data, with U5MR as the dependent variable.

load("~/Downloads/sowc.RData")
lm.1 = lm(U5MR ~ Income, data = sowc)
summary(lm.1)

## 
## Call:
## lm(formula = U5MR ~ Income, data = sowc)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -38.65 -26.53 -12.68  18.29 129.71 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 53.2358231  3.7144805  14.332  < 2e-16 ***
## Income      -0.0016308  0.0002899  -5.625 1.07e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 35.09 on 131 degrees of freedom
## Multiple R-squared:  0.1946, Adjusted R-squared:  0.1884 
## F-statistic: 31.65 on 1 and 131 DF,  p-value: 1.072e-07

Note the structure of the command …

lm() stands for a linear model. I have chosen an arbitray name for this regression, calling it lm.1
We specify the regression equation as lm(dependent ~ independent, data = dataset)
We then check the resulting estimates via summary(lm.1)

Interpreting the Estimates

How might we interpret our results? Just as we did before. That is:

When Income is 0, U5MR is predicted to be 53.23 (i.e., 53 infant mortalities per 1000 live births)
For every $1 increase in Income, U5MR is predicted to decrease by 0.0016 infant mortalities per 1000 live births
The adjusted $R^2$ is 0.1884; so this model explains about 18.84% of the variation in U5MR
The Residual standard error (the RMSE) is 35.09, indicating that on average our prediction error, based on this model, will be $\pm 35.09$ infant mortalities

Generating Predicted Values

We can generate predicted values with relative ease, and then plot these as well.

new.data = data.frame(Income = seq(0, 80000, by = 1000))

predicted1 = predict(lm.1, interval = "confidence", conf.level = 0.95, 
    data = new.data)
head(predicted1)

fit	lwr	upr
46.56587	40.26342	52.86831
46.53325	40.23420	52.83230
45.76678	39.54192	51.99163
47.16926	40.80063	53.53789
43.36950	37.30482	49.43418
27.06154	19.24410	34.87898

colnames(predicted1) = c("predicted.U5MR1", "lwr1", "upr1")
head(predicted1)

predicted.U5MR1	lwr1	upr1
46.56587	40.26342	52.86831
46.53325	40.23420	52.83230
45.76678	39.54192	51.99163
47.16926	40.80063	53.53789
43.36950	37.30482	49.43418
27.06154	19.24410	34.87898

predicted2 = predict(lm.1, interval = "prediction", conf.level = 0.95, 
    data = new.data)

## Warning in predict.lm(lm.1, interval = "prediction", conf.level = 0.95, : predictions on current data refer to _future_ responses

head(predicted2)

fit	lwr	upr
46.56587	-23.12982	116.26156
46.53325	-23.16213	116.22863
45.76678	-23.92194	115.45549
47.16926	-22.53245	116.87097
43.36950	-26.30509	113.04410
27.06154	-42.78745	96.91053

colnames(predicted2) = c("predicted.U5MR2", "lwr2", "upr2")
head(predicted2)

predicted.U5MR2	lwr2	upr2
46.56587	-23.12982	116.26156
46.53325	-23.16213	116.22863
45.76678	-23.92194	115.45549
47.16926	-22.53245	116.87097
43.36950	-26.30509	113.04410
27.06154	-42.78745	96.91053

df = cbind.data.frame(sowc, predicted1, predicted2)

Plots

library(ggplot2)

p1 = ggplot(data = df, aes(x = Income, y = predicted.U5MR1)) + 
    geom_line() + geom_ribbon(data = df, aes(ymin = lwr1, ymax = upr1), 
    alpha = 0.3)
p1

p2 = ggplot(data = df, aes(x = Income, y = predicted.U5MR2)) + 
    geom_line() + geom_ribbon(data = df, aes(ymin = lwr2, ymax = upr2), 
    alpha = 0.3)
p2

Note a few things about the commands:

There is little sense in trying to predict U5MR from the Income values in the data-set. Ideally you would test your regression model against a fresh data-set, and then compare how well your predicted U5MR rate stacks up against the actual U5MR rate in the fresh data-set. Since we lack this luxury here we generate a set of Income values. This is done with new.data = data.frame(Income = seq(0, 80000, by = 1000)). In brief, I am setting Income to start at 0 and increase to 80,000, with the step increase of 1000 dollars.
The predict() command generates predicted values by using the estimates from a specific model (here lm.1) and Income from the new.data we created. When generating predicted values you can ask for either the confidence intervals or the prediction intervals. You can also specify the interval (95% or 99%) via conf.level = 0.95 or conf.level = 0.99.
It is easy to then bind these values to the original data frame. This is being done via df = cbind.data.frame(sowc, predicted1, predicted2)
Finally we use ggplot2 to generate a plot of the predicted regression line and the associated intervals. The geom_ribbon() command is shading the intervals.

Dummy Independent variables

Assume that we are also interested in seeing if U5MR also differs according to whether a country is flagged as Urban or as Rural. So we include this independent variable in our model.

sowc$Urban_Rural[sowc$Urban > 50] = "Urban"
sowc$Urban_Rural[sowc$Urban <= 50] = "Rural"

lm.2 = lm(U5MR ~ Income + Urban_Rural, data = sowc)
summary(lm.2)

## 
## Call:
## lm(formula = U5MR ~ Income + Urban_Rural, data = sowc)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -54.188 -16.837  -8.108  10.429 132.540 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       6.620e+01  4.362e+00  15.175  < 2e-16 ***
## Income           -1.077e-03  2.916e-04  -3.694 0.000324 ***
## Urban_RuralUrban -2.980e+01  6.185e+00  -4.819 3.96e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 32.44 on 130 degrees of freedom
## Multiple R-squared:  0.3166, Adjusted R-squared:  0.3061 
## F-statistic: 30.12 on 2 and 130 DF,  p-value: 1.787e-11

Watch the coefficient on Urban_Rural. First, you see the coefficient estimated for Urban and the value is -29.80, indicating that holding all else constant there are almost 30 fewer infant mortalities per 1000 live briths in an Urban country as compared to in a Rural country.

Interaction Effects

Recall the discussion about a model with interactions. Briefly, this might be:

\[U5MR = \beta_0 + \beta_1(Income) + \beta_2(Urban\_Rural) + \beta_4(Income \times Urban\_Rural)\]

If it is an Urban country, then the equation becomes:

\[U5MR = \beta_0 + \beta_1(Income) + \beta_2(Urban) + \beta_4(Income \times Urban)\]

If it is a Rural country, then the equation becomes:

\[U5MR = \beta_0 + \beta_1(Income)\]

lm.3 = lm(U5MR ~ Income + Urban_Rural + Income:Urban_Rural, data = sowc)
summary(lm.3)

## 
## Call:
## lm(formula = U5MR ~ Income + Urban_Rural + Income:Urban_Rural, 
##     data = sowc)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -52.025 -16.480  -9.149   8.979 133.984 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              73.926165   5.004014  14.773  < 2e-16 ***
## Income                   -0.004076   0.001068  -3.816  0.00021 ***
## Urban_RuralUrban        -40.018345   6.962291  -5.748 6.19e-08 ***
## Income:Urban_RuralUrban   0.003226   0.001108   2.912  0.00424 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 31.55 on 129 degrees of freedom
## Multiple R-squared:  0.3588, Adjusted R-squared:  0.3439 
## F-statistic: 24.06 on 3 and 129 DF,  p-value: 1.97e-12

Note: R is using Urban as the group to be modeled, which means if we are interested in the Rural group we’ll have to cancel any term with Urban_RuralUrban in it.

Linear Regression Analysis

MPA 6020 (Spring 2016)

2016-03-14