Simple Linear Regression

Recall the fundamentals of a linear regression model. In brief, you have

  • \(y =\) a continuous dependent variable
  • \(x_{i} =\) one or more independent variables that might be qualitative (categorical, nominal or ordinal) and/or quantitative (continuous/discrete, ratio or interval)
  • the existence of a linear relationship between the dependent and the independent variables
  • Some assumptions that, if met, ensure that our slope coefficients \((\hat{beta}_i)\) are the best linear unbiased estimates (BLUE)

Let us fit a simple linear regression model to the sowc data, with U5MR as the dependent variable.

load("~/Downloads/sowc.RData")
lm.1 = lm(U5MR ~ Income, data = sowc)
summary(lm.1)
## 
## Call:
## lm(formula = U5MR ~ Income, data = sowc)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -38.65 -26.53 -12.68  18.29 129.71 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 53.2358231  3.7144805  14.332  < 2e-16 ***
## Income      -0.0016308  0.0002899  -5.625 1.07e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 35.09 on 131 degrees of freedom
## Multiple R-squared:  0.1946, Adjusted R-squared:  0.1884 
## F-statistic: 31.65 on 1 and 131 DF,  p-value: 1.072e-07

Note the structure of the command …

  • lm() stands for a linear model. I have chosen an arbitray name for this regression, calling it lm.1
  • We specify the regression equation as lm(dependent ~ independent, data = dataset)
  • We then check the resulting estimates via summary(lm.1)

Interpreting the Estimates

How might we interpret our results? Just as we did before. That is:

  • When Income is 0, U5MR is predicted to be 53.23 (i.e., 53 infant mortalities per 1000 live births)
  • For every $1 increase in Income, U5MR is predicted to decrease by 0.0016 infant mortalities per 1000 live births
  • The adjusted \(R^2\) is 0.1884; so this model explains about 18.84% of the variation in U5MR
  • The Residual standard error (the RMSE) is 35.09, indicating that on average our prediction error, based on this model, will be \(\pm 35.09\) infant mortalities

Generating Predicted Values

We can generate predicted values with relative ease, and then plot these as well.

new.data = data.frame(Income = seq(0, 80000, by = 1000))

predicted1 = predict(lm.1, interval = "confidence", conf.level = 0.95, 
    data = new.data)
head(predicted1)
fit lwr upr
46.56587 40.26342 52.86831
46.53325 40.23420 52.83230
45.76678 39.54192 51.99163
47.16926 40.80063 53.53789
43.36950 37.30482 49.43418
27.06154 19.24410 34.87898
colnames(predicted1) = c("predicted.U5MR1", "lwr1", "upr1")
head(predicted1)
predicted.U5MR1 lwr1 upr1
46.56587 40.26342 52.86831
46.53325 40.23420 52.83230
45.76678 39.54192 51.99163
47.16926 40.80063 53.53789
43.36950 37.30482 49.43418
27.06154 19.24410 34.87898
predicted2 = predict(lm.1, interval = "prediction", conf.level = 0.95, 
    data = new.data)
## Warning in predict.lm(lm.1, interval = "prediction", conf.level = 0.95, : predictions on current data refer to _future_ responses
head(predicted2)
fit lwr upr
46.56587 -23.12982 116.26156
46.53325 -23.16213 116.22863
45.76678 -23.92194 115.45549
47.16926 -22.53245 116.87097
43.36950 -26.30509 113.04410
27.06154 -42.78745 96.91053
colnames(predicted2) = c("predicted.U5MR2", "lwr2", "upr2")
head(predicted2)
predicted.U5MR2 lwr2 upr2
46.56587 -23.12982 116.26156
46.53325 -23.16213 116.22863
45.76678 -23.92194 115.45549
47.16926 -22.53245 116.87097
43.36950 -26.30509 113.04410
27.06154 -42.78745 96.91053
df = cbind.data.frame(sowc, predicted1, predicted2)

Plots

library(ggplot2)

p1 = ggplot(data = df, aes(x = Income, y = predicted.U5MR1)) + 
    geom_line() + geom_ribbon(data = df, aes(ymin = lwr1, ymax = upr1), 
    alpha = 0.3)
p1

p2 = ggplot(data = df, aes(x = Income, y = predicted.U5MR2)) + 
    geom_line() + geom_ribbon(data = df, aes(ymin = lwr2, ymax = upr2), 
    alpha = 0.3)
p2

Note a few things about the commands:

  • There is little sense in trying to predict U5MR from the Income values in the data-set. Ideally you would test your regression model against a fresh data-set, and then compare how well your predicted U5MR rate stacks up against the actual U5MR rate in the fresh data-set. Since we lack this luxury here we generate a set of Income values. This is done with new.data = data.frame(Income = seq(0, 80000, by = 1000)). In brief, I am setting Income to start at 0 and increase to 80,000, with the step increase of 1000 dollars.
  • The predict() command generates predicted values by using the estimates from a specific model (here lm.1) and Income from the new.data we created. When generating predicted values you can ask for either the confidence intervals or the prediction intervals. You can also specify the interval (95% or 99%) via conf.level = 0.95 or conf.level = 0.99.
  • It is easy to then bind these values to the original data frame. This is being done via df = cbind.data.frame(sowc, predicted1, predicted2)
  • Finally we use ggplot2 to generate a plot of the predicted regression line and the associated intervals. The geom_ribbon() command is shading the intervals.

Dummy Independent variables

Assume that we are also interested in seeing if U5MR also differs according to whether a country is flagged as Urban or as Rural. So we include this independent variable in our model.

sowc$Urban_Rural[sowc$Urban > 50] = "Urban"
sowc$Urban_Rural[sowc$Urban <= 50] = "Rural"

lm.2 = lm(U5MR ~ Income + Urban_Rural, data = sowc)
summary(lm.2)
## 
## Call:
## lm(formula = U5MR ~ Income + Urban_Rural, data = sowc)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -54.188 -16.837  -8.108  10.429 132.540 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       6.620e+01  4.362e+00  15.175  < 2e-16 ***
## Income           -1.077e-03  2.916e-04  -3.694 0.000324 ***
## Urban_RuralUrban -2.980e+01  6.185e+00  -4.819 3.96e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 32.44 on 130 degrees of freedom
## Multiple R-squared:  0.3166, Adjusted R-squared:  0.3061 
## F-statistic: 30.12 on 2 and 130 DF,  p-value: 1.787e-11

Watch the coefficient on Urban_Rural. First, you see the coefficient estimated for Urban and the value is -29.80, indicating that holding all else constant there are almost 30 fewer infant mortalities per 1000 live briths in an Urban country as compared to in a Rural country.

Interaction Effects

Recall the discussion about a model with interactions. Briefly, this might be:

\[U5MR = \beta_0 + \beta_1(Income) + \beta_2(Urban\_Rural) + \beta_4(Income \times Urban\_Rural)\]

If it is an Urban country, then the equation becomes:

\[U5MR = \beta_0 + \beta_1(Income) + \beta_2(Urban) + \beta_4(Income \times Urban)\]

If it is a Rural country, then the equation becomes:

\[U5MR = \beta_0 + \beta_1(Income)\]

lm.3 = lm(U5MR ~ Income + Urban_Rural + Income:Urban_Rural, data = sowc)
summary(lm.3)
## 
## Call:
## lm(formula = U5MR ~ Income + Urban_Rural + Income:Urban_Rural, 
##     data = sowc)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -52.025 -16.480  -9.149   8.979 133.984 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              73.926165   5.004014  14.773  < 2e-16 ***
## Income                   -0.004076   0.001068  -3.816  0.00021 ***
## Urban_RuralUrban        -40.018345   6.962291  -5.748 6.19e-08 ***
## Income:Urban_RuralUrban   0.003226   0.001108   2.912  0.00424 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 31.55 on 129 degrees of freedom
## Multiple R-squared:  0.3588, Adjusted R-squared:  0.3439 
## F-statistic: 24.06 on 3 and 129 DF,  p-value: 1.97e-12

Note: R is using Urban as the group to be modeled, which means if we are interested in the Rural group we’ll have to cancel any term with Urban_RuralUrban in it.