Chapter 9 Analysis of Variance (ANOVA)

The t-tests we covered in the preceding chapter allowed us to compare two groups on a numeric outcome. But what are our options when we have a numeric outcome of interest but more than two groups to compare? It may be tempting to think of carrying out pairwise comparisons, two at a time. If in doing so we find that A and B are not statistically different, and neither are A and C, but B and C are, well, what is wrong with such an approach? For example, look at the customer satisfaction scores provided by a random sample of customers served by a leading cellular phone service provider in three service locations – Atlanta, Dallas, and Seattle. I look at the means and I look at the boxplots, and it looks to me at least as if Atlanta has the highest median scores, followed by Dallas, with Seattle bringing up the rear. So couldn’t we compare Atlanta to Dallas via a t-test, then Atlanta to Seattle, and finally Dallas to Seattle, and see which pairs are statistically different?

TABLE 9.1: Customer Satisfaction Scores
Observation.No. Atlanta Dallas Seattle
1 85 71 59
2 75 75 64
3 82 73 62
4 76 74 69
5 71 69 75
6 85 82 67

Here are some summary statistics …

TABLE 9.2: Descriptive Statistics
Location Mean Variance Std. Dev.
Atlanta 79 34 5.83
Dallas 74 20 4.47
Seattle 66 32 5.66

… and boxplots by location

The problem, as it turns out, is the issue of inflated Type I error rates. Come again? In any single comparison of two groups, A versus B, we know that if \(\alpha\), the level of significance, is set to \(0.05\), then that is essentially the probability of our rejecting the Null Hypothesis when in fact the Null Hypothesis is true (i.e., committing a Type I error). By extension, \(1-\alpha\) is the probability of no significant result so if \(\alpha = 0.05\), then \(1 - \alpha = 0.95\). Let us now calculate the probability that at least one of these pairs of comparisons throws up a significant result by chance alone …

\[\begin{align*} P(\text{no Type I error in 1 comparison}) & = & 0.95 \\ P(\text{no Type I error in 2 comparisons}) & = & 0.95 \times 0.95 = 0.9025 \\ \text{Note: P(Type I error in 2 comparisons) is} & = & 1 - 0.9025 = 0.0975 \\ P(\text{no Type I error in 3 comparisons}) & = & 0.95 \times 0.95 \times 0.95 = 0.8573 \\ \text{Note: P(Type I error in 3 comparisons) is} & = & 1 - 0.857375 = 0.1426 \end{align*}\]

Note what is happening here … in making three pairwise comparisons you are going to have a functional \(\alpha = 0.1426\) even though you think each pairwise comparison has \(\alpha = 0.05\). This is what we call the problem of multiple comparisons.

There is a way to correct for Multiple Comparisons, by using a corrected level of significance … \(\alpha^* = \frac{\alpha}{\text{No. of Trials}}\) and then modifying our decision rules such that we now reject \(H_0\) only if p-values \(\leq \alpha^*\)

TABLE 9.3: Bonferroni Correction
Number of Trials Adjusted Alpha
1 0.0500
2 0.0250
3 0.0167
4 0.0125
5 0.0100
6 0.0083
7 0.0071
8 0.0063
9 0.0056
10 0.0050

In fact, there is a better way to compare three or more groups – by relying on the analysis of variance (ANOVA).

9.1 The Logic of ANOVA

ANOVA is a hypothesis testing procedure that allows us to simultaneously compare three or more groups and determine if they are drawn from a common population or from different populations. What does this test look like, this ANOVA test?

The test statistic is a ratio: \(\dfrac{\text{Difference between groups}}{\text{Difference within groups}}\)

If difference between groups $ > $ difference within the groups, then it must be because something differentiates the groups such that they are noticeably different from each other. Fair enough. But how can we measure the difference between groups and the difference within groups? In fact, come to think of it, what do we mean by difference here? Well, let us break it down.

If I ignore the fact that there are three service centers and calculate the mean for all 18 satisfaction scores, I get what we call the overall mean \(\bar{\bar{x}}\) and this turns out to be \(73\). I can then see how much each satisfaction score differs from this overall mean by calculating \(\dfrac{\sum\left(x_i - \bar{\bar{x}}\right)^2}{n - 1}\). This is just the usual numerator in the sample variance calculation, and we can call it the total squared difference or, the sum of squares total (SST):

TABLE 9.4: Overall Mean and Squared Differences
Observation.No. Location Score mean difference squared difference
1 Atlanta 85 73 12 144
2 Atlanta 75 73 2 4
3 Atlanta 82 73 9 81
4 Atlanta 76 73 3 9
5 Atlanta 71 73 -2 4
6 Atlanta 85 73 12 144
1 Dallas 71 73 -2 4
2 Dallas 75 73 2 4
3 Dallas 73 73 0 0
4 Dallas 74 73 1 1
5 Dallas 69 73 -4 16
6 Dallas 82 73 9 81
1 Seattle 59 73 -14 196
2 Seattle 64 73 -9 81
3 Seattle 62 73 -11 121
4 Seattle 69 73 -4 16
5 Seattle 75 73 2 4
6 Seattle 67 73 -6 36

The overall squared difference (SST) turns out to be \(946\).

Okay, fair enough. Now what about the variance within each group? For this I know I have to calculate the mean for each group and then sum the squared difference of each score in the group from the group mean.

TABLE 9.5: Group Means and Squared Differences
Observation.No. Location Score mean difference squared difference
1 Atlanta 85 79 6 36
2 Atlanta 75 79 -4 16
3 Atlanta 82 79 3 9
4 Atlanta 76 79 -3 9
5 Atlanta 71 79 -8 64
6 Atlanta 85 79 6 36
1 Dallas 71 74 -3 9
2 Dallas 75 74 1 1
3 Dallas 73 74 -1 1
4 Dallas 74 74 0 0
5 Dallas 69 74 -5 25
6 Dallas 82 74 8 64
1 Seattle 59 66 -7 49
2 Seattle 64 66 -2 4
3 Seattle 62 66 -4 16
4 Seattle 69 66 3 9
5 Seattle 75 66 9 81
6 Seattle 67 66 1 1

If you add the sum of squared differences for each of the three groups you will get \(170\) for Atlanta, \(100\) for Dallas, and \(160\) for Seattle. So how much is the sum of squared when calculated within each group? That turns out to be \(170 + 100 + 160 = 430\)

Now, the total sum of squared differences is composed of the sum of squared differences within groups and between groups, i.e.,

\[\text{Sum of Squares Total} = \text{Sum of Squares Within Groups} + \text{Sub of Squares Between Groups}\]

Since \(\text{Sum of Squares Total} = 946\) and \(\text{Sum of Squares Within Groups} = 430\), it must be that \(946 - 430 = 516\) is the \(\text{Sum of Squares Between Groups}\)

Now, ANOVA tests the ratio of the variance between groups to the variance within groups. So what would be these respective variances? Recall the formula for the sample variance … \(\dfrac{\sum\left(x_i - \bar{x} \right)^2}{n - 1}\)

How many groups are we calculating the variance for? \(3\). So the variance between groups would really be \(\dfrac{516}{3 - 1} = \dfrac{516}{2} = 208\)

Similarly, the variance within groups would be \(\dfrac{430}{18 - 3} = \dfrac{430}{15} = 28.66667\)

Now the ratio we are chasing … \(\dfrac{208}{28.66667} = 7.2558\)

How is this ratio distribution? It is not the \(z\) or the \(t\), but instead the \(F\) distribution that has two degrees of freedom … one for the numerator and one for the denominator. For the numerator, you have the degrees of freedom calculated as \(df_{k-1} = k - 1 = 3 - 1 = 2\) where \(k = \text{number of groups}\). For the denominator, you have the degrees of freedom calculated as \(df_{N-k} = N - 3 = 15\).

Under the assumption that the Null Hypothesis is true, i.e., that the group means are very similar, \(F\) is distributed as \(F^{k-1}_{N-k}\), and we can calculate the probability of finding, for given differences of freedom, by chance alone, \(F\) equal to or greater than a given value. Consequently, what we have is an F-statistic that is a ratio of two means – the Mean squared difference due to “treatments”, and the Mean squared difference due to chance and other factors … \(F^{k-1}_{N-k} = \dfrac{MSTR}{MSE}\)

If the probability of seeing our calculated \(F\) by chance is less than or equal to \(\alpha = 0.05\), we can reject the Null Hypothesis; else we fail to reject the Null Hypothesis. So how are the hypotheses specified?

\[H_0: \text{The group means are all the same}\] \[H_1: \text{At least one group mean is different from one or more of the other group means}\]

Alternatively, you will see the hypotheses specified as

\[H_0: \text{All population means are equal}\] \[H_1: \text{All population means are NOT equal}\]

9.1.1 Example 1: Systolic Blood Pressure and Treatments

Take for example a program that presumably lowers blood pressure, and to do so it assigns participants to one of four conditions (see table), Each participant’s systolic blood pressure is measured after two weeks of treatment. Hypothesis is that a combination (Diet and Drug) of treatments will be more effective than each individual treatment in isolation. These data are shown below.

TABLE 9.6: ANOVA Example 01
Control Diet Only Drug Only Diet and Drug
163 166 161 153
178 173 171 168
180 188 178 176
181 190 183 198
185 193 195 200

Do one or more of these treatments differ? Let us run an ANOVA and see what the data reveal.

\(H_0:\) There is no difference in mean systolic blood pressure between the treatment groups
\(H_1:\) All treatment groups are not the same

Overall mean \(\bar{\bar{x}} = 179\)

Mean for Control group \(\bar{x}_{Control} = 177\)
Mean for Diet only group \(\bar{x}_{Diet} = 182\)
Mean for Drug only group \(\bar{x}_{Drug} = 178\)
Mean for Diet and Drug group \(\bar{x}_{Diet+Drug} = 179\)

\(SST = 3170\)

\(SSTR = 67.6\)

\(SSE = 3170 - 67.6 = 3102.4\)

\(MSTR = \dfrac{SSTR}{k - 1} = \dfrac{67.6}{4-1} = \dfrac{67.6}{3} = 22.5\)

\(MSE = \dfrac{SSE}{n - k} = \dfrac{3102.4}{20 - 4} = \dfrac{3102.4}{16} = 193.9\)

\(F = \dfrac{MSTR}{MSE} = \dfrac{22.5}{193.9} = 0.116\)

\(p-value = 0.949\) so we are unable to reject the null hypothesis; these data suggest that there is no difference in mean systolic blood pressure between the treatment groups.

9.1.2 Example 2: Pre-surgical Fitness and Recovery Times

Does Pre-surgical fitness influence recovery times? That is, do fitter patients undergoing surgery (perhaps for total hip/knee replacement) recover faster than patients with poor fitness? To analyze this question a surgeon gathered data on a sample of 24 males aged 18-30 who underwent corrective knee surgery, measuring their prior fitness level and the number of days each needed for successful physical therapy post-surgery.

Below Average Average Above Average
29 30 26
42 35 32
38 39 21
40 28 20
43 31 23
40 31 22
30 29
42 35
29
33

\(H_0:\) There is no difference in average number of days needed to recover across the fitness levels

\(H_1:\) All fitness level groups are not the same

Set \(\alpha = 0.05\)

The calculated test statistic turns out to be \(F^{2}_{21} = 16.96\) with a probability of \(0.0000413\), leading us to reject the Null Hypothesis. These data suggest that all fitness level groups are not the same; at least one is different.

9.2 ANOVA with two independent variables (aka factors)

Thus far we have looked at a single source of differences in variation – the location one is in, relationship status, and so on. But what happens when you have more than one factor at work? For example, say you are looking at students’ 3rd grade reading scores and have recorded, in addition to the scores, both the humidity level in the classroom (Factor A), and the temperature in the classroom (Factor B). How can you figure out whether it is the humidity or the temperature that influences 3rd grade reading scores? ANOVA allows us to answer this question.

We will bypass the mathematics underlying the analysis and instead focus on the mechanics. In brief, we have two sets of hypotheses now …

\[H_0: \text{There is no main effect of Factor A}\] \[H_1: \text{There is a main effect of Factor A}\]

and then …

\[H_0: \text{There is no main effect of Factor B}\] \[H_1: \text{There is a main effect of Factor B}\]

The decision rules remain the same … we reject the Null Hypothesis if the probability of our calculated \(F\) statistic turns out to be \(\leq 0.05\) and we do not reject if the probability is \(> 0.05\).

Now, let us assume the following patterns are visible in the data.

\(70^0\) \(80^0\) \(90^0\)
Low Humidity \(\bar{x} = 85\) \(\bar{x} = 80\) \(\bar{x} = 75\) \(\bar{x}_{low} = 80\)
High Humidity \(\bar{x} = 75\) \(\bar{x} = 70\) \(\bar{x} = 65\) \(\bar{x}_{high} = 70\)
\(\bar{x}_{70}=80\) \(\bar{x}_{80} = 75\) \(\bar{x}_{90}=70\)

We can spell the main effects of each factor as follows:

  1. Main Effect of Factor A (Humidity): Difference between mean scores for high versus low humidity
  2. Main Effect for Factor B (Temperature): Difference between mean scores for \(70^0,\) \(80^0,\) and \(90^0\) temperatures, respectively

The Null Hypothesis for testing effects of Factor A would be \(H_0: \mu_{A1} = \mu_{A2}\), and the Null Hypothesis for testing effects of Factor B: \(H_0: \mu_{B1} = \mu_{B2} = \mu_{B3}\)

Note how the mean score drops as Temperature increases … by exactly \(5\)

Note also how the mean score drops as Humidity rises

  • At \(70^0\) there is a difference of \(10\) between Low/High humidity
  • At \(80^0\) there is a difference of \(10\) between Low/High humidity
  • At \(90^0\) there is a difference of \(10\) between Low/High humidity

If we plot the means, the following pattern should become visible.

The key pattern evident here is that there is a constant difference in average performance as humidity goes from low to high or high to low, and then also as temperature goes up or down. Two parallel train tracks are visible, suggesting that high humidity and high temperature are not necessarily worse for learning than high humidity and medium temperature or high humidity and low temperatures. But what if this wasn’t true and the data showed something else altogether? That is, what if we had an interaction between the two factors, Factor A and Factor B?

9.3 Interaction effects

At times, we stop and wonder if there are specific combinations of the two factors that exert an additional impact on the outcome. One of the examples I am particularly fond of is that of gender-based pay inequalities. One could hypothesize that at lower educational attainment levels women are paid considerably less than men for the same job but at higher educational attainment levels this gap shrinks.

Let us stick with the humidity and temperature example for now, and see what an interaction might look like. The data shown below are for learning, humidity, and temperature, but have been modified such that the impact of temperature increases is no longer constant.

\(70^0\) \(80^0\) \(90^0\)
Low Humidity \(\bar{x} = 80\) \(\bar{x} = 80\) \(\bar{x} = 80\) \(\bar{x}_{low} = 80\)
High Humidity \(\bar{x} = 80\) \(\bar{x} = 70\) \(\bar{x} = 60\) \(\bar{x}_{high} = 70\)
\(\bar{x}_{70}=80\) \(\bar{x}_{80} = 75\) \(\bar{x}_{90}=70\)

Note how the mean score drops as Temperature increases …

  • At \(70^0\) there is a difference of \(0\) between Low/High humidity

  • At \(80^0\) there is a difference of \(10\) between Low/High humidity

  • At \(90^0\) there is a difference of \(20\) between Low/High humidity

In a nutshell, at Low humidity, raising the temperature has no impact but at high humidity, raising the temperature has an increasing impact!

This is an interaction – when the effect (impact) of a particular value of one factor depends upon the value of the other factor. Testing now involved three pairs of hypotheses …

  1. Main Effect of Factor A

\[H_0: \text{There is no main effect of factor A}\] \[H_1: \text{There is a main effect of factor A}\]

  1. Main Effect of Factor B

\[H_0: \text{There is no main effect of factor B}\] \[H_1: \text{There is a main effect of factor B}\]

  1. Interaction Effect of Factor A and Factor B

\[H_0: \text{There is no interaction effect of factor A and factor B}\] \[H_1: \text{There is an interaction effect of factor A and factor B}\]

9.3.1 Example 01: The Drug and Diet Example Modified

This is a modified version of the blood pressure example from before. In brief, say I want to test whether giving the patient a drug (Factor A) lowers their systolic blood pressure, or whether diet (Factor B) has an impact, and then whether a combination of both the drug and the diet (interaction of Factor A and Factor B) has an impact on blood pressure.

TABLE 9.7: Drug and Diet Interaction?
No Diet, No Drug Diet, No Drug No Diet, Drug Both Diet & Drug
185 188 171 153
190 183 176 163
195 198 181 173
200 178 166 178
180 193 161 168

The hypotheses would be …

  1. Main Effect of Drug

\[H_0: \text{There is no main effect of the drug on blood pressure}\] \[H_1: \text{There is a main effect of the drug on blood pressure}\]

  1. Main Effect of Diet

\[H_0: \text{There is no main effect of the diet on blood pressure}\] \[H_1: \text{There is a main effect of the diet on blood pressure}\]

  1. Interaction Effect of Drug and Diet

\[H_0: \text{There is no interaction effect of the drug and diet on blood pressure}\] \[H_1: \text{There is an interaction effect of the drug and diet on blood pressure}\]

For (a) the F turns out to be 28.571 with a p-value of about 0.0000657, so we can reject the Null; these data suggest that the drug does have a main effect on blood pressure.

For (b) the F is 0.643, with a p-value of 0.434 so we are unable to reject the Null; these data suggest that the diet does not have a main effect on blood pressure.

For (c), the F is 0.071, with a p-value of 0.793 so again we are unable to reject the Null; these data suggest that there is no interaction effect of drug and diet on blood pressure.

9.3.2 Example 02: Pedagogy, Subject, and Learning

In this example we have three subjects, each taught via the traditional lecture method versus using computers. Students’ scores are recorded for each subject and method.

TABLE 9.8: Interaction of Pedagogy and Subject?
Statistics/Standard English/Standard History/Standard Statistics/Computer English/Computer History/Computer
44 47 46 53 13 45
18 37 21 42 10 36
48 42 40 49 16 41
32 42 30 51 11 35
35 39 29 47 16 38
27 33 20 34 6 33

The hypotheses would be …

  1. Main Effect of Subject

\[H_0: \text{There is no main effect of the subject on scores}\] \[H_1: \text{There is a main effect of the subject on scores}\]

  1. Main Effect of Method

\[H_0: \text{There is no main effect of the method on scores}\] \[H_1: \text{There is a main effect of the dmethod on scores}\]

  1. Interaction Effect of Subject and Method

\[H_0: \text{There is no interaction effect of the subject and the method on scores}\] \[H_1: \text{There is an interaction effect of the subject and the method on scores}\]

The results will show that subject has a main effect on scores – F is 10.737 with a p-value of 0.000304, and there is an interaction of subject and method – F is 25.629 with a p-value of 0.000000323. However, the method has no main effect on scores.