T-tests

With a numerical outcome variable of interest we know we can run some simple t-tests. The basic setup we know – we have a null hypothesis we wish to reject and want to do so with a sample at hand. We’ll run through some examples with the hsb2.RData so let us load it first.

setwd("~/Downloads")
load("hsb2.RData")
tab.1 = summary(hsb2[c("read", "math")])

library(knitr)
kable(tab.1, caption = "Table 1: Summary Statistics for Reading & Mathematics", 
    align = "l", format = "markdown")
read math
Min. :28.00 Min. :33.00
1st Qu.:44.00 1st Qu.:45.00
Median :50.00 Median :52.00
Mean :52.23 Mean :52.65
3rd Qu.:60.00 3rd Qu.:59.00
Max. :76.00 Max. :75.00

One-Sample t-test

Assume we want to test the following: \(H_0: \mu_{Reading} = 60\) versus \(H_1: \mu_{Reading} \neq 60\). This is easily done with the t.test() command in R.

t.test(hsb2$read, mu = 60, alternative = "two.sided", conf.level = 0.95)
## 
##  One Sample t-test
## 
## data:  hsb2$read
## t = -10.717, df = 199, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 60
## 95 percent confidence interval:
##  50.80035 53.65965
## sample estimates:
## mean of x 
##     52.23

Note the key elements: mu = 60 specifies the null value for the population, alternative = specifies that it is a two-tailed test, and conf.level=o0.95 specifies that we wish to use \(\alpha=0.05\).

What if we were testing \(H_0: \mu_{Reading} \geq 60\) versus \(H_1: \mu_{Reading} < 60\)?

t.test(hsb2$read, mu = 60, alternative = "less", conf.level = 0.95)
## 
##  One Sample t-test
## 
## data:  hsb2$read
## t = -10.717, df = 199, p-value < 2.2e-16
## alternative hypothesis: true mean is less than 60
## 95 percent confidence interval:
##      -Inf 53.42808
## sample estimates:
## mean of x 
##     52.23

\(H_0: \mu_{Reading} \leq 60\) versus \(H_1: \mu_{Reading} > 60\)?

t.test(hsb2$read, mu = 60, alternative = "greater", conf.level = 0.95)
## 
##  One Sample t-test
## 
## data:  hsb2$read
## t = -10.717, df = 199, p-value = 1
## alternative hypothesis: true mean is greater than 60
## 95 percent confidence interval:
##  51.03192      Inf
## sample estimates:
## mean of x 
##     52.23

Now run the former with math after picking a reasonable value for the population mean. Note that all subject scores are on a 0-100 scale.

Two-sample t-tests

Recall that these tests apply when we have two groups, for example, male versus female students. Suppose we wish to test whether male and female students differ in terms of how they perform on the reading test. Let us also assume that we have a simple two-tailed test with the following hypotheses: \(H_0: \mu_{FemaleReading} = \mu_{MaleReading}\) versus \(H_1: \mu_{FemaleReading} \neq \mu_{MaleReading}\), which translates into \(H_0: \mu_{FemaleReading} - \mu_{MaleReading} = 0\) and \(H_1: \mu_{FemaleReading} - \mu_{MaleReading} \neq 0\), respectively.

Assumptions:

  1. Random samples
  2. Variables are drawn from normally distributed Populations
  3. Variables are drawn from populations with equal variances

Rules-of-thumb:

  • Draw larger samples if you suspect the Population(s) may be skewed
  • Go with assumption of equal variances if both the following are met:
    1. Assumption theoretically justified, standard deviations fairly close
    2. \(n_1 \geq 30\) and \(n_2 \geq 30\)
  • Go with assumption of unequal variances if both the following are met:
    1. One standard deviation is at least twice the other standard deviation
    2. \(n_1 < 30\) or \(n_2 < 30\)

Two-Tailed Hypothesis:

\(H_0\): \(\mu_{1} = \mu_{2}\); \(H_A\): \(\mu_{1} \neq \mu_{2}\)

These can be rewritten as \(H_0\): \(\mu_{1} - \mu_{2} = 0\); \(H_A\): \(\mu_{1} - \mu_{2} \neq 0\)

One-Tailed Hypotheses:

\(H_0\): \(\mu_{1} \leq \mu_{2}\); \(H_A\): \(\mu_{1} > \mu_{2}\)

These can be rewritten as \(H_0\): \(\mu_{1} - \mu_{2} \leq 0\); \(H_A\): \(\mu_{1} - \mu_{2} > 0\)

Alternatively, the setup may be \(H_0\): \(\mu_{1} \geq \mu_{2}\); \(H_A\): \(\mu_{1} < \mu_{2}\)

These can be rewritten as \(H_0\): \(\mu_{1} - \mu_{2} \geq 0\); \(H_A\): \(\mu_{1} - \mu_{2} < 0\)

The Test Statistic:

\[t = \dfrac{\left(\bar{X}_{1} - \bar{X}_{2} \right) - \left(\mu_{1} - \mu_{2} \right)}{SE_{\bar{X}_{1} - \bar{X}_{2}}} \]

The pooled sample variance: \[s^{2}_{p} = \dfrac{df_{1}s^{2}_{1} + df_{2}s^{2}_{2}}{df_{1} + df_{2}}\] \[df_{1}=n_{1}-1; df_{2}=n_{2}-1\]

The standard error: \(SE_{\bar{X}_{1} - \bar{X}_{2}}\) and \(df\) are calculated in one of two ways:

  1. Assuming equal population variances

\[SE_{\bar{X}_{1} - \bar{X}_{2}} = \sqrt{ s^{2}_{p} \left(\dfrac{1}{n_{1}} + \dfrac{1}{n_{2}}\right) } \] \[df = n_{1} + n_{2} - 2\]

  1. Assuming unequal population variances \[SE_{\bar{X}_{1} - \bar{X}_{2}} = \sqrt{\dfrac{s^{2}_{1}}{n_{1}} + \dfrac{s^{2}_{2}}{n_{2}} } \]

\[\text{approximate } df = \dfrac{\left( \dfrac{s^{2}_{1}}{n_1} + \dfrac{s^{2}_{2}}{n_2} \right)^{2}}{\left[\dfrac{\left(\frac{s^{2}_{1}}{n_{1}} \right)^{2}}{df_{1}} + \dfrac{\left( \frac{s^{2}_{2}}{n_2} \right)^{2}}{df_{2}} \right]}\]

Running the test is simple:

t.test(hsb2$read ~ hsb2$female, alternative = "two.sided", conf.level = 0.95)
## 
##  Welch Two Sample t-test
## 
## data:  hsb2$read by hsb2$female
## t = 0.74506, df = 188.46, p-value = 0.4572
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -1.796263  3.976725
## sample estimates:
##   mean in group Male mean in group Female 
##             52.82418             51.73394

Note the ~ hsb2$female portion of the command; this says the two groups are flagged by the female variable in the hsb2 data.

If we had one-tailed tests we would switch out the input in the alternative= portion of the command.

t.test(hsb2$read ~ hsb2$female, alternative = "less", conf.level = 0.95)
## 
##  Welch Two Sample t-test
## 
## data:  hsb2$read by hsb2$female
## t = 0.74506, df = 188.46, p-value = 0.7714
## alternative hypothesis: true difference in means is less than 0
## 95 percent confidence interval:
##      -Inf 3.508987
## sample estimates:
##   mean in group Male mean in group Female 
##             52.82418             51.73394
t.test(hsb2$read ~ hsb2$female, alternative = "greater", conf.level = 0.95)
## 
##  Welch Two Sample t-test
## 
## data:  hsb2$read by hsb2$female
## t = 0.74506, df = 188.46, p-value = 0.2286
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
##  -1.328525       Inf
## sample estimates:
##   mean in group Male mean in group Female 
##             52.82418             51.73394

Assuming equal versus unequal variances

Recall that when comparing two groups we have to choose between assuming the groups come from populations with either (i) equal variances or (ii) unequal variances. If we decide to go with equal variances then we would specify var.equal=TRUE and if we opt for unequal variances then we would specify var.equal=FALSE.

t.test(hsb2$read ~ hsb2$female, alternative = "two.sided", conf.level = 0.95, 
    var.equal = TRUE)
## 
##  Two Sample t-test
## 
## data:  hsb2$read by hsb2$female
## t = 0.74801, df = 198, p-value = 0.4553
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -1.783998  3.964459
## sample estimates:
##   mean in group Male mean in group Female 
##             52.82418             51.73394
t.test(hsb2$read ~ hsb2$female, alternative = "two.sided", conf.level = 0.95, 
    var.equal = FALSE)
## 
##  Welch Two Sample t-test
## 
## data:  hsb2$read by hsb2$female
## t = 0.74506, df = 188.46, p-value = 0.4572
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -1.796263  3.976725
## sample estimates:
##   mean in group Male mean in group Female 
##             52.82418             51.73394

But how do we decide which of the two variance conditions holds? Well, we can utilize two tests – the F test and/or Levene's test. We will see both tests in action in just a bit.

Paired t-tests

A classic paired design is a “before” and “after”" study such as, for example, the speed in syrup versus water study. What you have are two conditions, one the control and the other the treatment, and every unit is measured under both conditions. You then take the difference in the measurements of each unit. If the treatment has no effect then on average there should be no difference.

\begin{eqnarray*} d_{i} &=& X_{1} - X_{2} \\ \bar{d} &=& \dfrac{\sum{d_i}}{n} \\ s^{2}_{d} &=& \dfrac{\sum(d_i - \bar{d})^2}{n-1} \\ s_d &=& \sqrt{\dfrac{\sum(d_i - \bar{d})^2}{n-1}} \end{eqnarray*}

Test Statistic: \[t = \dfrac{\bar{d} - \mu_d}{s_d/\sqrt{n}}; df=n-1 \]

Interval Estimate: \[\bar{d} \pm t_{\alpha/2}\left(\dfrac{s_d}{\sqrt{n}}\right) \]

The Hypotheses:

  • \(H_0: \mu_{d}=0; H_A: \mu_{d} \neq 0\)
  • \(H_0: \mu_{d} \leq 0; H_A: \mu_{d} > 0\)
  • \(H_0: \mu_{d} \geq 0; H_A: \mu_{d} < 0\)

Assumptions

  1. Random samples
  2. The differences are approximately Normally distributed (… but \(X_1\) and \(X_2\) can follow any distribution)
t.test(hsb2$read ~ hsb2$female, alternative = "two.sided", conf.level = 0.95, 
    var.equal = FALSE)
## 
##  Welch Two Sample t-test
## 
## data:  hsb2$read by hsb2$female
## t = 0.74506, df = 188.46, p-value = 0.4572
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -1.796263  3.976725
## sample estimates:
##   mean in group Male mean in group Female 
##             52.82418             51.73394

Testng the assumptions of t-tests

  • Random sample
  • Outcome of interest is measured on a ratio or an interval scale
  • Outcome is normally distributed in the population

Although t-tests are fairly robust to minor departures from normality, extreme skewness can lead to unreliable test results. Consequently, it is a good idea to test whether the distributions are normal via formal tests and graphical explorations.

(a) Graphical Explorations of normality

While we could use box-plots and histograms to see the distribution of the outcome variable, quantile-quantile plots (QQ plots) are designed for checking to see how much faith we should place in the assumption that the distribution of a certain outcome variable in our sample reflects the theoretical normal distribution. The plots below are for reading alone, and then for reading by sex.

library(lattice)

qqmath(~read, data = hsb2, prepanel = prepanel.qqmathline, panel = function(x, 
    ...) {
    panel.grid()
    panel.qqmathline(x, distribution = qnorm)
    panel.qqmath(x, ...)
}, layout = c(1, 1), aspect = 1, xlab = "Unit Normal Quantile", 
    ylab = "Reading Scores", main = "Overall Distribution of Reading Scores")

qqmath(~read | female, data = hsb2, prepanel = prepanel.qqmathline, 
    panel = function(x, ...) {
        panel.grid()
        panel.qqmathline(x, distribution = qnorm)
        panel.qqmath(x, ...)
    }, layout = c(2, 1), aspect = 1, xlab = "Unit Normal Quantile", 
    ylab = "Reading Scores", main = "Distribution of Reading Scores (by Sex)")

The idea here it to see if most of the data are close to diagonal line. If many data points deviate a lot from the line then this suggests we may have a problem. However, one has to be careful because in small samples you will almost always have a few extreme data points. Therefore one should not discard the assumption of normality too easily in small samples.

(b) Formal Tests of Normality

There are several tests for normality and hence one has to choose with care. We will, however, focus on the two used most often.

  • The Anderson-Darling (A-D) test suited for situations where you see non-normality in the tails and large sample sizes \((n > 5000)\).
  • The Shapiro-Wilk (S-W) test suited for situations where \((n \leq 5000)\). This is one of the most powerful tests for detecting non-normality.

Both tests have \(H_0:\) of Normality, and as such you hope you don’t end up rejecting the null of normality. That is:

\(H_0:\) The sample is drawn from a normal population

\(H_A:\) The sample is not drawn from a normal population

\[W =\dfrac{\left(\displaystyle\sum^{n}a_iX_{(i)}\right)^2}{\displaystyle\sum^{n}\left(X_i - \bar{X} \right)^2}\] \[0 \leq W \leq 1\]

\(W \to 1\): Observed distribution is as expected if population were Normal

Skewness: \(\displaystyle\sum^{n}_{i=1}\dfrac{\left(X_i - \bar{X}\right)^3}{\left(n-1\right)s^{3}}\)

\(s = 0\): Normal distribution; \(s < 0\): skewed left; \(s > 0\): skewed right

Kurtosis: \(\displaystyle\sum^{n}_{i=1}\dfrac{\left(X_i - \bar{X}\right)^4}{\left(n-1\right)s^{4}} - 3\)

\(k = 0\): Standard Normal distribution; \(K > 0\): “Peaked”; \(k < 0\): “Flat”

library(nortest)
ad.test(hsb2$read)
## 
##  Anderson-Darling normality test
## 
## data:  hsb2$read
## A = 1.4917, p-value = 0.0007367
shapiro.test(hsb2$read)
## 
##  Shapiro-Wilk normality test
## 
## data:  hsb2$read
## W = 0.97979, p-value = 0.005553

Note that regardless of the test used we have very small p-values and as such have to end up rejecting the \(H_0\) of normality; reading scores may not be normally distributed. What do we do now? Technically, the t-test should not be used. Instead, we should opt for alternative tests unless we can transform reading scores (maybe via the logarithm, the square-root, or some other transformation scheme) into a normally distributed variable. Before we do that though, one caution.

The Trouble with Normality Tests

Let us generate two variables. One \((x)\) will be a fairly small sample and the other \((y)\) will be a pretty large sample.

set.seed(987123)

x = rbinom(15, 12, 0.6)
hist(x)

qqnorm(x)
qqline(x)

shapiro.test(x)
## 
##  Shapiro-Wilk normality test
## 
## data:  x
## W = 0.91011, p-value = 0.1359
ad.test(x)
## 
##  Anderson-Darling normality test
## 
## data:  x
## A = 0.6116, p-value = 0.09076
y = rt(5e+05, 200)
hist(y)

qqnorm(y)
qqline(y)

ad.test(y)
## 
##  Anderson-Darling normality test
## 
## data:  y
## A = 1.0665, p-value = 0.008447

Well, looks like a duck but certainly doesn’t quack like a duck! The problem is this: In a small sample even a few unusual data points will not lead to a rejection of the assumption of normality of the data. Conversely, in large samples even just a couple of unusual data points will lead the formal tests to reject normality. Thus, as a rule, we weigh the presence or absence of normality by first recognizing that sample size may influence test results, and then juxtaposing formal tests with the QQ-plots to see how many observations are wandering off the line and how far are they wandering.

Testing for Equality of Variances

  1. The \(F-test\):
  • Assumes normally distributed populations (hence sensitive to departures)
  • If \(X_1\) and \(X_2\) are two independent random variables distributed as \(\chi^{2}\) with \(df_1\) and \(df_2\), respectively,then the ratio \(\dfrac{\frac{X_1}{df_1}}{\frac{X_2}{df_2}}\) follows the \(F\) distribution with \(df_1\) in the numerator and \(df_2\) in the denominator
  • Hypotheses: \(H_0: \sigma^{2}_{1} = \sigma^{2}_{2}; H_A: \sigma^{2}_{1} \neq \sigma^{2}_{2}\)
  • Test Statistic: \(F=\dfrac{s^{2}_{1}}{s^{2}_{2}}\); \(F \sim F_{\alpha/2, df_1, df_2}\)

Note: \(s^{2}_{1}\) is the larger sample variance

  1. Levene’s Test for Homogeneity of Variances:
  • Assumes roughly symmetric frequency distributions within all groups
  • Robust to violations of assumption
  • Can be used with 2 or more groups
  • Hypotheses: \(H_0: \sigma^{2}_{1} = \sigma^{2}_{2} = \sigma^{2}_{3} = \cdots \sigma^{2}_{k}\); \(H_A:\) For at least one pair of \((i,j)\) we have \(\sigma^{2}_{i} \neq \sigma^{2}_{j}\)
  • Test Statistic: \(W = \dfrac{ (N-k)\displaystyle\sum^{k}_{i=1}n_{i}\left( \bar{Z}_{i} - \bar{Z} \right)^{2} }{(k-1)\displaystyle\sum^{k}_{i=1}\sum^{n_i}_{j=1}\left( Z_{ij} - \bar{Z}_{i}\right)^{2}}\)
  • \(Z_{ij} = |{X_{ij} - \bar{X}_i}|\); \(\bar{Z}_i\) is the mean for all \(X\) in the \(i^{th}\) group; \(\bar{Z}\) is the mean for all \(X\) in the study; \(k\) is the number of groups in the study; and \(n_i\) is the sample size for group \(i\)
  • If you opt for the more robust version that uses the Median, then, \(Z_{ij} = |X_{ij} - \tilde{X}_{i}|\) where \(\tilde{X}_{i}\) is the median of the \(i^{th}\) group
  • \(W \sim F_{\alpha, k-1, n-k}\)

Let us run both tests with our hsb2 data, maybe sticking with the read variable.

var.test(hsb2$read ~ hsb2$female)
## 
##  F test to compare two variances
## 
## data:  hsb2$read by hsb2$female
## F = 1.0913, num df = 90, denom df = 108, p-value = 0.6613
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
##  0.7356804 1.6305090
## sample estimates:
## ratio of variances 
##           1.091251
library(car)
leveneTest(read ~ female, data = hsb2, center = mean)
Df F value Pr(>F)
group 1 0.4542217 0.501123
198 NA NA
leveneTest(read ~ female, data = hsb2, center = median)
Df F value Pr(>F)
group 1 0.6023704 0.4386011
198 NA NA

So the F-test soundly rejects the Null, indicating that reading scores are not equally variable for male and female students. Levene’s test (with the mean or the median) concludes likewise. In a nutshell, we cannot assume equal variances.

Violations of Normality

What if we ignore violations of the assumption of normality? Perhaps nothing will go wrong in terms of the test results. Most tests are robust to violations of normality, so long as you can rely on the Central Limit Theorem and the test you are dealing with involves the mean. In brief, large samples will allow you to get away with samples exhibiting some non-normal distribution so long as you are not testing variances. How large is large? There is no clear-cut answer. Depending upon what you are looking to test, how many groups are involved, and the research design even 500 units in each group may be insufficient.

Unequal Standard Deviations

The rule of thumb used for sample sizes and relative standard deviations of the groups applies here as well. If you have small samples or the standard deviation of one group is quite a bit larger than that of the other group(s) (i.e., one standard deviation is at least twice the other standard deviation), then you should not assume equal variances. Of course you also have the formal tests for equality of variance and for the homogeneity of variances, respectively (i.e., the F-test and Levene’s test).

If the F-test and Levene’s test give you conflicting results I woud recommend going with Levene’s test if the data are non-normally distributed