Testing Sample Proportions

Say you are flipping a fair coin twice. You know you can end up with Heads or Tails on any given toss. The sample space would then be \(S = \left\{ HH, HT, TH, TT \right\}\), i.e., the sample size is \(n=4\). The probability of getting exactly one Tails is \(\dfrac{2}{4} = 0.5\) and so is the probability of one Heads. What if we flipped the coin ten times? What would be the probability of exactly three Tails in ten flips of the coin? We could spell out the full sample space and then calculate the needed probability but that would be tedious. It would be easier to use a formula that does this calculation for us.

\[ \begin{align} P\left[X \text{ successes}\right] = \binom{n}{X}p^{X}\left(1 - p\right)^{n-X} \\ & \text{where } \binom{n}{x} = \dfrac{n!}{X!(n-X)!} \text{ and } \\ & n! = n \times (n-1) \times (n-2) \times \cdots \times 2 \times 1 \end{align} \]

Let us define Success as a Tails resulting on the flip, and X being 3 Tails in ten flips of the coin (i.e., \(n=10\)). Let us also assume that it is a fair coin so that the probability of Tails on any single flip of the coin is \(p=0.5\). Plugging these values into the equation will yield:

\[ \begin{align} P\left[3 \text{ successes}\right] = \binom{10}{3}0.5^{3}\left(1 - 0.5 \right)^{10-3} \\ & = \dfrac{10 \times 9 \times 8 \times \cdots \times 1}{\left(3 \times 2 \times 1 \right) \left(7 \times 6 \times \cdots \times 1 \right)} \left(0.5 \right)^{3} \left(0.5 \right)^{7} \\ & = \dfrac{10 \times 9 \times 8 }{\left(3 \times 2 \times 1 \right)} \left(0.5 \right)^{3} \left(0.5 \right)^{7} \\ & = \left(10 \times 3 \times 4 \right)\left(0.5 \right)^{3} \left(0.5 \right)^{7} \\ & = \left(120\right) \left(0.125 \right) \left(0.0078125 \right) \\ & = 0.1171875 \end{align} \]

Using R as a calculator we could solve the above formula as follows:

n = 10
x = 3
p = 0.5
p.3tails = choose(n, x) * (p)^(x) * (1 - p)^{
    n - x
}
p.3tails
## [1] 0.1171875

What about three or more Tails? We could calculate the the probability of exactly three Tails, four Tails, and so on through ten Tails, add these up, and we would have our answer … all using the formula above. An R function, rbinom(), can do these calculations for us quite easily. Below we do so for the example above.

set.seed(13579)
n = 1e+07
x = rbinom(n, 10, p = 0.5)
tab.x = table(x)
tab.x
0 1 2 3 4 5 6 7 8 9 10
9736 97255 438778 1169791 2051968 2462822 2050665 1171457 439391 98260 9877
ptab.x = (tab.x/n)
ptab.x
0 1 2 3 4 5 6 7 8 9 10
0.0009736 0.0097255 0.0438778 0.1169791 0.2051968 0.2462822 0.2050665 0.1171457 0.0439391 0.009826 0.0009877
plot(ptab.x, ylab = "Probability", xlab = "No. of Tails", xlim = c(0, 
    10), ylim = c(0, 0.25), col = "firebrick")

Note: If it is a fair coin then half the flips result in Tails … this is the most likely outcome. Say you flip the coin 10 times and you end up with only 1 Tails. Could this by chance? Sure, because we can see that this could happen with some non-zero probability, just as we could also end up with no Tails or even with ten Tails! Strange outcomes can result, even if they occur very rarely, and by chance alone. As a result, we have to generate some rules that can guide us in determining when to suspect that something is wrong because of an unusual result and when to determine that the result isn’t very unusual after all.

The Binomial Test

Standard Error: \(SE_{\hat{p}} = \sqrt{\dfrac{\hat{p}\left(1-\hat{p} \right)}{n-1}}\)

For the Agresti-Coull confidence interval:
First calculate \(p^{'} = \dfrac{X+2}{n+4}\)
CI is then given by: \(p^{'} - z\sqrt{ \dfrac{p^{'} \left(1-p^{'} \right) } {n+4} } < p < p^{'} + z\sqrt{ \dfrac{p^{'} \left(1-p^{'} \right) } {n+4} }\)

Alternatively, use the binom library as shown below:

library(binom)
# binom.test(x, n, p=?, alternative='?', conf.level=?)
# binom.confint(x, n, p=?, alternative='?', conf.level=?,
# methods = 'ac')

An Example

In 1955, John Wayne played Genghis Khan in The Conqueror. The movie was filmed downwind from a site where 11 overground nuclear bomb tests had been conducted. Of the 220 crew on the movie, 91 ended up being diagnosed with cancer by the early 1980s. Population data record that only about 14% of people in the same age-group as the movie crew should, on average, have been diagnosed with cancer within the same time frame of exposure.

  1. What is the best estimate of the probability of a member of the crew getting cancer within the study interval?
n = 220
x = 91
p.sample = x/n
p.sample
## [1] 0.4136364

The best estimate is 0.4136364

  1. Test whether the cancer rate for the movie crew was different from that for the population. Use \(\alpha = 0.05\) and calculate the Agresti-Coull confidence interval.

H0: The cancer rate for the movie crew is the same as that of the population (p = 0.14)
HA: The cancer rate for the movie crew is NOT the same as that of the population (p is not equal to 0.14)

library(binom)
binom.test(x, n, p = 0.14, alternative = "two.sided", conf.level = 0.95)
## 
##  Exact binomial test
## 
## data:  x and n
## number of successes = 91, number of trials = 220, p-value <
## 2.2e-16
## alternative hypothesis: true probability of success is not equal to 0.14
## 95 percent confidence interval:
##  0.3478469 0.4817857
## sample estimates:
## probability of success 
##              0.4136364
binom.confint(x, n, p = 0.14, alternative = "two.sided", conf.level = 0.95, 
    methods = "ac")
method x n mean lower upper
agresti-coull 91 220 0.4136364 0.3505683 0.4796687

The P-value is practically 0 so we can easily reject the null hypothesis (H0). The data suggest that the cancer rate for the movie crew was different from that of the population.

Another Example: The Titanic

library(titanic)
data(titanic_train)  # load data from titanic package
df = titanic_train  # save it as df
names(df)  # see variables in the data-set
##  [1] "PassengerId" "Survived"    "Pclass"      "Name"        "Sex"        
##  [6] "Age"         "SibSp"       "Parch"       "Ticket"      "Fare"       
## [11] "Cabin"       "Embarked"

VARIABLE DESCRIPTIONS:

  • Survived Survival (0 = No; 1 = Yes)
  • Pclass Passenger Class (1 = 1st; 2 = 2nd; 3 = 3rd)
  • Name Name
  • Sex Sex
  • Age Age
  • SibSp Number of Siblings/Spouses Aboard
  • Parch Number of Parents/Children Aboard
  • Ticket Ticket Number
  • Fare Passenger Fare
  • Cabin Cabin
  • Embarked Port of Embarkation (C = Cherbourg; Q = Queenstown; S = Southampton)

SPECIAL NOTES:

Pclass is a proxy for socio-economic status (SES) – 1st ~ Upper; 2nd ~ Middle; 3rd ~ Lower

Age is in Years – Fractional if Age less than One (1). If the Age is Estimated, it is in the form xx.5

With respect to the family relation variables (i.e. sibsp and parch) some relations were ignored. The following are the definitions used for sibsp and parch.

  • Sibling: Brother, Sister, Stepbrother, or Stepsister of Passenger Aboard Titanic

  • Spouse: Husband or Wife of Passenger Aboard Titanic (Mistresses and Fiances Ignored)

  • Parent: Mother or Father of Passenger Aboard Titanic

  • Child: Son, Daughter, Stepson, or Stepdaughter of Passenger Aboard Titanic

Other family relatives excluded from this study include cousins, nephews/nieces, aunts/uncles, and in-laws. Some children travelled only with a nanny, therefore parch=0 for them. As well, some travelled with very close friends or neighbors in a village, however, the definitions do not support such relations.

Let us clean it up a bit by labeling Survived, Pclass and the port of embarkation (Embarked) as follows:

df$Survived = factor(df$Survived, levels = c(0, 1), labels = c("Died", 
    "Survived"))
df$Pclass = factor(df$Pclass, levels = c(1, 2, 3), labels = c("1st Class", 
    "2nd Class", "3rd Class"))
df$Embarked = factor(df$Embarked, labels = c(NA, "Cherbourg", 
    "Queenstown", "Southampton"))

Unconditional probability of survival

Now let us ask ourselves: Ignoring all other information about a passenger (sex, age, etc.), what was the probability of survival of a passenger on the Titanic? A simple frequency table will show this to us.

tab.1 = table(df$Survived)
tab.1
Died Survived
549 342
tab.2 = prop.table(tab.1)
tab.2
Died Survived
0.6161616 0.3838384

The base probability of survival was 0.3838.

Let us assume we were told there was a 50:50 chance of survival on any passenger ship at the time that the Titanic sailed to sea. How would we test whether passengers on the Titanic had a statistically significant better or worse survival rate than that for all passenger ships? Why, with a simple hypothesis test of course.

\(H_0:\) Probability of survival was 0.50
\(H_A:\) Probability of survival was NOT 0.50

x = 382  # number who survived
n = 891  # total sample size
library(binom)
binom.test(x, n, p = 0.5, alternative = "two.sided", conf.level = 0.95)
## 
##  Exact binomial test
## 
## data:  x and n
## number of successes = 382, number of trials = 891, p-value =
## 2.359e-05
## alternative hypothesis: true probability of success is not equal to 0.5
## 95 percent confidence interval:
##  0.3959570 0.4619793
## sample estimates:
## probability of success 
##              0.4287318
binom.confint(x, n, p = 0.5, alternative = "two.sided", conf.level = 0.95, 
    methods = c("ac"))
method x n mean lower upper
agresti-coull 382 891 0.4287318 0.3966092 0.4614662

The p-value is 2.359e-05, i.e., 0.00002359, way smaller than \(\alpha=0.05\) and so we can easily reject the null hypothesis. The Agresti-Coull confidence interval indicates that we can be about 95% confident that the true probability of survival ranges between 0.3966 and 0.4614.

The \(\chi^2\) Distribution

The \(\chi^2\) distribution is used with multinomial data (i.e., when the categorical variable has more than two categories) to test whether observed frequency counts differ from expected frequency counts.

The Goodness-of-Fit Test for a single variable

\(H_0\): Proportions are all the same
\(H_A\): Proportions are all the same

\[\chi^{2} = \displaystyle\sum\limits_{i}\dfrac{\left(Observed_i - Expected_i \right)^2}{Expected_i}\]

\(\chi^2\) distributed with \(\left( \text{no. of categories}-1 \right)\) degrees of freedom \((\textit{df})\)

Reject \(H_0\) if \(P-value \leq \alpha\); Do not reject \(H_0\) otherwise

Assumptions:

  1. No category has expected frequencies \(< 1\)
  2. No more than 20% of the categories should have expected frequencies \(< 5\)

Was boarding distributed equally across the embarkation points?

If the answer is yes, then we should see roughly one-third of the passengers embarking at each of the ports. This can be easily tested as follows:

\(H_0:\) Embarkation was equally distributed across the three ports
\(H_A:\) Embarkation was NOT equally distributed across the three ports

tab.E = table(df$Embarked)
tab.E
NA Cherbourg Queenstown Southampton
2 168 77 644
Xsq.1 = chisq.test(tab.E)
Xsq.1
## 
##  Chi-squared test for given probabilities
## 
## data:  tab.E
## X-squared = 1124.2, df = 3, p-value < 2.2e-16
ls(Xsq.1)
## [1] "data.name" "expected"  "method"    "observed"  "p.value"   "parameter"
## [7] "residuals" "statistic" "stdres"
Xsq.1$expected
##        <NA>   Cherbourg  Queenstown Southampton 
##      222.75      222.75      222.75      222.75

The p-value is practically 0 so we can easily reject the null hypothesis; the data suggest that embarkation loads were not equally distributed across the three ports.

Did survival vary by sex?

\(H_0:\) Survival was independent of a passenger’s sex
\(H_0:\) Survival was NOT independent of a passenger’s sex

tab.ss = table(df$Survived, df$Sex)
tab.ss
/ female male
Died 81 468
Survived 233 109
Xsq.2 = chisq.test(tab.ss)
Xsq.2
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  tab.ss
## X-squared = 260.72, df = 1, p-value < 2.2e-16
ls(Xsq.2)
## [1] "data.name" "expected"  "method"    "observed"  "p.value"   "parameter"
## [7] "residuals" "statistic" "stdres"
Xsq.2$expected
female male
Died 193.4747 355.5253
Survived 120.5253 221.4747

What if we have “thin cells”?

Recall that the data must square with the assumptions of the \(\chi^2\) test. If any of these is violated then we can either collapse some of the categories (if doing so doesn’t inject noise into the distribution) or then opt for another test. These days, though, your best bet would be to ditch the chi-square test altogether and instead just go with Fisher’s Exact test. These days this test can be run with as large a sample as you may end up with.

fisher.test(tab.ss)
## 
##  Fisher's Exact Test for Count Data
## 
## data:  tab.ss
## p-value < 2.2e-16
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
##  0.0575310 0.1138011
## sample estimates:
## odds ratio 
## 0.08128333

Non-parameteric “t-tests”

What do I do if my data are non-normally distributed?

  • Have faith in the Central Limit Theorem. So long as you have a sample size of 30 or more you can assume that the sampling distribution of the sample mean follows the normal distribution (even if the original measurement does not come from a normal distribution)
  • Test statistics for means & confidence intervals of means will be unaffected so long as you have a ``large enough’’ sample to work with
    • Most tests are robust to some violations of the normality assumption so long as the sample size is not too small and the skew isn’t very extreme
    • How small? If there is similar skewness, even when comparing two groups, having 30 in each group will work
    • If I have one group with severe skewness (one left-skewed the other right-skewed) then I need samples of a few hundred units each before the test can be trusted
  • If all else fails (or you are a conservative analyst), try to transform the data and rerun all tests … We will try this tactic next
  • If this doesn’t solve the problem (or you are not a fan of massaging your data) then switch to tests that do not require the Normality assumption … We will also see these tests in action in a bit.

Transforming Data

If we are uncomfortable with ignoring violations of the normality assumption we have no choice but to transform the data. Several transformations are available for use, each with the goal of reducing the skew in the sample (i.e., rendering it more “normally” distributed).

  1. Log Transformation \[X^{'} = ln\left( X \right)\] This is very commonly used in biological research to reduce skew. It usually works so long as the measure does not include \(0\). If it does, then standard practice entails adding \(1\) to each measurement and then taking the natural logarithm.

The log transformation is most useful when:

  1. the measures represent ratios or products
  2. The data are positively skewed (i.e., skewed right)
  3. The group with the larger mean also has the larger standard deviation
  4. The data span several orders of magnitude
  • If substantially positively skewed and \(X > 0\) then use \(X^{'} = ln(X)\)
  • If substantially positively skewed and \(X \geq 0\) then use \(ln(X + c)\); where \(c\) is a constant chosen such that the smallest score \(= 1\)
  • If substantially negatively skewed then use \(ln(K - X)\); where \(K = X_{max} + 1\) so that the smallest score \(=1\)
  1. Square-root Transformation \[X^{'} = \left( \sqrt{X} \right)\] Used often when there is moderate skew, and with count data
  • If moderately positively skewed and \(X > 0\) then use \(\sqrt{X}\)
  • If moderately positively skewed and \(X \geq 0\) then use \(\sqrt{X + 0.5}\)
  • If moderately negatively skewed use \(\sqrt{K - X}\); where \(K = X_{max} + 1\) so that the smallest score \(=1\)
  1. Arcsine Transformation If data are proportions the arcsine: \(arcsin\left( \sqrt{p}\right)\) is suggested but I would not do this readily … Proportions are essentially coming from your Binomial or Poisson distributions and in these cases we would rarely run a \(t-test\) or fit a linear regression

  2. Other Transformations The square \((X^{'} = X^{2})\) is used f the data are skewed left, and if this fails, then the anti-log might work \((X^{'} = e^{X})\).

If data are skewed right, the reciprocal may also work \((X^{'} = \frac{1}{X})\).

For the square and the reciprocal you need to have all measurements be positive. If they are negative then you need to multiple each by \(-1\) and then carry out the transformation.

If we use a transformation, the estimated statistics have to be back-transformed to the original metric.

  • The Natural Logarithm: \(X^{'} = ln[X]\) if \(X > 0\)
  • The Natural Logarithm: \(X^{'} = ln[X + 1]\) if \(X \geq 0\)
  • The Square Root: \(X^{'} = \sqrt{X + \frac{1}{2}}\)
  • The Arcsine: \(p^{'} =arcsin[\sqrt{p}]\)

The Corresponding Back-transformations

  • The Natural Logarithm: \(X = e^{X^{'}}\)
  • The Square Root: \(X = X^{'2} - \frac{1}{2}\)
  • The Arcsine: \(p = \left(sin[p^{'}]\right)^{2}\)

Nonparametric Alternatives to the \(t-tests\)

Non-parametric tests are used (a) when transformations do not work, or (b) the data represent ordinal categories (or are ranked data)

  • Called non-parametric because unlike, say, the \(t-test\) which requires some distributional assumption to be true (i.e., Normality) and involves parameters (i.e., the mean and the variance), these alternatives make no such assumptions or need no such parameters
  • They are more likely to commit a Type II error so if the assumptions of parametric tests are met use parametric tests
  • Here are a few non-parametric analogues to the \(t-tests\):
    • Sign Test: Alternative to the One-Sample \(t-test\) or to the Paired \(t-test\)
    • Wilcoxon Signed-Rank Test: Alternative to the Paired \(t-test\)
    • Mann-Whitney \(U-test\): Alternative to the Two-Sample \(t-test\) with equal variances
    • Kolmogorov-Smirnov test: Alternative to the Two-Sample \(t-test\) with unequal variances

The Sign Test

Assumption: Samples are drawn independently from a continuous distribution

  • Tests whether the Median equals a hypothesized value (\(H_0\) value)
  • Scores above the \(H_0\) value are marked \(+\); scores below are marked \(-\)
  • Scores \(=\) to the Median are dropped
  • If \(H_0\) is correct then 50% of the scores should be \(+\) and 50% should be \(-\)
  • Essentially a Binomial test where \(p_0=0.50\)
  • One-Sample Hypotheses: *\(H_0\): The distribution is symmetric around \(p=0.50\)
    • \(H_A\): The distribution is not symmetric around \(p \neq 0.50\)
  • Paired Design Hypotheses:
    • \(H_0\): The distribution of the two measurements is the same
    • \(H_A\): The distribution of the two measurements is not the same
  • Has little power because it is simplistic and works best with large samples

An Example of the Sign Test

The process by which a single species splits into two species is still not well understood. One proposal involves ``sexual conflict’’ – a genetic arms race between males and females that arises from their different reproductive roles. For example, in a number of insect species, male seminal fluid contains chemicals that reduce the tendency of the female to mate again with other males. However, these chemicals also reduce female survival, so females have developed mechanisms to counter these chemicals. Sexual conflict can cause rapid genetic divergence between isolated populations of the same species, leading to the formation of new species. Sexual conflict is more pronounced in species in which females mate more than once, leading to the prediction that they should form new species at a more rapid rate.

To investigate this, Arnqvist et al. (2000) identified 25 insect taxa (groups) in which females mate multiple times, and they paired each of these groups to a closely related insect group in which females only mate once. Which type of insects tend to have more species? Note: These are treated as paired data because the two sets of groups are closely related

taxa = read.csv(url("http://whitlockschluter.zoology.ubc.ca/wp-content/data/chapter13/chap13e4SexualConflict.csv"))
head(taxa)
taxonPair nSpeciesMultipleMating nSpeciesSingleMating difference
A 53 10 43
B 73 120 -47
C 228 74 154
D 353 289 64
E 157 30 127
F 300 4 296
names(taxa)
## [1] "taxonPair"              "nSpeciesMultipleMating"
## [3] "nSpeciesSingleMating"   "difference"
summary(taxa)
taxonPair nSpeciesMultipleMating nSpeciesSingleMating difference
A : 1 Min. : 7 Min. : 4.0 Min. : -980.0
B : 1 1st Qu.: 34 1st Qu.: 10.0 1st Qu.: -3.0
C : 1 Median : 77 Median : 30.0 Median : 16.0
D : 1 Mean : 1134 Mean : 263.5 Mean : 870.5
E : 1 3rd Qu.: 228 3rd Qu.: 115.0 3rd Qu.: 79.0
F : 1 Max. :21000 Max. :3500.0 Max. :20940.0
(Other):19 NA NA NA
par(mfrow = c(1, 2))
hist(taxa$difference, breaks = 9, col = "firebrick", main = "Histogram of Difference in Species Number", 
    xlab = "", cex.main = 0.9)
qqnorm(taxa$difference, col = "blue")
qqline(taxa$difference, col = "red")

par(mfrow = c(1, 1))

shapiro.test(taxa$difference)
## 
##  Shapiro-Wilk normality test
## 
## data:  taxa$difference
## W = 0.25473, p-value = 2.911e-10

The difference reflects negative values so what do we do? We can try the following:

log.diff = log(taxa$difference + (2 - min(taxa$difference)))
summary(log.diff)

boxplot(log.diff, horizontal = TRUE)
shapiro.test(log.diff)

That didn’t work. We could try some R packages designed to help transform the data but that would be pretty useless. Consequently, we might as well opt for the Sign Test.

\(H_0\): Median difference in number of species is \(= 0\)
\(H_A\): Median difference in number of species is \(\neq 0\)

Let us use the BSDA library that has the test built-in. The commands below assume the data are from two groups, not the same group measured twice. This is in fact the case in our example so we are on the right track.

library(BSDA)
SIGN.test(taxa$nSpeciesMultipleMating, taxa$nSpeciesSingleMating, 
    md = 0, alternative = "two.sided", conf.level = 0.95)
## 
##  Dependent-samples Sign-Test
## 
## data:  taxa$nSpeciesMultipleMating and taxa$nSpeciesSingleMating
## S = 18, p-value = 0.04329
## alternative hypothesis: true median difference is not equal to 0
## 95 percent confidence interval:
##   1.00000 77.16674
## sample estimates:
## median of x-y 
##            16
Conf.Level L.E.pt U.E.pt
Lower Achieved CI 0.8922 1 70.0000
Interpolated CI 0.9500 1 77.1667
Upper Achieved CI 0.9567 1 78.0000

Given the p-value we can reject \(H_0\); the Median difference is not \(=0\). The data suggest that groups of insects whose females mate multiple times have more species than groups whose females mate singly, consistent with the sexual-conflict hypothesis.

If we had the same units measured twice we would have run the command slightly differently.

SIGN.test(taxa$difference, md = 0, alternative = "two.sided", 
    conf.level = 0.95)
## 
##  One-sample Sign-Test
## 
## data:  taxa$difference
## s = 18, p-value = 0.04329
## alternative hypothesis: true median is not equal to 0
## 95 percent confidence interval:
##   1.00000 77.16674
## sample estimates:
## median of x 
##          16
Conf.Level L.E.pt U.E.pt
Lower Achieved CI 0.8922 1 70.0000
Interpolated CI 0.9500 1 77.1667
Upper Achieved CI 0.9567 1 78.0000

Let us also see how to run the Sign test manually: First thing we will do is count how many cases are \(+\), how many are \(-\), and how many are equal to the hypothesized median of \(0\)

taxa$sign = "equal to median"
taxa$sign[taxa$difference > 0] = "plus"
taxa$sign[taxa$difference < 0] = "minus"
taxa$sign = factor(taxa$sign, levels = c("minus", "equal to median", 
    "plus"))
kable(taxa)
taxonPair nSpeciesMultipleMating nSpeciesSingleMating difference sign
A 53 10 43 plus
B 73 120 -47 minus
C 228 74 154 plus
D 353 289 64 plus
E 157 30 127 plus
F 300 4 296 plus
G 34 18 16 plus
H 3400 3500 -100 minus
I 20 1000 -980 minus
J 196 486 -290 minus
K 1750 660 1090 plus
L 55 63 -8 minus
M 37 115 -78 minus
N 100 30 70 plus
O 21000 60 20940 plus
P 37 40 -3 minus
Q 7 5 2 plus
R 15 7 8 plus
S 18 6 12 plus
T 240 13 227 plus
U 15 14 1 plus
V 77 16 61 plus
W 15 14 1 plus
X 85 6 79 plus
Y 86 8 78 plus

Here is a simple frequency of pluses and minuses

table(taxa$sign)
minus equal to median plus
7 0 18

So we have 7 minuses and 18 pluses. If the distribution were evenly spread we should have seen 50% of the cases below and 50% above the median. This is not what we see, but is this chance or something systematic? Let us apply the binomial test and see.

binom.test(7, 25, p = 0.5)
## 
##  Exact binomial test
## 
## data:  7 and 25
## number of successes = 7, number of trials = 25, p-value = 0.04329
## alternative hypothesis: true probability of success is not equal to 0.5
## 95 percent confidence interval:
##  0.1207167 0.4938768
## sample estimates:
## probability of success 
##                   0.28

Note the p-value. We can reject \(H_0\); the Median difference is not \(=0\). T

The Wilcoxon Signed-Rank Test

The Sign Test is weak in that it does not take into account the magnitudes of how much above/below the median an observation falls. The Wilcoxon Signed-Rank Test is a slight improvement because it ranks observations on the basis of how far below/above they fall from the hypothesized median.

  • Assumes the distribution is symmetric around the Median … \(=\) assumption of no skew so a very restrictive assumption
  • Steps:
    • Calculate \(X_i - \mu_{0}\) for all \(i=1,2,\ldots,n\)
    • Rank in ascending order the absolute differences \(\left|X_i - \mu_{0} \right|\) for all \(i=1,2,\ldots,n\)
    • Assign + or - to each rank
    • Let the sum of + and - ranks be \(W^{+}\) and \(W^{-}\), respectively
    • Let \(W = min\left(W^{+},W^{-}\right)\) and \(W^{*}_{\alpha}\) be critical W
    • Reject \(H_0\): The median difference between the two samples is \(\mu_0 = 0\) if …
      • \(H_A\): is \(\mu \neq \mu_0\) and \(W \leq W^{*}_{\alpha}\)
      • \(H_A\): is \(\mu > \mu_0\) and \(W^{-} \leq W^{*}_{\alpha}\)
      • \(H_A\): is \(\mu < \mu_0\) and \(W^{+} \leq W^{*}_{\alpha}\)
  • Normal approximation: \(z_0 = \dfrac{W^{+} - n(n+1)/4}{\sqrt{n(n+1)(2n+1)/24}}\) … used when \(n \geq 50\) and there are no ties

Let us run the test on the Sexual Conflict data:

wilcox.test(taxa$nSpeciesMultipleMating, taxa$nSpeciesSingleMating, 
    paired = TRUE)
## Warning in wilcox.test.default(taxa$nSpeciesMultipleMating, taxa
## $nSpeciesSingleMating, : cannot compute exact p-value with ties
## 
##  Wilcoxon signed rank test with continuity correction
## 
## data:  taxa$nSpeciesMultipleMating and taxa$nSpeciesSingleMating
## V = 230, p-value = 0.07139
## alternative hypothesis: true location shift is not equal to 0

Do not reject \(H_0\); the distributions appear to be the same (i.e., there is no difference in the number of species between the polyandrous and the monandrous groups).

The Mann-Whitney \(U-test\)

If we have a two-sample (i.e., unpaired) design, then the Mann-Whitney U-Test can be used assuming we couldn’t transform the data to run a \(t-test\).

  • The assumptions of the Mann-Whitney U test are:
  • The variable of interest is continuous (not discrete). The measurement scale is at least ordinal
  • The probability distributions of the two populations are identical, except for location (i.e., the ``center’’)
  • The two samples are independent
  • Both are simple random samples from their respective populations
  • \(H_0\): The samples come from populations with similar probability distributions
  • Test Process and Statistic …
    • Combine both samples and rank, in ascending order, all values
    • If there are ties, rank accordingly
    • Sum the ranks of the smaller group \((R_1)\)
    • \(U_1\) = \(n_1 n_2 + \dfrac{n_1(n_1 + 1)}{2} - R_{1}\); \(U_2 = n_1 n_2 - U_{1}\)
    • Choose the larger of \(U_1\) or \(U_2\) as the test statistic
    • Reject \(H_0\) if \(P-value \leq \alpha\)

An Example of the Mann-Whitney U aka Wilcoxon Signed Rank Test

The sage cricket, Cyphoderris strepitans, has an unusual form of mating. During mating, the male offers his fleshy hind wings to the female to eat. The wounds are not fatal but a male with already nibbled wings is less likely to be chosen by females he meets later. Females get some nutrition from feeding on the wings, which raises the question, ``Are females more likely to mate if they are hungry?’’

Johnson et al. (1999) answered this question by randomly dividing 24 females into two groups: one group of 11 females were starved for at least two days and another group of 13 females was fed during the same period. Finally, each female was put separately into a cage with a single (new) male, and the waiting time to mating was recorded.

\(H_0\): Time to mating is the same for female crickets that were starved as for those who were fed
\(H_A\): Time to mating is not the same for female crickets that were starved as for those who were fed

cric = read.csv(url("http://whitlockschluter.zoology.ubc.ca/wp-content/data/chapter13/chap13e5SagebrushCrickets.csv"))
head(cric)
feedingStatus timeToMating
starved 1.9
starved 2.1
starved 3.8
starved 9.0
starved 9.6
starved 13.0
names(cric)
## [1] "feedingStatus" "timeToMating"
wilcox.test(cric$timeToMating ~ cric$feedingStatus, paired = FALSE)
## 
##  Wilcoxon rank sum test
## 
## data:  cric$timeToMating by cric$feedingStatus
## W = 88, p-value = 0.3607
## alternative hypothesis: true location shift is not equal to 0

Do not reject \(H_0\); time to mating does not appear to be different for starved versus fed female crickets

Note the warning about ties; If we have any ties we need to switch to a version of the Wilcoxon test designed to handle ties. I’ll load the Stalkies data first.

gestures = read.csv(url("http://whitlockschluter.zoology.ubc.ca/wp-content/data/review2/rev2q13BlindGestures.csv"))
head(gestures)
sightedness numberOfGestures
Blind 0
Blind 1
Blind 1
Blind 2
Blind 1
Blind 1
names(gestures)
## [1] "sightedness"      "numberOfGestures"
wilcox.test(gestures$numberOfGestures ~ gestures$sightedness, 
    paired = FALSE)
## Warning in wilcox.test.default(x = c(0L, 1L, 1L, 2L, 1L, 1L, 1L, 3L, 1L, :
## cannot compute exact p-value with ties
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  gestures$numberOfGestures by gestures$sightedness
## W = 62.5, p-value = 0.5755
## alternative hypothesis: true location shift is not equal to 0
library(exactRankTests)
wilcox.exact(gestures$numberOfGestures ~ gestures$sightedness, 
    paired = FALSE)
## 
##  Exact Wilcoxon rank sum test
## 
## data:  gestures$numberOfGestures by gestures$sightedness
## W = 62.5, p-value = 0.6361
## alternative hypothesis: true mu is not equal to 0

Kolmogorov-Smirnov (K-S) Test

This (very weak) test is used to compare the distributions of two groups by comparing the empirical cumulative distribution functions (ecdfs) of the two groups and finding the greatest absolute distance between the two

  • The \(ecdf\) is \(\hat{F}(X) =\) fraction of sample with values \(\leq X_i\) , where \(i=1,2,3, \cdots, n\)
  • The \(K-S\) statistic is \(D_{max} = | \hat{F}_1(X) - \hat{F}_2(X) |\)
  • Assumptions:
    • The measurement scale is at least ordinal.
    • The probability distributions are continuous
    • The two samples are mutually independent
    • Both samples are simple random samples from their respective populations
  • \(H_0:\) \(F_1(X) = F_2(X)\) for all \(X_i\); \(H_A:\) \(F_1(X) \neq F_2(X)\) for at least one \(X_i\)
  • Reject \(H_0\) if \(P-value\) of calculated \(D_{max} \leq \alpha\)

An Example

The stalk-eyed fly, Crytodiopsis dalmanni, is a bizarre-looking insect from the jungles of Malaysia. Its eyes are at the ends of long stalks that emerge from its head, making the fly look like something from the cantina scene in Star Wars. These eye stalks are present in both sexes, but they are particularly impressive in males. The span of the eye stalk in males enhances their attractiveness to females as well as their success in battles against other males. The span, in millimeters, from one eye to the other, was measured in a random sample of 45 male stalk-eyed flies. Does diet have

Let us plot the distribution and run the formal tests to see if we can stick with the usual \(t-test\)

eyes = read.csv(url("http://whitlockschluter.zoology.ubc.ca/wp-content/data/chapter12/chap12q18StalkieEyespan.csv"))
head(eyes)
food eyeSpan
Corn 2.15
Corn 2.14
Corn 2.13
Corn 2.13
Corn 2.12
Corn 2.11
names(eyes)
## [1] "food"    "eyeSpan"
boxplot(eyes$eyeSpan ~ eyes$food, horizontal = TRUE)

var.test(eyes$eyeSpan ~ eyes$food)
## 
##  F test to compare two variances
## 
## data:  eyes$eyeSpan by eyes$food
## F = 0.068748, num df = 20, denom df = 23, p-value = 9.646e-08
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
##  0.0291719 0.1663771
## sample estimates:
## ratio of variances 
##           0.068748
library(car)
leveneTest(eyes$eyeSpan, eyes$food, center = "mean")
Df F value Pr(>F)
group 1 14.76323 0.0003965
43 NA NA
leveneTest(eyes$eyeSpan, eyes$food, center = "median")
Df F value Pr(>F)
group 1 14.7773 0.0003943
43 NA NA

The box-plots and the formal tests (at least Levene’s test for homogeneity of variances) suggest we should reject the null of equal variances. We could try and normalize the data via the natural logarithm.

eyes$log.span = log(eyes$eyeSpan)
boxplot(eyes$log.span ~ eyes$food, horizontal = TRUE)

var.test(eyes$log.span ~ eyes$food)
## 
##  F test to compare two variances
## 
## data:  eyes$log.span by eyes$food
## F = 0.040201, num df = 20, denom df = 23, p-value = 7.154e-10
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
##  0.01705872 0.09729156
## sample estimates:
## ratio of variances 
##         0.04020145
leveneTest(eyes$log.span, group = eyes$food, center = "mean")
Df F value Pr(>F)
group 1 19.99886 5.58e-05
43 NA NA
leveneTest(eyes$log.span, group = eyes$food, center = "median")
Df F value Pr(>F)
group 1 18.63644 9.13e-05
43 NA NA

Now we have a quandary, using both the mean and the median tells us we must reject \(H_0\). A safe approach would be to assume we cannot normalize the data so the Kolmogorov-Smirnov test may be called for.

ks.test(eyes$food, eyes$eyeSpan, alternative = "two.sided")
## Warning in ks.test(eyes$food, eyes$eyeSpan, alternative = "two.sided"):
## cannot compute exact p-value with ties
## 
##  Two-sample Kolmogorov-Smirnov test
## 
## data:  eyes$food and eyes$eyeSpan
## D = 0.46667, p-value = 0.0001109
## alternative hypothesis: two-sided

Uh oh! We have ties so this won’t work. Technically, a permutation test would be useful here but we won’t look at its mechanics until a later point this semester. For now though, we can run the test to close the loop.

library(exactRankTests)
perm.test(eyes$eyeSpan ~ eyes$food, paired = FALSE, alternative = "two.sided", 
    exact = TRUE)
## 
##  2-sample Permutation Test (scores mapped into 1:(m+n) using
##  rounded scores)
## 
## data:  eyes$eyeSpan by eyes$food
## T = 855, p-value = 9.363e-09
## alternative hypothesis: true mu is not equal to 0

Wel, we might as well try the coin library then.

library(coin)
independence_test(eyes$eyeSpan ~ eyes$food, alternative = "two.sided")
## 
##  Asymptotic General Independence Test
## 
## data:  eyes$eyeSpan by eyes$food (Corn, Cotton)
## Z = 5.095, p-value = 3.487e-07
## alternative hypothesis: two.sided

The test tells us we can easily reject \(H_0\); diet appears to have an effect on eye spans.

The Empirical Cumulative Distribution Function

There are several ways to plot this.

library(Hmisc)
Ecdf(~eyeSpan | food, data = eyes)

The plot above was generated via the Hmisc library but you could also whip one up by hand. See below for an example.

fed = subset(eyes, food == "Corn")
starved = subset(eyes, food == "Cotton")

plot(ecdf(fed$eyeSpan), do.points = FALSE, verticals = TRUE, 
    xlim = c(1, 2.2), pch = 46, col = "cornflowerblue", xlab = "Eye Span", 
    ylab = "Proportion", main = "ECDF of Eye Span (Corn vs. Cotton)")
lines(ecdf(starved$eyeSpan), do.points = FALSE, verticals = TRUE, 
    pch = 46, col = "salmon")
abline(v = 1.77, col = "gray")
legend(1, 0.8, legend = c("Corn", "Cotton"), col = c("cornflowerblue", 
    "salmon"), lwd = 1, cex = 0.75)


The Testing Sequence

  • One-Sample or Paired Design
    • Test for Normality
    • If \(N()\) holds then use the One-Sample \(t-test\)
    • If \(N()\) does not hold then try a transformation, test for \(N()\) again
    • If \(N()\) still does not hold then use the Sign test (one sample) or the Wilcoxon Signed-Rank test (paired design)
  • Two Sample design (equal variances)
    • Test for Normality and equal variances
    • If \(N()\) holds and variances are equal then use the Two-Sample \(t-test\) with \(var.equal=TRUE\)
    • If \(N()\) does not hold then try a transformation, test for \(N()\) again
    • If \(N()\) still does not hold then use the Mann-Whitney \(U-test\)
  • Two Sample design (unequal variances)
    • Test for Normality and equal variances
    • If \(N()\) holds but variances are unequal then use the Two-Sample \(t-test\) with \(var.equal=FALSE\)
    • If \(N()\) does not hold then try a transformation, test for \(N()\) again
    • If \(N()\) still does not hold then use the Kolmogorov-Smirnov test