Chapter 11 Comparing Proportions

Thus far we have worked with continuous and multinomial outcomes but the more common measures you are likely to encounter in the social and behavioral sciences happen to be dichotomous – did the job-training participant get employment \((y=1)\) or not \((y=0)\)?; did the patient improve post-surgery because he/she received pre-operative physical therapy \((y=1)\) or not \((y=0)\)? Does the community support this initiative \((y=1)\) or not \((y=0)\)? Did someone volunteer for the food drive \((y=1)\) or not \((y=0)\)? The dichotomy may also arise in other ways as, for example, with the questions of whether women are more likely to vote \((y=1)\) than men \((y=0)\), whether women are more likely to vote Democrat \((y=1)\) than men \((y=0)\), and so on. With data such as these, while the basic logic of hypothesis testing continues to guide analysis, the manner in which we analyze such data differs from the \(t-tests\) used in the preceding chapter. The need for hypothesis testing remains the same: We have to determine whether the proportions we see in our sample are very likely to reflect what is true in the population before we state a claim.

Take, for instance, the question of blacks killed by police officers. If you read the story you see statements such as “30% of black victims were unarmed compared to 21% of white victims in 2015”. The Washington Post releases a continually updated data-set of all individuals killed by a police officer and so we could analyze the statement for ourselves. Similarly, we could shatter the myth of the chivalry of the seas with the specific example of the sinking of the Titanic and asking whether women were more likely to survive than men by studying the details of the passenger manifest. These are but two examples of the ways in which categorical data tied to a pressing issue or to a lingering romantic historical artifact can be used to judge claims for ourselves, and to correct the record if necessary.

To begin to do so we could recognize that dichotomous (or binary) outcomes where only one of two events is likely to occur at a time per person – a passenger either survives or does not survive – and that the binomial distribution characterizes such outcomes. We covered this distribution in Chapter 6, section 6.2.1.1 to be precise. How could the binomial distribution help us here?

Let us start with the naive assumption that in any shipwreck involving passengers of both sexes, the probability of survival is \(0.50\) for both men and women. Recall that the probability of observing \(x\) successes in \(n\) trials of a binomial process is given by

\[\begin{eqnarray*} P\left[x \text{ successes}\right] = \binom{n}{x}p^{x}\left(1 - p\right)^{n-x} \\ \text{where } \binom{n}{x} = \dfrac{n!}{x!(n-x)!} \text{ and } n! = n \times (n-1) \times (n-2) \times \cdots \times 2 \times 1 \end{eqnarray*}\]

How many female passengers (aka trials) do we have? 466. Let us take this piece of data, together with \(p=0.50\) to generate the distribution of female passengers’ survival via the binomial distribution. The resulting number of female passengers surviving the shipwreck is shown below. The distribution peaks at 233, which is \(=0.5 \times 466\) … i.e., the most likely outcome is that one-half of the female passengers survive.

Binomial Distribution of Female Passenger's Survival with p = 0.5

FIGURE 11.1: Binomial Distribution of Female Passenger’s Survival with p = 0.5

How many of the female passengers survived the Titanic sinking? Some 339 did, which makes the actual probability of survival approximately \(0.7275\). If this were the true probability of a female passenger surviving a shipwreck, the distribution of survivals should have looked as follows:

Binomial Distribution of Female Passenger's Survival with p = 0.5 vs. p = 0.7275

FIGURE 11.2: Binomial Distribution of Female Passenger’s Survival with p = 0.5 vs. p = 0.7275

Note how the actual distribution of survival (in blue) differs from the expected distribution of survival (in red), and this difference is stark. But this is no surprise given that we had seen a massive gap between the expected probability of \(0.50\) and the actual probability of \(0.7275\). Assuming no complications with the data and factors shaping survival, this simple analysis would lead us to conclude that survival probabilities, at least on the Titanic, were quite a bit higher than \(0.5\) for female passengers. What about for the 843 male passengers? Their survival probability was about \(0.1910\), leading to two very different distributions of survival for the sexes.

Binomial Distribution of Female vs. Male Passengers

FIGURE 11.3: Binomial Distribution of Female vs. Male Passengers

This side-by-side comparison would also lead us to suspect that males were less likely to survive than females. But how could we draw these sorts of conclusions via hypothesis tests? As it turns out, in a fairly straightforward manner, and I demonstrate the approach below.

11.1 Specifyng Hypotheses for One-group Tests

Now we no longer refer to the null population mean \(mu\) but instead to the null population proportion \(p_0\) when constructing hypotheses, with the sample proportion represented by \(\bar{p}\). In particular, the hypotheses will look as follows:

\[\begin{eqnarray*} \text{Lower Tail Test } \ldots H_{0}: p \geq p_{0}; H_{1}: p < p_{0} \\ \text{Upper Tail Test } \ldots H_{0}: p \leq p_{0}; H_{1}: p > p_{0} \\ \text{Two Tailed Test } \ldots H_{0}: p = p_{0}; H_{1}: p \neq p_{0} \end{eqnarray*}\]

The sample standard deviation \((s)\) is calculated as \(s = \sqrt{p_0 \times ( 1 - p_0)}\). Note that we are using \(p_0\) and not \(\bar{p}\). The standard error \((s_{\bar{p}})\) is then calculated as \(s_{\bar{p}} = \dfrac{s}{ \sqrt{n} } = \dfrac{ \sqrt{ p_0 \times \left(1 - p_0 \right) } }{ \sqrt{n} }\).

The test statistic is: \(z = \dfrac{ \bar{p}-p_{0} }{ s_{\bar{p}} }\). Again, note that we are using the \(z\) distribution and not the \(t\).

Confidence intervals are calculated as before except the standard error is calculated with the sample proportion \(\bar{p}\), i.e.,

\[\bar{s}_{\bar{p}} = \sqrt{ \dfrac{ \left( \bar{p} \times (1-\bar{p}) \right) }{n} }\]

leading to the confidence interval defined as \(\bar{p} \pm z_{\alpha/2} (\bar{s}_{\bar{p}})\). We also make a coninuity correction used to adjust for the fact that we are approximating a discrete distribution with a continuous distribution, the correction being the addition and subtraction of \(\dfrac{0.5}{n}\) from the upper and lower confidence interval limits, respectively.

The decision rules stay the same: Specify the hypotheses, set \(\alpha = 0.05\) or \(\alpha=0.01\), reject the null hypothesis if the \(p-value \leq \alpha\), state the conclusion in words, and then report and interpret the confidence interval for the estimate.

11.1.1 The Normal Approximation to One-group Tests

We will start with the normal approximation tests before switching to the more precise tests, if only because you will see the normal approximation tests in use in other texts.

11.1.1.1 Example 1

The probability of surviving a shipwreck is assumed to be \(0.5\). When the Titanic went down, only some 38.1971% of its 1,309 passengers survived. Did the Titanic have a significantly lower survival rate than the assumed probability of \(0.5\)?

\[\begin{array}{l} H_0: \text{ Probability of surviving the Titanic } (p \geq 0.50) \\ H_1: \text{ Probability of surviving the Titanic } (p < 0.50) \end{array}\]

Let us use \(\alpha = 0.01\)

The standard deviation is \(s = \sqrt{p_0 \times \left( 1 - p_0 \right)} = \sqrt{0.5 \times \left( 1 - 0.5\right)} = \sqrt{0.25} = 0.5\)

The standard error is \(s_{\bar{p}} = \dfrac{s}{\sqrt{n}} = \dfrac{0.5}{\sqrt{1309}} = \dfrac{0.5}{36.18011} = 0.01381975\)

The test statistic is \(z=\dfrac{\bar{p}-p_{0}}{s_{\bar{p}}} = \dfrac{0.381971 - 0.5}{0.01381975} = \dfrac{-0.118029}{0.01381975} = -8.540603\). The \(p-value\) is practically \(0\) (to be precise, it is \(6.676161e-18\)) so we can easily reject the null hypothesis. The data suggest that the Titanic had a significantly lower survival rate than \(0.5\).

The 99% confidence interval is \(\bar{p} \pm z_{\alpha/2}(\bar{s}_{\bar{p}}) = 0.381971 \pm 2.327 \left(\sqrt{\dfrac{0.381971 \times (1 - 0.381971)}{1309}}\right) = (0.3507213, 0.4132207)\). With the continuity correction we have \((0.3507213 - 0.000381971)\) and \((0.4132207 + 0.000381971)\), respectively. That is, we can be about 99% confident that the true survival rate for shipwrecks lies in the interval given by \(0.3503\) and \(0.4136\).

11.1.1.2 Example 2

Your public school district carried out a drug-use survey and found the proportion of the 497 students reporting “occasional recreational use of opioids” to be \(0.11\). The national average is reported to be \(0.06\); is your school district’s rate significantly different from the national average? Use \(\alpha = 0.05\).

\[\begin{array}{l} H_0: \text{ The district's rate is not different from the national average } (p = 0.06) \\ H_1: \text{ The district's rate is different from the national average } (p \neq 0.06) \end{array}\]

The standard deviation is \(s = \sqrt{p_0 \times \left( 1 - p_0 \right)} = \sqrt{0.06 \times \left( 1 - 0.06\right)} = \sqrt{0.0564} = 0.2374868\)

The standard error is \(s_{\bar{p}} = \dfrac{s}{\sqrt{n}} = \dfrac{0.2374868}{\sqrt{497}} = \dfrac{0.2374868}{22.2935} = 0.01065274\)

The test statistic is \(z=\dfrac{\bar{p}-p_{0}}{s_{\bar{p}}} = \dfrac{0.11 - 0.06}{0.01065274} = \dfrac{0.05}{0.01065274} = 4.693628\). This has a \(p-value\) that is practically \(0\) and so we can reject the null hypothesis; these data suggest that the school district’s rate is significantly different from the national average.

The 95% confidence interval is \(\bar{p} \pm z_{\alpha/2} (\bar{s}_{\bar{p}}) = 0.11 \pm 1.96 (0.01403502) = (0.08249136, 0.1375086)\), and with the continuity correction we have \((0.08249136 - 0.001006036, 0.1375086 + 0.001006036)\) indicating that we can be about 95% confident the true rate lies in the interval of 8.14% and 13.85%.

11.1.2 The Binomial Test

The normal approximation we used above works well in “large samples” though defining “large” is tricky. Some folks are perfectly content to rely on the central limit theorem to use the normal approximation so long as they have 30 or more cases but others proceed with more caution. The latter group will focus on the known or, if unknown, the suspected population proportion \(p\). If \(p = 0.5\), then one can get away with the normal approximation when the sample size is 9. As \(p\) gets closer to \(01\) or \(1\), the sample size needed increases. If \(p=0.8\) or \(p=0.2\), you need \(n=36\) and if \(p=0.9\) or \(p=0.1\) you need \(n=81\).

Others will eschew the normal approximation in favor of more precise binomial tests. I follow this route because with the availability of computers there is really no longer any need to use the normal approximation with proportions. In fact, you don’t even need a statistics package, all you need is a calculator and a browser and you can run the tests via, for example, this applet or this applet.

11.1.2.1 Example 1 Revisited

Using the first applet, enter the values asked for and click Calculate. The first line of results will have a row called “Binomial Probability: P(X = 500)” and the corresponding \(p-value\) will be shown as \(< 0.000001\). This \(p-value\) is clearly smaller than \(\alpha = 0.05\) so we can reject the null hypothesis. This is the same conclusion we arrived at earlier.

The 99% confidence interval is \(0.3480\) and \(0.4170\), respectively.

11.1.2.2 Example 2 Revisited

Once again, if you enter the values asked for and calculate the \(p-value\) you will see “Cumulative Probability: P(X > 55)” with a corresponding value of \(5.74906923800356E-06\). Since it is a two-tailed hypothesis test, the \(p-value\) becomes \(0.00001149814\), again allowing us to reject the null hypothesis.

The 95% confidence interval is \(0.0858\) and \(0.1414\), respectively.

11.2 Two-group Tests

What if we have two groups instead of one as, for example, in the question of survival rates of males versus females on the Titanic, or even drug use among male versus female students in the school district? How can we carry out hypothesis tests for these designs?

11.2.1 Example 1

Assume, for example, that we are studying the Titanic disaster and interested in how survival rates differed by the passenger’s sex. A simple cross-tabulation of sex by survival status would look like the following:

Titanic Passengers' Survival by SexTitanic Passengers' Survival by Sex

FIGURE 11.4: Titanic Passengers’ Survival by Sex

TABLE 11.1: Survival Status by Sex
Died Survived Total
female 127 339 466
male 682 161 843
Total 809 500 1309

The bar-chart shows a marked difference in the chance of survival for male versus female passengers, with female passengers more likely to survive. Specifically, only 19.1% of males survived versus 72.75% of females. The same pattern is visible in the contingency table as well. On the basis of these visualizations is is very likely that if we test for an association between sex and survival we are likely to reject the null hypothesis of no association between the two variables. Let us now turn to the hypothesis test.

In essence, we have a \(2 \times 2\) contingency table and could test whether the proportion of female deaths differs from the proportion of male deaths via a normal approximation but instead we will rely upon two other tests – (i) the \(\chi^2\) test, and (ii) the more precise Fisher’s Exact test.

11.2.2 The \(\chi^2\) Test

We saw this test in action in the preceding chapter so the mechanics are familiar to us. Let us run through them.

\[\begin{array}{l} H_0: \text{ Survival was independent of a passenger's sex } (p_{male} = p_{female}) \\ H_1: \text{ Survival was NOT independent of a passenger's sex } (p_{male} \neq p_{female}) \end{array}\]

Using the online calculator we obtain \(\chi^2_{1} = 363.62\) with a \(p-value < 0.0001\) so we can easily reject the null hypothesis; these data suggest that survival was not independent of a passenger’s sex.

11.2.3 Fisher’s Exact Test

There is an online calculator for \(2 \times 2\) tables; see here, although you won’t see the exact \(p-value\) reported but it is the one calculator we currently have. As it turns out, the \(p-value\) is indeed very small (almost \(0\)) and so this too allows us to reject the null hypothesis.

11.2.4 Example 2

Suicidal deaths tend to be very high in India, with the causes ranging from indebtedness to a sense of shame for having failed an exam, divorce, and other life course events. Farmers tend to be one of the more vulnerable groups with, by some account, some 60,000 farmers having committed suicide over the past three decades due to rising temperatures and the resultant stress on India’s agricultural sector. We have access to data on suicides over the 2001-2012 period, and this data-set contains information on 3,193 of 19,799 suicides that occurred in 2012.

The question of interest is whether men and women are just as likely to commit suicide for the following two causes: (i) Fall in Social Reputation, and (ii) Dowry Dispute.

\[\begin{array}{l} H_0: \text{ There is no association between the cause and the suicide victim's sex} \\ H_1: \text{ There is an association between the cause and the suicide victim's sex} \end{array}\]

The contingency table and bar-chart are shown below:

TABLE 11.2: Suicide Cause by Sex (India, 2012)
Dowry Dispute Fall in Social Reputation Total
Female 47 50 97
Male 14 79 93
Total 61 129 190
Cause of Suicide by SexCause of Suicide by Sex

FIGURE 11.5: Cause of Suicide by Sex

Running the \(\chi^2\) test yields \(\chi^2_{df = 1} = 22.79\) with a \(p-value = 0.000001807\) and so we can reject the null hypothesis; these data suggest that cause of suicide and the victim’s sex are not independent.

11.3 Measuring the Strength of the Association

Merely rejecting or failing to reject a null hypothesis is rarely the ultimate goal of any analysis. Say you reject the null hypothesis of a passenger’s survival being independent of their sex. Wouldn’t you at least want to know how strong is the association between these two variables? You often do, and we can answers questions about the strength of the association between categorical variables (be they nominal or ordinal) by way of some statistics. In this section we will not only look at our options but also at what measure of association should be used and when.

Say our contingency table is made up of two nominal variables, one being support for abortion and the other being the respondent’s educational attainment level. The data are mapped for you in the table that follows:

TABLE 11.3: Educational Attainment and Support for Abortion
Less than High School High School or More Total
No Support (n) NA NA 555.0
No Support (%) NA NA 57.1
Yes Support (n) NA NA 417.0
Yes Support (%) NA NA 42.9
Total (n) NA NA 972.0
Total (%) NA NA 100.0

Having rejected the null hypothesis our interest now is in being able to predict support for abortion from educational attainment levels because we suspect that support increases with education. In the language of data analysis, support for adoption would be called the dependent variable while educational attainment would be labeled the independent variable. You only see row totals of 555 and 417; i.e., total for and against abortion, respectively. You pick one person at random. Is this person most likely to support abortion or most likely not to support abortion? Well, the only thing you can do is look at the modal response, which was 555 individuals indicating no support for abortion. So your best bet would be to expect the individual you drew at random to echo no support for abortion.

Now, what if you were provided the missing information, the missing cell frequencies?

TABLE 11.4: Educational Attainment and Support for Abortion
Less than High School High School or More Total
No Support (n) 399.0 156.0 555.0
No Support (%) 64.3 44.4 57.1
Yes Support (n) 222.0 195.0 417.0
Yes Support (%) 35.7 55.6 42.9
Total (n) 621.0 351.0 972.0
Total (%) 100.0 100.0 100.0

Now, a few things would become readily apparent. First, working with only the modal response you would have incorrectly predicted no support for abortion for 417 individuals. The percent correctly predicted would thus have been \(\dfrac{555}{972} \times 100 = 57.1\). Second, if you took educational attainment into account, how many errors would you make? You might have thought everyone with \(>\) High School supports abortion but only \(195\) do so, leading you to make \(156\) errors here. Similarly, you expected everyone with \(\leq\) High School to oppose abortion but only \(399\) do so, leading you to make \(222\) errors here. Total errors when taking education (the independent variable) into account would then sum to \(156 + 222 = 378\).

These errors are fewer in number – \(378\) versus \(417\) – than when you lacked any information the breakdown of support by education. In a way, then, you have reduced prediction error by folding in information on an individual’s educational attainment. This predictive leverage is what lies behind the notion of proportional reduction in error (PRE), with \(0 \leq PRE \leq 1\) and the closer is PRE to \(1\) the more the reduction in prediction error. Technically, PRE is calculated as follows:

  1. Set \(E_1 = n -\) Modal frequency \(= 972 - 555 = 417\). This is the number of prediction errors made by ignoring an independent variable.
  2. \(E_2 = (n_{column_i} - mode_{column_i}) + (n_{column_j} - mode_{column_j})\), \(\therefore E_2 = (621 - 399) + (351 - 195) =156 + 222 = 378\). This is the number of prediction errors made by taking an independent variable into account.

Now calculate \(PRE = \dfrac{E_1 - E_2}{E_1} = \dfrac{417 - 378}{417} = \dfrac{39}{417} = 0.09\). We improved our predictive ability by 9% when using educational attainment as compared to if we ignored or did not have access to information about an individual’s educational attainment.

There are several PRE measures that could be used, depending upon the nature of the variables that comprise the contingency table. Let us see each of these in turn.

11.3.1 Goodman-Kruskal Lambda \((\lambda)\)

\(\lambda = \dfrac{E_1 - E_2}{E_1}\) was used in the preceding example that explained the concept of proportional reduction in error. It is an asymmetrical measure of association in that its value will differ depending upon which variable is used as the dependent variable. For example, consider the following two tables, built with the same data except that one uses violence as the dependent variable while the other uses assailant’s status as the dependent variable.

In the first table, \[\begin{array}{l} \lambda = \dfrac{E_1 - E_2}{E_1} \\ E_1 = n - \text{ Modal frequency of the dependent variable} \\ \therefore E_1 = 9,898,980 - 8,264,320 = 1,634,660 \\ E_2 = (5,045,040 - 3,992,090 = 1,052,950) + (4,853,940 - 4,272,230 = 581,710) = 1,634,660 \\ \lambda = \dfrac{E_1 - E_2}{E_1} = \dfrac{1,634,660 - 1,634,660}{1,634,660} = 0 \end{array}\]

In the second table,

\[\begin{array}{l} \lambda = \dfrac{E_1 - E_2}{E_1} \\ E_1 = n - \text{ Modal frequency of the dependent variable} \\ \therefore E_1 = 9,898,980 - 5,045,040 = 4,853,940 \\ E_2 = (472,760 - 350,670) + (1,161,900 - 930,860) + (8,264,320 - 4,272,230) \\ \therefore E_2 = (122,090 + 231,040 + 3,992,090) = 4,345,220 \\ \lambda = \dfrac{E_1 - E_2}{E_1} = \dfrac{4,854,940 - 4,345,220}{4,853,940} = 0.1048 \end{array}\]

Conventionally, the dependent variable is always the row variable and you should follow this rule.

11.3.2 Phi \((\phi)\) Coefficient

If both variables are nominal and you have a \(2 \times 2\) contingency table, then we use the phi coefficient to measure the strength of the association between the two variables.

\(\phi = \sqrt{\dfrac{\chi^2}{n}}\)

Technically, if the table is as follows,

TABLE 11.5: Calculating phi
Alive Dead
Drug A a b
Drug B c d

where \(\phi = \dfrac{ad - bc}{\sqrt{(a+b)(c+d)(a+c)(b+d)}}\) with \(df=(r-1)(c-1)\).

11.3.3 Cramer’s \(V\) and Contingency \(C\)

Cramer’s \(V\) is used when both variables are nominal but at least one has more than \(2\) categories. Cramer’s V \(= \sqrt{\dfrac{\chi^2}{n \times m}}\) where \(m\) is the smaller of \((r-1)\) or \((c-1)\).

Contingency Coefficient \(C = \sqrt{ \dfrac{\chi^{2}}{\chi^{2} \times n} }\) is recommended for \(5 \times 5\) tables since it appears to underestimate the strength of the association in smaller tables.

Both \(c\) and \(V\) will fall in the \(\left[0,1\right]\) range, i.e., \(0 \leq V \leq 1\) and \(0 \leq C \leq 1\).

Reference tables are available to classify the strength of the association:

  1. For \(df=1\),
    • Small effect if \(0.10 < V \leq 0.30\)
    • Medium effect if \(0.30 < V \leq 0.50\)
    • Large effect if \(V > 0.50\)
  2. For \(df=2\),
    • Small effect if \(0.07 < V \leq 0.21\)
    • Medium effect if \(0.21 < V \leq 0.35\)
    • Large effect if \(V > 0.35\)
  3. For \(df=3\),
    • Small effect if \(0.06 < V \leq 0.17\)
    • Medium effect if \(0.17 < V \leq 0.29\)
    • Large effect if \(V > 0.29\)

11.3.4 Goodman-Kruskal Gamma \((\gamma)\)

What if we have two ordinal variables, as is the case in the contingency table shown below?

TABLE 11.6: Two Ordinal Variables
High School or More Less than High School Total
High Financial Satisfaction 1194 193 1387
Low Financial Satisfaction 477 147 624
Total 1671 340 2011

There are four pairs in the table – High education and High financial satisfaction, High education and Low financial satisfaction, Low education and High financial satisfaction, and Low education and Low financial satisfaction. Let us call these pairs High-High, High-Low, Low-High, and Low-Low.

The research question in play is whether education has any impact on financial satisfaction. If it does so perfectly, without any error, then we should see all those high on education to also be high on financial satisfaction, and likewise all those low on education to also be low on financial satisfaction. However, that is not the case; we do see High-Low and Low-High pairs with non-zero frequencies!

One way to evaluate the association between the two variables might be to how many concordant pairs there are (High-High and Low-Low) versus disordant pairs (High-Low, Low-High). Let us label the concordant pairs \(N_s\) and the discordant pairs \(N_d\). Then, the total number of concordant pairs possible would be given by the number High-High \(\times\) the number Low-Low. Similarly, the total number of discordant pairs would be given by the number High-Low \(\times\) the number Low-High. That is, for the given table, we would calculate

\[\begin{array}{l} N_s = 1194 \times 147 = 175,518 \\ N_d = 193 \times 477 = 92,061 \end{array}\]

If most of the pairs possible are concordant, then we could calculate a ratio of the difference between the number of concordant and discordant pairs to the total number of pairs (concordant + discordant). This is the measure of association called Goodman-Kruskal’s gamma \((\gamma)\) where

\[\gamma = \dfrac{N_s - N_d}{N_s + N_d}\]

such that

\[\begin{array}{l} N_s > N_d, \gamma \to +1 \\ N_s < N_d, \gamma \to -1 \\ N_s = N_d, \gamma = 0 \end{array}\]

In the present example, we have \(\gamma = \dfrac{N_s - N_d}{N_s + N_d} = \dfrac{175,518 - 92,061}{175,518 + 92,061}=0.31\), indicating a moderate impact of education on financial satisfaction.

Note, before we move on, that if \(N_s = 100, N_d = 0, \gamma = \dfrac{N_s - N_d}{N_s + N_d} = \dfrac{100 - 0}{100 + 0} = 1\) and if \(N_s = 0, N_d = 100, \gamma = \dfrac{N_s - N_d}{N_s + N_d} = \dfrac{0 - 100}{0 + 100} = -1\)

11.3.5 Kendall’s \((\tau_b)\)

Unfortunately, however, \(\gamma\) ignores what are called tied pairs. Say what?

TABLE 11.7: Tied Dependent and Independent Pairs
High School or More Less than High School Total
High Financial Satisfaction 1194 193 1387
Low Financial Satisfaction 477 147 624
Total 1671 340 2011

With tied dependent pairs \((T_y)\), we have individuals with the same value of the dependent variable but different values of the independent variable. Here, 1,194 and 193 are tied on the dependent variable value of High while 477 and 147 are tied on the dependent variable value of Low. With tied independent pairs \((T_x)\), we have individuals with the same value of the independent variable but different values of the dependent variable. Here, 1194 and 477 are tied on the independent variable value of High while 193 and 147 are tied on the independent variable value of Low.

\[\begin{array}{l} T_y = (1,194 \times 193) + (477 \times 147) = 230,442 + 70,119= 300,561 \\ T_x = (1,194 \times 477) + (193 \times 147) = 569,538 + 28,371 = 597,909 \end{array}\]

Then, we can calculate \(\gamma's\) replacement:\(\tau_b = \dfrac{N_s - N_d}{\sqrt{(N_s + N_d + T_y)(N_s + N_d + T_x)}} = \dfrac{83,457}{701,226} =0.12\), indicative of a weak positive association. This estimate is much smaller than what we had for \(\gamma\) but that should be no surprise given that \(\tau_b\) will, as a rule, be \(< \gamma\) because \(\tau_b\) takes all tied pairs into account whereas \(\gamma\) does not.

\(\tau_b\) is best used with square tables\(2 \times 2\), \(3 \times 3\), and so on.

11.3.6 Kendall’s \((\tau_c)\)

This measure can be used instead of \(\tau_b\) when you have tables that are not square\(2 \times 3\), \(3 \times 4\), and so on. Specifically, \(tau_c = \left(N_s - N_d\right) \times \left[2m / (n^{2} (m - 1) )\right]\), where \(m\) is the number of rows or columns, whichever is smaller, and \(n\) is sample size. All properties that hold for \(\tau_b\) apply to \(\tau_c\) as well.

11.3.7 Somer’s \(D\)

\(\gamma, tau_b, tau_c\) are all symmetric measures in that it does not matter what variable is an independent variable and what variable is dependent. Indeed, you may have no a priori idea of what is your dependent variable when using these and the resulting estimates would be fine. However, when you have a priori hypotheses about a particular variable being the dependent variable of interest, then you should use Somer’s \(D\).

\[D_{yx} = \dfrac{N_s - N_d}{N_s + N_d + T_y}\]

Again, because this measure adjusts for tied dependent pairs its value will be less than \(\gamma\). In the ongoing example of education and financial satisfaction,

\[D_{yx} = \dfrac{N_s - N_d}{N_s + N_d + T_y} = \dfrac{175,518 - 92,061}{175,518 + 92,061 + 300,561} = \dfrac{83,457}{568,140} = 0.1468951\]

Somer’s \(D\) will work with square and non-square tables of any size.

11.4 Practice Problems

Problem 1 Radioactive waste, John Wayne, and Hollywood. Yup, not making that up, read it for yourself! Turns out that John Wayne starred in a movie (The Conqueror) that was shot in St. George (Utah) in 1954, on a site 137 miles downwind of A nuclear testing site in Yucca Flat (Nevada). As it happens, by the early 1980s some 91 of the 220 cast and crew had been diagnosed with cancer of one form or another, and . According to epidemiological data available to us, only 14% of the population this group of 220 represented should have been diagnosed with cancer between 1954 and the early 1980s. Was a cast or crew member more likely to get cancer because of exposure to radioactive waste while working on the film?

Problem 2

Suicides and the holiday season, a common myth implies, go hand in hand. Similarly, “articles in the medical literature and lay press have supported a belief that individuals, including those dying of cancer, can temporarily postpone their death to survive a major holiday or other significant event, but results and effects have been variable” (Young and Hade 2004). To study this latter question, whether death does indeed take a holiday or not, the authors looked at 12,028 cancer deaths that occurred in Ohio between 1989 and 2000. Of these deaths, 6,052 occurred on the week before Christmas while the rest occurred in the week after Christmas. Do these data suggest that people are really able to postpone their passing until after the holiday season?

Problem 3

Young and Hade (2004) also looked at the breakdown of these cancer deaths by the deceased’s sex, finding that of the 6,252 men, 3,192 died in the week before Christmas while of the 5,776 women, 2,858 died in the week before Christmas. Do these data suggest a difference in the ability of men and women to control the timing of their passing?

Problem 4

Does a driver’s race have any impact on the probability that they will be stopped more often than a driver of another race? When stopped, what is the “hit rate” for each group – the probability that the stop actually results in some contraband being found on the driver or in the vehicle? If the hit rate is lower for minority drivers, that is typically used as an “outcome test” to argue that minority drivers are pulled over without good cause more often than are non-minority drivers. This isn’t foolproof evidence though because of the problem of “infra-margionality”, as has been argued here.

Let us look at Rhode Island’s data, provided to us courtesy of the Stanford Open Policing Project 2017. You can download the full data-set from here.

Once you download the data I want you to create a simple dummy variable that groups the drivers into one of two mutually exclusive groups – Minority \(=1\) if the driver’s race is “Black” or “Hispanic” and Minority \(=0\) if the driver’s race is recorded as “White”. Then calculate the search rates and hit rates for each group. The result should be a table such as the following:

Group Stops Searches Hits Search Rate Hit Rate
Minority 69883 3533 1475 0.05055593 0.4174922
Non-Minority 159808 3759 1828 0.02352198 0.4862995

Now answer the following questions:

  1. Are search rates (i.e., number of searches per number of stops) independent of a driver’s minority status?
  2. Are hit rates (i.e., number of hits per number of searches) independent of a driver’s minority status?
  3. When you combine the information about search rates with information on hit rates, what story emerges? Does this suggest a possible pattern of discrimination?

Problem 5

Some 15.9% of all fiurst time enrollees in a two- or four-year program at a college/university tend to be first-generation students (students who are the first in their family to study beyond high school). A major public university in the Midwest claims that of the 6,000 students who enrolled in the Fall of 2017, some 22.7% were first-generation students. Is this university’s rate significantly different from the national average?

Problem 6

The Youth Risk Behavior Surveillance System (YRBSS) was developed in 1990 to monitor priority health risk behaviors that contribute markedly to the leading causes of death, disability, and social problems among youth and adults in the United States. These behaviors, often established during childhood and early adolescence, include (i) Behaviors that contribute to unintentional injuries and violence; (ii) Sexual behaviors related to unintended pregnancy and sexually transmitted infections, including HIV infection; (iii) Alcohol and other drug use; (iv) Tobacco use; (v) Unhealthy dietary behaviors; and (vi) Inadequate physical activity. In addition, the YRBSS monitors the prevalence of obesity and asthma and other priority health-related behaviors plus sexual identity and sex of sexual contacts. From 1991 through 2015, the YRBSS has collected data from more than 3.8 million high school students in 1,700+ separate surveys.

The problems that follow rely upon the YRBSS 2015 data and the documentation for the data-set can be found here. Read the documentation carefully, in particular, the details of the survey questions. Then answer the following questions:

  1. Is there an association between drinking and driving (Q11) and texting and driving (Q12)?
  2. How strong is the relationship between these two variables?

Problem 7

The General Social Survey (GSS) gathers data on contemporary American society in order to monitor and explain trends and constants in attitudes, behaviors, and attributes. Hundreds of trends have been tracked since 1972. In addition, since the GSS adopted questions from earlier surveys, trends can be followed for up to 70 years. The GSS contains a standard core of demographic, behavioral, and attitudinal questions, plus topics of special interest. Among the topics covered are civil liberties, crime and violence, intergroup tolerance, morality, national spending priorities, psychological well-being, social mobility, and stress and traumatic events. Altogether the GSS is the single best source for sociological and attitudinal trend data covering the United States. It allows researchers to examine the structure and functioning of society in general as well as the role played by relevant subgroups and to compare the United States to other nations.

Using these 2016 GSS data, test (a) whether educational attainment (coldeg1) is related to confidence in the scientific community (consci), and (b) the strength of this relationship.

Problem 8

In 1984, the Centers for Disease Control and Prevention (CDC) initiated the state-based Behavioral Risk Factor Surveillance System (BRFSS) –- a cross-sectional telephone survey that state health departments conduct monthly over landline telephones and cellular telephones with a standardized questionnaire and technical and methodologic assistance from CDC. BRFSS is used to collect prevalence data among adult U.S. residents regarding their risk behaviors and preventive health practices that can affect their health status. Respondent data are forwarded to CDC to be aggregated for each state, returned with standard tabulations, and published at year’s end by each state. In 2011, more than 500,000 interviews were conducted in the states, the District of Columbia, and participating U.S. territories and other geographic areas.

The 2016 BRFSS data-set for Ohio is available here and the accopmpanying codebook is here. Using these data, answer the following questions:

  1. Test for a possible relationship between level of education completed (v_educag) and the four categories of body mass index (v_bmi5ca). How strong is this relatinship?
  2. Test for a possible relationship between income (v_incomg) and the four categories of body mass index (v_bmi5ca). How strong is this relatinship?
  3. Use an appropriate chart to reflect the relationships you tested in (a) and (b).