Descriptive Statistics

We can now start looking at some descriptive statistics – the usual mean, median, minimum, maximum, standard deviation, variance, etc. We won’t go into the statistical theory underlying these estimates since we covered this last semester. Let us start easy, by loading some data and then seeing some of the functions that give us summaries of our data.

The summary() Function

Let us load the hsb2.RData we read and saved earlier. If we want to look at its contents and get a quick feel for the distributions of each variable we can do so via the summary() function.

load("~/Downloads/hsb2.RData")

summary(hsb2)
id female race ses schtyp prog read write math science socst
Min. : 1.00 Male : 91 Hispanic : 24 Low :47 Public :168 General : 45 Min. :28.00 Min. :31.00 Min. :33.00 Min. :26.00 Min. :26.00
1st Qu.: 50.75 Female:109 Asian : 11 Middle:95 Private: 32 Academic :105 1st Qu.:44.00 1st Qu.:45.75 1st Qu.:45.00 1st Qu.:44.00 1st Qu.:46.00
Median :100.50 NA African American: 20 High :58 NA Vocational: 50 Median :50.00 Median :54.00 Median :52.00 Median :53.00 Median :52.00
Mean :100.50 NA White :145 NA NA NA Mean :52.23 Mean :52.77 Mean :52.65 Mean :51.85 Mean :52.41
3rd Qu.:150.25 NA NA NA NA NA 3rd Qu.:60.00 3rd Qu.:60.00 3rd Qu.:59.00 3rd Qu.:58.00 3rd Qu.:61.00
Max. :200.00 NA NA NA NA NA Max. :76.00 Max. :67.00 Max. :75.00 Max. :74.00 Max. :71.00

Note how you see each variable along with some key statistics. You don’t see the standard deviation or variance listed for the numeric variables but these are easily calculated.

There is a tedious way of getting these estimates. Instead, we can rely on some R packages to obtain these values. Before we do so, however, let us look at some of the functions we will use quite often. The code below uses a generic data frame (df) and a generic variable (x). You will have to replace df by whatever you have called the data frame and replace x by the actual name of the variable.

  • mean(df$x) \(\cdots\) the mean
  • sd(df$x) \(\cdots\) the standard deviation
  • var(df$x) \(\cdots\) the variance
  • min(df$x) \(\cdots\) the minimum
  • max(df$x) \(\cdots\) the maximum
  • quantile(df$x, c(0.25, 0.50, 0.75)) \(\cdots\) the first quartile, the median, the third quartile
  • IQR(df$x) \(\cdots\) the interquartile range
  • sum(df$x) \(\cdots\) the total of the values of variable x
  • scale(df$x) \(\cdots\) the z-score of variable x
  • cor(df$x1, df$x2) \(\cdots\) the correlation between x1 and x2
sd(hsb2$read)
## [1] 10.25294
var(hsb2$read)
## [1] 105.1227
quantile(hsb2$math, c(0.25, 0.59, 0.75))
##   25%   59%   75% 
## 45.00 54.41 59.00
IQR(hsb2$math)
## [1] 14
cor(hsb2$math, hsb2$science)
## [1] 0.6307332

Using data.table

There are a number of ways that we could generate tables of descriptive statistics for numeric variables in a data frame. One of the more promising ways is via the data.table package. On the web you can find several examples of how to accomplish a particular task but for now we will focus on generating simple tables. These tables are aggregates of means, standard deviations, etc. The commands that follow also use two other packages – knitr and printr – to dress-up the tables. As such, you will see these tables created in two steps, first generating the table we want and giving it a name (table.1, table.2, etc) and then dressing each table via the kable() command.

library(data.table)
DT = data.table(hsb2)

table.1 = DT[, list(Mean.Reading = mean(read), Md.Reading = median(read), 
    SD.Reading = sd(read), Mean.Writing = mean(write), Md.Writing = median(write), 
    SD.Writing = sd(write), Mean.Math = mean(math), Md.Math = median(math), 
    SD.Math = sd(math))]

kable(table.1, digits = 2, booktabs = TRUE, caption = "Table 1: Descriptive Statistics")
Table 1: Descriptive Statistics
Mean.Reading Md.Reading SD.Reading Mean.Writing Md.Writing SD.Writing Mean.Math Md.Math SD.Math
52.23 50 10.25 52.77 54 9.48 52.65 52 9.37

If we want these estimates for, say, Male versus Female students, and then perhaps for Male versus Female students in Public versus Private schools, we can do so via:

table.2 = DT[, list(Mean.Reading = mean(read), SD.Reading = sd(read), 
    Mean.Writing = mean(write), SD.Writing = sd(write), Mean.Math = mean(math), 
    SD.Math = sd(math)), by = "female"]

kable(table.2, digits = 2, booktabs = TRUE, caption = "Table 2: Descriptive Statistics (by Gender)")
Table 2: Descriptive Statistics (by Gender)
female Mean.Reading SD.Reading Mean.Writing SD.Writing Mean.Math SD.Math
Male 52.82 10.51 50.12 10.31 52.95 9.66
Female 51.73 10.06 54.99 8.13 52.39 9.15
table.3 = DT[, list(Mean.Reading = mean(read), SD.Reading = sd(read), 
    Mean.Writing = mean(write), SD.Writing = sd(write), Mean.Math = mean(math), 
    SD.Math = sd(math)), by = list(female, schtyp)]

kable(table.3, digits = 2, booktabs = TRUE, caption = "Table 3: Descriptive Statistics (by Gender & School-type)")
Table 3: Descriptive Statistics (by Gender & School-type)
female schtyp Mean.Reading SD.Reading Mean.Writing SD.Writing Mean.Math SD.Math
Male Public 52.35 10.81 49.36 10.54 52.31 9.57
Female Public 51.42 10.12 54.69 8.41 52.19 9.37
Male Private 55.43 8.50 54.29 8.00 56.43 9.81
Female Private 53.33 9.85 56.50 6.54 53.44 8.13

data.table is also useful for collapsing a data frame by a select set of variables. This is often done when we want to calculate the mean or median for a specific group. Say, for example, we wanted the mean of all subject scores for Male versus Female students, or by Race. This could be achieved via:

table.4 = DT[, lapply(.SD, mean), by = female, .SDcols = c("read", 
    "write", "math", "science", "socst")]

table.5 = DT[, lapply(.SD, mean), by = race, .SDcols = c("read", 
    "write", "math", "science", "socst")]

table.6 = DT[, lapply(.SD, mean), .SDcols = c("read", "write", 
    "math", "science", "socst"), by = c("female", "race", "schtyp")]


kable(table.4, digits = 2, booktabs = TRUE, caption = "Table 4: Mean Scores (by Subject and Gender)")
Table 4: Mean Scores (by Subject and Gender)
female read write math science socst
Male 52.82 50.12 52.95 53.23 51.79
Female 51.73 54.99 52.39 50.70 52.92
kable(table.5, digits = 2, booktabs = TRUE, caption = "Table 5: Mean Scores (by Subject and Race)")
Table 5: Mean Scores (by Subject and Race)
race read write math science socst
White 53.92 54.06 53.97 54.20 53.68
African American 46.80 48.20 46.75 42.80 49.45
Hispanic 46.67 46.46 47.42 45.38 47.79
Asian 51.91 58.00 57.27 51.45 51.00
kable(table.6, digits = 2, booktabs = TRUE, caption = "Table 6: Mean Scores (by Subject, Gender, Race, and School Type)")
Table 6: Mean Scores (by Subject, Gender, Race, and School Type)
female race schtyp read write math science socst
Male White Public 54.28 50.52 53.61 55.41 52.76
Female White Public 53.73 56.41 53.64 53.16 54.25
Male African American Public 46.86 47.00 45.29 46.71 49.00
Male Hispanic Public 47.31 44.38 49.23 45.54 45.69
Male White Private 55.43 54.29 56.43 55.29 56.00
Male Asian Public 52.33 55.67 58.67 53.00 47.67
Female Hispanic Public 41.33 47.00 43.44 43.56 48.44
Female White Private 51.77 56.92 54.46 53.15 52.23
Female African American Public 46.27 48.36 47.55 39.55 48.09
Female African American Private 49.50 51.50 47.50 47.00 58.50
Female Hispanic Private 66.50 57.50 53.50 52.50 58.50
Female Asian Public 51.29 58.86 57.43 52.14 51.71
Female Asian Private 55.00 59.00 52.00 42.00 56.00

Frequency Tables

With qualitative variables such as female, race, etc. we know we can best represent their distributions via frequency tables. These can be created very easily, and then dressed up a bit. The basic command for a cross-tabulation of frequencies is table(df$x1, df$x2). This command does not create the row and column totals so we use the as.data.frame(addmargins(tab.x, FUN = Total)) command. The kable() command generates the final table when we knit the document.

tab.6a = table(hsb2$ses)

Total = sum

tab.6b = as.data.frame(addmargins(tab.6a, FUN = Total))

colnames(tab.6b) = c("SES Category", "Frequency")

kable(tab.6b, booktabs = TRUE, caption = "Table 6: Frequency Table of SES")
Table 6: Frequency Table of SES
SES Category Frequency
Low 47
Middle 95
High 58
Total 200

Table 6b is a table of simple frequencies. What if we wanted a table of relative frequencies (as proportions or percentages)? We could build such a table as follows:

tab.6c = prop.table(tab.6a) * 100

tab.6d = as.data.frame(addmargins(tab.6c, FUN = Total, quiet = TRUE))

colnames(tab.6d) = c("SES Category", "Frequency")

kable(tab.6d, booktabs = TRUE, caption = "Table 7: Relative Frequency Table of SES")
Table 7: Relative Frequency Table of SES
SES Category Frequency
Low 23.5
Middle 47.5
High 29.0
Total 100.0

These are simple frequency/relative frequency tables of a single variable. What if we wanted to cross-tabulate ses and schtyp to create a contingency table?

tab.6e = table(hsb2$ses, hsb2$schtyp)

tab.6f = addmargins(tab.6e, FUN = Total, quiet = TRUE)

kable(tab.6f, digits = 0, booktabs = TRUE, caption = "Table 8: Crosstabulation of SES & School-type")
Table 8: Crosstabulation of SES & School-type
Public Private Total
Low 45 2 47
Middle 76 19 95
High 47 11 58
Total 168 32 200

The result is a cross-tabulation of frequencies. We can flip this into a table of relative frequencies by modifying the code used earlier. In particular, note the use of 1 in the prop.table() command and then the use of 2 in the addmargins() command. The 1 in prop.table says flip each row frequency into a proportion by dividing the frequency by the row total. When we add the resulting percentages (since the proportions have been multiplied by 100) we specify that addition must occur along the rows with the 2 in the addmargins() command.

tab.6g = prop.table(tab.6e, 1) * 100

tab.6h = addmargins(tab.6g, 2, FUN = Total, quiet = TRUE)

kable(tab.6h, digits = 2, booktabs = TRUE, caption = "Table 9: Crosstabulation of SES & School-type (Row Percentages)")
Table 9: Crosstabulation of SES & School-type (Row Percentages)
Public Private Total
Low 95.74 4.26 100
Middle 80.00 20.00 100
High 81.03 18.97 100

If we wanted column percentages then we would have to change things up a bit (see below):

tab.6i = prop.table(tab.6e, 2) * 100

tab.6j = addmargins(tab.6i, 1, FUN = Total, quiet = TRUE)

kable(tab.6j, digits = 2, booktabs = TRUE, caption = "Table 10: Crosstabulation of SES & School-type (Column Percentages)")
Table 10: Crosstabulation of SES & School-type (Column Percentages)
Public Private
Low 26.79 6.25
Middle 45.24 59.38
High 27.98 34.38
Total 100.00 100.00

This is a small snippet of some essential tables we might need to use but of course, as with all things R, this is but 1% of what R can do when it comes to constructing tables.


Hands-on Practice

Go ahead and take a turn at the following tasks. You can put your heads together.

Data-set 1

  • Open up the sowc.RData and calculate the mean, median, variance, and standard deviation for the under-5 mortality rate
  • Now repeat the preceding calculation for Urban versus Rural countries
  • What percentage of the data-set is made of Urban countries? Rural countries?

Data-set 2

  • With the cps85.RData and focusing on the hourly wages as the outcome of interest, calculate the mean, median and variance
  • Repeat the preceding calculations for the Union versus non-Union members
  • Calculate the mean and the standard deviation for all unique groups formed by union and sex (i.e., males in a union, males not in a union, females in a union, females not in a union)
  • What percent of the sample are males in a union? What percent of the sample are females in a union?