Descriptive Statistics
- The summary() Function
- Using data.table
Frequency Tables
Hands-on Practice

Descriptive Statistics

We can now start looking at some descriptive statistics – the usual mean, median, minimum, maximum, standard deviation, variance, etc. We won’t go into the statistical theory underlying these estimates since we covered this last semester. Let us start easy, by loading some data and then seeing some of the functions that give us summaries of our data.

The summary() Function

Let us load the hsb2.RData we read and saved earlier. If we want to look at its contents and get a quick feel for the distributions of each variable we can do so via the summary() function.

load("~/Downloads/hsb2.RData")

summary(hsb2)

id	female	race	ses	schtyp	prog	read	write	math	science	socst
Min. : 1.00	Male : 91	Hispanic : 24	Low :47	Public :168	General : 45	Min. :28.00	Min. :31.00	Min. :33.00	Min. :26.00	Min. :26.00
1st Qu.: 50.75	Female:109	Asian : 11	Middle:95	Private: 32	Academic :105	1st Qu.:44.00	1st Qu.:45.75	1st Qu.:45.00	1st Qu.:44.00	1st Qu.:46.00
Median :100.50	NA	African American: 20	High :58	NA	Vocational: 50	Median :50.00	Median :54.00	Median :52.00	Median :53.00	Median :52.00
Mean :100.50	NA	White :145	NA	NA	NA	Mean :52.23	Mean :52.77	Mean :52.65	Mean :51.85	Mean :52.41
3rd Qu.:150.25	NA	NA	NA	NA	NA	3rd Qu.:60.00	3rd Qu.:60.00	3rd Qu.:59.00	3rd Qu.:58.00	3rd Qu.:61.00
Max. :200.00	NA	NA	NA	NA	NA	Max. :76.00	Max. :67.00	Max. :75.00	Max. :74.00	Max. :71.00

Note how you see each variable along with some key statistics. You don’t see the standard deviation or variance listed for the numeric variables but these are easily calculated.

There is a tedious way of getting these estimates. Instead, we can rely on some R packages to obtain these values. Before we do so, however, let us look at some of the functions we will use quite often. The code below uses a generic data frame (df) and a generic variable (x). You will have to replace df by whatever you have called the data frame and replace x by the actual name of the variable.

mean(df$x) $\cdots$ the mean
sd(df$x) $\cdots$ the standard deviation
var(df$x) $\cdots$ the variance
min(df$x) $\cdots$ the minimum
max(df$x) $\cdots$ the maximum
quantile(df$x, c(0.25, 0.50, 0.75)) $\cdots$ the first quartile, the median, the third quartile
IQR(df$x) $\cdots$ the interquartile range
sum(df$x) $\cdots$ the total of the values of variable x
scale(df$x) $\cdots$ the z-score of variable x
cor(df$x1, df$x2) $\cdots$ the correlation between x1 and x2

sd(hsb2$read)

## [1] 10.25294

var(hsb2$read)

## [1] 105.1227

quantile(hsb2$math, c(0.25, 0.59, 0.75))

##   25%   59%   75% 
## 45.00 54.41 59.00

IQR(hsb2$math)

## [1] 14

cor(hsb2$math, hsb2$science)

## [1] 0.6307332

Using data.table

There are a number of ways that we could generate tables of descriptive statistics for numeric variables in a data frame. One of the more promising ways is via the data.table package. On the web you can find several examples of how to accomplish a particular task but for now we will focus on generating simple tables. These tables are aggregates of means, standard deviations, etc. The commands that follow also use two other packages – knitr and printr – to dress-up the tables. As such, you will see these tables created in two steps, first generating the table we want and giving it a name (table.1, table.2, etc) and then dressing each table via the kable() command.

library(data.table)
DT = data.table(hsb2)

table.1 = DT[, list(Mean.Reading = mean(read), Md.Reading = median(read), 
    SD.Reading = sd(read), Mean.Writing = mean(write), Md.Writing = median(write), 
    SD.Writing = sd(write), Mean.Math = mean(math), Md.Math = median(math), 
    SD.Math = sd(math))]

kable(table.1, digits = 2, booktabs = TRUE, caption = "Table 1: Descriptive Statistics")

Table 1: Descriptive Statistics
Mean.Reading	Md.Reading	SD.Reading	Mean.Writing	Md.Writing	SD.Writing	Mean.Math	Md.Math	SD.Math
52.23	50	10.25	52.77	54	9.48	52.65	52	9.37

If we want these estimates for, say, Male versus Female students, and then perhaps for Male versus Female students in Public versus Private schools, we can do so via:

table.2 = DT[, list(Mean.Reading = mean(read), SD.Reading = sd(read), 
    Mean.Writing = mean(write), SD.Writing = sd(write), Mean.Math = mean(math), 
    SD.Math = sd(math)), by = "female"]

kable(table.2, digits = 2, booktabs = TRUE, caption = "Table 2: Descriptive Statistics (by Gender)")

Table 2: Descriptive Statistics (by Gender)
female	Mean.Reading	SD.Reading	Mean.Writing	SD.Writing	Mean.Math	SD.Math
Male	52.82	10.51	50.12	10.31	52.95	9.66
Female	51.73	10.06	54.99	8.13	52.39	9.15

table.3 = DT[, list(Mean.Reading = mean(read), SD.Reading = sd(read), 
    Mean.Writing = mean(write), SD.Writing = sd(write), Mean.Math = mean(math), 
    SD.Math = sd(math)), by = list(female, schtyp)]

kable(table.3, digits = 2, booktabs = TRUE, caption = "Table 3: Descriptive Statistics (by Gender & School-type)")

Table 3: Descriptive Statistics (by Gender & School-type)
female	schtyp	Mean.Reading	SD.Reading	Mean.Writing	SD.Writing	Mean.Math	SD.Math
Male	Public	52.35	10.81	49.36	10.54	52.31	9.57
Female	Public	51.42	10.12	54.69	8.41	52.19	9.37
Male	Private	55.43	8.50	54.29	8.00	56.43	9.81
Female	Private	53.33	9.85	56.50	6.54	53.44	8.13

data.table is also useful for collapsing a data frame by a select set of variables. This is often done when we want to calculate the mean or median for a specific group. Say, for example, we wanted the mean of all subject scores for Male versus Female students, or by Race. This could be achieved via:

table.4 = DT[, lapply(.SD, mean), by = female, .SDcols = c("read", 
    "write", "math", "science", "socst")]

table.5 = DT[, lapply(.SD, mean), by = race, .SDcols = c("read", 
    "write", "math", "science", "socst")]

table.6 = DT[, lapply(.SD, mean), .SDcols = c("read", "write", 
    "math", "science", "socst"), by = c("female", "race", "schtyp")]


kable(table.4, digits = 2, booktabs = TRUE, caption = "Table 4: Mean Scores (by Subject and Gender)")

Table 4: Mean Scores (by Subject and Gender)
female	read	write	math	science	socst
Male	52.82	50.12	52.95	53.23	51.79
Female	51.73	54.99	52.39	50.70	52.92

kable(table.5, digits = 2, booktabs = TRUE, caption = "Table 5: Mean Scores (by Subject and Race)")

Table 5: Mean Scores (by Subject and Race)
race	read	write	math	science	socst
White	53.92	54.06	53.97	54.20	53.68
African American	46.80	48.20	46.75	42.80	49.45
Hispanic	46.67	46.46	47.42	45.38	47.79
Asian	51.91	58.00	57.27	51.45	51.00

kable(table.6, digits = 2, booktabs = TRUE, caption = "Table 6: Mean Scores (by Subject, Gender, Race, and School Type)")

Table 6: Mean Scores (by Subject, Gender, Race, and School Type)
female	race	schtyp	read	write	math	science	socst
Male	White	Public	54.28	50.52	53.61	55.41	52.76
Female	White	Public	53.73	56.41	53.64	53.16	54.25
Male	African American	Public	46.86	47.00	45.29	46.71	49.00
Male	Hispanic	Public	47.31	44.38	49.23	45.54	45.69
Male	White	Private	55.43	54.29	56.43	55.29	56.00
Male	Asian	Public	52.33	55.67	58.67	53.00	47.67
Female	Hispanic	Public	41.33	47.00	43.44	43.56	48.44
Female	White	Private	51.77	56.92	54.46	53.15	52.23
Female	African American	Public	46.27	48.36	47.55	39.55	48.09
Female	African American	Private	49.50	51.50	47.50	47.00	58.50
Female	Hispanic	Private	66.50	57.50	53.50	52.50	58.50
Female	Asian	Public	51.29	58.86	57.43	52.14	51.71
Female	Asian	Private	55.00	59.00	52.00	42.00	56.00

Frequency Tables

With qualitative variables such as female, race, etc. we know we can best represent their distributions via frequency tables. These can be created very easily, and then dressed up a bit. The basic command for a cross-tabulation of frequencies is table(df$x1, df$x2). This command does not create the row and column totals so we use the as.data.frame(addmargins(tab.x, FUN = Total)) command. The kable() command generates the final table when we knit the document.

tab.6a = table(hsb2$ses)

Total = sum

tab.6b = as.data.frame(addmargins(tab.6a, FUN = Total))

colnames(tab.6b) = c("SES Category", "Frequency")

kable(tab.6b, booktabs = TRUE, caption = "Table 6: Frequency Table of SES")

Table 6: Frequency Table of SES
SES Category	Frequency
Low	47
Middle	95
High	58
Total	200

Table 6b is a table of simple frequencies. What if we wanted a table of relative frequencies (as proportions or percentages)? We could build such a table as follows:

tab.6c = prop.table(tab.6a) * 100

tab.6d = as.data.frame(addmargins(tab.6c, FUN = Total, quiet = TRUE))

colnames(tab.6d) = c("SES Category", "Frequency")

kable(tab.6d, booktabs = TRUE, caption = "Table 7: Relative Frequency Table of SES")

Table 7: Relative Frequency Table of SES
SES Category	Frequency
Low	23.5
Middle	47.5
High	29.0
Total	100.0

These are simple frequency/relative frequency tables of a single variable. What if we wanted to cross-tabulate ses and schtyp to create a contingency table?

tab.6e = table(hsb2$ses, hsb2$schtyp)

tab.6f = addmargins(tab.6e, FUN = Total, quiet = TRUE)

kable(tab.6f, digits = 0, booktabs = TRUE, caption = "Table 8: Crosstabulation of SES & School-type")

Table 8: Crosstabulation of SES & School-type
	Public	Private	Total
Low	45	2	47
Middle	76	19	95
High	47	11	58
Total	168	32	200

The result is a cross-tabulation of frequencies. We can flip this into a table of relative frequencies by modifying the code used earlier. In particular, note the use of 1 in the prop.table() command and then the use of 2 in the addmargins() command. The 1 in prop.table says flip each row frequency into a proportion by dividing the frequency by the row total. When we add the resulting percentages (since the proportions have been multiplied by 100) we specify that addition must occur along the rows with the 2 in the addmargins() command.

tab.6g = prop.table(tab.6e, 1) * 100

tab.6h = addmargins(tab.6g, 2, FUN = Total, quiet = TRUE)

kable(tab.6h, digits = 2, booktabs = TRUE, caption = "Table 9: Crosstabulation of SES & School-type (Row Percentages)")

Table 9: Crosstabulation of SES & School-type (Row Percentages)
	Public	Private	Total
Low	95.74	4.26	100
Middle	80.00	20.00	100
High	81.03	18.97	100

If we wanted column percentages then we would have to change things up a bit (see below):

tab.6i = prop.table(tab.6e, 2) * 100

tab.6j = addmargins(tab.6i, 1, FUN = Total, quiet = TRUE)

kable(tab.6j, digits = 2, booktabs = TRUE, caption = "Table 10: Crosstabulation of SES & School-type (Column Percentages)")

Table 10: Crosstabulation of SES & School-type (Column Percentages)
	Public	Private
Low	26.79	6.25
Middle	45.24	59.38
High	27.98	34.38
Total	100.00	100.00

This is a small snippet of some essential tables we might need to use but of course, as with all things R, this is but 1% of what R can do when it comes to constructing tables.

Hands-on Practice

Go ahead and take a turn at the following tasks. You can put your heads together.

Data-set 1

Open up the sowc.RData and calculate the mean, median, variance, and standard deviation for the under-5 mortality rate
Now repeat the preceding calculation for Urban versus Rural countries
What percentage of the data-set is made of Urban countries? Rural countries?

Data-set 2

With the cps85.RData and focusing on the hourly wages as the outcome of interest, calculate the mean, median and variance
Repeat the preceding calculations for the Union versus non-Union members
Calculate the mean and the standard deviation for all unique groups formed by union and sex (i.e., males in a union, males not in a union, females in a union, females not in a union)
What percent of the sample are males in a union? What percent of the sample are females in a union?

Descriptive Statistics

MPA 6020 (Spring 2016)

08 February, 2016

Descriptive Statistics

The summary() Function

Using data.table

Frequency Tables

Hands-on Practice

Data-set 1

Data-set 2