We can now start looking at some descriptive statistics – the usual mean, median, minimum, maximum, standard deviation, variance, etc. We won’t go into the statistical theory underlying these estimates since we covered this last semester. Let us start easy, by loading some data and then seeing some of the functions that give us summaries of our data.
Let us load the hsb2.RData we read and saved earlier. If we want to look at its contents and get a quick feel for the distributions of each variable we can do so via the summary()
function.
load("~/Downloads/hsb2.RData")
summary(hsb2)
id | female | race | ses | schtyp | prog | read | write | math | science | socst | |
---|---|---|---|---|---|---|---|---|---|---|---|
Min. : 1.00 | Male : 91 | Hispanic : 24 | Low :47 | Public :168 | General : 45 | Min. :28.00 | Min. :31.00 | Min. :33.00 | Min. :26.00 | Min. :26.00 | |
1st Qu.: 50.75 | Female:109 | Asian : 11 | Middle:95 | Private: 32 | Academic :105 | 1st Qu.:44.00 | 1st Qu.:45.75 | 1st Qu.:45.00 | 1st Qu.:44.00 | 1st Qu.:46.00 | |
Median :100.50 | NA | African American: 20 | High :58 | NA | Vocational: 50 | Median :50.00 | Median :54.00 | Median :52.00 | Median :53.00 | Median :52.00 | |
Mean :100.50 | NA | White :145 | NA | NA | NA | Mean :52.23 | Mean :52.77 | Mean :52.65 | Mean :51.85 | Mean :52.41 | |
3rd Qu.:150.25 | NA | NA | NA | NA | NA | 3rd Qu.:60.00 | 3rd Qu.:60.00 | 3rd Qu.:59.00 | 3rd Qu.:58.00 | 3rd Qu.:61.00 | |
Max. :200.00 | NA | NA | NA | NA | NA | Max. :76.00 | Max. :67.00 | Max. :75.00 | Max. :74.00 | Max. :71.00 |
Note how you see each variable along with some key statistics. You don’t see the standard deviation or variance listed for the numeric variables but these are easily calculated.
There is a tedious way of getting these estimates. Instead, we can rely on some R packages to obtain these values. Before we do so, however, let us look at some of the functions we will use quite often. The code below uses a generic data frame (df) and a generic variable (x). You will have to replace df by whatever you have called the data frame and replace x by the actual name of the variable.
sd(hsb2$read)
## [1] 10.25294
var(hsb2$read)
## [1] 105.1227
quantile(hsb2$math, c(0.25, 0.59, 0.75))
## 25% 59% 75%
## 45.00 54.41 59.00
IQR(hsb2$math)
## [1] 14
cor(hsb2$math, hsb2$science)
## [1] 0.6307332
There are a number of ways that we could generate tables of descriptive statistics for numeric variables in a data frame. One of the more promising ways is via the data.table
package. On the web you can find several examples of how to accomplish a particular task but for now we will focus on generating simple tables. These tables are aggregates of means, standard deviations, etc. The commands that follow also use two other packages – knitr
and printr
– to dress-up the tables. As such, you will see these tables created in two steps, first generating the table we want and giving it a name (table.1, table.2, etc) and then dressing each table via the kable() command.
library(data.table)
DT = data.table(hsb2)
table.1 = DT[, list(Mean.Reading = mean(read), Md.Reading = median(read),
SD.Reading = sd(read), Mean.Writing = mean(write), Md.Writing = median(write),
SD.Writing = sd(write), Mean.Math = mean(math), Md.Math = median(math),
SD.Math = sd(math))]
kable(table.1, digits = 2, booktabs = TRUE, caption = "Table 1: Descriptive Statistics")
Mean.Reading | Md.Reading | SD.Reading | Mean.Writing | Md.Writing | SD.Writing | Mean.Math | Md.Math | SD.Math |
---|---|---|---|---|---|---|---|---|
52.23 | 50 | 10.25 | 52.77 | 54 | 9.48 | 52.65 | 52 | 9.37 |
If we want these estimates for, say, Male versus Female students, and then perhaps for Male versus Female students in Public versus Private schools, we can do so via:
table.2 = DT[, list(Mean.Reading = mean(read), SD.Reading = sd(read),
Mean.Writing = mean(write), SD.Writing = sd(write), Mean.Math = mean(math),
SD.Math = sd(math)), by = "female"]
kable(table.2, digits = 2, booktabs = TRUE, caption = "Table 2: Descriptive Statistics (by Gender)")
female | Mean.Reading | SD.Reading | Mean.Writing | SD.Writing | Mean.Math | SD.Math |
---|---|---|---|---|---|---|
Male | 52.82 | 10.51 | 50.12 | 10.31 | 52.95 | 9.66 |
Female | 51.73 | 10.06 | 54.99 | 8.13 | 52.39 | 9.15 |
table.3 = DT[, list(Mean.Reading = mean(read), SD.Reading = sd(read),
Mean.Writing = mean(write), SD.Writing = sd(write), Mean.Math = mean(math),
SD.Math = sd(math)), by = list(female, schtyp)]
kable(table.3, digits = 2, booktabs = TRUE, caption = "Table 3: Descriptive Statistics (by Gender & School-type)")
female | schtyp | Mean.Reading | SD.Reading | Mean.Writing | SD.Writing | Mean.Math | SD.Math |
---|---|---|---|---|---|---|---|
Male | Public | 52.35 | 10.81 | 49.36 | 10.54 | 52.31 | 9.57 |
Female | Public | 51.42 | 10.12 | 54.69 | 8.41 | 52.19 | 9.37 |
Male | Private | 55.43 | 8.50 | 54.29 | 8.00 | 56.43 | 9.81 |
Female | Private | 53.33 | 9.85 | 56.50 | 6.54 | 53.44 | 8.13 |
data.table
is also useful for collapsing a data frame by a select set of variables. This is often done when we want to calculate the mean or median for a specific group. Say, for example, we wanted the mean of all subject scores for Male versus Female students, or by Race. This could be achieved via:
table.4 = DT[, lapply(.SD, mean), by = female, .SDcols = c("read",
"write", "math", "science", "socst")]
table.5 = DT[, lapply(.SD, mean), by = race, .SDcols = c("read",
"write", "math", "science", "socst")]
table.6 = DT[, lapply(.SD, mean), .SDcols = c("read", "write",
"math", "science", "socst"), by = c("female", "race", "schtyp")]
kable(table.4, digits = 2, booktabs = TRUE, caption = "Table 4: Mean Scores (by Subject and Gender)")
female | read | write | math | science | socst |
---|---|---|---|---|---|
Male | 52.82 | 50.12 | 52.95 | 53.23 | 51.79 |
Female | 51.73 | 54.99 | 52.39 | 50.70 | 52.92 |
kable(table.5, digits = 2, booktabs = TRUE, caption = "Table 5: Mean Scores (by Subject and Race)")
race | read | write | math | science | socst |
---|---|---|---|---|---|
White | 53.92 | 54.06 | 53.97 | 54.20 | 53.68 |
African American | 46.80 | 48.20 | 46.75 | 42.80 | 49.45 |
Hispanic | 46.67 | 46.46 | 47.42 | 45.38 | 47.79 |
Asian | 51.91 | 58.00 | 57.27 | 51.45 | 51.00 |
kable(table.6, digits = 2, booktabs = TRUE, caption = "Table 6: Mean Scores (by Subject, Gender, Race, and School Type)")
female | race | schtyp | read | write | math | science | socst |
---|---|---|---|---|---|---|---|
Male | White | Public | 54.28 | 50.52 | 53.61 | 55.41 | 52.76 |
Female | White | Public | 53.73 | 56.41 | 53.64 | 53.16 | 54.25 |
Male | African American | Public | 46.86 | 47.00 | 45.29 | 46.71 | 49.00 |
Male | Hispanic | Public | 47.31 | 44.38 | 49.23 | 45.54 | 45.69 |
Male | White | Private | 55.43 | 54.29 | 56.43 | 55.29 | 56.00 |
Male | Asian | Public | 52.33 | 55.67 | 58.67 | 53.00 | 47.67 |
Female | Hispanic | Public | 41.33 | 47.00 | 43.44 | 43.56 | 48.44 |
Female | White | Private | 51.77 | 56.92 | 54.46 | 53.15 | 52.23 |
Female | African American | Public | 46.27 | 48.36 | 47.55 | 39.55 | 48.09 |
Female | African American | Private | 49.50 | 51.50 | 47.50 | 47.00 | 58.50 |
Female | Hispanic | Private | 66.50 | 57.50 | 53.50 | 52.50 | 58.50 |
Female | Asian | Public | 51.29 | 58.86 | 57.43 | 52.14 | 51.71 |
Female | Asian | Private | 55.00 | 59.00 | 52.00 | 42.00 | 56.00 |
With qualitative variables such as female, race, etc. we know we can best represent their distributions via frequency tables. These can be created very easily, and then dressed up a bit. The basic command for a cross-tabulation of frequencies is table(df$x1, df$x2)
. This command does not create the row and column totals so we use the as.data.frame(addmargins(tab.x, FUN = Total))
command. The kable()
command generates the final table when we knit
the document.
tab.6a = table(hsb2$ses)
Total = sum
tab.6b = as.data.frame(addmargins(tab.6a, FUN = Total))
colnames(tab.6b) = c("SES Category", "Frequency")
kable(tab.6b, booktabs = TRUE, caption = "Table 6: Frequency Table of SES")
SES Category | Frequency |
---|---|
Low | 47 |
Middle | 95 |
High | 58 |
Total | 200 |
Table 6b is a table of simple frequencies. What if we wanted a table of relative frequencies (as proportions or percentages)? We could build such a table as follows:
tab.6c = prop.table(tab.6a) * 100
tab.6d = as.data.frame(addmargins(tab.6c, FUN = Total, quiet = TRUE))
colnames(tab.6d) = c("SES Category", "Frequency")
kable(tab.6d, booktabs = TRUE, caption = "Table 7: Relative Frequency Table of SES")
SES Category | Frequency |
---|---|
Low | 23.5 |
Middle | 47.5 |
High | 29.0 |
Total | 100.0 |
These are simple frequency/relative frequency tables of a single variable. What if we wanted to cross-tabulate ses and schtyp to create a contingency table?
tab.6e = table(hsb2$ses, hsb2$schtyp)
tab.6f = addmargins(tab.6e, FUN = Total, quiet = TRUE)
kable(tab.6f, digits = 0, booktabs = TRUE, caption = "Table 8: Crosstabulation of SES & School-type")
Public | Private | Total | |
---|---|---|---|
Low | 45 | 2 | 47 |
Middle | 76 | 19 | 95 |
High | 47 | 11 | 58 |
Total | 168 | 32 | 200 |
The result is a cross-tabulation of frequencies. We can flip this into a table of relative frequencies by modifying the code used earlier. In particular, note the use of 1
in the prop.table()
command and then the use of 2
in the addmargins()
command. The 1 in prop.table says flip each row frequency into a proportion by dividing the frequency by the row total. When we add the resulting percentages (since the proportions have been multiplied by 100) we specify that addition must occur along the rows with the 2 in the addmargins() command.
tab.6g = prop.table(tab.6e, 1) * 100
tab.6h = addmargins(tab.6g, 2, FUN = Total, quiet = TRUE)
kable(tab.6h, digits = 2, booktabs = TRUE, caption = "Table 9: Crosstabulation of SES & School-type (Row Percentages)")
Public | Private | Total | |
---|---|---|---|
Low | 95.74 | 4.26 | 100 |
Middle | 80.00 | 20.00 | 100 |
High | 81.03 | 18.97 | 100 |
If we wanted column percentages then we would have to change things up a bit (see below):
tab.6i = prop.table(tab.6e, 2) * 100
tab.6j = addmargins(tab.6i, 1, FUN = Total, quiet = TRUE)
kable(tab.6j, digits = 2, booktabs = TRUE, caption = "Table 10: Crosstabulation of SES & School-type (Column Percentages)")
Public | Private | |
---|---|---|
Low | 26.79 | 6.25 |
Middle | 45.24 | 59.38 |
High | 27.98 | 34.38 |
Total | 100.00 | 100.00 |
This is a small snippet of some essential tables we might need to use but of course, as with all things R, this is but 1% of what R can do when it comes to constructing tables.
Go ahead and take a turn at the following tasks. You can put your heads together.