Aggregating Data and Other Operations

Generating aggregated data frames is frequently used in analysis. I’ll demonstrated some operations with the dplyr package. This is not the only package capable of performing the operations shown below but it is one (if not the) fastest. We’ll use a particular data set to show what dplyr can do – the nycflights13 data frame that contains all 336776 flights that departed from New York City in 2013 (source: US Bureau of Transportation Statistics).

library(nycflights13)
library(dplyr)
head(flights)
year month day dep_time dep_delay arr_time arr_delay carrier tailnum flight origin dest air_time distance hour minute
2013 1 1 517 2 830 11 UA N14228 1545 EWR IAH 227 1400 5 17
2013 1 1 533 4 850 20 UA N24211 1714 LGA IAH 227 1416 5 33
2013 1 1 542 2 923 33 AA N619AA 1141 JFK MIA 160 1089 5 42
2013 1 1 544 -1 1004 -18 B6 N804JB 725 JFK BQN 183 1576 5 44
2013 1 1 554 -6 812 -25 DL N668DN 461 LGA ATL 116 762 5 54
2013 1 1 554 -4 740 12 UA N39463 1696 EWR ORD 150 719 5 54

filter() allows you to subset the data. In the code below we are asking for all flights on January 1 (i.e., with month == 1, day == 1). The other ways of subsetting the data are also shown below.

filter(flights, month == 1, day == 1)
flights[flights$month == 1 & flights$day == 1, ]
subset(flights, month == 1 & day == 1)

Let us now sort the data frame, first by year, month and day, and then by arrival delay (arr_delay).

arrange(flights, year, month, day)
arrange(flights, arr_delay)
arrange(flights, desc(arr_delay))

Note the difference between arr_delay and desc(arr_delay).

We can select columns by name in several ways:

select(flights, year, month, day)
select(flights, year:day)
select(flights, -c(year, month, day))
select(flights, -(year:day))

Note that year, month, day is the same as year:day. Note too that we can ask for some columns to be excluded by running -c(year, month, day) … which is the same as -(year:day).

If we want to rename some columns we can do so via rename().

rename(flights, arrival.delay = arr_delay)
rename(flights, arrival.time = arr_time, departure.time = dep_time)

Unique values in a column can be extracted as well. For example, how many unique aircraft do we have flagged by the tail number (tailnum)? This is extracted by distinct(select()), both for a single column (as in tailnum) and for combinations of multiple columns (as in origin, des).

distinct(select(flights, tailnum))
distinct(select(flights, origin, dest))

If we want to calculate something based on existing columns and then add these new calculated values to the data frame, mutate comes in handy.

flights = mutate(flights, gain = arr_delay - dep_delay)

If you look at the data frame you will see that gain has been added as the \(17^{th}\) column. Now, one of the amazing things about the mutate command is that it allows you to refer to a previous calculation within the same command (see below);

flights = mutate(flights, gain = arr_delay - dep_delay, gain_per_hour = gain/(air_time/60))

Note that gain_per_hour has been added by dividing gain by (air_time / 60) … since longer flights can gain more time than shorter flights.

summarise() is another useful function (see below).

tab.1 = summarise(flights, mean_arr_delay = mean(arr_delay, na.rm = TRUE), 
    sd_arr_delay = sd(arr_delay, na.rm = TRUE), mean_dep_delay = mean(dep_delay, 
        na.rm = TRUE), sd_dep_delay = sd(dep_delay, na.rm = TRUE))

library(knitr)
kable(tab.1, caption = "Mean and Standard Deviation of Arrival and Departure Delays ")
Mean and Standard Deviation of Arrival and Departure Delays
mean_arr_delay sd_arr_delay mean_dep_delay sd_dep_delay
6.895377 44.63329 12.63907 40.21006

You can draw random samples from your data frame as well. The code below shows a random sub-sample of 15 observations and a random sub-sample of 2% of the full data.

sample_n(flights, 15)
sample_frac(flights, 0.02)

So far we have been running these commands on all observations rather than by some group indicator. For example, what if we wanted to run some calculations for each unique aircraft (identified by__tailnum__)? The first thing we will do is create the grouping key by_tailnum via the group_by() command.

by_tailnum = group_by(flights, tailnum)
delay = summarise(by_tailnum, count = n(), dist = mean(distance, 
    na.rm = TRUE), delay = mean(arr_delay, na.rm = TRUE))

If you look at delay you will see 4044 observations of 4 variables – count tells you how many times each aircraft has arrived at NYC, dist tells you the average distance flown, and delay tells you the average arrival delay. Now, is the average delay related to the average distance flown? We’ll ask and answer this question via a simple scatter-plot on a subset of the delay data frame, subset in that the aircraft must have more than 20 arrivals in NYC and flown average distance of less than 2000 miles.

library(plotly)
library(ggplot2)
delay <- filter(delay, count > 20, dist < 2000)
p = ggplot(delay, aes(x = dist, y = delay)) + geom_point(aes(size = count), 
    alpha = 0.5) + geom_smooth() + scale_size_area()
p

What if we wanted to know the number of flights to each destination listed in the data frame?

destinations = group_by(flights, dest)
summarise(destinations, flights = n())

What about the number of flights to each destination by month?

destinations = group_by(flights, month, dest)
summarise(destinations, flights = n())

Note that now you have an extra column (month) because you asked for the number of flights to each destination for each month.

Now for some fun. Say we want to know the number of flights per day to each destination? What about per month? Per year?

destinations = group_by(flights, day, dest)
flights_per_day = summarise(destinations, flights = n())

flights_per_month = summarise(flights_per_day, flights = sum(flights))

flights_per_year = summarise(flights_per_month, flights = sum(flights))