Aggregating Data and Other Operations

Generating aggregated data frames is frequently used in analysis. I’ll demonstrated some operations with the dplyr package. This is not the only package capable of performing the operations shown below but it is one (if not the) fastest. We’ll use a particular data set to show what dplyr can do – the nycflights13 data frame that contains all 336776 flights that departed from New York City in 2013 (source: US Bureau of Transportation Statistics).

library(nycflights13)
library(dplyr)
head(flights)

year	month	day	dep_time	dep_delay	arr_time	arr_delay	carrier	tailnum	flight	origin	dest	air_time	distance	hour	minute
2013	1	1	517	2	830	11	UA	N14228	1545	EWR	IAH	227	1400	5	17
2013	1	1	533	4	850	20	UA	N24211	1714	LGA	IAH	227	1416	5	33
2013	1	1	542	2	923	33	AA	N619AA	1141	JFK	MIA	160	1089	5	42
2013	1	1	544	-1	1004	-18	B6	N804JB	725	JFK	BQN	183	1576	5	44
2013	1	1	554	-6	812	-25	DL	N668DN	461	LGA	ATL	116	762	5	54
2013	1	1	554	-4	740	12	UA	N39463	1696	EWR	ORD	150	719	5	54

filter() allows you to subset the data. In the code below we are asking for all flights on January 1 (i.e., with month == 1, day == 1). The other ways of subsetting the data are also shown below.

filter(flights, month == 1, day == 1)
flights[flights$month == 1 & flights$day == 1, ]
subset(flights, month == 1 & day == 1)

Let us now sort the data frame, first by year, month and day, and then by arrival delay (arr_delay).

arrange(flights, year, month, day)
arrange(flights, arr_delay)
arrange(flights, desc(arr_delay))

Note the difference between arr_delay and desc(arr_delay).

We can select columns by name in several ways:

select(flights, year, month, day)
select(flights, year:day)
select(flights, -c(year, month, day))
select(flights, -(year:day))

Note that year, month, day is the same as year:day. Note too that we can ask for some columns to be excluded by running -c(year, month, day) … which is the same as -(year:day).

If we want to rename some columns we can do so via rename().

rename(flights, arrival.delay = arr_delay)
rename(flights, arrival.time = arr_time, departure.time = dep_time)

Unique values in a column can be extracted as well. For example, how many unique aircraft do we have flagged by the tail number (tailnum)? This is extracted by distinct(select()), both for a single column (as in tailnum) and for combinations of multiple columns (as in origin, des).

distinct(select(flights, tailnum))
distinct(select(flights, origin, dest))

If we want to calculate something based on existing columns and then add these new calculated values to the data frame, mutate comes in handy.

flights = mutate(flights, gain = arr_delay - dep_delay)

If you look at the data frame you will see that gain has been added as the \(17^{th}\) column. Now, one of the amazing things about the mutate command is that it allows you to refer to a previous calculation within the same command (see below);

flights = mutate(flights, gain = arr_delay - dep_delay, gain_per_hour = gain/(air_time/60))

Note that gain_per_hour has been added by dividing gain by (air_time / 60) … since longer flights can gain more time than shorter flights.

summarise() is another useful function (see below).

tab.1 = summarise(flights, mean_arr_delay = mean(arr_delay, na.rm = TRUE), 
    sd_arr_delay = sd(arr_delay, na.rm = TRUE), mean_dep_delay = mean(dep_delay, 
        na.rm = TRUE), sd_dep_delay = sd(dep_delay, na.rm = TRUE))

library(knitr)
kable(tab.1, caption = "Mean and Standard Deviation of Arrival and Departure Delays ")

Mean and Standard Deviation of Arrival and Departure Delays
mean_arr_delay	sd_arr_delay	mean_dep_delay	sd_dep_delay
6.895377	44.63329	12.63907	40.21006

You can draw random samples from your data frame as well. The code below shows a random sub-sample of 15 observations and a random sub-sample of 2% of the full data.

sample_n(flights, 15)
sample_frac(flights, 0.02)

So far we have been running these commands on all observations rather than by some group indicator. For example, what if we wanted to run some calculations for each unique aircraft (identified by__tailnum__)? The first thing we will do is create the grouping key by_tailnum via the group_by() command.

by_tailnum = group_by(flights, tailnum)
delay = summarise(by_tailnum, count = n(), dist = mean(distance, 
    na.rm = TRUE), delay = mean(arr_delay, na.rm = TRUE))

If you look at delay you will see 4044 observations of 4 variables – count tells you how many times each aircraft has arrived at NYC, dist tells you the average distance flown, and delay tells you the average arrival delay. Now, is the average delay related to the average distance flown? We’ll ask and answer this question via a simple scatter-plot on a subset of the delay data frame, subset in that the aircraft must have more than 20 arrivals in NYC and flown average distance of less than 2000 miles.

library(plotly)
library(ggplot2)
delay <- filter(delay, count > 20, dist < 2000)
p = ggplot(delay, aes(x = dist, y = delay)) + geom_point(aes(size = count), 
    alpha = 0.5) + geom_smooth() + scale_size_area()
p

What if we wanted to know the number of flights to each destination listed in the data frame?

destinations = group_by(flights, dest)
summarise(destinations, flights = n())

What about the number of flights to each destination by month?

destinations = group_by(flights, month, dest)
summarise(destinations, flights = n())

Note that now you have an extra column (month) because you asked for the number of flights to each destination for each month.

Now for some fun. Say we want to know the number of flights per day to each destination? What about per month? Per year?

destinations = group_by(flights, day, dest)
flights_per_day = summarise(destinations, flights = n())

flights_per_month = summarise(flights_per_day, flights = sum(flights))

flights_per_year = summarise(flights_per_month, flights = sum(flights))

Data Acquisition and Management (2)

MPA 6020 (Spring 2016)

22 February, 2016

Aggregating Data and Other Operations