The tidyr package is another useful tool in the data analysts’ toolkit. In fact very often you will see tidyr being used in tandem with dplyr because they complement each other well. One thing that we will do here is to use the pipe operator %>% because, well, because it is efficient and most likely will become the default even if it isn’t today.

What does this pipe operator do? Let us see its abilities in the context of how we have run commands thus far. In particular, thus far we have used dplyr as shown below:

arrange(
            summarize(
                filter(data, variable == numeric_value),
                Total = sum(variable)
            ),
        desc(Total)
    )

where each function is nested within another. This was essentially the same as doing the following:

a = filter(data, variable == numeric_value)
b = summarize(a, Total = sum(variable))
c = arrange(b, desc(Total))

These operations, when written with the pipe operator, will look strange (see below):

data %>%
            filter(variable == “value”) %>%
            summarize(Total = sum(variable)) %>%
            arrange(desc(Total))

tidyr’s Key Functions

There are four tidyr functions you are likely to use.

Before we do anything else, let us read in the data we will work with:

library(tidyr)
library(knitr)

setwd("~/Downloads/")
load("DF.RData")

kable(head(DF))
County Year Population State
Alamance 2006 141472 NC
Alamance 2007 144824 NC
Alamance 2008 147704 NC
Alamance 2009 149954 NC
Alamance 2010 151528 NC
Alamance 2011 153291 NC

If you open the data frame you will see there are seven counties (all from North Carolina), with data for multiple years (2006 through 2014) on population.

Let us see how we might create a single column that combines the County and the State columns.

unite

unite() will merge two columns into a single column for each observation.

DF.unite1 = unite(DF, CountyState, County, State, sep=", ")
DF.unite2 = DF %>% unite(CountyState, County, State, sep=", ")

Note the components of the command. Without the pipe operator you specify the data frame, then the two variables to be united, and finally if and how these should be separated. Notice the space after the comma in sep=", ". If you didn’t put the space the names would show up as Alamance,NC instead of as Alamance, NC.

separate

separate() does exactly the opposite; it will split a column. Let us split the CountyState column we just created into two columns, one we’ll call CountyName and the other we’ll call StateAbbreviation.

DF.separate1 = separate(DF.unite1, CountyState, c("CountyName", "StateAbbreviation"))
DF.separate2 = DF.unite2 %>% separate(CountyState, c("CountyName", "StateAbbreviation"))

Notice the components … if you are not using the pipe operator, the data frame comes first, then the variable to be split, and finally the names of the new columns to be created.

spread

spread() flips data from the long format to the wide format. Let us work with the DF data frame.

DF$Year = sub("^", "FY", DF$Year)
DF.spread1P = spread(DF, Year, Population)
DF.spread2P = DF %>% spread(Year, Population)

gather

gather() reshapes data from the wide to the long format.

gather(data, key, value, x1, x2, na.rm = FALSE, convert = FALSE) # without the pipe operator
data %>% gather(key, value, x1, x2, na.rm = FALSE, convert = FALSE) # with the pipe operatore

In this command, data refers to the data frame, key refers to the column name representing the new variable to be created, value refers to the column name representing the new variable’s values, x1, x2 are the pseudo-names of the columns we want to gather, na.rm indicates whether missing values are to be removed or not, and finally convert=TRUE will automatically convert values to logical, integer, numeric, complex or factor type as appropriate.

DF.gather1 = DF.spread1P %>% gather(Year, Popn, FY2006:FY2014)
DF.gather2 = DF.spread2P %>% gather(Year, Popn, FY2006:FY2014)

Note the sequence, again: We first specify what to call the long column (Year), then specify the name of the column that will contain the values (Popn), and then the columns to be converted from wide to long.

Reshaping slightly more complicated data

We first create a data frame.

library(dplyr)
grades <- tbl_df(read.table(header = TRUE, text = "
   ID   Test Year   Fall Spring Winter
    1   1   2008    15      16      19
    1   1   2009    12      13      27
    1   2   2008    22      22      24
    1   2   2009    10      14      20
    2   1   2008    12      13      25
    2   1   2009    16      14      21
    2   2   2008    13      11      29
    2   2   2009    23      20      26
    3   1   2008    11      12      22
    3   1   2009    13      11      27
    3   2   2008    17      12      23
    3   2   2009    14      9       31
"))

grades.long = grades %>%
  gather(Semester, Score, Fall:Winter) %>%
  mutate(Test = paste0("TestType", Test)) %>%
  spread(Test, Score) %>%
  arrange(ID, Year, Semester)

kable(head(grades.long))
ID Year Semester TestType1 TestType2
1 2008 Fall 15 22
1 2008 Spring 16 22
1 2008 Winter 19 24
1 2009 Fall 12 10
1 2009 Spring 13 14
1 2009 Winter 27 20

This looks quirky but it isn’t; it shows you how you can blend tidyr and dplyr to end up with the desired result. Let us see how it works.

  1. We know that gather() will take multiple columns and put them into a single column (what we’ve referred to as going from wide to long). Here the gather command is taking the three semesters (Fall, Spring, Winter) and collapsing these into a single column called Semester, and putting the test scores from Fall, Spring and Winter into a new column called Score.
step1 = grades %>%
  gather(Semester, Score, Fall:Winter) 

kable(head(step1, 12))
ID Test Year Semester Score
1 1 2008 Fall 15
1 1 2009 Fall 12
1 2 2008 Fall 22
1 2 2009 Fall 10
2 1 2008 Fall 12
2 1 2009 Fall 16
2 2 2008 Fall 13
2 2 2009 Fall 23
3 1 2008 Fall 11
3 1 2009 Fall 13
3 2 2008 Fall 17
3 2 2009 Fall 14
  1. We know that mutate() creates a new variable. Here the new variable is called Test and is being created by pasting the string “TestType” to Test 1 and Test 2.
step12 = grades %>%
  gather(Semester, Score, Fall:Winter) %>%
  mutate(Test = paste0("TestType", Test))

kable(head(step12, 12))
ID Test Year Semester Score
1 TestType1 2008 Fall 15
1 TestType1 2009 Fall 12
1 TestType2 2008 Fall 22
1 TestType2 2009 Fall 10
2 TestType1 2008 Fall 12
2 TestType1 2009 Fall 16
2 TestType2 2008 Fall 13
2 TestType2 2009 Fall 23
3 TestType1 2008 Fall 11
3 TestType1 2009 Fall 13
3 TestType2 2008 Fall 17
3 TestType2 2009 Fall 14
  1. Likewise, spread() shifts data from the long format to the wide format. Here, the spread command is then taking the newly created “Test” column and creating a column for each unique TestType. In these new columns the corresponding values from the Score column are being inserted.
step123 = grades %>%
  gather(Semester, Score, Fall:Winter) %>%
  mutate(Test = paste0("TestType", Test)) %>%
  spread(Test, Score)

kable(head(step123, 12))
ID Year Semester TestType1 TestType2
1 2008 Fall 15 22
1 2008 Spring 16 22
1 2008 Winter 19 24
1 2009 Fall 12 10
1 2009 Spring 13 14
1 2009 Winter 27 20
2 2008 Fall 12 13
2 2008 Spring 13 11
2 2008 Winter 25 29
2 2009 Fall 16 23
2 2009 Spring 14 20
2 2009 Winter 21 26
  1. Finally, arrange() organizes the data into a particular order. Here we are asking the final result to be arranged by ID, Year and Semester.
grades.long = grades %>%
  gather(Semester, Score, Fall:Winter) %>%
  mutate(Test = paste0("TestType", Test)) %>%
  spread(Test, Score) %>%
  arrange(ID, Year, Semester)

kable(head(grades.long))
ID Year Semester TestType1 TestType2
1 2008 Fall 15 22
1 2008 Spring 16 22
1 2008 Winter 19 24
1 2009 Fall 12 10
1 2009 Spring 13 14
1 2009 Winter 27 20

This example shows how you can chain commands via the pipe operator %>% and rather than breaking up the data manipulation tasks collapse all the commands into a single piped command. The trickier the data you are working with, the more difficult this is to do well. One trick is to take a small portion of your complicated data and subset it into a small portion that has maybe two or three unique units (individuals, geographies, etc). Then see if you can get the code to do what you need it to do. If it works on this subset then it will work on the full data set as well.