tidyr’s Key Functions
- unite
- separate
- spread
- gather
Reshaping slightly more complicated data

The tidyr package is another useful tool in the data analysts’ toolkit. In fact very often you will see tidyr being used in tandem with dplyr because they complement each other well. One thing that we will do here is to use the pipe operator %>% because, well, because it is efficient and most likely will become the default even if it isn’t today.

What does this pipe operator do? Let us see its abilities in the context of how we have run commands thus far. In particular, thus far we have used dplyr as shown below:

arrange(
            summarize(
                filter(data, variable == numeric_value),
                Total = sum(variable)
            ),
        desc(Total)
    )

where each function is nested within another. This was essentially the same as doing the following:

a = filter(data, variable == numeric_value)
b = summarize(a, Total = sum(variable))
c = arrange(b, desc(Total))

These operations, when written with the pipe operator, will look strange (see below):

data %>%
            filter(variable == “value”) %>%
            summarize(Total = sum(variable)) %>%
            arrange(desc(Total))

tidyr’s Key Functions

There are four tidyr functions you are likely to use.

separate() splits a single column into multiple columns
spread() takes two columns (key & value) and spreads in to multiple columns, it makes “long” data wider
gather() takes multiple columns, and gathers them into key-value pairs: it makes “wide” data longer
unite() combines multiple columns into a single column

Before we do anything else, let us read in the data we will work with:

library(tidyr)
library(knitr)

setwd("~/Downloads/")
load("DF.RData")

kable(head(DF))

County	Year	Population	State
Alamance	2006	141472	NC
Alamance	2007	144824	NC
Alamance	2008	147704	NC
Alamance	2009	149954	NC
Alamance	2010	151528	NC
Alamance	2011	153291	NC

If you open the data frame you will see there are seven counties (all from North Carolina), with data for multiple years (2006 through 2014) on population.

Let us see how we might create a single column that combines the County and the State columns.

unite

unite() will merge two columns into a single column for each observation.

DF.unite1 = unite(DF, CountyState, County, State, sep=", ")
DF.unite2 = DF %>% unite(CountyState, County, State, sep=", ")

Note the components of the command. Without the pipe operator you specify the data frame, then the two variables to be united, and finally if and how these should be separated. Notice the space after the comma in sep=", ". If you didn’t put the space the names would show up as Alamance,NC instead of as Alamance, NC.

separate

separate() does exactly the opposite; it will split a column. Let us split the CountyState column we just created into two columns, one we’ll call CountyName and the other we’ll call StateAbbreviation.

DF.separate1 = separate(DF.unite1, CountyState, c("CountyName", "StateAbbreviation"))
DF.separate2 = DF.unite2 %>% separate(CountyState, c("CountyName", "StateAbbreviation"))

Notice the components … if you are not using the pipe operator, the data frame comes first, then the variable to be split, and finally the names of the new columns to be created.

spread

spread() flips data from the long format to the wide format. Let us work with the DF data frame.

DF$Year = sub("^", "FY", DF$Year)
DF.spread1P = spread(DF, Year, Population)
DF.spread2P = DF %>% spread(Year, Population)

gather

gather() reshapes data from the wide to the long format.

gather(data, key, value, x1, x2, na.rm = FALSE, convert = FALSE) # without the pipe operator
data %>% gather(key, value, x1, x2, na.rm = FALSE, convert = FALSE) # with the pipe operatore

In this command, data refers to the data frame, key refers to the column name representing the new variable to be created, value refers to the column name representing the new variable’s values, x1, x2 are the pseudo-names of the columns we want to gather, na.rm indicates whether missing values are to be removed or not, and finally convert=TRUE will automatically convert values to logical, integer, numeric, complex or factor type as appropriate.

DF.gather1 = DF.spread1P %>% gather(Year, Popn, FY2006:FY2014)
DF.gather2 = DF.spread2P %>% gather(Year, Popn, FY2006:FY2014)

Note the sequence, again: We first specify what to call the long column (Year), then specify the name of the column that will contain the values (Popn), and then the columns to be converted from wide to long.

Reshaping slightly more complicated data

We first create a data frame.

library(dplyr)
grades <- tbl_df(read.table(header = TRUE, text = "
   ID   Test Year   Fall Spring Winter
    1   1   2008    15      16      19
    1   1   2009    12      13      27
    1   2   2008    22      22      24
    1   2   2009    10      14      20
    2   1   2008    12      13      25
    2   1   2009    16      14      21
    2   2   2008    13      11      29
    2   2   2009    23      20      26
    3   1   2008    11      12      22
    3   1   2009    13      11      27
    3   2   2008    17      12      23
    3   2   2009    14      9       31
"))

grades.long = grades %>%
  gather(Semester, Score, Fall:Winter) %>%
  mutate(Test = paste0("TestType", Test)) %>%
  spread(Test, Score) %>%
  arrange(ID, Year, Semester)

kable(head(grades.long))

ID	Year	Semester	TestType1	TestType2
1	2008	Fall	15	22
1	2008	Spring	16	22
1	2008	Winter	19	24
1	2009	Fall	12	10
1	2009	Spring	13	14
1	2009	Winter	27	20

This looks quirky but it isn’t; it shows you how you can blend tidyr and dplyr to end up with the desired result. Let us see how it works.

We know that gather() will take multiple columns and put them into a single column (what we’ve referred to as going from wide to long). Here the gather command is taking the three semesters (Fall, Spring, Winter) and collapsing these into a single column called Semester, and putting the test scores from Fall, Spring and Winter into a new column called Score.

step1 = grades %>%
  gather(Semester, Score, Fall:Winter) 

kable(head(step1, 12))

ID	Test	Year	Semester	Score
1	1	2008	Fall	15
1	1	2009	Fall	12
1	2	2008	Fall	22
1	2	2009	Fall	10
2	1	2008	Fall	12
2	1	2009	Fall	16
2	2	2008	Fall	13
2	2	2009	Fall	23
3	1	2008	Fall	11
3	1	2009	Fall	13
3	2	2008	Fall	17
3	2	2009	Fall	14

We know that mutate() creates a new variable. Here the new variable is called Test and is being created by pasting the string “TestType” to Test 1 and Test 2.

step12 = grades %>%
  gather(Semester, Score, Fall:Winter) %>%
  mutate(Test = paste0("TestType", Test))

kable(head(step12, 12))

ID	Test	Year	Semester	Score
1	TestType1	2008	Fall	15
1	TestType1	2009	Fall	12
1	TestType2	2008	Fall	22
1	TestType2	2009	Fall	10
2	TestType1	2008	Fall	12
2	TestType1	2009	Fall	16
2	TestType2	2008	Fall	13
2	TestType2	2009	Fall	23
3	TestType1	2008	Fall	11
3	TestType1	2009	Fall	13
3	TestType2	2008	Fall	17
3	TestType2	2009	Fall	14

Likewise, spread() shifts data from the long format to the wide format. Here, the spread command is then taking the newly created “Test” column and creating a column for each unique TestType. In these new columns the corresponding values from the Score column are being inserted.

step123 = grades %>%
  gather(Semester, Score, Fall:Winter) %>%
  mutate(Test = paste0("TestType", Test)) %>%
  spread(Test, Score)

kable(head(step123, 12))

ID	Year	Semester	TestType1	TestType2
1	2008	Fall	15	22
1	2008	Spring	16	22
1	2008	Winter	19	24
1	2009	Fall	12	10
1	2009	Spring	13	14
1	2009	Winter	27	20
2	2008	Fall	12	13
2	2008	Spring	13	11
2	2008	Winter	25	29
2	2009	Fall	16	23
2	2009	Spring	14	20
2	2009	Winter	21	26

Finally, arrange() organizes the data into a particular order. Here we are asking the final result to be arranged by ID, Year and Semester.

grades.long = grades %>%
  gather(Semester, Score, Fall:Winter) %>%
  mutate(Test = paste0("TestType", Test)) %>%
  spread(Test, Score) %>%
  arrange(ID, Year, Semester)

kable(head(grades.long))

ID	Year	Semester	TestType1	TestType2
1	2008	Fall	15	22
1	2008	Spring	16	22
1	2008	Winter	19	24
1	2009	Fall	12	10
1	2009	Spring	13	14
1	2009	Winter	27	20

This example shows how you can chain commands via the pipe operator %>% and rather than breaking up the data manipulation tasks collapse all the commands into a single piped command. The trickier the data you are working with, the more difficult this is to do well. One trick is to take a small portion of your complicated data and subset it into a small portion that has maybe two or three unique units (individuals, geographies, etc). Then see if you can get the code to do what you need it to do. If it works on this subset then it will work on the full data set as well.

tidy your data

ruhil