The tidyr
package is another useful tool in the data analysts’ toolkit. In fact very often you will see tidyr
being used in tandem with dplyr
because they complement each other well. One thing that we will do here is to use the pipe operator %>%
because, well, because it is efficient and most likely will become the default even if it isn’t today.
What does this pipe operator do? Let us see its abilities in the context of how we have run commands thus far. In particular, thus far we have used dplyr as shown below:
arrange(
summarize(
filter(data, variable == numeric_value),
Total = sum(variable)
),
desc(Total)
)
where each function is nested within another. This was essentially the same as doing the following:
a = filter(data, variable == numeric_value)
b = summarize(a, Total = sum(variable))
c = arrange(b, desc(Total))
These operations, when written with the pipe operator, will look strange (see below):
data %>%
filter(variable == “value”) %>%
summarize(Total = sum(variable)) %>%
arrange(desc(Total))
There are four tidyr functions you are likely to use.
separate()
splits a single column into multiple columnsspread()
takes two columns (key & value) and spreads in to multiple columns, it makes “long” data widergather()
takes multiple columns, and gathers them into key-value pairs: it makes “wide” data longerunite()
combines multiple columns into a single columnBefore we do anything else, let us read in the data we will work with:
library(tidyr)
library(knitr)
setwd("~/Downloads/")
load("DF.RData")
kable(head(DF))
County | Year | Population | State |
---|---|---|---|
Alamance | 2006 | 141472 | NC |
Alamance | 2007 | 144824 | NC |
Alamance | 2008 | 147704 | NC |
Alamance | 2009 | 149954 | NC |
Alamance | 2010 | 151528 | NC |
Alamance | 2011 | 153291 | NC |
If you open the data frame you will see there are seven counties (all from North Carolina), with data for multiple years (2006 through 2014) on population.
Let us see how we might create a single column that combines the County and the State columns.
unite()
will merge two columns into a single column for each observation.
DF.unite1 = unite(DF, CountyState, County, State, sep=", ")
DF.unite2 = DF %>% unite(CountyState, County, State, sep=", ")
Note the components of the command. Without the pipe operator you specify the data frame, then the two variables to be united, and finally if and how these should be separated. Notice the space after the comma in sep=", "
. If you didn’t put the space the names would show up as Alamance,NC instead of as Alamance, NC.
separate()
does exactly the opposite; it will split a column. Let us split the CountyState column we just created into two columns, one we’ll call CountyName and the other we’ll call StateAbbreviation.
DF.separate1 = separate(DF.unite1, CountyState, c("CountyName", "StateAbbreviation"))
DF.separate2 = DF.unite2 %>% separate(CountyState, c("CountyName", "StateAbbreviation"))
Notice the components … if you are not using the pipe operator, the data frame comes first, then the variable to be split, and finally the names of the new columns to be created.
spread()
flips data from the long format to the wide format. Let us work with the DF data frame.
DF$Year = sub("^", "FY", DF$Year)
DF.spread1P = spread(DF, Year, Population)
DF.spread2P = DF %>% spread(Year, Population)
gather()
reshapes data from the wide to the long format.
gather(data, key, value, x1, x2, na.rm = FALSE, convert = FALSE) # without the pipe operator
data %>% gather(key, value, x1, x2, na.rm = FALSE, convert = FALSE) # with the pipe operatore
In this command, data
refers to the data frame, key
refers to the column name representing the new variable to be created, value
refers to the column name representing the new variable’s values, x1, x2
are the pseudo-names of the columns we want to gather, na.rm
indicates whether missing values are to be removed or not, and finally convert=TRUE
will automatically convert values to logical, integer, numeric, complex or factor type as appropriate.
DF.gather1 = DF.spread1P %>% gather(Year, Popn, FY2006:FY2014)
DF.gather2 = DF.spread2P %>% gather(Year, Popn, FY2006:FY2014)
Note the sequence, again: We first specify what to call the long column (Year), then specify the name of the column that will contain the values (Popn), and then the columns to be converted from wide to long.
We first create a data frame.
library(dplyr)
grades <- tbl_df(read.table(header = TRUE, text = "
ID Test Year Fall Spring Winter
1 1 2008 15 16 19
1 1 2009 12 13 27
1 2 2008 22 22 24
1 2 2009 10 14 20
2 1 2008 12 13 25
2 1 2009 16 14 21
2 2 2008 13 11 29
2 2 2009 23 20 26
3 1 2008 11 12 22
3 1 2009 13 11 27
3 2 2008 17 12 23
3 2 2009 14 9 31
"))
grades.long = grades %>%
gather(Semester, Score, Fall:Winter) %>%
mutate(Test = paste0("TestType", Test)) %>%
spread(Test, Score) %>%
arrange(ID, Year, Semester)
kable(head(grades.long))
ID | Year | Semester | TestType1 | TestType2 |
---|---|---|---|---|
1 | 2008 | Fall | 15 | 22 |
1 | 2008 | Spring | 16 | 22 |
1 | 2008 | Winter | 19 | 24 |
1 | 2009 | Fall | 12 | 10 |
1 | 2009 | Spring | 13 | 14 |
1 | 2009 | Winter | 27 | 20 |
This looks quirky but it isn’t; it shows you how you can blend tidyr
and dplyr
to end up with the desired result. Let us see how it works.
gather()
will take multiple columns and put them into a single column (what we’ve referred to as going from wide to long). Here the gather command is taking the three semesters (Fall, Spring, Winter) and collapsing these into a single column called Semester, and putting the test scores from Fall, Spring and Winter into a new column called Score.step1 = grades %>%
gather(Semester, Score, Fall:Winter)
kable(head(step1, 12))
ID | Test | Year | Semester | Score |
---|---|---|---|---|
1 | 1 | 2008 | Fall | 15 |
1 | 1 | 2009 | Fall | 12 |
1 | 2 | 2008 | Fall | 22 |
1 | 2 | 2009 | Fall | 10 |
2 | 1 | 2008 | Fall | 12 |
2 | 1 | 2009 | Fall | 16 |
2 | 2 | 2008 | Fall | 13 |
2 | 2 | 2009 | Fall | 23 |
3 | 1 | 2008 | Fall | 11 |
3 | 1 | 2009 | Fall | 13 |
3 | 2 | 2008 | Fall | 17 |
3 | 2 | 2009 | Fall | 14 |
mutate()
creates a new variable. Here the new variable is called Test and is being created by pasting the string “TestType” to Test 1 and Test 2.step12 = grades %>%
gather(Semester, Score, Fall:Winter) %>%
mutate(Test = paste0("TestType", Test))
kable(head(step12, 12))
ID | Test | Year | Semester | Score |
---|---|---|---|---|
1 | TestType1 | 2008 | Fall | 15 |
1 | TestType1 | 2009 | Fall | 12 |
1 | TestType2 | 2008 | Fall | 22 |
1 | TestType2 | 2009 | Fall | 10 |
2 | TestType1 | 2008 | Fall | 12 |
2 | TestType1 | 2009 | Fall | 16 |
2 | TestType2 | 2008 | Fall | 13 |
2 | TestType2 | 2009 | Fall | 23 |
3 | TestType1 | 2008 | Fall | 11 |
3 | TestType1 | 2009 | Fall | 13 |
3 | TestType2 | 2008 | Fall | 17 |
3 | TestType2 | 2009 | Fall | 14 |
spread()
shifts data from the long format to the wide format. Here, the spread command is then taking the newly created “Test” column and creating a column for each unique TestType. In these new columns the corresponding values from the Score column are being inserted.step123 = grades %>%
gather(Semester, Score, Fall:Winter) %>%
mutate(Test = paste0("TestType", Test)) %>%
spread(Test, Score)
kable(head(step123, 12))
ID | Year | Semester | TestType1 | TestType2 |
---|---|---|---|---|
1 | 2008 | Fall | 15 | 22 |
1 | 2008 | Spring | 16 | 22 |
1 | 2008 | Winter | 19 | 24 |
1 | 2009 | Fall | 12 | 10 |
1 | 2009 | Spring | 13 | 14 |
1 | 2009 | Winter | 27 | 20 |
2 | 2008 | Fall | 12 | 13 |
2 | 2008 | Spring | 13 | 11 |
2 | 2008 | Winter | 25 | 29 |
2 | 2009 | Fall | 16 | 23 |
2 | 2009 | Spring | 14 | 20 |
2 | 2009 | Winter | 21 | 26 |
arrange()
organizes the data into a particular order. Here we are asking the final result to be arranged by ID, Year and Semester.grades.long = grades %>%
gather(Semester, Score, Fall:Winter) %>%
mutate(Test = paste0("TestType", Test)) %>%
spread(Test, Score) %>%
arrange(ID, Year, Semester)
kable(head(grades.long))
ID | Year | Semester | TestType1 | TestType2 |
---|---|---|---|---|
1 | 2008 | Fall | 15 | 22 |
1 | 2008 | Spring | 16 | 22 |
1 | 2008 | Winter | 19 | 24 |
1 | 2009 | Fall | 12 | 10 |
1 | 2009 | Spring | 13 | 14 |
1 | 2009 | Winter | 27 | 20 |
This example shows how you can chain commands via the pipe operator %>% and rather than breaking up the data manipulation tasks collapse all the commands into a single piped command. The trickier the data you are working with, the more difficult this is to do well. One trick is to take a small portion of your complicated data and subset it into a small portion that has maybe two or three unique units (individuals, geographies, etc). Then see if you can get the code to do what you need it to do. If it works on this subset then it will work on the full data set as well.