As discussed previously, R can read data created in various formats, both from the actual file downloaded and saved to your machine or from the web. The goal today is not to focus on reading these data. Rather, the goal is to see how we can work with, say, a Census data file or a file from the CDC, the Bureau of Labor, the Bureau of Economic Analysis, the United Nations, and so on. Our path will be simple: Download the raw data, clean and shape it into the version we need for our analyses.
For a decade or more now there has been tremendous interest in making some data open – available to anyone who wishes to use it – but this trend was hastened by governments signing on and encouraging open data. Take a look at some of these sites …
Let us download a specific file – the Voting Age Population by Citizenship and Race (CVAP) data compiled from the American Community Survey (ACS) 5-year (2009-2013) collection available here. This file can be downloaded from here. Once the file downloads, double-click the zip archive and extract all files. Documentation for these data is available here.
Once you extract the data, load the Place data as shown below:
setwd("~/Downloads/")
df.cvap = read.csv("CVAP_CSV_Format_2009-2013/Place.csv", sep = ",",
header = TRUE)
If you open the data frame you will see multiple rows per Place. Take a look at the documentation to understand what the columns represent.
The rows represent the following groups, marked by LNTITLE and LNNUMBER, respectively:
The Behavioral Risk Factor Surveillance System (BRFSS) is the nation’s premier system of health-related telephone surveys that collect state data about U.S. residents regarding their health-related risk behaviors, chronic health conditions, and use of preventive services. Established in 1984 with 15 states, BRFSS now collects data in all 50 states as well as the District of Columbia and three U.S. territories. BRFSS completes more than 400,000 adult interviews each year, making it the largest continuously conducted health survey system in the world.
The 2014 BRFSS survey data can be downloaded in ASCII format from here while the codebook is here and the variable layout is here.
Note: ASCII (pronounced ask-ee) stands for the American Standard Code for Information Interchange and is a code for representing English characters as numbers, with each letter assigned a number from 0 to 127. “For example, the ASCII code for uppercase M is 77. Most computers use ASCII codes to represent text, which makes it possible to transfer data from one computer to another.” Source:.
If you download the data and look at the variable layout here you notice that every column has specific starting and ending positions. If we choose to read the downloaded ASCII data we’ll have to pass the variable layout information to R. I’ll show you how to do this but let us take a shortcut via the other formats CDC so kindly makes available for download, namely the SAS Transport Format found here.
Download and extract two files and put both in your working directory.
Then execute the following code:
setwd("~/Downloads/")
library(foreign)
brfss2014a = read.xport("LLCP2014.XPT ")
Note: There is an extra blank space after LLCP2014.XPT so run the code as written and not as LLCP2014.XPT“
.
When the code executes you should see brfss2014a in upper-right window of RStudio. Very easy, but, what if we were given the raw ASCII data? Now we will need the variable layout information, plus choose a library to work with. Two I use are readr
(my preferred library) and LaF
, and both are shown in action below.
Note: The original column layout specifications had to be modified because the third variable listed online (IDATE) subsumes the three that follow. As such IDATE was deleted from the specifications below.
The readr
code specifies the start and end of the columns, as well as the column names.
fwf_positions_begin = c(1, 16, 18, 20, 22, 31, 35, 45, 62, 63,
64, 65, 67, 68, 70, 72, 80, 81, 83, 85, 87, 88, 89, 90, 91,
92, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105,
106, 108, 109, 146, 147, 148, 150, 151, 152, 154, 158, 170,
171, 172, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185,
186, 187, 188, 189, 190, 192, 193, 196, 198, 200, 202, 203,
209, 210, 211, 213, 215, 216, 218, 219, 220, 221, 222, 223,
224, 225, 226, 227, 228, 229, 230, 231, 232, 233, 234, 235,
236, 237, 243, 255, 256, 257, 258, 261, 264, 266, 268, 270,
271, 272, 273, 275, 277, 279, 281, 282, 284, 285, 310, 311,
312, 314, 315, 316, 317, 318, 319, 320, 321, 322, 323, 326,
327, 329, 330, 332, 334, 336, 339, 340, 341, 342, 343, 345,
346, 347, 348, 349, 351, 352, 354, 356, 357, 359, 360, 361,
362, 363, 364, 365, 366, 367, 368, 369, 370, 371, 372, 573,
574, 575, 576, 578, 579, 581, 582, 583, 584, 590, 591, 595,
623, 625, 626, 627, 628, 629, 645, 647, 1356, 1357, 1363,
1393, 1403, 1417, 1419, 1421, 1425, 1426, 1427, 1430, 1473,
1612, 1669, 1671, 1673, 1674, 1675, 1711, 1907, 1910, 1913,
1916, 2155, 2156, 2157, 2158, 2159, 2160, 2161, 2162, 2163,
2164, 2221, 2223, 2227, 2228, 2229, 2230, 2231, 2232, 2234,
2235, 2236, 2239, 2242, 2247, 2251, 2252, 2253, 2254, 2255,
2256, 2257, 2258, 2259, 2262, 2263, 2267, 2271, 2272, 2273,
2274, 2275, 2276, 2277, 2278, 2279, 2281, 2283, 2284, 2286,
2292)
width = c(2, 2, 2, 2, 4, 4, 10, 10, 1, 1, 1, 1, 1, 2, 2, 2, 1,
2, 2, 2, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 2, 1, 1, 1, 1, 2, 1, 1, 2, 4, 4, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 3, 2, 2, 2, 1, 6,
1, 1, 2, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 6, 2, 1, 1, 1, 3, 3, 2, 2, 2, 1, 1, 1,
2, 2, 2, 2, 1, 2, 1, 25, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1,
1, 3, 1, 2, 1, 2, 2, 2, 3, 1, 1, 1, 1, 2, 1, 1, 1, 1, 2,
1, 2, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 2, 1, 2, 1, 1, 1, 6, 1, 4, 28, 2, 1, 1, 1, 1, 1,
2, 2, 1, 6, 10, 10, 10, 2, 2, 1, 1, 1, 1, 10, 10, 1, 2, 2,
1, 1, 1, 10, 3, 3, 3, 10, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2,
2, 1, 1, 1, 1, 1, 2, 1, 1, 3, 3, 5, 4, 1, 1, 1, 1, 1, 1,
1, 1, 3, 1, 4, 4, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1)
fwf_positions_end = fwf_positions_begin + (width - 1)
col_names = c("STATE", "FMONTH", "IMONTH", "IDAY", "IYEAR", "DISPCODE",
"SEQNO", "_PSU", "CTELENUM", "PVTRESD1", "COLGHOUS", "STATERES",
"LADULT", "NUMADULT", "NUMMEN", "NUMWOMEN", "GENHLTH", "PHYSHLTH",
"MENTHLTH", "POORHLTH", "HLTHPLN1", "PERSDOC2", "MEDCOST",
"CHECKUP1", "EXERANY2", "SLEPTIM1", "CVDINFR4", "CVDCRHD4",
"CVDSTRK3", "ASTHMA3", "ASTHNOW", "CHCSCNCR", "CHCOCNCR",
"CHCCOPD1", "HAVARTH3", "ADDEPEV2", "CHCKIDNY", "DIABETE3",
"DIABAGE2", "LASTDEN3", "RMVTETH3", "VETERAN3", "MARITAL",
"CHILDREN", "EDUCA", "EMPLOY1", "INCOME2", "WEIGHT2", "HEIGHT3",
"NUMHHOL2", "NUMPHON2", "CPDEMO1", "INTERNET", "RENTHOM1",
"SEX", "PREGNANT", "QLACTLM2", "USEEQUIP", "BLIND", "DECIDE",
"DIFFWALK", "DIFFDRES", "DIFFALON", "SMOKE100", "SMOKDAY2",
"STOPSMK2", "LASTSMK2", "USENOW3", "ALCDAY5", "AVEDRNK2",
"DRNK3GE5", "MAXDRNKS", "FLUSHOT6", "FLSHTMY2", "PNEUVAC3",
"SHINGLE2", "FALL12MN", "FALLINJ2", "SEATBELT", "DRNKDRI2",
"HADMAM", "HOWLONG", "PROFEXAM", "LENGEXAM", "HADPAP2", "LASTPAP2",
"HADHYST2", "PCPSAAD2", "PCPSADI1", "PCPSARE1", "PSATEST1",
"PSATIME", "PCPSARS1", "BLDSTOOL", "LSTBLDS3", "HADSIGM3",
"HADSGCO1", "LASTSIG3", "HIVTST6", "HIVTSTD3", "WHRTST10",
"PDIABTST", "PREDIAB1", "INSULIN", "BLDSUGAR", "FEETCHK2",
"DOCTDIAB", "CHKHEMO3", "FEETCHK", "EYEEXAM", "DIABEYE",
"DIABEDU", "PAINACT2", "QLMENTL2", "QLSTRES2", "QLHLTH2",
"MEDICARE", "HLTHCVR1", "DELAYMED", "DLYOTHER", "NOCOV121",
"LSTCOVRG", "DRVISITS", "MEDSCOST", "CARERCVD", "MEDBILL1",
"ASBIALCH", "ASBIDRNK", "ASBIBING", "ASBIADVC", "ASBIRDUC",
"WTCHSALT", "LONGWTCH", "DRADVISE", "ASTHMAGE", "ASATTACK",
"ASERVIST", "ASDRVIST", "ASRCHKUP", "ASACTLIM", "ASYMPTOM",
"ASNOSLEP", "ASTHMED3", "ASINHALR", "IMFVPLAC", "TETANUS",
"HPVTEST", "HPLSTTST", "HPVADVC2", "HPVADSHT", "CNCRDIFF",
"CNCRAGE", "CNCRTYP1", "CSRVTRT1", "CSRVDOC1", "CSRVSUM",
"CSRVRTRN", "CSRVINST", "CSRVINSR", "CSRVDEIN", "CSRVCLIN",
"CSRVPAIN", "CSRVCTL1", "RRCLASS2", "RRCOGNT2", "RRATWRK2",
"RRHCARE3", "RRPHYSM2", "RREMTSM2", "SCNTMNY1", "SCNTMEL1",
"SCNTPAID", "SCNTWRK1", "SCNTLPAD", "SCNTLWK1", "SCNTVOT1",
"SXORIENT", "TRNSGNDR", "RCSBIRTH", "RCSGENDR", "RCHISLA1",
"RCSRACE1", "RCSBRAC1", "RCSRLTN2", "CASTHDX2", "CASTHNO2",
"EMTSUPRT", "LSATISFY", "QSTVER", "QSTLANG", "MSCODE", "_STSTR",
"_STRWT", "_RAWRAKE", "_WT2RAKE", "_AGE80", "_IMPRACE", "_IMPNPH",
"_IMPEDUC", "_IMPMRTL", "_IMPHOME", "_LANDWT2", "_LANDWT",
"_CHISPNC", "_CPRACE", "_CRACE1", "_IMPCAGE", "_IMPCRAC",
"_IMPCSEX", "_CLANDWT", "NPHH", "NAHH", "NCHH", "_HHOLDWT",
"_RFHLTH", "_HCVU651", "_TOTINDA", "_LTASTH1", "_CASTHM1",
"_ASTHMS1", "_DRDXAR1", "_EXTETH2", "_ALTETH2", "_DENVST2",
"_PRACE1", "_MRACE1", "_HISPANC", "_RACE", "_RACEG21", "_RACEGR3",
"_RACE_G1", "_AGEG5YR", "_AGE65YR", "_AGE_G", "HTIN4", "HTM4",
"WTKG3", "_BMI5", "_BMI5CAT", "_RFBMI5", "_CHLDCNT", "_EDUCAG",
"_INCOMG", "_SMOKER3", "_RFSMOK3", "DRNKANY5", "DROCDY3_",
"_RFBING5", "_DRNKDY4", "_DRNKMO4", "_RFDRHV4", "_RFDRMN4",
"_RFDRWM4", "_FLSHOT6", "_PNEUMO2", "_RFSEAT2", "_RFSEAT3",
"_RFMAM2Y", "_MAM502Y", "_RFPAP32", "_RFPSA21", "_RFBLDS2",
"_RFSIGM2", "_AIDTST3")
library(readr)
brfss2014b = read_fwf("~/Downloads/LLCP2014.ASC ", fwf_positions(fwf_positions_begin,
fwf_positions_end, col_names))
brfss2014b = subset(brfss2014b, STATE == "39")
save(brfss2014b, file = "~/Downloads/brfss2014b.RData")
Next you see the LaF
.
library(LaF)
laf = laf_open_fwf("LLCP2014.ASC ", column_widths = width, column_types = rep("character",
length(width)), column_names = col_names)
brfss2014c = data.frame(laf[, ])
save(brfss2014c, file = "~/Downloads/brfss2014c.RData")
Notice that with LaF
we had to specify the column widths via column_widths = width
and also the column headings via column_names = cnames
.
When these files are read some of the variable will come in as characters when in fact they are numeric (see STATE, for example) or factors (see PHYSHLTH for example). Others will be read as integers (see GENHLTH) and so on. Consequently, we often have to reset the read formats. Further, we do know from the codebook that missing values and other responses (such as refused) carry numeric codes in the survey. For instance, GENHLTH is supposed to have the following responses mapped to the numeric values stored in the variable
We can take this mapping and fix GENHLTH by creating a new variable, say general.health as a factor.
load("~/Downloads/brfss2014b.RData")
brfss2014b$general.health = factor(brfss2014b$GENHLTH, levels = c(1,
2, 3, 4, 5), labels = c("Excellent", "Very Good", "Good",
"Fair", "Poor"))
table(brfss2014b$general.health)
Excellent | Very Good | Good | Fair | Poor |
---|---|---|---|---|
1689 | 3499 | 3425 | 1643 | 651 |
Let us also flip STATE into a numeric variable, and format FMONTH, IMONTH, IDAY, IYEAR as dates. For dates the lubridate
package comes in very handy.
brfss2014b$STATE = as.numeric(brfss2014b$STATE)
We’ll combine IMONTH, IDAY and IYEAR to create a date variable.
brfss2014b$date = with(brfss2014b, sprintf("%s-%s-%s", IMONTH,
IDAY, IYEAR))
head(brfss2014b$date)
## [1] "11-05-2014" "07-14-2014" "12-16-2014" "09-15-2014" "09-08-2014"
## [6] "08-29-2014"
library(lubridate)
brfss2014b$Date = mdy(brfss2014b$date)
is.Date(brfss2014b$Date)
## [1] FALSE
str(brfss2014b$Date)
## POSIXct[1:10933], format: "2014-11-05" "2014-07-14" "2014-12-16" "2014-09-15" ...
Note how we built date … we first concatenated the three columns and then used lubridate
to format date via the mdy()
specification. I then check to make sure the format is correct via the is.Date()
command. This could also have been done via the str()
command. Now that Date has been created we could drop the original three columns but we will let that be for now.
The Bureau of Labor Statistics (BLS) compiles a lot of data, including some under the Local Area Unemployment Statistics (LAUS) program – a Federal-State cooperative effort in which monthly estimates of total employment and unemployment are prepared for approximately 7,300 areas:
These estimates are key indicators of local economic conditions. The Bureau of Labor Statistics (BLS) of the U.S. Department of Labor is responsible for the concepts, definitions, technical procedures, validation, and publication of the estimates that State employment security agencies prepare under agreement with BLS. Data available at BLS can be found here
Similarly, the Bureau of Economic Analysis (BEA) has a regional economic accounts program that details the geographic distribution of U.S. economic activity and growth. “The estimates of gross domestic product by state and state and local area personal income, and the accompanying detail, provide a consistent framework for analyzing and comparing individual state and local area economies.”
While the BEA data read below were obtained from here, the BLS data read below were obtained from here and here, respectively.
setwd("~/Downloads/")
df.bea = read.csv("CA1/CA1_1969_2014_OH.csv")
df.bls1 = read.csv("http://www.bls.gov/lau/laucnty14.txt", header = FALSE,
skip = 6, sep = "")
colnames(df.bls1) = c("LAUS Code", "StateFIPSCode", "CountyFIPSCode",
"CountyName", "StateAbbreviation", "Year", "TotalLaborForce",
"TotalEmployed", "TotalUnemployed", "UnemploymentRate")
df.bls2 = read.csv("http://www.bls.gov/cew/data/api/2015/2/industry/10.csv",
sep = ",", header = TRUE)
The Initiative and Referendum Institute has long gathered data on direct democracy in the states. You can read more here. Two data sets are readily available for download here:
IRI initiatives by state and year for the 1904-through 2010 period. This database lists all statewide initiatives to appear on the ballot since the first statewide initiative in Oregon in 1904. This list does not contain information on issues placed on the ballot by state legislatures, commonly referred to as “legislative measures.”
setwd("~/Downloads/")
library(readxl)
df.ir = read_excel("IRI initiatives by state-year (1904-2010).xls",
col_names = FALSE)
colnames(df.ir) = c("StateName", "Year", "Number", "ApprovalRate")
The Legal Landscape Database describes direct democracy provisions in American cities in 2005. The database includes the 1,000 largest cities in the country as well as the ten largest cities in every state. Data items include city name, FIPS code, population, and initiative status.
setwd("~/Downloads/")
df.ircity = read_excel("Legal Landscape (version 20080504).xls")
## DEFINEDNAME: 21 00 00 01 0b 00 00 00 01 00 00 00 00 00 00 0d 3b 00 00 00 00 e5 05 00 00 05 00
## DEFINEDNAME: 00 00 00 1b 07 00 00 00 01 00 00 00 00 00 00 44 6f 63 75 6d 65 6e 74 31 7a 7a 53 50 5f 34 30 32 33 30 30 30 30 64 62 63 62 36 3c 00 00 0b 00 8a 00
## DEFINEDNAME: 00 00 00 1b 07 00 00 00 01 00 00 00 00 00 00 44 6f 63 75 6d 65 6e 74 31 7a 7a 53 50 5f 34 39 61 31 30 30 30 30 32 36 37 37 33 3c 00 00 07 00 67 00
## DEFINEDNAME: 00 00 00 1b 07 00 00 00 01 00 00 00 00 00 00 44 6f 63 75 6d 65 6e 74 31 7a 7a 53 50 5f 36 30 36 38 30 30 30 30 34 38 61 36 35 3c 00 00 09 00 67 00
## DEFINEDNAME: 00 00 00 1b 07 00 00 00 01 00 00 00 00 00 00 44 6f 63 75 6d 65 6e 74 31 7a 7a 53 50 5f 38 66 61 39 30 30 30 30 34 39 39 62 34 3c 00 00 05 00 67 00
## DEFINEDNAME: 00 00 00 1b 07 00 00 00 01 00 00 00 00 00 00 44 6f 63 75 6d 65 6e 74 31 7a 7a 53 50 5f 62 61 33 62 30 30 30 30 37 64 33 30 31 3c 00 00 0d 00 8a 00
## DEFINEDNAME: 00 00 00 1b 07 00 00 00 01 00 00 00 00 00 00 44 6f 63 75 6d 65 6e 74 31 7a 7a 53 50 5f 62 64 36 38 30 30 30 30 36 34 32 38 31 3c 00 00 03 00 67 00
## DEFINEDNAME: 21 00 00 01 0b 00 00 00 01 00 00 00 00 00 00 0d 3b 00 00 00 00 e5 05 00 00 05 00
## DEFINEDNAME: 00 00 00 1b 07 00 00 00 01 00 00 00 00 00 00 44 6f 63 75 6d 65 6e 74 31 7a 7a 53 50 5f 34 30 32 33 30 30 30 30 64 62 63 62 36 3c 00 00 0b 00 8a 00
## DEFINEDNAME: 00 00 00 1b 07 00 00 00 01 00 00 00 00 00 00 44 6f 63 75 6d 65 6e 74 31 7a 7a 53 50 5f 34 39 61 31 30 30 30 30 32 36 37 37 33 3c 00 00 07 00 67 00
## DEFINEDNAME: 00 00 00 1b 07 00 00 00 01 00 00 00 00 00 00 44 6f 63 75 6d 65 6e 74 31 7a 7a 53 50 5f 36 30 36 38 30 30 30 30 34 38 61 36 35 3c 00 00 09 00 67 00
## DEFINEDNAME: 00 00 00 1b 07 00 00 00 01 00 00 00 00 00 00 44 6f 63 75 6d 65 6e 74 31 7a 7a 53 50 5f 38 66 61 39 30 30 30 30 34 39 39 62 34 3c 00 00 05 00 67 00
## DEFINEDNAME: 00 00 00 1b 07 00 00 00 01 00 00 00 00 00 00 44 6f 63 75 6d 65 6e 74 31 7a 7a 53 50 5f 62 61 33 62 30 30 30 30 37 64 33 30 31 3c 00 00 0d 00 8a 00
## DEFINEDNAME: 00 00 00 1b 07 00 00 00 01 00 00 00 00 00 00 44 6f 63 75 6d 65 6e 74 31 7a 7a 53 50 5f 62 64 36 38 30 30 30 30 36 34 32 38 31 3c 00 00 03 00 67 00
## DEFINEDNAME: 21 00 00 01 0b 00 00 00 01 00 00 00 00 00 00 0d 3b 00 00 00 00 e5 05 00 00 05 00
## DEFINEDNAME: 00 00 00 1b 07 00 00 00 01 00 00 00 00 00 00 44 6f 63 75 6d 65 6e 74 31 7a 7a 53 50 5f 34 30 32 33 30 30 30 30 64 62 63 62 36 3c 00 00 0b 00 8a 00
## DEFINEDNAME: 00 00 00 1b 07 00 00 00 01 00 00 00 00 00 00 44 6f 63 75 6d 65 6e 74 31 7a 7a 53 50 5f 34 39 61 31 30 30 30 30 32 36 37 37 33 3c 00 00 07 00 67 00
## DEFINEDNAME: 00 00 00 1b 07 00 00 00 01 00 00 00 00 00 00 44 6f 63 75 6d 65 6e 74 31 7a 7a 53 50 5f 36 30 36 38 30 30 30 30 34 38 61 36 35 3c 00 00 09 00 67 00
## DEFINEDNAME: 00 00 00 1b 07 00 00 00 01 00 00 00 00 00 00 44 6f 63 75 6d 65 6e 74 31 7a 7a 53 50 5f 38 66 61 39 30 30 30 30 34 39 39 62 34 3c 00 00 05 00 67 00
## DEFINEDNAME: 00 00 00 1b 07 00 00 00 01 00 00 00 00 00 00 44 6f 63 75 6d 65 6e 74 31 7a 7a 53 50 5f 62 61 33 62 30 30 30 30 37 64 33 30 31 3c 00 00 0d 00 8a 00
## DEFINEDNAME: 00 00 00 1b 07 00 00 00 01 00 00 00 00 00 00 44 6f 63 75 6d 65 6e 74 31 7a 7a 53 50 5f 62 64 36 38 30 30 30 30 36 34 32 38 31 3c 00 00 03 00 67 00
## DEFINEDNAME: 21 00 00 01 0b 00 00 00 01 00 00 00 00 00 00 0d 3b 00 00 00 00 e5 05 00 00 05 00
## DEFINEDNAME: 00 00 00 1b 07 00 00 00 01 00 00 00 00 00 00 44 6f 63 75 6d 65 6e 74 31 7a 7a 53 50 5f 34 30 32 33 30 30 30 30 64 62 63 62 36 3c 00 00 0b 00 8a 00
## DEFINEDNAME: 00 00 00 1b 07 00 00 00 01 00 00 00 00 00 00 44 6f 63 75 6d 65 6e 74 31 7a 7a 53 50 5f 34 39 61 31 30 30 30 30 32 36 37 37 33 3c 00 00 07 00 67 00
## DEFINEDNAME: 00 00 00 1b 07 00 00 00 01 00 00 00 00 00 00 44 6f 63 75 6d 65 6e 74 31 7a 7a 53 50 5f 36 30 36 38 30 30 30 30 34 38 61 36 35 3c 00 00 09 00 67 00
## DEFINEDNAME: 00 00 00 1b 07 00 00 00 01 00 00 00 00 00 00 44 6f 63 75 6d 65 6e 74 31 7a 7a 53 50 5f 38 66 61 39 30 30 30 30 34 39 39 62 34 3c 00 00 05 00 67 00
## DEFINEDNAME: 00 00 00 1b 07 00 00 00 01 00 00 00 00 00 00 44 6f 63 75 6d 65 6e 74 31 7a 7a 53 50 5f 62 61 33 62 30 30 30 30 37 64 33 30 31 3c 00 00 0d 00 8a 00
## DEFINEDNAME: 00 00 00 1b 07 00 00 00 01 00 00 00 00 00 00 44 6f 63 75 6d 65 6e 74 31 7a 7a 53 50 5f 62 64 36 38 30 30 30 30 36 34 32 38 31 3c 00 00 03 00 67 00
“The Common Core of Data (CCD) is a program of the U.S. Department of Education’s National Center for Education Statistics that annually collects fiscal and non-fiscal data about all public schools, public school districts and state education agencies in the United States. The data are supplied by state education agency officials and include information that describes schools and school districts, including name, address, and phone number; descriptive information about students and staff, including demographics; and fiscal data, including revenues and current expenditures.” You can learn more about the program here and access the data and documentation here.
setwd("~/Downloads/")
df.nces = read.csv("nces1314.csv", header = TRUE, sep = ",",
skip = 5)
colnames(df.nces) = c("SchoolName", "State", "City", "Address",
"StateAbb", "ZIP", "ZIP4", "StateAbb2", "SchoolID", "Agency",
"AgencyID", "County", "Countynumber", "StateFIPS", "SchoolType",
"AgencyType", "Status", "Charter", "Magnet", "Latitude",
"Longitude", "StateSchoolID", "StateAgencyID", "VirtualSchool",
"FreeLunch", "ReducedLunch", "TotalFRL.N", "TotalStudents",
"FTE.Teachers", "PupilTeacherRatio")
These are data for a single year. If we wanted to build a panel (i.e., data for multiple schools over multiple years) we would have to download all the files and then bind them into a single data frame.
Data are rarely clean, unless you have a little army working for you that knows exactly what is needed to prepare the data for analysis, or, if you are lucky, someone else has basically shipped you a clean version of the data. More often than not you will end up spending 90% of your time cleaning and preparing your data for analysis. Some of the more common procedures you might run into are outlined below but do note that these are but a smidgen of what may need to be done or what is possible with R.
We will look at a small array of operations, starting with the easiest – sub-setting data.
Sub-setting a data set means creating a smaller piece of the original data by selecting observations that meet a criterion of interest. For example, maybe we are not interested in the nationwide brfss2014b data we downloaded but instead just the data for Ohio. This could be easily achieved as follows:
setwd("~/Downloads/")
load("brfss2014c.RData")
brfss2014c.OH = subset(brfss2014c, STATE == 39, )
Take another example, the Census data we downloaded earlier. Suppose we wanted to focus our analysis on all Places with a population size of 25000 or more. How might we do this?
df.cvap.sub = subset(df.cvap, TOT_EST >= 25000, )
Wait a minute. Let us make it even smaller, by focusing only on Ohio’s Places with a total population of at least 25000. To do this we’ll have to split the GEONAME column to create two columns – Placename and Statename.
library(splitstackshape)
df.cvap2 = cSplit(as.data.table(df.cvap), "GEONAME", ",")
Now we have two new columns, GEONAME1 and GEONAME2 so let us rename them as needed.
names(df.cvap2)[names(df.cvap2) == "GEONAME_1"] <- "PlaceName"
names(df.cvap2)[names(df.cvap2) == "GEONAME_2"] <- "StateName"
Now let us subset this data frame to places in Ohio with a total population of 25000 or more.
df.cvap3 = subset(df.cvap2, StateName == "Ohio" & TOT_EST >=
25000, )
What if we wanted multiple states, say Ohio and West Virginia, and only retain the rows for Total Population?
df.cvap4 = subset(df.cvap2, StateName == "Ohio" & TOT_EST >=
25000 & LNNUMBER == 1 | StateName == "West Virginia" & TOT_EST >=
25000 & LNNUMBER == 1, )
We often end up having to construct new variables because the original variables are not perfectly suited for our purposes. Maybe we want to create a new variable based on quartiles or collapse some nominal or ordinal variable into fewer categories. We’ll see a couple of ways to carry out these recodings. First, let us create a new category called LargeCity in the df.cvap data set.
attach(df.cvap)
df.cvap$LargeCity = NA
df.cvap$LargeCity[df.cvap$TOT_EST <= 1e+05 & df.cvap$LNNUMBER ==
1] = "Not Large City"
df.cvap$LargeCity[df.cvap$TOT_EST > 1e+05 & df.cvap$LNNUMBER ==
1] = "Large City"
detach(df.cvap)
table(df.cvap$LargeCity)
Large City | Not Large City |
---|---|
303 | 29206 |
Broadly summarized, data tend to arrive in one of two formats – either in what we call the wide
format (shown below)
GeoName | LineCode | Description | X1969 | X1970 | … | X2014 |
---|---|---|---|---|---|---|
Adams | 1 | Population | 18895 | 19117 | … | 28129 |
Adams | 2 | Income | 45179 | 45202 | … | 51039 |
Athens | 1 | Population | 16895 | 17117 | … | 31129 |
Athens | 2 | Income | 41179 | 41102 | … | 61572 |
or in the long
format (shown below).
GeoName | LineCode | Description | Year | Estimate |
---|---|---|---|---|
Adams | 1 | Population | 1969 | 18895 |
Adams | 1 | Population | 1970 | 19117 |
Athens | 1 | Population | 1969 | 16895 |
Athens | 1 | Population | 1970 | 17117 |
setwd("~/Downloads/CA1/")
df = read.csv("CA1_1969_2014_OH.csv", sep = ",", header = TRUE)
head(df[c(2, 7, 8, 9)], n = 9)
GeoName | Description | X1969 | X1970 |
---|---|---|---|
Ohio state total | Personal income (thousands of dollars) | 41814714 | 44121386 |
Ohio state total | Population (persons) 1/ | 10563000 | 10668839 |
Ohio state total | Per capita personal income (dollars) 2/ | 3959 | 4136 |
Adams, OH | Personal income (thousands of dollars) | 45171 | 49857 |
Adams, OH | Population (persons) 1/ | 18895 | 19117 |
Adams, OH | Per capita personal income (dollars) 2/ | 2391 | 2608 |
Allen, OH | Personal income (thousands of dollars) | 415365 | 439215 |
Allen, OH | Population (persons) 1/ | 110340 | 111084 |
Allen, OH | Per capita personal income (dollars) 2/ | 3764 | 3954 |
df2 = subset(df, Region != "NA", )
df2$IndustryClassification = NULL
df2$Table = NULL
df2$Region = NULL
df2$LineCode = factor(df2$LineCode, levels = c(1, 2, 3), labels = c("Personal Income",
"Population", "Per Capita Personal Income"))
df2$Description = NULL
library(reshape2)
df3 = melt(df2, id.vars = c("GeoFIPS", "GeoName", "LineCode"),
variable.name = "Year", value.name = "Estimate")
df4 = as.data.frame(sapply(df3, gsub, pattern = "X", replacement = ""))
df5 <- dcast(df4, GeoName + Year ~ LineCode, value.var = "Estimate")
colnames(df5) = c("GeoName", "Year", "pcpersinc", "persinc",
"popn")
cols = c(3, 4, 5)
df5[, cols] = apply(df5[, cols], 2, function(x) as.numeric(as.character(x)))
head(df5, n = 9)
GeoName | Year | pcpersinc | persinc | popn |
---|---|---|---|---|
Adams, OH | 1969 | 2391 | 45171 | 18895 |
Adams, OH | 1970 | 2608 | 49857 | 19117 |
Adams, OH | 1971 | 2566 | 50988 | 19870 |
Adams, OH | 1972 | 2708 | 56026 | 20687 |
Adams, OH | 1973 | 2897 | 63292 | 21847 |
Adams, OH | 1974 | 3253 | 73276 | 22525 |
Adams, OH | 1975 | 3553 | 79034 | 22247 |
Adams, OH | 1976 | 3703 | 85818 | 23178 |
Adams, OH | 1977 | 4200 | 98894 | 23546 |
Sometimes the data are not rectangularized, that is, every observation is supposed to have the same number of rows and columns as all others. However, at times the data may not reflect this property. One example is the Ohio Department of Education’s district-level data. Let us download three files, one with total enrollment, the other with information on the number of students with an economic disadvantage, and the third and final one with information on the racial distribution of students. These data can be downloaded from here. Make sure when you Export
the results you do so as a Plain text file, tab-delimited and with Export Report Title deselected.
setwd("~/Downloads/")
df.a = read.csv("Enrollment (District).txt", skip = 4, header = FALSE,
sep = "\t")
summary(df.a)
V1 | V2 | V3 | V4 | |
---|---|---|---|---|
Min. : 125 | Buckeye Local : 3 | Mode:logical | 88: 5 | |
1st Qu.: 43588 | Crestview Local: 3 | NA’s:1061 | 98: 5 | |
Median : 46359 | Eastern Local : 3 | NA | 139: 5 | |
Mean : 47115 | Green Local : 3 | NA | 36: 4 | |
3rd Qu.: 49601 | Madison Local : 3 | NA | 62: 4 | |
Max. :151209 | Northwest Local: 3 | NA | 108: 4 | |
NA | (Other) :1043 | NA | (Other) :1034 |
df.a$V3 = NULL
colnames(df.a) = c("DistrictIRN", "DistrictName", "Enrollment")
df.b = read.csv("Enrollment by Student Demographic (District).txt",
skip = 4, header = FALSE, sep = "\t")
summary(df.b)
V1 | V2 | V3 | V4 | V5 | V6 | |
---|---|---|---|---|---|---|
Min. : 125 | Perry Local : 14 | American Indian or Alaskan Native: 43 | Mode:logical | 11: 93 | >95%: 285 | |
1st Qu.: 43950 | Buckeye Local : 13 | Asian :281 | NA’s:3372 | 12: 90 | 1.7%: 59 | |
Median : 46094 | Northwest Local : 13 | Black, Non-Hispanic :707 | NA | 14: 80 | .9%: 57 | |
Mean : 47575 | Springfield Local: 13 | Hispanic :652 | NA | 13: 64 | 1.2%: 55 | |
3rd Qu.: 49205 | Madison Local : 12 | Multiracial :712 | NA | 16: 63 | 1.3%: 55 | |
Max. :151209 | Lakota Local : 10 | Pacific Islander : 18 | NA | 15: 60 | .8%: 50 | |
NA | (Other) :3297 | White, Non-Hispanic :959 | NA | (Other) :2922 | (Other):2811 |
df.b$V2 = NULL
df.b$V4 = NULL
df.b$V6 = NULL
colnames(df.b) = c("DistrictIRN", "Race", "Number")
df.c = read.csv("Enrollment by Student Demographic (District) (1).txt",
skip = 4, header = FALSE, sep = "\t")
summary(df.c)
V1 | V2 | V3 | V4 | V5 | V6 | |
---|---|---|---|---|---|---|
Min. : 125 | Buckeye Local : 6 | N: 902 | Mode:logical | 10: 10 | >95%: 210 | |
1st Qu.: 43968 | Crestview Local: 6 | Y:1009 | NA’s:1911 | 11: 9 | 36.3%: 8 | |
Median : 46565 | Green Local : 6 | NA | NA | 16: 9 | 63.7%: 8 | |
Mean : 47795 | Madison Local : 6 | NA | NA | 20: 8 | 42.2%: 7 | |
3rd Qu.: 49639 | Northwest Local: 6 | NA | NA | 22: 8 | 43.7%: 7 | |
Max. :151209 | Perry Local : 6 | NA | NA | 76: 8 | 46.1%: 7 | |
NA | (Other) :1875 | NA | NA | (Other) :1859 | (Other):1664 |
df.c$V2 = NULL
df.c$V4 = NULL
df.c$V6 = NULL
colnames(df.c) = c("DistrictIRN", "Econ", "Number")
We read in each file with read.csv(“filepath”, skip=2, header=FALSE, sep=“”)
, skipping 2 lines from the top (because this gives us a cleaner file). Because there are no column headers the data frame will end up with columns named V1, V2, … and so on. I am dropping certain columns because they are either blank or because we don’t need them. The colnames(dataframe) = c(“…”)
command assigns meaningful column headers to each data frame.
df.a has one row per district but df.c should have two rows per district – one row for the number of students flagged as economically disadvantaged (Econ == “Y”) and the other row for the number of students flagged as not economically disadvantaged (Econ == “N”). Let us see how many rows there are:
table(df.c$Econ)
N | Y |
---|---|
902 | 1009 |
Aha! only 902 “N” and 1009 “Y”. If these were equal in number we would have a rectangular data frame, but we can make it so via the epicalc
library and the following code:
library(epicalc)
df.c2 <- fillin(df.c, select = c(DistrictIRN, Econ))
table(df.c2$Econ)
N | Y |
---|---|
1061 | 1061 |
The race data, df.b have the same problem and can be fixed similarly.
table(df.b$Race)
American Indian or Alaskan Native | Asian | Black, Non-Hispanic | Hispanic | Multiracial | Pacific Islander | White, Non-Hispanic |
---|---|---|---|---|---|---|
43 | 281 | 707 | 652 | 712 | 18 | 959 |
df.b2 <- fillin(df.b, select = c(DistrictIRN, Race))
table(df.b2$Race)
American Indian or Alaskan Native | Asian | Black, Non-Hispanic | Hispanic | Multiracial | Pacific Islander | White, Non-Hispanic |
---|---|---|---|---|---|---|
1059 | 1059 | 1059 | 1059 | 1059 | 1059 | 1059 |
One last thing before we move on: Let us flip the following variables:
Pay attention to the code, there is a subtle difference when flipping the variables.
df.a$DistrictIRN2 = as.numeric(df.a$DistrictIRN)
df.a$Enrollment2 = as.numeric(gsub(",", "", df.a$Enrollment))
df.b2$DistrictIRN2 = as.numeric(df.b2$DistrictIRN)
df.b2$Number2 = as.numeric(gsub(",", "", df.b2$Number))
df.c2$DistrictIRN2 = as.numeric(df.c2$DistrictIRN)
df.c2$Number2 = as.numeric(gsub(",", "", df.c2$Number))
If a numeric variable has been read as a Factor and has commas, the gsub(“,”, “”, df$variable)
portion of the code strips the commas from the variable before converting the variable to a numeric variable. Note that to be safe I am creating new variables in each data frame, one for every format conversion. This way if something goes wrong during the conversion we will still have the original variable to rework.
Now that we have the education data in three files we can merge them into a single data frame. This is often necessary since data will not always be downloaded in one single file. Merging is fairly seamless, so long as one is careful about a few things. First, you need the column (or columns) you will use to merge the files to be identically named and formatted in each file to be merged. This means we will have to rename DistrictIRN2 in df.b2 and df.c2 to be renamed DistrictIRN. This can be done by dropping DistrictIRN in df.b2 and df.c2, and then renaming DistrictIRN2 to DistrictIRN in df.b2 and df.c2.
df.a$DistrictIRN = NULL
df.b2$DistrictIRN = NULL
df.c2$DistrictIRN = NULL
df.a$Enrollment = NULL
df.b2$Number = NULL
df.c2$Number = NULL
names(df.a)[names(df.a) == "DistrictIRN2"] <- "DistrictIRN"
names(df.b2)[names(df.b2) == "DistrictIRN2"] <- "DistrictIRN"
names(df.c2)[names(df.c2) == "DistrictIRN2"] <- "DistrictIRN"
Second, you need to decide beforehand if you want a 1:1 merge where only those records are merged that are present in both files. Thus, for example, if we have 1061 districts in df.a and df.b2 but only 1059 districts in df.c, then if we go with a 1:1 merge the merged data frame will end up with only 1059 districts. We could, of course, tweak the code to ask that the merged data frame include all districts even if they only appear in one file; this is what is being done below.
Before we merge though, I am going to reshape the Econ and the Race data frames.
df.c3 <- reshape(df.c2, timevar = "Econ", idvar = c("DistrictIRN"),
direction = "wide")
df.b3 <- reshape(df.b2, timevar = "Race", idvar = c("DistrictIRN"),
direction = "wide")
and now we merge df.a and df.c3 first.
merge.1 = merge(df.a, df.c3, by = c("DistrictIRN"), all = TRUE)
If you look at merge.1 you will see blank Number2.N cells for some districts. Let us fill these in by calculating Number2.N = Enrollment2 - Number2.Y
if Number2.N is blank. This can be done via the data.table
package.
library(data.table)
df1 = data.table(merge.1)
df1.b = df1[is.na(Number2.N), `:=`(Number2.N, Enrollment2 - Number2.Y)]
Now we can create a PctEconDis column that will be (Number2.Y / Enrollment2) * 100
and then we can drop the Number2.N and Number2.Y columns.
df1.b$PctEconDis = with(df1, (Number2.Y/Enrollment2) * 100)
df1.b$Number2.N = NULL
df1.b$Number2.Y = NULL
Now we do the same thing with the Race data, merging df.a and df.c2. This one will be trickier; you’ll see why.
merge.2 = merge(df.a, df.b3, by = c("DistrictIRN"), all = TRUE)
Note how many cells are blank because we have data for a district on students of some race but not the other groups. As such we’ll have to figure out a way to create a variable called PctMinority. This could be easily done if we had data for every district for the Number2.White, Non-Hispanic variable. But this column has missing values too so we’ll use a modified version of the preceding code.
df2 = data.table(merge.2)
myRowSums = function(...) {
retVal = as.numeric()
df = data.frame(...)
for (r in row.names(df)) {
retVal = append(retVal, sum(df[r, ], na.rm = TRUE))
}
return(retVal)
}
merge.2 <- within(merge.2, NonMinority <- ifelse(!is.na(`Number2.White, Non-Hispanic`),
`Number2.White, Non-Hispanic`, Enrollment2 - myRowSums(Number2.Asian,
`Number2.Black, Non-Hispanic`, Number2.Hispanic, `Number2.American Indian or Alaskan Native`,
`Number2.Pacific Islander`)))
merge.2$PctMinority = with(merge.2, ((Enrollment2 - NonMinority)/Enrollment2) *
100)
my.vars = c("DistrictIRN", "PctMinority")
df2.b = merge.2[my.vars]
Whew! Now we can merge df1 and df2 to construct a single data frame that has, for every district, Enrollment2, PctEconDis and PctMinority, along with the DistrictIRN and DistrictName.
df.educ = merge(df1.b, df2.b, by = "DistrictIRN", all = TRUE)
colnames(df.educ) = c("DistrictIRN", "DistrictName", "Enrollment",
"PctEconDis", "PctMinority")
save(df.educ, file = "educ.RData")