1 Data Acquisition

As discussed previously, R can read data created in various formats, both from the actual file downloaded and saved to your machine or from the web. The goal today is not to focus on reading these data. Rather, the goal is to see how we can work with, say, a Census data file or a file from the CDC, the Bureau of Labor, the Bureau of Economic Analysis, the United Nations, and so on. Our path will be simple: Download the raw data, clean and shape it into the version we need for our analyses.

2 Some Useful Data Sources

For a decade or more now there has been tremendous interest in making some data open – available to anyone who wishes to use it – but this trend was hastened by governments signing on and encouraging open data. Take a look at some of these sites …

2.1 Census Data

Let us download a specific file – the Voting Age Population by Citizenship and Race (CVAP) data compiled from the American Community Survey (ACS) 5-year (2009-2013) collection available here. This file can be downloaded from here. Once the file downloads, double-click the zip archive and extract all files. Documentation for these data is available here.

Once you extract the data, load the Place data as shown below:

setwd("~/Downloads/")
df.cvap = read.csv("CVAP_CSV_Format_2009-2013/Place.csv", sep = ",", 
    header = TRUE)

If you open the data frame you will see multiple rows per Place. Take a look at the documentation to understand what the columns represent.

  • TOT_EST: The rounded estimate of the total number of people for that geographic area and group. (Not available for tracts or block groups.)
  • TOT_MOE: The margin of error for the total number of people for that geographic area and group. (Not available for tracts or block groups.)
  • ADU_EST: The rounded estimate of the total number of people 18 years of age or older for that geographic area and group. (Not available for tracts or block groups.)
  • ADU_MOE: The margin of error for the total number of people 18 years of age or older for that geographic area and group. (Not available for tracts or block groups.)
  • CIT_EST: The rounded estimate of the total number of United States citizens for that geographic area and group
  • CIT_MOE: The margin of error for the total number of United States citizens for that geographic area and group.
  • CVAP_EST: The rounded estimate of the total number of United States citizens 18 years of age or older for that geographic area and group.
  • CVAP_MOE: The margin of error for the total number of United States citizens 18 years of age or older for that geographic area and group.

The rows represent the following groups, marked by LNTITLE and LNNUMBER, respectively:

  • Total 1
  • Not Hispanic or Latino 2
  • American Indian or Alaska Native Alone 3
  • Asian Alone 4
  • Black or African American Alone 5
  • Native Hawaiian or Other Pacific Islander Alone 6
  • White Alone 7
  • American Indian or Alaska Native and White 8
  • Asian and White 9
  • Black or African American and White 10
  • American Indian or Alaska Native and Black or African American 11
  • Remainder of Two or More Race Responses 12
  • Hispanic or Latino 13

2.2 CDC’s BRFSS Data

The Behavioral Risk Factor Surveillance System (BRFSS) is the nation’s premier system of health-related telephone surveys that collect state data about U.S. residents regarding their health-related risk behaviors, chronic health conditions, and use of preventive services. Established in 1984 with 15 states, BRFSS now collects data in all 50 states as well as the District of Columbia and three U.S. territories. BRFSS completes more than 400,000 adult interviews each year, making it the largest continuously conducted health survey system in the world.

The 2014 BRFSS survey data can be downloaded in ASCII format from here while the codebook is here and the variable layout is here.

Note: ASCII (pronounced ask-ee) stands for the American Standard Code for Information Interchange and is a code for representing English characters as numbers, with each letter assigned a number from 0 to 127. “For example, the ASCII code for uppercase M is 77. Most computers use ASCII codes to represent text, which makes it possible to transfer data from one computer to another.” Source:.

If you download the data and look at the variable layout here you notice that every column has specific starting and ending positions. If we choose to read the downloaded ASCII data we’ll have to pass the variable layout information to R. I’ll show you how to do this but let us take a shortcut via the other formats CDC so kindly makes available for download, namely the SAS Transport Format found here.

Download and extract two files and put both in your working directory.

Then execute the following code:

setwd("~/Downloads/")
library(foreign)
brfss2014a = read.xport("LLCP2014.XPT ")

Note: There is an extra blank space after LLCP2014.XPT so run the code as written and not as LLCP2014.XPT“.

When the code executes you should see brfss2014a in upper-right window of RStudio. Very easy, but, what if we were given the raw ASCII data? Now we will need the variable layout information, plus choose a library to work with. Two I use are readr (my preferred library) and LaF, and both are shown in action below.

Note: The original column layout specifications had to be modified because the third variable listed online (IDATE) subsumes the three that follow. As such IDATE was deleted from the specifications below.

The readr code specifies the start and end of the columns, as well as the column names.

fwf_positions_begin = c(1, 16, 18, 20, 22, 31, 35, 45, 62, 63, 
    64, 65, 67, 68, 70, 72, 80, 81, 83, 85, 87, 88, 89, 90, 91, 
    92, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 
    106, 108, 109, 146, 147, 148, 150, 151, 152, 154, 158, 170, 
    171, 172, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 
    186, 187, 188, 189, 190, 192, 193, 196, 198, 200, 202, 203, 
    209, 210, 211, 213, 215, 216, 218, 219, 220, 221, 222, 223, 
    224, 225, 226, 227, 228, 229, 230, 231, 232, 233, 234, 235, 
    236, 237, 243, 255, 256, 257, 258, 261, 264, 266, 268, 270, 
    271, 272, 273, 275, 277, 279, 281, 282, 284, 285, 310, 311, 
    312, 314, 315, 316, 317, 318, 319, 320, 321, 322, 323, 326, 
    327, 329, 330, 332, 334, 336, 339, 340, 341, 342, 343, 345, 
    346, 347, 348, 349, 351, 352, 354, 356, 357, 359, 360, 361, 
    362, 363, 364, 365, 366, 367, 368, 369, 370, 371, 372, 573, 
    574, 575, 576, 578, 579, 581, 582, 583, 584, 590, 591, 595, 
    623, 625, 626, 627, 628, 629, 645, 647, 1356, 1357, 1363, 
    1393, 1403, 1417, 1419, 1421, 1425, 1426, 1427, 1430, 1473, 
    1612, 1669, 1671, 1673, 1674, 1675, 1711, 1907, 1910, 1913, 
    1916, 2155, 2156, 2157, 2158, 2159, 2160, 2161, 2162, 2163, 
    2164, 2221, 2223, 2227, 2228, 2229, 2230, 2231, 2232, 2234, 
    2235, 2236, 2239, 2242, 2247, 2251, 2252, 2253, 2254, 2255, 
    2256, 2257, 2258, 2259, 2262, 2263, 2267, 2271, 2272, 2273, 
    2274, 2275, 2276, 2277, 2278, 2279, 2281, 2283, 2284, 2286, 
    2292)

width = c(2, 2, 2, 2, 4, 4, 10, 10, 1, 1, 1, 1, 1, 2, 2, 2, 1, 
    2, 2, 2, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
    1, 1, 2, 1, 1, 1, 1, 2, 1, 1, 2, 4, 4, 1, 1, 1, 1, 1, 1, 
    1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 3, 2, 2, 2, 1, 6, 
    1, 1, 2, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
    1, 1, 1, 1, 1, 1, 6, 2, 1, 1, 1, 3, 3, 2, 2, 2, 1, 1, 1, 
    2, 2, 2, 2, 1, 2, 1, 25, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 
    1, 3, 1, 2, 1, 2, 2, 2, 3, 1, 1, 1, 1, 2, 1, 1, 1, 1, 2, 
    1, 2, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
    1, 1, 1, 2, 1, 2, 1, 1, 1, 6, 1, 4, 28, 2, 1, 1, 1, 1, 1, 
    2, 2, 1, 6, 10, 10, 10, 2, 2, 1, 1, 1, 1, 10, 10, 1, 2, 2, 
    1, 1, 1, 10, 3, 3, 3, 10, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 
    2, 1, 1, 1, 1, 1, 2, 1, 1, 3, 3, 5, 4, 1, 1, 1, 1, 1, 1, 
    1, 1, 3, 1, 4, 4, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
    1)

fwf_positions_end = fwf_positions_begin + (width - 1)

col_names = c("STATE", "FMONTH", "IMONTH", "IDAY", "IYEAR", "DISPCODE", 
    "SEQNO", "_PSU", "CTELENUM", "PVTRESD1", "COLGHOUS", "STATERES", 
    "LADULT", "NUMADULT", "NUMMEN", "NUMWOMEN", "GENHLTH", "PHYSHLTH", 
    "MENTHLTH", "POORHLTH", "HLTHPLN1", "PERSDOC2", "MEDCOST", 
    "CHECKUP1", "EXERANY2", "SLEPTIM1", "CVDINFR4", "CVDCRHD4", 
    "CVDSTRK3", "ASTHMA3", "ASTHNOW", "CHCSCNCR", "CHCOCNCR", 
    "CHCCOPD1", "HAVARTH3", "ADDEPEV2", "CHCKIDNY", "DIABETE3", 
    "DIABAGE2", "LASTDEN3", "RMVTETH3", "VETERAN3", "MARITAL", 
    "CHILDREN", "EDUCA", "EMPLOY1", "INCOME2", "WEIGHT2", "HEIGHT3", 
    "NUMHHOL2", "NUMPHON2", "CPDEMO1", "INTERNET", "RENTHOM1", 
    "SEX", "PREGNANT", "QLACTLM2", "USEEQUIP", "BLIND", "DECIDE", 
    "DIFFWALK", "DIFFDRES", "DIFFALON", "SMOKE100", "SMOKDAY2", 
    "STOPSMK2", "LASTSMK2", "USENOW3", "ALCDAY5", "AVEDRNK2", 
    "DRNK3GE5", "MAXDRNKS", "FLUSHOT6", "FLSHTMY2", "PNEUVAC3", 
    "SHINGLE2", "FALL12MN", "FALLINJ2", "SEATBELT", "DRNKDRI2", 
    "HADMAM", "HOWLONG", "PROFEXAM", "LENGEXAM", "HADPAP2", "LASTPAP2", 
    "HADHYST2", "PCPSAAD2", "PCPSADI1", "PCPSARE1", "PSATEST1", 
    "PSATIME", "PCPSARS1", "BLDSTOOL", "LSTBLDS3", "HADSIGM3", 
    "HADSGCO1", "LASTSIG3", "HIVTST6", "HIVTSTD3", "WHRTST10", 
    "PDIABTST", "PREDIAB1", "INSULIN", "BLDSUGAR", "FEETCHK2", 
    "DOCTDIAB", "CHKHEMO3", "FEETCHK", "EYEEXAM", "DIABEYE", 
    "DIABEDU", "PAINACT2", "QLMENTL2", "QLSTRES2", "QLHLTH2", 
    "MEDICARE", "HLTHCVR1", "DELAYMED", "DLYOTHER", "NOCOV121", 
    "LSTCOVRG", "DRVISITS", "MEDSCOST", "CARERCVD", "MEDBILL1", 
    "ASBIALCH", "ASBIDRNK", "ASBIBING", "ASBIADVC", "ASBIRDUC", 
    "WTCHSALT", "LONGWTCH", "DRADVISE", "ASTHMAGE", "ASATTACK", 
    "ASERVIST", "ASDRVIST", "ASRCHKUP", "ASACTLIM", "ASYMPTOM", 
    "ASNOSLEP", "ASTHMED3", "ASINHALR", "IMFVPLAC", "TETANUS", 
    "HPVTEST", "HPLSTTST", "HPVADVC2", "HPVADSHT", "CNCRDIFF", 
    "CNCRAGE", "CNCRTYP1", "CSRVTRT1", "CSRVDOC1", "CSRVSUM", 
    "CSRVRTRN", "CSRVINST", "CSRVINSR", "CSRVDEIN", "CSRVCLIN", 
    "CSRVPAIN", "CSRVCTL1", "RRCLASS2", "RRCOGNT2", "RRATWRK2", 
    "RRHCARE3", "RRPHYSM2", "RREMTSM2", "SCNTMNY1", "SCNTMEL1", 
    "SCNTPAID", "SCNTWRK1", "SCNTLPAD", "SCNTLWK1", "SCNTVOT1", 
    "SXORIENT", "TRNSGNDR", "RCSBIRTH", "RCSGENDR", "RCHISLA1", 
    "RCSRACE1", "RCSBRAC1", "RCSRLTN2", "CASTHDX2", "CASTHNO2", 
    "EMTSUPRT", "LSATISFY", "QSTVER", "QSTLANG", "MSCODE", "_STSTR", 
    "_STRWT", "_RAWRAKE", "_WT2RAKE", "_AGE80", "_IMPRACE", "_IMPNPH", 
    "_IMPEDUC", "_IMPMRTL", "_IMPHOME", "_LANDWT2", "_LANDWT", 
    "_CHISPNC", "_CPRACE", "_CRACE1", "_IMPCAGE", "_IMPCRAC", 
    "_IMPCSEX", "_CLANDWT", "NPHH", "NAHH", "NCHH", "_HHOLDWT", 
    "_RFHLTH", "_HCVU651", "_TOTINDA", "_LTASTH1", "_CASTHM1", 
    "_ASTHMS1", "_DRDXAR1", "_EXTETH2", "_ALTETH2", "_DENVST2", 
    "_PRACE1", "_MRACE1", "_HISPANC", "_RACE", "_RACEG21", "_RACEGR3", 
    "_RACE_G1", "_AGEG5YR", "_AGE65YR", "_AGE_G", "HTIN4", "HTM4", 
    "WTKG3", "_BMI5", "_BMI5CAT", "_RFBMI5", "_CHLDCNT", "_EDUCAG", 
    "_INCOMG", "_SMOKER3", "_RFSMOK3", "DRNKANY5", "DROCDY3_", 
    "_RFBING5", "_DRNKDY4", "_DRNKMO4", "_RFDRHV4", "_RFDRMN4", 
    "_RFDRWM4", "_FLSHOT6", "_PNEUMO2", "_RFSEAT2", "_RFSEAT3", 
    "_RFMAM2Y", "_MAM502Y", "_RFPAP32", "_RFPSA21", "_RFBLDS2", 
    "_RFSIGM2", "_AIDTST3")

library(readr)

brfss2014b = read_fwf("~/Downloads/LLCP2014.ASC ", fwf_positions(fwf_positions_begin, 
    fwf_positions_end, col_names))

brfss2014b = subset(brfss2014b, STATE == "39")

save(brfss2014b, file = "~/Downloads/brfss2014b.RData")

Next you see the LaF.

library(LaF)

laf = laf_open_fwf("LLCP2014.ASC ", column_widths = width, column_types = rep("character", 
    length(width)), column_names = col_names)

brfss2014c = data.frame(laf[, ])

save(brfss2014c, file = "~/Downloads/brfss2014c.RData")

Notice that with LaF we had to specify the column widths via column_widths = width and also the column headings via column_names = cnames.

2.3 Changing Variable Formats

When these files are read some of the variable will come in as characters when in fact they are numeric (see STATE, for example) or factors (see PHYSHLTH for example). Others will be read as integers (see GENHLTH) and so on. Consequently, we often have to reset the read formats. Further, we do know from the codebook that missing values and other responses (such as refused) carry numeric codes in the survey. For instance, GENHLTH is supposed to have the following responses mapped to the numeric values stored in the variable

  • 1 = Excellent
  • 2 = Very Good
  • 3 = Good
  • 4 = Fair
  • 5 = Poor
  • 7 = Don’t Know/Not Sure
  • 9 = Refused
  • BLANK = Not asked or Missing

We can take this mapping and fix GENHLTH by creating a new variable, say general.health as a factor.

load("~/Downloads/brfss2014b.RData")

brfss2014b$general.health = factor(brfss2014b$GENHLTH, levels = c(1, 
    2, 3, 4, 5), labels = c("Excellent", "Very Good", "Good", 
    "Fair", "Poor"))

table(brfss2014b$general.health)
Excellent Very Good Good Fair Poor
1689 3499 3425 1643 651

Let us also flip STATE into a numeric variable, and format FMONTH, IMONTH, IDAY, IYEAR as dates. For dates the lubridate package comes in very handy.

brfss2014b$STATE = as.numeric(brfss2014b$STATE)

We’ll combine IMONTH, IDAY and IYEAR to create a date variable.

brfss2014b$date = with(brfss2014b, sprintf("%s-%s-%s", IMONTH, 
    IDAY, IYEAR))

head(brfss2014b$date)
## [1] "11-05-2014" "07-14-2014" "12-16-2014" "09-15-2014" "09-08-2014"
## [6] "08-29-2014"
library(lubridate)
brfss2014b$Date = mdy(brfss2014b$date)
is.Date(brfss2014b$Date)
## [1] FALSE
str(brfss2014b$Date)
##  POSIXct[1:10933], format: "2014-11-05" "2014-07-14" "2014-12-16" "2014-09-15" ...

Note how we built date … we first concatenated the three columns and then used lubridate to format date via the mdy() specification. I then check to make sure the format is correct via the is.Date() command. This could also have been done via the str() command. Now that Date has been created we could drop the original three columns but we will let that be for now.

2.4 BEA and BLS Data

The Bureau of Labor Statistics (BLS) compiles a lot of data, including some under the Local Area Unemployment Statistics (LAUS) program – a Federal-State cooperative effort in which monthly estimates of total employment and unemployment are prepared for approximately 7,300 areas:

  • Census regions and divisions
  • States
  • Metropolitan Statistical Areas and Metropolitan NECTAS (New England City and Town Areas)
  • Metropolitan Divisions and NECTA Divisions
  • Micropolitan Statistical Areas and Micropolitan NECTAs
  • Combined Metropolitan Statistical Areas and Combined NECTAs
  • Small Labor Market Areas
  • Counties and county equivalents
  • Cities of 25,000 population or more
  • Cities and towns in New England regardless of population

These estimates are key indicators of local economic conditions. The Bureau of Labor Statistics (BLS) of the U.S. Department of Labor is responsible for the concepts, definitions, technical procedures, validation, and publication of the estimates that State employment security agencies prepare under agreement with BLS. Data available at BLS can be found here

Similarly, the Bureau of Economic Analysis (BEA) has a regional economic accounts program that details the geographic distribution of U.S. economic activity and growth. “The estimates of gross domestic product by state and state and local area personal income, and the accompanying detail, provide a consistent framework for analyzing and comparing individual state and local area economies.”

While the BEA data read below were obtained from here, the BLS data read below were obtained from here and here, respectively.

setwd("~/Downloads/")
df.bea = read.csv("CA1/CA1_1969_2014_OH.csv")

df.bls1 = read.csv("http://www.bls.gov/lau/laucnty14.txt", header = FALSE, 
    skip = 6, sep = "")
colnames(df.bls1) = c("LAUS Code", "StateFIPSCode", "CountyFIPSCode", 
    "CountyName", "StateAbbreviation", "Year", "TotalLaborForce", 
    "TotalEmployed", "TotalUnemployed", "UnemploymentRate")

df.bls2 = read.csv("http://www.bls.gov/cew/data/api/2015/2/industry/10.csv", 
    sep = ",", header = TRUE)

3 Initiative and Referendum Data (1904-2010)

The Initiative and Referendum Institute has long gathered data on direct democracy in the states. You can read more here. Two data sets are readily available for download here:

IRI initiatives by state and year for the 1904-through 2010 period. This database lists all statewide initiatives to appear on the ballot since the first statewide initiative in Oregon in 1904. This list does not contain information on issues placed on the ballot by state legislatures, commonly referred to as “legislative measures.”

setwd("~/Downloads/")
library(readxl)
df.ir = read_excel("IRI initiatives by state-year (1904-2010).xls", 
    col_names = FALSE)
colnames(df.ir) = c("StateName", "Year", "Number", "ApprovalRate")

The Legal Landscape Database describes direct democracy provisions in American cities in 2005. The database includes the 1,000 largest cities in the country as well as the ten largest cities in every state. Data items include city name, FIPS code, population, and initiative status.

setwd("~/Downloads/")
df.ircity = read_excel("Legal Landscape (version 20080504).xls")
## DEFINEDNAME: 21 00 00 01 0b 00 00 00 01 00 00 00 00 00 00 0d 3b 00 00 00 00 e5 05 00 00 05 00 
## DEFINEDNAME: 00 00 00 1b 07 00 00 00 01 00 00 00 00 00 00 44 6f 63 75 6d 65 6e 74 31 7a 7a 53 50 5f 34 30 32 33 30 30 30 30 64 62 63 62 36 3c 00 00 0b 00 8a 00 
## DEFINEDNAME: 00 00 00 1b 07 00 00 00 01 00 00 00 00 00 00 44 6f 63 75 6d 65 6e 74 31 7a 7a 53 50 5f 34 39 61 31 30 30 30 30 32 36 37 37 33 3c 00 00 07 00 67 00 
## DEFINEDNAME: 00 00 00 1b 07 00 00 00 01 00 00 00 00 00 00 44 6f 63 75 6d 65 6e 74 31 7a 7a 53 50 5f 36 30 36 38 30 30 30 30 34 38 61 36 35 3c 00 00 09 00 67 00 
## DEFINEDNAME: 00 00 00 1b 07 00 00 00 01 00 00 00 00 00 00 44 6f 63 75 6d 65 6e 74 31 7a 7a 53 50 5f 38 66 61 39 30 30 30 30 34 39 39 62 34 3c 00 00 05 00 67 00 
## DEFINEDNAME: 00 00 00 1b 07 00 00 00 01 00 00 00 00 00 00 44 6f 63 75 6d 65 6e 74 31 7a 7a 53 50 5f 62 61 33 62 30 30 30 30 37 64 33 30 31 3c 00 00 0d 00 8a 00 
## DEFINEDNAME: 00 00 00 1b 07 00 00 00 01 00 00 00 00 00 00 44 6f 63 75 6d 65 6e 74 31 7a 7a 53 50 5f 62 64 36 38 30 30 30 30 36 34 32 38 31 3c 00 00 03 00 67 00 
## DEFINEDNAME: 21 00 00 01 0b 00 00 00 01 00 00 00 00 00 00 0d 3b 00 00 00 00 e5 05 00 00 05 00 
## DEFINEDNAME: 00 00 00 1b 07 00 00 00 01 00 00 00 00 00 00 44 6f 63 75 6d 65 6e 74 31 7a 7a 53 50 5f 34 30 32 33 30 30 30 30 64 62 63 62 36 3c 00 00 0b 00 8a 00 
## DEFINEDNAME: 00 00 00 1b 07 00 00 00 01 00 00 00 00 00 00 44 6f 63 75 6d 65 6e 74 31 7a 7a 53 50 5f 34 39 61 31 30 30 30 30 32 36 37 37 33 3c 00 00 07 00 67 00 
## DEFINEDNAME: 00 00 00 1b 07 00 00 00 01 00 00 00 00 00 00 44 6f 63 75 6d 65 6e 74 31 7a 7a 53 50 5f 36 30 36 38 30 30 30 30 34 38 61 36 35 3c 00 00 09 00 67 00 
## DEFINEDNAME: 00 00 00 1b 07 00 00 00 01 00 00 00 00 00 00 44 6f 63 75 6d 65 6e 74 31 7a 7a 53 50 5f 38 66 61 39 30 30 30 30 34 39 39 62 34 3c 00 00 05 00 67 00 
## DEFINEDNAME: 00 00 00 1b 07 00 00 00 01 00 00 00 00 00 00 44 6f 63 75 6d 65 6e 74 31 7a 7a 53 50 5f 62 61 33 62 30 30 30 30 37 64 33 30 31 3c 00 00 0d 00 8a 00 
## DEFINEDNAME: 00 00 00 1b 07 00 00 00 01 00 00 00 00 00 00 44 6f 63 75 6d 65 6e 74 31 7a 7a 53 50 5f 62 64 36 38 30 30 30 30 36 34 32 38 31 3c 00 00 03 00 67 00 
## DEFINEDNAME: 21 00 00 01 0b 00 00 00 01 00 00 00 00 00 00 0d 3b 00 00 00 00 e5 05 00 00 05 00 
## DEFINEDNAME: 00 00 00 1b 07 00 00 00 01 00 00 00 00 00 00 44 6f 63 75 6d 65 6e 74 31 7a 7a 53 50 5f 34 30 32 33 30 30 30 30 64 62 63 62 36 3c 00 00 0b 00 8a 00 
## DEFINEDNAME: 00 00 00 1b 07 00 00 00 01 00 00 00 00 00 00 44 6f 63 75 6d 65 6e 74 31 7a 7a 53 50 5f 34 39 61 31 30 30 30 30 32 36 37 37 33 3c 00 00 07 00 67 00 
## DEFINEDNAME: 00 00 00 1b 07 00 00 00 01 00 00 00 00 00 00 44 6f 63 75 6d 65 6e 74 31 7a 7a 53 50 5f 36 30 36 38 30 30 30 30 34 38 61 36 35 3c 00 00 09 00 67 00 
## DEFINEDNAME: 00 00 00 1b 07 00 00 00 01 00 00 00 00 00 00 44 6f 63 75 6d 65 6e 74 31 7a 7a 53 50 5f 38 66 61 39 30 30 30 30 34 39 39 62 34 3c 00 00 05 00 67 00 
## DEFINEDNAME: 00 00 00 1b 07 00 00 00 01 00 00 00 00 00 00 44 6f 63 75 6d 65 6e 74 31 7a 7a 53 50 5f 62 61 33 62 30 30 30 30 37 64 33 30 31 3c 00 00 0d 00 8a 00 
## DEFINEDNAME: 00 00 00 1b 07 00 00 00 01 00 00 00 00 00 00 44 6f 63 75 6d 65 6e 74 31 7a 7a 53 50 5f 62 64 36 38 30 30 30 30 36 34 32 38 31 3c 00 00 03 00 67 00 
## DEFINEDNAME: 21 00 00 01 0b 00 00 00 01 00 00 00 00 00 00 0d 3b 00 00 00 00 e5 05 00 00 05 00 
## DEFINEDNAME: 00 00 00 1b 07 00 00 00 01 00 00 00 00 00 00 44 6f 63 75 6d 65 6e 74 31 7a 7a 53 50 5f 34 30 32 33 30 30 30 30 64 62 63 62 36 3c 00 00 0b 00 8a 00 
## DEFINEDNAME: 00 00 00 1b 07 00 00 00 01 00 00 00 00 00 00 44 6f 63 75 6d 65 6e 74 31 7a 7a 53 50 5f 34 39 61 31 30 30 30 30 32 36 37 37 33 3c 00 00 07 00 67 00 
## DEFINEDNAME: 00 00 00 1b 07 00 00 00 01 00 00 00 00 00 00 44 6f 63 75 6d 65 6e 74 31 7a 7a 53 50 5f 36 30 36 38 30 30 30 30 34 38 61 36 35 3c 00 00 09 00 67 00 
## DEFINEDNAME: 00 00 00 1b 07 00 00 00 01 00 00 00 00 00 00 44 6f 63 75 6d 65 6e 74 31 7a 7a 53 50 5f 38 66 61 39 30 30 30 30 34 39 39 62 34 3c 00 00 05 00 67 00 
## DEFINEDNAME: 00 00 00 1b 07 00 00 00 01 00 00 00 00 00 00 44 6f 63 75 6d 65 6e 74 31 7a 7a 53 50 5f 62 61 33 62 30 30 30 30 37 64 33 30 31 3c 00 00 0d 00 8a 00 
## DEFINEDNAME: 00 00 00 1b 07 00 00 00 01 00 00 00 00 00 00 44 6f 63 75 6d 65 6e 74 31 7a 7a 53 50 5f 62 64 36 38 30 30 30 30 36 34 32 38 31 3c 00 00 03 00 67 00

4 Education Data (CCD)

“The Common Core of Data (CCD) is a program of the U.S. Department of Education’s National Center for Education Statistics that annually collects fiscal and non-fiscal data about all public schools, public school districts and state education agencies in the United States. The data are supplied by state education agency officials and include information that describes schools and school districts, including name, address, and phone number; descriptive information about students and staff, including demographics; and fiscal data, including revenues and current expenditures.” You can learn more about the program here and access the data and documentation here.

setwd("~/Downloads/")
df.nces = read.csv("nces1314.csv", header = TRUE, sep = ",", 
    skip = 5)

colnames(df.nces) = c("SchoolName", "State", "City", "Address", 
    "StateAbb", "ZIP", "ZIP4", "StateAbb2", "SchoolID", "Agency", 
    "AgencyID", "County", "Countynumber", "StateFIPS", "SchoolType", 
    "AgencyType", "Status", "Charter", "Magnet", "Latitude", 
    "Longitude", "StateSchoolID", "StateAgencyID", "VirtualSchool", 
    "FreeLunch", "ReducedLunch", "TotalFRL.N", "TotalStudents", 
    "FTE.Teachers", "PupilTeacherRatio")

These are data for a single year. If we wanted to build a panel (i.e., data for multiple schools over multiple years) we would have to download all the files and then bind them into a single data frame.

5 Cleaning and Preparing Acquired Data

Data are rarely clean, unless you have a little army working for you that knows exactly what is needed to prepare the data for analysis, or, if you are lucky, someone else has basically shipped you a clean version of the data. More often than not you will end up spending 90% of your time cleaning and preparing your data for analysis. Some of the more common procedures you might run into are outlined below but do note that these are but a smidgen of what may need to be done or what is possible with R.

We will look at a small array of operations, starting with the easiest – sub-setting data.

5.1 Subsetting Data

Sub-setting a data set means creating a smaller piece of the original data by selecting observations that meet a criterion of interest. For example, maybe we are not interested in the nationwide brfss2014b data we downloaded but instead just the data for Ohio. This could be easily achieved as follows:

setwd("~/Downloads/")

load("brfss2014c.RData")

brfss2014c.OH = subset(brfss2014c, STATE == 39, )

Take another example, the Census data we downloaded earlier. Suppose we wanted to focus our analysis on all Places with a population size of 25000 or more. How might we do this?

df.cvap.sub = subset(df.cvap, TOT_EST >= 25000, )

Wait a minute. Let us make it even smaller, by focusing only on Ohio’s Places with a total population of at least 25000. To do this we’ll have to split the GEONAME column to create two columns – Placename and Statename.

library(splitstackshape)
df.cvap2 = cSplit(as.data.table(df.cvap), "GEONAME", ",")

Now we have two new columns, GEONAME1 and GEONAME2 so let us rename them as needed.

names(df.cvap2)[names(df.cvap2) == "GEONAME_1"] <- "PlaceName"
names(df.cvap2)[names(df.cvap2) == "GEONAME_2"] <- "StateName"

Now let us subset this data frame to places in Ohio with a total population of 25000 or more.

df.cvap3 = subset(df.cvap2, StateName == "Ohio" & TOT_EST >= 
    25000, )

What if we wanted multiple states, say Ohio and West Virginia, and only retain the rows for Total Population?

df.cvap4 = subset(df.cvap2, StateName == "Ohio" & TOT_EST >= 
    25000 & LNNUMBER == 1 | StateName == "West Virginia" & TOT_EST >= 
    25000 & LNNUMBER == 1, )

5.2 Recoding Variables

We often end up having to construct new variables because the original variables are not perfectly suited for our purposes. Maybe we want to create a new variable based on quartiles or collapse some nominal or ordinal variable into fewer categories. We’ll see a couple of ways to carry out these recodings. First, let us create a new category called LargeCity in the df.cvap data set.

attach(df.cvap)
df.cvap$LargeCity = NA
df.cvap$LargeCity[df.cvap$TOT_EST <= 1e+05 & df.cvap$LNNUMBER == 
    1] = "Not Large City"
df.cvap$LargeCity[df.cvap$TOT_EST > 1e+05 & df.cvap$LNNUMBER == 
    1] = "Large City"
detach(df.cvap)
table(df.cvap$LargeCity)
Large City Not Large City
303 29206

5.3 Reshaping Data

Broadly summarized, data tend to arrive in one of two formats – either in what we call the wide format (shown below)

GeoName LineCode Description X1969 X1970 X2014
Adams 1 Population 18895 19117 28129
Adams 2 Income 45179 45202 51039
Athens 1 Population 16895 17117 31129
Athens 2 Income 41179 41102 61572

or in the long format (shown below).

GeoName LineCode Description Year Estimate
Adams 1 Population 1969 18895
Adams 1 Population 1970 19117
Athens 1 Population 1969 16895
Athens 1 Population 1970 17117
setwd("~/Downloads/CA1/")
df = read.csv("CA1_1969_2014_OH.csv", sep = ",", header = TRUE)
head(df[c(2, 7, 8, 9)], n = 9)
GeoName Description X1969 X1970
Ohio state total Personal income (thousands of dollars) 41814714 44121386
Ohio state total Population (persons) 1/ 10563000 10668839
Ohio state total Per capita personal income (dollars) 2/ 3959 4136
Adams, OH Personal income (thousands of dollars) 45171 49857
Adams, OH Population (persons) 1/ 18895 19117
Adams, OH Per capita personal income (dollars) 2/ 2391 2608
Allen, OH Personal income (thousands of dollars) 415365 439215
Allen, OH Population (persons) 1/ 110340 111084
Allen, OH Per capita personal income (dollars) 2/ 3764 3954
df2 = subset(df, Region != "NA", )
df2$IndustryClassification = NULL
df2$Table = NULL
df2$Region = NULL
df2$LineCode = factor(df2$LineCode, levels = c(1, 2, 3), labels = c("Personal Income", 
    "Population", "Per Capita Personal Income"))
df2$Description = NULL


library(reshape2)
df3 = melt(df2, id.vars = c("GeoFIPS", "GeoName", "LineCode"), 
    variable.name = "Year", value.name = "Estimate")

df4 = as.data.frame(sapply(df3, gsub, pattern = "X", replacement = ""))

df5 <- dcast(df4, GeoName + Year ~ LineCode, value.var = "Estimate")

colnames(df5) = c("GeoName", "Year", "pcpersinc", "persinc", 
    "popn")

cols = c(3, 4, 5)

df5[, cols] = apply(df5[, cols], 2, function(x) as.numeric(as.character(x)))

head(df5, n = 9)
GeoName Year pcpersinc persinc popn
Adams, OH 1969 2391 45171 18895
Adams, OH 1970 2608 49857 19117
Adams, OH 1971 2566 50988 19870
Adams, OH 1972 2708 56026 20687
Adams, OH 1973 2897 63292 21847
Adams, OH 1974 3253 73276 22525
Adams, OH 1975 3553 79034 22247
Adams, OH 1976 3703 85818 23178
Adams, OH 1977 4200 98894 23546

5.4 Rectangularizing Data

Sometimes the data are not rectangularized, that is, every observation is supposed to have the same number of rows and columns as all others. However, at times the data may not reflect this property. One example is the Ohio Department of Education’s district-level data. Let us download three files, one with total enrollment, the other with information on the number of students with an economic disadvantage, and the third and final one with information on the racial distribution of students. These data can be downloaded from here. Make sure when you Export the results you do so as a Plain text file, tab-delimited and with Export Report Title deselected.

setwd("~/Downloads/")
df.a = read.csv("Enrollment (District).txt", skip = 4, header = FALSE, 
    sep = "\t")
summary(df.a)
V1 V2 V3 V4
Min. : 125 Buckeye Local : 3 Mode:logical 88: 5
1st Qu.: 43588 Crestview Local: 3 NA’s:1061 98: 5
Median : 46359 Eastern Local : 3 NA 139: 5
Mean : 47115 Green Local : 3 NA 36: 4
3rd Qu.: 49601 Madison Local : 3 NA 62: 4
Max. :151209 Northwest Local: 3 NA 108: 4
NA (Other) :1043 NA (Other) :1034
df.a$V3 = NULL
colnames(df.a) = c("DistrictIRN", "DistrictName", "Enrollment")

df.b = read.csv("Enrollment by Student Demographic (District).txt", 
    skip = 4, header = FALSE, sep = "\t")
summary(df.b)
V1 V2 V3 V4 V5 V6
Min. : 125 Perry Local : 14 American Indian or Alaskan Native: 43 Mode:logical 11: 93 >95%: 285
1st Qu.: 43950 Buckeye Local : 13 Asian :281 NA’s:3372 12: 90 1.7%: 59
Median : 46094 Northwest Local : 13 Black, Non-Hispanic :707 NA 14: 80 .9%: 57
Mean : 47575 Springfield Local: 13 Hispanic :652 NA 13: 64 1.2%: 55
3rd Qu.: 49205 Madison Local : 12 Multiracial :712 NA 16: 63 1.3%: 55
Max. :151209 Lakota Local : 10 Pacific Islander : 18 NA 15: 60 .8%: 50
NA (Other) :3297 White, Non-Hispanic :959 NA (Other) :2922 (Other):2811
df.b$V2 = NULL
df.b$V4 = NULL
df.b$V6 = NULL
colnames(df.b) = c("DistrictIRN", "Race", "Number")

df.c = read.csv("Enrollment by Student Demographic (District) (1).txt", 
    skip = 4, header = FALSE, sep = "\t")
summary(df.c)
V1 V2 V3 V4 V5 V6
Min. : 125 Buckeye Local : 6 N: 902 Mode:logical 10: 10 >95%: 210
1st Qu.: 43968 Crestview Local: 6 Y:1009 NA’s:1911 11: 9 36.3%: 8
Median : 46565 Green Local : 6 NA NA 16: 9 63.7%: 8
Mean : 47795 Madison Local : 6 NA NA 20: 8 42.2%: 7
3rd Qu.: 49639 Northwest Local: 6 NA NA 22: 8 43.7%: 7
Max. :151209 Perry Local : 6 NA NA 76: 8 46.1%: 7
NA (Other) :1875 NA NA (Other) :1859 (Other):1664
df.c$V2 = NULL
df.c$V4 = NULL
df.c$V6 = NULL
colnames(df.c) = c("DistrictIRN", "Econ", "Number")

We read in each file with read.csv(“filepath”, skip=2, header=FALSE, sep=“”), skipping 2 lines from the top (because this gives us a cleaner file). Because there are no column headers the data frame will end up with columns named V1, V2, … and so on. I am dropping certain columns because they are either blank or because we don’t need them. The colnames(dataframe) = c(“…”) command assigns meaningful column headers to each data frame.

df.a has one row per district but df.c should have two rows per district – one row for the number of students flagged as economically disadvantaged (Econ == “Y”) and the other row for the number of students flagged as not economically disadvantaged (Econ == “N”). Let us see how many rows there are:

table(df.c$Econ)
N Y
902 1009

Aha! only 902 “N” and 1009 “Y”. If these were equal in number we would have a rectangular data frame, but we can make it so via the epicalc library and the following code:

library(epicalc)
df.c2 <- fillin(df.c, select = c(DistrictIRN, Econ))
table(df.c2$Econ)
N Y
1061 1061

The race data, df.b have the same problem and can be fixed similarly.

table(df.b$Race)
American Indian or Alaskan Native Asian Black, Non-Hispanic Hispanic Multiracial Pacific Islander White, Non-Hispanic
43 281 707 652 712 18 959
df.b2 <- fillin(df.b, select = c(DistrictIRN, Race))
table(df.b2$Race)
American Indian or Alaskan Native Asian Black, Non-Hispanic Hispanic Multiracial Pacific Islander White, Non-Hispanic
1059 1059 1059 1059 1059 1059 1059

One last thing before we move on: Let us flip the following variables:

  • In df.a, flip DistrictIRN to numeric and Enrollment to numeric
  • In df.b2, flip DistrictIRN to numeric and Number to numeric
  • In df.c2, flip DistrictIRN to numeric and Number to numeric

Pay attention to the code, there is a subtle difference when flipping the variables.

df.a$DistrictIRN2 = as.numeric(df.a$DistrictIRN)
df.a$Enrollment2 = as.numeric(gsub(",", "", df.a$Enrollment))

df.b2$DistrictIRN2 = as.numeric(df.b2$DistrictIRN)
df.b2$Number2 = as.numeric(gsub(",", "", df.b2$Number))

df.c2$DistrictIRN2 = as.numeric(df.c2$DistrictIRN)
df.c2$Number2 = as.numeric(gsub(",", "", df.c2$Number))

If a numeric variable has been read as a Factor and has commas, the gsub(“,”, “”, df$variable) portion of the code strips the commas from the variable before converting the variable to a numeric variable. Note that to be safe I am creating new variables in each data frame, one for every format conversion. This way if something goes wrong during the conversion we will still have the original variable to rework.

5.5 Merging Data

Now that we have the education data in three files we can merge them into a single data frame. This is often necessary since data will not always be downloaded in one single file. Merging is fairly seamless, so long as one is careful about a few things. First, you need the column (or columns) you will use to merge the files to be identically named and formatted in each file to be merged. This means we will have to rename DistrictIRN2 in df.b2 and df.c2 to be renamed DistrictIRN. This can be done by dropping DistrictIRN in df.b2 and df.c2, and then renaming DistrictIRN2 to DistrictIRN in df.b2 and df.c2.

df.a$DistrictIRN = NULL
df.b2$DistrictIRN = NULL
df.c2$DistrictIRN = NULL

df.a$Enrollment = NULL
df.b2$Number = NULL
df.c2$Number = NULL

names(df.a)[names(df.a) == "DistrictIRN2"] <- "DistrictIRN"
names(df.b2)[names(df.b2) == "DistrictIRN2"] <- "DistrictIRN"
names(df.c2)[names(df.c2) == "DistrictIRN2"] <- "DistrictIRN"

Second, you need to decide beforehand if you want a 1:1 merge where only those records are merged that are present in both files. Thus, for example, if we have 1061 districts in df.a and df.b2 but only 1059 districts in df.c, then if we go with a 1:1 merge the merged data frame will end up with only 1059 districts. We could, of course, tweak the code to ask that the merged data frame include all districts even if they only appear in one file; this is what is being done below.

Before we merge though, I am going to reshape the Econ and the Race data frames.

df.c3 <- reshape(df.c2, timevar = "Econ", idvar = c("DistrictIRN"), 
    direction = "wide")

df.b3 <- reshape(df.b2, timevar = "Race", idvar = c("DistrictIRN"), 
    direction = "wide")

and now we merge df.a and df.c3 first.

merge.1 = merge(df.a, df.c3, by = c("DistrictIRN"), all = TRUE)

If you look at merge.1 you will see blank Number2.N cells for some districts. Let us fill these in by calculating Number2.N = Enrollment2 - Number2.Y if Number2.N is blank. This can be done via the data.table package.

library(data.table)
df1 = data.table(merge.1)
df1.b = df1[is.na(Number2.N), `:=`(Number2.N, Enrollment2 - Number2.Y)]

Now we can create a PctEconDis column that will be (Number2.Y / Enrollment2) * 100 and then we can drop the Number2.N and Number2.Y columns.

df1.b$PctEconDis = with(df1, (Number2.Y/Enrollment2) * 100)

df1.b$Number2.N = NULL
df1.b$Number2.Y = NULL

Now we do the same thing with the Race data, merging df.a and df.c2. This one will be trickier; you’ll see why.

merge.2 = merge(df.a, df.b3, by = c("DistrictIRN"), all = TRUE)

Note how many cells are blank because we have data for a district on students of some race but not the other groups. As such we’ll have to figure out a way to create a variable called PctMinority. This could be easily done if we had data for every district for the Number2.White, Non-Hispanic variable. But this column has missing values too so we’ll use a modified version of the preceding code.

df2 = data.table(merge.2)

myRowSums = function(...) {
    retVal = as.numeric()
    df = data.frame(...)
    for (r in row.names(df)) {
        retVal = append(retVal, sum(df[r, ], na.rm = TRUE))
    }
    return(retVal)
}

merge.2 <- within(merge.2, NonMinority <- ifelse(!is.na(`Number2.White, Non-Hispanic`), 
    `Number2.White, Non-Hispanic`, Enrollment2 - myRowSums(Number2.Asian, 
        `Number2.Black, Non-Hispanic`, Number2.Hispanic, `Number2.American Indian or Alaskan Native`, 
        `Number2.Pacific Islander`)))

merge.2$PctMinority = with(merge.2, ((Enrollment2 - NonMinority)/Enrollment2) * 
    100)

my.vars = c("DistrictIRN", "PctMinority")

df2.b = merge.2[my.vars]

Whew! Now we can merge df1 and df2 to construct a single data frame that has, for every district, Enrollment2, PctEconDis and PctMinority, along with the DistrictIRN and DistrictName.

df.educ = merge(df1.b, df2.b, by = "DistrictIRN", all = TRUE)

colnames(df.educ) = c("DistrictIRN", "DistrictName", "Enrollment", 
    "PctEconDis", "PctMinority")

save(df.educ, file = "educ.RData")