1 Data Acquisition

As discussed previously, R can read data created in various formats, both from the actual file downloaded and saved to your machine or from the web. The goal today is not to focus on reading these data. Rather, the goal is to see how we can work with, say, a Census data file or a file from the CDC, the Bureau of Labor, the Bureau of Economic Analysis, the United Nations, and so on. Our path will be simple: Download the raw data, clean and shape it into the version we need for our analyses.

2 Some Useful Data Sources

For a decade or more now there has been tremendous interest in making some data open – available to anyone who wishes to use it – but this trend was hastened by governments signing on and encouraging open data. Take a look at some of these sites …

2.1 Census Data

Let us download a specific file – the Voting Age Population by Citizenship and Race (CVAP) data compiled from the American Community Survey (ACS) 5-year (2009-2013) collection available here. This file can be downloaded from here. Once the file downloads, double-click the zip archive and extract all files. Documentation for these data is available here.

Once you extract the data, load the Place data as shown below:

setwd("~/Downloads/")
df.cvap = read.csv("CVAP_CSV_Format_2009-2013/Place.csv", sep = ",", 
    header = TRUE)

If you open the data frame you will see multiple rows per Place. Take a look at the documentation to understand what the columns represent.

TOT_EST: The rounded estimate of the total number of people for that geographic area and group. (Not available for tracts or block groups.)
TOT_MOE: The margin of error for the total number of people for that geographic area and group. (Not available for tracts or block groups.)
ADU_EST: The rounded estimate of the total number of people 18 years of age or older for that geographic area and group. (Not available for tracts or block groups.)
ADU_MOE: The margin of error for the total number of people 18 years of age or older for that geographic area and group. (Not available for tracts or block groups.)
CIT_EST: The rounded estimate of the total number of United States citizens for that geographic area and group
CIT_MOE: The margin of error for the total number of United States citizens for that geographic area and group.
CVAP_EST: The rounded estimate of the total number of United States citizens 18 years of age or older for that geographic area and group.
CVAP_MOE: The margin of error for the total number of United States citizens 18 years of age or older for that geographic area and group.

The rows represent the following groups, marked by LNTITLE and LNNUMBER, respectively:

Total 1
Not Hispanic or Latino 2
American Indian or Alaska Native Alone 3
Asian Alone 4
Black or African American Alone 5
Native Hawaiian or Other Pacific Islander Alone 6
White Alone 7
American Indian or Alaska Native and White 8
Asian and White 9
Black or African American and White 10
American Indian or Alaska Native and Black or African American 11
Remainder of Two or More Race Responses 12
Hispanic or Latino 13

2.2 CDC’s BRFSS Data

The Behavioral Risk Factor Surveillance System (BRFSS) is the nation’s premier system of health-related telephone surveys that collect state data about U.S. residents regarding their health-related risk behaviors, chronic health conditions, and use of preventive services. Established in 1984 with 15 states, BRFSS now collects data in all 50 states as well as the District of Columbia and three U.S. territories. BRFSS completes more than 400,000 adult interviews each year, making it the largest continuously conducted health survey system in the world.

The 2014 BRFSS survey data can be downloaded in ASCII format from here while the codebook is here and the variable layout is here.

Note: ASCII (pronounced ask-ee) stands for the American Standard Code for Information Interchange and is a code for representing English characters as numbers, with each letter assigned a number from 0 to 127. “For example, the ASCII code for uppercase M is 77. Most computers use ASCII codes to represent text, which makes it possible to transfer data from one computer to another.” Source:.

If you download the data and look at the variable layout here you notice that every column has specific starting and ending positions. If we choose to read the downloaded ASCII data we’ll have to pass the variable layout information to R. I’ll show you how to do this but let us take a shortcut via the other formats CDC so kindly makes available for download, namely the SAS Transport Format found here.

Download and extract two files and put both in your working directory.

download the ASCII version of the data * download the SAS Transport version of the data

Then execute the following code:

setwd("~/Downloads/")
library(foreign)
brfss2014a = read.xport("LLCP2014.XPT ")

Note: There is an extra blank space after LLCP2014.XPT so run the code as written and not as LLCP2014.XPT“.

When the code executes you should see brfss2014a in upper-right window of RStudio. Very easy, but, what if we were given the raw ASCII data? Now we will need the variable layout information, plus choose a library to work with. Two I use are readr (my preferred library) and LaF, and both are shown in action below.

Note: The original column layout specifications had to be modified because the third variable listed online (IDATE) subsumes the three that follow. As such IDATE was deleted from the specifications below.

The readr code specifies the start and end of the columns, as well as the column names.

fwf_positions_begin = c(1, 16, 18, 20, 22, 31, 35, 45, 62, 63, 
    64, 65, 67, 68, 70, 72, 80, 81, 83, 85, 87, 88, 89, 90, 91, 
    92, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 
    106, 108, 109, 146, 147, 148, 150, 151, 152, 154, 158, 170, 
    171, 172, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 
    186, 187, 188, 189, 190, 192, 193, 196, 198, 200, 202, 203, 
    209, 210, 211, 213, 215, 216, 218, 219, 220, 221, 222, 223, 
    224, 225, 226, 227, 228, 229, 230, 231, 232, 233, 234, 235, 
    236, 237, 243, 255, 256, 257, 258, 261, 264, 266, 268, 270, 
    271, 272, 273, 275, 277, 279, 281, 282, 284, 285, 310, 311, 
    312, 314, 315, 316, 317, 318, 319, 320, 321, 322, 323, 326, 
    327, 329, 330, 332, 334, 336, 339, 340, 341, 342, 343, 345, 
    346, 347, 348, 349, 351, 352, 354, 356, 357, 359, 360, 361, 
    362, 363, 364, 365, 366, 367, 368, 369, 370, 371, 372, 573, 
    574, 575, 576, 578, 579, 581, 582, 583, 584, 590, 591, 595, 
    623, 625, 626, 627, 628, 629, 645, 647, 1356, 1357, 1363, 
    1393, 1403, 1417, 1419, 1421, 1425, 1426, 1427, 1430, 1473, 
    1612, 1669, 1671, 1673, 1674, 1675, 1711, 1907, 1910, 1913, 
    1916, 2155, 2156, 2157, 2158, 2159, 2160, 2161, 2162, 2163, 
    2164, 2221, 2223, 2227, 2228, 2229, 2230, 2231, 2232, 2234, 
    2235, 2236, 2239, 2242, 2247, 2251, 2252, 2253, 2254, 2255, 
    2256, 2257, 2258, 2259, 2262, 2263, 2267, 2271, 2272, 2273, 
    2274, 2275, 2276, 2277, 2278, 2279, 2281, 2283, 2284, 2286, 
    2292)

width = c(2, 2, 2, 2, 4, 4, 10, 10, 1, 1, 1, 1, 1, 2, 2, 2, 1, 
    2, 2, 2, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
    1, 1, 2, 1, 1, 1, 1, 2, 1, 1, 2, 4, 4, 1, 1, 1, 1, 1, 1, 
    1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 3, 2, 2, 2, 1, 6, 
    1, 1, 2, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
    1, 1, 1, 1, 1, 1, 6, 2, 1, 1, 1, 3, 3, 2, 2, 2, 1, 1, 1, 
    2, 2, 2, 2, 1, 2, 1, 25, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 
    1, 3, 1, 2, 1, 2, 2, 2, 3, 1, 1, 1, 1, 2, 1, 1, 1, 1, 2, 
    1, 2, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
    1, 1, 1, 2, 1, 2, 1, 1, 1, 6, 1, 4, 28, 2, 1, 1, 1, 1, 1, 
    2, 2, 1, 6, 10, 10, 10, 2, 2, 1, 1, 1, 1, 10, 10, 1, 2, 2, 
    1, 1, 1, 10, 3, 3, 3, 10, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 
    2, 1, 1, 1, 1, 1, 2, 1, 1, 3, 3, 5, 4, 1, 1, 1, 1, 1, 1, 
    1, 1, 3, 1, 4, 4, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
    1)

fwf_positions_end = fwf_positions_begin + (width - 1)

col_names = c("STATE", "FMONTH", "IMONTH", "IDAY", "IYEAR", "DISPCODE", 
    "SEQNO", "_PSU", "CTELENUM", "PVTRESD1", "COLGHOUS", "STATERES", 
    "LADULT", "NUMADULT", "NUMMEN", "NUMWOMEN", "GENHLTH", "PHYSHLTH", 
    "MENTHLTH", "POORHLTH", "HLTHPLN1", "PERSDOC2", "MEDCOST", 
    "CHECKUP1", "EXERANY2", "SLEPTIM1", "CVDINFR4", "CVDCRHD4", 
    "CVDSTRK3", "ASTHMA3", "ASTHNOW", "CHCSCNCR", "CHCOCNCR", 
    "CHCCOPD1", "HAVARTH3", "ADDEPEV2", "CHCKIDNY", "DIABETE3", 
    "DIABAGE2", "LASTDEN3", "RMVTETH3", "VETERAN3", "MARITAL", 
    "CHILDREN", "EDUCA", "EMPLOY1", "INCOME2", "WEIGHT2", "HEIGHT3", 
    "NUMHHOL2", "NUMPHON2", "CPDEMO1", "INTERNET", "RENTHOM1", 
    "SEX", "PREGNANT", "QLACTLM2", "USEEQUIP", "BLIND", "DECIDE", 
    "DIFFWALK", "DIFFDRES", "DIFFALON", "SMOKE100", "SMOKDAY2", 
    "STOPSMK2", "LASTSMK2", "USENOW3", "ALCDAY5", "AVEDRNK2", 
    "DRNK3GE5", "MAXDRNKS", "FLUSHOT6", "FLSHTMY2", "PNEUVAC3", 
    "SHINGLE2", "FALL12MN", "FALLINJ2", "SEATBELT", "DRNKDRI2", 
    "HADMAM", "HOWLONG", "PROFEXAM", "LENGEXAM", "HADPAP2", "LASTPAP2", 
    "HADHYST2", "PCPSAAD2", "PCPSADI1", "PCPSARE1", "PSATEST1", 
    "PSATIME", "PCPSARS1", "BLDSTOOL", "LSTBLDS3", "HADSIGM3", 
    "HADSGCO1", "LASTSIG3", "HIVTST6", "HIVTSTD3", "WHRTST10", 
    "PDIABTST", "PREDIAB1", "INSULIN", "BLDSUGAR", "FEETCHK2", 
    "DOCTDIAB", "CHKHEMO3", "FEETCHK", "EYEEXAM", "DIABEYE", 
    "DIABEDU", "PAINACT2", "QLMENTL2", "QLSTRES2", "QLHLTH2", 
    "MEDICARE", "HLTHCVR1", "DELAYMED", "DLYOTHER", "NOCOV121", 
    "LSTCOVRG", "DRVISITS", "MEDSCOST", "CARERCVD", "MEDBILL1", 
    "ASBIALCH", "ASBIDRNK", "ASBIBING", "ASBIADVC", "ASBIRDUC", 
    "WTCHSALT", "LONGWTCH", "DRADVISE", "ASTHMAGE", "ASATTACK", 
    "ASERVIST", "ASDRVIST", "ASRCHKUP", "ASACTLIM", "ASYMPTOM", 
    "ASNOSLEP", "ASTHMED3", "ASINHALR", "IMFVPLAC", "TETANUS", 
    "HPVTEST", "HPLSTTST", "HPVADVC2", "HPVADSHT", "CNCRDIFF", 
    "CNCRAGE", "CNCRTYP1", "CSRVTRT1", "CSRVDOC1", "CSRVSUM", 
    "CSRVRTRN", "CSRVINST", "CSRVINSR", "CSRVDEIN", "CSRVCLIN", 
    "CSRVPAIN", "CSRVCTL1", "RRCLASS2", "RRCOGNT2", "RRATWRK2", 
    "RRHCARE3", "RRPHYSM2", "RREMTSM2", "SCNTMNY1", "SCNTMEL1", 
    "SCNTPAID", "SCNTWRK1", "SCNTLPAD", "SCNTLWK1", "SCNTVOT1", 
    "SXORIENT", "TRNSGNDR", "RCSBIRTH", "RCSGENDR", "RCHISLA1", 
    "RCSRACE1", "RCSBRAC1", "RCSRLTN2", "CASTHDX2", "CASTHNO2", 
    "EMTSUPRT", "LSATISFY", "QSTVER", "QSTLANG", "MSCODE", "_STSTR", 
    "_STRWT", "_RAWRAKE", "_WT2RAKE", "_AGE80", "_IMPRACE", "_IMPNPH", 
    "_IMPEDUC", "_IMPMRTL", "_IMPHOME", "_LANDWT2", "_LANDWT", 
    "_CHISPNC", "_CPRACE", "_CRACE1", "_IMPCAGE", "_IMPCRAC", 
    "_IMPCSEX", "_CLANDWT", "NPHH", "NAHH", "NCHH", "_HHOLDWT", 
    "_RFHLTH", "_HCVU651", "_TOTINDA", "_LTASTH1", "_CASTHM1", 
    "_ASTHMS1", "_DRDXAR1", "_EXTETH2", "_ALTETH2", "_DENVST2", 
    "_PRACE1", "_MRACE1", "_HISPANC", "_RACE", "_RACEG21", "_RACEGR3", 
    "_RACE_G1", "_AGEG5YR", "_AGE65YR", "_AGE_G", "HTIN4", "HTM4", 
    "WTKG3", "_BMI5", "_BMI5CAT", "_RFBMI5", "_CHLDCNT", "_EDUCAG", 
    "_INCOMG", "_SMOKER3", "_RFSMOK3", "DRNKANY5", "DROCDY3_", 
    "_RFBING5", "_DRNKDY4", "_DRNKMO4", "_RFDRHV4", "_RFDRMN4", 
    "_RFDRWM4", "_FLSHOT6", "_PNEUMO2", "_RFSEAT2", "_RFSEAT3", 
    "_RFMAM2Y", "_MAM502Y", "_RFPAP32", "_RFPSA21", "_RFBLDS2", 
    "_RFSIGM2", "_AIDTST3")

library(readr)

brfss2014b = read_fwf("~/Downloads/LLCP2014.ASC ", fwf_positions(fwf_positions_begin, 
    fwf_positions_end, col_names))

brfss2014b = subset(brfss2014b, STATE == "39")

save(brfss2014b, file = "~/Downloads/brfss2014b.RData")

Next you see the LaF.

library(LaF)

laf = laf_open_fwf("LLCP2014.ASC ", column_widths = width, column_types = rep("character", 
    length(width)), column_names = col_names)

brfss2014c = data.frame(laf[, ])

save(brfss2014c, file = "~/Downloads/brfss2014c.RData")

Notice that with LaF we had to specify the column widths via column_widths = width and also the column headings via column_names = cnames.

2.3 Changing Variable Formats

When these files are read some of the variable will come in as characters when in fact they are numeric (see STATE, for example) or factors (see PHYSHLTH for example). Others will be read as integers (see GENHLTH) and so on. Consequently, we often have to reset the read formats. Further, we do know from the codebook that missing values and other responses (such as refused) carry numeric codes in the survey. For instance, GENHLTH is supposed to have the following responses mapped to the numeric values stored in the variable

1 = Excellent
2 = Very Good
3 = Good
4 = Fair
5 = Poor
7 = Don’t Know/Not Sure
9 = Refused
BLANK = Not asked or Missing

We can take this mapping and fix GENHLTH by creating a new variable, say general.health as a factor.

load("~/Downloads/brfss2014b.RData")

brfss2014b$general.health = factor(brfss2014b$GENHLTH, levels = c(1, 
    2, 3, 4, 5), labels = c("Excellent", "Very Good", "Good", 
    "Fair", "Poor"))

table(brfss2014b$general.health)

Excellent	Very Good	Good	Fair	Poor
1689	3499	3425	1643	651

Let us also flip STATE into a numeric variable, and format FMONTH, IMONTH, IDAY, IYEAR as dates. For dates the lubridate package comes in very handy.

brfss2014b$STATE = as.numeric(brfss2014b$STATE)

We’ll combine IMONTH, IDAY and IYEAR to create a date variable.

brfss2014b$date = with(brfss2014b, sprintf("%s-%s-%s", IMONTH, 
    IDAY, IYEAR))

head(brfss2014b$date)

## [1] "11-05-2014" "07-14-2014" "12-16-2014" "09-15-2014" "09-08-2014"
## [6] "08-29-2014"

library(lubridate)
brfss2014b$Date = mdy(brfss2014b$date)
is.Date(brfss2014b$Date)

## [1] FALSE

str(brfss2014b$Date)

##  POSIXct[1:10933], format: "2014-11-05" "2014-07-14" "2014-12-16" "2014-09-15" ...

Note how we built date … we first concatenated the three columns and then used lubridate to format date via the mdy() specification. I then check to make sure the format is correct via the is.Date() command. This could also have been done via the str() command. Now that Date has been created we could drop the original three columns but we will let that be for now.

2.4 BEA and BLS Data

The Bureau of Labor Statistics (BLS) compiles a lot of data, including some under the Local Area Unemployment Statistics (LAUS) program – a Federal-State cooperative effort in which monthly estimates of total employment and unemployment are prepared for approximately 7,300 areas:

Census regions and divisions
States
Metropolitan Statistical Areas and Metropolitan NECTAS (New England City and Town Areas)
Metropolitan Divisions and NECTA Divisions
Micropolitan Statistical Areas and Micropolitan NECTAs
Combined Metropolitan Statistical Areas and Combined NECTAs
Small Labor Market Areas
Counties and county equivalents
Cities of 25,000 population or more
Cities and towns in New England regardless of population

These estimates are key indicators of local economic conditions. The Bureau of Labor Statistics (BLS) of the U.S. Department of Labor is responsible for the concepts, definitions, technical procedures, validation, and publication of the estimates that State employment security agencies prepare under agreement with BLS. Data available at BLS can be found here

Similarly, the Bureau of Economic Analysis (BEA) has a regional economic accounts program that details the geographic distribution of U.S. economic activity and growth. “The estimates of gross domestic product by state and state and local area personal income, and the accompanying detail, provide a consistent framework for analyzing and comparing individual state and local area economies.”

While the BEA data read below were obtained from here, the BLS data read below were obtained from here and here, respectively.

setwd("~/Downloads/")
df.bea = read.csv("CA1/CA1_1969_2014_OH.csv")

df.bls1 = read.csv("http://www.bls.gov/lau/laucnty14.txt", header = FALSE, 
    skip = 6, sep = "")
colnames(df.bls1) = c("LAUS Code", "StateFIPSCode", "CountyFIPSCode", 
    "CountyName", "StateAbbreviation", "Year", "TotalLaborForce", 
    "TotalEmployed", "TotalUnemployed", "UnemploymentRate")

df.bls2 = read.csv("http://www.bls.gov/cew/data/api/2015/2/industry/10.csv", 
    sep = ",", header = TRUE)

3 Initiative and Referendum Data (1904-2010)

The Initiative and Referendum Institute has long gathered data on direct democracy in the states. You can read more here. Two data sets are readily available for download here:

IRI initiatives by state and year for the 1904-through 2010 period. This database lists all statewide initiatives to appear on the ballot since the first statewide initiative in Oregon in 1904. This list does not contain information on issues placed on the ballot by state legislatures, commonly referred to as “legislative measures.”

setwd("~/Downloads/")
library(readxl)
df.ir = read_excel("IRI initiatives by state-year (1904-2010).xls", 
    col_names = FALSE)
colnames(df.ir) = c("StateName", "Year", "Number", "ApprovalRate")

The Legal Landscape Database describes direct democracy provisions in American cities in 2005. The database includes the 1,000 largest cities in the country as well as the ten largest cities in every state. Data items include city name, FIPS code, population, and initiative status.

setwd("~/Downloads/")
df.ircity = read_excel("Legal Landscape (version 20080504).xls")

## DEFINEDNAME: 21 00 00 01 0b 00 00 00 01 00 00 00 00 00 00 0d 3b 00 00 00 00 e5 05 00 00 05 00 
## DEFINEDNAME: 00 00 00 1b 07 00 00 00 01 00 00 00 00 00 00 44 6f 63 75 6d 65 6e 74 31 7a 7a 53 50 5f 34 30 32 33 30 30 30 30 64 62 63 62 36 3c 00 00 0b 00 8a 00 
## DEFINEDNAME: 00 00 00 1b 07 00 00 00 01 00 00 00 00 00 00 44 6f 63 75 6d 65 6e 74 31 7a 7a 53 50 5f 34 39 61 31 30 30 30 30 32 36 37 37 33 3c 00 00 07 00 67 00 
## DEFINEDNAME: 00 00 00 1b 07 00 00 00 01 00 00 00 00 00 00 44 6f 63 75 6d 65 6e 74 31 7a 7a 53 50 5f 36 30 36 38 30 30 30 30 34 38 61 36 35 3c 00 00 09 00 67 00 
## DEFINEDNAME: 00 00 00 1b 07 00 00 00 01 00 00 00 00 00 00 44 6f 63 75 6d 65 6e 74 31 7a 7a 53 50 5f 38 66 61 39 30 30 30 30 34 39 39 62 34 3c 00 00 05 00 67 00 
## DEFINEDNAME: 00 00 00 1b 07 00 00 00 01 00 00 00 00 00 00 44 6f 63 75 6d 65 6e 74 31 7a 7a 53 50 5f 62 61 33 62 30 30 30 30 37 64 33 30 31 3c 00 00 0d 00 8a 00 
## DEFINEDNAME: 00 00 00 1b 07 00 00 00 01 00 00 00 00 00 00 44 6f 63 75 6d 65 6e 74 31 7a 7a 53 50 5f 62 64 36 38 30 30 30 30 36 34 32 38 31 3c 00 00 03 00 67 00 
## DEFINEDNAME: 21 00 00 01 0b 00 00 00 01 00 00 00 00 00 00 0d 3b 00 00 00 00 e5 05 00 00 05 00 
## DEFINEDNAME: 00 00 00 1b 07 00 00 00 01 00 00 00 00 00 00 44 6f 63 75 6d 65 6e 74 31 7a 7a 53 50 5f 34 30 32 33 30 30 30 30 64 62 63 62 36 3c 00 00 0b 00 8a 00 
## DEFINEDNAME: 00 00 00 1b 07 00 00 00 01 00 00 00 00 00 00 44 6f 63 75 6d 65 6e 74 31 7a 7a 53 50 5f 34 39 61 31 30 30 30 30 32 36 37 37 33 3c 00 00 07 00 67 00 
## DEFINEDNAME: 00 00 00 1b 07 00 00 00 01 00 00 00 00 00 00 44 6f 63 75 6d 65 6e 74 31 7a 7a 53 50 5f 36 30 36 38 30 30 30 30 34 38 61 36 35 3c 00 00 09 00 67 00 
## DEFINEDNAME: 00 00 00 1b 07 00 00 00 01 00 00 00 00 00 00 44 6f 63 75 6d 65 6e 74 31 7a 7a 53 50 5f 38 66 61 39 30 30 30 30 34 39 39 62 34 3c 00 00 05 00 67 00 
## DEFINEDNAME: 00 00 00 1b 07 00 00 00 01 00 00 00 00 00 00 44 6f 63 75 6d 65 6e 74 31 7a 7a 53 50 5f 62 61 33 62 30 30 30 30 37 64 33 30 31 3c 00 00 0d 00 8a 00 
## DEFINEDNAME: 00 00 00 1b 07 00 00 00 01 00 00 00 00 00 00 44 6f 63 75 6d 65 6e 74 31 7a 7a 53 50 5f 62 64 36 38 30 30 30 30 36 34 32 38 31 3c 00 00 03 00 67 00 
## DEFINEDNAME: 21 00 00 01 0b 00 00 00 01 00 00 00 00 00 00 0d 3b 00 00 00 00 e5 05 00 00 05 00 
## DEFINEDNAME: 00 00 00 1b 07 00 00 00 01 00 00 00 00 00 00 44 6f 63 75 6d 65 6e 74 31 7a 7a 53 50 5f 34 30 32 33 30 30 30 30 64 62 63 62 36 3c 00 00 0b 00 8a 00 
## DEFINEDNAME: 00 00 00 1b 07 00 00 00 01 00 00 00 00 00 00 44 6f 63 75 6d 65 6e 74 31 7a 7a 53 50 5f 34 39 61 31 30 30 30 30 32 36 37 37 33 3c 00 00 07 00 67 00 
## DEFINEDNAME: 00 00 00 1b 07 00 00 00 01 00 00 00 00 00 00 44 6f 63 75 6d 65 6e 74 31 7a 7a 53 50 5f 36 30 36 38 30 30 30 30 34 38 61 36 35 3c 00 00 09 00 67 00 
## DEFINEDNAME: 00 00 00 1b 07 00 00 00 01 00 00 00 00 00 00 44 6f 63 75 6d 65 6e 74 31 7a 7a 53 50 5f 38 66 61 39 30 30 30 30 34 39 39 62 34 3c 00 00 05 00 67 00 
## DEFINEDNAME: 00 00 00 1b 07 00 00 00 01 00 00 00 00 00 00 44 6f 63 75 6d 65 6e 74 31 7a 7a 53 50 5f 62 61 33 62 30 30 30 30 37 64 33 30 31 3c 00 00 0d 00 8a 00 
## DEFINEDNAME: 00 00 00 1b 07 00 00 00 01 00 00 00 00 00 00 44 6f 63 75 6d 65 6e 74 31 7a 7a 53 50 5f 62 64 36 38 30 30 30 30 36 34 32 38 31 3c 00 00 03 00 67 00 
## DEFINEDNAME: 21 00 00 01 0b 00 00 00 01 00 00 00 00 00 00 0d 3b 00 00 00 00 e5 05 00 00 05 00 
## DEFINEDNAME: 00 00 00 1b 07 00 00 00 01 00 00 00 00 00 00 44 6f 63 75 6d 65 6e 74 31 7a 7a 53 50 5f 34 30 32 33 30 30 30 30 64 62 63 62 36 3c 00 00 0b 00 8a 00 
## DEFINEDNAME: 00 00 00 1b 07 00 00 00 01 00 00 00 00 00 00 44 6f 63 75 6d 65 6e 74 31 7a 7a 53 50 5f 34 39 61 31 30 30 30 30 32 36 37 37 33 3c 00 00 07 00 67 00 
## DEFINEDNAME: 00 00 00 1b 07 00 00 00 01 00 00 00 00 00 00 44 6f 63 75 6d 65 6e 74 31 7a 7a 53 50 5f 36 30 36 38 30 30 30 30 34 38 61 36 35 3c 00 00 09 00 67 00 
## DEFINEDNAME: 00 00 00 1b 07 00 00 00 01 00 00 00 00 00 00 44 6f 63 75 6d 65 6e 74 31 7a 7a 53 50 5f 38 66 61 39 30 30 30 30 34 39 39 62 34 3c 00 00 05 00 67 00 
## DEFINEDNAME: 00 00 00 1b 07 00 00 00 01 00 00 00 00 00 00 44 6f 63 75 6d 65 6e 74 31 7a 7a 53 50 5f 62 61 33 62 30 30 30 30 37 64 33 30 31 3c 00 00 0d 00 8a 00 
## DEFINEDNAME: 00 00 00 1b 07 00 00 00 01 00 00 00 00 00 00 44 6f 63 75 6d 65 6e 74 31 7a 7a 53 50 5f 62 64 36 38 30 30 30 30 36 34 32 38 31 3c 00 00 03 00 67 00

4 Education Data (CCD)

“The Common Core of Data (CCD) is a program of the U.S. Department of Education’s National Center for Education Statistics that annually collects fiscal and non-fiscal data about all public schools, public school districts and state education agencies in the United States. The data are supplied by state education agency officials and include information that describes schools and school districts, including name, address, and phone number; descriptive information about students and staff, including demographics; and fiscal data, including revenues and current expenditures.” You can learn more about the program here and access the data and documentation here.

Make sure you have downloaded the nces1314.csv file from the my Box folder for the class and put it in your working directory.

setwd("~/Downloads/")
df.nces = read.csv("nces1314.csv", header = TRUE, sep = ",", 
    skip = 5)

colnames(df.nces) = c("SchoolName", "State", "City", "Address", 
    "StateAbb", "ZIP", "ZIP4", "StateAbb2", "SchoolID", "Agency", 
    "AgencyID", "County", "Countynumber", "StateFIPS", "SchoolType", 
    "AgencyType", "Status", "Charter", "Magnet", "Latitude", 
    "Longitude", "StateSchoolID", "StateAgencyID", "VirtualSchool", 
    "FreeLunch", "ReducedLunch", "TotalFRL.N", "TotalStudents", 
    "FTE.Teachers", "PupilTeacherRatio")

These are data for a single year. If we wanted to build a panel (i.e., data for multiple schools over multiple years) we would have to download all the files and then bind them into a single data frame.

5 Cleaning and Preparing Acquired Data

Data are rarely clean, unless you have a little army working for you that knows exactly what is needed to prepare the data for analysis, or, if you are lucky, someone else has basically shipped you a clean version of the data. More often than not you will end up spending 90% of your time cleaning and preparing your data for analysis. Some of the more common procedures you might run into are outlined below but do note that these are but a smidgen of what may need to be done or what is possible with R.

We will look at a small array of operations, starting with the easiest – sub-setting data.

5.1 Subsetting Data

Sub-setting a data set means creating a smaller piece of the original data by selecting observations that meet a criterion of interest. For example, maybe we are not interested in the nationwide brfss2014b data we downloaded but instead just the data for Ohio. This could be easily achieved as follows:

setwd("~/Downloads/")

load("brfss2014c.RData")

brfss2014c.OH = subset(brfss2014c, STATE == 39, )

Take another example, the Census data we downloaded earlier. Suppose we wanted to focus our analysis on all Places with a population size of 25000 or more. How might we do this?

df.cvap.sub = subset(df.cvap, TOT_EST >= 25000, )

Wait a minute. Let us make it even smaller, by focusing only on Ohio’s Places with a total population of at least 25000. To do this we’ll have to split the GEONAME column to create two columns – Placename and Statename.

library(splitstackshape)
df.cvap2 = cSplit(as.data.table(df.cvap), "GEONAME", ",")

Now we have two new columns, GEONAME1 and GEONAME2 so let us rename them as needed.

names(df.cvap2)[names(df.cvap2) == "GEONAME_1"] <- "PlaceName"
names(df.cvap2)[names(df.cvap2) == "GEONAME_2"] <- "StateName"

Now let us subset this data frame to places in Ohio with a total population of 25000 or more.

df.cvap3 = subset(df.cvap2, StateName == "Ohio" & TOT_EST >= 
    25000, )

What if we wanted multiple states, say Ohio and West Virginia, and only retain the rows for Total Population?

df.cvap4 = subset(df.cvap2, StateName == "Ohio" & TOT_EST >= 
    25000 & LNNUMBER == 1 | StateName == "West Virginia" & TOT_EST >= 
    25000 & LNNUMBER == 1, )

5.2 Recoding Variables

We often end up having to construct new variables because the original variables are not perfectly suited for our purposes. Maybe we want to create a new variable based on quartiles or collapse some nominal or ordinal variable into fewer categories. We’ll see a couple of ways to carry out these recodings. First, let us create a new category called LargeCity in the df.cvap data set.

attach(df.cvap)
df.cvap$LargeCity = NA
df.cvap$LargeCity[df.cvap$TOT_EST <= 1e+05 & df.cvap$LNNUMBER == 
    1] = "Not Large City"
df.cvap$LargeCity[df.cvap$TOT_EST > 1e+05 & df.cvap$LNNUMBER == 
    1] = "Large City"
detach(df.cvap)
table(df.cvap$LargeCity)

Large City	Not Large City
303	29206

5.3 Reshaping Data

Broadly summarized, data tend to arrive in one of two formats – either in what we call the wide format (shown below)

GeoName	LineCode	Description	X1969	X1970	…	X2014
Adams	1	Population	18895	19117	…	28129
Adams	2	Income	45179	45202	…	51039
Athens	1	Population	16895	17117	…	31129
Athens	2	Income	41179	41102	…	61572

or in the long format (shown below).

GeoName	LineCode	Description	Year	Estimate
Adams	1	Population	1969	18895
Adams	1	Population	1970	19117
Athens	1	Population	1969	16895
Athens	1	Population	1970	17117

setwd("~/Downloads/CA1/")
df = read.csv("CA1_1969_2014_OH.csv", sep = ",", header = TRUE)
head(df[c(2, 7, 8, 9)], n = 9)

GeoName	Description	X1969	X1970
Ohio state total	Personal income (thousands of dollars)	41814714	44121386
Ohio state total	Population (persons) 1/	10563000	10668839
Ohio state total	Per capita personal income (dollars) 2/	3959	4136
Adams, OH	Personal income (thousands of dollars)	45171	49857
Adams, OH	Population (persons) 1/	18895	19117
Adams, OH	Per capita personal income (dollars) 2/	2391	2608
Allen, OH	Personal income (thousands of dollars)	415365	439215
Allen, OH	Population (persons) 1/	110340	111084
Allen, OH	Per capita personal income (dollars) 2/	3764	3954

df2 = subset(df, Region != "NA", )
df2$IndustryClassification = NULL
df2$Table = NULL
df2$Region = NULL
df2$LineCode = factor(df2$LineCode, levels = c(1, 2, 3), labels = c("Personal Income", 
    "Population", "Per Capita Personal Income"))
df2$Description = NULL


library(reshape2)
df3 = melt(df2, id.vars = c("GeoFIPS", "GeoName", "LineCode"), 
    variable.name = "Year", value.name = "Estimate")

df4 = as.data.frame(sapply(df3, gsub, pattern = "X", replacement = ""))

df5 <- dcast(df4, GeoName + Year ~ LineCode, value.var = "Estimate")

colnames(df5) = c("GeoName", "Year", "pcpersinc", "persinc", 
    "popn")

cols = c(3, 4, 5)

df5[, cols] = apply(df5[, cols], 2, function(x) as.numeric(as.character(x)))

head(df5, n = 9)

GeoName	Year	pcpersinc	persinc	popn
Adams, OH	1969	2391	45171	18895
Adams, OH	1970	2608	49857	19117
Adams, OH	1971	2566	50988	19870
Adams, OH	1972	2708	56026	20687
Adams, OH	1973	2897	63292	21847
Adams, OH	1974	3253	73276	22525
Adams, OH	1975	3553	79034	22247
Adams, OH	1976	3703	85818	23178
Adams, OH	1977	4200	98894	23546

5.4 Rectangularizing Data

Sometimes the data are not rectangularized, that is, every observation is supposed to have the same number of rows and columns as all others. However, at times the data may not reflect this property. One example is the Ohio Department of Education’s district-level data. Let us download three files, one with total enrollment, the other with information on the number of students with an economic disadvantage, and the third and final one with information on the racial distribution of students. These data can be downloaded from here. Make sure when you Export the results you do so as a Plain text file, tab-delimited and with Export Report Title deselected.

setwd("~/Downloads/")
df.a = read.csv("Enrollment (District).txt", skip = 4, header = FALSE, 
    sep = "\t")
summary(df.a)

V1	V2	V3	V4
Min. : 125	Buckeye Local : 3	Mode:logical	88: 5
1st Qu.: 43588	Crestview Local: 3	NA’s:1061	98: 5
Median : 46359	Eastern Local : 3	NA	139: 5
Mean : 47115	Green Local : 3	NA	36: 4
3rd Qu.: 49601	Madison Local : 3	NA	62: 4
Max. :151209	Northwest Local: 3	NA	108: 4
NA	(Other) :1043	NA	(Other) :1034

df.a$V3 = NULL
colnames(df.a) = c("DistrictIRN", "DistrictName", "Enrollment")

df.b = read.csv("Enrollment by Student Demographic (District).txt", 
    skip = 4, header = FALSE, sep = "\t")
summary(df.b)

V1	V2	V3	V4	V5	V6
Min. : 125	Perry Local : 14	American Indian or Alaskan Native: 43	Mode:logical	11: 93	>95%: 285
1st Qu.: 43950	Buckeye Local : 13	Asian :281	NA’s:3372	12: 90	1.7%: 59
Median : 46094	Northwest Local : 13	Black, Non-Hispanic :707	NA	14: 80	.9%: 57
Mean : 47575	Springfield Local: 13	Hispanic :652	NA	13: 64	1.2%: 55
3rd Qu.: 49205	Madison Local : 12	Multiracial :712	NA	16: 63	1.3%: 55
Max. :151209	Lakota Local : 10	Pacific Islander : 18	NA	15: 60	.8%: 50
NA	(Other) :3297	White, Non-Hispanic :959	NA	(Other) :2922	(Other):2811

df.b$V2 = NULL
df.b$V4 = NULL
df.b$V6 = NULL
colnames(df.b) = c("DistrictIRN", "Race", "Number")

df.c = read.csv("Enrollment by Student Demographic (District) (1).txt", 
    skip = 4, header = FALSE, sep = "\t")
summary(df.c)

V1	V2	V3	V4	V5	V6
Min. : 125	Buckeye Local : 6	N: 902	Mode:logical	10: 10	>95%: 210
1st Qu.: 43968	Crestview Local: 6	Y:1009	NA’s:1911	11: 9	36.3%: 8
Median : 46565	Green Local : 6	NA	NA	16: 9	63.7%: 8
Mean : 47795	Madison Local : 6	NA	NA	20: 8	42.2%: 7
3rd Qu.: 49639	Northwest Local: 6	NA	NA	22: 8	43.7%: 7
Max. :151209	Perry Local : 6	NA	NA	76: 8	46.1%: 7
NA	(Other) :1875	NA	NA	(Other) :1859	(Other):1664

df.c$V2 = NULL
df.c$V4 = NULL
df.c$V6 = NULL
colnames(df.c) = c("DistrictIRN", "Econ", "Number")

We read in each file with read.csv(“filepath”, skip=2, header=FALSE, sep=“”), skipping 2 lines from the top (because this gives us a cleaner file). Because there are no column headers the data frame will end up with columns named V1, V2, … and so on. I am dropping certain columns because they are either blank or because we don’t need them. The colnames(dataframe) = c(“…”) command assigns meaningful column headers to each data frame.

df.a has one row per district but df.c should have two rows per district – one row for the number of students flagged as economically disadvantaged (Econ == “Y”) and the other row for the number of students flagged as not economically disadvantaged (Econ == “N”). Let us see how many rows there are:

table(df.c$Econ)

N	Y
902	1009

Aha! only 902 “N” and 1009 “Y”. If these were equal in number we would have a rectangular data frame, but we can make it so via the epicalc library and the following code:

library(epicalc)
df.c2 <- fillin(df.c, select = c(DistrictIRN, Econ))
table(df.c2$Econ)

N	Y
1061	1061

The race data, df.b have the same problem and can be fixed similarly.

table(df.b$Race)

American Indian or Alaskan Native	Asian	Black, Non-Hispanic	Hispanic	Multiracial	Pacific Islander	White, Non-Hispanic
43	281	707	652	712	18	959

df.b2 <- fillin(df.b, select = c(DistrictIRN, Race))
table(df.b2$Race)

American Indian or Alaskan Native	Asian	Black, Non-Hispanic	Hispanic	Multiracial	Pacific Islander	White, Non-Hispanic
1059	1059	1059	1059	1059	1059	1059

One last thing before we move on: Let us flip the following variables:

In df.a, flip DistrictIRN to numeric and Enrollment to numeric
In df.b2, flip DistrictIRN to numeric and Number to numeric
In df.c2, flip DistrictIRN to numeric and Number to numeric

Pay attention to the code, there is a subtle difference when flipping the variables.

df.a$DistrictIRN2 = as.numeric(df.a$DistrictIRN)
df.a$Enrollment2 = as.numeric(gsub(",", "", df.a$Enrollment))

df.b2$DistrictIRN2 = as.numeric(df.b2$DistrictIRN)
df.b2$Number2 = as.numeric(gsub(",", "", df.b2$Number))

df.c2$DistrictIRN2 = as.numeric(df.c2$DistrictIRN)
df.c2$Number2 = as.numeric(gsub(",", "", df.c2$Number))

If a numeric variable has been read as a Factor and has commas, the gsub(“,”, “”, df$variable) portion of the code strips the commas from the variable before converting the variable to a numeric variable. Note that to be safe I am creating new variables in each data frame, one for every format conversion. This way if something goes wrong during the conversion we will still have the original variable to rework.

5.5 Merging Data

Now that we have the education data in three files we can merge them into a single data frame. This is often necessary since data will not always be downloaded in one single file. Merging is fairly seamless, so long as one is careful about a few things. First, you need the column (or columns) you will use to merge the files to be identically named and formatted in each file to be merged. This means we will have to rename DistrictIRN2 in df.b2 and df.c2 to be renamed DistrictIRN. This can be done by dropping DistrictIRN in df.b2 and df.c2, and then renaming DistrictIRN2 to DistrictIRN in df.b2 and df.c2.

df.a$DistrictIRN = NULL
df.b2$DistrictIRN = NULL
df.c2$DistrictIRN = NULL

df.a$Enrollment = NULL
df.b2$Number = NULL
df.c2$Number = NULL

names(df.a)[names(df.a) == "DistrictIRN2"] <- "DistrictIRN"
names(df.b2)[names(df.b2) == "DistrictIRN2"] <- "DistrictIRN"
names(df.c2)[names(df.c2) == "DistrictIRN2"] <- "DistrictIRN"

Second, you need to decide beforehand if you want a 1:1 merge where only those records are merged that are present in both files. Thus, for example, if we have 1061 districts in df.a and df.b2 but only 1059 districts in df.c, then if we go with a 1:1 merge the merged data frame will end up with only 1059 districts. We could, of course, tweak the code to ask that the merged data frame include all districts even if they only appear in one file; this is what is being done below.

Before we merge though, I am going to reshape the Econ and the Race data frames.

df.c3 <- reshape(df.c2, timevar = "Econ", idvar = c("DistrictIRN"), 
    direction = "wide")

df.b3 <- reshape(df.b2, timevar = "Race", idvar = c("DistrictIRN"), 
    direction = "wide")

and now we merge df.a and df.c3 first.

merge.1 = merge(df.a, df.c3, by = c("DistrictIRN"), all = TRUE)

If you look at merge.1 you will see blank Number2.N cells for some districts. Let us fill these in by calculating Number2.N = Enrollment2 - Number2.Y if Number2.N is blank. This can be done via the data.table package.

library(data.table)
df1 = data.table(merge.1)
df1.b = df1[is.na(Number2.N), `:=`(Number2.N, Enrollment2 - Number2.Y)]

Now we can create a PctEconDis column that will be (Number2.Y / Enrollment2) * 100 and then we can drop the Number2.N and Number2.Y columns.

df1.b$PctEconDis = with(df1, (Number2.Y/Enrollment2) * 100)

df1.b$Number2.N = NULL
df1.b$Number2.Y = NULL

Now we do the same thing with the Race data, merging df.a and df.c2. This one will be trickier; you’ll see why.

merge.2 = merge(df.a, df.b3, by = c("DistrictIRN"), all = TRUE)

Note how many cells are blank because we have data for a district on students of some race but not the other groups. As such we’ll have to figure out a way to create a variable called PctMinority. This could be easily done if we had data for every district for the Number2.White, Non-Hispanic variable. But this column has missing values too so we’ll use a modified version of the preceding code.

df2 = data.table(merge.2)

myRowSums = function(...) {
    retVal = as.numeric()
    df = data.frame(...)
    for (r in row.names(df)) {
        retVal = append(retVal, sum(df[r, ], na.rm = TRUE))
    }
    return(retVal)
}

merge.2 <- within(merge.2, NonMinority <- ifelse(!is.na(`Number2.White, Non-Hispanic`), 
    `Number2.White, Non-Hispanic`, Enrollment2 - myRowSums(Number2.Asian, 
        `Number2.Black, Non-Hispanic`, Number2.Hispanic, `Number2.American Indian or Alaskan Native`, 
        `Number2.Pacific Islander`)))

merge.2$PctMinority = with(merge.2, ((Enrollment2 - NonMinority)/Enrollment2) * 
    100)

my.vars = c("DistrictIRN", "PctMinority")

df2.b = merge.2[my.vars]

Whew! Now we can merge df1 and df2 to construct a single data frame that has, for every district, Enrollment2, PctEconDis and PctMinority, along with the DistrictIRN and DistrictName.

df.educ = merge(df1.b, df2.b, by = "DistrictIRN", all = TRUE)

colnames(df.educ) = c("DistrictIRN", "DistrictName", "Enrollment", 
    "PctEconDis", "PctMinority")

save(df.educ, file = "educ.RData")

Data Acquisition and Management

MPA 6020 (Spring 2016)

07 February, 2016