R is a free software environment for statistical computing and graphics. It is powerful, elegant, and incredibly flexible, and the best part is you don’t need to be a programmer to use it. RStudio is a graphical user interface (GUI) for R that is also free and yet more powerful than any commercial software solution in existence today.
The first thing you need to do to get going is to download R. You can download R for Windows from here and R for Mac from here. Double-click on the downloaded file and accept the default settings as you go through the installation. Once R is installed you can install RStudio for Windows from here and RStudio for Mac from here. There are daily builds of RStudio, and the latest builds have some features not in the latest stable release. I would suggest that for this semester you download and install RStudio from here for Windows and here for Mac.
Accept the default prompts through the installation process. Once installation finishes double-click the RStudio shortcut/icon and RStudio will launch. If all goes well you should see R starting up inside RStudio and the interface looking as shown below:
Both R and RStudio go through very frequent updates, some minor, some major. As needed, repeat the steps you took above to re-install and update your version of R and RStudio.
R has packages dedicated to performing specific tasks. Want to analyze gene sequencing data? There is a package for it. Want to analyze financial data? There is a package for that. How about elegant graphics? You bet; there is a package for that. Mapping anyone? You betcha; there is a package for that too. We will use specific packages in this course and I will point out what packages you need and when. You will have to use the “Install Packages…” option under Tools to install these packages. Packages are frequently updated so once a month you should see if there are updates available by running “Check for Package Updates…”, also under Tools.
Let us go ahead and install some packages we will need. The necessary packages are listed below:
R can read data created in various formats (SPSS, SAS, Stata, Excel, CSV, TXT, etc). The most common data formats you will encounter are likely to be CSV or Excel files. Let us see how to read data in these formats by first downloading and saving the data available here (as a zip archive). Once this file downloads, double-click it and extract all four files to a new folder (title it Data) you create in your OU Box folder for the course.
The first thing R will need to know is where your data reside. This can be accomplished either by setting the working directory or by explicitly specifying the path to your data. We will employ the second option for now.
df.csv = read.csv("~/Downloads/Archive/file1.csv", sep = ",",
header = TRUE)
df.csv is the name I have chosen to give to the data being read. I am telling R that it is in CSV format, where the file resides, the file-name, the fact that variables (one in each column) are separated by a comma (,)
, and the fact that the original data have column-headings (header=TRUE
).
Note that when you create anything in R, you do so either via the =
symbol or via <-
symbol. Thus df.csv = read.csv(…)
is the same as df.csv <= read.csv(…)
but my suggestion would be to stick with =
.
When you execute the command you will see df.csv showing up under Data in the upper-right pane of RStudio. Click on df.csv and you can see the data.
A similar process works for reading in tab-delimited files where the columns are separated by a tab rather than by a comma.
df.tab = read.csv("~/Downloads/Archive/file1.txt", sep = "\t",
header = TRUE)
Note the one difference here: I have told R it is a tab-delimited file by specifying sep=“\t”
readxl
. Whenever we need to use a package we will have to first load it and then execute whatever commands call upon the loaded package’s features as shown below.library(readxl)
df.xls = read_excel("~/Downloads/Archive/file1.xls")
df.xlsx = read_excel("~/Downloads/Archive/file2.xlsx")
Note the one minor difference in the commands; the xlsx file is called file2.xlsx
.
It is also possible to specify the full web-path for a file.
fpe = read.table("http://data.princeton.edu/wws509/datasets/effort.dat")
test = read.table("http://www.ats.ucla.edu/stat/data/test.txt",
header = TRUE)
test.csv = read.csv("http://www.ats.ucla.edu/stat/data/test.csv",
header = TRUE)
library(foreign)
hsb2.spss = read.spss("http://www.ats.ucla.edu/stat/spss/webbooks/reg/hsb2.sav")
df.hsb2.spss = as.data.frame(hsb2.spss)
rm("hsb2.spss") # Deleting the intermediate file
R is able to read data from Twitter feeds, Buoys sitting in the Atlantic ocean, and so much more!
You can generate your own data, manipulate data by adding, subtracting, dividing, or multiplying, and convert numeric data to factors (qualitative variables), etc. We will see a few basic data operations at work below. Let us start by creating some data.
Let us create two variables, x and y.
x = c(100, 101, 102, 103, 104, 105, 106)
y = c(7, 8, 9, 10, 11, 12, 13)
df = as.data.frame(cbind(x, y))
The commands above generate two columns, x and y, and then bind them as columns into a data-set called df. If we used rbind()
instead it would bind x and y as rows instead of columns.
x = c(100, 101, 102, 103, 104, 105, 106)
y = c(7, 8, 9, 10, 11, 12, 13)
df.rows = as.data.frame(rbind(x, y))
Note that when we use rbind()
it names the columns V1, V2, and so on. Often we will want to label the columns differently from how they were read-in. This is easily accomplished:
names(df) = c("Variable 1", "Variable 2")
names(df.rows) = c("Variable 1", "Variable 2", "Variable 3",
"Variable 4", "Variable 5", "Variable 6", "Variable 7")
You can also generate data-sets that combine quantitative and qualitative variables. This is demonstrated below:
x = c(100, 101, 102, 103, 104, 105, 106)
y = c("Male", "Female", "Male", "Female", "Female", "Male", "Female")
df.1 = as.data.frame(cbind(x, y))
x = c(100, 101, 102, 103, 104, 105, 106)
y = c(0, 1, 0, 1, 1, 0, 1)
df.2 = as.data.frame(cbind(x, y))
Note that in df.1 y is a string variable with values of Male/Female. In contrast, df.2 has y specified as a 0/1 variable, with 0=Male
and 1=Female
. We could label the 0/1 values in df.2 as follows:
df.2$y = factor(df.2$y, labels = c("Male", "Female"))
If you click the “play” button before df.2 you will see the contents of the data-set. Note that x is shown as num
(numeric) while y
is shown as Factor with two levels “Males”, “Female”.
We can operate on x as follows:
df.2$x1 = df.2$x * 10
df.2$x2 = df.2$x * 100
df.2$x3 = df.2$x/10
df.2$x4 = sqrt(df.2$x)
df.2$x5 = df.2$x^(2)
df.2$x6 = df.2$x * 1.31
Note the various operators; we multiply via *
, divide via /
, take the square-root via sqrt()
, and so on.
We can save a data-set we have created quite easily (see below):
save(df.2, file = "~/Downloads/Archive/df2.RData")
Note the sequence. We specify the data set we want to save, here df.2
, and then the location and filename of the saved data: file=“~/Downloads/Archive/df2.RData”
. If you look at the folder specified in the command you will see a file called df2.RData.
Let us load some larger data-sets, perhaps the hsb2 data we used last semester.
hsb2 = read.table("http://www.ats.ucla.edu/stat/r/modules/hsb2.csv",
header = TRUE, sep = ",")
Note that there are no labels for the various qualitative variables (female, race, ses, schtyp, and prog) so we’ll have to create these.
hsb2$female = factor(hsb2$female, labels = c("Male", "Female"))
hsb2$race = factor(hsb2$race, labels = c("Hispanic", "Asian",
"African American", "White"))
hsb2$ses = factor(hsb2$ses, labels = c("Low", "Middle", "High"))
hsb2$schtyp = factor(hsb2$schtyp, labels = c("Public", "Private"))
hsb2$prog = factor(hsb2$prog, labels = c("General", "Academic",
"Vocational"))
Having added labels to the factors in hsb2 we can now save the data for later use.
save(hsb2, file = "~/Downloads/Archive/hsb2.RData")