1 Graphics in R

Now we’ll see one of R’s premier packages in action when graphing data. Let us load the hsb2.RData we saved earlier.

load("~/Downloads/Archive/hsb2.RData")

ggplot2 is one of the leading R packages for graphics, followed closely by lattice. Let us work with ggplot2 first and fit some simple graphs. Note that there is extensive help available for ggplot2 on the web. You can start with the Cookbook for R or the ggplot2 documentation. You can also search on stackoverflow.

1.1 The Mechanics of ggplot2

ggplot2 uses the grammar of graphics to build graphs by breaking up each graph into three components – data, aesthetics, and geometry. You specify the data frame with the data command, then the x and y coordinates with the aes command, and finally the geometry (bar-chart, histogram, etc.) via the geom_ command. The geometry for some of the graphs we will use most often is listed below:

  • geom_bar() – bar-chart
  • geom_histogram() – histogram
  • geom_line() – line chart
  • geom_point() – scatte plot
  • geom_density() – density plots
  • geom_jitter() – stripcharts

2 Constructing Graphs

Recall that for numeric variables we can rely on box-plots and histograms to explore the distribution of a numeric (scale) variable. Perhaps we are interested in reading scores and want to start with a histogram.

2.1 Histograms

library(ggplot2)
ggplot(data = hsb2, aes(x = read)) + geom_histogram()

You see R telling you that stat_bin() using bins = 30. Pick better value with binwidth.. That is, R is automatically grouping read in a way that there are 30 groups. Maybe we want fewer groups, maybe 10. This can be done as follows:

ggplot(data = hsb2, aes(x = read)) + geom_histogram(bins = 10)

We can customize this histogram further, changing the colors, the labels for the x-axis, the y-axis, adding a title, and so on.

ggplot(data = hsb2, aes(x = read)) + geom_histogram(bins = 10, 
    fill = "cornflowerblue") + ggtitle("Histogram of Reading Scores") + 
    xlab("Reading Score") + ylab("Frequency")

ggplot(data = hsb2, aes(x = read)) + geom_histogram(bins = 10, 
    fill = "salmon") + ggtitle("Histogram of Reading Scores") + 
    xlab("Reading Score") + ylab("Frequency")

ggplot(data = hsb2, aes(x = read)) + geom_histogram(bins = 10, 
    fill = "deeppink1") + ggtitle("Histogram of Reading Scores") + 
    xlab("Reading Score") + ylab("Frequency")

ggplot(data = hsb2, aes(x = read)) + geom_histogram(bins = 10, 
    fill = "yellowgreen") + ggtitle("Histogram of Reading Scores") + 
    xlab("Reading Score") + ylab("Frequency")

Note: A small snippet of the wide expanse of colors available in R can be seen here and you can always brew your own color palette (ask me and I’ll give you the code).

What if wanted to construct these histograms for male versus female students, or for each of the SES groups?

ggplot(data = hsb2, aes(x = read)) + geom_histogram(bins = 10, 
    fill = "tomato") + ggtitle("Histogram of Reading Scores") + 
    xlab("Reading Score") + ylab("Frequency") + facet_wrap(~female)

ggplot(data = hsb2, aes(x = read)) + geom_histogram(bins = 10, 
    fill = "tomato") + ggtitle("Histogram of Reading Scores") + 
    xlab("Reading Score") + ylab("Frequency") + facet_wrap(~ses)

What if we wanted to break it out by female/male students in public versus private schools?

ggplot(data = hsb2, aes(x = read)) + geom_histogram(bins = 10, 
    fill = "tomato") + ggtitle("Histogram of Reading Scores") + 
    xlab("Reading Score") + ylab("Frequency") + facet_wrap(female ~ 
    schtyp)

2.2 Kernel Density Plots

ggplot(data = iris, aes(x = Sepal.Length, fill = Species)) + 
    geom_density(alpha = 0.3, trim = TRUE)

ggplot(data = iris, aes(x = Sepal.Length)) + geom_histogram(aes(y = ..density..), 
    binwidth = 0.2, fill = "cornflowerblue") + ggtitle("Histogram & Kernel Density Plot of Reading Scores") + 
    xlab("Reading Score") + ylab("Frequency") + geom_density(alpha = 1, 
    color = "tomato4", trim = TRUE) + facet_wrap(~Species)

When we construct a histogram we choose the bin-widths (i.e., how many groups do we want and how wide should each group be?). As a result, histograms are not smooth, and depend on both the width of the bins and the end points of the bins. Kernel density plots get around two of these problems – they are smooth and do not depend on the end points of the bins. One can think of them as probability distribution functions, similar to the standard normal distribution \((z)\), that tell you how your data are distributed.

2.3 Box-plots

Now we can revisit our old friends, the box-plots.

ggplot(data = hsb2, aes(x = female, y = read)) + geom_boxplot(fill = "seagreen2") + 
    ggtitle("Box-Plot of Reading Scores") + xlab("Gender") + 
    ylab("Reading Score") + coord_flip()

ggplot(data = hsb2, aes(x = female, y = read)) + geom_boxplot(fill = "peachpuff") + 
    ggtitle("Box-Plot of Reading Scores (by Gender & School Type)") + 
    xlab("Gender") + ylab("Reading Score") + coord_flip() + facet_wrap(~schtyp)

2.4 Violin Plots

While box-plots are very useful for loking at the general shape of the distribution, violin plots tend to be more informative since they combine box-plots and kernel density plots. But not everyone likes these (or is used to them).

ggplot(data = hsb2, aes(x = female, y = read)) + geom_violin(fill = "seagreen2", 
    trim = FALSE, adjust = 0.5) + ggtitle("Violin Plots of Reading Scores") + 
    geom_boxplot(width = 0.1) + xlab("Gender") + ylab("Reading Score") + 
    coord_flip()

ggplot(data = hsb2, aes(x = female, y = read)) + geom_violin(fill = "peachpuff", 
    trim = FALSE, adjust = 0.5) + geom_boxplot(width = 0.1) + 
    ggtitle("Violin Plots of Reading Scores (by Gender & School Type)") + 
    xlab("Gender") + ylab("Reading Score") + coord_flip() + facet_wrap(~schtyp)

ggplot(data = hsb2, aes(x = female, y = read, fill = schtyp)) + 
    geom_violin(trim = FALSE, adjust = 0.25) + geom_boxplot(width = 0.1) + 
    ggtitle("Violin Plots of Reading Scores") + xlab("Gender") + 
    ylab("Reading Score") + coord_flip()

ggplot(data = hsb2, aes(x = female, y = read, fill = schtyp)) + 
    geom_violin(trim = FALSE, adjust = 0.5) + geom_boxplot(width = 0.1) + 
    ggtitle("Violin Plots of Reading Scores") + xlab("Gender") + 
    ylab("Reading Score") + coord_flip()

ggplot(data = hsb2, aes(x = female, y = read, fill = schtyp)) + 
    geom_violin(trim = FALSE, adjust = 0.75) + geom_boxplot(width = 0.1) + 
    ggtitle("Box-Plot of Reading Scores") + xlab("Gender") + 
    ylab("Reading Score") + coord_flip()

2.5 Bar-Charts

Recall the bar-charts we used for qualitative variables last semester. Let us generate a few for gender, schtyp, prog, ses, and race.

ggplot(data = hsb2, aes(female)) + geom_bar(fill = "seagreen2", 
    width = 0.25) + ggtitle("Bar-Chart of Gender") + xlab("Gender") + 
    ylab("Frequency") + theme(axis.text.x = element_text(angle = 90, 
    hjust = 0))

ggplot(data = hsb2, aes(race)) + geom_bar(fill = "seagreen2") + 
    ggtitle("Bar-Chart of Race (by School Type)") + xlab("Race") + 
    ylab("Frequency") + facet_wrap(~schtyp) + theme(axis.text.x = element_text(angle = 90, 
    hjust = 0))

ggplot(data = hsb2, aes(race)) + geom_bar(fill = "seagreen2") + 
    ggtitle("Bar-Chart of Race (by SES & School Type)") + xlab("Race") + 
    ylab("Frequency") + facet_wrap(ses ~ schtyp) + theme(axis.text.x = element_text(angle = 90, 
    hjust = 0))

ggplot(data = hsb2, aes(race)) + geom_bar(fill = "seagreen2") + 
    ggtitle("Bar-Chart of Race (by SES & School Type)") + xlab("Race") + 
    ylab("Frequency") + facet_wrap(ses ~ schtyp, ncol = 2) + 
    theme(axis.text.x = element_text(angle = 90, hjust = 0))

Clearly some of these aren’t very helpful because we have too few students of a given racial/ethnic group. Let us look at different rendition of a bar-chart.

ggplot(data = hsb2, aes(prog)) + geom_bar(aes(fill = ses)) + 
    ggtitle("Bar-Chart of Program Type (by SES & School Type)") + 
    xlab("Race") + ylab("Frequency") + facet_wrap(~schtyp, ncol = 2) + 
    theme(axis.text.x = element_text(angle = 90, hjust = 0))

Use these with caution since they are useless unless the differences in the frequencies are very conspicuous and hence will jump out at the viewer.

2.6 Line Charts

If you have data over time then line charts are a good way to show trends over time. Run the code below (first making sure have the plotly package installed) and see the result.

library(plotly)
py = plotly("aniruhil", "qbyq4fx812")
p = plot_ly(economics, x = date, y = uempmed, name = "unemployment")

plotly is a special graphics package for interactive graphics so don’t think this is how the typical line chart might look. For example, the same plot rendered via ggplot2 would look as follows:

ggplot(data = economics, aes(x = date, y = uempmed)) + geom_line()

A little touch of magic via ggplot and the plotly package, and voila!!

p2 = ggplot(data = economics, aes(x = date, y = uempmed)) + geom_line(aes(color = "red"), 
    size = 0.25) + xlab("Date") + ylab("Unemployment Rate")
p2 = ggplotly(p2)

Regardless of the package-specific rendering, the basic point should be obvious: You can see how unemployment varies over time. If you are interested, check out plotly’s capabilities here.

2.7 Scatter-plots

If we have two numeric (scale) variables then a scatter-plot is a great way to explore if and how these two variables are related.

ggplot(data = iris, aes(x = Sepal.Length, y = Petal.Width, color = Species)) + 
    geom_point()

ggplot(data = mtcars, aes(x = qsec, y = mpg, color = factor(cyl))) + 
    geom_point()

There is another package for interactive graphics , the ggvis package. The same plot as that rendered above can be drawn with ggvis (not shown below).

library(ggvis)
iris %>% ggvis(~Sepal.Length, ~Petal.Width) %>% layer_points(fill = ~Species)

More examples of ggvis graphics are available here.

Finally, let us close out scatter-plots by looking at the rCharts package, yet another way to generate interactive graphics (not shown below).

library(rCharts)
scatter.rcharts <- rPlot(Petal.Width ~ Sepal.Length, data = iris, 
    color = "Species", type = "point")
scatter.rcharts$print(include_assets = TRUE)

More documentation on rCharts is available here.

2.8 Stripcharts

ggplot(data = iris, aes(y = Sepal.Length, x = Species, color = Species)) + 
    geom_jitter()

ggplot(data = iris, aes(y = Sepal.Length, x = Species, color = Species)) + 
    geom_jitter() + stat_summary(fun.y = median, geom = "point", 
    size = 3, color = "black")

ggplot(data = iris, aes(y = Sepal.Length, x = Species, color = Species)) + 
    geom_jitter() + stat_summary(fun.data = "mean_sdl", geom = "pointrange", 
    size = 0.5, color = "black")

ggplot(data = iris, aes(y = Sepal.Length, x = Species, color = Species)) + 
    geom_boxplot() + geom_jitter()

3 A Teaser on Mapping with ggplot2 and ggmap

I’ll leave you with a few maps, first of the 48 states on the continent, then of all counties in the country, then one of counties in Ohio, and finally a googlemap of Athens.

library(maps)
library(ggmap)

states = map_data("state")

ggplot() + geom_polygon(data = states, aes(x = long, y = lat, 
    group = group, fill = region)) + coord_fixed(1.3) + guides(fill = FALSE)

counties <- map_data("county")

ggplot() + geom_polygon(data = counties, aes(x = long, y = lat, 
    group = group, fill = region)) + coord_fixed(1.3) + guides(fill = FALSE)

ohio = subset(counties, region == "ohio")

ggplot() + geom_polygon(data = ohio, aes(x = long, y = lat, group = group, 
    fill = subregion)) + coord_fixed(1.3) + guides(fill = FALSE)

athens = get_map(location = "Athens, Ohio", zoom = 14, source = "osm")
ggmap(athens)

whitehouse = get_map(location = "The White House, Washington DC", 
    zoom = 16, source = "osm")
ggmap(whitehouse)

athens = get_map(location = "Athens, Ohio", zoom = 14)
ggmap(athens)