Now we’ll see one of R’s premier packages in action when graphing data. Let us load the hsb2.RData we saved earlier.
load("~/Downloads/Archive/hsb2.RData")
ggplot2
is one of the leading R packages for graphics, followed closely by lattice
. Let us work with ggplot2 first and fit some simple graphs. Note that there is extensive help available for ggplot2
on the web. You can start with the Cookbook for R or the ggplot2 documentation. You can also search on stackoverflow.
ggplot2
uses the grammar of graphics to build graphs by breaking up each graph into three components – data, aesthetics, and geometry. You specify the data frame with the data
command, then the x and y coordinates with the aes
command, and finally the geometry (bar-chart, histogram, etc.) via the geom_
command. The geometry for some of the graphs we will use most often is listed below:
Recall that for numeric variables we can rely on box-plots and histograms to explore the distribution of a numeric (scale) variable. Perhaps we are interested in reading scores and want to start with a histogram.
library(ggplot2)
ggplot(data = hsb2, aes(x = read)) + geom_histogram()
You see R telling you that
. That is, R is automatically grouping read in a way that there are 30 groups. Maybe we want fewer groups, maybe 10. This can be done as follows:stat_bin()
using bins = 30
. Pick better value with binwidth
.
ggplot(data = hsb2, aes(x = read)) + geom_histogram(bins = 10)
We can customize this histogram further, changing the colors, the labels for the x-axis, the y-axis, adding a title, and so on.
ggplot(data = hsb2, aes(x = read)) + geom_histogram(bins = 10,
fill = "cornflowerblue") + ggtitle("Histogram of Reading Scores") +
xlab("Reading Score") + ylab("Frequency")
ggplot(data = hsb2, aes(x = read)) + geom_histogram(bins = 10,
fill = "salmon") + ggtitle("Histogram of Reading Scores") +
xlab("Reading Score") + ylab("Frequency")
ggplot(data = hsb2, aes(x = read)) + geom_histogram(bins = 10,
fill = "deeppink1") + ggtitle("Histogram of Reading Scores") +
xlab("Reading Score") + ylab("Frequency")
ggplot(data = hsb2, aes(x = read)) + geom_histogram(bins = 10,
fill = "yellowgreen") + ggtitle("Histogram of Reading Scores") +
xlab("Reading Score") + ylab("Frequency")
Note: A small snippet of the wide expanse of colors available in R can be seen here and you can always brew your own color palette (ask me and I’ll give you the code).
What if wanted to construct these histograms for male versus female students, or for each of the SES groups?
ggplot(data = hsb2, aes(x = read)) + geom_histogram(bins = 10,
fill = "tomato") + ggtitle("Histogram of Reading Scores") +
xlab("Reading Score") + ylab("Frequency") + facet_wrap(~female)
ggplot(data = hsb2, aes(x = read)) + geom_histogram(bins = 10,
fill = "tomato") + ggtitle("Histogram of Reading Scores") +
xlab("Reading Score") + ylab("Frequency") + facet_wrap(~ses)
What if we wanted to break it out by female/male students in public versus private schools?
ggplot(data = hsb2, aes(x = read)) + geom_histogram(bins = 10,
fill = "tomato") + ggtitle("Histogram of Reading Scores") +
xlab("Reading Score") + ylab("Frequency") + facet_wrap(female ~
schtyp)
ggplot(data = iris, aes(x = Sepal.Length, fill = Species)) +
geom_density(alpha = 0.3, trim = TRUE)
ggplot(data = iris, aes(x = Sepal.Length)) + geom_histogram(aes(y = ..density..),
binwidth = 0.2, fill = "cornflowerblue") + ggtitle("Histogram & Kernel Density Plot of Reading Scores") +
xlab("Reading Score") + ylab("Frequency") + geom_density(alpha = 1,
color = "tomato4", trim = TRUE) + facet_wrap(~Species)
When we construct a histogram we choose the bin-widths (i.e., how many groups do we want and how wide should each group be?). As a result, histograms are not smooth, and depend on both the width of the bins and the end points of the bins. Kernel density plots get around two of these problems – they are smooth and do not depend on the end points of the bins. One can think of them as probability distribution functions, similar to the standard normal distribution \((z)\), that tell you how your data are distributed.
Now we can revisit our old friends, the box-plots.
ggplot(data = hsb2, aes(x = female, y = read)) + geom_boxplot(fill = "seagreen2") +
ggtitle("Box-Plot of Reading Scores") + xlab("Gender") +
ylab("Reading Score") + coord_flip()
ggplot(data = hsb2, aes(x = female, y = read)) + geom_boxplot(fill = "peachpuff") +
ggtitle("Box-Plot of Reading Scores (by Gender & School Type)") +
xlab("Gender") + ylab("Reading Score") + coord_flip() + facet_wrap(~schtyp)
While box-plots are very useful for loking at the general shape of the distribution, violin plots tend to be more informative since they combine box-plots and kernel density plots. But not everyone likes these (or is used to them).
ggplot(data = hsb2, aes(x = female, y = read)) + geom_violin(fill = "seagreen2",
trim = FALSE, adjust = 0.5) + ggtitle("Violin Plots of Reading Scores") +
geom_boxplot(width = 0.1) + xlab("Gender") + ylab("Reading Score") +
coord_flip()
ggplot(data = hsb2, aes(x = female, y = read)) + geom_violin(fill = "peachpuff",
trim = FALSE, adjust = 0.5) + geom_boxplot(width = 0.1) +
ggtitle("Violin Plots of Reading Scores (by Gender & School Type)") +
xlab("Gender") + ylab("Reading Score") + coord_flip() + facet_wrap(~schtyp)
ggplot(data = hsb2, aes(x = female, y = read, fill = schtyp)) +
geom_violin(trim = FALSE, adjust = 0.25) + geom_boxplot(width = 0.1) +
ggtitle("Violin Plots of Reading Scores") + xlab("Gender") +
ylab("Reading Score") + coord_flip()
ggplot(data = hsb2, aes(x = female, y = read, fill = schtyp)) +
geom_violin(trim = FALSE, adjust = 0.5) + geom_boxplot(width = 0.1) +
ggtitle("Violin Plots of Reading Scores") + xlab("Gender") +
ylab("Reading Score") + coord_flip()
ggplot(data = hsb2, aes(x = female, y = read, fill = schtyp)) +
geom_violin(trim = FALSE, adjust = 0.75) + geom_boxplot(width = 0.1) +
ggtitle("Box-Plot of Reading Scores") + xlab("Gender") +
ylab("Reading Score") + coord_flip()
Recall the bar-charts we used for qualitative variables last semester. Let us generate a few for gender, schtyp, prog, ses, and race.
ggplot(data = hsb2, aes(female)) + geom_bar(fill = "seagreen2",
width = 0.25) + ggtitle("Bar-Chart of Gender") + xlab("Gender") +
ylab("Frequency") + theme(axis.text.x = element_text(angle = 90,
hjust = 0))
ggplot(data = hsb2, aes(race)) + geom_bar(fill = "seagreen2") +
ggtitle("Bar-Chart of Race (by School Type)") + xlab("Race") +
ylab("Frequency") + facet_wrap(~schtyp) + theme(axis.text.x = element_text(angle = 90,
hjust = 0))
ggplot(data = hsb2, aes(race)) + geom_bar(fill = "seagreen2") +
ggtitle("Bar-Chart of Race (by SES & School Type)") + xlab("Race") +
ylab("Frequency") + facet_wrap(ses ~ schtyp) + theme(axis.text.x = element_text(angle = 90,
hjust = 0))
ggplot(data = hsb2, aes(race)) + geom_bar(fill = "seagreen2") +
ggtitle("Bar-Chart of Race (by SES & School Type)") + xlab("Race") +
ylab("Frequency") + facet_wrap(ses ~ schtyp, ncol = 2) +
theme(axis.text.x = element_text(angle = 90, hjust = 0))
Clearly some of these aren’t very helpful because we have too few students of a given racial/ethnic group. Let us look at different rendition of a bar-chart.
ggplot(data = hsb2, aes(prog)) + geom_bar(aes(fill = ses)) +
ggtitle("Bar-Chart of Program Type (by SES & School Type)") +
xlab("Race") + ylab("Frequency") + facet_wrap(~schtyp, ncol = 2) +
theme(axis.text.x = element_text(angle = 90, hjust = 0))
Use these with caution since they are useless unless the differences in the frequencies are very conspicuous and hence will jump out at the viewer.
If you have data over time then line charts are a good way to show trends over time. Run the code below (first making sure have the plotly
package installed) and see the result.
library(plotly)
py = plotly("aniruhil", "qbyq4fx812")
p = plot_ly(economics, x = date, y = uempmed, name = "unemployment")
plotly is a special graphics package for interactive graphics so don’t think this is how the typical line chart might look. For example, the same plot rendered via ggplot2
would look as follows:
ggplot(data = economics, aes(x = date, y = uempmed)) + geom_line()
A little touch of magic via ggplot and the plotly
package, and voila!!
p2 = ggplot(data = economics, aes(x = date, y = uempmed)) + geom_line(aes(color = "red"),
size = 0.25) + xlab("Date") + ylab("Unemployment Rate")
p2 = ggplotly(p2)
Regardless of the package-specific rendering, the basic point should be obvious: You can see how unemployment varies over time. If you are interested, check out plotly’s capabilities here.
If we have two numeric (scale) variables then a scatter-plot is a great way to explore if and how these two variables are related.
ggplot(data = iris, aes(x = Sepal.Length, y = Petal.Width, color = Species)) +
geom_point()
ggplot(data = mtcars, aes(x = qsec, y = mpg, color = factor(cyl))) +
geom_point()
There is another package for interactive graphics , the ggvis
package. The same plot as that rendered above can be drawn with ggvis (not shown below).
library(ggvis)
iris %>% ggvis(~Sepal.Length, ~Petal.Width) %>% layer_points(fill = ~Species)
More examples of ggvis
graphics are available here.
Finally, let us close out scatter-plots by looking at the rCharts
package, yet another way to generate interactive graphics (not shown below).
library(rCharts)
scatter.rcharts <- rPlot(Petal.Width ~ Sepal.Length, data = iris,
color = "Species", type = "point")
scatter.rcharts$print(include_assets = TRUE)
More documentation on rCharts
is available here.
ggplot(data = iris, aes(y = Sepal.Length, x = Species, color = Species)) +
geom_jitter()
ggplot(data = iris, aes(y = Sepal.Length, x = Species, color = Species)) +
geom_jitter() + stat_summary(fun.y = median, geom = "point",
size = 3, color = "black")
ggplot(data = iris, aes(y = Sepal.Length, x = Species, color = Species)) +
geom_jitter() + stat_summary(fun.data = "mean_sdl", geom = "pointrange",
size = 0.5, color = "black")
ggplot(data = iris, aes(y = Sepal.Length, x = Species, color = Species)) +
geom_boxplot() + geom_jitter()