Chapter 3 Visualizing Data
One of the first tasks of data analysis should be to look at our data. Certainly, you should open up the dataset and eyeball it to make sure everything looks okay, that the file isn’t mysteriously corrupted, etc., but that is not what I mean when I say “look”. Nor am I talking about data visualization
in the sense of the masterpieces of stalwarts such as Alberto Cairo, Mona Chalabi, Andy Kirk, Robert Kossara, Giorgia Lupi, David McCandless, Cole Nussbaumer Knaflic, Randy Olson, Lisa Charlotte Rost, John Schwabish, Nathan Yau, Stephanie Evergreen, Martin Wattenberg, Fernanda Viégas, and many others. We are not going to create visuals that rival these artists’ stunning works. Rather, our goal will be simpler, to use appropriate tables and graphics when analyzing and describing our data, both so that we understand what the data are telling us and can communicate the resulting story to our audience. To accomplish these tasks we will obey simple rules that help us make wise decisions.
3.1 Visualizing Nominal/Ordinal Data
3.1.1 Bar-charts and Frequency Tables
If a variable is measured in nominal or ordinal terms either (i) a bar-chart
or (ii) a frequency table
are very effective displays of how the variable is distributed: What value seems to be most common? Are high values as likely as low values? You can see both these visuals below, drawn from data about the relative abundance of different species. First, the frequency table.
Activity | Frequency | Proportion | Percentage |
---|---|---|---|
Disturbing tiger kill | 5 | 0.06 | 6 |
Fishing | 8 | 0.09 | 9 |
Forest products | 11 | 0.12 | 12 |
Fuelwood/timber | 5 | 0.06 | 6 |
Grass/fodder | 44 | 0.50 | 50 |
Herding | 7 | 0.08 | 8 |
Sleeping in house | 3 | 0.03 | 3 |
Toilet | 2 | 0.02 | 2 |
Walking | 3 | 0.03 | 3 |
Total | 88 | 1.00 | 100 |
Notice how the frequency table is constructed. You have each human activity listed, followed by the frequency
(the number of times someone was killed by a tiger while engaging in each activity), then this column with this frequency converted into a proportion
, and finally the proportion converted into a percentage
. None of this should be a mystery to you:
\[proportion = \dfrac{frequency}{Total}\] \[percentage = \left(\dfrac{frequency}{Total}\right) \times 100 = proportion \times 100\]
The last row in the table shows you the total frequency (we have a total of 88 humans killed), the total proportion (which must sum to \(1\)), and the total percentage (which should sum to \(100\)). I love frequency tables such as these because they show all the data and the story; most kills occurred while the individuals were cutting grass for cattle fodder, followed by while they were out gathering other forest products, and least of all while they were going to the toilet4. We could improve upon this table and bar-chart by organizing it in such a way that the most dangerous activity is listed/plotted first. This would have the added benefit of quickly drawing the reader’s/viewer’s eyes to the most dangerous activity.
What if the variable was an ordinal level variable, say something like an individual’s frequency of praying?
Response Category | Frequency |
---|---|
|
196 |
|
226 |
|
264 |
|
346 |
|
456 |
Total | 1488 |
We would be facing the same options, a bar-chart or a frequency table. However, we would have to be cautious here in making sure both the table and the bar-chart categories are logically assigned. That is, notice the Response Category
; we start with the option “Never” and each category that follows corresponds to a category that reflects praying more often than the preceding category. This order would have to be maintained so that we would be unable to arrange the table or bar-chart in ascending/descending order of the “Frequency” column. If we made that mistake we would be destroying the natural order that exists in the ordinal variable we have before us.
You might be wondering, what about pie charts
? Why not use those with nominal/ordinal data? There are two camps, those who hate them and those who think they may be useful. If you are interested, read What do you mean I’m not supposed to use Pie Charts?! but I for one do not use them since I find them less useful than bar-charts.
3.1.2 Contingency Tables and Bar-charts
Frequency tables and bar-charts are also useful when you have two nominal/ordinal variables to work with and are interested in exploring difference between the two or more groups reflected in one variable versus whatever is being measured by the second variable. Let us use a specific example, one where we ask if religiosity differs between Liberals, Moderates, and Conservatives. First the table.
Liberal | Moderate | Conservative | Total | |
---|---|---|---|---|
|
62 | 46 | 36 | 144 |
|
53 | 63 | 52 | 168 |
|
53 | 69 | 76 | 198 |
|
76 | 89 | 107 | 272 |
|
69 | 99 | 176 | 344 |
Total | 313 | 366 | 447 | 1126 |
If you look at Table 3.3, you see the frequencies reported in what we call a contingency table
or a crosstabulation
– where the distributions of two nominal/ordinal variables are jointly displayed. Reading such a table should be simple. In brief, You see how many Liberals said they prayed “Never”, “Once a week or less”, “A few times a week”, “Once a day”, or “Several times a day”. What pattern is evident from this table? Of the 447 conservatives we see most of them saying they pray several times a day, followed by once a day, and the fewest saying they never pray. The pattern is similar for liberals and moderates, although the differences between the numbers responding “Never” and “Several times a day” is smaller for liberals than it is for conservatives.
The story could be helped a great deal if we calculated percentages for these frequencies. We have two choices when calculating these percentages, we could calculate these as row percentages
where we ask “what percent of those who said Never were Liberal, Moderate, and Conservative, respectively?” We could then repeat this for the other categories of religiosity. The result is shown in Table 3.4. This table shows quite clearly that 51.16% of those who say they pray several times a day tend to be Conservative. Likewise, most (43.06%) of those who say they never pray tend to be Liberal.
Liberal | Moderate | Conservative | |
---|---|---|---|
|
43.06 | 31.94 | 25.00 |
|
31.55 | 37.50 | 30.95 |
|
26.77 | 34.85 | 38.38 |
|
27.94 | 32.72 | 39.34 |
|
20.06 | 28.78 | 51.16 |
If we calculated column percentages
we would be able to answer such questions as: “What percent of Liberals said they never pray, pray once a week or less, a few times a week, once a day, several times a day?” The same could then be asked of Moderates and Conservatives, respectively. If we used the column percentages shown in Table 3.5 it would be obvious that Moderates and Conservatives tend to be more religious than Liberals. The essential takeaway here is that how you calculate the percentages (row versus column) depends upon what story you want to highlight, the question you want to ask and answer.
Liberal | Moderate | Conservative | |
---|---|---|---|
|
19.81 | 12.57 | 8.05 |
|
16.93 | 17.21 | 11.63 |
|
16.93 | 18.85 | 17.00 |
|
24.28 | 24.32 | 23.94 |
|
22.04 | 27.05 | 39.37 |
Graphing these tables is easily done as well, and very effective; see Figure 3.3.
Categorical variables (one or two) -> bar-chart
If you have cross-tabulations, choose between stacked versus dodged bar-charts
3.2 Visualizing Interval/Ratio Data
We have several visuals we could draw with numeric data that are either interval or ratio levels of measurement. Let us see these first before we look at the one frequency table that could be used with interval/ratio data.
3.2.1 The Histogram
A histogram
is used with a single numeric variable and looks like a bar-chart except there are no gaps between consecutive bars unless there are missing data. The example that follows use a popular dataset known as hsb2
, which contains information about 200 randomly selected students from a national survey of high school seniors called the High School and Beyond survey. The variables in this dataset include:
- id = student id
- female = (0/1)
- race = ethnicity (1=hispanic 2=asian 3=african-amer 4=white)
- ses = (1=low 2=middle 3=high)
- schtyp = type of school (1=public 2=private)
- prog = type of program (1=general 2=academic 3=vocational)
- read = standardized reading score
- write = standardized writing score
- math = standardized math score
- science = standardized science score
- socst = standardized social studies score
I’ll use read
(Reading scores on a standardized test) to plot a histogram. Notice a few things about the histogram in Figure 3.4. The height of the bars, representing how often a particular score occurs, varies a great deal. A few students have done poorly while a few have done very well, but the rest are distributed over the middle range of test scores.
We could farther break this histogram apart by asking if male and female students (female
), private versus public students (schtyp
), or students of different races/ethnicities (race
) perform differently on the reading test.
These plots are hard to read for a number of reasons. First, with just \(n = 200\) students in all we have more students in some groups than in others and as a result the histograms look very thin for the groups with fewer data points, making it difficult to tease out any patterns. Second, within any group there are too many different test scores so we don’t see a clear pattern at all. To fix these problems we construct groups of scores, turning what is a numeric variable into an ordinal variable. Let us build this by first creating a grouped frequency table
and then plotting this table as a histogram.
Histogram’s bins must be chosen with some care
3.2.2 Grouped Frequency Tables
Because numeric variables take on too many values it is often easier to see their distribution by grouping the numeric values. For example, we could start by seeing what are the lowest and highest reading scores. These turn out to be 28 and 76, respectively. We can build the groups as follows:
- Calculate difference between maximum and minimum values, which turns out to be \(76 - 28 = 48\)
- Decide how many groups we want. Good practice suggests no fewer than 4/5 and no more than 6/7. Say we go with 5 groups.
- Divide the gap between the maximum and minimum values by the desired number of groups: \(= \dfrac{48}{5} = 9.6\) and round up to the nearest whole number \(= 10\). This tells us how wide each group should be.
- The groups could thus be \(28-38, 38-48, 48-58, 58-68, 68-78\).
Notice that these groups span, start to finish, all the values of reading scores. But we’ll have to decide which group to include 38, 48, 58, and 68 in. Should 38 go in 28-38 or in 38-48? The choice doesn’t matter so long as we are consistent. Let us choose a rule that says include 38 in 38-48, 48 in 48-58, 58 in 58-68, and 68 in 68-78. Using this rule we now find each reading score and drop it into its group. Then we calculate how many scores fall in each group, creating our Frequency column.
Grouped Scores | Frequency |
---|---|
28-38 | 14 |
38-48 | 68 |
48-58 | 62 |
58-68 | 36 |
68-78 | 20 |
Total | 200 |
Now it is easy to see that the largest number of students (68) appear to fall in the 38-48 group of reading scores while the smallest frequency (14) occurs in the 28-38 group. Let us add to this table a percentage column.
Grouped Scores | Frequency | Percentage |
---|---|---|
28-38 | 14 | 7 |
38-48 | 68 | 34 |
48-58 | 62 | 31 |
58-68 | 36 | 18 |
68-78 | 20 | 10 |
Total | 200 | 100 |
The percentages make it even easier to see how the distribution breaks down; 34 percent of the students have scores in the 38-48 range while only 10 percent scored in the highest bracket (68-78). We can also break this down by the three groups we used earlier.
Histograms were once quite popular but there is a better visual for looking at our numeric variables, one called the box-plot
that we’ll see in the next chapter. I am not a huge fan of histograms because grouping decisions influence the story being told. My advice would be to use grouped frequency tables instead of histograms to present a summary overview of your numeric data.
Histograms of grouped frequencies of numerical variables can be useful for summary depictions of the distribution
3.2.3 Scatterplots
With two numeric variables, a scatterplot
comes in very handy if we want to explore how one variable might be related to another. For example, we may want to ask whether students who score high on the reading test also tend to score high on the mathematics test. This could be visually explored via a scatter-plot, as shown below. The goal should be to look for a pattern: Does one variable increase as the other increases or does one variable decrease as the other increases? Or does there seem to be no relationship at all?
Quite clearly, students who do well in Reading also tend to do well in Mathematics. You can see this by virtue of the upward, left-to-right tilt of the cloud of points.
What about Science scores and Mathematics scores? Is there any relationship between doing well in Science and doing well in Mathematics?
Of course, just like everything else we could break this down by any nominal/ordinal variable. The visual below shows you the breakouts by the student’s sex.
Work best with two numerical variables since they show the pattern of association between the two variables
3.2.4 Line Graphs
If we have time-series data, such as the Presidential Approval data we saw earlier, then a line graph
works well because it shows you how the outcome/phenomenon varies over time. Let us take another example, median household incomes. Say I am curious about trends in median household incomes in Ohio, Pennsylvania, and West Virginia.
These plots don’t work well just for financial or election data, they are ideally suited for any phenomenon that is measured over time. For example, the size of the immigrant population over the years, and even the number of lynx pelts reported in Canada per year from 1752 to 1819.
Line graphs are the default choice for showing trends – patterns over time
3.2.5 Polar Charts
These charts are helpful for visualizing data that might have otherwise been explored via a bar-chart. For example, say you want to look at the miles per gallon given by a number of different automobiles. We could use a bar-chart, as shown below:
Quite obviously, the Toyota Corolla has the best fuel economy, followed by the Fiat 128, while the Cadillac Fleetwood is tied with the Lincoln Continental for the worst fuel economy. A polar chart would present the same information as seen in the bar-chart, albeit in a more aesthetically pleasing manner.
3.3 Some Essential Rules for Good Visualizations
There are several rules, some favored by this expert or that, but regardless of the source of the rule, its merit is not in doubt. So here are the ones I try to follow (although I do break these at times, some times by design and other times unintentionally). It all starts of course with Edward Tufte’s maxims: show the data, tell the truth, help the viewer think about the information rather than the design, encourage the eye to compare the data, make large data sets coherent.
A more specific listing of the rules I try to keep in mind follows:
- Do not include anything in your visualization that is not very informative. Give titles and subtitles where needed and make these stand on their own. At the same time, do not clutter your charts and tables with too much information otherwise the reader gets lost.
- Use colors wisely. Is this visual to be printed on a color printer? Is it part of a presentation in a poorly lit room? Choose bright colors that stand out from each each other and yet are visible on a printed page or on the projection screen in the room. Remember, a fair share of men seeing the visual could be color-blind so use color palettes designed for color-blind individuals. If this is news to you, read How a dog sees a rainbow, and 12 other images that explain how we see color. Pay attention to what is being plotted: Use
sequential
colors for ordered data values (that go from low to high or vice-versa);qualitative
colors nominal data, and;diverging
colors if the goal is to put equal emphasis on mid-range critical values and extremes at both ends of the data range but emphasize the middle with light colors and the low/high extremes with dark colors that have contrasting hues. - The visualization is not your personal Mona Lisa; it must be effective for the target audience and help you tell the story. Do not get carried away in creating it.
- Use a table if a table would be more effective than a graph but remember, while tables are useful for audiences that will want to see more (not less of the actual data you have) graphs are favored by those who want a quick take-away.
- Always have your
y-axis
(orx-axis
if relevant) starting at zero. If you don’t do this you can misrepresent the data. If you are forced to truncate the axis, point this out to the reader so that they can interpret dips and swells in line charts or differences between heights of frequency bars with care. - Combine multiple graphs/tables into a single figure if you can so long as it does not lead to information overload. Go back and look at the scatterplots and line charts and try to visualize what these might look like if the y-axis had been forced to start at \(0\); they would certainly look different. For illustration purposes I have let the plotting software pick the starting coordinates.
- For non-technical audiences you should round up all percentages/proportions to no more than one decimal place and ideally to no decimal place unless doing so distorts the picture. For technical audiences, stay with two or four decimal places.
- Above all, start with pencil and paper and consider all alternate visualizations possible by drawing a rough sketch of what the finished product should look like.
- Consider using a choropleth map if geographic variation is one of the key narratives.
- Be prepared to alter your visual if it does not work; we are all hesitant to delete a page or a graphic and start from scratch but this locks you into something that isn’t working to begin with and you will be stuck in a rut.
- Avoid jargon like the plague. We often think we sound intelligent when we use jargon but this distracts from the oral/written presentation. Think of your audience and write and present in a manner that will resonate with them.
- Show as much data as you can; it vastly improves the effectiveness of the visual narration.
3.4 Chapter 3 Practice Problems
Problem 1
Download the monthly Great Lakes water level dataset SPSS format from here and Excel format from here. Construct an appropriate chart to display the data for Lake Superior. Be sure to label the x- and y-axis, and to title the chart. Note that water level is in meters.
Problem 2
Download the number of births per 10,000 of 23 year old women, U.S., 1917-1975 SPSS format from here and Excel format from here. Construct an appropriate chart to display the trend in the data. Be sure to label the x- and y-axis, and to title the chart.
Problem 3
Download the winning speed (in kilometers per hour) for several men’s track and field distances world meets over the 1900 - 2012 period SPSS format from here and Excel format from here. Construct an appropriate chart to display the speeds for the 100 meter dash. Be sure to label the x- and y-axis, and to title the chart. Note that the data are monthly and replicate the speed from the preceding month if the fastest speed was not eclipsed.
Problem 4
Use this data-set used in Practice Problem 4 in the preceding Chapter, noting these details of each variable. Construct a frequency table for belief in life after death
, showing both the frequencies and the relative frequencies (as percentages). Based on the table, what do most people seem to believe? Report the percentage for your answer. Construct an appropriate chart for these data, making sure to label all axis and providing a title.
Problem 5
Construct an appropriate chart that shows the relationship between high school GPA and college GPA. Label both axis and title the chart. What does this plot show? Are the two positively/negatively related? How strong would you guess is the relationship?
Problem 6
Construct a contingency table of vegetarianism against belief in life after death. What percent of vegetarians believe in life after death? What percent of those who believe in life after death are vegetarians? Use an appropriate chart to show the relationship between vegetarianism and belief in life after death. Label everything.
Problem 7
Construct a grouped frequency table with five groups
of the variable age
. Report both the frequencies and the relative frequencies (as percentages). Plot the relative frequencies using an appropriate chart. Label everything as usual. What is the modal age group?
The next set of questions revolve around the 2016 Boston Marathon race results available here. The dataset contains the following variables:
- Bib = the runner’s bib number
- Name = the runner’s name
- Age = the runner’s age (in years)
- M/F = the runner’s gender (M = Male; F = Female)
- City = the runner’s home city
- State = the runner’s home state
- Country = the runner’s home country
- finishtime = the runner’s time (in seconds) to the finish line
Problem 9
What was the distribution of the runners’ gender? Use a suitable chart to reflect the distribution.
Problem 10
Construct a grouped frequency table of the runners’ age, using the following groupings – 18-25; 25-32; 32-39; 39-46; 46-53; 53-60; 60-67; 67-74; 74-81; 81-88. Also construct a grouped histogram. What is the modal age group?
Problem 11
Draw a scatter-plot of runners’ age and finish times. Does this show any relationship? If it does, what sort of a relationship?
Problem 12
Using a reasonable grouping structure, construct country-specific histograms of finish times for runners from each the following countries – AUS, BRA, CAN, CHN, FRA, GBR, GER, ITA, JPN, MEX, and USA. Are finish times skewed for each country? Is the direction of the skew similar? What country seems to have the least skew? What country seems to have the most skew?
Use this data-set, also used in Practice Problem 5
in the preceding Chapter, to answer the following questions.
Problem 13
Were employees who had a workplace accident more likely to leave than employees who did not have an accident? Briefly explain your conclusion with appropriate charts/tables.
Problem 14
Are low salary employees more likely to leave than medium/high salary employees? Why do you conclude as you do? Briefly explain your reasoning with appropriate charts/tables.
Problem 15
Construct an appropriate chart that shows the distribution of the number of years employees spent in the company. What patterns do you see in this chart?
Download the 2020 County Health Rankings data SPSS format from here, CSV format from here and the accompanying analytic codebook
Problem 16
You should see the following measures (measurename
)
- Adult obesity
- Children in poverty
- High school graduation
- Preventable hospital stays
- Unemployment rate
Looking at pairs of variables in turn, briefly (20 words or less) state whether you would expect a positive/negative/no relationship between the variables in each pair and why you deduce as you do?
Indoor plumbing is lacking in many still developing countries, particularly in the rural areas. In some of these countries, particularly India, tigers aren’t the most dangerous animals↩︎
Nominal -> no hierarchy to the levels, values of the categorical variable (A = B = C = …)
Ordinal -> some hierarchy to the levels, values of the categorical variable (A > B > C > …)