1 Calculating Measures of Central Tendency and Variability

Let us start by downloading the relevant data from here

These data are state-level measures of the “MEANS OF TRANSPORTATION TO WORK” for Workers 16 years and older, and have been drawn from the the U.S. Census Bureau’s, 2011-2013 3-Year American Community Survey.

Once the data open in SPSS, look at the variables and descriptions in the “Variable View” tab. The variables are:

  • Id = Unique ID used by the Census Bureau
  • Id2 = Each state/territory’s unique Federal Information Processing Standards (FIPS) code
  • Geography = Name of the state or territory
  • EstimateTotal = Total estimated population 16 years old or older who are working
  • EstimateCartruckorvan = Number of people in the population group commuting by: Car, truck, or van:
  • EstimateCartruckorvanDrovealone = “Number of people in the population group commuting by: Car, truck, or van: - * Drove alone”
  • EstimateCartruckorvanCarpooled = Number of people in the population group commuting by: Car, truck, or van: - Carpooled:
  • EstimateCartruckorvanCarpooledIn2personcarpool = Number of people in the population group commuting by: Car, truck, or van: - Carpooled: - In 2-person carpool
  • EstimateCartruckorvanCarpooledIn3personcarpool = Number of people in the population group commuting by: Car, truck, or van: - Carpooled: - In 3-person carpool
  • EstimateCartruckorvanCarpooledIn4personcarpool = Number of people in the population group commuting by: Car, truck, or van: - Carpooled: - In 4-person carpool
  • EstimateCartruckorvanCarpooledIn5or6personc = Number of people in the population group commuting by: Car, truck, or van: - Carpooled: - In 5- or 6-person carpool
  • EstimateCartruckorvanCarpooledIn7ormoreperson = Number of people in the population group commuting by: Car, truck, or van: - Carpooled: - In 7-or-more-person carpool
  • EstimatePublictransportationexcludingtaxicab = Number of people in the population group commuting by: Public transportation (excluding taxicab):
  • EstimatePublictransportationexcludingtaxicabBusortr = Number of people in the population group commuting by: Public transportation (excluding taxicab): - Bus or trolley bus
  • EstimatePublictransportationexcludingtaxicabStreetcar = Number of people in the population group commuting by: Public transportation (excluding taxicab): - Streetcar or trolley car (carro publico in Puerto Rico)
  • EstimatePublictransportationexcludingtaxicabSubwayor = Number of people in the population group commuting by: Public transportation (excluding taxicab): - Subway or elevated
  • EstimatePublictransportationexcludingtaxicabRailroad = Number of people in the population group commuting by: Public transportation (excluding taxicab): - Railroad
  • EstimatePublictransportationexcludingtaxicabFerryboat = Number of people in the population group commuting by: Public transportation (excluding taxicab): - Ferryboat
  • EstimateTaxicab = Number of people in the population group commuting by: Taxicab
  • EstimateMotorcycle = Number of people in the population group commuting by: Motorcycle
  • EstimateBicycle = Number of people in the population group commuting by: Bicycle
  • EstimateWalked = Number of people in the population group commuting by: Walked
  • EstimateOthermeans = Number of people in the population group commuting by: Other means
  • EstimateWorkedathome = Number of people in the population group commuting by: Worked at home
  • Census_Region = Whether the state falls in the Northeast, Midwest, South, or West

1.1 Excluding Puerto Rico and the District of Columbia

Since these aren’t states, it would be best to exclude them from any calculations we want to perform. Puerto Rico is Id2 == 72 and the District of Columbia is Id2 == 11.

In the File Menu, go to Data and then to Select Cases… You will see a dialogue box. Choose the second option you see, the one that reads “If condition is satisfied” Click the “If” button and then choosing Id2, move it to the box on the right hand side to create the following statement: Id2 ~= 11 & Id2 ~= 72

Then click the Continue button. Make sure Filter out unselected cases is selected, and then click OK. If you look at the data now, in Data View, you will see Puerto Rico and DC have been crossed out. These two rows of data will not be used for any calculations or graphics unless you deactivate this selection of cases.

2 The task at hand

We will focus on a few measures –

  1. EstimateTotal
  2. EstimateWorkedathome
  3. EstimateBicycle
  4. EstimateCartruckorvanDrovealone

2.1 Basic descriptive statistics

Our goal will be to calculate the Mean, Median, Minimum, Maximum, Range, Interquartile Range, Variance, and Standard Deviation of each. Once we have calculated these, let us try to answer the following questions for each variable:

  1. How close are the Mean and the Median?
  2. Which of the four variables has the highest versus the lowest variance?
  3. Which of the four variables has the smallest Interquartile Range?

2.2 Answers

The results are shown here. Note that this is just the output file you see, exported as a web-report from SPSS; hence it looks different than it does in SPSS.

  1. The mean and median are at a distance, with the mean always being greater than the median.
  2. The variance is the highest for EstimateTotal and the lowest for EstimateBicycle.
  3. EstimateBicycle has the smallest Interquartile Range.

3 Constructing box-plots

Now we construct some box-plots. Remember, you will see the minimum, Quartile 1, the Median (which is Quartile 2), Quartile 3, the maximum value, and any outliers. Outliers are any data points that fall \(1.5 \times IQR\) above Quartile 3 or below Quartile 1. If we calculate the values that will define the outliers, and we do this for the EstimateBicycle, any data point that has a value of …

  1. 27174 above Quartile 3, or
  2. 27174 below Quartile 1

We see that Quartile 1 is 3553 and Quartile 3 is 21248.

If we add the boundary value of 27174 to the Quartile 3 value of 21248, we end up with 48422. Only three states seem to be outliers on this particular variable – California with 183669, Florida with 54652, and New York with 53187. Notice that Oregon and Illinois do not count as outliers.

There are no outliers at the lower end of the distribution since 27174 below Quartile 1 would mean -23621 and that is impossible. Why? Because the minimum value must be 0 (i.e., nobody bicycles to work in the state).

3.1 Video Guide

4 Improving the analysis

If you think about what we have done, perhaps we should not have compared the states the way we did. Why would it be unfair to compare the states, for example, on the basis of the number of people in each state who bicycle to work? Who walk to work? Who drive alone? … How could we carry out a fair comparison?

4.1 Calculating “Percent”

Obviously we should take into account the fact that every state is of a different population size. So the appropriate way to compare the state residents’ commuting patterns would be to convert the variables into percents by dividing each by EstimateTotal. This can be done by using the Transform menu, and choosing Computer Variable…

4.2 Repeating the preceding calculations and box-plots

Once you have created the three percent variables, go back and calculate the Mean, Median, Minimum, Maximum, Range, Interquartile Range, Variance, and Standard Deviation of the three new variables. Then answer the same questions:

  1. How close are the Mean and the Median?
  2. Which of the four variables has the highest versus the lowest variance?
  3. Which of the four variables has the smallest Interquartile Range?

Finally, construct the box-plots and note if you see any outliers. Also identify whether you see any skewed distributions, and if you do, is the skew positive or negative. Which variable seems to be most skewed? Check your answer against what you see in the Output (looking at the Skewness measure reported for each variable).

4.3 Percentiles, Deciles, and more

SPSS has a useful function to create variables that show you a data point’s “rank”, what quartile or decile it falls into, and its percentile rank. We’ll start with quartiles

4.3.1 Quartiles

Go to Transform in the main menu bar and then choose Rank Cases…. (1) Select percent_worked_at_home
(2) Select Assign Rank 1 to Largest value
(3) Click Rank Types… and make sure Rank is selected and Ntiles is set to 4.
(4) Click OK

What you have done is ask SPSS to rank the 50 states in descending order of percent_work_at_home AND to lump each state into one of four mutually exclusive groups – the bottom 25% is group 1, the next 25% are in group 2, the next 25% are in group 3, and the top 25% are in group 4. The rank variable is called Rpercent and the group variable is called Npercent.

4.3.2 Deciles

Now assume you wanted to lump each state into one of 10 mutually exclusive groups. In a way, you are saying you want to figure out which state is in the top 10%, which in the next 10%, and so on. You can follow the same sequence you did to create the four groups. The ONLY change you want to make is to change Ntiles: to 10. Once you do this and then you click OK, look at the data.

4.3.3 Video Guide

5 Descriptive statistics and Box-plots by Grouping Variable

We can also calculate the descriptive statistics and construct box-plots for groups represented by some grouping variable. For example, mean and median reading scores of male versus female students, percent working at home in each of four Census regions.

5.1 Statistics by group

See the images below for how this would be done with Census_Region as the grouping variable. Note that Census_Region shows up in the Factor List: and that we have added Geography to the box titled Label Cases by:

5.2 Box-plots by group

When constructing box-plots, we can break these out – one per group member – as shown below. Note the grouping variable has been selected as the x-axis variable.