Let us start by downloading the relevant data from here
These data are state-level measures of the “MEANS OF TRANSPORTATION TO WORK” for Workers 16 years and older, and have been drawn from the the U.S. Census Bureau’s, 2011-2013 3-Year American Community Survey.
Once the data open in SPSS, look at the variables and descriptions in the “Variable View” tab. The variables are:
Since these aren’t states, it would be best to exclude them from any calculations we want to perform. Puerto Rico is Id2 == 72 and the District of Columbia is Id2 == 11.
In the File Menu, go to Data and then to Select Cases… You will see a dialogue box. Choose the second option you see, the one that reads “If condition is satisfied” Click the “If” button and then choosing Id2, move it to the box on the right hand side to create the following statement: Id2 ~= 11 & Id2 ~= 72
Then click the Continue button. Make sure Filter out unselected cases is selected, and then click OK. If you look at the data now, in Data View, you will see Puerto Rico and DC have been crossed out. These two rows of data will not be used for any calculations or graphics unless you deactivate this selection of cases.
We will focus on a few measures –
Our goal will be to calculate the Mean, Median, Minimum, Maximum, Range, Interquartile Range, Variance, and Standard Deviation of each. Once we have calculated these, let us try to answer the following questions for each variable:
The results are shown here. Note that this is just the output file you see, exported as a web-report from SPSS; hence it looks different than it does in SPSS.
Now we construct some box-plots. Remember, you will see the minimum, Quartile 1, the Median (which is Quartile 2), Quartile 3, the maximum value, and any outliers. Outliers are any data points that fall \(1.5 \times IQR\) above Quartile 3 or below Quartile 1. If we calculate the values that will define the outliers, and we do this for the EstimateBicycle, any data point that has a value of …
We see that Quartile 1 is 3553 and Quartile 3 is 21248.
If we add the boundary value of 27174 to the Quartile 3 value of 21248, we end up with 48422. Only three states seem to be outliers on this particular variable – California with 183669, Florida with 54652, and New York with 53187. Notice that Oregon and Illinois do not count as outliers.
There are no outliers at the lower end of the distribution since 27174 below Quartile 1 would mean -23621 and that is impossible. Why? Because the minimum value must be 0 (i.e., nobody bicycles to work in the state).
If you think about what we have done, perhaps we should not have compared the states the way we did. Why would it be unfair to compare the states, for example, on the basis of the number of people in each state who bicycle to work? Who walk to work? Who drive alone? … How could we carry out a fair comparison?
Obviously we should take into account the fact that every state is of a different population size. So the appropriate way to compare the state residents’ commuting patterns would be to convert the variables into percents by dividing each by EstimateTotal. This can be done by using the Transform menu, and choosing Computer Variable…
Once you have created the three percent variables, go back and calculate the Mean, Median, Minimum, Maximum, Range, Interquartile Range, Variance, and Standard Deviation of the three new variables. Then answer the same questions:
Finally, construct the box-plots and note if you see any outliers. Also identify whether you see any skewed distributions, and if you do, is the skew positive or negative. Which variable seems to be most skewed? Check your answer against what you see in the Output (looking at the Skewness measure reported for each variable).
SPSS has a useful function to create variables that show you a data point’s “rank”, what quartile or decile it falls into, and its percentile rank. We’ll start with quartiles
Go to Transform in the main menu bar and then choose Rank Cases…. (1) Select percent_worked_at_home
(2) Select Assign Rank 1 to Largest value
(3) Click Rank Types… and make sure Rank is selected and Ntiles is set to 4.
(4) Click OK
What you have done is ask SPSS to rank the 50 states in descending order of percent_work_at_home
AND to lump each state into one of four mutually exclusive groups – the bottom 25% is group 1, the next 25% are in group 2, the next 25% are in group 3, and the top 25% are in group 4. The rank variable is called Rpercent and the group variable is called Npercent.
Now assume you wanted to lump each state into one of 10 mutually exclusive groups. In a way, you are saying you want to figure out which state is in the top 10%, which in the next 10%, and so on. You can follow the same sequence you did to create the four groups. The ONLY change you want to make is to change Ntiles: to 10. Once you do this and then you click OK, look at the data.
We can also calculate the descriptive statistics and construct box-plots for groups represented by some grouping variable. For example, mean and median reading scores of male versus female students, percent working at home in each of four Census regions.
See the images below for how this would be done with Census_Region
as the grouping variable. Note that Census_Region shows up in the Factor List: and that we have added Geography to the box titled Label Cases by:
When constructing box-plots, we can break these out – one per group member – as shown below. Note the grouping variable has been selected as the x-axis variable.