SPSS Tutorials Crosstabs
The Crosstabs procedure is used to create contingency tables, which describe the interaction between two categorical variables.
Crosstabs
To summarize a single categorical variable, we use frequency tables. To summarize the relationship between two categorical variables, we use a cross-tabulation (also called a contingency table). A cross-tabulation (or crosstab for short) is a table that depicts the number of times each of the possible category combinations occurred in the sample data.
To create a crosstab, click Analyze > Descriptive Statistics > Crosstabs.
A Row(s): One or more variables to use in the rows of the crosstab(s). You must enter at least one Row variable.
B Column(s): One or more variables to use in the columns of the crosstab(s). You must enter at least one Column variable.
Also note that if you specify one row variable and two or more column variables, SPSS will print crosstabs for each pairing of the row variable with the column variables. The same is true if you have one column variable and two or more row variables, or if you have multiple row and column variables.
C Layer: An optional "stratification" variable. When a layer variable is specified, the crosstab between the Row and Column variable(s) will be created at each level of the layer variable. You can have multiple layers of variables by specifying the first layer variable and then clicking Next to specify the second layer variable. Alternatively, you can try out multiple variables as single layers at a time by putting them all in the Layer 1 of 1 box.
D Statistics: Opens the Crosstabs: Statistics window, which contains fifteen different inferential statistics for comparing categorical variables. (These statistics will be covered in detail in a later tutorial.)
E Cells: Opens the Crosstabs: Cell Display window, which controls which output is displayed in each cell of the crosstab.
F Format: Opens the Crosstabs: Table Format window, which specifies how the rows of the table are sorted.
Special Considerations for Crosstabs
Data Requirements
The crosstabs procedure can use numeric or string variables defined as nominal, ordinal, or scale. However, crosstabs should only be used when there are a limited number of categories.
Note that in most cases, the row or column variables can be treated interchangeably. The choice of row/column variable is usually dictated by space requirements or interpretation of the results.
Describing a Crosstab
The dimensions of the crosstab refer to the number of rows and columns in the table (not including the row/column totals). The table dimensions are reported as as RxC, where R is the number of categories for the row variable, and C is the number of categories for the column variable.
Additionally, a "square" crosstab is one in which the row and column variables have the same number of categories. Tables of dimensions 2x2, 3x3, 4x4, etc. are all square crosstabs.
Example 1: "Long" table
- Row variable: Class Rank (4 categories: freshman, sophomore, junior, senior)
- Column variable: Gender (2 categories: male, female)
- Table dimension: 4x2
Example 2: "Wide" table
- Row variable: Gender (2 categories: male, female)
- Column variable: Smoking (3 categories: never smoked, past smoker, current smoker)
- Table dimension: 2x3
Example 3: "Square" table
- Row variable: Gender (2 categories: male, female)
- Column variable: Alcohol (2 categories: no, yes)
- Table dimension: 2x2 (square)
Understanding Row, Column, and Total Percents
A typical 2x2 crosstab has the following construction:
Column 1 | Column 2 | Row totals | |
---|---|---|---|
Row 1 | a | b | a + b |
Row 2 | c | d | c + d |
Column totals | a + c | b + d | a + b + c + d |
The letters a, b, c, and d represent what are called cell counts.
- a is the number of observations corresponding to Row 1 AND Column 1.
- b is the number of observations corresponding to Row 1 AND Column 2.
- c is the number of observations corresponding to Row 2 AND Column 1.
- d is the number of observations corresponding to Row 2 AND Column 2.
By adding a, b, c, and d, we can determine the total number of observations in each category, and in the table overall.
- Row sum of row 1 (i.e., total number of observations in Row 1): a + b
- Row sum of row 2 (i.e., total number of observations in Row 2): c + d
- Column sum of column 1 (i.e., total number of observations in Column 1): a + c
- Column sum of column 2 (i.e., total number of observations in Column 2): b + d
- Total sum (i.e., total number of observations in the table): n = a + b + c + d
The row sums and column sums are sometimes referred to as marginal frequencies. Note that if you were to make frequency tables for your row variable and your column variable, the frequency table should match the values for the row totals and column totals, respectively.
When you are describing the composition of your sample, it is often useful to refer to the proportion of the row or column that fell within a particular category. This can be achieved by computing the row percentages or column percentages.
Column 1 | Column 2 | Row totals | |
---|---|---|---|
Row 1 Row 1 % |
a a / (a + b) |
b b / (a + b) |
a + b (a + b) / (a+b) = 100% |
Row 2 Row 2 % |
c c / (c + d) |
d d / (c + d) |
c + d (c + d) / (c + d) = 100% |
Column totals % of total |
a + c (a + c) / (a + b + c + d) |
b + d (b + d) / (a + b + c + d) |
a + b + c + d (a + b + c + d) / (a + b + c + d) = 100% |
Notice that when computing row percentages, the denominators for cells a, b, c, d are determined by the row sums (here, a + b and c + d). This implies that the percentages in the "row totals" column must equal 100%.
Column 1 | Column 2 | Row totals | |
---|---|---|---|
Row 1 Column 1 % |
a a / (a + c) |
b b / (b + d) |
a + b (a + b) / (a + b + c + d) |
Row 2 Column 2 % |
c c / (a + c) |
d d / (b + d) |
c + d (c + d) / (a + b + c + d) |
Column totals Percentage % |
a + c (a + c) / (a + c) = 100% |
b + d (b + d) / (b + d) = 100% |
a + b + c + d (a + b + c + d) / (a + b + c + d) = 100% |
Notice that when computing column percentages, the denominators for cells a, b, c, d are determined by the column sums (here, a + c and b + d). This implies that the percentages in the "column totals" row must equal 100%.
Column 1 | Column 2 | Row totals | |
---|---|---|---|
Row 1 % of total |
a a / (a + b + c + d) |
b b / (a + b + c + d) |
a + b (a + b) / (a + b + c + d) |
Row 2 % of total |
c c / (a + b + c + d) |
d d / (a + b + c + d) |
c + d (c + d) / (a + b + c + d) |
Column totals % of total |
a + c (a + c) / (a + b + c + d) |
b + d (b + d) / (a + b + c + d) |
a + b + c + d (a + b + c + d) / (a + b + c + d) = 100% |
Notice that when total percentages are computed, the denominators for all of the computations are equal to the total number of observations in the table, i.e. a + b + c + d.
Example: Summarizing the Relationships of Three Categorical Variables
Problem Statement
Some universities in the United States require that freshmen live in the on-campus dormitories during their first year, with exceptions for students whose families live within a certain radius of campus. That is, certain freshmen whose families live close enough to campus are permitted to live off-campus. After completing their first or second year of school, students living in the dorms may choose to move into an off-campus apartment. How prevalent is this pattern?
In the sample dataset, there are several variables relating to this question:
- Rank - Class rank (Freshmen, Sophomore, Junior, Senior)
- RankUpperUnder - Class rank recoded into Underclassman/Upperclassman (see the Recode into Different Variables tutorial)
- LiveOnCampus - Do you live on campus? (Yes/No)
- State - Are you an in-state or out-of-state student? (In State, Out of state)
- State_Residency - State residency, converted from string to numeric so that missing values are correctly identified (See the Automatic Recode tutorial)
Let's use different aspects of the Crosstabs procedure to investigate the relationship between class rank and living on campus.
Part 1 - Simple Crosstabs
Using the sample data, let's make crosstab of the variables Rank and LiveOnCampus. Let the row variable be Rank, and the column variable be LiveOnCampus.
Running the Procedure
Using the Crosstabs Dialog Window
- Open the Crosstabs window (Analyze > Descriptive Statistics > Crosstabs).
- Select Rank as the row variable, and LiveOnCampus as the column variable.
- Click OK.
Using Syntax
CROSSTABS
/TABLES=Rank BY LiveOnCampus
/FORMAT=AVALUE TABLES
/CELLS=COUNT
/COUNT ROUND CELL.
Output
The Case Processing Summary tells us what proportion of the observations had nonmissing values for both Rank and LiveOnCampus. In this sample, there were 47 cases that had a missing value for Rank, LiveOnCampus, or for both Rank and LiveOnCampus.
The second table (here, Class Rank * Do you live on campus? Crosstabulation) contains the crosstab. We can quickly observe information about the interaction of these two variables:
- Many more freshmen lived on-campus (100) than off-campus (37)
- About an equal number of sophomores lived off-campus (42) versus on-campus (48)
- Far more juniors lived off-campus (90) than on-campus (8)
- Only one (1) senior lived on campus; the rest lived off-campus (62)
Note the margins of the crosstab (i.e., the "total" row and column) give us the same information that we would get from frequency tables of Rank and LiveOnCampus, respectively:
- The sample had 137 freshmen, 90 sophomores, 98 juniors, and 63 seniors
- There were 231 individuals who lived off-campus, and 157 individuals lived on-campus
Part 2 - Row, column, and total percentages
Let's build on the table shown in Example 1 by adding row, column, and total percentages. For simplicity's sake, let's switch out the variable Rank (which has four categories) with the variable RankUpperUnder (which has two categories).
Running the Procedure
Using the Crosstabs Dialog Window
- Reopen the Crosstabs window (Analyze > Descriptive Statistics > Crosstabs).
- In the Row box, replace variable Rank with RankUpperUnder.
- Click Cells. In the Percentages area, check off Row, Column, and Total percentages. (In the following examples, we will be showing each of these one at a time for ease of reading.) Click Continue.
- Click OK to run.
Using Syntax
CROSSTABS
/TABLES=RankUpperUnder BY LiveOnCampus
/FORMAT=AVALUE TABLES
/CELLS=COUNT ROW COLUMN TOTAL
/COUNT ROUND CELL.
Output
Row percents
If the row variable is RankUpperUnder and the column variable is LiveOnCampus, then the row percentages will tell us what percentage of the upperclassmen or what percentage of the underclassmen live on campus. That is, variable RankUpperUnder will determine the denominator of the percentage computations.
- The proportion of underclassmen who live off campus is 34.8%, or 79/227.
- The proportion of underclassmen who live on campus is 65.2%, or 148/226.
- The proportion of upperclassmen who live off campus is 94.4%, or 152/161.
- The proportion of upperclassmen who live on campus is 5.6%, or 9/161.
Column percents
If the row variable is RankUpperUnder and the column variable is LiveOnCampus, then the column percentages will tell us what percentage of the individuals who live on campus are upper or underclassmen. That is, variable LiveOnCampus will determine the denominator of the percentage computations.
- The proportion of individuals living off campus that are underclassmen is 34.2%, or 79/231.
- The proportion of individuals living off campus that are upperclassmen is 65.8%, or 152/231.
- The proportion of individuals living on campus that are underclassmen is 94.3%, or 148/157.
- The proportion of individuals living on campus that are upperclassmen is 5.7%, or 9/157.
Total percents
If the row variable is RankUpperUnder and the column variable is LiveOnCampus, then the total percentage tells us what proportion of the total is within each combination of RankUpperUnder and LiveOnCampus. That is, the overall table size determines the denominator of the percentage computations.
- Underclassmen living off campus make up 20.4% of the sample (79/388).
- Underclassmen living on campus make up 38.1% of the sample (148/388).
- Upperclassmen living off campus make up 39.2% of the sample (152/388).
- Upperclassmen living on campus make up 2.3% of the sample (9/388).
Part 3 - Crosstabs with Layer Variable
Let's modify our analysis slightly by taking into account the students' state of residence (in-state or out-of-state). Here, we will be working with three categorical variables: RankUpperUnder, LiveOnCampus, and State_Residency.
In this example, we want to create a crosstab of RankUpperUnder by LiveOnCampus, with variable State_Residency acting as a strata, or layer variable.
Running the Procedure
Using the Crosstabs Dialog Window
- Open the Crosstabs dialog (Analyze > Descriptive Statistics > Crosstabs).
- Select RankUpperUnder as the row variable, and LiveOnCampus as the column variable.
- Select State_Residency as the layer variable.
- You may want to go back to the Cells options and turn off the row, column, and total percentages if you have just run the previous example.
- Click OK.
Syntax
CROSSTABS
/TABLES=RankUpperUnder BY LiveOnCampus BY State_Residency
/FORMAT=AVALUE TABLES
/CELLS=COUNT
/COUNT ROUND CELL.
Output
Again, the Crosstabs output includes the boxes Case Processing Summary and the crosstabulation itself.
Notice that after including the layer variable State Residency, the number of valid cases we have to work with has dropped from 388 to 367. This is because the crosstab requires nonmissing values for all three variables: row, column, and layer.
The layered crosstab shows the individual Rank by Campus tables within each level of State Residency. Some observations we can draw from this table include:
- A slightly higher proportion of out-of-state underclassmen live on campus (30/43) than do in-state underclassmen (110/168).
- There were about equal numbers of out-of-state upper and underclassmen; for in-state students, the underclassmen outnumbered the upperclassmen.
- Of the nine upperclassmen living on-campus, only two were from out of state.