The Crosstabs procedure is used to create contingency tables, which describe the "interaction" between two categorical variables. This tutorial covers the descriptive statistics aspects of the Crosstabs procedure.
To summarize a single categorical variable, we use frequency tables. To summarize the relationship between two categorical variables, we use a cross-tabulation (also called a contingency table). A cross-tabulation (or crosstab for short) is a table that depicts the number of times each of the possible category combinations occurred in the sample data.
To create a crosstab, click Analyze > Descriptive Statistics > Crosstabs.
A Row(s): One or more variables to use in the rows of the crosstab(s). You must enter at least one Row variable.
B Column(s): One or more variables to use in the columns of the crosstab(s). You must enter at least one Column variable.
Also note that if you specify one row variable and two or more column variables, SPSS will print crosstabs for each pairing of the row variable with the column variables. The same is true if you have one column variable and two or more row variables, or if you have multiple row and column variables.
C Layer: An optional "stratification" variable. When a layer variable is specified, the crosstab between the Row and Column variable(s) will be created at each level of the layer variable. You can have multiple layers of variables by specifying the first layer variable and then clicking Next to specify the second layer variable. Alternatively, you can try out multiple variables as single layers at a time by putting them all in the Layer 1 of 1 box.
D Statistics: Opens the Crosstabs: Statistics window, which contains fifteen different inferential statistics for comparing categorical variables. (These statistics will be covered in detail in a later tutorial.)
E Cells: Opens the Crosstabs: Cell Display window, which controls which output is displayed in each cell of the crosstab.
F Format: Opens the Crosstabs: Table Format window, which specifies how the rows of the table are sorted.
The crosstabs procedure can use numeric or string variables defined as nominal, ordinal, or scale. However, crosstabs should only be used when there are a limited number of categories.
Note that in most cases, the row or column variables can be treated interchangeably. The choice of row/column variable is usually dictated by space requirements or interpretation of the results.
The dimensions of the crosstab refer to the number of rows and columns in the table (not including the row/column totals). The table dimensions are reported as as RxC, where R is the number of categories for the row variable, and C is the number of categories for the column variable.
Additionally, a "square" crosstab is one in which the row and column variables have the same number of categories. Tables of dimensions 2x2, 3x3, 4x4, etc. are all square crosstabs.
A typical 2x2 crosstab has the following construction:
Column 1 | Column 2 | Row totals | |
---|---|---|---|
Row 1 | a | b | a + b |
Row 2 | c | d | c + d |
Column totals | a + c | b + d | a + b + c + d |
The letters a, b, c, and d represent what are called cell counts.
By adding a, b, c, and d, we can determine the total number of observations in each category, and in the table overall.
Note that if you were to make frequency tables for your row variable and your column variables, the frequency table should match the values for the row and column totals.
When you are describing the composition of your sample, it is often useful to refer to the proportion of the row or column that fell within a particular category. This can be achieved by computing the row percentages or column percentages.
Column 1 | Column 2 | Row totals | |
---|---|---|---|
Row 1 Row 1 % |
a a / (a + b) |
b b / (a + b) |
a + b (a + b) / (a+b) = 100% |
Row 2 Row 2 % |
c c / (c + d) |
d d / (c + d) |
c + d (c + d) / (c + d) = 100% |
Column totals % of total |
a + c (a + c) / (a + b + c + d) |
b + d (b + d) / (a + b + c + d) |
a + b + c + d (a + b + c + d) / (a + b + c + d) = 100% |
Notice that when computing row percentages, the denominators for cells a, b, c, d are determined by the row sums (here, a + b and c + d). This implies that the percentages in the "row totals" column must equal 100%.
Column 1 | Column 2 | Row totals | |
---|---|---|---|
Row 1 Column 1 % |
a a / (a + c) |
b b / (b + d) |
a + b (a + b) / (a + b + c + d) |
Row 2 Column 2 % |
c c / (a + c) |
d d / (b + d) |
c + d (c + d) / (a + b + c + d) |
Column totals Percentage % |
a + c (a + c) / (a + c) = 100% |
b + d (b + d) / (b + d) = 100% |
a + b + c + d (a + b + c + d) / (a + b + c + d) = 100% |
Notice that when computing column percentages, the denominators for cells a, b, c, d are determined by the column sums (here, a + c and b + d). This implies that the percentages in the "column totals" row must equal 100%.
Column 1 | Column 2 | Row totals | |
---|---|---|---|
Row 1 % of total |
a a / (a + b + c + d) |
b b / (a + b + c + d) |
a + b (a + b) / (a + b + c + d) |
Row 2 % of total |
c c / (a + b + c + d) |
d d / (a + b + c + d) |
c + d (c + d) / (a + b + c + d) |
Column totals % of total |
a + c (a + c) / (a + b + c + d) |
b + d (b + d) / (a + b + c + d) |
a + b + c + d (a + b + c + d) / (a + b + c + d) = 100% |
Notice that when total percentages are computed, the denominators for all of the computations are equal to the total number of observations in the table, i.e. a + b + c + d.
Some universities in the United States require that freshmen live in the dorms (or on-campus) during their first year, with exceptions for students whose families live within a certain radius of campus. That is, certain freshmen whose families live close enough to campus are permitted to off-campus. After the first year or two, students living in the dorms may choose to move into an off-campus apartment. How prevalent is this pattern?
In the sample dataset, there are several variables relating to this question:
Let's use different aspects of the Crosstabs procedure to investigate the relationship between class rank and living on campus.
Using the sample data, let's make crosstab of the variables Rank and LiveOnCampus. Let the row variable be Rank, and the column variable be LiveOnCampus.
CROSSTABS
/TABLES=Rank BY LiveOnCampus
/FORMAT=AVALUE TABLES
/CELLS=COUNT
/COUNT ROUND CELL.
The Case Processing Summary tells us what proportion of the observations had nonmissing values for both Rank and LiveOnCampus. In this sample, there were 47 cases that had a missing value for Rank, LiveOnCampus, or for both Rank and LiveOnCampus.
The second table (here, Class Rank * Do you live on campus? Crosstabulation) contains the crosstab. We can quickly observe information about the interaction of these two variables:
Note the margins of the crosstab (i.e., the "total" row and column) give us the same information that we would get from frequency tables of Rank and LiveOnCampus, respectively:
Let's build on the table shown in Example 1 by adding row, column, and total percentages. For simplicity's sake, let's switch out the variable Rank (which has four categories) with the variable RankUpperUnder (which has two categories).
CROSSTABS
/TABLES=RankUpperUnder BY LiveOnCampus
/FORMAT=AVALUE TABLES
/CELLS=COUNT ROW COLUMN TOTAL
/COUNT ROUND CELL.
If the row variable is RankUpperUnder and the column variable is LiveOnCampus, then the row percentages will tell us what percentage of the upperclassmen or what percentage of the underclassmen live on campus. That is, variable RankUpperUnder will determine the denominator of the percentage computations.
If the row variable is RankUpperUnder and the column variable is LiveOnCampus, then the column percentages will tell us what percentage of the individuals who live on campus are upper or underclassmen. That is, variable LiveOnCampus will determine the denominator of the percentage computations.
If the row variable is RankUpperUnder and the column variable is LiveOnCampus, then the total percentage tells us what proportion of the total is within each combination of RankUpperUnder and LiveOnCampus. That is, the overall table size determines the denominator of the percentage computations.
Let's modify our analysis slightly by looking at the differences between men and women with respect to alcohol use and binge drinking. Here, we will be working with three categorical variables: RankUpperUnder, LiveOnCampus, and State_Residency.
In this example, we want to create a crosstab of Alcohol by binge, with variable Gender acting as a strata, or grouping variable.
CROSSTABS
/TABLES=RankUpperUnder BY LiveOnCampus BY State_Residency
/FORMAT=AVALUE TABLES
/CELLS=COUNT
/COUNT ROUND CELL.
Again, the Crosstabs output includes the boxes Case Processing Summary and the crosstabulation itself.
Notice that after including the layer variable State Residency, the number of valid cases we have to work with has dropped from 388 to 367. This is because the crosstab requires nonmissing values for all three variables: row, column, and layer.
The layered crosstab shows the individual Rank by Campus tables within each level of State Residency. Some observations we can draw from this table include: