Our tutorials reference a dataset called "sample" in many examples. If you'd like to download the sample dataset to work through the examples, choose one of the files below:
To describe a single categorical variable, we use frequency tables. To describe the relationship between two categorical variables, we use a special type of table called a cross-tabulation (or "crosstab" for short). Consider the following sets of tables, both of which summarize the categorical variables "Gender" and "Athlete":
Frequency tables of variables Gender and Athlete
Crosstab of Gender and Athlete
In a cross-tabulation, the categories of one variable determine the rows of the table, and the categories of the other variable determine the columns. The cells of the table contain the number of times that a particular combination of categories occurred. The "edges" (or "margins") of the table typically contain the total number of observations for that category.
This type of table is also known as a:
In this tutorial, we'll cover how to create crosstabs using the SAS procedure PROC FREQ, and how to interpret the frequencies and proportions in these tables.
The dimensions of the crosstab refer to the number of rows and columns in the table (not including the row/column totals). The table dimensions are reported as as RxC, where R is the number of categories for the row variable, and C is the number of categories for the column variable.
Additionally, a "square" crosstab is one in which the row and column variables have the same number of categories. Tables of dimensions 2x2, 3x3, 4x4, etc. are all square crosstabs.
A typical 2x2 crosstab has the following construction:
Column 1 | Column 2 | Row totals | |
---|---|---|---|
Row 1 | a | b | a + b |
Row 2 | c | d | c + d |
Column totals | a + c | b + d | a + b + c + d |
The letters a, b, c, and d represent what are called cell counts.
By adding a, b, c, and d, we can determine the total number of observations in each category, and in the table overall.
The row sums and column sums are sometimes referred to as marginal frequencies. Note that if you were to make frequency tables for your row variable and your column variable, the frequency table should match the values for the row totals and column totals, respectively.
When you are describing the composition of your sample, it is often useful to refer to the proportion of the row or column that fell within a particular category. This can be achieved by computing the row percentages or column percentages.
Column 1 | Column 2 | Row totals | |
---|---|---|---|
Row 1 Row 1 % |
a a / (a + b) |
b b / (a + b) |
a + b (a + b) / (a+b) = 100% |
Row 2 Row 2 % |
c c / (c + d) |
d d / (c + d) |
c + d (c + d) / (c + d) = 100% |
Column totals % of total |
a + c (a + c) / (a + b + c + d) |
b + d (b + d) / (a + b + c + d) |
a + b + c + d (a + b + c + d) / (a + b + c + d) = 100% |
Notice that when computing row percentages, the denominators for cells a, b, c, d are determined by the row sums (here, a + b and c + d). This implies that the percentages in the "row totals" column must equal 100%.
Column 1 | Column 2 | Row totals | |
---|---|---|---|
Row 1 Column 1 % |
a a / (a + c) |
b b / (b + d) |
a + b (a + b) / (a + b + c + d) |
Row 2 Column 2 % |
c c / (a + c) |
d d / (b + d) |
c + d (c + d) / (a + b + c + d) |
Column totals Percentage % |
a + c (a + c) / (a + c) = 100% |
b + d (b + d) / (b + d) = 100% |
a + b + c + d (a + b + c + d) / (a + b + c + d) = 100% |
Notice that when computing column percentages, the denominators for cells a, b, c, d are determined by the column sums (here, a + c and b + d). This implies that the percentages in the "column totals" row must equal 100%.
Column 1 | Column 2 | Row totals | |
---|---|---|---|
Row 1 % of total |
a a / (a + b + c + d) |
b b / (a + b + c + d) |
a + b (a + b) / (a + b + c + d) |
Row 2 % of total |
c c / (a + b + c + d) |
d d / (a + b + c + d) |
c + d (c + d) / (a + b + c + d) |
Column totals % of total |
a + c (a + c) / (a + b + c + d) |
b + d (b + d) / (a + b + c + d) |
a + b + c + d (a + b + c + d) / (a + b + c + d) = 100% |
Notice that when total percentages are computed, the denominators for all of the computations are equal to the total number of observations in the table, i.e. a + b + c + d.
Your data must meet the following requirements:
Note that the choice of row/column variable is often dictated by space requirements or interpretation of the results. If your particular set of variables has what could be considered "independent" and "dependent" variables, it is conventional to put the "independent" variable as the column variable, and the "dependent" variable as the row variable. However, if you plan to compute relative risk, it is conventional to put the "independent" variable as the row and the "dependent" variable as the column variable.
Your dataset should have the following structure:
For crosstabs, the basic syntax of the FREQ procedure is:
PROC FREQ DATA=dataset <options>;
TABLES RowVar*ColVar / <options>;
RUN;
In the first line, PROC FREQ
tells SAS to execute the FREQ procedure on the dataset given in the DATA=
argument. If desired, additional options you can include on this line are:
NLEVELS
Adds a table to the output summarizing the number of levels (categories) for each variable named in the TABLES
statement.
ORDER
=data
ORDER
=freq
On the next line, the TABLES
statement is where you put pairs of variables you want to produce crosstabs for. To create a basic cross-tab between two variables A and B, place an asterisk (*) between the names of the variables in the TABLES statement. You can list as many variables or variable pairs as you want, with each variable or variable pair separated by a space. This is the minimum that is required to produce a crosstab using PROC FREQ, but there are several important analysis options to be aware of, which you can add on this line after a slash (/) character:
PLOTS=FREQPLOT
PLOTS=MOSAICPLOT
MISSING
MISSPRINT
NOROW
, NOCOL
, and NOPERCENT
PLOTS=FREQPLOT
PLOTS=MOSAICPLOT
Some universities in the United States require that freshmen live in the on-campus dormitories during their first year, with exceptions for students whose families live within a certain radius of campus. That is, certain freshmen whose families live close enough to campus are permitted to live off-campus. After completing their first or second year of school, students living in the dorms may choose to move into an off-campus apartment. How prevalent is this pattern?
In the sample dataset, there are several variables relating to this question:
Let's use different aspects of PROC FREQ to investigate the relationship between class rank and living on campus.
Using the sample data, let's make crosstab of the variables Rank and LiveOnCampus. Let the row variable be Rank, and the column variable be LiveOnCampus.
PROC FREQ DATA=work.sample;
TABLE Rank*LiveOnCampus;
RUN;
In this syntax:
Rank*LiveOnCampus
will create a crosstab of variable Rank (as the row variable) against LiveOnCampus (as the column variable). The table will include all of the default output for crosstabs.The first table contains the actual crosstab:
Notice the square to the left of the table: it contains the legend for how to read the cells of this crosstab. The legend here tells us that, in this example, each cell of the table has 4 numbers:
From this table, we can make several observations:
Note the margins of the crosstab (i.e., the "total" row and column) give us the same information that we would get from frequency tables of Rank and LiveOnCampus, respectively:
Lastly, the outermost row of the table shows the total number of cases with missing values for either Rank, LiveOnCampus, or both (47).
Let's delve into the proportions from the previous table. Although the default table already contains all three types of proportions, it's a little overwhelming to see all the information at once, so let's use the NOROW
, NOCOL
, and NOPERCENT
options to limit what type of percentages we see.
If the row variable is Rank and the column variable is LiveOnCampus, then the row percentages will tell us what percentage of the freshmen, sophomores, juniors, and seniors live on campus. That is, variable Rank will determine the denominator of the percentage computations.
/*Show row proportions only - suppress column and total proportions*/
PROC FREQ DATA=work.sample;
TABLE Rank*LiveOnCampus / NOCOL NOPERCENT;
RUN;
If the row variable is Rank and the column variable is LiveOnCampus, then the column percentages will tell us what percentage of the individuals who live on campus are freshmen, sophomores, juniors, or seniors. That is, variable LiveOnCampus will determine the denominator of the percentage computations.
/*Show column proportions only - suppress row and total proportions*/
PROC FREQ DATA=work.sample;
TABLE Rank*LiveOnCampus / NOROW NOPERCENT;
RUN;
If the row variable is Rank and the column variable is LiveOnCampus, then the total percentage tells us what proportion of the total is within each combination of Rank and LiveOnCampus. That is, the overall table size determines the denominator of the percentage computations.
/*Show overall proportions only - suppress row and column proportions*/
PROC FREQ DATA=work.sample;
TABLE Rank*LiveOnCampus / NOROW NOCOL;
RUN;
A full list of options for the FREQ procedure can be found in the SAS Help and Documentation guide.