Our tutorials reference a dataset called "sample" in many examples. If you'd like to download the sample dataset to work through the examples, choose one of the files below:
Categorical variables can be summarized using a frequency table, which shows the number and percentage of cases observed for each category of a variable. In this tutorial, we will show how to use the SAS procedure PROC FREQ to create frequency tables that summarize individual categorical variables.
The FREQ procedure prints all values of a given categorical variable in the Output window, along with the counts and proportions. The FREQ procedure can work with both string (character) or numeric categorical variables.
The basic syntax of the FREQ procedure is:
PROC FREQ DATA=dataset <options>;
TABLES variable(s);
RUN;
* Alternately, if you will be using any of the analysis options produced by the TABLES statement:;
PROC FREQ DATA=dataset <options>;
TABLES variable(s) / <options>;
RUN;
In the first line, PROC FREQ
tells SAS to execute the FREQ procedure on the dataset given in the DATA=
argument. If desired, additional options you can include on this line are:
NLEVELS
Adds a table to the output summarizing the number of levels (categories) for each variable named in the TABLES
statement.
ORDER
=data
ORDER
=freq
On the next line, the TABLES
statement is where you put the names of the variables you want to produce a frequency table for. (Note that SAS will recognize both TABLE
and TABLES
.) You can list as many variables as you want, with each variable separated by a space. If the TABLES
statement is not included, then SAS will generate a table for every variable in the dataset. This is all that is required to produce basic frequency tables, but there are many useful analysis enhancements that can be added on this line after a slash (/) character:
PLOTS=FREQPLOT
BINOMIAL
MISSING
MISSPRINT
Sometimes, your dataset may contain a “count” variable. In this case, the WEIGHT
statement specifies which variable acts as the frequency variable. This statement would be given after the TABLES
statement.
Recall that in our sample dataset, the variable State is a nominal categorical variable (representing whether the student is an in-state or out-of-state student), while variable Rank is an ordinal categorical variable (representing the student's class rank).
Recall also that State is a string variable, and Rank is a numeric variable. This simply means that the observations for Rank were recorded as numbers (with value labels applied later), while the observations for State were recorded as characters (strings). This example will show that PROC FREQ works for both types of variables.
Create frequency tables for the variables State and Rank.
PROC FREQ DATA=sample;
TABLE State Rank;
RUN;
PROC FREQ creates one frequency table per variable. Each frequency table contains four columns of summary measures:
If there are cases with missing values for the variable, the number of missing values are given below the table.
If you do not specify an ORDER option in the PROC FREQ statement, the frequency table will be sorted by the values of the categories. This means that, for string variables, the categories will be ordered alphabetically, while numeric variables, the categories will be ordered from smallest to largest number code.
Suppose we want to create frequency tables, but we'd also like to know the number of distinct categories for each variable. (This can be useful if you have a variable with many categories, where the number of rows in the frequency table can't easily be determined by a quick scan.)
To do this, we add the NLEVELS option to the PROC FREQ statement.
PROC FREQ DATA=sample NLEVELS;
TABLE State Rank;
RUN;
Adding the NLEVELS option to our syntax will add one new table to the output, right at the beginning:
The new table shows how many "nonmissing levels" (i.e., observed categories) and how many "missing levels" (i.e., how many special missing value codes were present in the data). For variable State, there are two nonmissing levels; for variable Rank, there are four nonmissing levels. These are easily confirmed by scanning the rows of the corresponding frequency tables.
SAS normally orders the rows of the frequency table based on the order of the category values. In some cases, we may wish to sort the rows of the frequency table based on descending counts. This makes it much easier to determine which categor(ies) were the most frequently occurring.
PROC FREQ DATA=sample ORDER=freq;
TABLE State Rank;
RUN;
The ORDER=freq option in the first line of the syntax tells SAS to order the values in the table in descending order.
With the rows of the frequency tables ordered by relative frequency, it's much easier to tell which categories are the most common. For variable State, there are many more in-state students than out-of-state students. For variable Rank, the most common group is Freshmen (category 1), followed by Juniors (category 3).
When including a frequency table in a write-up or report, it's usually preferable to order tables for nominal categorical variables by frequency. However, for ordinal categorical variables, it usually makes more sense to order the table with respect to the level of the categories.
Notice that in the previous tables, the counts are based only on the number of nonmissing observations. The number of missing values is printed below the table, but the number in the last row of the "cumulative frequency" column is the total number of nonmissing values for the variable. A consequence of this is that the proportions in the table represent the proportion of of nonmissing cases. What if we instead want the proportions to be based on the total number of cases (i.e. number of nonmissing values + number of missing values)?
To do this, we can add the MISSING option to the TABLE statement:
PROC FREQ DATA=sample;
TABLE State Rank / MISSING;
RUN;
The MISSING option appearing after the slash (/) in the TABLE statement tells SAS to include the missing values as a row in the table.
After adding the MISSING option, notice that the first row of the table is now the number of missing values. Since variable State is a string variable, the row has a blank label; and since variable Rank is a numeric variable, the row has a "." label.
If we compare the proportions in this table to the ones in the previous examples, we can see that the proportions have changed. It is also easier to see that approximately 6% of the responses are missing for both State and Rank (before, we only saw the number of missing responses for those variables).
Note: If you specify ORDER=FREQ in the PROC FREQ statement and include the MISSING option in the TABLES statement, the missing values will always appear as the first row of the table, even if they aren't the most frequently occurring category. The ORDER option will affect the ordering of the nonmissing categories that appear after the missing category.