Skip to Main Content

SAS Tutorials: Missing Values

This SAS software tutorial describes what missing values are, and how SAS handles missing values for string and numeric variables.

What are missing values?

When working with a dataset, you may come across cases where some variables have empty values, or are using special characters to denote the absence.

Missing values are instances where no information has been recorded for a given case and variable. "Missingness" can happen for many reasons and causes. For example, some values may be missing due to error or accident, but some may be intentionally left missing.

Missing data and missing data handling is an important part of the data analysis process because missing data can impact our statistical estimates. In this tutorial, we'll go over how SAS represents information about missing values in data sets, and how the presence of missing data is treated by different SAS procedures.

Missing Values in SAS Datasets

SAS stores, displays, and represents missing values differently depending on the variable type.

  • Character variables: Missing values for character variables appear as blanks.
  • Numeric variables: Missing values for numeric variables (including date variables) appear as a period.

Screenshot of VIEWTABLE view in SAS. Missing values for numeric variables are shown as periods, while missing values for character variables are shown as blank cells.

It is important to understand how SAS handles missing values when you execute statements. Depending on the statements being used, SAS might handle missing values in different ways. For example, it might treat a missing value as the lowest possible value (e.g., frequency tables in PROC FREQ), or it might omit the value from the computation (e.g., regression).

Internally, SAS treats numeric missing values as the smallest possible number. Most of the time, the user will probably not be affected by this. In general, if you are subsetting data or doing any kind of conditional logic based on continuous numeric values, you should always explicitly tell SAS how to handle missing values first.

Your SAS help manual will help you understand how missing values are treated in the statements you are executing.

When importing data from an external file, SAS automatically recognizes blank cells as missing values. You do not need to enter a period character in your external file's blank cells.

Listwise versus Pairwise Missing Values

An important concept when analyzing data with missing values is the distinction between listwise deletion and pairwise deletion. (You might also see the terms listwise exclusion and pairwise exclusion used to describe these concepts.) Listwise and pairwise deletion/exclusion refer to how missing data is handled when there are two or more variables under consideration.

Recall that, for a set of n variables, there are $$ \frac{n \cdot (n-1)}{2} $$ possible pairs of variables among them. This means that for a set of two variables, there is only one pairwise comparison; when there are three variables, there are three pairwise comparisons; for a set of four variables, there are six pairwise comparisons; and so on. So in this context:

  • In listwise deletion, the statistic(s) computed for each pair of variables will only use cases with nonmissing values on all analysis variables. This will mean that there is only one sample size for the analysis, and the same set of cases will be used for all pairwise comparisons.
  • In pairwise deletion, the statistic(s) computed for a given pair of variables will use all cases with nonmissing values on the pair, regardless of missingness on the other variables in the set. If there are three or more variables under consideration, this will mean that there are (potentially) different sets of cases and sample sizes being used for each comparison.

To illustrate the difference between listwise and pairwise deletion, consider the following data table, where blank cells indicate missing values:

ID English Reading Math
1 75.6 70.3 79.8
2 79.8   91.5
3 82.6 81.3  
4     46.0
5   90.2 95.6
6 92.5 99.8 89.9
7 67.1   67.2
8      
9 76.3 70.1 85.2
10 84.1 85.6 71.3

This dataset has ten cases on four variables. In reviewing the table for missingness, we see that:

  • ID has no missing values.
  • Missing values for English are in rows 4, 5, 8. English has 3 missing values.
  • Missing values for Reading are in rows 2, 4, 7, 8. Reading has 4 missing values.
  • Missing values for Math are in rows 3 and 8. Math has 2 missing values.

Suppose we wanted to look at the correlation between English, Reading, and Math. Since there are three (3) variables under consideration, we can use the above formula to determine that there are $$ \frac{3 \cdot (3-1)}{2} = 3 $$ possible pairs: (English & Reading), (English & Math), and (Reading & Math).

The correlation between a given pair of variables can only be computed using cases with nonmissing values for both variables in the pair. If all available data for a given pair is used, we would have:

  • English and Reading: 5 usable cases (rows 1, 3, 6, 9, 10)
  • English and Math: 6 usable cases (rows 1, 2, 6, 7, 9, 10)
  • Reading and Math: 5 usable cases (rows 1, 5, 6, 9, 10)

This is an example of pairwise deletion.

Now suppose we wanted to fit a multiple linear regression model, where the dependent variable is Math score and the independent variables are English and Reading scores. Multiple linear regression can only make use of cases with nonmissing values for all the variables included in the model. This is an example of listwise deletion: If a given case is missing even one value out of the three, it cannot be used to fit the regression model, and will be dropped from the analysis. Stated differently, a case must have nonmissing values on Math and English and Reading to be used for this analysis. In this example dataset, applying listwise deletion would leave us with only 4 usable cases (rows 1, 6, 9, 10).

You can imagine that using listwise deletion when there are many variables under consideration can greatly reduce your usable sample size. Thus, it is important to understand how the SAS procedures you choose to use will handle missing data, and make an informed decision about which type of deletion/exclusion is appropriate. Some procedures will default to using pairwise deletion, such as PROC CORR for correlation and PROC TTEST for independent samples and paired samples t-tests, while other procedures, such as those used for regression and ANOVA, can only use listwise missing deletion.