Our tutorials reference a dataset called "sample" in many examples. If you'd like to download the sample dataset to work through the examples, choose one of the files below:
When working with a dataset, you may come across cases where some variables have empty values, or are using special characters to denote the absence.
Missing values are instances where no information has been recorded for a given case and variable. "Missingness" can happen for many reasons and causes. For example, some values may be missing due to error or accident, but some may be intentionally left missing.
Missing data and missing data handling is an important part of the data analysis process because missing data can impact our statistical estimates. In this tutorial, we'll go over how SAS represents information about missing values in data sets, and how the presence of missing data is treated by different SAS procedures.
SAS stores, displays, and represents missing values differently depending on the variable type.
It is important to understand how SAS handles missing values when you execute statements. Depending on the statements being used, SAS might handle missing values in different ways. For example, it might treat a missing value as the lowest possible value (e.g., frequency tables in PROC FREQ), or it might omit the value from the computation (e.g., regression).
Internally, SAS treats numeric missing values as the smallest possible number. Most of the time, the user will probably not be affected by this. In general, if you are subsetting data or doing any kind of conditional logic based on continuous numeric values, you should always explicitly tell SAS how to handle missing values first.
Your SAS help manual will help you understand how missing values are treated in the statements you are executing.
When importing data from an external file, SAS automatically recognizes blank cells as missing values. You do not need to enter a period character in your external file's blank cells.
An important concept when analyzing data with missing values is the distinction between listwise deletion and pairwise deletion. (You might also see the terms listwise exclusion and pairwise exclusion used to describe these concepts.) Listwise and pairwise deletion/exclusion refer to how missing data is handled when there are two or more variables under consideration.
Recall that, for a set of n variables, there are $$ \frac{n \cdot (n-1)}{2} $$ possible pairs of variables among them. This means that for a set of two variables, there is only one pairwise comparison; when there are three variables, there are three pairwise comparisons; for a set of four variables, there are six pairwise comparisons; and so on. So in this context:
To illustrate the difference between listwise and pairwise deletion, consider the following data table, where blank cells indicate missing values:
ID | English | Reading | Math |
---|---|---|---|
1 | 75.6 | 70.3 | 79.8 |
2 | 79.8 | 91.5 | |
3 | 82.6 | 81.3 | |
4 | 46.0 | ||
5 | 90.2 | 95.6 | |
6 | 92.5 | 99.8 | 89.9 |
7 | 67.1 | 67.2 | |
8 | |||
9 | 76.3 | 70.1 | 85.2 |
10 | 84.1 | 85.6 | 71.3 |
This dataset has ten cases on four variables. In reviewing the table for missingness, we see that:
Suppose we wanted to look at the correlation between English, Reading, and Math. Since there are three (3) variables under consideration, we can use the above formula to determine that there are $$ \frac{3 \cdot (3-1)}{2} = 3 $$ possible pairs: (English & Reading), (English & Math), and (Reading & Math).
The correlation between a given pair of variables can only be computed using cases with nonmissing values for both variables in the pair. If all available data for a given pair is used, we would have:
This is an example of pairwise deletion.
Now suppose we wanted to fit a multiple linear regression model, where the dependent variable is Math score and the independent variables are English and Reading scores. Multiple linear regression can only make use of cases with nonmissing values for all the variables included in the model. This is an example of listwise deletion: If a given case is missing even one value out of the three, it cannot be used to fit the regression model, and will be dropped from the analysis. Stated differently, a case must have nonmissing values on Math and English and Reading to be used for this analysis. In this example dataset, applying listwise deletion would leave us with only 4 usable cases (rows 1, 6, 9, 10).
You can imagine that using listwise deletion when there are many variables under consideration can greatly reduce your usable sample size. Thus, it is important to understand how the SAS procedures you choose to use will handle missing data, and make an informed decision about which type of deletion/exclusion is appropriate. Some procedures will default to using pairwise deletion, such as PROC CORR for correlation and PROC TTEST for independent samples and paired samples t-tests, while other procedures, such as those used for regression and ANOVA, can only use listwise missing deletion.