Our tutorials reference a dataset called "sample" in many examples. If you'd like to download the sample dataset to work through the examples, choose one of the files below:
Sometimes you want to change the structure of your dataset so that observations are ordered according to one or more variables, called sorting. Data is easily sorted by one or more variables with a procedure called PROC SORT
. You can sort data by both numeric and character variables. The general format of the sort procedure is:
PROC SORT <options>;
BY var;
RUN;
In the syntax above, PROC
is the keyword that starts the proc step and SORT
is the name of the procedure. Immediately following PROC SORT is where you put any options you want to include. Let’s review some of the more common options:
DATA
= Dataset-Name
NODUPLICATES
NODUPLICATES
is used without NODUPKEY
, then records are only considered duplicates if they have identical values for every variable.NODUPKEY
OUT
= New-Dataset-Name
In the next line is the BY
statement, where you tell SAS what variable(s) to sort the data on, and what order to do the sorting in.
DESCENDING
before the variable name that you want the dataset to be sorted on.
BY
keyword should be separated by a space, and should be listed in the order you want SAS to order by.The RUN
statement is placed at the end of the block and tells SAS to execute the code.
In the sample dataset, we have information about gender and birthday (but not necessarily age) for the subjects in the sample. How can we sort each gender from youngest (most recent birthdate) to oldest (least recent birthdate)?
The birthday variable (bday) in the sample dataset is a date variable. Recall that date variables are a special type of numeric variable; therefore, date variables will follow the same sorting rules as regular numeric variables. This means that when date variables are sorted in ascending order, missing values will come first, and then the dates will be sorted from oldest to newest. Conversely, if a date variable is sorted in descending order, the newest (most recent) dates will come first.
(Why is this? Recall that dates in SAS are internally measured as the amount of time that has elapsed since the reference date (January 1, 1960). This implies that more recent dates will technically be stored as larger numbers.)
PROC SORT data=sample;
BY gender descending bday;
RUN;
After sorting, your data will look similar to this:
The data is sorted first by gender, in ascending order. Notice how missing gender values appear first, then 0 (coded for male). Within each gender, the data is then sorted in descending order by birth date. Among the cases with missing values for gender, we can see that SAS recognizes that November 29, 1995 > July 28, 1995 > April 7, 1994. In rows 10 through 12, we see the 3 "largest" (most recent) birthdates for males: January 2, 1996 > December 25, 1995 > December 12, 1995.