Our tutorials reference a dataset called "sample" in many examples. If you'd like to download the sample dataset to work through the examples, choose one of the files below:
Recall that SAS programs consist of two main blocks of code: the data step and the procedure (proc) step. The data step is where data is created, imported, modified, merged, or calculated.
The data step follows the following format:
DATA Dataset-Name (OPTIONS);
.
.
.
RUN;
In the SAS program file above, DATA
is the keyword that starts the data step, meaning that it tells SAS to create a dataset. Dataset-Name is the name of the dataset that you want to create or manipulate. If you want to add any of the dataset options (see below), they would go in the parenthetical after you name the dataset. In between the first and last lines are the statements that create and manipulate the dataset. Note the data step ends with a RUN
statement and a semicolon.
When you need to copy or modify an existing dataset, use the SET
statement in the data step. In general the code will follow this form:
DATA New-Dataset-Name (OPTIONS);
SET Existing-Dataset-Name (OPTIONS);
.
.
.
RUN;
The statements above tell SAS to create a new dataset (New-Dataset-Name) that is an exact copy of an existing SAS dataset (Existing-Dataset-Name). This allows you to create new variables or recode existing variables without permanently changing the original data. (It is strongly recommended that you do not alter your original data files.)
A data step containing only the SET
statement will create an exact copy of the dataset. For example, the program
DATA new_sample;
SET sample;
RUN;
creates a new temporary dataset called new_sample that is a clone of the already existing dataset called sample. You might use code like this when you want to copy a dataset from the temporary library to a permanent library or vice versa.
If you do not want to make a copy of a dataset, and instead wish to modify an existing dataset, then you can simply use the same dataset name in the DATA
statement and in in the SET
statement.
DATA sample;
SET sample;
<other commands here>
RUN;
However, you should be aware that this will permanently overwrite the existing dataset. That is, if you use the same names, then SAS will overwrite the existing dataset with the new dataset you are creating.
Data step options generally perform variable-level actions, like renaming or dropping variables from a dataset. Options usually appear in parentheses right after the name(s) of a dataset that is referenced in the DATA statement or in the SET statement.
In the data step, DROP and KEEP are used to "throw out" certain variables from your dataset:
KEEP
tells SAS to keep only the listed variables; all other variables are removed from the dataset.DROP
tells SAS to remove only the listed variables from the dataset; all other variables are kept.These two options can accomplish the same thing, but in a given situation one will likely be easier than another. If you only want to remove a couple of variables from a dataset, then using a DROP
option would be easier than specifying all the variables to stay in a KEEP
option. Conversely, if you only want to keep a couple of variables in the dataset then using a KEEP
option would be easier than specifying all the variables to remove in a DROP
option.
Suppose we want to create a new dataset with a variable BMI computed from the existing variables height and weight. Suppose that we also don't want the height and weight variables to be carried over into the new dataset. The following example creates two new variables (bmi and height2) based on the existing variables height and weight, but removes height and weight from the new dataset sample_new_vars.
DATA sample_new_vars (DROP = height weight);
SET sample;
bmi = (weight / (height*height) ) * 703;
height2 = height * 0.0254;
RUN;
Let's say that we want to make a copy of our dataset, but only keep the character variables (and the ID variable, ids). SAS has special syntax for "name lists", which allow you to use a special nickname to refer to "all character variables in this dataset" (_CHAR_
) or "all numeric variables in this dataset" (_NUMERIC_
). These name lists can be used in a DROP or KEEP statement, in place of (or in addition to) typical variable names.
DATA sample_stringonly;
SET sample(KEEP=ids _CHAR_);
RUN;
DATA sample_numericonly;
SET sample(KEEP=ids _NUMERIC_);
RUN;
The RENAME
option tells SAS to change the name of one or more variables. Its general form is:
RENAME = (oldvariable1=newvariable1 oldvariable2=newvariable2 ...)
You can rename more than one variable within the parentheses as long as each pair of old and new variable is separated by a space.
To change the names of the variables Gender and DOB to Sex and Date_of_Birth, respectively, we could use the following syntax:
DATA sample2 (RENAME=(Gender=Sex DOB=Date_of_Birth));
SET sample;
RUN;
You may have noticed in the above examples that some included the DROP or KEEP options in the SET statement, while others put it in the DATA statement (the first line of the data step).
Data step options provide SAS with additional instructions on how to read or write the dataset you name. They are generally attached to an output dataset (one that SAS is going to create), but they can also be attached to an input dataset (one that SAS is going to read, like when a SET statement is used).
We have covered some of the most common data step options here. You can discover more options in the SAS Help and Documentation window.