Our tutorials reference a dataset called "sample" in many examples. If you'd like to download the sample dataset to work through the examples, choose one of the files below:
Ideally, datasets are structured so that each row corresponds to one unique subject or object, and each column corresponds to a single variable. However, data can be recorded or collected in many different arrangements, depending on what is convenient or cost-effective for the data collector. Furthermore, we may need our data to be arranged in a particular format in order to use a specific analysis or procedure. This is where transposing or reshaping a dataset comes in to play.
The simplest possible case of transposing switches the rows and columns of a matrix or dataset.
For example, consider the following 2x3 matrix (2 rows, 3 columns):
1 2 3 4 5 6
Transposing this matrix would turn it into a 3x2 matrix (3 rows, 2 columns):
1 4 2 5 3 6
Notice that the first row of matrix 1 becomes the first column of matrix 2. The matrix is essentially turned around a diagonal axis.
A "wide" dataset contains exactly one row per subject, and uses a unique ID (called a "key") to identify each subject. Any measurements about that subject are recorded in columns. If a given measurement was taken under different conditions -- for example, in a within-subjects experiment where a subject completes a questionnaire more than once under different conditions -- there will be separate columns for each instance of the measurement.
Statistical software packages typically require data to be in "wide" format for procedures like:
A "long" dataset contains more than one row per subject, and uses a unique ID to identify each subject. Panel (or longitudinal) data is often recorded in this "long" format.
Statistical software packages typically require data to be in "long" format for procedures like:
Consider a clinic where patients come in for 1-month and 3-month follow-up visits after some procedure. As patients come into the clinic, each visit is recorded in the clinic’s records. That is, each row of the "appointments" dataset corresponds to visit. A single visit record might contain information about the patient's name, the type of visit, and the weight of the patient during that visit. In this situation, the patient identifier and the type of visit are both "key" variables that uniquely identify each record; while the patient identifier uniquely identifies a given subject. This arrangement would be considered "long format", since there are multiple rows associated with each subject.
Unique Patient ID | Time | Weight |
---|---|---|
Patient1 | 1-month follow-up visit | Weight |
Patient1 | 3-month follow-up visit | Weight |
Patient2 | 1-month follow-up visit | Weight |
Patient2 | 3-month follow-up visit | Weight |
. | . | . |
. | . | . |
. | . | . |
Patientn | 1-month follow-up visit | Weight |
Patientn | 3-month follow-up visit | Weight |
Now suppose you want to create a scatterplot of how the patients' weights changed between their 1-month and 3-month follow-up visits, or compute the correlation between these measurements. To do this, you might want to transpose the data so that each patient has one line of data that includes both weight values (i.e., a wide dataset), like below:
Unique Patient ID | Weight at Time=1-month visit | Weight at Time=3-month visit |
---|---|---|
Patient1 | Patient1's Weight at 1-month visit | Patient1's Weight at 3-month visit |
Patient2 | Patient2's Weight at 1-month visit | Patient1's Weight at 3-month visit |
. | . | . |
. | . | . |
. | . | . |
Patientn | Patientn's Weight at 1-month visit | Patientn's Weight at 3-month visit |
In SAS, PROC TRANSPOSE
can perform simple transposes, as well as wide-to-long and long-to-wide restructuring of datasets. The general format is:
PROC TRANSPOSE DATA=Dataset-name OUT=New-dataset-name; BY variable(s); ID variable; VAR variable(s); RUN;
In the SAS code above:
PROC TRANSPOSE
statement tells SAS to execute the transpose procedure on an existing dataset called Dataset-name
.
OUT
keyword says that the transposed dataset should be created as a new dataset called New-dataset-name
.BY
statement is used to determine the row structure of the transposed dataset. You can include more than one variable in the BY
statement. Your data must be sorted on your BY variables before running PROC TRANSPOSE.
ID
statement assigns names to the transposed value columns that match the values in the variable listed in the ID statement.
VAR
statement is where you actually tell SAS what variables you want transposed. These are the values that will appear in the cells of the transposed variables.
Other options available in the PROC TRANSPOSE
statement that can be found in the SAS Help Guide.