Our tutorials reference a dataset called "sample" in many examples. If you'd like to download the sample dataset to work through the examples, choose one of the files below:
Sometimes you will want to transform a variable by grouping its categories or values together. For example, you may want to change a continuous variable into a categorical variable, or you may want to merge the categories of a nominal variable. In SPSS, this type of transform is called recoding.
In SPSS, there are three basic options for recoding variables:
Each of these options allows you to re-categorize an existing variable. Recode into Different Variables and DO IF syntax create a new variable without modifying the original variable, while Recode into Same Variables will permanently overwrite the original variable. In general, it is best to recode a variable into a different variable so that you never alter the original data and can easily access the original data if you need to make different changes later on.
Recoding into a different variable transforms an original variable into a new variable. That is, the changes do not overwrite the original variable; they are instead applied to a copy of the original variable under a new name.
To recode into different variables, click Transform > Recode into Different Variables.
The Recode into Different Variables window will appear.
The left column lists all of the variables in your dataset. Select the variable you wish to recode by clicking it. Click the arrow in the center to move the selected variable to the center text box, (B).
A Input Variable -> Output Variable: The center text box lists the variable(s) you have selected to recode, as well as the name your new variable(s) will have after the recode. You will define the new name in (C).
B Output Variable: Define the name and label for your recoded variable(s) by typing them in the text fields. Once you are finished, click Change. Now the center text box, (B), will display both the name of the original variable as well as the name for the new variable (e.g., “Height --> Height_categ”).
C Old and New Variables: Click the Old and New Values to specify how you wish to recode the values for the selected variable.
D If: The If option allows you to specify the conditions under which your recode will be applied. (We discuss the If option in more detail later in this tutorial.)
Once you click Old and New Values, a new window where you will specify how to transform the values will appear.
1Old Value: Specify the type of value you wish to recode (e.g., a specific value, missing data, or a range of values) and the specific value to be recoded (e.g., a value of “1” or a range of “1-5”).
When recoding variables, always handle the missing values first! The most common recoding errors happen when you don't tell SPSS explicitly what to do with missing values: SPSS may recode missing values into one of the new valid categories. This is especially true if using the "Lowest thru", "thru Highest", or "Range - through" options.
2New Value: Specify the new value for your variable (i.e., a specific numeric code such as “2,” system-missing, or copy old values).
3Old -> New: Once you have selected the old and new values for your selected variable in (1) and (2), click Add in area (3), Old-->New. The recode that you have specified now appears in the text field. If you need to change one of the recodes that you have added to the Old-->New area section, simply click on the one you wish to change and make changes in (1) and (2) as necessary.
You will need to repeat these steps for each value that you wish to recode. Once you have specified all the transformations that you wish to make for the selected variable, click the “Continue” button.
Sometimes you may wish to recode values for a specific variable only when other conditions in your data are satisfied. This means that cases meeting the conditions will be recoded, and cases not meeting the conditions will be assigned a missing value. To specify such conditions, click If to bring up the Recode into Different Variables: If Cases window.
1 The left column displays all of the variables in your dataset. You will use one or more variables to define the conditions under which your recode should be applied to the data.
2 The default specification for a recode is to Include all cases. To specify the conditions under which the recode should be applied, however, you will need to click Include if case satisfies condition. This will allow you to specify the conditions under which the recode will be applied to your data.
3 The center of the window includes a collection of arithmetic operators, Boolean operators, and numeric characters, which you can use to specify the conditions under which your recode will be applied to the data. There are many kinds of conditions you can specify by selecting a variable (or multiple variables) from the left column, moving them to the center text field, and using the blue buttons to specify values (e.g., “1”) and operations (e.g., +, *, /). You can also use the options in the Function group list.
When you are finished defining the conditions under which your recode will be applied to the data, click Continue.
Note: Recode into Different Variables does not include the ability to add value labels to the new categories, so immediately after recoding, you should add value labels to your new numeric codes.
When you are ready to run the procedure, click OK. Now your new variable will be recoded according to the criteria you specified. You can find your new variable in the last column in Data View or in the last row of Variable View.
Recoding into the same variable (Transform > Recode into Same Variables) works the same way as described above, except for that any changes made will permanently alter the original variable. That is, the original values will be replaced by the recoded values.
In general, it is good practice not to recode into the same variable because it overwrites the original variable. If you ever needed to use the variable in its original form (or wanted to double-check your steps), that information would be lost.
DO IF-ELSE IF syntax performs similarly to the Recode procedures, but allows for more control over specifying numeric ranges. If you want to discretize a numeric variable into more than three categories, or if you want to perform a recoding based on more than one variable, you'll need to use DO IF-ELSE IF syntax. (You could use DO IF-ELSE IF for recoding a categorical variable, but there's no real reason to use it over Recode; the Recode syntax is shorter and more efficient for that situation.)
The DO IF-ELSE IF syntax is:
DO IF (conditional statement).
COMPUTE (variable assignment statement).
ELSE IF (conditional statement).
COMPUTE (variable assignment statement).
...
ELSE.
COMPUTE (variable assignment statement).
END IF.
EXECUTE.
The DO IF
and ELSE IF
lines tell SPSS to perform the nested computation if certain conditions are true. These conditions are statements (or chains of statements) that evaluate as true or false. For example:
x > 2
is a conditional statement that returns true if the value of x is greater than 2, and returns false if the value of x is less than or equal to 2.x > 2 AND x < 10
returns true if x is larger than two and also smaller than 10 (i.e., 2 < x < 10), and returns false if x is less than or equal to two or if x is greater than or equal to ten (x ≤ 2 or x ≥ 10).MISSING(...)
returns true if its argument is system-missing or user-missing. If you want to handle the recoding of missing values, you would use the syntax DO IF(MISSING(variable)).
A list of operators that SPSS recognizes in conditional (or logical) statements is given in the following table. Note that you can use the letter combinations or the mathematical symbols in your statements. You can also use parentheses to group or distribute the effects of an operator.
Operator | Symbol | Definition |
---|---|---|
EQ |
= |
Equal to |
NE |
~= |
Not equal to |
LT |
< |
Less than |
LE |
<= |
Less than or equal to |
GT |
> |
Greater than |
GE |
>= |
Greater than or equal to |
AND |
& |
Both statements must be true |
OR |
| |
One or both statements must be true |
NOT |
~ |
Negation (must not be true) |
The ELSE
line tells SPSS to perform its nested computation on all other values not accounted for by the previous conditional statements. ELSE is optional -- you don't necessarily have to use it, but it is often more convenient to use than addressing every possible outcome using ELSE IF. If you do use ELSE, it must be at the very end of the loop (right before the END DO
statement).
When using
DO IF
, any conditions based on missing values must be included in the DO IF step; they can not be included inELSE IF
statements. If missing value conditions are used in ELSE IF statements, they are ignored.
The COMPUTE
statements are where the new variable(s) are actually computed or set. Note that if you want to set a a variable equal to a missing value in a COMPUTE statement, use the syntax var=$SYSMIS
. The term $SYSMIS refers to system-missing values. (Note that although SPSS indicates numeric missing values using period characters (.), you would not use the assignment statement var=.
; this will return a syntax error.)
You may encounter this syntax error after executing a DO IF block:
Error # 4095. Command name: EXECUTE The transformations program contains an unclosed LOOP, DO IF, or complex file structure. Use the level-of-control shown to the left of the SPSS Statistics commands to determine the range of LOOPs and DO IFs. Execution of this command stops.
If this happens, you may need to add a hyphen (-) before the
COMPUTE
statement(s).
Class ranks for high schools and colleges are are nicknames for what year of study the person is completing: "freshman" (first-year), "sophomore" (second-year), "junior" (third-year), "senior" (fourth-year). Class ranks are also sometimes divided into "underclassmen" (first or second-year students) and "upperclassmen" (third or fourth-year students).
In the sample dataset, the variable Rank has the categories Freshman (1), Sophomore (2), Junior (3), and Senior (4). Let's use Recode into Different Variables to merge the categories and create a new indicator variable called RankIndicator with the levels Underclassman (1) and Upperclassman (2).
We will show three different ways of defining the categories that produce identical results. You only have to use one of these; we show multiple methods to show that there is flexibility in how you define the groups.
This method tells SPSS exactly how to map each old category onto a new category.
RECODE Rank (SYSMIS=SYSMIS) (1=1) (2=1) (3=2) (4=2) INTO RankIndicator.
VARIABLE LABELS RankIndicator 'Class Rank (binary)'.
EXECUTE.
This method uses ranges. Note that this method works OK for integers, but will often yield unexpected results when used on variables that have one or more nonzero decimal places.
RECODE Rank (SYSMIS=SYSMIS) (1 thru 2=1) (3 thru 4=2) INTO RankIndicator2.
VARIABLE LABELS RankIndicator2 'Class Rank (binary)'.
EXECUTE.
This method uses the "Lowest thru" and "thru Highest" ranges. The "Lowest thru" option acts as "less than or equal to some-number", and the "thru Highest" option acts as "greater than or equal to some-number".
RECODE Rank (SYSMIS=SYSMIS) (Lowest thru 2=1) (3 thru Highest=2) INTO RankIndicator3.
VARIABLE LABELS RankIndicator3 'Class Rank (binary)'.
EXECUTE.
After recoding, we should be able to compare the frequencies old and new variables. There should be an identical number of missing values; the number of underclassmen should equal the sum of the number of freshmen and sophomores; and the number of upperclassmen should equal the sum of the number of juniors and seniors.
One important use of the Recode procedure is dichotomizing or discretizing a continuous variable. Dichotomizing a continuous variable transforms a scale variable into a binary categorical variable by splitting the values into two groups based on a cut point. Discretizing a continuous variable transforms a scale variable into an ordinal categorical variable by splitting the values into three or more groups based on several cut points.
In the sample dataset, the variable CommuteTime represents the amount of time (in minutes) it takes the respondent to commute to campus. Let's try recoding this variable into three ordinal groups:
To check your work, go to the Variable View tab in the Data Editor window. Right-click on the new CommuteLength variable and click Descriptives Statistics. This will create a quick frequency table and summary statistics of the new variable. Make sure that the new variable has the same number of missing values as the original variable. You will also want to set the value labels for the new variable before doing any analysis using this variable.
RECODE CommuteTime (SYSMIS=SYSMIS) (Lowest thru 30=1) (60 thru Highest=3) (ELSE=2) INTO CommuteLength.
EXECUTE.
Why didn't we use the "Range" option to specify category 2?
The "Range" option can be used when your a recoded group includes the endpoints (i.e., is defined by "greater than or equal to" AND "less than or equal to" statements). However, it should NOT be used if one or both of the endpoints is "open" (which happens if a group is defined by a "[strictly] greater than" or "[strictly] less than" statement).
Using "All other values" to define group 2 was completely dependent on us correctly accounting for all other possible categories first, including the missing values. Had we not first handled the missing values, category 2 would have included all of the cases with 30 < time < 60 and all of the cases with missing values.
The above example showed how to discretize a continuous variable into three categories using Recode into Different Variables. Recode into Different Variables was able to correctly account for all possible values in that situation. However, if we wanted to discretize into four or more categories, Recode into Different Variables isn't equipped to properly define each range. We'll illustrate this with a test case, then show how to use DO IF syntax to properly implement the desired recoding scheme.
Suppose we have test scores as percentages, and want to convert those percentages to a letter grade. A typical grading scheme in the United States is:
Recall that the Range specification in Recode into Different Variables allows us to specify a range of values which includes both endpoints. With that constraint, how would we achieve a grouping that was intended to have an open endpoint? For the "D" and "C" grades, we could try specifying the ranges as [60, 69.9] -> D and [70, 70.9] -> C. This could work if scores were only recorded to one decimal place, but what would happen to a score with two decimal places -- say, 69.99? Imagine a number line:
In that instance, the score 69.99 would fall into a "gap" not covered by any recoding rules. In general, your instructions to SPSS should be specified in such a way that all possible outcomes are accounted for, regardless of whether you're using the menus or syntax.
In the sample dataset, the variable Math represents the subjects' scores (out of 100 points) on a math placement test. Suppose we want to recode these scores to have a letter grade using the scheme described above. Let's use DO IF syntax to perform this recode and save the results as a new variable, MathGrade.
This computation must be done using syntax.
DO IF(MISSING(Math)).
COMPUTE MathGrade=$SYSMIS.
ELSE IF (Math < 60).
COMPUTE MathGrade = 1.
ELSE IF (Math >= 60 AND Math < 70).
COMPUTE MathGrade = 2.
ELSE IF (Math >= 70 AND Math < 80).
COMPUTE MathGrade = 3.
ELSE IF (Math >= 80 AND Math < 90).
COMPUTE MathGrade = 4.
ELSE IF (Math >= 90).
COMPUTE MathGrade = 5.
END IF.
EXECUTE.
NOTE: This syntax has been tested in SPSS Version 22 and 23. We have found that it may not work properly in SPSS Version 20. If you are using version 20, you may need to put dashes before each COMPUTE statement inside the DO IF block.
If the recode was performed successfully, we should see the new variable in the Data Editor window.
If the new variable appeared but all of the values are missing, then there is something wrong with your code; you may have forgotten an EXECUTE statement.
We should also be able to check our new variable to make sure that it performed as we expected. There should be the same number of missing values that we started with, and each of the original scores should be classified into exactly one of the grade categories. We can check this using the Compare Means via the syntax below, or via the menus (Analyze > Compare Means > Means. The dependent variable is Math, and the layer/ independent variable is MathGrade):
MEANS TABLES=Math BY MathGrade
/CELLS=COUNT MIN MAX.
Remember that before you perform any further analysis with this variable, you'll want to add value labels showing 1='F', 2='D', and so on.