SEARCH UNIVERSITY LIBRARIES
This tutorial is a primer on how to work with data from multiple choice, multiple-response (or "check all that apply") questions in SPSS Statistics.
Multiple response sets occur when you have a set of related choices or characteristics in which a subject or experimental unit can possess one or more of those characteristics. In this tutorial, we will focus on a specific type of multiple response set: multiple response (or "check-all-that-apply") questionnaire items.
A multiple response question presents a list of possible answer options, and the respondent selects all options that are true for them. For example, suppose we are interested in surveying a group about what types of electronic devices they own, and suppose we are especially interested in the three most common types of mobile computing devices: laptops, phones, and tablets. We might create a survey question like this one:
As individual users complete the survey, their selections might look like this:
Selects "laptop" and "phone" and "tablet"
Selects "phone" and "other"; types "mp3 player" in the write-in box
This particular question type is deceptively simple. On its surface, it looks similar to "single-choice" multiple choice questions, which can be summarized using (univariate) frequency tables. However, this is not the case for multiple response questions: each checkbox functions like a "Yes or No" question. For example, we could restructure this question into a series of single-choice, "Yes or No" questions:
This means that one multiple-response question is actually composed of several binary variables. (There will be as many binary variables as there are "selectable" options.) Additionally, we are not just concerned about how many individuals selected a given choice; we may also care about how many of the options were selected, and what combinations of the options were most common (i.e., are the selections correlated).
To properly analyze these responses, our data must be structured correctly. In practice, there are two basic data structures for this type of data, but one of them is much easier to work with than the other.
If you are given the choice between these two structures, the multiple-column scheme is strongly preferred. If your data is recorded using the single-column structure, you will need to "clean up" the data to get it into the one-column-per-selection format.
To properly analyze multiple response questions in SPSS, your dataset should have the following structure:
The following two examples demonstrate both schemes using the same underlying data. In these examples, the columns represent the answers to a check-all-that-apply question, "Which of the following devices do you own?", with four answer options: laptop, phone, tablet, or "other". In plain language, the data used in both examples:
In this coding scheme, we have a distinct numeric code representing the "checked" or "present" state, but use a missing value (blank) to represent the "unchecked" or "absent" state.
In this example, 1 denotes "present" or "checked", and blank cells denote "absent" or "not checked". (Remember that when entering data, we do not type anything in the cell when the value is missing. The dot you see in the cell is something that SPSS displays, not something that you as the user add.) Variable labels are strongly recommended, since those determine the labeling used in the multiple response frequency tables. Value labels are also useful, but are not a strict requirement.
In this coding scheme, we have a distinct numeric code representing the "checked" or "present" state, and a distinct numeric code representing the "unchecked" or "absent" state.
In this example, 0 denotes "absent" or "not checked", and 1 denotes "present" or "checked". You do not necessarily have to use the numbers 0 and 1, but you should use the same numeric codes across all of the columns. Value labels are strongly recommended, so that you can remember the meanings of the codes.
In this tutorial, we will be using simulated data from a hypothetical survey with two questions. Neither question was required, so respondents could choose to skip one or both questions.
The data for the multiple response question was encoded using the 1s/blanks scheme, and the answers to the question on gender were encoded as 0=Male, 1=Female.
An important task when working with check-all-that-apply questions is being able to say how many people did not answer the question. This task is not as straightforward as it is with single-choice multiple-choice questions, where we can simply count the number of missing values in a single column. That approach will not work with multiple-response questions, because the answers are spread across multiple variables, and can be selected independently. For example, someone who responds that they own a phone will still have missing values for laptop and tablet and other. This person clearly answered the question, despite having "missing values" on some of the variables in the set. It stands to reason that a person who did not answer the question must have missing values on all variables in the response set. How do we count the number of nonmissing responses a person gave?
Additionally, we may want to know how many options respondents tended to select. Did most people only select 1 option, or did most people tend to select 2 or 3 options?
We can answer both of these questions using the Count Values Within Cases procedure in SPSS. This procedure takes a set of variables and counts the number of times a specific value occurs for a given case/row. This "count" is added as a new variable to the dataset, which we can then use to apply filters.
In our example data, we used the number 1 to indicate "present", so we want to count the number of 1's a person has across the four multiple response variables. We can use Count Values Within Cases to count the number of "checked boxes" for a given respondent. If someone does not have any 1's, they will have a count of 0.
Count Values Within Cases can be configured to count any number or range of numbers, and can even count missing values. In this example, we choose to count the number of 1's, so individuals who selected zero choices will have values of 0, and individuals who answered the question will have counts greater than 0. This interpretation is more intuitive, and makes it easy to filter out non-responders.
COUNT selected=owns_laptop owns_phone owns_tablet owns_other(1).
In this syntax:
COUNT is the name of the procedure.
selected, appears to the left of the equals sign.
EXECUTE statement tells SPSS to carry out the computation and write the result to the active dataset. (If you run the
COUNT statement without the
EXECUTE statement, SPSS will "queue up" the command, but not actually carry it out.)
The Output window will display the syntax from the Count Values within Cases command, but will not show any table output. To see the result, go into the Data Editor window; if we were successful, our new variable should appear at the end of the dataset (you may need to scroll to the right to see it).
Notice how cases 1, 2, and 5 had values of 1 for owns_laptop, owns_phone, and owns_tablet, and that their value of selected is 3. Cases 3, 4, 6, and 8 had values of 1 for owns_laptop and owns_phone, so their value of selected is 2. Case 7 only had a 1 for owns_phone, so their value of selected is 1.
We can now look at how many devices the respondents owned by creating a frequency table of variable selected (Analyze > Descriptive Statistics > Frequencies).
From this table, we can see that six (6) respondents did not select any electronic devices. Keep this number in mind when reviewing the Multiple Response Frequencies output in the next example.
This table also tells us:
To filter out individuals who did not answer the multiple response question, use the Select Cases procedure to keep cases if
selected > 0(selected greater than 0).
SPSS has a two-step process to use multiple response sets using the dialog windows:
After a multiple response set is defined, it is only retained as long as the SPSS session is active. Once you close SPSS, the multiple response set definition is erased; the next time you start SPSS, you would need to re-define the multiple response set if you wanted to re-run the multiple response frequency tables. (The exception to this is if you have the Custom Tables module , which is not covered here.) The best way to avoid having to re-define your multiple response sets is to save the syntax created by the Multiple Response Frequency Tables and Crosstabs procecdures in a SPSS syntax file, because the syntax for these procedures automatically includes the definitions of the response sets. (This is covered in both examples later in the tutorial.)
To define a multiple response set through the dialog windows, click Analyze > Multiple Response > Define Variable Sets.
A Variables in Set: The variables from the dataset that compose the multiple response set. For surveys, this is typically the set of columns corresponding to the "selectable" choices for a single survey question.
B Variables Are Coded As: The data values used to indicate that the category was present.
Note that it is only possible to choose one of these schemes. If your data does not match one of these schemes, you may need compute recoded versions of the variables using the Recode into Different Variable procedure.
C Name and Label: The name (required) and label (optional) of the multiple response set. The naming rules for multiple response set names are the same as the normal variable naming rules in SPSS (no spaces, must start with a letter).
D Multiple Response Sets: List of all response sets that have been defined in the current SPSS session. This panel will be blank if no response sets are defined.
After setting up a multiple response set, you will be able to access the Multiple Response Frequencies option through the menus. To do this, click Analyze > Multiple Response > Frequencies.
All multiple response sets you've defined during the current SPSS session will appear on the left.
The two options in the Missing Values section control how cases with missing values should be treated. These settings will have different effects, depending on whether you use blanks versus numeric codes to represent unselected choices, and whether you specified a dichotomy or a range of category codes in the previous step:
To avoid having to re-define the same response set, we recommend using the Paste button (instead of the OK button) to generate the command syntax code for the multiple response frequency table or crosstab. This is because the syntax command for multiple response sets,
MULT RESPONSE, contains the definition of the set in the command. (This will be illustrated in the example below.) Using the Paste button will write the syntax commands to the syntax window, which you can then use to execute the analysis without needing to go through the dialog windows.
After setting up a multiple response set, you will be able to access the Multiple Response Crosstabs option through the menus. To do this, click Analyze > Multiple Response > Crosstabs.
A Variable list: The variables in the current dataset. Categorical variables in this list can be used as Row, Column, or Layer variables. For each variable in this list that you use in the table, you will need to use the Define Ranges button to tell SPSS which number categories you want to be included in the table.
B Multiple Response Sets: The multiple response sets that have been defined during the current session. These variables can be used as Row, Column, or Layer variables.
C Rows: The variable(s) you want to be used as the rows in the crosstab.
D Columns: The variable(s) you want to be used as the columns in the crosstab.
E Layers: The variable(s) you want to be used as the "layer" variable in the crosstab. The categories of the layer variable will appear on the outermost edge of the table.
If multiple variables are entered in the Row, Column, and/or Layer boxes, there will be a separate table for each unique combination of the row*column*layer variables.
F Define Range: Opens the Define Range prompt. This option becomes available when you've added a regular variable to the Row, Column, or Layer box, and have clicked on the variable so that it's highlighted.
Note that this means that you cannot use string variables in these tables, and the numeric category codes you want to include in the table must be sequential (i.e., if you had categories 1=Disagree, 2=Neutral, 3=Agree, and you only wanted to include categories 1 and 3 in the table, you would need to recode the variable so that Neutral is not within the range.) If there are numbers between the minimum and maximum that are not represented in the observed data, those numbers will be ignored.
G Options: Opens the Options window:
Suppose we want to know what types of electronic devices (laptops, smartphones, and tablets) college students commonly own. Our desired summary would look something like this:
|% of respondents (n=??)
If we were to try to use the regular Frequencies procedure on this data (Analyze > Descriptives > Frequencies), the resulting tables would not be succinct:
The first table shows the number of valid and missing responses for each variable. Notice the number of missing responses for each variable: Because we are using the scheme of 1=checked, missing=not checked, the missing values here actually represent the number of people who did not select that option. It does not necessarily mean that they did not answer the question! We should only consider individuals who left all four options blank as skipping the question. It's not possible to determine how many individuals left all four options blank from the basic Frequencies procedure.
In the individual frequency tables, we see the number of people who checked that option (in the rows labeled "Valid - 1"). The Percent column represents the proportion of the total sample who checked that option. Because this procedure can't determine if there were individuals who did not answer the question, we don't know for certain if we should use the total sample size as the denominator to compute the percentages.
Instead, we should use the Multiple Response Frequencies procedure, which can deal with all of these issues, and produce a table structured like the above.
If using the dialog windows, we must do this in two steps: first, using the Define Multiple Response window, and then using the Multiple Response Frequencies window.
After clicking Close, nothing will appear to happen; this is normal. To actually create the table, we now run the Multiple Response Frequencies procedure:
Using syntax for multiple response frequency tables is much simpler: the definition of the set and the command to produce the frequency table are done in the same command:
MULT RESPONSE GROUPS=$devices 'Electronic devices owned' (owns_laptop owns_phone owns_tablet
Running the above steps or syntax produces the following output:
The first table, Case Summary, counts the number of cases with valid and "true" nonmissing values -- i.e., cases that did not have any of the options checked. We see that only 6 cases did not select any of the answer options. This matches what we saw from the Count Values Within Cases procedure (above).
The second table, $devices Frequencies, is the frequency table of interest. From left to right, the columns of this table show:
Using the values of N and Percentage of Cases from the multiple response frequency table, we can fill in the table from the beginning of this example:
|% of respondents (n=429)
This table tells us that:
We saw that the Multiple Response Frequencies procedure will treat an individual as "missing" (i.e. did not answer the question) if the individual had missing values for all variables in the set. However, our survey question only had four options -- laptop, phone, tablet, and "other". All of these options assume that the respondent owns an electronic device. If someone does not have an electronic device, the only way they can accurately respond is to not select any choices! This means that we can't distinguish between people who don't own any electronic devices and people who skipped the question. Given our original research question, this would be especially problematic: if we are interested in knowing the electronic devices that college students own, we need to be certain about what proportion of students do not own any devices, since that could impact students' access to online course materials.
How can we prevent this problem when designing future surveys? One option is to add an answer choice that would specifically accommodate individuals who don't own an electronic device. If we do this, we will need to take an extra step to prevent respondents from giving contradictory answers: for example, we don't want to allow the option for someone to answer "I own a phone and I don't own any electronic devices". Some online survey platforms (such as Qualtrics) allow the survey designer to designate specific answer options as "exclusive". Answers marked as "exclusive" will be "either-or": you can choose any and all of the non-exclusive options, or you can choose the exclusive option, but not both simultaneously.
Remember: a good multiple choice question will have answers that span the full range of possible answers. (This is true for both single-choice and check-all-that-apply question types!) Consider your research question, and use it to guide whether you should include an option like "other", "not applicable", or "none of these".
We've gone over how to do frequency tables for multiple response variables; in that example, our concern was counting how common each of the electronic device options were. What is we want to compare differences in device ownership between independent groups, such as men and women? Our desired table of results might look like this:
We would like to obtain a crosstab, but as we saw in the previous example, the regular Crosstab procedure does not work the way we would expect when multiple response set variables are involved:
Recall that the Crosstabs procedure can only use cases that have nonmissing values for both variables. Since unselected values are coded as missing values, the Crosstabs procedure drops them from the table entirely. What we really want is a table that will only drop cases if they're missing values for gender or for the multiple response set (i.e., didn't select any of the answer options). These tables alone won't give us the information we need to fill in the table above. We'll need to use the Multiple Response Crosstabs procedure instead.
If you have not done so already, follow the instructions above to define the multiple response set. Then do the following:
Unlike the normal Crosstabs procedure, we need to specify the range of numeric codes we want to be included in the table. (This must be done for any variable in our crosstab that isn't a multiple response set.) Click Define Ranges.
Notice that the same procedure,
MULT RESPONSE, powers both the multiple response frequencies and multiple response crosstabs:
MULT RESPONSE GROUPS=$devices 'Electronic devices owned' (owns_laptop owns_phone owns_tablet
/TABLES=$devices BY Gender
In this syntax:
MULT RESPONSE is the name of the command
GROUPS= contains the definition of the multiple response group. The definition appears to the right of the equals sign, and has the following components:
$devices is the name of the multiple response set; note the dollar sign prefix, which denotes that this is a multiple response variable
/VARIABLES= specifies any non-multiple-response variables to be used in the tables, and which numeric values (given in parentheses) should be included in the table (in this case, 0 and 1)
/TABLES= gives the desired structure of the table to the right of the equals sign, using the pattern
rowvar BY colvar BY layervar
/BASE=CASES says that, if proportions are computed, use the total number of cases as the denominator
/CELLS=COLUMN says to print column percentages in the table.
Using the column proportions, we can observe that:
We can now fill in our table:
Warning: Do not use the chi-square test of independence on a crosstab containing a multiple response variable. One of the assumptions of the chi-square test of independence is that the responses are uncorrelated with each other. This includes situations where a subject or respondent is counted "twice". Because a respondent can select more than one of the multiple response options, respondents can be counted multiple times, and their responses will be inherently correlated, which violates this critical assumption.