A categorical variable stores one or a limited number of distinct values for each respondent. Categorical variables are generally based on questions that have a predefined list of possible responses, known as categories. For example, the age variable in the Museum sample data set stores the responses to the following question. This question has eight categories, which represent the possible responses:
The age variable is called a single response variable because when a respondent answers the question, he or she must choose only one response from the list of categories.
In some categorical questions, the respondent can choose more than one category from the list of categories. A variable that stores the responses to this type of question is called a multiple response variable and it can store more than one response for each respondent. Here is an example of a multiple response question:
Notice that this question has a category with the text Other. If a respondent has visited museums that are not in the list of categories, he or she can select this category and write the museum names in the space provided. This type of category is called an Other Specify category and the open-ended responses to this question are stored in a text variable called an Other Specify variable that is associated with the main categorical variable. Variables that are associated with a main variable and hold additional information are called helper variables.
In a categorical variable, there is one category for each response in the question on which it is based. Sometimes categorical variables have additional items, for example, representing the base or the mean value. You can add items to variables for use in your tables. However, in some data sets (particularly IBM® SPSS® Quanvert™ databases) the variables actually have these additional items built into the structure of the variable.
How do categorical variables store the responses?
IBM® SPSS® Data Collection Survey Reporter accesses the data through the IBM® SPSS® Data Collection Data Model, which presents data in a consistent way regardless of the underlying data format. It is not necessary to understand how the Data Model represents the responses stored in a categorical variable when you are simply building tables or defining simple filters. However, you will find it helpful to understand it if you want to use some of the advanced features, such as using advanced expressions to define filters.
The Data Model assigns a unique numeric (integer) value to each unique category full name in the data set. These unique values are called mapped category values. Category full names must be unique within a question, but the same full name can be used in different questions. For example, categories called Yes and No can be used in several questions, and will have the same mapped value in each one.
By default, the Data Model presents the responses to a categorical question as a string, in which the mapped values are formatted within braces, ({ }) and separated by commas (,). For example, the response to a single response question might be {24} and the response to a multiple response question might be {31,36,43}, where 24, 31, 36, and 43 are the mapped values of the chosen categories. However, provided a metadata source is available, the Data Model can also present the responses using the category names rather than the mapped values. Our example responses might then appear as {female} and {dinosaurs,insects,human_biology}.
When you refer to specific responses in, for example, a filter expression, you should normally use the category names and not the mapped values. The following table provides examples of doing this.
Use the category names | Rather than the mapped values |
---|---|
gender = {female} | gender = {24} |
remember = {dinosaurs,insects,human_biology} | remember = {31,36,43} |