How Do Respondents Attend to Verbal Labels in ... - SAGE Journals

7 downloads 0 Views 224KB Size Report
verbal labels for end categories only (END form) and verbal labels for each of ... although the probability that respondents attend to a verbal label seems to.
Article

How Do Respondents Attend to Verbal Labels in Rating Scales?

Field Methods 2014, Vol 26(1) 21-39 ª The Author(s) 2013 Reprints and permission: sagepub.com/journalsPermissions.nav DOI: 10.1177/1525822X13508270 fm.sagepub.com

Natalja Menold1, Lars Kaczmirek1, Timo Lenzner1, and Alesˇ Neusar2

Abstract Two formats of labeling in rating scales are commonly used in questionnaires: verbal labels for end categories only (END form) and verbal labels for each of the categories (ALL form). We examine attention processes and respondents’ burden in using verbal labels in rating scales. Attention was tracked in a laboratory setting employing eye-tracking technology. The results of the two experiments are presented: One applied seven and the other applied five categories in rating scales comparing END and ALL forms (n ¼ 47 in each experiment). The results show that the ALL form provides higher reliability, although the probability that respondents attend to a verbal label seems to decrease as the number of verbally labeled categories increases. Keywords cognitive response process, rating scales, labeling, eye tracking

1 2

GESIS – Leibniz Institute for the Social Sciences, Mannheim, Germany Palacky´ University, Olomouc, Czech Republic

Corresponding Author: Natalja Menold, GESIS – Leibniz Institute for the Social Sciences, P.O. Box 12 21 55, D-68072 Mannheim, Germany. Email: [email protected]

22

Field Methods 26(1)

This article is concerned with the labeling of rating scales in web surveys. Rating scales present a continuum, going from one extreme to the other. When giving their answers, respondents have to graduate them, for example, on an agree–disagree continuum. In our research, we focus on rating scales and their use in multiple item sets that are expected to measure a single construct (Likert-type scales). With regard to labeling, two different rating scale formats are commonly used in questionnaires: verbal labels for end categories only, often combined with numerical labels (END form); and verbal labels for each of the categories (ALL form). A great deal of research shows that the ALL form is associated with higher measurement quality. Other results show that numerous verbal labels may increase the difficulty of the cognitive task of respondents. This article addresses the question of how thoroughly respondents attend to rating scale labels in web surveys and how this affects data quality (reliability of measurements). Response behavior and attention to different parts of the rating scales are measured by means of eye tracking. The next section describes the theoretical background of the response behavior. Specifically, we compare two models developed by Parducci (1983) and Krosnick and Alwin (1987).

Theoretical Background According to Parducci (1983), characteristics of rating scales impact the cognitive representation of the measurement dimensions by respondents. The crucial characteristics are scale range and frequency. Scale range is determined by the meaning of the end categories, while frequency is determined by the number of categories between the end categories. When respondents answer a question with a rating scale, they match the end anchors of their internal subjective range to the end categories of the rating scale, whereby scale categories between the end points divide the internal representation of the dimension into equal subranges. Consequently, verbal labels should affect the understanding of range and subranges in rating scales (Tourangeau et al. 2000). If so, respondents should attend to the verbal labels to understand the range and frequency of the dimension, which is presented by a rating scale. However, this attention can be limited if the mapping process becomes more complicated and burdensome with an increasing amount of information that respondents have to comprehend and to distinguish. According to satisficing theory (Krosnick and Alwin 1987), difficulties in the mapping process are likely to increase the chance of superficial information processing

Menold et al.

23

(satisficing), which, in turn, impedes data quality. However, optimal information processing, which considers the categories and their labels while mapping judgments onto rating scales, would be in agreement with the model by Parducci (1983). The ALL form can provide a better understanding of the range and frequency of a rating scale as verbal labeling for each category clarifies its meaning and thus reduces the interpretation variability across respondents (e.g., Krosnick and Fabrigar 1997; Maitland 2009). However, the mapping process can also become more burdensome with the ALL form compared to the END form because more information has to be considered. In accordance with this discrepancy, previous studies reported mixed results with respect to the measurement quality of the ALL form. Some studies found that, compared to the END form, the ALL form increased cross-sectional reliability (Alwin and Krosnick 1991; Saris and Gallhofer 2007), interrater reliability (e.g., Peters and McCormick 1966), retest reliability (Krosnick and Berent 1993; Weng 2004), and validity (Saris and Gallhofer 2007). However, there are also contradictory findings. In many other studies, the ALL form did not impact the cross-sectional reliability (see Churchill and Peter [1984] for a meta-analysis and see Tra¨nkle [1987]). Andrews (1984), who applied a multitrait-multimethod design using different surveys on U.S. and Canadian populations, found a negative relationship between the ALL form and the measurement quality of rating scales with two to nine categories. In addition, developing adequate labels for all categories in (longer) rating scales may present a challenge for questionnaire designers. Generally, studies in which apparently equidistant labels were developed for different measures (e.g., agreement, frequency) are rare and show some problems in arriving at equidistance (Rohrmann 1978). To summarize, there are many research findings that recommend using the ALL form, while there are also studies that come to contradictory conclusions. We believe that an analysis of the respondents’ attention that is spent on verbal labels can help us better understand this issue. The cognitive models by Tourangeau et al. (2000) and Parducci (1983) remain vague in their assumptions about attention processes during the mapping stage. In addition, there is little empirical evidence of attention processes with respect to labels in rating scales. In several studies, the ALL form showed a somewhat higher response time and a reduced impact of irrelevant visual features compared to the END form with numerical labels (e.g., Couper et al. 2007; Toepoel and Dillman 2011). In the current study, we take a closer look at how respondents attend to the information conveyed by verbal labels in the ALL and END forms.

24

Field Methods 26(1)

Parducci’s (1983) model implies that respondents should pay attention to the extreme categories of a rating scale (which convey information about the range of the underlying dimension of measurement) and to intermediate categories (which convey information about frequency and subranges). If these assumptions are met, respondents should attend to verbal labels, irrespective of the form used (ALL or END). If respondents pay limited attention to the verbal labels, such behavior may be related to satisficing behavior, which may result in a lower quality of measurements. In our study, we conducted two experiments employing a web-based survey and varying the labeling. We used ALL and END forms with seven (in the first experiment) and five (in the second experiment) response categories. This number of categories is commonly used in surveys and has been often accepted as appropriate (Krosnick and Fabrigar 1997). Attention was measured by means of eye tracking, which allowed us to determine exactly how often and for how long respondents looked at a particular verbal label, and thus provided a window into the ways in which they processed the different rating scales. In both experiments, we expected to obtain similar results and tested the following three hypotheses: Hypothesis 1: Attention will be paid to all verbal labels in the ALL and END forms; consequently, the attention to the verbal labels will not significantly differ between the ALL and the END forms. Hypothesis 2: Since information processing is more complex in the case of the ALL form, it will be more time consuming to read the labels. It takes longer to attend to a rating scale in the case of the ALL form as opposed to the END form. Hypothesis 3: Finally, we expect to confirm the findings of previous research regarding the superior reliability of the ALL form.

General Method The study was conducted between June and July 2009 at the Max Planck Institute for Human Development in Berlin, Germany. Forty-seven respondents participated in the study. Their mean age was 26 (SD ¼ 3.6); 57% were female. All participants had at least 12 years of schooling; 70% were currently enrolled as university students. The native language of all participants was German (the language in which the questionnaires were designed).

Menold et al.

25

The Tobii T120 eye-tracking system was used to record and analyze participants’ eye movements. In the T120 hardware, the eye-tracking cameras are integrated into a 17-in. TFT monitor allowing for unobtrusive recording of respondents’ eye movements. The documentation of the T120 describes its accuracy to be within 0.5 , with less than 0.3 drift over time, and less than 1 due to head motion. It allows for head movement within a 30  22  30 cm volume centered up to 70 cm from the camera. The sampling rate is 120 Hz, meaning that 120 data points per second are collected for each eye. The accuracy of the T120 was found to be generally sufficient to determine the words on which respondents fixate. However, to make sure that all fixations could be unequivocally allocated to the words respondents had actually read, we used a larger font size, up to 18 pixels, and an increased line height. Screen resolution was set to 1280  1024. In our analyses, we included all fixations that lasted at least 100 milliseconds and encompassed 20 pixels (about four characters of text; see Galesic et al. [2008] for similar methodology). The two randomized experiments were part of a larger study with several unrelated experiments (e.g., Lenzner et al. 2011). In experiment 1, respondents were confronted with the ALL versus END forms for the first time. After this experiment, respondents were randomly assigned to the next experiment with different content and a different response scale, then they were again regrouped (randomly) to the conditions of experiment 2. This procedure ensures that possible effects of the previous experiments are randomly divided between the two conditions in both experiments (hence, the error cannot be systematic but is rather random). The entire study took about 2 hours, 1 hour of which was devoted to eye tracking. As the accuracy of the calibration might decrease over time, respondents were recalibrated every 10–15 minutes. A technical assistant was seated at a table next to the respondent and was monitoring his or her eye movements on a separate computer monitor. Respondents were seated in front of the eye tracker so that their eyes were approximately 60 cm from the screen. They were instructed to read at normal speed while trying to understand and answer the questions. After participants had successfully completed a standardized calibration procedure, they were presented with the pages of the web survey. In the beginning of the eye-tracking session, all respondents received three questions on political opinions (taken from the German General Survey, ALLBUS; http://www.gesis.org/allbus). From these questions, we computed respondents’ average fixation length and average fixation count, which were later used as covariates in the analysis to control for individual

26

Field Methods 26(1)

differences in respondents’ reading speed (see Lenzner et al. 2011). For their participation in the whole study, respondents received a compensation of €10.

Experiment 1 Method Experiment 1 involved a Likert-type scale consisting of four items about national identity (see the Appendix). The items were originally asked in the ALLBUS 2008 and have been found (by our own principal component analysis [PCA] with ALLBUS data) to build a single factor (44% of explained variance). The rating scale consisted of seven categories. Respondents were randomly assigned to one of the following two conditions: 1.

2.

ALL form with verbal labels for each category ‘‘does not apply at all,’’ ‘‘does not apply,’’ ‘‘applies to some extent,’’ ‘‘applies partly,’’ ‘‘applies to a great extent,’’ ‘‘applies almost fully,’’ and ‘‘applies fully’’ (own translation from German). END form with verbal labels at the two end points (‘‘does not apply at all’’ and ‘‘applies fully’’).

The labels are described in Rohrmann (1978), who developed verbal labels for different dimensions in the German language.

Results Descriptive Results No significant differences were found between the two conditions regarding means, neither for the individual items nor for the summarized mean of four items, the highest t(45) ¼ 1.1, p > .10, d ¼ 0.33. The mean values for individual items range in the condition ALL from Mitem3 ¼ 1.21 (SDitem3 ¼ 0.51) to Mitem1 ¼ 3.58 (SDitem1 ¼ 1.95; see the Appendix). In the condition END, the means for individual items range from Mitem3 ¼ 1.49 (SDitem3 ¼ 1.08) to Mitem1 ¼ 3.74 (SDitem1 ¼ 1.66). There was also no significant difference in completion times for all four items between the two conditions, log-transformed response time, M(ALL) ¼ 3.62; SD ¼ 0.29; M(END) ¼ 3.59; SD ¼ 0.34; t(45) ¼ 0.55, p > .10, d ¼ 0.08.

Menold et al.

27

The distributions of all items significantly differed from a standard normal distribution in both conditions, except for item 1 in the END condition (the highest significant Shapiro–Wilk value was 0.898, p < .05; the significant skewness values of items ranged from S ¼ 1.4 to S ¼ 3.6 (SE ¼ 0.47); the significant kurtosis values ranged from K ¼ 1.76 to K ¼ 14.84, SE ¼ 0.94). We obtained differences in distributions between the two forms looking at the summarized Likert-type scale means. Here, the mean values followed a normal distribution in the ALL condition (Shapiro– Wilk ¼ 0.90, p > .05), but not in the END condition (Shapiro–Wilk ¼ 0.93, p < .05). Additionally, the skewness (S ¼ 1.33, SE ¼ 0.48) and kurtosis (K ¼ 2.3, SE ¼ 0.94) of the scale mean were both positive and significant in the END condition, whereas in the ALL condition they were not significant (S ¼ 0.39, SE ¼ 0.47; K ¼ 0.84, SE ¼ 0.92). Overall, we obtained a better fit with respect to normality for the summarized Likert-type scale means in the ALL condition as compared to the END condition.

Respondents’ Attention Respondents’ attention was examined by means of eye tracking. Eye movements were analyzed for different areas of the screen (Table 1, Figure 1): (1) the area comprising all items (items), (2) the rating scale, (3) each verbal label, (4) the answers (comprising the radio buttons), and (5) all these areas together (combined total). Concerning eye movements, we considered fixation length and fixation count of respondents on different areas. Fixation length is the time in seconds that a respondent looks at a specific area. The fixation count is the number of fixations on an area. We conducted analyses of covariance (ANCOVAs) for the comparisons between the ALL and END forms. As covariates, we used the average fixation length/fixation count to control for differences in respondents’ reading speed (see General Method section). Concerning the area of the rating scale (response scale), the results show significantly longer fixation times and more fixation counts for the ALL form, as compared to the END form (Table 1). This means that respondents spend more time looking at the ALL form. Concerning the verbal labels, only fixations on the end categories can be compared between the two conditions. Table 1 shows that in the ALL condition, the label on the right end receives significantly lower fixation times and fixation counts than the right end label in the condition END. For the intermediate verbal labels of the ALL form, the fixation times and counts linearly decrease from left to right (Table 1, note). Thus, respondents who used the ALL form paid less attention to the right-end label and

28

Field Methods 26(1)

Table 1. Mean Fixation Length and Fixation Counts in the Two Experiments. Mean Fixation Lengtha Screen Areas Experiment 1 Left-end label Right-end label Response scale Items Answers Combined total Experiment 2 Left-end label Right-end label Response scale Items Answers Combined total

Mean Fixation Count

Partial Z2

END

1.92 1.84 0.18 1.10 0.31 18.47** 3.95 6.19 7.16** 9.32 10.35 0.16 7.37 6.48 1.43 22.95 24.78 0.12

.00 .30 .14 .00 .03 .00

6.44 3.78 14.73 45.65 30.65 102.04

5.71 0.69 1.04 17.95** 19.42 5.37** 45.79 0.01 24.08 3.64 97.58 0.37

.02 .29 .11 .00 .08 .01

2.19 1.13 11.67** 1.39 0.37 19.67** 4.42 7.04 10.49** 26.49 25.42 0.00 18.61 17.58 0.09 52.87 53.81 0.21

.21 .31 .19 .00 .00 .01

7.25 4.09 7.89** 4.88 1.78 14.64** 16.33 26.44 13.98** 119.42 121.87 0.21 67.83 68.00 0.05 219.72 234.09 0.92

.15 .25 .24 .01 .00 .02

END

ALL

F(1,44)

ALL

F(1,44)

Partial Z2

Note: Experiment 1: Mean fixation length (counts) on the intermediate categories (from left to right): 1.41 (4.27); 0.98 (3.27); 0.81 (2.58); 0.47 (1.43); 0.37 (1.12); n(END) ¼ 23; n(ALL) ¼ 24. Experiment 2: Mean fixation length (counts) on the intermediate categories (from left to right): 2.38 (8.75); 1.37 (5.84); 1.79 (5.98). n(END) ¼ 24; n(ALL) ¼ 23. a In seconds. **p < .01.

Figure 1. Areas of interest for the analysis of fixations in experiment 1, condition ALL.

Menold et al.

29

other intermediate labels on the right part of the rating scale. Table 1 also shows that in the END form, the label on the right receives lower fixation times and fixation counts than the label on the left. There are no other significant differences in fixation times and fixation counts between the two conditions, either for the total screen area (combined total) or for the item or answer areas. Using ANCOVAs, we showed that the respondents fixated longer on the rating scale area in the ALL condition than in the END condition and that the labels on the right received lower attention, in particular in the ALL form. Next, we examined the probability for a verbal label to receive attention. Thus, we analyzed whether a respondent fixated a verbal label at least one time before he or she answered the first item (initial fixations) and after he or she had answered all four items (final fixations). We were interested in initial fixations to analyze whether respondents had considered the meaning of the range and subranges of the rating scales—presented by the verbal labels—before they started to map their answers onto the response categories (in accordance with the assumptions of the range frequency principle). If that was not the case, we were interested in whether the meaning of the labels was considered by the respondents during the answering process (at least toward the end of this process, measured by the number of final fixations). We used observations of eyetracking videos to obtain initial and final fixations. The observations were made by two observers and were subsequently checked by three other observers. During this control check, no corrections of data were necessary. For initial fixations, we found that in the END condition, 74% (17 participants) read both labels before they started to answer the first item. In contrast, in the ALL condition, only 12.5% (three participants) read all seven labels before they answered the first item. The majority of participants in the ALL condition (66%) initially fixated on five or fewer labels. For final fixations, after having answered all four items, nearly all participants (96%) had fixated on both labels in the END condition at least one time. In contrast, only six participants (25%) had fixated on all seven labels in the ALL condition at least once. To test the significance of differences in initial and final fixations between the two conditions, we calculated for each respondent the ratios of fixated labels as related to the number of labels in a rating scale. In the END condition, the ratios were x/2 and in the ALL condition the ratios were x/7 (where x is the number of fixated labels by one respondent). For ratios of the initial reading, the mean was M ¼ 0.87 (SD ¼ 0.22) in the END condition and M ¼ 0.69 (SD ¼ 0.21) in the ALL condition. For the final

30

Field Methods 26(1)

reading, the mean was M ¼ 0.98 (SD ¼ 0.10) in the END condition and M ¼ 0.76 (SD ¼ 0.20) in the ALL condition. The ratios were significantly smaller in the ALL condition for initial fixations, t(45) ¼ 2.86, p < .01, d ¼ 0.86, as well as for final fixations, t(34.9) ¼ 4.67, p < .001, d ¼ 1.38. So, the probability of fixating on a verbal label at least once was significantly lower in the ALL form than in the END form for both initial reading and reading at the end of the mapping process. The results do not support Hypothesis 1, because there were differences in attention spent to the verbal labels in the ALL and END forms. Most participants attended to only some of the verbal labels, where the probability for a verbal label to be attended to was significantly lower in the condition ALL than in the condition END. Hypothesis 2 is supported by our data: It takes respondents longer to process the information in fully labeled rating scales, despite the fact that some of the labels are disregarded by the respondents.

Reliability It is important to use an appropriate coefficient to measure reliability. In this regard, the crucial question is whether items are homogeneous and equivalent in measuring the latent dimension (see Lord and Novick 1968). The items used in the experiment describe different aspects (national pride vs. acceptance of dictatorship) of a latent dimension ‘‘national identity’’ (see Experiment 1, Method) and are rather heterogeneous. Cronbach’s a (which is a commonly used reliability coefficient) requires that the items are at least essentially t equivalent (have the same loadings on the latent dimension) and that the error terms should not be correlated. Thus, Cronbach’s a is applicable only for highly homogeneous items (Lord and Novick 1968). As an alternative measure of reliability, Guttman’s lambda (l) coefficients can be used (Lord and Novick 1968). In particular, l4 (a measure for two test halves) can be considered, because (1) parallelism or equivalence of two parts are not assumed; (2) it is not necessary for two parts to have comparable variances; and (3) in contrast to Cronbach’s a, which can be thought of as a mean of all possible test halves, l4 finds the splits (these could also be of different length) with the largest reliability (Callender and Osburn 1979; Guttman 1945). In our data, l4 is remarkably higher in the ALL condition (l4 ¼ .71) than in the END condition (l4 ¼ .57). The reliability coefficient of l4 ¼ .71 is low but acceptable (we referred to Fisseni’s [1997] recommendation to evaluate reliabilities). Because Cronbach’s a is the most common measure (and it

Menold et al.

31

has often been applied without the consideration of item equivalence), we also report the results obtained by Cronbach’s a: Alphas do not differ between the two conditions and both values are unacceptably low, a(END) ¼ .61; a(ALL) ¼ .60. Considering the heterogeneity of the items, the ALL condition resulted in a higher reliability than the END condition in accordance with the assumption of Hypothesis 3.

Conclusion Although we did not find differences in the means of item values between the two conditions, we can report a better measurement quality for the ALL form. The ALL form obtained higher reliability than the END form, and the summarized scale value was normally distributed in the ALL form while it was not normally distributed in the END form. Normality of the distribution is a prerequisite for many linear tests of significance. However, the results also showed that the probability for a verbal label to receive attention (both before and after completion of the mapping process) was remarkably low in the ALL condition as compared to the END condition. In this regard, respondents did not behave in accordance with the assumptions proposed by the range frequency model, but rather shortened the cognitive effort as described by satisficing theory.

Experiment 2 Method In this experiment, we compared the END and ALL forms using fivecategory rating scales. Respondents were randomly assigned to one of the following two conditions: 1.

2.

ALL form with fully labeled categories (‘‘do not agree at all,’’ ‘‘agree to some extent,’’ ‘‘neither nor,’’ ‘‘agree to a great extent,’’ and ‘‘agree fully’’). END form with verbal labels at the two end points (‘‘do not agree at all,’’ ‘‘agree fully’’). Additionally, the response categories were marked with numbers from 1 to 5.

The labels differed from the labels used in experiment 1. We found it important to change the wording of the labels in the second experiment

32

Field Methods 26(1)

to minimize the training effect from the participation in the first experiment and to ensure that the information in the rating scale is new so that respondents are required to pay attention to this new information. In this experiment, we presented 11 heterogeneous items about opinions on the European Union (EU; see the Appendix; the original questions were part of the German Longitudinal Election Study, GLES). By our own PCA with GLES 2011 data, we found that the items form two factors with 49% of explained variance (the first factor, which we label EU1/sovereignty, consists of the items 1, 2, 4, 6, 7, and 8 and addresses EU impact on German life and economy as well as sovereignty of the states within the EU; the second factor, which we label EU2/integrity, contains remaining items, which describe EU success and the integrity of states within the EU). To compute the scale means for EU1, we recoded reversed items (1, 6, and 8; see the Appendix). In this experiment, we tested the same three hypotheses as in experiment 1.

Results Descriptive Results There were neither significant mean differences in answers to each of the items nor mean differences in summarized Likert-type scale values of EU1 and EU2, the highest t(45) ¼ 1.63, p > .05, d ¼ 0.69. The mean values for individual items range in the condition ALL from Mitem8 ¼ 2.30 (SDitem8 ¼ 1.91) to Mitem3 ¼ 3.87 (SDitem3 ¼ 0.69). In the condition END, the means for individual items range from Mitem8 ¼ 1.92 (SDitem8 ¼ 0.83) to Mitem11 ¼ 3.83 (SDitem11 ¼ 1.13). In addition, we found no differences in the completion times for the items (logarithmic times; ALL: M ¼ 4.56; SD ¼ 0.22; END: M ¼ 4.5; SD ¼ 0.32; t(45) ¼ 0.75, p > .10, d ¼ 0.22. No item followed a normal distribution, regardless of the experimental condition (the highest Shapiro–Wilk value was 0.87, p < .001). Regarding the skewness and kurtosis of the items, no systematic patterns that could demonstrate the effect of the experimental manipulation were found. The scale means of EU1 and EU2 factors were normally distributed in the END condition (the highest Shapiro–Wilk value was 0.93, p > .05). In the ALL condition, the scale value of EU2 is close to normality (Shapiro– Wilk ¼ 0.91, p ¼ .05), whereas the value of EU1 is normally distributed (Shapiro–Wilk ¼ 0.92, p > .05). However, in the END condition, significant positive kurtosis of K ¼ 2.79, SE ¼ 0.9 was obtained for the EU2 scale, so that the conditions were rather similar with respect to normality.

Menold et al.

33

Respondents’ Attention Eye movements were analyzed the same way as in experiment 1. The results were also similar to the results obtained in experiment 1. We found significant longer fixation times and fixation counts for the rating scale area in the ALL form than in the END form (Table 1). However, the labels at the left and right extremes received significantly shorter fixations times and fewer fixation counts in the ALL form than in the END form. The intermediate labels of the ALL form received longer fixation times and higher fixation counts in comparison to the labels on the extremes (Table 1, note). Thus, the second label ‘‘agree to some extent’’ obtained the longest fixation time and the highest fixation count. The lowest attention was paid to the right end label of the ALL form (fully disagree). Similar to experiment 1, we found no significant differences between the two conditions for all other areas of the screen. Similar to experiment 1, we compared the counts of initially and finally read verbal labels in the two conditions. Before answering the first item (initially), 46% fixated on both verbal labels of the END form at least once. In the ALL condition, only one participant (4%) initially fixated on all five labels at least one time, and only six participants (26%) fixated on four labels at least once. After having answered all 11 items, all respondents had read both verbal labels in the END condition. In the ALL condition, 61% had read all five verbal labels. The ratio of initially read labels to the number of verbal labels in a rating scale was significantly lower, t(45) ¼ 2.65, p < .01, d ¼ 0.77, in the ALL condition (M ¼ 0.54, SD ¼ 0.26) than in the END condition (M ¼ 0.73; SD ¼ 0.25). After having answered all items, significantly more respondents had read all labels in the END condition (END: M ¼ 1.0; SD ¼ 0.0) compared to the ALL condition, M ¼ 0.91, SD ¼ 0.12; t(45) ¼ 3.61, p < .001, d ¼ 0.45. In line with the results of experiment 1, experiment 2 showed that many participants only fixated on some of the verbal labels in the ALL form, even after they had answered all 11 items. In contrast, respondents were more likely to attend to both verbal labels when the END form was presented. These results differ from the expectations of Hypothesis 1. Similar to experiment 1, attending to the ALL form required more time and more fixations, which is in accordance with the expectations of Hypothesis 2.

Reliability The 11 items on opinions toward the EU measure two dimensions (EU1 and EU2, see Experiment 2, Method), so the analyses are separately conducted

34

Field Methods 26(1)

for each dimension. Comparable to experiment 1, the items within the EU dimensions are heterogeneous (within the single factors they address, e.g., impact of the EU and EU politics; see the Appendix). In addition, there are items with different polarity within the EU1 factor, and, consequently, there are negative correlations between several items (correlations range from r ¼ .29 to r ¼ .61). Applying Guttman’s Lambda 4 (l4), we calculated a reliability of l4(EU1) ¼ .70 and l4(EU2) ¼ .69 (low, but still acceptable coefficients) in the condition ALL. In the condition END, the values are l4(EU1) ¼ .22 (an unacceptable value) and l4(EU2) ¼ .64 (a rather low value). Lord and Novick (1968:97) suggest using Lambda 2 (l2) if some of the items are negatively correlated. In contrast to l4, l2 is not a split half coefficient, but an internal consistence coefficient like Cronbach’s a. The main difference from Cronbach’s a is that l2 considers covariances between the items omitting the variances (Guttman 1945). For factor EU1, we also calculated a higher l2 in the ALL condition, l2(EU1) ¼ .66, than in the END condition, l2(EU1) ¼ .46. The fact that the items within the two factors are indeed heterogeneous is also reflected in low Cronbach’s a coefficients. Cronbach’s a for the reversed items 1, 6, and 8 is a(EU1) ¼ .29 and a(EU2) ¼ .62 in the ALL condition as well as a(EU1) ¼ .13 and a(EU2) ¼ .38 in the END condition, respectively, which are all low reliability coefficients (Fisseni 1997). Hypothesis 3 (which predicted higher reliability for the ALL form) is supported by our data, taking into account the heterogeneity of the items and a reversed wording of several items within the EU1 factor.

Conclusion The results of experiment 2 are similar to those of experiment 1. Again, we did not find differences in means, but in measurement quality between the ALL and END forms. The ALL form resulted in a remarkably higher reliability than the END form. Moreover, we found differences in the ways respondents processed the ALL and END forms. In the END condition, respondents attended longer and more often to the verbal labels of the end categories than respondents in the ALL condition. Hence, the verbal labels were used to a lesser extent by the respondents in the ALL condition during the responding process to understand the range and subranges of the rating scale.

Discussion The aim of this study was to examine how respondents attend to the information transported by verbal labels in rating scales. In particular,

Menold et al.

35

we focused on whether respondents map their answers onto the response categories in accordance with the range and frequency model (Parducci 1983), and thus whether they pay attention to each of the verbal labels or not. We found clear differences in the underlying cognitive processes and in the measurement quality between the ALL and END forms. The results of both experiments are quite similar, despite the different length of the rating scales (five vs. seven categories) and the different content of the items. This suggests a certain level of stability of the effects we reported. In both experiments, the ALL form resulted in higher reliability than the END form (similar to previous research, e.g., Krosnick and Fabrigar 1997). The differences in the attention process identified in both experiments can be summarized as follows: In the case of the END form, both verbal labels were attended to more often and for a longer period than the verbal labels on the extremes in the ALL form. In the case of the END form, a considerable proportion of the respondents did not look at the right-end category before they started to answer the first item. In the ALL condition, most respondents did not perceive the whole rating scale (its range and the meaning of subranges) at the beginning of the response process and some of them disregarded this information even at the end of the answering process. However, in the case of five categories, the probability of using all verbal labels during the mapping process seems to be higher than in the case of seven categories. This finding can be partly explained by the different number of items in the multiple item set and by the fact that in experiment 1 (seven categories) most of the items were skewed to the right. These alternative explanations should be addressed in future research. In sum, our results support the assumption that satisficing behavior with respect to the attention paid to the verbal labels is the more dominant behavior as compared to the optimizing behavior predicted by Parducci (for both, ALL and END forms, even if the shortcuts of attention to verbal labels are more pronounced in the case of the ALL form). Our finding that limited attention to the verbal labels of the ALL form was not associated with lower measurement quality might be explained by the fact that in the ALL form the verbal labels provide more clarity and a universal understanding of the subranges in rating scales. Information on subranges conveyed by the intermediate verbal labels in the ALL form might be helpful for respondents, even if they do not consider all information. We assume that the END form (also combined with numbers in experiment 2) conveys too little information, which is not

36

Field Methods 26(1)

sufficient to unequivocally understand the underlying dimension of measurement, even if this information had been fully considered by the respondents. Considering the results of our experiments, we suggest using the ALL form. However, the results on attention process pose a challenging question: How can we ease the mapping process required by the ALL form? Especially with regard to the limited attention to the verbal labels shown in experiment 1, it is important to examine whether five-category rating scales would be a preferable alternative. The results showed a sufficient reliability with a fully labeled five-category rating scale, a finding that is in accordance with several previous studies (Birkett 1986; Masters 1974; Rosenstone et al. 1986). However, more research is needed to find out whether five categories are more appropriate than seven categories in rating scales with the ALL form. Given that we did not use a fully crossed experimental design, it is not possible to directly compare the measurement quality of five versus seven categories with our data. This question should be addressed by further research, applying, for example, a three-factorial design that varies verbal and numerical labels and the number of categories. In addition, our results are constrained by the sample size and the laboratory setting of the study. At the same time, conducting this study in the laboratory with payment for participation has probably enforced a rather conscientious information processing style. Consequently, respondents in real surveys may show even more satisficing than the participants in our experiments.

Appendix Items Used in Experiment 1 To what extent do the following statements about national identity apply? 1. 2. 3. 4.

I am proud to be German. In certain circumstances, a dictatorship is the better form of government. National Socialism also had its positive aspects. If it were not for the events of the Holocaust, Hitler would be considered a great statesman today.

Items Used in Experiment 2 And now a few questions about the policy of the European Union (EU). Please indicate to which extent you agree with the statements, using the scale below.

Menold et al.

1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11.

37

In Germany, the social security is weakened by EU regulations. The regions in Europe should preserve their sovereignty. The EU is able to assist in overcoming the current economic crisis. A member state should be able to quit the EU of one’s own accord. The eastward expansion led to an economic upturn in Germany. The eastward expansion endangered job security in Germany. All EU citizens should be able to decide on EU contracts by referendum. The eastward expansion led to an increase of criminal activities in Germany. The introduction of the euro has been a great success so far. The euro should be introduced into all EU states. The EU needs a common foreign and security policy.

Acknowledgments The authors are grateful to Mirta Galesic, Gregor Caregnato and Christian Elsner for their help in conducting the study.

Declaration of Conflicting Interests The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding The authors received no funding for the research, authorship, and/or publication of this article. Data collection was supported by the Max Planck Institute for Human Development, Berlin.

References Alwin, D. F., and J. A. Krosnick. 1991. The reliability of survey attitude measurement: The influence of question and response attributes. Sociological Methods and Research 20:139–81. Andrews, F. M. 1984. Construct validity and error components of survey measures: A structural equation approach. Public Opinion Quarterly 48:409–48. Birkett, N. J. 1986. Selecting the number of response categories for a Likert-type scale. Proceedings of the American Statistical Association 1987 Annual Meetings, Section on Survey Research Methods, American Statistical Association 488-92, ASA, Alexandria, Egypt. http://www.amstat.org/sections/srms/Proceedings/ (accessed September 10, 2013).

38

Field Methods 26(1)

Callender, J., and H. G. Osburn. 1979. An empirical comparison of coefficient alpha, Guttman’s Lambda-2, and MSPLIT maximized split-half reliability estimates. Journal of Educational Measurement 16:89–99. Churchill, G. A.Jr., and J. P. Peter. 1984. Research design effects on the reliability of rating scales: A meta-analysis. Journal of Marketing Research 21:360–75. Couper, M. P., F. G. Conrad, and R. Tourangeau. 2007. Visual context effects in web surveys. Public Opinion Quarterly 71:623–34. Fisseni, H.-J. 1997. Lehrbuch der psychologischen Diagnostik. Go¨ttingen, Germany: Hogrefe. Galesic, M., R. Tourangeau, M. P. Couper, and F. G. Conrad. 2008. Eye-tracking data: New insights on response order effects and other cognitive shortcuts in survey responding. Public Opinion Quarterly 72:892–913. Guttman, L. 1945. A basis for analyzing test-retest reliability. Psychometrika 10: 255–81. Krosnick, J. A., and D. Alwin. 1987. An evaluation of a cognitive theory of response-order effects in survey measurement. Public Opinion Quarterly 51: 201–19. Krosnick, J. A., and M. K. Berent. 1993. Comparison of party identification and policy preferences: The impact of survey question format. American Journal of Political Science 37:941–64. Krosnick, J. A., and L. R. Fabrigar. 1997. Designing rating scales for effective measurement in surveys. In Survey measurement and process quality, eds. L. Lyberg, P. Biemer, M. Collins, E. de Leeuw, C. Dippo, N. Schwarz, and D. Trewin, 141–64. New York: John Wiley. Lenzner, T., L. Kaczmirek, and M. Galesic. 2011. Seeing through the eyes of the respondent: An eye-tracking study on survey question comprehension. International Journal of Public Opinion Research 23:361–73. Lord, F. M., and M. R. Novick. 1968. Statistical theories of mental test scores. Reading, MA: Addison-Wesley. Maitland, A. 2009. Should I label all scale points or just the end points for attitudinal questions? Survey Practice, 04. AAPOR e-journal. http://surveypractice.files. wordpress.com/2009/05/survey-practice-april-20091.pdf (accessed September 10, 2013). Masters, J. R. 1974. The relationship between number of response categories and reliability of Likert-type questionnaires. Journal of Educational Measurement 11:49–53. Parducci, A. 1983. Category ratings and the relational character of judgment. In Modern issues in perception, eds. H. G. Geissler, H. F. J. M. Bulfart, E. L. H. Leeuwenberg, and V. Sarris, 262–82. Berlin: VEB Deutscher Verlag der Wissenschaften.

Menold et al.

39

Peters, D. L., and E. J. McCormick. 1966. Comparative reliability of numerically anchored versus job-task anchored rating scales. Journal of Applied Psychology 50:92–96. Rohrmann, B. 1978. Empirische Studien zur Entwicklung von Antwortskalen. Zeitschrift fu¨r Sozialpsychologie 9:222–45. Rosenstone, S. J., J. M. Hansen, and D. R. Kinder. 1986. Measuring change in personal economic well-being. Public Opinion Quarterly 50:176–92. Saris, W. E., and I. N. Gallhofer. 2007. Design, evaluation, and analysis of questionnaires for survey research. Hoboken, NJ: John Wiley. Toepoel, V., and D. A. Dillman. 2011. How visual design affects the interpretability of survey questions. In Social and behavioral research and the Internet: Advances in applied methods and research strategies, eds. M. Das, P. Ester, and L. Kaczmirek, 165–90. Oxford: Taylor and Francis. Tourangeau, R., J. J. Rips, and K. Rasinski. 2000. The psychology of survey response. Cambridge: Cambridge University Press. Tra¨nkle, U. 1987. Auswirkung der Gestaltung der Antwortskala auf quantitative Urteile. Zeitschrift fu¨r Sozialpsychologie 18:88–99. Weng, L. 2004. Impact of the number of response categories and anchor labels on coefficient alpha and test-retest reliability. Educational and Psychological Measurement 64:956–72.