Why we should not use 5-point Likert scales: The ...

369 downloads 921 Views 178KB Size Report
Subjective quality of life (SQOL) is gaining prominence as a measure of intervention effectiveness. ... the discriminative capacity of most people in terms of their perceived well-being. ..... Life goals, satisfaction, and self-rated health: Preliminary.
Cummins, R.A. & Gullone, E. (2000). Why we should not use 5-point Likert scales: The case for subjective quality of life measurement. Proceedings, Second International Conference on Quality of Life in Cities (pp.74-93). Singapore: National University of Singapore.

Why we should not use 5-point Likert scales: The case for subjective quality of life measurement

Robert A. Cummins School of Psychology Deakin University

Eleonora Gullone and

Department of Psychology Monash University

Key words: Likert scale, history, reliability, sensitivity.

Correspondence: Robert A. Cummins Professor of Psychology Deakin University 221 Burwood Highway Melbourne, Victoria 3125 Australia E-mail: [email protected]

Abstract An argument is presented that the Likert scales commonly employed to measure subjective quality of life (SQOL) are not sufficiently sensitive for the purpose of using SQOL as a measure of outcome. A review of the literature indicates that expanding the number of choice-points beyond 5- or 7-points does not systematically damage scale reliability, yet such an increase does increase scale sensitivity. It is also argued that naming the Likert scale categories detracts from the interval nature of the derived data. As a consequence it is recommended that SQOL be measured using 10-point, end-defined scales.

Acknowledgement I thank Betina Gardner for her assistance in the preparation of this document.

Subjective quality of life (SQOL) is gaining prominence as a measure of intervention effectiveness. This is most evident in the fields of human service delivery and medicine where a very large number of instruments have been devised to measure SQOL in some form. Typically, each instrument will comprise 10 to 20 items, with each item scored on a 3- to 5point Likert scale. While this recognition of personal, perceived well-being as a measure of outcome effectiveness represents a major conceptual advance, it is time to take the forms of instrumentation more seriously. Cummins (1997) lists around 400 instruments that have been devised to measure SQOL or related constructs, and almost all rely on the use of ‘Likert-type’ scales. However, a glance at this collection reveals an apparent absence of rules by which researchers have designed their scales. The number of choice points can vary from two to 100, some use unidimensional scales (e.g. from ‘no satisfaction’ to ‘complete satisfaction), some use bidimensional scales (e.g. from ‘complete dissatisfaction’ to ‘complete satisfaction’), some use a neutral scale mid-point while others do not, some use extreme anchors (e.g. ‘Terrible’) while others use mild anchors (e.g. ‘Dissatisfied’), and so on. Each of these different forms is known to influence the response pattern that people make to Likert scales, and yet there has been almost no attention to such issues in relation to SQOL measurement. At one level it could be argued this is relatively unimportant. The Likert scale has a fairly robust character which has proved to be reliable over a wide variety of forms, and the larger issues concern such matters as the item content of scales related to their construct validity. While such concerns are valid there is one aspect of Likert scale construction which is at least of equal importance when the data are used as measures of outcome. As suggested by Guyatt and Jaeschke (1990), this is the issue of measurement sensitivity, and it is really quite curious that this crucial parameter has been virtually ignored. The typical Likert scale offers 5- or 7-choice points which, of itself, is hardly likely to exploit the discriminative capacity of most people in terms of their perceived well-being. But even this modest array of response choices is actually reduced by a number of factors, some of which are as follows: 1. SQOL is not free to vary over its entire range subject to the contingency of personal circumstance. It is constrained in range by its trait characteristic (see for example McCauley & Bremer, 1991) which causes it to behave as though it is held under homeostatic control. For example, the population mean for life satisfaction, a common measure of SQOL, is held within the normative range of 70-80 percentage of the scale maximum within Western populations (Cummins, 1995, 1998). Consequently, respondents will normally experience SQOL variations that are limited to only a proportion of the available scale. 2. As noted above, SQOL data display a natural negative skew when measured using ordinary unipolar or bidimensional Likert scales (Watson, 1930; Cummins, 1997). Consequently, again the majority of respondents are employing only the higher or positive half of the scale in order to register their judgement. 3. Responses to Likert scales comprise a high degree of response-set. For example, Peabody (1962) found an average correlation of .54±.24 between individuals’ intensity of agreement or disagreement to sets of attitudinal questions. Moreover, he calculated the

relative contribution of response direction (positive or negative around a neutral midpoint) and response intensity (response distance from the neutral mid-point to an end of the scale) to a composite score, and found that intensity contributed only 10 percent to the composite variation. This has important implications for the detection of differences between QOL domains, such as provided by the Comprehensive Quality of Life Scale (Cummins, 1997). Maximum scale sensitivity is required in order to measure such differential levels of intensity. 4. Vokman (1951) first drew attention to the ‘variable series effect’ (see also Upshaw, 1962). Here it is proposed that the stimulus range is a major determinate of the value assigned to items in a series. Essentially, if people are presented with a scale in which their attitude is non-mid-point, then they will subjectively divide the range between the values that they recognize as being consistent with their own attitude range and, as a consequence, exhibit a narrower response range than the presented scale intends. So the conclusion to be drawn is that Likert scales, as currently constructed, are likely to represent very blunt instruments by which to judge change in SQOL. In order to see why this has come about it is necessary to outline some of the history of scale development. A history of Scale Construction Likert (1932) was not the first to obtain subjective ratings on a printed scale, and it is most interesting to note that the early scale developers used far more sensitive scales than we currently employ. Freyd (1923) discusses the various forms of scale available at that time and notes that they tended to be based on 10-point or 100-point formats. For example, the ‘Decile’ scale comprises a number of statements corresponding to different levels of construct ‘strength’ against numbers from 0 – 10. This numbering system is undoubtedly the most intuitive and easy to conceptualize. The traditional counting task for children involves their fingers or toes. It also has the advantage of having a perception of equal psychometric distance between the scale points. This is an essential supposition when such scale are used in combinations with parametric statistics, even though this condition is known to be violated using Likert’s scale to varying degrees (see e.g. Ferguson, 1941). Freyd then introduced his ‘Graphic rating method’ which had the following form: [item] Does he appear neat or slovenly in his dress? Extremely neat and clean. Almost a dude.

Appropriately and neatly dressed.

Inconspicuous in dress.

Somewhat careless in his dress.

Very slovenly and unkempt.

The above scale was intended to be used in conjunction with job interviews, and raters were instructed as follows: “When you have satisfied yourself on the standing of this person in the trait on which you are rating him, place a check at the appropriate point on the horizontal line. You do not have to place your check directly above a descriptive phrase. You may place your check at any point on the line.” (p.88).

He then recommended scoring the responses by dividing the line into 10 or 20 equal intervals. A few years later, Watson (1930) published a similar scale to measure an aspect of SQOL as follows:

Most miserable of all

About three-fourths of the population are happier than you are

The average person of your own age and sex

Happier, on the whole, than threefourths of the population of similar age and sex

Happiest of all

Instructions to respondents were: “Comparing yourself with other persons of the same age and sex how do you feel you should rate your own general happiness? Place a short vertical mark across the line below to indicate about where you belong. Consider your average state over several months.” (p.84) The scale was then scored 0-100. Then, in 1932, Likert produced his scale which had the following form:

Strongly Approve

Approve

Undecided

Disapprove

Strongly Disapprove

While this format is clearly derivative from the previous examples, it importantly and drastically reduces the number of effective choice-points in two ways. First, the scoring system is no longer continuous. Respondents are now required to mark the line only adjacent to one of the labels, making the scoring system 1-5. Second, he has introduced the bidimensional scale with a neutral mid-point. More than six decades have passed since Likert’s formulation was published and it instructive to consider why this original form has remained so popular. The reasons include the type of psychometric investigation to which it has been subjected, the difficulty of generating substantially larger numbers of labeled choice points, and the complex nature of alternative scales. Each of these issues will now be considered with the aim of demonstrating that the 5-point format has survived because it has utility for some types of measurement, but that these have not included the issue of scale sensitivity in the measurement of SQOL. Psychometric properties of the Likert scale The three basic properties of Likert scales are reliability, validity, and sensitivity. And the extent to which psychometric research has concentrated on the former is astonishing. Almost

40 years after the scale had been published, Jacoby and Matell (1971) stated “ … most of the psychometric literature, dealing with the number-of-alternatives problem emphasizes reliability as the major criterion in the number of scale points. However, the ultimate criterion is the effect a change in the number of scale points has on the validity of the scale. An intensive literature search failed to reveal any empirical investigation addressed to this question.” (p.496). While the situation has changed somewhat over the intervening period it is still true that the kinds of psychometric data researchers report in support of their scales are generally restricted to reliability (internal and test-retest) and convergent/divergent validity which is, itself, highly dependent on reliability. This narrow focus has led, inexorably, to the view that smaller, rather than larger number of scale points are advantageous to measurement (e.g. Cronbach, 1946). The reason for this has been twofold. First, early research showed that increased numbers of choice-points means that there is more scope for people to display response-sets (see Cronbach, 1950, for a review), and this scope is minimized in dichotomously scored scales. Second, later research produced considerable divergence of opinion on the merit of shorter vs. longer scales. Given such ambiguity, the most pragmatic choice favors shorter scales since they reduce response time in large surveys where they are most commonly employed. But now it is time to revisit this literature from a different perspective. Not whether evidence can be found to support the use of simple-choice Likert scales in surveys, but whether evidence can be found to support an expanded number of choice points for the purpose of SQOL measurement. In terms of this appraisal there are two pertinent issues. First whether such expansion is likely to enhance measurement sensitivity, and second whether such expansion is detrimental to the other psychometric characteristics of scale reliability and validity. When Cronbach (1946, 1950) railed at the use of multiple-point scales he did so largely in the context of educational tests where response-sets were evident in knowledge-based questions, most particularly where the student did not know the answer and so engaged in such response-sets as acquiescence, by favoring ‘true’ response modes over false. But as Cronbach acknowledged “If a situation is structured for the student, so that he knows the answer required, he responds directly to the content of the item and response sets probably are unimportant.” (1946, p.483). Surely this, then, places the SQOL scales in a quite different context. The SQOL questions are simple, respondents have a view that they can express on the scale provided, and the questions are not knowledge based. And, indeed, the reliability data on non-knowledge based items do seem to indicate that response-sets are generally ‘unimportant’ as will be demonstrated. In terms of sensitivity the issue seems intuitive. Few people would feel their discriminative capacity for perceived well-being to be limited to five levels of experience. Moreover, as indicated earlier, there are a number of scale-construction factors which tend to reduce the effective choice still further. So, it would be expected that the empirical literature would support this view, and it does. Indeed, opinion seems to be unanimous; increasing the number of scale points increases scale sensitivity. Thus, for example, Diefenbach et al. (1993) found a 7-point scale to be more sensitive than a 5-point scale while Russell and Bobko (1992) found that data from a 15-point scale increased regression analysis effect sizes by 93 percent over those from a 5-point scale. As noted by Jaeschke and Guyatt (1990) in the context of medical QOL scales, it seems clear that 5-point scales, at the least, do not provide

sufficient sensitivity to detect small, clinically significant differences. So what, then, are the psychometric impediments to using expanded Likert scales? The main issue here is scale reliability, and it is true that some authors (e.g. Bardo & Yeager, 1982a&b; Bardo, et al., 1982) have concluded that scale reliability decreases as the number of choice-points exceeds two. However, these authors had the specific intention of detecting response-sets with a methodology that involved asking people to respond randomly to presented scales, an approach which has little relevance for SQOL measurement. Another source of data generally favoring simple scales has been derived from computer simulations. For example, Lissitz and Green (1975) determined that Likert-scale reliability increased from 2- to 5-points, but that no further gains were made beyond this point. However, later and larger-scale simulations (Cicchetti, et al., 1985; Jenkins & Taber, 1977) extended this upward range. Using monte carlo methodology these authors found that while the largest changes occurred over the range 2-points to 5/7-points, gradual increases beyond 5/7-points were evident in a range of psychometric parameters including inter-rater reliability and judgement accuracy. Such findings imply no impediment to the generation of more complex scales from consideration of reliability. The conclusions reached by researchers using human-generated data have been more variable. On the minimalist side, Peabody (1962) argued the case that agree/disagree scales should simply be scored dichotomously according to the direction of response. This was based on his finding that, when people responded to attitude scales, a ‘composite score’ derived by combining direction of response (agree or disagree) with the intensity of response (extent of agreement or disagreement) was dominated by the former. In fact, only 10 percent of composite score variation could be attributed to intensity, as opposed to the 70-80 percent attributed to direction. He also found a high degree of response-set within the intensity dimension. While all of this led him to recommend dichotomous scales two factors should be noted. First, he used a 6-point scale and acknowledged that with an increased number of choice-points to 9 or 11 “the reasoning used with regard to six-point scales would imply that the extremeness component might play a more significant role.” (p.72). Second, his data formed near-normal distributions, as opposed to the determined negative skew of SQOL data. In relation to this he acknowledged that a similar calculation would be unreliable since subjects tend to respond in the same direction to nearly all items. Other studies which have argued for fewer numbers of choice points are McKelvie (1978) and Chang (1994). The former compared 5-, 7-, and 11-point scales over two studies, and two rating tasks. In the main no differences were found between the scales on inter-rater reliability (agreement), test-retest reliability, and validity (a tone judgement task). Where effects were found on reliability, they tended to be contradictory, showing quite different trends for raw and transformed data. In terms of validity, the 5-point scale was clearly inferior. Curiously, however, the author concluded the 5-point scale to be superior, despite the weight of empirical evidence to the contrary. Chang (1994) took a more sophisticated approach. Using multitrait-multimethod to separate trait and method variance he found lower internal reliability within a 6-point agree/disagree scale than within a 4-point scale, but no difference between the scales in terms of criterionrelated reliability. The author concludes that increasing the number of scale points creates opportunities for response sets to arise. However, he also acknowledges “The issue of

selecting 4- versus 6-point scales may not be generally resolvable, but may rather depend on the empirical setting.” (p.205). In summary for the negative, none of these authors have presented a strong generalist case against the use of more complex scales in terms of reliability, and certainly not in the context of negatively skewed data. On the other hand, other researchers have found no change in reliability over scales ranging from 2 to 19 points (Test-retest and internal, Matell & Jacoby, 1971) or even 5 to 100 points (Diefenbach, et al., 1993). Still others have found increased rater reliability (the ability of single raters to discriminate differences between the rated stimuli:Bendig, 1954) and test reliability (the consistency of individual differences in the assignment of high and low ratings) when employing from two to five categories (Bendig, 1954), and an increased internal reliability from two- to six-points (Komorita & Graham, 1963). Finally, in a meta-analysis of 131 studies in the marketing research literature, Churchill and Peter (1984) found a positive relationship between internal reliability and the number of scale choice points (mean number of points 5.8±2.3, range 1-20). Their regression analysis revealed that the number of choice points explained five percent of the reliability variance. From all of these data it may be reasonably concluded that increasing the response options beyond 7-points does not systematically detract from scale reliability, a conclusion shared by others (e.g. Russell & Bobko, 1992). So, since many people will have a discriminative capacity that exceeds 7 points, restricting people to such scales results in a loss of potentially discriminative data. This is most particularly relevant in relation to SQOL measurement where the data are skewed. However, there are some potential difficulties in generating expanded scales, and one of these involves the tradition of using category names. The issue of categorical naming Likert (1932) named all of his five categories and the received wisdom is that this is a good idea. Andrews and Withey (1976), authors of the most widely cited document in the QOL literature, endorse category naming because, they suggest, it enhances comparability in the way respondents use the scales, and it makes it possible to know exactly what the respondent was endorsing. However they also note that it is not really true. The use of such terms as ‘good’ does not, in fact, ensure a standardized point of reference. For example, a ‘good’ house has very different meanings depending on a respondent’s socio-economic status, and high ambiguity in the allocation of values to scale categories has been documented for some time (e.g. Jones & Thurstone, 1955). So these reasons do not withstand scrutiny. Not only that, but there are other, powerful reasons to actually avoid category naming and these ideas have not been given the publicity they deserve. The Likert scale makes the assumption that the psychometric distance between categories is equal. This aspect of scale construction is always dutifully portrayed as an equally-spaced visual image, comprising marks on a horizontal line or a series of boxes. It may even be reinforced by a linear numbering system, for example from 1 to 7, with each successive integer corresponding to the next printed category. And then we add category names. The clear implication is that these categorical names exhibit the same internal scaling as the printed scale and numbers suggest. This, however, is wrong, sometimes very wrong, and the

data demonstrating this have been available for a considerable period of time (Cronbach, 1946). Table 1 Estimates of frequency (percent) Category

Never Rarely Very unusual Unusual Seldom Very unlikely Once or twice Infrequently Unlikely Once in a while

A

6.7

B

STUDY C

D

0.2*±.6 5.3±3.6

Maximum Difference

1.4 9.0 22.0

25.6 28.4±11.3 28.9 31.1 31.4±10.4 32.2

Now and then Sometimes Occasionally Pretty often Often

47.8 50.0 55.6 56.7 60.0

Frequently Repeatedly Usually Very probable Almost always

65.6 72.2

Most of the time Always

95.6

33.4±17.0 19.5±15.1

35.0 33.0

16.6 36.1

60.9±13.1

58.5

2.4 81.2±10.4

75.9±9.8

30.0

45.9 82.5±8.8

94.2±4.9 99.5*±2.0

83.1±8.8

16.4

* Participants were told that always = 100% and never = 0%

Table code: A: B: C: D:

Spector (1976) from College students. Kenney (1981) from ‘professionals’. Respondents were told that always = 100%, and never = 0%. Nakao & Axelrod (1983) from ‘non-physicians’. Tavana et al. (1997) from financial strategy experts.

The data in Table 1 indicate the percentage allocated to category labels of ‘frequency’ by respondents in four separate studies. The following observations can be made:

1.

With two exceptions (‘rarely’ and ‘often’) there is considerable discrepancy between the average category values derived from the studies. The worst labels in this regard are ‘occasionally’ and ‘usually’ which differ by 36.1 percent and 45.9 percent respectively.

2.

The relative ordering of terms is not consistent between studies. See, for example, ‘sometimes’ in relation to ‘occasionally’. The relative ordering of terms within studies, however, is always consistent with the presented Likert scale order.

3.

Study B (Kenney, 1981) demonstrates a clear pattern of increasing intra-category variance towards the middle categories. But this is likely to be an artifact of their methodology where participants were told that ‘never = 0%’ and ‘always = 100%’. In the absence of such cuing, Jones and Thurstone (1955) found exactly the opposite result, that the standard deviation of score allocation on a 9-point scale (+4 to –4) increased from the more neutral, central categories (e.g. good) to the extreme categories (e.g. Best of all). Actually, the highest variance was accorded to extreme negative values (e.g. Loath, despise), but perhaps this is because they are unusual words and their relative meaning was uncertain to some respondents.

The conclusion from all of this is obvious. People have such a varying interpretation of the numerical value accorded to frequency category labels, that their application in Likert scales is simply detracting form the interval-nature of such scales. The reason that the intra-study category values always accord with the presented order is that respondents are being cued by the relative position of category labels on the printed scale. Support for this has been provided by Solomon and Kopelman (1984) who found higher scale interval reliability when items were grouped by content and labeled, rather than being randomly distributed. It is also notable that the category labels of ‘frequency’ which form Table I, should provide at least some response cues to their numerical equivalence. This type of cue is far less evident in the category labels used in most SQOL scales, and so the latter are probably even less consistently employed than the frequency data. Some data are available to support this. Ware and Gander (1994) used the Thurstone method of equal-appearing intervals to calculate the following distances between category labels used in the SF-36 (Ware & Sherbourne, 1992) as follows: Poor (1.0), Fair (2.3), Good (3.4), Very Good (4.3), and Excellent (5.0). It can be seen that the distance between the lowest two categories (1.3) is about double that between the highest two categories (0.7). A comparison can also be made between the ratings of these same categories by Spector (1976). In equivalent units he found the separation of ‘poor’ and ‘fair’ to be 2.3, while that between ‘good’ and ‘excellent’ to be 1.1 units. It can be seen that the disparities between low and high adjacent category labels is in the same direction but of even greater magnitude. It can also be noted that the business of naming Likert categories constitutes a severe impediment to the expansion of scales due to the difficulty of finding appropriate categorical names. Consider, for example, the 9-point scale generated by Roy Morgan Research (1993) as: Delighted, very pleased, pleased, mostly satisfied, mixed feelings, mostly dissatisfied, unhappy, very unhappy, terrible. This scale assumes that the psychometric distance from neutrality to ‘pleased’ and to ‘unhappy’ are equivalent. There are no data known to me that support this view and yet such assumptions are forced by the inability of our language to make fine discriminations between affective states. So it can be concluded that the addition of category names to Likert scales not only detracts from the interval nature of the scale but also makes it difficult to generate expanded choice formats. Therefore since expanded Likert

scales are desirable for SQOL measurement, as has been previously argued, the appropriate scale format may be a 10-point, end-defined scale. Justification for a 10-point, end-defined scale The use of end-defined scales was pioneered by Jones and Thurstone (1955). They generated a 9-point scale in relation to food preference, anchored by ‘greatest like/dislike’ (+4/-4), with a central category of ‘Neither like nor dislike’ (0), and with the intermediate categories labeled only by their appropriate integer. So, do such scales produce data that is different from conventionally labeled Likert scales? The answer appears to be in the negative. Matell and Jacoby (1972) used verbally anchored adjective statements related to civic beliefs, with the number of intervening points varying from two to 19. Apart from the fact that testing time increased with >12-point formats, no differences were found on the proportion of scale utilized (>3-points) or in the proportion of ‘uncertain’ responses (>5-points). Similarly, Wyatt and Meyers (1987) using a 5-point scale, and Dixon et al. (1984) using a 6-point scale, found no systematic differences between the data from end-defined and conventional Likert scales. From this it can be concluded that the end-defined format seems not to bias the data in any particular way. It is also interesting to note that an increasing number of recent authors (e.g. Hooker & Siegler, 1993; Watkins, et al., 1998) are using 10-point end-defined scales. Conclusion Likert scales in common use, within the field of SQOL measurement, were devised for a quite different context. Of particular importance in this regard is the fact that SQOL data are negatively skewed, which means that most people will respond only to a restricted portion of the conventional scale. Moreover, when SQOL is used as a measure of outcome, scale sensitivity becomes a critical concern since this construct has a high trait component, and small deviations are highly meaningful. So, it is proposed, the number of choice points needs to be expanded. Such expansion appears not to systematically influence scale reliability, and is therefore psychometrically feasible, but is made difficult by the convention of naming all response categories. It has been argued that this naming is quite unnecessary and actually detracts from the interval nature of the scale. So the solution proposed is to adopt 10-point, enddefined scales. These offer a form of rating (one to ten) which lies within common experience and produce increased sensitivity of the measurement instrument.

References Andrews, F.M., & Withey, S.B. (1976). Social Indicators of well-being: Americans’ perceptions of life quality. New York: Plenum Press. Bardo, J.W., & Yeager, S.J. (1982a). Consistency of response style across types of response formats. Perceptual and Motor Skills, 55, 307-310. Bardo, J.W., & Yeager, S.J. (1982b). Note on reliability of fixed-response formats. Perceptual and Motor Skills, 54, 1163-1166. Bardo, J.W., Yeager, S.J., & Klingsporn, M.J. (1982). Preliminary assessment of formatspecific control tendency and leniency error in summated rating scales. Perceptual and Motor Skills, 54, 227-234. Bendig, A.W. (1954). Reliability of short rating scales and the heterogeneity of the rated stimuli. Journal of Applied Psychology , 38, 167-170. Chang, L. (1994). A psychometric evaluation of 4-point and 6-point Likert-type scales in relation to reliability and validity. Applied Psychological Measurement, 18, 205-216. Cicchetti, D.V., Showalter, D., & Tyrer, P.J. (1985). The effect of number of rating scale categories on levels of interater reliability: A Monte Carlo investigation. Applied Psychological Measurement, 9, 31-36. Cronbach, L.J. (1946). Response sets and test validity. Educational and Psychological Measurement, 6, 475-494. Cronbach, L.J. (1950). Further evidence on response sets and test design. Educational and Psychological Measurement, 10, 3-31. Cummins, R.A. (1995). On the trail of the gold standard for subjective well-being. Social Indicators Research, 35, 179-200. Cummins, R.A. (1997). The Directory of Instruments to measure quality of life and cognate areas of study. Fourth Edition. Melbourne: Deakin University. Cummins, R.A. (1997). The Comprehensive Quality of Life Scale – Adult. Fifth Edition. Melbourne: Deakin University. Cummins, R.A. (1998). The second approximation to an international standard for life satisfaction. Social Indicators Research, 43, 307-334. Diefenbach, M.A., Weinstein, N.D., & O’Reilly, J. (1993). Scales for assessing perceptions of health hazard susceptibility. Health Education Research, 8, 181-192. Dixon, P.N., Bobo, M., & Stevick, R.A. (1984). Response differences and preferences for all-category-defined and end-defined Likert formats. Educational and Psychological Measurement, 44, 61-66. Ferguson, L.W. (1941). A study of the Likert technique of attitude scale construction. Journal of Social Psychology, 13, 51-57. Freyd, M. (1923). The graphic rating scale. Journal of Educational Psychology, 14, 83-102. Guyatt, G.H., & Jaeschker, R. (1990). Measurement in clinical trials: Choosing the appropriate approach. In: B. Spilker (Ed.), Quality of life assessment in clinical trials (pp.37-46). New York: Raven Press. Hooker, K., & Siegler, I.C. (1993). Life goals, satisfaction, and self-rated health: Preliminary findings. Experimental aging research, 19, 97-110. Jacoby, J., & Matell, M.S. (1971). Three-point Likert scales are good enough. Journal of Marketing Research, 8, 495-500. Jaeschke, R., & Guyatt, G.H. (1990). How to develop and validate a new quality of life instrument. In: B. Spilker (Ed.) Quality of life assessment in clinical trials (pp.4757). New York: Raven Press. Jenkins, G.D., & Taber, T.D. (1977). A monte carlo study of factors affecting three indices of composite scale reliability. Journal of Applied Psychology, 62, 392-398.

Jones, L.V., & Thurstone, L.L. (1955). The psychophysics of semantics. Journal of Applied Psychology, 39, 31-36. Kenney, R.M. (1981). Between never and always. New England Journal of Medicine, 305, 1097-8. Komorita, S.S., & Graham, W.K. (1965). Number of scale points and the reliability of scales. Educational and Psychological Measurement, 25, 987-995. Likert, R. (1932). A technique for the measurement of attitudes. Archives in Psychology, 140, 1-55. Lissitz, R.W., & Green, S.B. (1975). Effect of the number of scale points on reliability: A Monte Carlo approach. Journal of Applied Psychology, 60, 10-13. Matell, M.S., & Jacoby, J. (1971). Is there an optimal number of alternatives for Likert scale items? Study 1: Reliability and validity. Educational and Psychological Measurement, 31, 657-674. McCauley, C., & Bremer, B.A. (1991). Subjective quality of life for evaluating medical intervention. Evaluation and the Health Professions, 14, 371-387. McKelvie, S.J. (1978). Graphic rating scale – How many categories? British Journal of Psychology, 69, 185-202. Nakao, M.A., & Axelrod, S. (1983). Numbers are better than words: Verbal specifications of frequency have no place in medicine. American Journal of Medicine, 74, 1061-1065. Peabody, D. (1962). Two components in bipolar scales: Direction and extremeness. Psychological Review, 69, 65-73. Roy Morgan Research (1993). International values audit, 22/23 May. Melbourne: Roy Morgan Research Centre. Russell, C., & Bobko, P. (1992). Moderated regression analysis and Likert scales: Too coarse for comfort. Journal of Applied Psychology, 77, 336-342. Solomon, E., & Kopelman, R.E. (1984). Questionnaire format and scale reliability: An examination of three modes of item presentation. Psychological Reports, 54, 447452. Spector, P.E. (1976). Choosing response categories for summated rating scales. Journal of Applied Psychology, 61, 374-375. Tavana, M., Kennedy, D.T., & Mohebbi, B. (1997). An applied study using the analytic hierarchy process to translate common verbal phrases to numerical probabilities. Journal of Behavioral Decision Making, 10, 133-150. Upshaw, H.S. (1962). Own attitude as an anchor in equal-appearing intervals. Journal of Abnormal and Social Psychology, 64, 85-96. Vokman, P. (1951). In: Upshaw, H.S. (1962). Own attitude as an anchor in equal-appearing intervals. Journal of Abnormal and Social Psychology, 64, 85-96. Ware, J.E., & Gander, B. (1994). The SF-36 health survey: Development and use in mental health research and the IQOLA project. International Journal of Mental Health, 23, 49-73. Ware, J.E., & Sherbourne, C.D. (1992). The MOS 36-item short-form health survey (SF36). Medical Care, 30, 473-481. Watkins, D., Akande, A., Fleming, J., et al. (1998). Cultural dimensions, gender, and the nature of self-concept: A fourteen-country study. International Journal of Psychology, 33, 17-31. Watson, G.B. (1930). Happiness among adult students of education. Journal of Educational Psychology, 21, 79-109. Wyatt, R.C., & Meyers, L.S. (1987). Psychometric properties of four 5-point Likert-type response scales. Educational and Psychological Measurement, 47, 27-35.