Improving the measurement of customer satisfaction ...

1 downloads 0 Views 254KB Size Report
study with the STAR Profiling Service as research context. ... of different attribute performances (Murphy ...... Dillon, W.R., Mulani, N. and Frederick, D.G. (1984),.
Introduction

Techniques Improving the measurement of customer satisfaction: a test of three methods to reduce halo

Today, increasing customer satisfaction has become the main focus of many firms to boost repeat business and benefit from positive word-of-mouth, thus increasing long-term profitability. To better manage customer satisfaction, firms spend millions of dollars on tracking customer satisfaction mostly on an attribute-by-attribute level. One disturbing aspect of studies employing multi-attribute models is the potential existence of halo effects. A number of studies in a variety of disciplines have shown that halo effects between attribute measures are a potential threat to the usefulness of such data (Murphy et al., 1993). Wirtz and Bateson (1995) showed empirically that halo effects can be present in satisfaction data and can severely limit the interpretability of such data. These findings were later replicated in a second empirical study by Wirtz (2000). Halo effects between attributes have been extensively studied in the context of social psychology and personnel management, and a number of potential ways to control or to reduce halo effects have been proposed and tested. Despite these findings, no research has yet examined whether these ways of controlling and/or reducing halo can be transferred to the context of customer satisfaction. In this paper, the effectiveness of three methods is tested. The measures were selected as they seem to be effective in other disciplines, and at the same time, they could be easily implemented in customer satisfaction studies. The three measures are: (1) measuring attribute satisfaction immediately after consumption rather than with a time delay; (2) using relative rating scales instead of standard agree-disagree or satisfieddissatisfied scales; and (3) increasing the number of attributes to be evaluated.

Jochen Wirtz

The author Jochen Wirtz is Associate Professor in Marketing, at the National University of Singapore, Faculty of Business Administration, Department of Marketing, Singapore. Keywords Customer satisfaction, Service quality, Measurement, Methodology Abstract Many firms measure customer satisfaction on an attribute-by-attribut e level. Past research has shown that halo errors can pose a serious threat to the interpretability of such data. Examines three factors that potentially reduce halo, using a combination of an experimental and quasi-experimenta l research design. Three conclusions were drawn. First, measurement after consumption showed less halo than delayed measurement . Second, relative rating scales contained less halo than standard satisfaction scales. Third, an interaction effect was found between the number of attributes to be evaluated and the rating scale used. The evaluation of many attributes reduced halo in comparison to an evaluation of few attributes when a standard satisfaction rating scale was used. However, when the more complex relative rating scale was used, halo was not reduced when subjects had to evaluate a large number of attributes, perhaps due to the increased complexity of the task. Electronic access The research register for this journal is available at http://www.mcbup.com/research_registers

The author gratefully acknowledges the research assistance of Loh Kah Lan and Jerome Kho Sze Wee. Furthermore, the author thanks the Service Quality Centre (SQ Centre) in Singapore and David Kwok for facilitating the conduct of the study with the STAR Profiling Service as research context. This research was partially funded by the National University of Singapore.

The current issue and full text archive of this journal is available at http://www.emerald-library.com/ft Managing Service Quality Volume 11 . Number 2 . 2001 . pp. 99±111 # MCB University Press . ISSN 0960-4529

99

Improving the measurement of customer satisfaction

Managing Service Quality Volume 11 . Number 2 . 2001 . 99±111

Jochen Wirtz

Customer satisfaction and halo

Figure 1 The impact of halo on orthogonal attribute evaluations

The concept of multi-attribute models was transferred from the consumer choice literature to the evaluation of individual consumption experiences. Many researchers supported the application of these models to consumer satisfaction (e.g. Churchill and Surprenant, 1982; Day, 1977; Woodruff et al., 1983), which is the prevailing conceptualisation of the components in satisfaction models (Oliver, 1997). Management is also able to use the analysis of satisfaction levels with salient attributes as a guiding tool. For example, overall dissatisfaction with a product can be a result of dissatisfaction with one or several product attributes. By identifying the causes of dissatisfaction, managerial actions can be taken to increase performance perceptions with regard to these attributes. Halo effects Halo effects are distortions of consumer perceptions of attribute-specific properties. Two main kinds of these distortions are discussed in the marketing literature. First, the response to a particular attribute can be influenced by the general impression of the overall object (Beckwith et al., 1978) or its affective overtone (Holbrook, 1983). For example, a strong liking (disliking) of a brand can have a positive (negative) influence on the evaluation of all other attributes of this product. Second, the response to other attributes can be influenced by the evaluation of a dominant attribute (Nisbett and Wilson, 1977). It has been suggested that halo effects simply reflect the individual’s tendency to maintain cognitive consistency (Abelson et al., 1968; Holbrook, 1983) and/or to avoid cognitive dissonance (Beckwith et al., 1978). In technical terms, a halo effect is the excess correlation over and above the true correlation of the attributes (Murphy and Jako, 1989). Halo effects in general assimilate the evaluation of different attributes, flatten the overall profile of evaluations and compress the differences between evaluations of different attribute performances (Murphy et al., 1993). Figure 1 shows a graphical representation of halo between two independent dimensions, the satisfaction scores for attributes ‘‘A’’ and ‘‘B’’. Line 1 shows that the true level of satisfaction for attribute B is independent of attribute A.

Line 2 shows a correlation between the observed evaluations of the two orthogonal dimensions, as would be caused by halo. Now, the observed satisfaction levels for attribute ‘‘B’’ range from dissatisfaction (–2) to satisfaction (+2), depending on the performance of attribute ‘‘A’’. In the real world of course, many true levels of correlation are not zero, as most products tend to be good (or poor) on a number of attributes. In these cases, halo can further increase the observed level of correlation. Implications of halo effects on the interpretability of satisfaction scores At the present time, much applied research takes attribute-specific data at face value when comparing satisfaction levels of various aspects of a brand or between brands (e.g. attributes of in-flight service such as cabin crew service, food and entertainment). It has been demonstrated in two experimental studies that halo can render those analyses unreliable on at least two levels (Wirtz, 2000; Wirtz and Bateson, 1995). First, absolute satisfaction levels of a particular attribute can be contaminated by one or more halo effects from other attributes. For example, a satisfactory rating for in-flight food can mean: . the food is all right (no halo); or . the food is not quite satisfactory but other aspects of the in-flight service, such as excellent cabin crew service, have pushed up the ratings for food (positive halo); or . the food is actually better than satisfactory, but its rating has been pushed down by one or more ratings of other attributes, such as by poor cabin

100

Improving the measurement of customer satisfaction

Managing Service Quality Volume 11 . Number 2 . 2001 . 99±111

Jochen Wirtz

crew service or in-flight entertainment (negative halo). As a consequence, unless the direction and magnitude of potential halo effects are known, it would be difficult for a marketing manager to draw conclusions based on observed satisfaction levels. Second, and a logical extension of the first point, comparisons of attribute satisfaction ratings across brands seem to be particularly unreliable. Past experimental studies showed that halo effects can render a comparison of attribute satisfaction scores across products meaningless (Wirtz, 2000; Wirtz and Bateson, 1995). For example, a difference in satisfaction with a particular attribute of airline A versus airline B may be the result of a halo effect or combination of halo effects with any of the other attributes of either brand. Therefore, halo effects do not allow for a reliable interpretation of attribute-specific comparisons between brands. This is a severe limitation, as attribute-specific data are often used precisely for that purpose, such as for competitive comparisons in regularly published reports on the airline industry, in which, among others, the satisfaction levels with various attributes of in-flight service are compared across a number of airlines.

work directed at controlling halo focused on statistical approaches (e.g. Bemmaor and Huber, 1978; Dillon et al., 1984; Holbrook and Huber, 1979; Huber and James, 1978). However, it seems clear that techniques such as partialling out the overall rating from each attribute rating will not work. These techniques seem to remove the true correlation between attributes as well as any potential halo, and do little to improve the usefulness of the resulting attribute evaluations (Lance and Woehr, 1986; Murphy et al., 1993). Murphy and his colleagues conclude in their review paper that, unless the true correlation between evaluations is known, ‘‘it is virtually impossible to measure or even to estimate the level of true halo present in a set of ratings’’ (p. 220). Therefore, statistical control does not seem to be a fruitful avenue to pursue in a search for methods to control halo in attribute-based satisfaction measures, and alternative methods for reducing and/or controlling halo need to be explored. A review of three such methods is provided in the next section and propositions on their use in consumer satisfaction are advanced.

Propositions on halo reducing methods Theoretical background on the causes of halo Halo, first observed by Wells (1907) and later named by Thorndike (1920), has long been regarded as a pervasive form of inadvertent rater judgement bias (Borman, 1977; Cooper, 1981). Over the past 80 years, researchers from a variety of disciplines have studied this phenomenon. The majority of this research has been conducted in the context of social psychology, for example in interpersonal judgement (Murphy and Jako, 1989; Nisbett and Wilson, 1977) and self-assessment (Lay and Jackson, 1969), and in human resource management such as evaluative judgement in job interviews and performance appraisals (Farh et al., 1991). Within marketing, research has examined the role of halo in the relationship between beliefs and attitudes as well as its role in multi-attribute choice models (e.g. Bagozzi, 1996; Beckwith and Lehmann, 1975; Holbrook, 1983). In marketing, almost all

A number of methods for reducing halo have been proposed and tested in the areas of social psychology and organizational behavior, and their relevance for customer satisfaction has been discussed in detail elsewhere (Wirtz, 1996). Table I provides a summary of those potentially applicable methods in the context of customer satisfaction measurement. A recent study tested four of these methods (Wirtz and Lim, 1998). The results showed less halo on the responses of highly involved subjects, in the measurement of more rather than less attributes, and when subjects were presented with a developmental purpose of the study versus an evaluative purpose. A hypothesis that several randomised orders of attributes decreased halo was inconclusive. The present study focuses on two further halo reduction methods, which have not yet been tested in the context of customer satisfaction. They are: (1) measuring attribute satisfaction immediately after consumption; and (2) using relative rating scales.

101

Improving the measurement of customer satisfaction

Managing Service Quality Volume 11 . Number 2 . 2001 . 99±111

Jochen Wirtz

Table I Attributes in the ``few’’ and ``many’’ attribute conditions Few attributes

Many attributes

Symbol

Realism of stories in video clips Approachabilit y of staff Time duration of service User-friendliness of program Design of waiting area

Realism of stories in video clips Approachability of staff Time duration of service User-friendliness of program Design of waiting area Clarity of instructions on screen Response time from touch of screen to computer reaction Design of computer room Sound system Visual appeal of video clips

SReal SStaff STime SUser SWaiting SInstruc SRepons SRoom SSound SVisual

Note: The symbols for the attribute-specific perceived performance and disconfirmation-of-expectation s measures have the prefixes ``P’’ and ``D’’, respectively, instead of the prefix ``S’’ as shown here for the satisfaction measures

H1: Immediate measurement after consumption shows less halo than delayed measurement.

A third method is the increase of number of attributes to be evaluated which was tested again aiming to replicate the Wirtz and Lim (1998) findings. Each of the three methods included in the present study is discussed in the context of consumer satisfaction in the following paragraphs. Time of measurement There is evidence that memory-based evaluations are subject to systematic distortion, which result in high levels of halo. Specifically, substantial delays between observation and judgment increase the likelihood that raters will rely on global impressions in making specific judgments (Murphy and Balzer, 1986; Murphy and Reynolds, 1988; Murphy and Anhalt, 1992; Nathan and Lord, 1983). In a study by Murphy and Balzer (1986), ratings in a delayed rating condition showed significantly higher levels of correlation between attributes than did those ratings that were obtained immediately after exposure. In another study by Schweder and D’Andrade (1979) on personality assessments, it was found that halo was higher when ratings were based on older observations than those obtained concurrently with observations. It seems reasonable to suggest that a delay between consumption and measurement of satisfaction can also lead to increased halo. Conversely, measurement immediately after consumption should show less halo. Immediate measurement after consumption avoids evaluations being retrieved from (long term) memory, which may cause general impression halo:

Relative rating scales In the area of psychological construct measures there are many examples of scaling methods that were explicitly developed to reduce measurement errors such as halo (Cooper, 1981). Among the more prominent ones are the behaviorally anchoring rating scales (Smith and Kendall, 1963) and forced-choice scales (Bartlett, 1983; King et al., 1980). It seems feasible to adopt the principles of anchoring and comparing attributes not only, as it is done currently, in measuring consumer choice variables such as relative importance of attributes and the attractiveness of various performance levels of those attributes, but also in satisfaction research. It has been suggested in social psychology that scales, which instruct raters to make comparisons among the different dimensions of an object or a person to be evaluated, can decrease halo (Bartlett, 1983). It seems that such scales should work well in particular in suppressing halo that results from inadequate discrimination between attributes and/or the influence of salient attributes: H2: Relative rating scales show lower levels of halo than standard rating scales. Number of attributes Murphy et al. (1993, p. 222) suggest that ‘‘halo errors seem most likely when there are only a few dimensions, each of which is highly relevant to one’s overall evaluation, and less likely when there are many dimensions, several of which are apparently unrelated to overall performance.’’ This argument is advanced on the basis that having more

102

Improving the measurement of customer satisfaction

Managing Service Quality Volume 11 . Number 2 . 2001 . 99±111

Jochen Wirtz

attributes should force the rater to exert more cognitive effort in their rating process than in the case of few attributes. The extra cognitive effort expended is proposed to help evaluators better distinguish the differences between attributes, and thereby lead to less halo in their attribute-specific judgments. (The potential increase in number of attributes is limited. As the number increases fatigue gradually will set in, leading to less cognitive effort being exerted on each individual item just to finish the questionnaire quickly. Fatigue and the reduced level of cognitive effort may increase halo.) H3: Measurement of a larger number of attributes shows less halo than measurement of a small number of attributes.

Research design Method and research context A 2 6 2 6 2 between-subject factorial design was employed, in which the resultant eight cells were defined by the combinations of time delay after the service experience (immediate versus one to six months’ delay), type of rating scale (standard versus relative), and number of attributes measured (five versus ten). The type of rating scale and number of attributes were manipulated in a true experimental design, while the time frame was examined using a quasi-experimental approach. A field setting was used. Specifically, a multi-media service, called the Service Traits and Attitudinal Response (STAR) Profiling, provided by the Service Quality Centre (SQ Centre) in its training centre in Singapore was chosen as research context. The STAR Profiling service assesses service traits and attitudes of front-line staff mainly for developmental purposes. The study asked customers (i.e. front line staff), who went through a STAR Profiling session, to rate their satisfaction with this service. The STAR Profiling service was selected for two reasons: (1) The performance of the multi-media based service is the same for each respondent, which provides relatively standardised service experiences across customers. (2) The service has a wide enough performance domain, which facilitates

the ‘‘number of attributes manipulation’’ to be realistically operationalized. Manipulations and measures The time frame of rating condition was manipulated through the administration of questionnaires to customers who experienced the service at different points in time, namely immediate satisfaction measurement after consumption, and measurement with a delay of one to six months. Type of rating scale and number of attributes measured were manipulated through the use of different questionnaire versions. The type of scale (relative rating scale versus standard satisfaction scale) was manipulated by using two different questionnaire versions. The standard rating scale employed is a widely used satisfaction measure and is made up of a one-item seven-point Likert-type scale ranging from ‘‘extremely satisfied’’ to ‘‘extremely dissatisfied’’. The relative rating scale has been used in the marketing choice literature, especially in studies using self-explicated conjoint tasks when measuring relative attractivity (e.g. Green and Srinivasan, 1990), but not in satisfaction studies yet. For the purpose of this study, the relative rating scale was adapted to the context of consumer satisfaction. It uses a sequential comparison of attribute satisfactions with scores ranging from ‘‘10 – most satisfied’’ to ‘‘0 – not at all satisfied’’). The Appendix shows the relative rating scale in the five-attribute condition. To identify what attributes to measure, existing customer feedback forms were reviewed and interviews with service employees and customers were conducted. An initial list of 15 attributes was obtained and then reduced to ten, based on the consideration that the final list of attributes should capture all important aspects of the service. The final set of attributespecific satisfaction items was developed and pretested to ensure that all items were easily understood. Further, five attributes were selected from the ten, based on their importance ratings in a pretest. The selection along perceived importance is consistent with practices in marketing research, which often focuses on key attributes to keep questionnaires short. The manipulation of number of attributes was operationalized on two levels: (1) few (five attributes); and (2) many (ten attributes).

103

Improving the measurement of customer satisfaction

Managing Service Quality Volume 11 . Number 2 . 2001 . 99±111

Jochen Wirtz

Table I shows the two categories of attributes. Disconfirmation-of-expectations and perceived performance measures were included for all attributes to allow examination of nomological validity of the attributes-specific satisfaction measures. Disconfirmation-of-expectations was measured using Oliver’s (1980) one-item semantic differential scale ranging from ‘‘better than expected’’ to ‘‘worse than expected’’. Perceived performance was measured with a widely used one-item seven-point semantic differential scale anchored in a positive and negative performance description such as ‘‘the instructions on the screen were very clear’’ to ‘‘not clear at all’’. Procedure To avoid demand effects, this study was presented to respondents as part of SQ Centre’s customer satisfaction programme, rather than as a research project. In the immediate rating context, the four questionnaire versions (few attributes/ standard scale, few attributes/relative scale, many attributes/standard scale, and many attributes/relative scale) were randomly distributed to customers upon the completion of their service experience. The forms were collected within 30 minutes after the completion of the service. In the delayed condition, lists of customers who used the service during the past six months were generated from SQ Centre’s database. As these customers came from different companies, consent was sought from the employers to administer the survey. This was followed by a randomised distribution of four questionnaire versions to these customers either via interoffice mail or hand delivery. These questionnaires contained a cover letter that closely followed the introduction used in the case of the immediate rating. Respondents took on average three to five days to return the questionnaires. A total of 264 questionnaires were collected. Of the sample, 2.1 per cent of all subjects were 20 years of age and below, 44.2 per cent were between 21 and 30, 36.8 per cent between 31 and 40, and 16.9 per cent were 41 years of age and more. A total of 79.2 per cent were female and 20.8 per cent male, 41.6 per cent had GCE ‘‘O’’ levels (equivalent to the UK’s local Cambridge ordinary levels), and below as their highest educational

qualification levels, 26.3 per cent had GCE ‘‘A’’ levels (equivalent to the UK’s local Cambridge advanced levels), 24.3 per cent had a diploma or a university degree, and 7.8 per cent had other highest levels of education (e.g. professional qualifications).

Data analysis Validity of measures The nomological validity of all attributespecific satisfaction measures was examined using two relationships predicted by the disconfirmation-of-expectations model. The model predicts that each performance measure of an attribute should correlate higher with the disconfirmation-ofexpectations measure of the same attribute than with disconfirmation measures of all other attributes. Furthermore, the disconfirmation measure of each attribute should correlate more highly with the satisfaction measure of the same attribute than with satisfaction measures of all other attributes. As can be seen in Table II, with the exception of DInstruc, all attribute-specific measures behaved according to the two predictions and show high nomological validity. In the case of DInstruc, its correlation with SInstruc (r = 0.55) was marginally lower than with SUser (r = 0.59). This may be explained as clarity of screen instructions (SInstruc) and user-friendliness (SUser) can be considered as closely related dimensions of a multi-media service offering. Therefore, the high observed correlation may reflect a true high correlation between the two attributes, and it was decided to retain DInstruc for further analysis. Findings As the objective of this study was to test the effectiveness of various halo-reducing methods, the analyses centre on comparisons of inter-item correlation coefficients between the experimental conditions (cf Murphy and Balzer, 1986). This assumes that the observed correlations contain halo in addition to the ‘‘true’’ levels of underlying correlations between attributes. The limitations associated with this assumption are discussed in the limitation section at the end of this paper. All individual inter-item correlations were transformed to z-scores using Fisher’s r-to-z

104

Improving the measurement of customer satisfaction

Managing Service Quality Volume 11 . Number 2 . 2001 . 99±111

Jochen Wirtz

Table II Correlation of attribute-specific measures

1 2 3 4 5 6 7 8 9 10

11 12 13 14 15 16 17 18 19 20

21 22 23 24 25 26 27 28 29 30

DInstruc DReal DRespons DRoom DSound DStaff DTime DUser DVisual DWaiting

SInstruc SReal SRespons SRoom SSound SStaff STime SUser SVisual SWaiting

SInstruc SReal SRespons SRoom SSound SStaff STime SUser SVisual SWaiting

PInstruc

PReal

PRespons

PRoom

PSound

PStaff

PTime

PUser

PLVisual

PWaiting

0.57 0.36 0.23 0.43 0.33 0.38 0.23 0.40 0.50 0.24

0.41 0.54 0.26 0.43 0.28 0.16 0.34 0.39 0.50 0.25

0.42 0.36 0.58 0.34 0.29 0.46 0.41 0.36 0.41 0.27

0.42 0.38 0.24 0.76 0.30 0.39 0.36 0.40 0.43 0.34

0.55 0.33 0.27 0.44 0.56 0.40 0.35 0.53 0.51 0.25

0.31 0.20 0.19 0.27 0.14 0.47 0.28 0.29 0.25 0.30

0.33 0.39 0.30 0.30 0.21 0.31 0.54 0.34 0.38 0.28

0.54 0.38 0.31 0.33 0.35 0.25 0.25 0.55 0.38 0.27

0.50 0.42 0.20 0.48 0.40 0.40 0.30 0.38 0.58 0.23

0.25 0.31 0.28 0.41 0.28 0.39 0.33 0.37 0.22 0.63

DInstruc

DReal

DRespons

DRoom

DSound

DStaff

DTime

DUser

DVisual

DWaiting

0.55 0.36 0.38 0.28 0.48 0.26 0.26 0.59 0.54 0.30

0.32 0.51 0.29 0.33 0.32 0.26 0.29 0.36 0.33 0.27

0.27 0.20 0.53 0.23 0.28 0.34 0.41 0.26 0.26 0.31

0.31 0.33 0.20 0.54 0.30 0.29 0.29 0.36 0.42 0.32

0.37 0.23 0.34 0.31 0.55 0.26 0.34 0.47 0.43 0.23

0.33 0.14 0.31 0.25 0.21 0.49 0.25 0.30 0.32 0.37

0.28 0.34 0.32 0.29 0.26 0.31 0.51 0.26 0.23 0.31

0.42 0.35 0.37 0.33 0.42 0.30 0.39 0.56 0.50 0.26

0.45 0.48 0.33 0.30 0.38 0.23 0.22 0.40 0.58 0.20

0.31 0.22 0.27 0.32 0.31 0.35 0.37 0.38 0.36 0.60

SInstruc

SReal

SRespons

SRoom

SSound

SStaff

STime

SUser

SVisual

SWaiting

1.0 0.48 0.45 0.41 0.55 0.47 0.33 0.53 0.58 0.33

1.0 0.25 0.31 0.37 0.32 0.31 0.46 0.41 0.26

1.0 0.42 0.45 0.40 0.52 0.44 0.48 0.32

1.0 0.56 0.52 0.43 0.37 0.46 0.49

1.0 0.49 0.43 0.56 0.61 0.41

1.0 0.42 0.41 0.40 0.59

1.0 0.38 0.34 0.49

1.0 0.69 0.33

1.0 0.36

1.0

Note: The italics denote correlations between performance and disconfirmation measures of the same attribute, or between disconfirmation and satisfaction measures of the same attribute

transformation to test for significant differences between coefficients. H1 predicted that measurement of attribute-specific satisfaction immediately after consumption shows lower halo than delayed measurement. Table III shows the correlation coefficients of those attributes measured immediately after consumption and those measured with a time delay. Six out of the total ten pair-items are in the anticipated direction, and the only pair-item that is significantly different between the two conditions is also in the hypothesised direction. The average inter-item correlation dropped from 0.37 to 0.31 when immediate

rather than delayed measurement of satisfaction was conducted. This drop is only marginally significant at the 0.06 level (one-tailed). In conclusion, these findings tentatively support H1. H2 predicted that relative rating scales show lower levels of halo than standard rating scales. The data provide support for H2. Eight of the total ten pair-items are in the predicted direction, and five of all five significant differences are also as hypothesised. The average inter-item correlation dropped from 0.47 to 0.34 (p < 0.001) when a relative rating scale was

105

Improving the measurement of customer satisfaction

Managing Service Quality Volume 11 . Number 2 . 2001 . 99±111

Jochen Wirtz

used instead of the standard scale, supporting H2. H3 – it was predicted that halo is lower when a large number of attributes is measured rather than a small number. Table III shows the correlation coefficients of those five attributes that were included in both experimental conditions of few (five) and many (ten) attributes. Six out of the total ten pair-items are in the anticipated direction. In addition, the one pair-item that is significantly different

between the two conditions is also lower in the many attributes condition. These findings seem to provide support for the H3. However, the average inter-item correlation hardly drops when the larger number of attributes is measured. Although in the hypothesised direction, it drops only from 0.40 to 0.39 (n.s. at the 10 per cent level), rejecting H3. Interaction effects Potential interaction effects between all experimental conditions were examined, as it

Table III Comparison of inter-item correlation across experimental conditions Time of measurement (H1) Attributes

1

1 SReal 1.0 2 SStaff 0.31 3 STime 0.30 4 SUser 0.51* 5 SWaiting 0.19 Average intercorrelation Notes: No. of items supporting one

2 1.0 0.36 0.37 0.56

Delayed 3

1.0 0.38 0.47 0.37**

4

1.0 0.25

5

1

1.0

1.0 0.11 0.19 0.11* 0.13

Immediate 2 3 1.0 0.46 0.40 0.56

1.0 0.30 0.46 0.31**

4

5

1.0 0.42

1.0

H1 = six out of ten; number of significant pair items supporting H1 = one out of

Type of rating scale (H2) Attributes

1

1 SReal 1.0 2 SStaff 0.38 3 STime 0.41* 4 SUser 0.62* 5 SWaiting 0.29 Average intercorrelation Notes: No. of items supporting five

2 1.0 0.38 0.54* 0.59

Standard 3

1.0 0.49* 0.53 0.47*

4

1.0 0.43*

5

1

2

1.0

1.0 0.31 0.20* 0.28* 0.28

1.0 0.47 0.26* 0.66

Relative 3

1.0 0.26* 0.48 0.34*

4

5

1.0 0.21*

1.0

H2 = eight out of ten; number of significant pair items supporting H2 = five out of

Number of attributes (H3) Attributes

1

2

Few 3

4

5

1

2

Many 3

4

5

1 SReal 1.0 1.0 2 SStaff 0.37 1.0 0.27 1.0 3 STime 0.31 0.37 1.0 0.30 0.47 1.0 4 SUser 0.54* 0.49 0.40 1.0 0.34* 0.33 0.35 1.0 5 SWaiting 0.19 0.58 0.43 0.35 1.0 0.34 0.59 0.55 0.33 1.0 Average 0.40 0.39 intercorrelation Notes: No. of items supporting H1 = six out of ten; number of significant pair items supporting H1 = one out of one; the italics indicate when correlation efficients are lower, that is as hypothesized, in the respective experimental conditions; * indicates significant difference between coefficients at p > 0.05 (two-tailed); ** indicates the difference in average intercorrelation is marginally significant (p = 0.06, one-tailed), and in the hypothesized direction

106

Improving the measurement of customer satisfaction

Managing Service Quality Volume 11 . Number 2 . 2001 . 99±111

Jochen Wirtz

is of interest whether the halo reducing methods are simply additive and can all be applied at the same time, or whether potential interaction effects prescribe a more selective mix of halo reducing methods. Further correlation analyses revealed the presence of an interaction effect between type of rating scale and number of attributes. No other interaction effects were observed. As Table IV and Figure 2 show, the type of rating scales exerts different impact on halo for the two number of attribute conditions. In the case of the standard scale, as hypothesised, halo was lower in the many attribute condition than in the few attribute condition (p = 0.06), replicating the findings of an earlier study by Wirtz and Lim (1998). However, in the relative scale condition no significant impact was observed, and the absolute difference was opposite the hypothesized direction. This implies that H1 is partially confirmed, and many attributes reduce halo as long as they are measured with a standard scale, and not a relative scale. Implications of this finding are discussed in the Summary findings and conclusion section.

Summary findings and conclusions Many firms measure customer satisfaction on an attribute-by-attribute level. Past research has shown that halo errors can pose a serious threat to the interpretability of such data. The present study provides some insights into the effectiveness of halo-reducing methods that had previously been developed mainly in the areas of social psychology and organisational behaviour for application in the consumer satisfaction context. Our findings show that immediate measurement after consumption has less halo than delayed measurement as proposed in H1. This finding closely replicates studies in social psychology (e.g.

Murphy and Balzer, 1986; Nathan and Lord, 1983). The findings also support the hypothesis that relative rating scales show lower halo than standard rating scales (H2). The findings suggest that relative rating scales are superior to standard scales in reducing halo in satisfaction measures, and that the principles derived from behaviourally anchored rating scales and forced-choice measurement in social psychology are also applicable in the satisfaction context (cf Bartlett, 1983). The measurement of many attributes rather than few showed only an insignificant main effect in the hypothesised direction, and H3 was rejected. However, a significant interaction effect was found between number of attributes and type of rating scale. When the standard rating scale was used, the reduction of halo when measuring many rather than few attributes was as hypothesised. This finding replicates Wirtz and Lim’s (1998) results who also used standard rating scales in their research. On the other hand, when the more complex relative rating scale was used, no significant effect was observed (an increase, although insignificant, was found, which is opposite the prediction of H3). An explanation of these findings may be that the increased complexity of having more attributes to evaluate on a more complex rating scale was cognitively too complex a task. Subjects may have been less able to distinguish among the different dimensions compared to their counterparts in the few attribute condition, and as a result, halo was not reduced. This finding is interesting, as it shows that the mix of halo reducing methods to be employed needs to be carefully examined for potential interaction effects. The effects of halo reducing methods may not be simply additive. For example, the increased complexity of both, using many attributes and

Table IV Interaction effect between type of rating scale and number of attributes Experimental conditions Number of attributes Few Many P Anticipated direction

Type of rating scale Standard Relative 0.50 0.42 0.06 Yes

0.31 0.38 ns No

P

Anticipated direction

< 0.001 ns

Yes Yes

Notes: Decimals represent average intercorrelations of the same five attribute-specific satisfaction measures for the respective conditions; significance levels are two-tailed, ns at the 10 per cent level

107

Improving the measurement of customer satisfaction

Managing Service Quality Volume 11 . Number 2 . 2001 . 99±111

Jochen Wirtz

Figure 2 Interaction effect between type of rating scale and number of attributes

relative rating scales, may reduce the respondents’ ability to discriminate among the various attributes and lead to higher rather than lower halo.

Managerial Implications There are a number of detrimental consequences of halo. These include the possibilities of rendering the interpretation of attribute-specific satisfaction data meaningless, obscuring the identification of product strengths and weaknesses, as well as making attribute-specific comparisons across brands unreliable. By adopting halo-reducing techniques, market researchers could curtail these problems, and thereby, reduce potential risks that stem from the misinterpretation of attribute-specific data. Investments into satisfaction measurement as well as product improvement are hence likely to provide more accurate feedback, which in turn, can lead to better marketing decisions. For example, managers would be able to make better decisions as to whether a poorly rated attribute represents a real product weakness or is just an evaluation contaminated by halo. Given that at least some of the available halo-reducing methods can be easily incorporated into research designs, market researchers should consider their adoption. Three specific implications can be drawn from this study. First, it is suggested that researchers re-examine a currently frequent practice of surveying respondents whose last consumption experience may date back from a few days to even several months. In many cases, these respondents simply may not recall their experience correctly and are thus not capable of providing accurate feedback on

attribute satisfaction levels. General impression halo is likely to obscure attributespecific data. Where budgets and time permit, satisfaction measurement should be conducted immediately after consumption to reduce the magnitude of potential halo biases. Second, it is suggested that market researchers consider the use of relative rating scales which have been shown to be the most effective of the three halo reduction measures tested in this study. However, care must be taken to ensure that the number of attributes included in the research design does not create too complex a rating task. As evidenced in the findings, relative scales work better when few rather than when many attributes are measured. Third, it is recommended to incorporate more attributes into research designs. This is suggested to encourage respondents to expend more cognitive effort in distinguishing between the different attributes, which in the process, can reduce the amount of halo errors made. This recommendation is also consistent with the frequent desire of product managers to capture satisfaction ratings across a large number of attributes. However, when many attributes are used, relative rating scales should not be employed at the same time as shown by the study results. Trade-offs between the type of scale used and the number of attributes measured should thus be made in an attempt to reduce halo. This study pioneered the use of halo reduction methods in a consumer satisfaction context, and largely found that what had previously already been shown to work in other fields such as social psychology and organisational behaviour, also applies to our context. It therefore seems likely that the other methods tested in those fields, which have not been examined in our present study, may also be useful in reducing halo in customer satisfaction studies. We summarise potentially interesting halo reduction methods in Table V, and briefly discuss the ease of implementation as well as potential advantages and drawbacks for application in applied customer satisfaction research.

Further research and limitations Although the various halo reduction methods listed in Table V have been demonstrated to work in other contexts, it would be good to

108

109

3

3

3

3

3

3

3

3

3

3

3

3

3

3

Notes: a The classification of services was adopted from Lovelock (1983); the table was adapted from Wirtz (1996) 3 = implementation is possible; ? = implementation seems difficult

7

.

3

3

?

3

3

3

3

?

3

3

3

3

3

3

3

?

3

3

3

3

3

3

3

?

3

3

± Longer questionnaire ± More complex data analysis (structural equation modelling) ± More expensive production and administering of questionnaire ± Slightly more complex data analysis

± Bias towards higher involvement customers and/or ± Longer questionnaire if involvement is measured

+ Portrays firm as responsive to customer wishes and may improve image

+ Insights into more attributes ± Longer questionnaire ± More complex to answer and not all respondents may be able to handle this format well ± Respondents take more time to answer ± Data collection may be more expensive and time consuming ± May bias sample towards heavy users, not representative of customer base and/or ± Longer questionnaire if ability is measured

Potential advantages and drawbacks for application in applied research

Jochen Wirtz

6

5

4

.

Measure satisfaction immediately after consumption (e.g. Murphy and Balzer, 1986) Use respondents that are able (experienced and knowledgeable) enough to evaluate the product. Alternatively control for respondents ability through measurement (e.g. Murphy et al., 1993) Increase respondents’ willingness to evaluate the various attributes by: stressing a developmental purpose of the evaluation rather than an evaluative purpose (e.g. Banks and Murphy, 1983) avoiding respondents with low product involvement, or, by controlling for involvement through separate measurement (e.g. Banks and Murphy, 1985) Control for affective overtones or affective reaction towards the product by measuring affect (e.g. Holbrook, 1983; Bagozzi, 1996) Randomise order of attributes within the same study (e.g. Judd et al. 1991)

3

2

Increase the number of attributes measured (Murphy et al., 1993) Use forced choice or relative rating scales (e.g. Bartlett, 1983)

1

Methods for reducing halo

Consumer goods Consumer servicesa Fast moving Customer comes Service firm Transactions at consumer goods Durables to service firm goes to customer arm’s length

Table V Methods for reducing halo in attribute-specific satisfaction measures Improving the measurement of customer satisfaction Managing Service Quality Volume 11 . Number 2 . 2001 . 99±111

Improving the measurement of customer satisfaction

Managing Service Quality Volume 11 . Number 2 . 2001 . 99±111

Jochen Wirtz

demonstrate empirically that they apply equally well to a consumer satisfaction contexts. Furthermore, as with any research, this study is subject to a number of limitations that can be addressed in future research. First, this study was conducted in a field setting, which increases the authenticity of the whole procedure and reduces demand effects, but compromises on the internal validity. Future studies may use true experimental designs to overcome this issue. Second, the relative satisfaction scale anchored at the highest level of satisfaction (on a scale from 0-10) with the most satisfactory attribute cannot be used when products or services with lower overall satisfaction levels are to be evaluated. To address this shortcoming, an ‘‘absolute’’ measurement of the most satisfactory attribute using a standard scale may be obtained, and then all other relative ratings readjusted, or it may be possible to use overall satisfaction measures to re-adjust relative attribute-specific ratings.

References Abelson, R.P., Aronson, E., McGuire, W.J., Newcomb, T.M., Rosenberg, M.J. and Tannenbaum, P.H. (1968), Theories of Cognitive Consistency: A Sourcebook, Rand McNally, Chicago, IL. Bagozzi, R.P. (1996), ``The role of arousal in the creation and control of the halo effect in attitude models’’, Psychology and Marketing, Vol. 13 No. 3, pp. 235-64. Banks, C.G. and Murphy, K.R. (1983), ``What’s the difference between valid and invalid halo? Forced-choice measurement without forcing a choice’’, Journal of applied Psychology, Vol. 68, pp. 218-26. Bartlett, C.J. (1983), ``What’s the difference between valid and invalid halo? Forced-choice measuremen t without forcing a choice’’, Journal of Applied Psychology, Vol. 68, pp. 218-26. Beckwith, N.E. and Lehmann, D.R. (1975), ``The importance of halo effect in multi-attribute models’’, Journal of Marketing Research, Vol. 12, pp. 265-75. Beckwith, N.E., Kassarjian, H.H. and Lehmann, D.R. (1978), ``Halo effects in marketing research: review and prognosis’’, in Hunt, H.K. (Ed.), Advances in Consumer Research, Vol. 5, Association for Consumer Research, Ann Arbor, MI, pp. 465-7. Bemmaor, A.C. and Huber, J.C. (1978). ``Econometric estimation of halo effect: single vs. simultaneous equation models’’, in Hunt, H.K. (Ed.), Advances in Consumer Research, Vol. 5, Association for Consumer Research, Ann Arbor, MI, pp. 477-80. Borman, W.C. (1977), ``Consistency of rating accuracy and rating errors in judgement of human performance’’,

Organisational Behaviour and Human Performance, Vol. 20, pp. 238-52. Churchill, G.A. Jnr and Surprenant, C. (1982), ``An investigation into the determinants of customer satisfaction’’, Journal of Marketing Research, Vol. 19, pp. 491-504. Cooper, W.H. (1981), ``Ubiquitous halo’’, Psychological Bulletin, Vol. 90, pp. 218-44. Day, R.L. (1977), ``Towards a process model of consumer satisfaction’’, in Hunt, H.K. (Ed.), Conceptualisation and Measurement of Consumer Satisfaction and Dissatisfaction, Marketing Science Institute, Cambridge, MA, pp. 153-83. Dillon, W.R., Mulani, N. and Frederick, D.G. (1984), ``Removing perceptual distortions in product space analysis’’, Journal of Marketing Research, Vol. 21, pp. 184-93. Farh, J.L., Cannella, A.A. Jnr and Bedeian, A. (1991), ``Peer ratings: the impact of purpose on rating quality and acceptance’’, Group and Organisation Studies, Vol. 16 No. 4, pp. 367-86. Green, P.E. and Srinivasan, V. (1990), ``Conjoint analysis in consumer research: issues and outlook’’, Journal of Marketing, Vol. 54, pp. 3-19. Holbrook, M.B. (1983), ``Using a structural model of halo effect to assess perceptual distortion due to affective overtones’’, Journal of Consumer Research, Vol. 10, pp. 247-52. Holbrook, M.B. and Huber, J. (1979), ``Separating perceptual dimensions from affective overtones: an application to consumer aesthetics’’, Journal of Consumer Research, Vol. 5, pp. 272-83. Huber, J. and James, W. (1978), ``A measure of halo’’, in Hunt, H.K. (Ed.), Advances in Consumer Research, Association for Consumer Research, Ann Arbor, MI, Vol. 5, pp. 468-73. Judd, C.M., Drake, R.A., Dowing, J.W. and Krosnick, J.A. (1991), ``Some dynamic properties of attitude structures: context-induced response facilitation and polarization’’, Journal of Personality and Social Psychology, Vol. 60, pp. 193-202. King, L.M., Hunter, J.E. and Schmidt, F.L. (1980), ``Halo in a multidimensional forced-choice performance evaluation scale’’, Journal of Applied Psychology, Vol. 65, pp. 507-13. Lance, C.E. and Woehr, D.J. (1986), ``Statistical control of halo: clarification from two cognitive models of the performance appraisal process’’, Journal of Applied Psychology, Vol. 71 No. 4, pp. 679-85. Lay, C.H. and Jackson, D.N. (1969), ``Analysis of the generality of trait-inferential relationships’’, Journal of Personality and Social Psychology, Vol. 12, pp. 12-21. Lovelock, C.H. (1983), ``Classifying services to gain strategic insights’’, Journal of Marketing, Vol. 47, Summer, pp. 9-20. Murphy, K.R. and Anhalt, R.L. (1992), ``Is halo error a property of the rater, the ratees, or the specific behaviours observed’’, Journal of Applied Psychology, Vol. 77, pp. 494-500. Murphy, K.R. and Balzer, W.K. (1986), ``Systematic distortion in memory-based behaviour ratings and performance evaluations: consequences for rating accuracy’’, Journal of Applied Psychology, Vol. 71, pp. 39-44.

110

Improving the measurement of customer satisfaction

Managing Service Quality Volume 11 . Number 2 . 2001 . 99±111

Jochen Wirtz

Murphy, K.R. and Jako, R. (1989), ``Under what conditions are observed intercorrelations greater or smaller than true intercorrelations?’’, Journal of Applied Psychology, Vol. 74, pp. 828-30. Murphy, K.R. and Reynolds, D.H. (1988), ``Does true halo affect observed halo?’’, Journal of Applied Psychology, Vol. 73, pp. 235-8. Murphy, K.R., Jako, R. and Anhalt, R.L. (1993), ``Nature and consequences of halo error: a critical analysis’’, Journal of Applied Psychology, Vol. 78, pp. 218-25. Nathan, B.R. and Lord, R.G. (1983), ``Cognitive categorization and dimensional schemata: a process approach to the study of halo in performance ratings’’, Journal of Applied Psychology, Vol. 68, pp. 102-14. Nisbett, R.E. and Wilson, T.D. (1977), ``The halo effect: evidence for unconscious alteration of judgements’’, Journal of Personality and Social Psychology, Vol. 35 No. 4, pp. 250-6. Oliver, R.L. (1980), ``A cognitive model of the antecedents and consequences of satisfaction decisions’’, Journal of Marketing Research, Vol. 17, pp. 460-9. Oliver, R.L. (1997), Satisfaction: A Behavioral Perspective on the Consumer, McGraw-Hill, New York, NY. Schweder, R. and D’Andrade, R. (1979), ``The systematic distortion hypothesis’’, in Schweder, R. and Fiske, D. (Eds), New Directions for Methodology of Behavioural Science: Fallible Judgement in Behavioural Research, Jossey-Bass, San Francisco, CA, pp. 37-58. Smith, P.C. and Kendall, L.M. (1963), ``Retranslation of expectations: an approach to the construction of unambiguous anchors for rating scales’’, Journal of Applied Psychology, Vol. 47, pp. 149-55. Thorndike, E.L. (1920), ``A constant error in psychological ratings’’, Journal of Applied Psychology, Vol. 4, pp. 25-9. Wells, F. (1907), ``A statistical study of literary merit’’, Archives of Psychology, Vol. 1 No. 7. Wirtz, J. (1996), ``Controlling halo in attribute-specific customer satisfaction measures ± towards a conceptual framework’’, Asian Journal of Marketing, Vol. 5 No. 1, pp. 41-58. Wirtz, J. (2000), ``An examination of the presence, magnitude and impact of halo on consumer

satisfaction measures’’, Journal of Retailing and Consumer Services, Vol. 7, pp. 89-9. Wirtz, J. and Bateson, J.E.G. (1995), ``An experimental investigation of halo effects in satisfaction measures of service attributes’’, International Journal of Service Industry Management, Vol. 6 No. 3, pp. 84-102. Wirtz, J. and Lim, S.J. (1998), ``How to reduce halo in attribute-specific satisfaction measures: an empirical examination of four halo reduction methods’’, Asia-Pacific Advances in Consumer Research, Association for Consumer Research, Vol. 3, Provo, UT, p. 89. Woodruff, R.B., Cadotte, E.R. and Jenkins, R.L. (1983), ``Modelling consumer satisfaction processes using experience-based norms’’, Journal of Marketing Research, Vol. 20, pp. 296-304.

Appendix. Relative satisfaction scale How satisfied are you with the following aspects of STAR service? First, choose the feature that you are most satisfied with and give it a score of ‘‘10’’. Then, rate the remaining features against your most satisfied option using a scale of ‘‘0’’ to ‘‘9’’. Example: If you are most satisfied with the design of outside waiting area, give it a ‘‘10’’. If you are nearly as satisfied with approachability of SQ staff, give it a ‘‘9’’ or ‘‘8’’. If you are satisfied with it a lot less, then give it a lower score, say ‘‘4’’ or ‘‘3’’. If you are not at all satisfied with it, give it a ‘‘0’’. SCORE Realism of stories in video clips ________ _ Approachability of SQ staff ____________ _ Time duration of STAR service ________ _ User-friendliness of STAR program ______ Design of outside waiting area __________ _

111