Performance Appraisal Reactions: Measurement

11 downloads 0 Views 2MB Size Report
ment of employees' reactions toward their performance appraisals is important for many reasons, including (a) the notion that reac- tions represent a criterion of ...
Journal of Applied Psychology 2000, Vol. 85, No. 5, 708-723

Copyright 2000 by the American Psychological Association, Inc. 0021-9010/00/$5.00 DOI: 10.1037//0021-9010.85.5.708

Performance Appraisal Reactions: Measurement, Modeling, and Method Bias Lisa M. Keeping and Paul E. Levy University of Akron

In this study, the authors attempted to comprehensively examine the measurement of performance appraisal reactions. They first investigated how well the reaction scales, representative of those used in the field, measured their substantive constructs. A confirmatory factor analysis indicated that these scales did a favorable job of measuring appraisal reactions, with a few concerns. The authors also found that the data fit a higher order appraisal reactions model. In contrast, a nested model where the reaction constructs were operationalized as one general factor did not adequately fit the data. Finally, the authors tested the notion that self-report data are affectively driven for the specific case of appraisal reactions, using the techniques delineated by L. J. Williams, M. B. Gavin, and M. L. Williams (1996). Results indicated that neither positive nor negative affect presented method biases in the reaction measures, at either the measurement or construct levels.

confounded, as we tend to assume that because different measures are given the same label, they are in fact the same construct. Similarly, when measures are given different labels, we assume they are in fact distinct constructs. It is our contention that the research literature that has examined appraisal reactions is one example of an area where these assumptions are pervasive. The purpose of the present study is to attempt to examine and delineate the various employee appraisal reactions and provide a preliminary assessment of their measurement properties. Thus, we attempt to address some of the concerns raised above. We hope that the present study represents the first step to a better understanding of appraisal reactions. To this end, we discuss why the measurement of employee reactions is important in the first place, followed by a description of the current state of measurement of appraisal reactions. Next, a brief overview of the relevant literature on appraisal reactions is presented, with specific emphasis placed on the interrelations among these constructs.

Appraisal effectiveness has been at the heart of much of the research literature in performance appraisal. Appraisal effectiveness refers to how well the appraisal system is operating as a tool for the assessment of work performance (Cardy & Dobbins, 1994). The interest in appraisal effectiveness has focused both on the predictor and the criterion sides of the equation. Thus, a substantial body of research has examined the factors contributing to effective performance appraisals. This research has focused on such variables as appraisal participation, relevancy of performance dimensions, trust in supervisor, and appraisal frequency (Murphy & Cleveland, 1995). Complementing this literature has been research investigating how to evaluate the effectiveness of appraisals. Research in this area has assessed the quality of performance ratings as well as employee reactions toward the appraisal (Cardy & Dobbins, 1994; Murphy & Cleveland, 1995). When we focus more specifically on appraisal reactions, a review of the extant literature reveals that it is difficult to consolidate and interpret the current research concerning reactions for the following reasons: (a) This area contains many inconsistencies in terms of the measurement of appraisal reactions, (b) measurement has seemingly been conducted without a theoretical basis, and (c) many researchers develop idiosyncratic measures of what appear to be the same construct, thus flooding the field with a multitude of scales, most of which are never validated. With regard to the measurement of constructs, Pedhazur and Schmelkin (1991) suggested that in sociobehavioral research, measures often become

Why Study Employees' Appraisal Reactions? Before we discuss the measurement issues relevant to appraisal reactions, it is necessary to address the importance of measuring appraisal reactions in the first place. We believe,that the assessment of employees' reactions toward their performance appraisals is important for many reasons, including (a) the notion that reactions represent a criterion of great interest to practitioners and (b) the fact that reactions have been theoretically linked to determinants of appraisal acceptance and success but have been relatively ignored in research. Each of these issues fits within the context of the scientist-practitioner gap that is often debated within the field of industrial-organizational (I/O) psychology. The scientist-practitioner gap has been of great concern to I/O psychologists over the years (e.g., Hyatt et al., 1997). That is, many psychologists in the field are troubled by the apparent lack of alignment between research and practice. Although this is an area of concern to the field as a whole, performance appraisal is considered by many to be one of the areas most affected by the

Lisa M. Keeping and Paul E. Levy, Department of Psychology, University of Akron. A version of this article was presented at the thirteenth annual meeting of the Society for Industrial and Organizational Psychology, Dallas, April 1998. We thank both Rosalie Hall and Andee Snell for their assistance regarding data-analytic techniques and strategies. We also thank Jeann6 Makiney for her assistance with the figures. Correspondence concerning this article should be addressed to Paul E. Levy, Department of Psychology, University of Akron, Akron, Ohio 44325-4301. Electronic mail may be sent to [email protected]. 708

MEASURING APPRAISALREACTIONS scientist-practitioner gap (e.g., Banks & Murphy, 1985; Bretz, Milkovich, & Read, 1992; Ilgen, Barnes-Farrell, & McKellin, 1993; Maroney & Buckely, 1992; Smither, 1998). In relation to this gap, Balzer and Sulsky (1990) suggested the existence of a criterion gap in performance appraisal that originates from two sources. First, many of the dependent variables measured in research are of little interest to practitioners. Second, many of the dependent variables that are of interest to practitioners are largely or completely ignored by researchers. In general, researchers have traditionally focused on measuring the psychometric properties of ratings, whereas practitioners are more often interested in practical criteria such as appraisal reactions (Cardy & Dobbins, 1994; Murphy & Cleveland, 1995). In addition, many surveys of both employees and managers indicate dissatisfaction with the performance appraisal process; as Bernardin, Hagan, Kane, and Villanova (1998) pointed out, "the appraisal of performance appraisal is not good" (p. 3). Thus, it seems important that researchers continue to address the issue of employee appraisal reactions to help bridge the gap between science and practice in performance appraisal. Many researchers have suggested that appraisal reactions play an important role in the appraisal process because they are vital to the acceptance and use of an appraisal system (e.g., Bernardin & Beatty, 1984; Cardy& Dobbins, 1994; Murphy & Cleveland, 1995) as well as a contributing factor to the validity of an appraisal (Lawler, 1967). In fact, Curdy and Dobbins (1994) asserted that "with dissatisfaction and feelings of unfairness in process and inequity in evaluations, any appraisal system will be doomed to failure" (p. 54). Similarly, Murphy and Cleveland (1995) contended that "reaction criteria are almost always relevant, and an unfavorable reaction may doom the most carefully constructed appraisal system" (p. 314). These are strong statements that speak to the importance of assessing the reactions of employees regarding their appraisals. Finally, Hedge and Borman (1995) suggested that worker reactions toward performance appraisal may play an increasingly important role in performance appraisal as processes and procedures continue to develop. Thus, it seems clear that there is general consensus among performance appraisal researchers that the assessment of appraisal reactions is important. The Current State of Appraisal Reaction Measurement The main purpose of the present article is to attempt to clarify the measurement of appraisal reactions. The previous section highlighted the importance of measuring appraisal reactions. The following sections focus on the different conceptualizations and operationalizations found in the literature of employees' reactions toward their appraisals. This brief review of the most frequently assessed appraisal reactions draws on a recent meta-analysis by Cawley, Keeping, and Levy (1998) that identified and discussed these conceptualizations and operationalizations in detail. The meta-analysis conducted an extensive review of the appraisal reactions literature, with a focus on the role of employee participation in affecting those reactions. Each of the reactions discussed below and included in the current study was identified as relevant by that meta-analysis. The purpose of the review that follows is primarily to illustrate the state of the extant measures assessing seemingly similar constructs within the appraisal reactions literature. Although this over-

709

view is not an attempt to question the substantive findings of past research regarding employee reactions, the lack of clarity and consistency inherent in these measures should present some concern to researchers in this area. It is important to note, however, that measurement has been identified as a neglected area in I/O psychology in general (e.g., MacCallum, 1998), and we do not wish to suggest that the appraisal reactions literature is worse than any other in this regard. In addition, as the following review illustrates, many of the reaction measures were developed more than a decade ago. Since then, our knowledge of measurement has advanced and our conceptualizations and operationalizations of many constructs have expanded. Thus, our goal is to identify potential measurement issues regarding appraisal reaction measures while acknowledging that many of these issues could not have been addressed at the rime of development. We feel that the more advanced knowledge of measurement issues by I/O psychologists today provides an opportunity to critically examine some of these longstanding measures.

Satisfaction Satisfaction has been the most frequently measured appraisal reaction (Giles & Mossholder, 1990). Appraisal satisfaction has been primarily conceptualized in three ways: (a) satisfaction with the appraisal interview or session, (b) satisfaction with the appraisal system, and (c) satisfaction with performance ratings. Some researchers have clearly operationalized their measures to be consistent with these conceptualizations. For example, Giles and Mossholder developed separate measures for system and session satisfaction. Similarly, in assessing satisfaction with performance ratings, Taylor, Tracy, Renard, Harrison, and Carroll (1995) had employees respond to the item "I was very satisfied with the results of the performance evaluation I received during the Pilot Project." This item very clearly refers to satisfaction with the results. Unfortunately, many operationalizations of satisfaction have been inconsistent and may be contaminated with other variables. For example, Dipboye and dePonthriand (1981) used a one-item measure of satisfaction that asked participants to rate their sarisfaction with the effectiveness of the current appraisal. This item potentially confounds satisfaction and utility. Greller (1978) constructed a three-item measure of satisfaction that has been used by subsequent researchers. This measure assesses the degree to which subordinates are satisfied, report accurate and fair evaluations, and feel they will improve their working relations with their supervisors. Thus, Greller's measure may confound the constructs of satisfaction, fairness, accuracy, and utility. Similarly, Taylor et al. (1995) operationalized satisfaction with a four-item scale assessing whether the organization should change the appraisal system, whether there are fewer work problems as a result of the system, whether employees are satisfied with the way the organization conducted the appraisal, and whether having appraisals is a waste of time. This measure may potentially confound satisfaction and utility. Finally, some satisfaction measures seem to confound appraisal satisfaction with job satisfaction. For example, Dorfman, Stephan, and Loveland (1986) measured appraisal satisfaction by having employees respond to the following items: "How satisfied were you with the discussion between yourself and your supervisor

710

KEEPING AND LEVY

about your job performance?", "In general, how satisfied are you with your supervisor?", "In general, how satisfied are you with your job?", and "How satisfied are you with the overall evaluation of your performance?"

Fairness A review of the measures assessing appraisal fairness presents a more complicated picture than does any other appraisal reaction. This is due to the recent influence of organizational justice on the measurement of employee reactions to performance appraisal, which is consistent with Smither's (1998) argument that the ultimate appraisal system is especially sensitive to issues of justice or fairness. Traditionally, appraisal fairness was conceptualized as either the perceived fairness of the performance rating or the perceived fairness of the appraisal in general. Recently, however, researchers in performance appraisal have adopted the constructs of procedural and distributive justice and have used these measures to assess the issue of fairness (e.g., Korsgaard & Roberson, 1995). Thus, appraisal fairness has been conceptualized in four different ways: (a) fairness with performance ratings, (b) fairness with the appraisal system, (c) procedural justice, and (d) distributive justice. Fairness with performance ratings has often been operationalized by one-item measures. For example, Taylor et al. (1995) asked employees whether the appraisal received was a fair one. Unfortunately, there is some inconsistency, even among the oneitem measures. For example, Burke, Weitzel, and Weir (1978) asked employees how fair they felt their last appraisal session was. This focuses on the session rather than on the system or the ratings received. Conversely, Evans and McShane (1988) asked employees to indicate their overall perception of the fairness of their company's appraisal system. This item encompasses more than just the fairness of the session or the final rating. Fairness has also been operationalized as distributive justice. For example, Korsgaard and Roberson (1995) assessed distributive justice by asking employees how fair they felt the appraisal was, how much it agreed with their own final rating, and how well it represented their performance. Although this may be conceptualized as distributive justice, the measure also appears to tap perceived accuracy of the rating, particularly the latter two items. Supporting this interpretation, one of the two items Taylor et al. (1995) used to measure accuracy (not fairness) assessed whether the evaluation showed how employees really performed on the job (i.e., how well the appraisal represented their performance). Similarly, Inderrieden, Allen, and Keaveny (1992) operationalized fairness of ratings with the item "My supervisor overlooked important things affecting my job performance in his ratings of performance." Again, this item may potentially assess perceived accuracy rather than fairness. In addition, this measure focuses particularly on the fairness extended by the supervisor.

Perceived Utility Another popular reaction has been the perceived utility of the appraisal. Compared with satisfaction and fairness, the measurement of utility has been relatively consistent and unconfounded. The most typical conceptualizations of utility have focused on the usefulness of the appraisal session. For example, Greller (1978) conceptualized utility in terms of the appraisal session and opera-

tionalized this with items such as "The appraisal helped me learn how I can do my job better" and "I learned a lot from the appraisal." Many researchers since then have operationalized perceived utility in the same way and have measured it with Greller's scale or some modification of it (e.g., Nathan, Mohrman, & Milliman, 1991; Prince & Lawler, 1986).

Perceived Accuracy The assessment of perceived accuracy presents an unusual case compared with the other reactions typically measured. In reviewing the research in which perceived accuracy has been used as a criterion, Cawley et al. (1998) reported that the vast majority of studies appear to confound accuracy with other reactions, most notably fairness. For example, in their seminal paper on employee reactions and performance appraisal, Landy, Barnes, and Murphy (1978) asked subordinates to respond to the question, "Has your performance been fairy and accurately evaluated?" Putting It All Together: Relations A m o n g Appraisal Reactions As we mentioned previously, the research literature on appraisal reaction lacks a theoretical framework, and, because of this, researchers have not carefully considered how the various reactions might work together. However, most studies measuring one appraisal reaction tend to measure other reactions as well. As we also mentioned earlier, and in relation to the previous criticism, the rationale behind the choice of reactions measured is rarely provided. Thus, the literature is full of various combinations of appraisal reactions and their bivariate correlations. An examination of the bivariate correlations among appraisal reactions reveals that these correlations are quite high, which might lead one to question whether these rationally developed constructs are actually distinct entities. Some researchers have suggested that appraisal reactions are really just measures of an overall construct of appraisal effectiveness. In this vein, Cardy and Dobbins (1994) conceptualized appraisal effectiveness as a multidimensional construct or an ultimate criterion (Thorndike, 1949) that cannot be directly measured but rather is assessed through the measurement of other subordinate criteria. They further suggested that the subordinate criteria that each reflect a portion of the overall concept of appraisal effectiveness are rater errors, rating accuracy, and qualitative aspects of the appraisal (i.e., appraisal reactions). Thus, it is possible that the appraisal reactions considered to be separate constructs in the research literature are in fact indicators contributing to a larger overall construct--appraisal reactions--that represents one component of the ultimate criterion labeled appraisal effectiveness by Cardy and Dobbins (1994). The current study focuses on empirically examining this conceptualization. Appraisal Reactions and Affect When appraisal reactions are used as variables in appraisal research, the logical and most straightforward mode of measurement involves self-report measures. However, criticisms regarding the validity of self-report data are ubiquitous in the I/O literature, particularly the literature concerning employee attitudes and reactions (Spector, 1994). Perhaps the most contentious of these crit-

MEASURING APPRAISAL REACTIONS icisms is that self-report data are susceptible to common method variance or systematic error variance that is related to the measurement process rather than the actual constructs of interest (see Doty & Glick, 1998). That is, it is often argued that observed correlations are produced by the fact that the data originated from the same source rather than from relations among substantive constructs. Furthermore, Feldman and Lynch (1988), as well as Tourangeau and Rasinski (1988), have discussed how self-reported attitudes can be biased because of contextual aspects such as the timing, order, and method of measurement. In line with this reasoning, our focus in the current study is on the potentially biasing contextual aspect of an individual's affect at the time of completing self-report measures. In a similar vein, research in the organizational behavior literature has examined the effects of negative affectivity and, to a lesser extent, positive affectivity on self-report data. For example, Williams and Anderson (1994), applying a latent-variable structural equation approach suggested by Schaubroeck, Ganster, and Fox (1992), investigated the potential biasing effect of common method variance (i.e., positive or negative affectivity) on selfreports of job attitudes. They found that although some method variance existed in the self-reports of job attitudes, these effects were minimal and essentially meaningless in any practical sense. The present study examines potential method bias effects derived from the method variance associated with positive and negative affect, in the specific context of performance appraisal reactions. Our approach is consistent with Doty and Glick's (1998) conceptualization of common methods variance and common methods bias, as manifested in independently assessed method effects such as affectivity or social desirability.

The Present Study From reviewing the literature, it is apparent to us that there is some confusion in terms of what has been measured in the area of reactions toward performance appraisal. The purpose of the present study is to attempt to clarify and reorganize this area by examining the measurement of various measures of appraisal reactions. Appraisal reactions have been measured in research for over 30 years, and it seems appropriate, in fact overdue, that we now begin to examine exactly what we have been measuring. The field has come a long way from employing single-item measures to employing multi-item measures as well as theory-driven measures such as procedural and distributive justice. The next logical step is to assess the measures currently in use and to examine whether and to what extent they might work together to assess the overall construct of appraisal effectiveness. To this end, we use structural equation modeling to assess the measurement of appraisal reactions. More specifically, we are interested in pursuing answers to the following questions: (a) How adequately are appraisal reaction constructs being measured with current scales? (b) How adequately do the data support a model in which the separate reaction constructs reflect a higher order appraisal reactions construct? Alternatively, how adequately do the data support a model in which the appraisal reactions are operationalized as a single factor? and (c) Does the method variance associated with positive and negative affect result in methodbiased measurement of appraisal reactions? If so, what is the extent

711

of this bias in terms of substantive effects on the relationships among constructs?

Method

Participants The sample consisted of approximately 350 employees from the midwestern head office of a large international organization. Surveys were returned by 208 employees, resulting in a 59% response rate. Only cases in which the employee had received a performance review and participated in an appraisal discussion were included in analyses, resulting in a sample size of 182 employees, 95% of whom were employed full time. This sample contained 32% men and 64% women from a variety of positions ranging from corporate lawyer to shipper/receiver. The average organizational tenure for respondents was 4.58 years, and the average job tenure was 2.90 years. The sample was fairly well educated, with 46% holding a college degree, 21% having some college, and 6% having only a high school diploma. With respect to age, the sample was fairly evenly distributed, with 33% under the age of 30, 32% between 30 and 39, and 35% 40 years or older. Complete data suitable for structural equation modeling were provided by 169 of these employees.

Procedure We distributed questionnaires to employees with an attached letter requesting their voluntary participation. The 94-item surveys were completed anonymously and returned to us in self-addressed, stamped envelopes. Employees were asked to reflect on their most recent performance review when responding to items. The measures of positive and negative affect were placed at the end of the questionnaire, after the reaction measures. Appraisals had taken place approximately 2 months prior to questionnaire distribution. The appraisals were used for personnel decisions such as salary allocations and promotions. Employees were evaluated on eight separate dimensions (e.g., job knowledge and skills, quality of work) and one overall performance dimension with ratings on each ranging from 1 (unsatisfactory performance) to 5 (exceptional performance). Raters were instructed to carefully evaluate an employee's work performance in relation to his or her current job responsibilities.

Measures To provide a comprehensive investigation of the current state of measurement of appraisal reactions, as much as possible, we specifically chose reaction measures that have been used most pervasively in the field. We make no a priori claim that these are the best measures or the worst measures but only that they are frequently used as research tools. Satisfaction with the appraisal session. Employees' satisfaction with the performance appraisal discussion was assessed using the three-item measure developed by Giles and Mossholder (1990). A sample item is "I felt quite satisfied with my last review discussion." Responses were indicated on a 6-point Likert scale, with 1 representing strongly disagree and 6 representing strongly agree. Satisfaction with the appraisal system. Satisfaction with the appraisal system was measured using the three-item scale developed by Giles and Mossholder (1990). A sample item is "In general, I feel the company has an excellent performance review system." Responses were indicated using the same 6-point scale described for session satisfaction. Perceived utility of the appraisal. The perceived utility of the appraisal was assessed with Greller's (1978) four-item measure. This is one of the oldest measures of utility and has been used in many studies. A sample item is "The performance review helped me learn how I can do my job

712

KEEPING AND LEVY

Table 1 Means, Standard Deviations, Reliabilities, and lntercorrelations Among Variables Variable 1. 2. 3. 4. 5. 6. 7. 8.

Session Satisfaction System Satisfaction Utility Accuracy Distributive justice Procedural justice Positive affect Negative affect

Note.

M

SD

c~

t

2

3

4

4.09 3.40 2.16 4.75 3.36 4.98 4.96 2.95

1.60 1.44 0.84 1.70 1.28 1.67 1.22 1.66

.95 .90 .91 .96 .95 .96 .89 .95

-.69 .71 .81 .82 .72 .29 -.27

-.67 .67 .70 .69 .29 -.29

-.56 .58 .59 .32 -.22

-.88 .76 .26 -.27

5

6

.76 .25

-.27

-.27

-.20

7

8

--.67

All correlations are significant at ~ = .01. Ns range from 175 to 181.

better." Employees indicated their responses on a 4-point scale ranging from I do not feel this way, not at all to t feel exactly this way, completely. Perceived accuracy of the appraisal. The extent to which employees perceived the appraisal as accurate was measured with an eight-item modification of Stone, Gueutal, and Mclntosh's (1984) measure of feedback accuracy. A sample item from this scale is "I do not feel the feedback reflected my actual perfbrmance." Employees indicated their responses on a 7-point Likert scale ranging from strongly disagree to strongly agree. Procedural justice. A four-item procedural justice scale developed by Keeping, Makiney, Levy, Moon, and Gillette (1999) was used. This procedural justice scale is specific to the performance appraisal context. A sample item from this scale is "The procedures used to evaluate my performance were fair." Responses to items in this scale were made on the same 7-point scale as described for perceived accuracy. Distributive justice. Distributive justice was assessed with the fouritem appraisal distributive justice measure developed by Korsgaard and Roberson (1995). A sample item is "The performance appraisal fairly represented my past year's performance." Responses were indicated on a 5-point Likert scale ranging from strongly disagree to strongly agree. Positive and negative affect. Positive and negative affect were measured using scales developed by Zuwerink and Devine (1996). Positive affect was measured with eight adjectives associated with positivity (i.e., happy, optimistic, content, good, confident, proud, satisfied with myself, and pleased with myself). A simple confirmatory factor analysis of positive affect indicated that content and satisfied with myself were poor indicators (standardized loadings were .56 and .58, respectively), and thus, these two items were dropped and the remaining six items were used for all analyses. Negative affect was measured with a six-item scale conceptually related to irritation (i.e., agitated, angry, annoyed, bothered, disgusted, and irritated). For both scales, employees responded in terms of how they felt at that moment, thus assessing their current affective state. Responses were indicated on a 7-point scale from does not apply at all to applies very much. Results Table 1 presents means, standard deviations, reliabilities, and intercorrelations for the various scales. As we expected, the intercorrelations between the appraisal reactions were quite high. It is interesting to note that although positive and negative affect were significantly correlated with the appraisal reaction measures, these correlations were not very high (ranging in magnitude from .20 to .32). All subsequent analyses were conducted using LISREL VIII (Jrreskog & SOrbom, 1993) with a covariance matrix as input. Table 2 presents a description of our analytic strategy for the study,

complete with descriptions of the models estimated and the associated rationale. 1 A s s e s s i n g the M e a s u r e m e n t o f A p p r a i s a l R e a c t i o n s Appraisal reaction measurement models. The procedures used to assess the measurement models are summarized in the top portion of Table 2. To determine how well the appraisal reaction indicators were measuring the substantive constructs, we estimated a measurement model containing satisfaction with the session, perceived utility, satisfaction with the system, perceived accuracy, procedural justice, and distributive justice. Each of these six appraisal reaction constructs was modeled with multiple single-item indicators, except for perceived accuracy, which was modeled with three rationally derived parcels (2 two-item parcels and 1 four-item parcel), following recommendations in the literature (Bagozzi & Edwards, 1998; Hall, Snell, & Foust, 1999). Consistent with standard practice, the appraisal reaction constructs were allowed to freely correlate with one another (Anderson & Gerbing, 1988). Table 3 presents the chi-square values, degrees of freedom, and fit indices for the measurement model, as well as for subsequent models. In accordance with the two-index presentation strategy suggested by Hu and Bentler (1999), the standardized root mean square residual (SRMSR) is presented, along with the root mean square error of approximation (RMSEA) and the nonnormed fit index (NNFI), also known as the Tucker-Lewis Index (TLI). Although the fit was not unreasonable, as the SRMSR fell below the .08 cutoff suggested by Hu and Bentler (1999), the value o f . 11 for the R M S E A presented some cause for concern, as it exceeded the .08 cutoff suggested by Browne and Cudeck (1993) and the .06 cutoff suggested by Hu and Bentler. Finally, the value of .91 for the TLI fell below the suggested cutoff of .95 (Hu & Bentler, 1999). On the basis of these criteria, we rejected the measurement model. Modification indices for the error variances indicated that model fit might improve if some of the uniquenesses were allowed to correlate. Most of these modifications involved allowing correlations between uniquenesses within constructs. Thus, we estimated a modified measurement model, allowing

We thank two anonymous reviewers for suggesting that we summarize our analytic strategy in this manner.

713

MEASURING APPRAISAL REACTIONS Table 2

Description of Analytic Strategy Goal and step Assessing the measurement and modeling of reactions 1. Estimate measurement model 2. Estimate modified measurement model 3. Estimate hierarchical model

4. Estimate single-factor model

5. Chi-square difference test

Assessing method effects--measurement level 1. Estimate baseline model

Rationale

Description

To assess how well reaction indicators measured their substantive constructs To try to improve fit of baseline measurement model before addressing other questions To assess whether the data support a model where the reaction constructs compose a higher order Appraisal Reactions construct To assess whether the data support a model where the appraisal reactions are operationalized as one factor (i.e., to see if they are all measuring the same thing) To determine which model provides a better fit for the d a t a ~ t h e hierarchical model or the single factor model

Six appraisal reaction constructs were modeled with indicators. Reaction constructs were allowed to freely correlate with one another. Same as measurement model, except allowed for eight correlations between errors within constructs (see Figure 1).

To establish a baseline for subsequent comparison

2. Estimate method model

To assess method effects at measurement level (i.e., represented as factor loadings) and provide standard of comparison

3. Chi-square difference test

To test overall relationships with the Method factor

4. Estimate constrained method model

To assess the measurement effects of the Method factor on the estimates of the appraisal reaction correlations To test extent to which the method effect (represented as factor loadings) indicated in the last step changes (i.e., biases) substantive relations among appraisal reaction constructs

5. Chi-square difference test

Assessing method effects--structural level 1. Estimate structural baseline model 2. Estimate structural method model 3. Chi-square difference test

4, Estimate constrained structural method model 5. Chi-square difference test

To establish a baseline for subsequent comparison To assess method effects at structural level (i.e., represented as direct paths) and provide standard of comparison To test overall relationships with the Method factor

To assess extent of substantive impact of Method factor on relations between reaction constructs To test extent to which the method effect (represented as direct paths) indicated in the last step changes (i.e., biases) substantive relations among appraisal reaction constructs

Appraisal reactions were modeled as distinct latent constructs that together reflect a higher order factor (i.e., reactions became latent indicators of an overall Appraisal Reactions construct--see Figure 2). Data were modeled as in hierarchical model, except that loadings from hierarchical factor to reaction factors were all constrained to 1.0.

Compute chi-square difference test by subtracting chi-square for hierarchical model from chi-square for single-factor model.

Six reaction constructs were allowed to freely correlate with each other, but the Method factor was left uncorrelated with each of the reaction constructs. Direct paths from the reaction constructs to each of their associated indicators were also estimated. Estimated direct paths from Method factor to each of the indicators of the six reaction constructs and direct paths from reaction consu'ucts to their respective indicators. As with baseline model, correlations between reaction constructs were estimated, but the Method factor was not allowed to correlate with any of the reaction constructs (see Figure 3). Computed chi-square difference test between baseline model and method model. If test is significant, this suggests the presence of method variance induced by Method factor. Proceed to Step 4. If test is not significant, this suggests the Method factor does not significantly induce method variance. Stop here. Same as method model except 15 appraisal reaction correlations were set to equal the LISREL estimates that were obtained from the baseline model. Computed the chi-square difference test between the method model and the constrained method model. If test is significant, this indicates the method effect significantly biases appraisal reaction factor correlations obtained (i.e., there is method bias). If test is not significant, this indicates the method effect does not bias appraisal reaction factor correlations obtained (i.e., there is no method bias).

Identical to original baseline model except modeled on both the x and the y sides. Thus, estimates 15 disturbance correlations rather than factor correlations. Structural paths from Method factor to reaction constructs were estimated (see Figure 4). Computed chi-square difference between structural baseline model and structural method model. If test is significant, this suggests the presence of substantive Method factor relationships. Proceed to Step 4. If test is not significant, this suggests the Method factor does not substantively affect relationships. Stop here. Same as structural method model except 15 disturbance correlations were set to equal the LISREL estimates that were obtained in the structural baseline model. Computed chi-square difference between structural method model and constrained slructural method model. If test is significant, this indicates the method effect significantly biases the appraisal reaction factor correlations obtained (i.e., there is method bias). If test is not significant, this indicates the method effect does not bias the appraisal reaction factor correlations obtained (i.e., there is no method bias).

714

KEEPING AND LEVY

Table 3

Summary Statistics f o r Models Model

Description

X2

df

SRMSR

TLI

RMSEA

Measurement model Modified measurement model Hierarchical model

Reaction constructs intercorrelated Correlated errors within constructs Reactions are latent indicators of a higher order factor (Appraisal Reactions) Paths from higher order factor to reactions constrained to 1.0 Reactions not correlated with PA Paths from PA to substantive indicators estimated Paths from PA to substantive indicators estimated and reaction correlations set to original estimates Reactions not correlated with NA Paths from NA to substantive indicators estimated Modeled on the x and y sides (identical to the PA baseline model) Paths from PA to reaction constructs estimated Paths from PA to reaction constructs estimated and disturbance correlations set to original estimates Modeled on the x and y sides (identical to the NA baseline model) Paths from NA to reaction constructs estimated Paths from NA to reaction constructs estimated and disturbance correlations set to original estimates

509.15 308.58 349.47

174 166 175

.05 .04 .06

.91 .96 .95

.l 1 .07 .08

515.97

181

.38

.91

.I0

408.64 367.33 368.67

229 208 223

.12 .04 .06

.95 .96 .96

.07 .07 .06

381.92 355.70 408.64

229 208 229

.11 .04 .12

.96 .96 .95

.06 .07 .07

389.48 390.61

223 238

.05 .06

.96 .96

.07 .06

381.92

229

.11

.96

.06

367.67 368.36

223 238

.04 .05

.96 .97

.06 .06

Single-factor model PA baseline model PA method model PA constrained method model NA baseline model NA method model PA structural baseline model PA structural method model PA constrained structural method model NA structural baseline model NA structural method model NA constrained structural method model

Note.

All analyses were conducted on an N of 182. SRMSR = standardized root mean square residual; TLI = Tucker-Lewis Index (also referred to as nonnormed fit index); RMSEA = root mean square error of approximation; PA = positive affect; NA = negative affect.

eight correlations b e t w e e n errors within constructs. F i g u r e 1 p r e s e n t s this m o d i f i e d m e a s u r e m e n t model, c o m p l e t e with indicator l o a d i n g s and error variances. As Table 3 indicates, the fit for the m o d i f i e d m e a s u r e m e n t m o d e l w a s substantially better than that for the m e a s u r e m e n t model, with the R M S E A and T L I b o t h within acceptable ranges. B e c a u s e o f this i m p r o v e m e n t in fit, w e u s e d the m o d i f i e d m e a s u r e m e n t m o d e l as the m e a s u r e -

.07

.42

.24

Figure 1.

.09

.21

.23

25

.38

.31

.24

m e n t basis for later m o d e l s rather than the m e a s u r e m e n t model. As F i g u r e 1 indicates, the indicators s h o w s t r o n g l o a d i n g s on the latent c o n s t r u c t s ( r a n g i n g f r o m .76 to .97), and all were statistically significant. In addition, the latent correlations bet w e e n the appraisal reactions did not c h a n g e substantially f r o m the m e a s u r e m e n t m o d e l to the m o d i f i e d m e a s u r e m e n t model, with an average difference o f .025. As w e expected, these

.21

.16

.07

,11

.05

.21

21

.18

.21

Modified measurement model. Sat = satisfaction; Proc = procedural; Dist = distributive; X = manifest indicator.

.09

.21

MEASURING APPRAISAL REACTIONS

715

Appraisal Effectiveness

.94

System Sat

,,ssion ~ Sat ]

(

/

.69

Perceived Utility '~

\

.95

~

Perceived ( Accuracy ~

.96

.02

.46

.25

07 .22 .28

.24 39 .30 .25

.21 .16 .08

.85

~.

(

.96

Proc Justice

195 /.98 ~.89\ .88

.11 .05 .21 .22

Justice

/.91 1.89 1.95 \.89

.18 .21

.09

.21

Figure 2. Hierarchical model. Sat = satisfaction; Proc = procedural; Dist = distributive; Y = manifest indicator.

correlations were quite high, with perceived utility being slightly less correlated with the other reactions. 2 Hierarchical model We next modeled the appraisal reactions as latent constructs reflecting an overarching factor. That is, we explored the fit of the data in a model in which the appraisal reactions were still modeled as distinct yet together reflected a hierarchical factor we labeled Appraisal Reactions. This hierarchical factor is analogous to the qualitative component of the ultimate criterion of appraisal effectiveness suggested by Cardy and Dobbins (1994). In this way, the appraisal reaction constructs became latent indicators of a higher order factor. Figure 2 illustrates this hierarchical model and presents the loadings from the higher order factor to the appraisal reaction constructs, as well as the indicator loadings and error variances. All six of the reaction constructs loaded highly and significantly on the hierarchical Appraisal Reactions factor, lending support for this model. With respect to model fit, as Table 3 indicates, the fit for the hierarchical model was only slightly lower than that for the modified measurement model. Unfortunately, because the two models were not nested, it was not possible to compare them statistically. However, we computed a chi-square to degrees of freedom ratio (Bollen, 1989) for each model as a means of subjective comparison. The ratios were 1.86 and 2.00 for the measurement and hierarchical models, respectively. Bollen (1989) suggested a rule of thumb of no higher than 3.0 for this ratio, which suggests that both models provide reasonable fit, given the number of parameters estimated. Thus, the

similarity of the fit indices and chi-square/degrees of freedom ratios suggests that this hierarchical model provides a reasonable representation of the appraisal reactions. Single factor model (constrained hierarchical model). To test the notion that the appraisal reactions were not actually distinct constructs but rather part of one general reaction factor, we modeled the data in the same way as for the hierarchical model described previously, except that the loadings from the hierarchical factor to the reaction factors were constrained to 1.0. In essence, this is conceptualizing the appraisal reactions as measuring one construct and is operationally identical to modeling the items of the appraisal reaction measures as indicators of one general reaction factor rather than as separate factors (constraining the correlations between reaction constructs to 1.0 also produces the same result). Modeling the data in this way has the advantage of providing a model nested within the hierarchical model, such that the

2 The very high correlation between perceived accuracy and distributive justice led us to estimate an alternate nested model in which the correlation between these two constructs was constrained to 1.0. In addition, the correlations between the other reaction constructs and each of these two constructs were set to be equal. For example, the correlations between accuracy and utility and between distributive justice and utility were set to be equal. A significant chi-square difference test, X2(5) = 63.21, p < .001, however, indicated that the data better fit the model in which the constructs were modeled as distinct (the modified measurement model).

716

KEEPING AND LEVY

Table 4

Summary of Relevant Chi-Square Difference Tests and Their Implications AX2

df

p

Implication

Hierarchical model vs. single-factor model

166.50

6

< .001

PA baseline model vs. PA method model

41.31

21

< .01

1.34

15

ns

NA baseline model vs. NA method model

26.22

21

ns

PA structural baseline model vs. PA structural method model PA structural method model vs. PA constrained structural method model

19.16

6

< .01

1.13

15

ns

NA structural baseline model vs. NA structural method model NA structural method model vs. NA constrained structural method model

14.25

6

< .05

0.69

15

ns

The hierarchical model is a better representation of the data than the single-factor model, which models the appraisal reactions as a single factor Suggests the presence of PA method effects in the measurement of appraisal reactions Although some method effects in the measurement of appraisal reactions exist as a result of PA, these effects do not bias the relationships among the appraisal reaction factors No NA method effects are present in the measurement of appraisal reactions Suggests the presence of PA method effects in the substantive appraisal reaction constructs The method effects due to PA in the substantive constructs did not account for significant variance in the relationships among the appraisal reaction constructs Suggests the presence of NA method effects in the substantive appraisal reaction constructs The method effects due to NA in the substantive constructs did not account for significant variance in the relationships among the appraisal reaction constructs

Models compared

PA method model vs. PA constrained method model

Note. All analyses were conducted on an N of 182. PA = positive affect; NA = negative affect.

two models can be directly compared using a chi-square difference test. It is important to note that the fit indices suggest that this single factor model does not represent the data as well as either the modified measurement model or the hierarchical model does. In particular, the SRMSR, which Hu and Bentler (1998) found to be the index most sensitive to models with misspecified factor covariances, rose to .38. In addition, the chi-square/degrees of freedom ratio was 2.85, which is considerably higher than that for the hierarchical model and approaches the upper bound of Bollen's (1989) rule of thumb, suggesting that the hierarchical model is more parsimonious. We conducted a chi-square difference test by subtracting the chi-square obtained for the hierarchical model from that obtained for the single-factor model. To facilitate comprehension, the associated values and implications of this test, as well as subsequent chi-square difference tests, are summarized in Table 4. As Table 4 indicates, this test revealed a significant difference in the two models. This, coupled with the chi-square/degrees of freedom ratio and the high SRMSR value, suggests that the data better fit a model in which the reaction constructs are modeled as distinct yet are related through a higher order factor (i.e., the hierarchical model).

Assessing Affect as a Method Effect o f Appraisal Reaction Measurement We were also interested in investigating the notion that selfreport measures of appraisal reactions may be driven by a respondent's mood at the time of filling out the survey. Thus, we conducted tests of method bias separately for PA and NA, using the procedures described by Williams et al. (1996). 3 These procedures are summarized in the middle portion of Table 2, and the results are summarized in Tables 3 and 4. Positive affect. We first estimated a baseline measurement model (the PA baseline model) containing the six appraisal reac-

tions and PA. In this model, the six appraisal reaction constructs were allowed to freely correlate, but PA was left uncorrelated with each of the reaction constructs. We also estimated the direct paths from the reaction constructs to each of their associated indicators. We compared the PA baseline model with a method effect model (the PA method model), which is illustrated in Figure 3. This model was identical to the PA baseline model except that direct paths from PA to each of the indicators of the six reaction constructs were estimated, in addition to the direct paths from the reaction constructs to their respective indicators. Results for the PA method model indicate that all but one of the substantive indicators (Variable 17, a procedural justice indicator) loaded significantly on the PA construct, although the loadings were not considerably high (range = .14 to .32). We conducted a chi-square difference test comparing the two models, which allowed for an overall test of relationships with PA. This resulted in a significant difference, )(2(21) = 41.31, p < .01, suggesting the presence of method variance in the form of PA in the measurement of appraisal reactions. The correlations between the appraisal reaction constructs were only slightly lower for the PA method model than for the PA baseline model, with an average difference of .025. In terms of fit, estimating PA method effects improved the fit only slightly. The next step involved assessing the measurement effects of PA on the estimates of the appraisal reaction correlations. In other

3 The analyses presented were performed separately for PA and NA, primarily because of the issue of sample size per parameters estimated associated with the method effect models (see Williams & Anderson, 1994, for a more detailed discussion). However, analyses examining the simultaneous method effects associated with PA and NA were also conducted, in line with the procedures described by Williams and Anderson (1994). The pattem of results and interpretations were consistent with those presented in the current article and are available from us.

MEASURING APPRAISAL REACTIONS

717

System Sat 1]1

Session Sat

112

Perceived Utility 113

PA/NA

~t Perceived Accuracy r14

Dist Justice 116

Figure 3. Method effect models. PA = positive affect; NA = negative affect; Sat = satisfaction; Proc = procedural; Dist = distributive; X and Y represent manifest indicators.

words, did the method variance represented as factor loadings result in method-biased measurement of the appraisal reactions? To this end, we estimated the PA constrained method model. This model was similar to the P A method model, except that the 15 appraisal reaction correlations, which were estimated in the PA method model, were set to equal the LISREL estimate values

obtained from the baseline model. The chi-square for the PA constrained method model was compared with that for the PA method model, and a difference test was computed. This difference was not significant, X2(15) = 1.34, suggesting that the PA method effect did not bias the appraisal reaction factor correlations obtained.

718

KEEPING AND LEVY

To obtain a more simplified picture of the effects of PA on the measurement of the appraisal reaction constructs, we partitioned the systematic variance of each of the indicators in the PA method model into a component associated with its substantive reaction construct and a component associated with PA. We obtained these estimates by squaring the factor loadings from the completely standardized LISREL estimates for the 2t reaction indicators, as suggested by Williams et al. (1996). This analysis indicated that, on average, 74% of the variance in the indicators was accounted for by the appraisal reactions (range = 49% to 88%), whereas only 7% of the variance, on average, was accounted for by PA (range = 2% to 10%). This quite clearly indicates that PA did not present a significant method bias in the measurement of the appraisal reactions. Negative affect. The procedures for testing the method effects of NA on appraisal reactions were identical to those described previously for PA. Thus, the NA baseline model and the NA method model were estimated. Similar to the case with PA, all but one of the indicator loadings in the NA method model (Variable 8 - - a utility indicator) for the NA construct were significant. However, similar to the results for PA, the range of these loadings was quite low (.15 to .29). A chi-square difference test computed by subtracting the chisquare for the NA method model from the chi-square for the NA baseline model resulted in a nonsignificant value, )(2(21) = 26.22, suggesting that NA did not represent a method effect in the measurement of the appraisal reactions. Further support for this suggestion comes from the squared factor loadings for the NA method model, which indicate that only 5% (range = 2% to 8%) of the variance in the indicators, on average, was accounted for by NA, whereas on average, 75% (range = 50% to 90%) of the variance was accounted for by the reaction constructs. Given the finding of no method variance effect, examining potential method bias was unnecessary. Method effects on reaction constructs. The previous two sections deal with the effects of affect on the measurement of appraisal reactions. Following Williams et al. (1996), we also examined the potential for method effects at the level of structural parameters, and these procedures are summarized in the bottom portion of Table 2. That is, we estimated the effect of positive and negative affect on the appraisal reaction constructs themselves rather than on the measurement of these constructs (referred to as the congeneric model in Williams et al., 1996). This provides a test of the presence of substantive effects of affect on the relationships among the appraisal reaction constructs. That is, do method effects represented as structural paths significantly bias the relations among appraisal reaction constructs? Given that the pattern of results was identical for PA and NA, in the interest of brevity, the results for these method effects are presented together. Results are written with respect to PA, and results for NA appear in parentheses. The first step was to estimate a simple structural model, the PA structural baseline model (NA structural baseline model), identical to the PA baseline model (NA baseline model), in which PA (NA) was not related to the other constructs. Because this model was estimated on the x and y sides in LISREL, the PA structural baseline model (NA structural baseline model) estimated 15 disturbance correlations (i.e., ffs), whereas the PA baseline model (the NA baseline model) estimated 15 factor correlations. Its matrices,

chi-square values, fit indices, and so forth were identical to the PA baseline model (NA baseline model), as we expected. We then compared the PA structural baseline model (NA structural baseline model) with the PA structural method model (NA structural method model), in which the structural paths from PA (NA) to the six reaction constructs were estimated. This model is illustrated in Figure 4. A chi-square difference test between the two models was significant, PA: )(2(6) = 19.16, p < .01; NA: X2(6) = 14.25, p < .05, indicating the presence of substantive PA (NA) relationships. All six of the paths from PA (NA) to the reactions were significant. In addition, the correlations among the disturbances for the PA structural method model (NA structural method model) are not considerably lower than those obtained for the PA structural baseline model (NA structural baseline model). These correlations are conceptually equivalent to partial correlations controlling for PA (NA; Williams et al., 1996) and thus represent the remaining shared variance among appraisal reaction constructs after controlling for PA (NA). The average difference between these correlations for the PA structural baseline model and the PA structural method model was .083, whereas the average difference between the NA structural baseline model and the NA structural method model was .057. Thus, although the chi-square difference test indicates the presence of method effects, they do not appear to have altered the relationships among the reaction constructs in any practical sense, for either PA or NA. To further assess the substantive impact of PA (NA) on the relations between the reaction constructs, we estimated the PA constrained structural method model (NA constrained structural method model). The chi-square difference test between the PA structural method model (NA structural method model) and the PA constrained structural method model (NA constrained structural method model) was not significant, PA: X2(15) = 1.13; NA: Xa(15) = .69, further suggesting that PA (NA) did not account for a significant amount of variance in the relationships among the appraisal reactions, and thus did not represent a method bias. This is consistent with the small differences found in the partial correlations between the baseline and method effects models, as we discussed previously. Discussion The purpose of the present study was to address particular concerns regarding the measurement of reactions in the performance appraisal literature and to examine this issue in detail. More specifically, we attempted to address questions regarding the adequacy of the measurement of appraisal reactions, the appropriateness of operationalizing appraisal reactions as reflecting a higher order factor, and the extent to which positive and negative affect might present a method bias in assessing appraisal reactions. Our discussion focuses on each of these three issues in tum.

Measurement Adequacy of Current Appraisal Measures The confirmatory factor analysis examining the six appraisal reactions suggests that, overall, the indicators reflected the reaction constructs very well. As Figure 1 illustrates, the factor loadings ranged from .76 to .97, with an average loading of .89. The reliabilities for these various scales were also quite high, as Table 1 indicates. Thus, it appears that insofar as these scales are repre-

MEASURING APPRAISAL REACTIONS

719

System Sat rll

SessiOnsat ~4---- ~

-"

r12

Perceived Utility ~3 PA/NA Perceived Accuracy q4

Proc Justice

/~_.__ ~5

Figure 4. Structural method effect models. PA = positive affect; NA = negative affect; Sat = satisfaction; Proc = procedural; Dist = distributive.

sentative of the measures typically employed, they do a good job of assessing appraisal reactions. That is not to say, however, that these measures could not be improved. Correlated errors. On the basis of modification indices, we allowed eight within-construct correlations between errors, which resulted in an improvement in fit for the measurement model. The constructs that were affected by these correlated errors were system satisfaction, session satisfaction, and procedural justice. Although this by no means invalidates our results, it does represent a potential problem with these measures. The fact that these errors are highly correlated indicates that there is common variance shared among these items other than the variance they share in terms of reflecting the same construct. This may represent what Anderson and Gerbing (1988) termed high external consistency, which is often indicative of the multidimensionality of a measure. Oftentimes, this unspecified source of variance that a set of indicators reflects is a methodological factor. A look at the items composing the procedural justice construct indicates that this is a

probable explanation for the high correlations among errors. More specifically, this methodological bias seems to result from the fact that the four items involved all share one of two question stems (i.e., "The procedures used to evaluate my performance w e r e . . . " or "The process used to evaluate my performance w a s . . . "). Perhaps employees were not easily able to distinguish between appraisal procedures and processes and the similar stems led to the correlated errors. With respect to the two appraisal satisfaction constructs, it is less clear why the errors were highly correlated. On the surface, they do not appear to suffer from a methodological bias, as was the case with procedural justice. It is possible that each of these measures is related to a second, substantive factor that was unmeasured in the current study. Future research should continue to examine this potential measurement concern. Cross-loadings of indicators. It is important to recall that earlier, we suggested that the appraisal reactions literature contains scales that may be confounded because researchers often assume that these measures are assessing distinct constructs simply be-

720

KEEPING AND LEVY

cause they are given different labels. A review of the modification indices for the indicators of the original measurement model in the present study reveals some support for our contention, particularly for the system and session satisfaction measures. More specifically, one distributive justice indicator, two of the three system satisfaction indicators, and all three of the session satisfaction indicators possessed large modification indices for multiple constructs, suggesting that model fit would be improved if these indicators were allowed to load on multiple constructs. As an example, the worst case involved the first indicator for session satisfaction, which was the item "I felt quite satisfied with my last review discussion." Modification indices indicated that simply allowing this item to cross-load on the system satisfaction, accuracy, procedural justice, and distributive justice constructs would reduce the chi-square by a total of 160.26. Similar to the issue discussed previously with respect to correlated errors, Anderson and Gerbing (1988) suggested that indicators with cross-loadings represent another situation in which there might be high external consistency and, therefore, multidimensionality. This is perhaps not surprising, given the nature of the satisfaction construct, which tends to be more general than other appraisal reactions. Future research should continue to examine these and alternate measures of session and system satisfaction and investigate the question of multidimensionality with these constructs.

Conceptualizing and Operationalizing the Nature of Appraisal Reactions Cardy and Dobbins (1994) suggested that appraisal reactions might together reflect one aspect of a multidimensional construct they labeled appraisal effectiveness. As a preliminary empirical examination of this conceptualization of appraisal reactions, we estimated a model in which the reaction constructs were modeled as latent reflectors of a higher order construct. Results indicate that this model fit the data quite well. The paths relating Appraisal Reactions to the appraisal constructs were all high and significant, and the fit indices for this model supported the data. In contrast, results associated with an alternate, nested model representing the appraisal reactions as a single factor suggest that this model did not fit the data as well as the hierarchical model did. Thus, it appears that appraisal constructs are better conceptualized and operationalized as separate yet related entities. The difficulty in addressing the issue of how to best represent appraisal reactions involves the comparison between the modified measurement model illustrated in Figure 1 and the hierarchical model illustrated in Figure 2. Because it is not possible to compare these models statistically, the comparison becomes a subjective one (MacCallum, 1995). On the basis of fit indices and the chisquare/degrees of freedom ratio, these models do not differ in any meaningful way. For example, the RMSEA and TLI for the hierarchical model are both lower by .01 compared with the modified measurement model. However, this was expected, given the larger number of parameters estimated in the modified measurement model (166 vs. 175 in the hierarchical model; Hu & Bentler, 1995), and therefore also does not represent a meaningful difference between the two models. For us, the issue of model choice ultimately is one of parsimony. The hierarchical model simply presents a more parsimonious rep-

resentation of the data, both conceptually and empirically. Empirically, we obtained roughly equivalent fit by estimating fewer parameters. Muliak and James (1995) suggested that given the same data set, models with good fit and more degrees of freedom are preferred because they are "subjected to more conditions of disconfirmability" (p. 132). More importantly, however, representing appraisal reactions as reflecting an overall Appraisal Reactions construct is conceptually more parsimonious than representing these constructs as correlated entities. In addition, the hierarchical model is one based on sound theoretical grounds, which is an important consideration when interpreting models (MacCallum, 1995) and lends credence to our preference for this model over the modified measurement model. It would be remiss of us not to emphasize, however, that both the modified measurement model and the hierarchical model are viable, given the data.

Affective Method Bias and Appraisal Reactions The final issue that we addressed in the current study is whether and to what extent positive and negative affect presented a method bias in the measurement of appraisal reactions (Doty & Gtick, 1998). We felt that this was an important issue to examine in the context of appraisal reactions, given the pervasive criticism that self-report data often suffer from common method variance. With respect to the measurement level, results indicated that no method variance occurred in the form of negative affect, and although method variance occurred in the form of positive affect, this did not result in method bias. We also examined the effect of positive and negative affect on the appraisal constructs themselves, as opposed to just the measurement of them. Similar to the findings at the measurement level, results strongly suggested that although a small degree of method effects were present in the data as a result of both positive and negative affect, these effects did not bias the appraisal reactions in any meaningful respect. To our knowledge, the present study represents the first examination of method bias in the measurement of appraisal reactions. Although common method variance is a very prevalent critique of attitude and reactions research, additional support from other research areas seems very consistent with our findings and argues against a substantial role for common method variance as a biasing factor in the assessment of focal constructs. First, with respect to method effects at the measurement level, Williams and Anderson (1994) found that positive emotionality had only minor impact on the relationships among organizational attitudes such as job satisfaction and organizational commitment. Negative emotionality was not found to induce any method bias in the same variables. Similarly, Williams et al. (1996) found that the small method effects in the form of negative affectivity did not substantially affect the relationships among variables such as role ambiguity, job satisfaction, and affective organizational commitment. In addition, Munz, Huelsman, Konold, and McKinney (1996) found that negative affectivity had weak but inconsequential measurement relationships with job characteristics and affective outcomes. Finally, Doty and Glick (1998) concluded from their review of multitrait-multimethod studies that although common method bias is a legitimate concern, it is not as prevalent nor as damning a criticism as the literature might suggest. In particular, the detected bias was not typically large enough to affect the theoretical interpretations of the substantive relationships.

MEASURING APPRAISAL REACTIONS With respect to method effects at the construct level, both Williams et al. (1996) and Munz et al. (1996) found method effects for negative emotionality/affectivity on substantive constructs, but again, these effects did not impact the relationships between these constructs. In combination, our study, along with these other studies, which examined the role of common method variance in separate research areas, arrive at a very similar conclusion: Common method variance exists, but at low and usually inconsequential levels.

Implications and Future Research The results of the present study suggest some important implications for the measurement of appraisal reactions. First, the data support a model in which the reaction constructs remain distinct yet reflect a higher order construct, consistent with Cardy and Dobbins's (1994) conceptualization of appraisal reactions representing one part of the ultimate criterion of appraisal effectiveness. According to these authors, rating errors and rating accuracy, in conjunction with appraisal reactions, complete the assessment of appraisal effectiveness. However, the performance appraisal field has been moving away from indices of error and accuracy and relying more on qualitative criteria such as reactions, emphasizing the importance of appropriate and accurate measurement of these criteria. Our findings serve to provide a theoretical and practical framework for these criteria, which continue to grow in importance. Second, on the basis of the scales employed in this study, the measurement of appraisal reactions appears to be quite good, with some qualifications. Measurement of the constructs of procedural justice, system satisfaction, and session satisfaction needs to be fine-tuned to alleviate the problems associated with correlated errors and cross-loadings. Thus, although the reaction variables are distinct constructs, there is some overlap that could be attenuated with careful attention paid to item writing and measurement. Because we only examined one measure per construct, it was not possible to address whether or not there is a measurement problem in the appraisal reactions literature such that scales given the same label actually reflect different constructs (Pedhazur & Schmelkin, 1991). As we mentioned previously, there are multiple measures of appraisal reactions in the literature, many of which are labeled as the same construct. Future research could perhaps adopt a multitrait-multimethod approach (Campbell & Fiske, 1959) to examine the prevalence of this assumption in appraisal reactions. In general, we suggest that researchers and practitioners should be careful when choosing or developing measures in an attempt to minimize as many confounds as possible. Third, positive and negative affect do not seem to induce a method bias in the measurement of appraisal reactions or in the constructs themselves. Thus, the criticism of self-report data being affected by method variance in the context of appraisal reactions seems to be unfounded, at least with respect to affect as the biasing factor. Of course, this issue should be re-examined in future research and results should be cross-validated before drawing any definitive conclusions. However, our research strongly supports the notion that affect does not substantively bias the relationships among appraisal reaction constructs. Fourth, it is important to note that in the current study, appraisals were conducted for administrative purposes. It is reasonable to

721

expect that appraisal purpose should influence employees' reactions. 4 Moreover, as Murphy and Cleveland (1995) suggested, beliefs regarding appraisal purpose are just as important as the actual purpose. For example, in the current study, it was clear from employees' open-ended responses that the appraisal was used for administrative purposes and that employees understood this. Although appraisal purpose should affect employees' reactions such that different reaction criteria become more or less important for different appraisal purposes, we suggest that this should not affect the type of reactions that exist. That is, although appraisal purpose may affect the mean level of reactions in such a way that certain reactions may be stronger for a particular purpose, it should not affect the existence of these reactions. Thus, the measurement properties examined in the current article should be relevant across appraisal purposes, although the level of these reactions may vary. However, this is really an empirical question that should be addressed by future research. Previous research has focused on the effect of appraisal purpose on appraisal processes and outcomes, such as the psychometric quality and accuracy of ratings (Murphy & Cleveland, 1995). However, little attention has been directed toward appraisal reactions as a criterion. Given the increasing importance placed on employee reactions, it seems timely that research examine this issue. Finally, we began this article by presenting three major concerns associated with appraisal reactions and posing sequential questions in an attempt to examine the measurement of these constructs. Although the present study represents only a preliminary examination of these issues, we believe that it contributes to the appraisal literature for the following reasons. First, we have denoted some of the deficiencies in the measurement of appraisal reactions and suggested areas for improvement. Second, we have provided researchers and practitioners with a conceptual and empirical starting point for measuring appraisal reactions. Third, we have shown that appraisal reactions do not appear to be biased by method variance in the form of positive or negative affect. Ultimately, the issue of appraisal measurement, like so many others, exists within a larger organizational context complete with constraints. Thus, accurate and comprehensive measurement is not always possible. It is our hope, howe~,er, that the current study assists in providing an awareness of some of the important issues that should be considered when measuring appraisal reactions.

4 We thank an anonymous reviewer for raising this issue.

References Anderson, J. C., & Gerbing, D. W. (1988). Structural equation modeling in practice: A review and recommended two-step approach. Psychological Bulletin, 103, 411-423. Bagozzi, R. P., & Edwards, J. R. (1998). A general approach for representing constructs in organizational research. Organizational Research Methods, 1, 45-87. Balzer, W. K., & Sulsky, L. M. (1990). Performance appraisal effectiveness. In K. R. Murphy & F. E. Saal (Eds.), Psychology in organizations: Integrating science and practice. Hillsdale, NJ: Erlbaum. Banks, C. G., & Murphy, K. R. (1985). Toward narrowing the researchpractice gap in performance appraisal. Personnel Psychology, 38, 335345.

722

KEEPING AND LEVY

Bemardin, H. J., & Beatty, R. W. (1984). Performance appraisal: Assessing human performance at work. Boston: Kent. Bernardin, H. J., Hagan, C. M., Kane, J. S., & Villanova, P. (1998). Effective performance management: A focus on precision, customers, and situational constraints. In J. W. Smither fEd.), Performance appraisal: State of the art in practice. San Francisco: Jossey-Bass. Bollen, K. A. (1989). Structural equations with latent variables. New York: Wiley. Bretz, R. D., Milkovich, G. T., & Read, W. (1992). The current state of performance appraisal research and practice: Concerns, directions, and implications. Journal of Management, 18, 321-352. Browne, M. W., & Cudeck, R. (1993). Alternative ways of assessing model fit. In K. A. Bollen & J. S. Long fEds.), Testing structural equation models. Newbury Park, CA: Sage. Burke, R. J., Weitzel, W., & Weir, T. (1978). Characteristics of effective employee performance review and development interviews: Replication and extension. Personnel Psychology, 31, 903-919. Campbell, D. T., & Fiske, D. W. (1959). Convergent and discriminant validation by the nmltitrait-multimethod matrix. Psychological Bulletin, 56, 81-105. Cardy, R. L., & Dobbins, G. H. (1994). Performance appraisal: Alternative perspectives. Cincinnati, OH: South-Western Publishing. Cawley, B. D., Keeping, L. M., & Levy, P. E. (1998). Participation in the performance appraisal process and employee reactions: A meta-analytic review of field investigations. Journal of AppUed Psychology, 83, 615633. Dipboye, R. L., & dePontbriand, R. (1981). Correlates of employee reactions to performance appraisals and appraisal systems. Journal of Applied Psychology, 66, 248-251. Dorfman, P. W., Stephan, W. G., & Loveland, J. (1986). Performance appraisal behaviors: Supervisor perceptions and subordinate reactions. Personnel Psychology, 39, 579-597. Dory, H. D., & Glick, W. H. (1998). Common methods bias: Does common methods variance really bias results? Organizational Research Methods, 1, 374-406. Evans, E. M., & McShane, S. L. (1988). Employee perceptions of performance appraisal fairness in two organizations. Canadian Journal of Behavioral Sc'&nce, 20, I77-191. Feldman, J. M., & Lynch, J. G. (1988). Self-generated validity and other effects of measurement on belief, attitude, intention, and behavior. Journal of Applied Psychology, 73, 421-435. Giles, W. F., & Mossholder, K. W. (1990). Employee reactions to contextual and session components of performance appraisal. Journal of Applied Psychology, 75, 371-377. Greller, M. M. (1978). The nature of subordinate participation in the appraisal interview. Academy of Management Journal, 21, 646-658. Hall, R. J., Snell, A. F., & Foust, M. S. (1999). Item parceling strategies in SEM: Investigating the subtle effects of unmodeled secondary consuucts. Organizational Research Methods, 2, 233-256. Hedge, J. W., & Borman, W. C. (1995). Changing conceptions and practices in performance appraisal. In A. Howard (Ed.), The changing nature of work. San Francisco: Jossey-Bass. Hu, L., & Bentler, P. M. (1995). Evaluating model fit. In R. H. Hoyle fed.), Structural equation modeling: Concepts, issues, and applications. Thousand Oaks, CA: Sage. Hu, L., & Bentler, P. M. (1998). Fit indices in covariance structure modeling: Sensitivity to underparameterized model rnisspecifieation. Psychological Methods, 3, 424-453. Hu, L., & Bentter, P. M. (1999). Cutoff criteria for fit indexes in covariance structure analysis: Conventional criteria versus new alternatives. Structural Equation Modeling, 6, 1-55. Hyatt, D., Cropanzano, R., Finfer, L. A., Levy, P. E., Ruddy, T. M., Vandaveer, V., & Walker, S. (1997). Bridging the gap between academ-

ics and practice: Suggestions from the field. The IndustrialOrganizational Psychologist, 35(1), 29-32. Ilgen, D. R., Barnes-Farrell, J. L., & McKeltin, D. B. (1993). Performance appraisal process research in the 1980s: What has it contributed to appraisals in use? Organizational Behavior and Human Decision Processes, 54, 321-368. Inderrieden, E. J., Alien, R. E., & Keaveny, T. J. (1992, August). An investigation of the antecedents and consequences of voluntary selfratings in a performance appraisal system. Paper presented at the annual meeting of the Academy of Management, Las Vegas, NV. J0reskog, K. G., & SSrbom, D. (1993). LISREL 8: Structural equation modeling with the SIMPLIS command language. Hillsdale, NJ: Erlbaum. Keeping, L. M., Makiney, J. D , Levy, P. E., Moon, M., & Gillette, L. M. (1999, August). Self-ratings and reactions to feedback: It's not how you finish but where you start. In R. A. Noe (Chair), New approaches to understanding employees' affective and behavioral responses to multirater feedback systems. Symposium conducted at the annual meeting of the Academy of Management, Chicago, IL. Korsgaard, M. A., & Roberson, L. (1995). Procedural justice in performance evaluation: The role of instrumental and non-instrumental voice in performance appraisal discussions. Journal of Management, 21, 657669. Landy, F. J., Barnes, J,, & Murphy, K. (1978). Correlates of perceived fairness and accuracy of performance appraisals. Journal of Applied Psychology, 63, 751-754. Lawler, E. E. (1967). The muttitralt-multirate approach to measuring managerial job performance. Journal of Applied Psychology, 51, 369381. MacCallum, R. C. (1995). Model specification: Procedures, strategies, and related issues. In R. H. Hoyle (Ed.), Structural equation modeling: Concepts, issues, and applications. Thousand Oaks, CA: Sage. MacCallum, R. C. (1998). Commenta D, on quantitative methods in I-O research. The Industrial-Organizational Psychologist, 35(4), 19-30. Maroney, B. P., & Buckely, M. R. (1992). Does research in performance appraisal influence the practice of performance appraisal? Regretfully not! Public Personnel Management, 21, 185-196. Muliak, S. A., & James, L. R. (1995). Objectivity and reasoning in science and structural equation modeling. In R. H. Hoyle (Ed.), Structural equation modeling: Concepts, issues, and applications. Thousand Oaks, CA: Sage. Munz, D. C., Huelsman, T. J., Konold, T. R., & McKinney, J. J. (1996). Are there methodological and substantive roles for affectivity in Job Diagnostic Survey relationships? Journal of Applied Psychology, 81, 795-805. Murphy, K. R., & Cleveland, J. N. (1995). Understanding performance appraisal: Social, organizational, and goal-based perspectives. Thousand Oaks, CA: Sage. Nathan, B. R., Mohrman, A. M., & Milliman, J. (1991). Interpersonal relations as a context for the effects of appraisal interviews on performance and satisfaction: A longitudinal study. Academy of Management Journal, 34, 352-369. Pedhazur, E. J., & Schmelkin, L. P. (1991). Measurement, design, and analysis: An integrated approach. Hillsdale, N J: Erlbaum. Prince, J. B., & Lawler, E. E. (1986). Does salary discussion hurt the developmental performance appraisal? Organizational Behavior and Human Decision Processes, 37, 357-374. Schaubroeck, J., Ganster, D., & Fox, M. (1992). Dispositional affect and work-related stress. Journal of Applied Psychology, 77, 322-335. Smither, J. W. (1998). Lessons learned: Research implications for performance appraisal and management practice. In J. W. Smither (Ed.), Performance appraisal: State of the art in practice. San Francisco: Jossey-Bass. Spector, P. E. (1994). Using self-report questionnaires in OB research: A

MEASURING APPRAISALREACTIONS comment on the use of a controversial method. Journal of Organizational Behavior, 15, 385-392. Stone, D. L., Gueutal, H. G., & Mclntosh, B. (1984). The effects of feedback sequence and expertise of the rater on perceived feedback accuracy. PersonnelPsychology, 37, 487-506. Taylor, M. S., Tracy, K. B., Renard, M. K., Harrison,J. K., & Carroll, S. J. (1995). Due process in performance appraisal: A quasi-experiment in procedural justice. AdministrativeScienceQuarterly, 40, 495-523. Thorndike, R. L. (1949). Personnelselection. New York: Wiley. Tourangeau, R., & Rasinski, K. A. (1988). Cognitiveprocesses underlying context effects in attitude measurement. PsychologicalBulletin, 103, 299-314. Williams, L. J., & Anderson, S. E. (1994). An alternative approach to

723

method effects by using latent-variable models: Applications in organizational behavior research. JournalofAppliedPsychology, 79, 323-331, Williams, L. J., Gavin, M. B., & Williams, M. L. (1996). Measurement and nonmeasurement processes with negative affectivity and employee attitudes. Journalof AppliedPsychology,81, 88-101. Zuwerink, J. R., & Devine, P. G. (1996). Attitude importance and resistance to persuasion: It's not just the thought that counts. Journal of Personalityand Social Psychology, 70, 931-944. Received April 29, 1999 Revision received October 12, 1999 Accepted October 13, 1999 m

Call for Nominations The Publications and Communications Board has opened nominations for the editorships o f Journal of Applied Psychology, Journal of Consulting and Clinical Psychology, Journal of Educational Psychology, Psychological Bulletin, and Journal

of Personality and Social Psychology: Interpersonal Relations and Group Processes for the years 2003-2008. Kevin R. Murphy, PhD, Philip C. Kendall, PhD, Michael Pressley, PhD, Nancy Eisenberg, PhD, and Chester A. Insko, PhD, respectively, are the incumbent editors. Candidates should be members of A P A and should be available to start receiving manuscripts in early 2002 to prepare for issues published in 2003. Please note that the P&C Board encourages participation by members of underrepresented groups in the publication process and would particularly welcome such nominees. Self-nominations are also encouraged. To nominate candidates, prepare a statement of one page or less in support of each candidate and send to • •

Margaret B. Spencer, PhD, for the Journal of Applied Psychology Donna M. Gelfand, PhD, and Lucia Albino Gilbert, PhD, for the Journal of Con-

sulting and Clinical Psychology • • •

Lauren B. Resnick, PhD, for the Journal of Educational Psychology Janet Shibley Hyde, PhD, and Randi C. Martin, PhD, for Psychological Bulletin Sara B. Kiesler, PhD, for JPSP: Interpersonal Relations and Group Processes

Address all nominations to the appropriate search committee at the following address: [Name o f journal] Search Committee c/o Karen Sellman, P&C Board Search Liaison R o o m 2004 American Psychological Association 750 First Street, NE Washington, DC 20002-4242 The first review o f nominations will begin December 1 I, 2000.