Convergent and Discriminant Validity of Assessment Center Dimensions

3 downloads 27229 Views 972KB Size Report
The criterion-related validity of assessment centers, which has been exten- ..... related validity. Four potential sources of variance in assessment center ratings were exam- ined: person ...... Finally, we take this opportunity to call for additional ...
Journal of Management 2000, Vol. 26, No. 4, 813– 835

Convergent and Discriminant Validity of Assessment Center Dimensions: A Conceptual and Empirical Reexamination of the Assessment Center Construct-Related Validity Paradox Winfred Arthur, Jr. David J. Woehr Robyn Maldegen Texas A&M University

This study notes that the lack of convergent and discriminant validity of assessment center ratings in the presence of content-related and criterion-related validity is paradoxical within a unitarian framework of validity. It also empirically demonstrates an application of generalizability theory to examining the convergent and discriminant validity of assessment center dimensional ratings. Generalizability analyses indicated that person, dimension, and person by dimension effects contribute large proportions of variance to the total variance in assessment center ratings. Alternately, exercise, rater, person by exercise, and dimension by exercise effects are shown to contribute little to the total variance. Correlational and confirmatory factor analyses results were consistent with the generalizability results. This provides strong evidence for the convergent and discriminant validity of the assessment center dimension ratings – a finding consistent with the conceptual underpinnings of the unitarian view of validity and inconsistent with previously reported results. Implications for future research and practice are discussed. © 2000 Elsevier Science Inc. All rights reserved.

Assessment centers are currently used in numerous private and public organizations (Gaugler, Rosenthal, Thornton, & Bentson, 1987; Spychalski, Quin˜ones, Gaugler, & Pohley, 1997) to evaluate thousands of people each year (Thornton & Byham, 1982). Assessment centers are used regularly for managerial assessment. In addition, they are being used more frequently for purposes other

Direct all correspondence to: Winfred Arthur, Jr., Department of Psychology, Texas A&M University, College Station, TX 77843-4235. Tel.: (409) 845-2502; E-mail address: [email protected] Copyright © 2000 by Elsevier Science Inc. 0149-2063 813

814

W. ARTHUR, D.J. WOEHR, AND R. MALDEGEN

than managerial assessment. For example, assessment centers have been used to evaluate salespeople, teachers and principals, engineers, rehabilitation counselors, and blue collar workers, as well as police and fire forces (Gaugler et al., 1987; Klimoski & Brickner, 1987). They have also been used to assist high school and college students with career planning (Arthur & Benjamin, 1999; Rowe & Mauer, 1991). The criterion-related validity of assessment centers, which has been extensively documented (Gaugler et al., 1987), is undoubtedly partially responsible for their popularity. In addition, content-related methods of validation are also used in assessment center development in an effort to meet professional and legal requirements (Sackett, 1987). Evidence for the construct-related validity of assessment center dimensions, however, has been less promising. Assessment centers are designed to evaluate individuals on specific dimensions or constructs across situations or exercises. Research, however, indicates that exercise rather than dimension factors emerge in the evaluation of assessees (Bycio, Alvares, & Hahn, 1987; Highhouse & Harris, 1993; Schneider & Schmitt, 1992; Turnage & Muchinsky, 1982). Thus, a lack of convergent validity, as well as a partial lack of discriminant validity has been reported extensively in the literature (Brannick, Michaels, & Baker, 1989; Klimoski & Brickner, 1987; Sackett & Harris, 1988). We would like to contribute to this debate by noting that the prevailing view that assessment center dimensions display content-related and criterion-related validity, but not construct-related validity represents somewhat of a paradox. Specifically, this view is inconsistent with the current unitarian conceptualization of validity, which postulates that content-, criterion-, and construct-related validity are simply different strategies for demonstrating the construct validity of a test or measure (Binning & Barrett, 1989). As evidential bases from which, on the basis of some predictor or test score, inferences about future job performance can be supported or justified (Binning & Barrett, 1989; Landy, 1986; Lawshe, 1985), content-, criterion-, and construct-related validity form an interrelated, bound, logical system, such that demonstration of any two, conceptually requires a demonstration of the third (Binning & Barrett, 1989). So within the framework of the unitarian view, at a theoretical/conceptual level (which we distinguish from single study demonstrations), if a measurement tool demonstrates criterion-related validity and content-related validity, as has been demonstrated with assessment centers, it should conceptually also be expected to demonstrate construct-related validity. Consequently, the prevailing view that assessment centers display criterion-related, and content-related validity but not construct-related validity is somewhat paradoxical because such a state is inconsistent with the unitarian view of validity. There are two explanations for the presence of assessment center contentand criterion-related validity in the absence of convergent and discriminant validity. First, assessment center procedures, implementation, and other methodological factors (Gaugler et al., 1987; Jones, 1992; Lievens, 1998; Schmitt, Schneider, & Cohen, 1990) may add measurement error that prevents appropriate convergent and discriminant validities from being obtained. Second, assessment centers may be measuring constructs other than those originally intended by the JOURNAL OF MANAGEMENT, VOL. 26, NO. 4, 2000

CONVERGENT AND DISCRIMINANT VALIDITY OF AC DIMENSION RATINGS

815

assessment center designers (Lance et al., 1998; Raymark & Binning, 1997; Russell & Domm, 1995). This second explanation suggests that the lack of convergent and discriminant validity evidence is not attributable to measurement error, but instead to mis-specification of the latent structure of the latent construct domain. As Russell and Domm (1995) note “simply put, assessment center ratings must be valid representations of some construct(s), we just do not know which one(s)” (p. 26). The present study focuses on the first explanation and the implications of the second are reviewed and discussed in the Discussion section. A number of procedure, implementation, and method-related explanations have been offered to account for the lack of convergent and discriminant validity of assessment center dimensions (Lievens, 1998). One explanation suggests that performance may be exercise-specific (Neidig & Neidig, 1984). For example, it is common for people to perform some job tasks better than others. If assessment center exercises are designed to elicit skills and abilities related to these job tasks, then it is also likely that people will perform better in some exercises than in others. This would result in higher dimension ratings for those exercises on which the person performs better and lower dimension ratings on exercises where worse performance occurs. The result is what seems to be inconsistent performance on the same dimension across exercises. This leads to smaller dimension intercorrelations across exercises and higher correlations among different dimensions in the same exercise. A second explanation suggests that a given performance dimension may be more observable in some exercises than in others (Sackett & Dreher, 1982). Therefore, a candidate not using his or her one opportunity to perform a behavior indicative of a particular dimension may be rated higher or lower on that dimension than would be the case if there was ample opportunity to observe behaviors related to that performance dimension. This, too, could lead to increased variance in dimension ratings and a reduction in the size of the dimension correlations across exercises. A final set of explanations that we explore focus on assessment center design and implementation. Although there have been attempts to standardize the development and implementation of assessment centers (e.g., Task Force on Assessment Center Guidelines, 1989), they are primarily a method of assessment and there is currently no “one” way to design and implement them (Bender, 1973; Spychalski et al., 1997). (We draw a distinction between the method of measurement, which is the means by which we measure a construct, and the content being measured, which is the construct being measured.) Indeed a substantial amount of research indicates that differences in the implementation of assessment centers can lead to large variations in their predictive validity. For example, Schmitt et al. (1990) compared the correlations of overall assessment ratings (OARs) with teacher ratings from one assessment center implemented at 16 different sites. Although the original implementation was the same across sites, some sites took liberties to make changes during the time the assessment center was in use. These changes in implementation resulted in a considerable range in predictive validity coefficients. Corrected for unreliability and restriction of range, these coefficients ranged from ⫺0.40 to 0.82. JOURNAL OF MANAGEMENT, VOL. 26, NO. 4, 2000

816

W. ARTHUR, D.J. WOEHR, AND R. MALDEGEN

Several other methodological factors and design characteristics appear to influence the psychometric properties of assessment center ratings. These include the number of dimensions observed, recorded, and subsequently rated (Bycio et al., 1987; Gaugler & Thornton, 1989; Schmitt, 1977); assessor/rater training (Gaugler et al., 1987; Woehr & Huffcutt, 1994); the participant-to-assessor ratio (Gaugler et al., 1987); and the type of evaluation approach (i.e., within-dimension vs. within-exercise; Harris, Becker, & Smith, 1993; Sackett & Dreher, 1988; Silverman, DeLessio, Woods, & Johnson, 1986). Additional factors identified and discussed by Gaugler et al. (1987) include the type of assessor (psychologists versus managers and supervisors; cf. Spychalski et al., 1997); whether feedback was given; the amount of assessor training (Dugan, 1988); and the number of days of observation. The effects of design and implementation factors on the predictive validity of assessment center ratings are consistent with Gaugler et al.’s (1987) conclusion that “assessment centers show both validity generalization and situational specificity“ (p. 493, emphasis added). Our objective in the present study was to extend previous research examining systematic sources of variance in assessment center ratings. Our goal was to examine whether an assessment center incorporating features, characteristics, and guidelines recommended by research and professional practice would demonstrate convergent and discriminant validity for the measured dimensions. In doing so, we took several steps to ensure that the design and implementation of the assessment center used here incorporated the features, characteristics, and guidelines recommended by research and professional practice. It should be noted that it was not our intention to examine the effects of specific design and implementation characteristics. This would have called for the implementation of a typical experimental design with manipulation of single features while holding others constant in a control group. Such a design was not possible or feasible in the present operational assessment center. It was inconceivable to ask the client organizations to allow us to increase the number of dimensions assessed to say 15, just to see what the effect of this manipulation would be! Thus we incorporated several variables at once. Consequently, this study does not speak to the specific effects of each design characteristic. However, the inclusion of several variables at once allows us to consider this study as a test of what current research would suggest is a reasonably good scenario for generating assessment center ratings that display discriminant and convergent validity. The specific assessment center design features and characteristics, along with the conceptual basis for their inclusion are next reviewed. Participant-to-Assessor Ratio and Number of Dimensions The first variables of interest were the participant-to-assessor ratio and the number of dimensions assessors were asked to observe, record, and rate. These variables play an important role in the validity of assessment center ratings (Bycio et al., 1987). For example, Schmitt (1977) found that instead of using the 17 designated dimensions, in evaluating participants, assessors actually collapsed these 17 dimensions into three global dimensions for rating purposes. Along JOURNAL OF MANAGEMENT, VOL. 26, NO. 4, 2000

CONVERGENT AND DISCRIMINANT VALIDITY OF AC DIMENSION RATINGS

817

similar lines, Sackett and Hakel (1979) found that only 5 of 17 dimensions were required to predict most of the variance in OARs. In an extension of Sackett and Hakel (1979), Russell (1985) also found that a single dimension dominated assessors’ ratings of 16 dimensions. Gaugler and Thornton (1989) further demonstrated that assessors have difficulty differentiating between a large number of performance dimensions. Thus, it seems that when asked to rate a large number of dimensions, the cognitive demands placed on assessors may make it difficult for them to process information at the dimension level, resulting in a failure to obtain convergent and discriminant validity. A review of the extant literature showed that out of 19 studies identified that investigated and failed to support the convergent/discriminant validity of assessment centers, assessors were typically responsible for rating performance on a fairly large number of dimensions (mean ⫽ 11.10, SD ⫽ 5.24, median ⫽ 9, mode ⫽ 7, min ⫽ 3, max ⫽ 25). (A table summarizing the design and implementation characteristics and features of the 19 studies is available on request from the first author. See also Woehr and Arthur [1999].) Furthermore, 45% of the studies used 10 or more dimensions, an interesting statistic in light of Gaugler and Thornton’s findings. Specifically, in Gaugler and Thornton’s study, assessors were responsible for rating three, six, or nine dimensions. Those assessors who were asked to rate three or six dimensions provided more accurate ratings than those asked to rate nine. These findings are consistent with the upper limits of human information processing capacity reported in the cognitive psychology literature (Miller, 1956). In summary, this body of research suggests there is much to lose from the inclusion of a large number of assessment center dimensions. The inability to simultaneously process a large number of dimensions may account for assessors’ tendency to rate using more global dimensions resulting in a failure to obtain convergent and discriminant validity. A large number of dimensions results in inaccurate ratings; assessors are more accurate when they are required to rate fewer dimensions. Thus in the present assessment center, the participant-to-assessor ratio was never greater than 2 to 1 and the number of dimensions was limited to a relatively manageable set (i.e., 9 dimensions) compared to previous studies (mean ⫽ 11.10, SD ⫽ 5.24, median ⫽ 9, mode ⫽ 7, min ⫽ 3, max ⫽ 25). From an information processing and cognitive demand perspective, both of these features (i.e., low participant-to-assessor ratio and relatively small number of dimensions) ensured that assessors had ample time and opportunity to effectively observe, record, categorize, and subsequently, rate participant behaviors. Type of Assessor The second variable focused on was the type of assessor, specifically psychologists versus managers and supervisors (Gaugler et al., 1987). Gaugler et al. posited psychologists make better assessors because their education and training better equip them to observe, record, and rate behavior. Similar effects are reported by Sagie and Magnezy (1997). Of the previously mentioned 19 studies JOURNAL OF MANAGEMENT, VOL. 26, NO. 4, 2000

818

W. ARTHUR, D.J. WOEHR, AND R. MALDEGEN

examining the convergent/discriminant validity of assessment centers, only 10 reported information on the type of assessor used. Contrary to the research-based recommendation, all these 10 studies used line managers or other employees of the organization instead of psychologists as assessors. Each assessor group (which consisted of three to four assessors) in our assessment center had at least two post-Master’s and/or Ph.D. industrial/organizational (I/O) psychologists; thus in total, 51% of the assessor staff consisted of I/O psychologists. Evaluation Approach A large number of the previous convergent/discriminant validity studies rated participants’ performance on dimensions within-exercise instead of withindimensions. Specifically, two evaluation approaches have been identified across assessment centers (Sackett & Dreher, 1982). In the within-exercise approach assessees are rated on each dimension after completion of each exercise. Alternately evaluation may occur after all of the exercises have been completed at which time the set of dimensions are rated based on performance from all of the exercises – this is the within-dimension approach. Conceptually, it is argued that rating dimensions after each exercise forces assessors to process information in terms of exercises. On the other hand, if assessors rate dimensions after the completion of all exercises, they are more likely to process information in terms of dimensions instead of exercises (Howard, 1997). Silverman et al. (1986) provided some evidence that the choice of approach may moderate findings of convergent and discriminant validity in assessment center ratings. Although their results suggested that a within-dimension approach is preferable to a withinexercise approach, Harris et al. (1993) failed to replicate their findings. Specifically, Harris et al.’s results “showed that both within-dimension and withinexercise scoring methods produced virtually the same average monotraitheteromethod correlations and heterotrait-monomethod correlations” (p. 677). Nevertheless, since it has been posited as conceptually more appropriate because it is more consistent with obtaining dimension factors (Howard, 1997; Silverman et al., 1986), the present study used the within-dimension approach. Assessor Training Although assessor training is an important element in the development and implementation of assessment centers, a review of the extant literature indicated that rarely is any information provided about the type of assessor training. For instance, in a survey of assessment center practices in organizations, although Spychalski et al. (1997) provided substantial information about the content of assessor training programs, few details were provided about how the training programs were conducted. The assessor training used in this assessment center followed a frame-of-reference (FOR) approach, because the consensus in the rater training literature is that FOR is the most effective rater training approach (Woehr & Huffcutt, 1994). FOR training emphasizes the distinctiveness of dimensions, understanding how dimensions are defined, the use of behavioral incidents describing the dimensions, and practice and feedback using the rating instruments. JOURNAL OF MANAGEMENT, VOL. 26, NO. 4, 2000

CONVERGENT AND DISCRIMINANT VALIDITY OF AC DIMENSION RATINGS

819

Generalizability Theory A secondary objective was to recommend and demonstrate the use of generalizability theory analysis to assess convergent/discriminant validity. Our use of generalizability theory in this manner is not unique (e.g., Kraiger & Teachout, 1990). In addition, generalizability theory has been strongly recommended as a useful framework for understanding sources of variance in assessment center ratings (Brannick, Michaels, & Baker, 1989; Jones, 1992) as well as work sample tests in general (McHenry & Schmitt, 1994). This is because, as noted by previous authors (e.g., Jones, 1992; Saal, Downey, & Lahey, 1980; Turnage & Muchinsky, 1982), a lack of clear separation of sources of variance has confounded efforts to assess the construct-related validity of assessment centers. In addition, previous analytic approaches (i.e., correlational and confirmatory factor analysis [CFA] approaches to multitrait/multimethod [MTMM] matrices) have been limited in that they only focus on variance associated with assessment center exercises and dimensions ignoring potential variation attributable to other sources such as the individual (assessee), the rater (assessor), and relevant interactions (described below). Thus, the choice of a generalizability theory approach (Brennan, 1994; Cronbach, Glesser, Nanda, & Rajaratanm, 1972; Shavelson & Webb, 1991) as the primary analysis in the present study is particularly appropriate because it permits differentiation among multiple sources of systematic variance in assessment center ratings. In addition, it allows for the calculation of a summary coefficient (i.e., generalizability coefficient) that reflects the dependability of measurement. This generalizability coefficient is analogous to the reliability coefficient in classical test theory. Despite these advantages and previous recommendations concerning the appropriateness of generalizability theory with respect to assessment center ratings, we are aware of no published study to date that has used this approach in the context of assessment center constructrelated validity. Four potential sources of variance in assessment center ratings were examined: person effects, dimension effects, exercise effects, and assessor effects. Previous research (e.g., Schnieder & Schmitt, 1992; Turnage & Muchinsky, 1982) has examined these factors either individually or in pairs, but no study has examined the variance associated with all of these factors and their interactions in the same study. These four sources of variance are particularly relevant to an understanding of the variance in assessment center ratings. Specifically, the variance associated with the person component – the person effect – represents the overall level of agreement in ratings for individuals across exercises and dimensions. Second, variance associated with the person by dimension interaction represents the extent to which an individual’s performance is discriminated across dimensions. This component in essence reflects the discriminant validity across dimensions. Third, variance associated with the main effect for assessors represents the systematic variance attributable to assessors rather than assessees (i.e., rater bias). Fourth, variance associated with the person by exercise interaction represents the extent to which individuals’ performance differ across exercises. This component represents the extent to which performance is explainable by JOURNAL OF MANAGEMENT, VOL. 26, NO. 4, 2000

820

W. ARTHUR, D.J. WOEHR, AND R. MALDEGEN

exercises (situational effects). Finally, variance associated with the dimension by exercise interaction represents the extent to which dimensions are differentially represented in different exercises. Table 1 provides a summary of the sources of variance examined in the present study. Evidence of construct-related validity is derived from the extent to which the variance associated with the constructs of interest (measurement focus) is large relative to the variance associated with conditions of measurement. In effect this demonstrates that variance in the measurements is actually attributable to the intended construct(s). In the context of assessment centers, construct-related validity would be supported by relatively large variance components associated with the individuals being assessed (person effects), the constructs on which the individuals are assessed (dimension effects), and the interaction between individuals and dimensions (person x dimensions effects) and relatively small variance components associated with the assessment center raters (rater effects), exercises (exercise effects), the interaction between individuals and exercises (person x exercise effects), and the interaction between dimensions and exercises (dimension x exercise effects). However, it should be noted that it is this pattern of effects that provides construct-related validity evidence and not any one effect in isolation. Of primary importance for evidence of construct-related validity is the variance associated with person by dimension effects. Specifically, this component indicates the extent to which distinctions are made across both individuals and dimensions. Previous examinations of assessment center convergent/discriminat validity have most commonly analyzed their data using zero-order correlational analysis or a factor-analytic approach (i.e., traditional MTMM analyses). Consequently, for comparative purposes, we also analyzed our data using traditional MTMM oriented correlations and CFA.

Table 1. Sources of Variability in Assessment Center Ratings Source of Variability P:R X D R PⴱX PⴱD DⴱX P:R ⴱ X ⴱ D ⴱ R

Type of Variability Systematic variance associated with the person factor Systematic variance associated with exercise Systematic variance associated with dimension Systematic variance associated with rater The extent to which an assessee’s performance rating differs on one exercise relative to the other exercises The extent to which an assessee’s performance differs on one dimension relative to the other dimensions The extent to which dimensions are more observable in some exercises than others The residual error and unmeasured aspects of the p, x, d, and r facets

JOURNAL OF MANAGEMENT, VOL. 26, NO. 4, 2000

CONVERGENT AND DISCRIMINANT VALIDITY OF AC DIMENSION RATINGS

821

Method Participants Data were collected as part of an operational assessment center developed for two federal agencies. The assessment center was used 10 times over an 8-year period (i.e., 1989 –1996) to evaluate 149 government managerial employees for developmental purposes. To date, 97 males and 52 females were assessed, including 119 Caucasians, 18 African Americans, 5 Hispanics, and 7 who described their race as “other.” Assessors Across the 10 administrations, a total of 31 different assessors (15 males and 16 females) were used. Twenty-four assessors were Caucasian, 3 were African American, 3 were Hispanic, and 1 was Asian American. Finally, 51% of the assessors were Ph.D. or post-Master’s I/O psychologists; the remainder were nonpsychologists in business-related fields (e.g., management consultants). The Assessment Center The assessment center was developed using the following steps. First, a detailed job analysis was conducted using interviews with incumbents and supervisors, and a review of current job descriptions on file. Draft job descriptions were then reviewed by incumbents and supervisors, their comments and suggestions solicited, and on the basis of this, the job description was revised and finalized. Second, using the job description, critical knowledge, skills, abilities, and other characteristics (KSAOs) necessary for successful job performance were identified. The third step called for identification of behavioral dimensions related to the identified KSAOs (or job requirements). The implementation of this step (i.e., the identification, labeling, and definition of dimensions) was guided by the extant literature (e.g., Thornton & Byham, 1982, pp. 138 –140) and our familiarity with the assessment center and managerial performance literatures. Finally, exercises measuring the behavioral dimensions were identified. This developmental process ensured a high level of content-related validity. The sequence of steps implemented in the development of the assessment center is summarized below: Job Analysis ➞ Job Description (Work Behaviors) ➞ KSAOs ➞ Behavioral Dimensions ➞ Exercises

Nine behavioral dimensions – oral communication, written communication, influencing others, innovation, flexibility, team building, organizing and planning, problem solving, and stress tolerance – were identified as a result of this process. Definitions of these nine dimensions are presented in the appendix. Four assessment center exercises were used. Sequentially, these were (1) a competitive allocation exercise (leaderless group discussion); (2) an in-basket exercise followed by an interview to answer questions concerning the in-basket (the in-basket interview was used to obtain oral communication scores for this exercise); (3) a written exercise; and (4) a noncompetitive management problems exercise (anJOURNAL OF MANAGEMENT, VOL. 26, NO. 4, 2000

822

W. ARTHUR, D.J. WOEHR, AND R. MALDEGEN

other leaderless group discussion). (A detailed description of the four exercises are available from the first author upon request.) Assessors received an 8-hr FOR-based training program (Bernardin & Buckley, 1981). The assessment center method and purpose for conducting the assessment center were initially explained. This was followed by definitions of the dimensions and descriptions of the exercises. Assessors were given examples of behaviors representing each dimension. Each set of behaviors was accompanied by a rating scale indicating the effectiveness levels of the behaviors listed. Once assessors were familiar with the dimensions and exercises, they practiced observing behaviors using videotapes of assessee groups participating in the competitive allocation exercise. Prior to the practice observations, trainers stressed observing and recording assessee behaviors without trying to categorize the behaviors into dimensions. After watching the videotape and recording behaviors, assessors were asked to categorize behaviors into dimensions and provide ratings on each dimension. When assessors finished categorizing behaviors and rating performance on the dimensions, they discussed their observations and ratings, using this as an opportunity to build a common frame of reference. After the competitive allocation exercise, assessors were each given a completed in-basket. Assessors were directed to record any information provided by the assessee and were given the opportunity to clarify any confusion during a practice in-basket interview. After the in-basket exercise, assessors were directed to record behaviors and provide dimension ratings. Again, behaviors and ratings were discussed so that a common frame of reference could be developed. The next training segment familiarized assessors with the rating process. Assessors were trained to work across exercises and complete rating a dimension before moving on to the next one. The final segment of training demonstrated how to prepare feedback reports and conduct feedback interviews. Finally, as part of their training, new assessors shadowed a more experienced assessor for at least one assessment center administration before they served as a full assessor whose ratings were used operationally. Each assessment center candidate participated in each of the four exercises and was evaluated on the nine dimensions previously discussed (see Figure 1). Groups of three to four assessors were assigned to observe and/or rate the materials of a group of four to six assessees. For the group exercises, this included sitting in the back of the room, so as not to be obtrusive, and recording behaviors displayed by the assessees. For the group exercises (i.e., competitive allocation and noncompetitive management problems), each assessor observed and recorded the behavior of one or two assessment center participants. Each assessor observed different participants in each exercise. However, all assessors in the group were present during each exercise. Upon completion of each exercise, assessors categorized their recorded observations into dimensions using materials that described each dimension in detail, along with a list of some representative behaviors. For the in-basket exercise (including the follow-up in-basket interview) and written exercise, assessors reviewed and rated the materials for each participant to whom they were assigned. JOURNAL OF MANAGEMENT, VOL. 26, NO. 4, 2000

CONVERGENT AND DISCRIMINANT VALIDITY OF AC DIMENSION RATINGS

823

Note: Shaded areas represent dimensions included in this analysis. Areas with diagonal lines represent dimensions that are unobservable in the specified exercise. Figure 1. Dimension by exercise matrix

After completing all exercises, the assessors in each group met to rate assessees’ performance. Each rater provided independent ratings of each participant in the group on each dimension. For exercises in which a rater was not assigned to record behaviors of a participant, ratings were based on a verbatim listing of observed behaviors recorded by the assigned assessor as well as the assessor’s own observations. The rating process proceeded as follows. Selecting a participant, the assessors started with a specified dimension (e.g., oral communication). A verbatim listing of observed behaviors for the first relevant exercise (e.g., competitive allocation) for that dimension was presented by the assigned assessor. After the presentation of observed behaviors for the exercise, an independent rating was made on the dimension in question at the exercise level; this JOURNAL OF MANAGEMENT, VOL. 26, NO. 4, 2000

824

W. ARTHUR, D.J. WOEHR, AND R. MALDEGEN

process was repeated for all the exercises (e.g., in-basket, management problems) relevant to the dimension in question. These were the data used in the present study. It is important to note that no group discussion or consensus process occurred prior to each assessor’s ratings. Ratings were made on a 7-point Likert-type scale with descriptive anchors at the mid- and endpoints of the scale. This sequence of steps was repeated for the remaining eight dimensions. The entire process was then repeated for each participant assigned to the group of assessors. No criterion data were available as part of the assessment center. Data Analyses For the generalizability analysis, we used the VARCOMP procedure in SAS (SAS Institute, 1990) to estimate variance components associated with each measurement facet (see Table 1). The design was a 4-factor (i.e., person, rater, dimension, exercise) random effects design with one repeated measure. The facets were considered random variables because the levels of each facet were judged to be representative of the population of possible facet levels. Specifically, the assessees evaluated in this assessment center ranged in experience and level in their organizations. The exercises used in this assessment center were obtained commercially and were considered representative of exercises currently available on the market and commonly used in assessment centers (Spychalski et al., 1997). Exercises were chosen because they were judged to permit assessment of the identified performance dimensions of interest. Thornton and Byham (1982) report a list of commonly used assessment center dimensions. The dimensions used in this assessment center were judged to be representative of this list. Assessors consisted of I/O psychologists and management consultants and were, therefore, also judged to be representative of the population that should be used as assessors (Gaugler et al., 1987; Spychalski et al., 1997). Assessors provided ratings for each participant in their group on each dimension observable in each exercise. Thus, the assessee facet was nested within assessors, because assessors provided ratings for only the participants in their group (i.e., not all assessors rated all assessees). In generalizability theory, obtaining variance estimates from a completely crossed design increases the generalizability of scores (Nunnally & Bernstein, 1994). Furthermore, a fully crossed design permits the researcher to partition unique variance associated with a facet from error variance, thereby yielding more useful information. To obtain a fully crossed design, we analyzed ratings on only three of the four exercises and four of the nine dimensions. Thus all dimensions and exercises used in the analysis were completely crossed. The selected subset of exercises were the in-basket exercise, the allocation exercise, and the management problems exercise. The selected dimensions were oral communication, team building, innovation, and stress tolerance. All other dimensions were excluded from this analysis because they were not observable in the selected exercises. Likewise, the written exercise was omitted because oral communication was not observable in this exercise. Figure 1 presents an illustration of the dimensions and exercises used. JOURNAL OF MANAGEMENT, VOL. 26, NO. 4, 2000

CONVERGENT AND DISCRIMINANT VALIDITY OF AC DIMENSION RATINGS

825

In addition to generalizability analysis, we examined correlations among dimensions and exercises and conducted a CFA test. The correlational analyses involved comparing mean within-dimension, across-exercise correlations with mean within-exercise, across-dimension correlations. Higher values for the former relative to the latter would be indicative of convergent/discriminant validity. Although correlational analysis provides some evidence of either dimension or exercise effects, it does not allow for an overall test of these effects. Consequently, we also used CFA to evaluate a model representing both exercise and dimension factors (i.e., we used a traditional CFA approach to MTMM data). The model evaluated (presented in Figure 2) was comprised of seven latent variables. Four of the latent variables represented the four dimension factors (analogous to trait factors in MTMM analysis) and three of the latent variables represented exercise factors (analogous to method factors in MTMM analysis). Overall measures of fit for this model indicated how well a model specifying the four dimension and three exercise factors corresponds to the data. In addition, a comparison of the magnitude of individual parameter estimates of the dimension factors on the ratings versus parameter estimates of the exercise factors on the ratings provides an indication of the relative magnitude of dimension and exercise effects. Specifically, large dimension factor loadings indicate the existence of convergent validity, large exercise factor loadings indicate the existence of exercise effects, and large dimension correlations (i.e., those approaching 1.0) indicate a lack of discriminant validity (Marsh & Grayson, 1995).

Note: IN ⫽ innovation; OC ⫽ oral communication; ST ⫽ stress tolerance; TB ⫽ team building; A ⫽ allocation; IB ⫽ in-basket; MP ⫽ management problems. Figure 2. MTMM CFA model JOURNAL OF MANAGEMENT, VOL. 26, NO. 4, 2000

826

W. ARTHUR, D.J. WOEHR, AND R. MALDEGEN

Results The means and SD of the dimension/exercise level ratings, and each dimension (across exercises) and exercise (across dimensions) are presented in Table 2. Generalizability Analysis The variance component estimates for each of the 4 main effects as well as the assessee by dimension, assessee by exercise, and dimension by exercise interactions are presented in Table 3. Examination of variance estimates indicated a pattern clearly supporting the convergent/discriminant validity of the assessment center dimension ratings. Specifically, both the dimension main effect and the person by dimension interaction accounted for substantial portions of total variance. The dimension main effect accounted for approximately 21% of total variance. This provides evidence that performance on dimensions is variable across participants. The person by dimension interaction accounted for approximately 20% of total variance. The variance associated with this interaction indicates that performance across traits (i.e., dimensions) varies as would be expected if the ratings in fact captured latent individual differences in the dimensions. In addition, 18% of the total variance was accounted for by the person main effect. This effect indicates that the assessors differentiated among the assessment center participants’ performance. Alternately, the assessor component accounted for only 6% of the total variance, indicating that a relatively small amount of variance was contributed by assessors, and ratings were relatively consistent across assessors. Similarly, the exercise effect accounted for very little, if any, of the total variance in assessment center scores (less than 1%). The person by exercise interaction accounted for approximately 5% of the variance indicating that participants’ performance is relatively stable across exercises. Finally, the dimension by exercise interaction Table 2. Means and SD of the Dimension/Exercise Level Ratings, and Each Dimension (Across-Exercises) and Exercise (Across Dimensions) Exercises Dimensions Innovation Oral communication Stress tolerance Team building Mean SD

Allocation

In-Basket

Management Problems

4.23 (1.51) 4.95 (1.43) 6.13 (1.26) 4.63 (1.46)

3.79 (1.55) 5.34 (1.26) 6.14 (1.40) 5.02 (1.55)

4.68 (1.50) 4.93 (1.41) 6.25 (1.23) 4.54 (1.51)

4.80 1.16

4.98 1.10

4.92 1.20

Dimension/exercise level SD are in parenthesis. JOURNAL OF MANAGEMENT, VOL. 26, NO. 4, 2000

Mean

SD

4.29

1.57

5.08

1.38

6.17

1.31

4.59

1.58

CONVERGENT AND DISCRIMINANT VALIDITY OF AC DIMENSION RATINGS

827

Table 3. Variance Component Estimates Source of Variation

Variance Component Estimate

% Total Variance

Person (Rater) p:r Dimension d Exercise x Rater r Person ⴱ Dimension (Rater) (p ⴱ d)r Person ⴱ Exercise (Rater) (p ⴱ x)r Dimension ⴱ Exercise d ⴱ x Error e, (p ⴱ d ⴱ x ⴱ r)r

0.503

18

0.570

21

0.000

⬍1

0.176

6

0.558

20

0.143

5

1.104

4

0.703

25

accounted for approximately 4% of the variance indicating that the dimensions were generally assessable in all exercises. In summary, consistent with convergent/discriminant validity, those facets associated with the person and dimension cumulatively accounted for approximately 60% of the total variance. In addition, those facets associated with the exercises and assessors only accounted for approximately 11% of total variance. Finally, we calculated the generalizability and dependability coefficients (see Table 4). It is important to note that the generalizability coefficient is comparable to a reliability coefficient. One of the similarities is that a g-coefficient ranges from 0.0 to 1.0, where larger numbers indicate more dependable measures. The generalizability coefficient for this study was 0.98, indicating that the scores from this assessment center provide a highly reliable assessment of individual performance.

Table 4. Generalizability Coefficient Estimates Source of Variance 2 2 2 2 ⫽ ␴xp:r /n xn r ⫹ ␴dx /n dn x ⫹ ␴error /n dn xn r ␴Rel 2 2 2 2 2 2 ␴Abs ⫽ ␴x /n x ⫹ ␴r /n r ⫹ ␴xp:r /n xn r ⫹ ␴dx /n dn x ⫹ ␴error /n dn xn r 2 2 2 2 2 2 E␳Rel ⫽ ␴p:r ⫹ ␴dp:r /␴p:r ⫹ ␴dp:r ⫹ ␴Rel 2 2 2 2 2 ␾ ⫽ ␴p:r ⫹ ␴dp:r /␴p:r ⫹ ␴dp:r ⫹ ␴Abs

Component Estimate 0.017 0.022 0.98 0.97

2 2 2 ␴Rel is relative variance, ␴Abs is absolute variance, E␳Rel is the generalizability coefficient for relative decisions, ␾ is the generalizability coefficient for absolute decisions.

JOURNAL OF MANAGEMENT, VOL. 26, NO. 4, 2000

828

W. ARTHUR, D.J. WOEHR, AND R. MALDEGEN

Correlational Analysis The intercorrelations among the ratings for each dimension derived from each exercise are presented in Table 5. These correlations reflect a pattern consistent with results of the generalizability analysis and provide further evidence suggesting that this assessment center demonstrates trait (dimension) factors as opposed to method (exercise) factors. As indicated in the table, the overall mean correlation among ratings of the same dimension across exercises was 0.60 compared with a mean overall correlation among ratings of different dimensions within the same exercise of 0.39. Confirmatory Factor Analysis We used a CFA application of LISREL 8.14 to evaluate the fit of the model presented in Figure 2. Covariances among the 12 ratings (1 rating for each of the four dimensions based on each of the three exercises) served as input to the program. Overall fit indices indicated that the model provides an excellent representation of the data (02[36] ⫽ 19.29 ns, GFI ⫽ 0.98, AGFI ⫽ 0.95, RMSR ⫽ 0.05, NFI ⫽ 0.98, CFI ⫽ 1.0). Examination of individual parameter estimates (presented in Table 6) indicates that while both dimension and exercise factors contribute to the ratings, the estimates for dimension parameters are Table 5. Dimension and Exercise Intercorrelations In-Basket

Management Problems

Allocation

IN

In-Basket

OC ⫺ .43 .38 .31

TB

OC TB IN ST

ST

OC

TB

IN

⫺ .48 .22

⫺ .30



Allocation

OC TB IN ST

.63 .37 .30 .32

.40 .55 .40 .20

.39 .42 .53 .24

Management Problems

OC TB IN ST

.54 .36 .31 .34

.38 .44 .26 .21

.42 .44 .50 .28

ST

.27 .22 .17 .72

⫺ .51 .45 .38

⫺ .51 .28

⫺ .24



.24 .22 .23 .70

.68 .49 .39 .34

.48 .62 .41 .28

.45 .41 .49 .22

.31 .28 .27 .80

OC

TB

IN

ST

⫺ .60 .49 .38

⫺ .51 .32

⫺ .30



Mean Within-Dimensions Across-Exercise r:

Mean Within-Exercise Across-Dimension r:

Oral communication ⫽ .62 Team building ⫽ .54 Innovation ⫽ .51 Stress tolerance ⫽ .74

In-Basket ⫽ .35 Allocation ⫽ .40 Management Problems ⫽ .43

Overall Mean ⫽ .60

Overall Mean ⫽ .39

OC ⫽ oral communication; TB ⫽ team building; IN ⫽ innovation; ST ⫽ stress tolerance. JOURNAL OF MANAGEMENT, VOL. 26, NO. 4, 2000

CONVERGENT AND DISCRIMINANT VALIDITY OF AC DIMENSION RATINGS

829

Table 6. Standardized Parameter Estimates From the MTMM CFA Model Dimensions OC IB, IB, IB, IB,

OC TB IN ST

.70

A, A, A, A,

OC TB IN ST

.87

MP, MP, MP, MP,

OC TB IN ST

TB

Exercises

IN

ST

IB

.79

.28 .40 .35 .19

.65 .70

.84 .76 .91

A

.12 .01 .07 .37

.77

.39 .43 .33 .14

.72 .63 .88 Mean Dimension Loading ⫽ .77

MP

Mean Exercise Loading ⫽ .26

OC ⫽ oral communication; TB ⫽ team building; IN ⫽ innovation; ST ⫽ stress tolerance; IB ⫽ in-basket; A ⫽ allocation; MP ⫽ management problems.

substantially larger in all cases than estimates for exercise factors (mean parameter estimates of 0.77 and 0.26 for the dimension and exercise parameters respectively). In addition, intercorrelations among the latent dimension factors ranged from 0.36 to 0.77 (mean r ⫽ 0.55) providing evidence of discriminant validity across dimensions. Thus, these results support and are consistent with those obtained from the generalizability analysis. That is, the dimension factors are the primary determinants of assessment center ratings. Discussion The contribution of the present study is threefold. First, we draw attention to the fact that the lack of construct-related validity of assessment center ratings in the presence of content-related and criterion-related validity is inconsistent with the unitarian view of validity. Second, we present an empirical demonstration of generalizability theory to examining the convergent and discriminant validity of a set of assessment center dimensions – an application that has been absent in previous investigations of this issue. We examined systematic variance associated with person, dimension, exercise, and assessor factors as well as person by dimension and person by exercise interactions. Results indicated that the person, dimension, and person by dimension components accounted for a substantial portion of the variance in assessment center ratings. Conversely, relatively small portions of rating variance were accounted for by assessor, exercise, and person by exercise components. Correlational and confirmatory factor analyses provided consistent findings indicating that the present results are not an artifact of the JOURNAL OF MANAGEMENT, VOL. 26, NO. 4, 2000

830

W. ARTHUR, D.J. WOEHR, AND R. MALDEGEN

analytic approach. Finally, our analyses demonstrated considerable reliability in the measurement of individual performance and discriminability of individual performance across dimensions. There are two potential explanations for differences between the results of the present study and previous studies. First, an important difference between our study and previous attempts to partial variance estimates has to do with the analyses. Specifically, although other studies used analysis of variance approaches, none used a generalizability framework in which all conceptually relevant sources of variance were estimated. In addition, we deliberately limited the number of dimensions and exercises to those dimensions and exercises that were completely crossed. The use of generalizability theory and a completely crossed design allowed for simultaneous estimation of all conceptually relevant assessment center facets. Turnage and Muchinsky (1982), for example used an analysis of variance approach to assess the variance associated with person, person by dimension, and person by exercise effects, but confounded exercise effects with assessor effects (i.e., they had only one assessor per exercise). This limitation may have resulted in their findings. It is possible that limiting the dimensions and exercises in our analyses to those that were completely crossed reduced variability in the assessment center scores. Thus we may have obtained different variance component estimates had all dimensions and exercises been included. We believe, however, that it is unlikely that the addition of 5 dimensions and 1 exercise would drastically change the findings of this study. In fact, since correlational and CFA analyses are not as dependent on the fully crossed design, we replicated both of these analyses using all the exercises and dimensions and obtained essentially the same results as the fully crossed set of dimensions and exercises reported here. (Results of the CFA analyses replicate those for the reduced model [i.e., overall levels of fit are excellent and dimension parameter estimates are substantially larger than those for the exercises]. Results of both supplementary CFA and correlational analyses are available from the first author upon request.) Thus, inclusion of the additional exercise and dimensions would most likely only affect the amount of variance associated with specific dimension and exercise factors. Although the proportion of variance accounted for by each facet in the generalizability analyses may increase, it is not likely that the exercise facet would suddenly account for a greater proportion of the total variance in assessment center scores than the dimension facet. One may also consider the nested design of the assessment center to be a limitation. More information could have been obtained if participants had not been nested within assessors. That is, if each assessor had rated all of the participants, it would have been possible to examine independent person and assessor components. However, we do not consider this to be a serious limitation because the assessment center design more closely resembles typical operational assessment centers than a design in which each assessor evaluates all participants. In fact, the results obtained in this study may have more ecological validity than results that would have been obtained had all assessors rated all participants. JOURNAL OF MANAGEMENT, VOL. 26, NO. 4, 2000

CONVERGENT AND DISCRIMINANT VALIDITY OF AC DIMENSION RATINGS

831

Finally it should also be noted that in the present study dimensions and exercises were analyzed as random effects. Although the dimensions and exercises used were judged to be representative of those typically used in assessment centers, strictly speaking they were not “randomly” selected from all possible exercises and dimensions that might have been available in the population. Future replication using different exercises and dimensions could conceivably yield different results. Thus, a second potential explanation for the differences between the results of the present study and previous studies pertains to the fact that assessment centers are susceptible to differences in developmental and implementation factors that may affect the likelihood of finding convergent and discriminant validity (Lievens, 1998). In light of this, several steps recommended by both research and professional practice were incorporated into our assessment center to enhance its ability to measure the constructs/dimensions of interest. Suggestions for Future Research The goal of the present study was to examine whether an assessment center incorporating the features, characteristics, and guidelines recommended by research and professional practice would demonstrate convergent and discriminant validity for the measured dimensions. Being an operational assessment center, we incorporated several variables at once so it was not possible to determine which particular design feature resulted in the obtained outcomes. Nevertheless, having done so allows us to consider this study as a test of what current research would suggest is a good case for generating assessment center ratings that would display discriminant and convergent validity. Moreover, in an applied area such as this, one could argue that generating an “effective” assessment center by incorporating multiple features may outweigh the inferential advantages associated with single, isolated manipulations. Nevertheless, single manipulations, where feasible, contribute to further our understanding of assessment centers. Guidelines recommended by research and professional practice for the development and implementation of assessment centers would suggest a number of features and characteristics that may impact assessment center validity. However, to date, limited systematic investigation of these factors on assessment center validity has occurred. We strongly believe future research should be directed toward systematically examining implementation factors influencing the psychometric properties of assessment centers. Another potential limitation is the absence of criterion data. Although criterion-related validity is not required to demonstrate convergent and discriminant (construct-related) validity of an assessment tool, presentation of this evidence would have made a stronger case for the construct-related validity of our assessment center. Unfortunately, because this was a developmental assessment center, client organizations had no interest in collection or documentation of criterion data for operational or research use. On the other hand, Arthur and Tubre (1999) report a correlation of 0.34 (p ⬍ .01, N ⫽ 50) between assessment center and multisource performance ratings for a developmental assessment that was deJOURNAL OF MANAGEMENT, VOL. 26, NO. 4, 2000

832

W. ARTHUR, D.J. WOEHR, AND R. MALDEGEN

signed and implemented in a manner identical, with the exception of the number and content of dimensions, to the assessment center used in the present study. Nevertheless, future research investigating the criterion-related validity of developmental assessment centers is warranted. This is highlighted by the results of a review of the literature which indicated that there are a very limited number of studies reporting criterion-related validities for developmental assessment centers. In fact, we could find only three studies, two of which were reported in Gaugler et al.’s (1987) meta-analysis— one was a published article (Ritchie & Moses, 1983), and the other was a technical report (Metropolitan Transit Authority, 1972). The third study was by Jones and Whitmore (1995). In addition, consistent with the view that a demonstration of convergent/ discriminant validity does not require a demonstration of criterion-related validity, it is important to note that of the 19 studies that investigated and failed to obtain convergent/discriminant validity for assessment center ratings, only four present criterion-related validity data. Consequently, prevailing summary statements about assessment center ratings demonstrating criterion-related validity, but not convergent and discriminant validity, are based on relatively independent sets of studies. Future investigations of multiple evidential strategies and inferences within the same study may be a worthwhile effort. In addition to methodologically-driven reasons, another plausible explanation that could account for or reconcile what we consider to be the assessment center construct-related validity paradox is the possibility of construct misspecification. That is, it is conceivable that instead of measuring the targeted constructs of interest – such as team building, flexibility, influencing others – assessment centers may unwittingly be measures of unspecified constructs like, for example, self-monitoring or impression management (Church, 1997; Cronshaw & Ellis, 1991). Thus, in this particular example, the actual explanatory variable – self-monitoring – is a “deeper” source trait operating at a different nomological level than assessment center constructs ostensibly being measured. This construct mis-specification hypothesis has yet to receive extensive empirical attention and appears to be an area worthy of future research (c.f., Arthur and Tubre [1999]; see also Russell and Domm’s [1995] investigation of role congruency as a plausible explanatory construct). If emprically confirmed, it has dire implications for the use of assessment centers as training and development interventions since this use is predicated on the assumption that assessment centers are indeed measuring the specified dimensions of interest (e.g., team building, flexibility, influencing others). Developmental feedback reports and interviews, and individual development plans are all designed and developed around these dimensions. Are, and have all of these efforts been fundamentally misguided? Is this important use of assessment centers fundamentally flawed? We think not. Although conceptually plausible, the mis-specification hypothesis has yet to receive extensive empirical support. Furthermore, given the data and arguments presented here, we are inclined to believe that the lack of discriminant and convergent validity is due to development and implementation factors, a more parsimonious and succinct explanation for the assessment center construct-related validity paradox. JOURNAL OF MANAGEMENT, VOL. 26, NO. 4, 2000

CONVERGENT AND DISCRIMINANT VALIDITY OF AC DIMENSION RATINGS

833

We also believe future research should make more extensive use of generalizability theory analyses. Generalizability theory is particularly appropriate in that it permits differentiation among multiple sources of systematic variance in assessment center ratings. Thus it not only allows for the partitioning of variance associated with dimension and exercise effects, but other potentially important sources of variance such as assessor and interaction effects as well. In summary, the results of this study provide considerable support for the convergent/discriminant validity of our assessment center ratings. Our goal was not to examine the relative or comparative effects of specific design and implementation characteristics. Instead, it was to investigate whether assessment center dimensions measured using an assessment center designed to incorporate research and professional design recommendations would demonstrate convergent and discriminant validity. This research question was motivated by our position that the prevailing view that assessment center dimensions display content-related and criterion-related validity but not construct-related validity is inconsistent with the current unitarian view of validity and represents somewhat of a paradox. In conclusion, our findings question the received doctrine that assessment center dimensions do not have construct-related validity, and instead, lead us to conclude that as measurement tools, assessment centers are probably only as good as their development, design, and implementation. Finally, we take this opportunity to call for additional research on the role of methodological and implementation factors on the validity of assessment center ratings. Appendix: Definitions of Assessment Center Dimensions Oral Communication. The extent to which an individual effectively conveys oral information and responds to questions and challenges. Written Communication. The extent to which an individual effectively and persuasively conveys information in writing. Influencing Others. The extent to which an individual is effective in persuading others to do something or adopt a point of view to produce desired results without creating hostility. Team-building. The extent to which an individual successfully engages and works collaboratively with other members of a group such that others are involved in, and contribute to the process and outcome. Problem Solving. The extent to which an individual gathers data, effectively analyzes and uses data and information, and selects supportable courses of action for problems and situations. Organizing and Planning. The extent to which an individual effectively and systematically arranges his/her own work and resources as well as that of others for efficient task accomplishment; and the extent to which the individual anticipates and prepares for the future. Innovation. The extent to which an individual generates new or creative ideas and solutions, and uses available resources in new and more efficient ways. Flexibility. The extent to which an individual adapts his or her behavior and ideas to respond to various people and conditions to reach a desired goal. JOURNAL OF MANAGEMENT, VOL. 26, NO. 4, 2000

834

W. ARTHUR, D.J. WOEHR, AND R. MALDEGEN

Stress Tolerance. The extent to which an individual maintains a consistent level of performance under the stress of confrontation, tight time-frames and/or uncertainty. Acknowledgments: The authors would like to acknowledge the involvement of Karlease Clark and Winston Bennett, Jr. in the design and implementation of the assessment center. We also thank Pamela Stanush and Winston Bennett, Jr. for the indispensable role they played in the entry and management of the assessment center data set. Finally, we thank Travis Tubre for his comments on several drafts of this work, and John Binning for his comments on an earlier version of this paper. References Arthur, W., Jr, & Benjamin, L. T., Jr. 1999. Psychology applied to business. In A. M. Stec, & D. A. Bernstein (Eds.), Psychology: Fields of application: 98 –115. Boston, MA: Houghton Mifflin. Arthur, W., Jr, & Tubre, T. C. 1999. The assessment center construct-related validity paradox: A case of construct misspecification? In Quin˜ones, M. A. (Chair), Assessment centers, 21st century: New issues and new answers to old problems. Symposium presented at the14th Annual Conference of the Society for Industrial and Organizational Psychology, Atlanta, GA. Bender, J. M. 1973. What is “typical” of assessment centers? Personnel, 50: 50 –57. Bernardin, H. J., & Buckley, M. R. 1981. Strategies in rater training. Academy of Management Review, 6: 205–212. Binning, J. F., & Barrett, G. V. 1989. Validity of personnel decisions: A conceptual analysis of the inferential and evidential bases. Journal of Applied Psychology, 74: 478 – 494. Brannick, M. T., Michaels, C. E., & Baker, D. P. 1989. Construct validity of in-basket scores. Journal of Applied Psychology, 74: 957–963. Brennan, R. L. 1994. Variance components in generalizability theory. In C. R. Reynolds (Ed.), Cognitive assessment: A multidisciplinary perspective: 175–207. NY: Plenum Press. Bycio, P., Alvares, K. M., & Hahn, J. 1987. Situation specificity in assessment center ratings: A confirmatory analysis. Journal of Applied Psychology, 72: 463– 474. Church, A. H. 1997. Managerial self-awareness in high-performing individuals in organizations. Journal of Applied Psychology, 82: 281–292. Cronbach, L. J., Glesser, G. C., Nanda, H., & Rajaratnam, N. 1972. The dependability of behavioral measurements: Theory of generalizability for scores and profiles. New York: John Wiley & Sons, Inc. Cronshaw, S. F., & Ellis, R. J. 1991. A process investigation of self-monitoring and leader emergence. Small Group Research, 22: 403– 420. Dugan, B. 1988. Effects of assessor training on information use. Journal of Applied Psychology, 73: 743–748. Gaugler, B. B., Rosenthal, D. B., Thornton, G. C. III, & Bentson, B. 1987. Meta-analysis of assessment center validity. Journal of Applied Psychology, 72: 493–511. Gaugler, B. B., & Thornton, G. C., III. 1989. Number of assessment center dimensions as a determinant of assessor generalizability of the assessment center ratings. Journal of Applied Psychology, 74: 611– 618. Harris, M. M., Becker, A. S., & Smith, D. E. 1993. Does the assessment center scoring method affect the cross-situational consistency of ratings? Journal of Applied Psychology, 78: 675– 678. Highhouse, S., & Harris, M. M. 1993. The measurement of assessment center situations: Bem’s template matching technique for examining exercise similarity. Journal of Applied Social Psychology, 23: 140 –155. Howard, A. 1997. A reassessment of assessment centers: Challenges for the 21st century. Journal of Social Behavior and Personality, 12: 13–52. Jones, R. G. 1992. Construct validation of assessment center final dimension ratings: Definition and measurement issues. Human Resources Management Review, 2: 195–220. Jones, R. G., & Whitmore, M. D. 1995. Evaluating developmental assessment centers as interventions. Personnel Psychology, 48: 377–388. Klimoski, R., & Brickner, M. 1987. Why do assessment centers work? The puzzle of assessment center validity. Personnel Psychology, 40: 243–259. Kraiger, K., & Teachout, M. S. 1990. Generalizability theory as construct-related evidence of the validity of job performance ratings. Human Performance, 3: 19 –35. Lance, C. E., Gatewood, R. D., Newbolt, W. H., Foster, M. S., French, N. R., & Smith, D. E. 1998. Assessment center exercise factors represent cross-situational specificity, not method bias. Unpublished manuscript. University of Georgia, Athens, GA. JOURNAL OF MANAGEMENT, VOL. 26, NO. 4, 2000

CONVERGENT AND DISCRIMINANT VALIDITY OF AC DIMENSION RATINGS

835

Landy, F. J. 1986. Stamp collecting versus science: Validation as hypothesis testing. American Psychologist, 41: 1181–1192. Lievens, F. 1998. Factors which improve the construct validity of assessment centers: A review. International Journal of Selection and Assessment, 6: 141–152. Marsh, H. W., & Grayson, D. 1995. Latent variable models of multitrait-multimethod data. In R. Hoyle (Ed.), Structural equation modeling: Concepts, issues, and applications. Thousand Oaks, CA: Sage Publications. McHenry, K. R., & Schmitt, N. 1994. Multimedia testing. In M. J. Rumsey, C. D. Walker, & J. Harris (Eds.), Personnel selection and classification research: 193–222. Mahwah, NJ: Lawrence Erlbaum. Metropolitan Transit Authority. 1972. The uses of the assessment center in a government agency’s management program. Unpublished manuscript. Miller, G. A. 1956. The magical number seven, plus or minus two: Some limits on our capacity for processing information. Psychological Review, 63: 81–97. Neidig, R. D., & Neidig, P. J. 1984. Multiple assessment center exercises and job relatedness. Journal of Applied Psychology, 69: 182–186. Nunnally, J. C., & Bernstein, I. H. 1994. Psychometric theory (3rd ed.). New York: McGraw-Hill. Raymark, P. H., & Binning, J. F. 1997. Explaining assessment center validity: A test of the criterion contamination hypothesis. Paper presented at the 1997 Academy of Management meeting, Boston, MA. Ritchie, R. J., & Moses, J. L. 1983. Assessment center correlates of women’s advancement into middle management: A 7-year longitudinal analysis. Journal of Applied Psychology, 68: 227–231. Rowe, F. A., & Mauer, K. A. 1991. Career guidance, career assessment, and consultancy. Journal of Career Development, 17: 223–233. Russell, C. 1985. Individual decision processes in an assessment center. Journal of Applied Psychology, 70: 737–746. Russell, C. J., & Domm, D. R. 1995. Two field test of an explanation of assessment centre validity. Journal of Occupational and Organizational Psychology, 68: 25– 47. Saal, F. E., Downey, R. G., & Lahey, M. A. 1980. Rating the ratings: Assessing the psychometric quality of rating data. Psychological Bulletin, 88: 413– 428. Sackett, P.R. 1987. Assessment centers and content validity: Some neglected issues. Personnel Psychology, 40: 13–25. Sackett, P. R., & Dreher, G. F. 1982. Constructs and assessment center dimensions: Some troubling empirical findings. Journal of Applied Psychology, 67: 401– 410. Sackett, P. R., & Hakel, M. D. 1979. Temporal stability and individual differences in using assessment center information from overall ratings. Organizational Behavior and Human Performance, 23: 120 –137. Sackett, P. R., & Harris, M. M. 1988. A further examination of the constructs underlying assessment center ratings. Journal of Business and Psychology, 3: 214 –229. Sagie, A., & Magnezy, R. 1997. Assessor type, number of distinguishable categories, and assessment centre construct validity. Journal of Occupational and Organizational Psychology, 67: 401– 410. SAS Institute. 1990. SAS/STAT user’s guide (version 6, 4th ed., vol. 2). Cary, NC: SAS Institute Inc. Schmitt, N. 1977. Interrater agreement in dimensionality and combination of assessment center judgments. Journal of Applied Psychology, 62: 171–176. Schmitt, N., Schneider, J. R., & Cohen, S. A. 1990. Factors affecting validity of a regionally administered assessment center. Personnel Psychology, 43: 1–12. Schneider, J. R., & Schmitt, N. 1992. An exercise design approach to understanding assessment center dimension and exercise constructs. Journal of Applied Psychology, 77: 32– 41. Shavelson, R. J., & Webb, N. M. 1991. Generalizability theory: A primer, vol 1. Newbury Park, CA: Sage Publications. Silverman, W. H., Dalessio, A., Woods, S. B., & Johnson, R. L., Jr. 1986. Influence of assessment center methods on assessors’ ratings. Personnel Psychology, 39: 565–578. Spychalski, A. C., Quin˜ones, M. A., Gaugler, B. B., & Pohley, K. 1997. A survey of assessment center practices in organizations in the United States. Personnel Psychology, 50: 71–90. Task Force on Assessment Center Guidelines. 1989. Guidelines for ethical considerations. Public Personnel Management, 18: 457– 470. Thornton, G. C. III, & Byham, W. C. 1982. Assessment centers and managerial performance. NY: Academic Press. Turnage, J. J., & Muchinsky, P. M. 1982. Transsituational variability in human performance within assessment centers. Organizational Behavior and Human Performance, 30: 174 –200. Woehr, D. J., & Arthur, W., Jr. 1999. The assessment center validity paradox: A review of the role of methodological factors. In Quin˜ones, M. A. (Chair), Assessment centers, 21st century: New issues and new answers to old problems. Symposium presented at the 14th Annual Conference of the Society for Industrial and Organizational Psychology, Atlanta, GA. Woehr, D. J., & Huffcutt, A. I. 1994. Rater training for performance appraisal: A meta-analytic review. Journal of Occupational and Organizational Psychology, 67: 189 –205. JOURNAL OF MANAGEMENT, VOL. 26, NO. 4, 2000