The Predictive Validity of Assessment Centers in ... - Hogrefe eContent

8 downloads 174 Views 128KB Size Report
Keywords: personnel selection, assessment center, German-speaking regions, meta-analysis, predictive validity. Assessment centers are a common and widely ...
Original Article

The Predictive Validity of Assessment Centers in German-Speaking Regions A Meta-Analysis Nicolas Becker,1 Stefan Höft,2 Marcus Holzenkamp,3 and Frank M. Spinath1 1

Universität des Saarlandes, Saarbrücken, Germany, 2Hochschule der Bundesagentur für Arbeit (HdBA), Mannheim, Germany, 3Obermann Consulting GmbH, Cologne, Germany

Abstract. As previous meta-analyses have focused almost solely on English-speaking regions, this study presents the first systematic metaanalytical examination of the predictive validity of assessment centers (ACs) conducted in German-speaking regions. It summarizes 24 validity coefficients taken from 19 studies (N = 3,556), yielding a mean corrected validity of q = .396 (80% credibility interval .235  q  .558). ACs with different purposes and different kinds of criterion measures were analyzed separately. Furthermore, target group (internal vs. external candidates), average age of the assessees, inclusion of intelligence measures, number of instruments used, AC duration, as well as time elapsed between AC and criterion assessment were found to moderate the validity. Keywords: personnel selection, assessment center, German-speaking regions, meta-analysis, predictive validity

Assessment centers are a common and widely used instrument in human resources management. With their seminal meta-analytical examination, Gaugler, Rosenthal, Thornton, and Bentson (1987) provided substantial proof for the general validity of assessment centers (mean corrected validity q = .37). However, recent meta-analyses investigating the predictive validity of Overall Assessment Ratings (OAR), initiated to update results presented by Gaugler et al. (1987), show significantly lower estimates of the mean corrected validity of assessment centers. Hermelin, Lievens, and Robertson (2007) reported a mean corrected validity of q = .28 based on 26 studies and 27 validity coefficients (correlations between OAR and supervisory performance ratings; N = 5,850; corrections made for sampling error, direct range restriction, and error of measurement). Hardison and Sackett (2007) analyzed 40 coefficients (correlations between OAR and different criterion measures; N = 11,136) in their study and corrected them for sampling error and error of measurement. They reported a mean corrected validity of q = .26. Both studies provide important information long overdue since the original data reported by Gaugler et al. (1987) had been collected more than 20 years ago. But it has to be kept in mind that all meta-analyses almost solely focused on assessment centers conducted in English-speaking countries. The generalizability of these findings to other countries with differing cultural background should not automatically be concluded. A large-scale survey with 959 organizations from 20 countries conducted by Ryan, McFarland, Baron, and Page (1997) showed substantial differences in general Ó 2011 Hogrefe Publishing

staffing practices (variety and number of methods used) across nations. More specifically, Krause and Gebert (2003) compared assessment center practices in organizations in German-speaking regions and in the United States. They used own survey results (descriptions of assessment center realizations utilized by 75 German-speaking organizations with more than 2,000 employees) and similar US-based data published by Spychalski, Quiones, Gaugler, and Pohley (1997). They found rather striking cross-sample differences between the usages of AC measures. The recent study of Ho¨ft and Obermann (in press) with an update of the German survey confirms the general findings concerning robust differences between the US and German-speaking countries concerning design and execution of ACs. Possible causes for these differences are multifold. Originally discussing possible restraints for the design of global staffing systems, Wiechmann, Ryan, and Hemingway (2003, p. 82) enumerate a multitude of national and regional obstacles. Most of them also apply in this context: Differing legal requirements, educational systems, economic conditions, ability to acquire and use technology, labor market variations, value differences across cultures, availability of offthe-shelf assessment tools, level of HR experience, role of HR in hiring, familiarity with a tool or practice, etc. Krause and Gebert (2003), for example, offered three explanations for their results (p. 308): First, ideologically rooted reservations lead to a remarkably low utilization of psychometric tests (both intelligence and personality) in German-speaking ACs. Second, compared to US-based ACs a comparative lack of professionalism in handling Journal of Personnel Psychology 2011; Vol. 10(2):61–69 DOI: 10.1027/1866-5888/a000031

62

N. Becker et al.: Predictive Validity of Assessment Centers in German-Speaking Regions

ACs (e.g., less elaborated job analyses) seems to be immanent in German-speaking regions. And third, the US and the German-speaking countries vary remarkably in labor legislation and this lead to differences in AC design (e.g., in German-speaking regions worker committee representatives work more often as AC observers and more details on objectives, exercises, data privacy, etc. are given to AC participants). Based on the results and the interpretation of Krause and Gebert (2003), we should expect a lower predictive validity of ACs in German-speaking regions compared to the existing, primarily US-based findings. As already the original authors stated, some of the differences identified by Krause and Gebert (2003) could be attributed to general cultural differences (e.g., as described by Hofstede, 2001, by five dimensions, namely power distance, individualism/collectivism, masculinity/femininity, uncertainty avoidance, and long/short term orientation). For example, Brodbeck, Frese, and Javidan (2002) showed in the German part of the worldwide study ‘‘Global Leadership and Organizational Behaviour Effectiveness (GLOBE)’’ that institutionalized collectivism (e.g., solve emerging organizational conflicts on a collective basis through representatives like employer associations and unions) is more typical in Germany than in the US. On a more general level, Ryan et al. (1997) could show that two of the cultural dimensions (uncertainty avoidance and, to a lesser extent, power distance) described by Hofstede (2001) explained some of the national differences observed in the extensiveness of methods used for personnel selection. Granted that cultural differences in the use and implementation of personnel selection exist, it seems plausible that criterion-related validities of these selections are influenced, too. An example out the field of general mental ability (GMA) testing should clarify this argument: Salgado et al. (2003) meta-analyzed the criterion-related validity of GMA tests used in the European Community and found validity coefficients comparable to the original primarily US-based meta-analyses by Ghiselli (1973) and Hunter and Hunter (1984). From the 89 primary studies only nine had been conducted in Germany. An alternative meta-analysis by Hu¨lsheger, Maier, and Stumpp (2007) solely concentrated on German samples and collected 54 independent primary studies. Findings suggest that overall German operational validities are comparable with earlier findings but training success validities are slightly lower in Germany compared with US or European meta-analyses. Rather striking are the results of the moderator analyses: They show that job complexity and the year of publication are relevant moderator variables, with lower job complexity levels and older studies being associated with higher validities. Especially the first moderator does not correspond to international findings (Hunter & Hunter, 1984, as well as Salgado et al., 2003, identified growing validities for higher job levels). The authors give a plausible explanation for this results drawing upon the early stratification of the German school system resulting in indirect range restriction. These results show that it would have been premature to transfer the general (US and European-based) results without crosscheck to the German context.

Journal of Personnel Psychology 2011; Vol. 10(2):61–69

In sum, considering the strong role of German companies in the international economic markets and given the fact, that many German companies and institutions use ACs (Ho¨ft & Obermann, in press), a systematic investigation of ACs conducted in German-speaking countries constitutes a valuable endeavor. Variations in usage and design of staffing systems in general (Ryan et al., 1997) and of ACs in particular (Krause & Gebert, 2003) between different countries are well known. Consequences for the criterion-related validities are plausible. Accordingly, the major aim of the present study is to provide the first systematic meta-analytical study on the validity of assessment centers in German-speaking regions.

Method Database We used two strategies to identify validity studies for the present meta-analysis. As a first step, a search of several computerized databases (PsycInfo, Academic Search Premier, Business Source Premier, Google Scholar, etc.) was carried out. As a second step, we contacted members of the ‘‘Arbeitskreis Assessment Center e.V.’’, a nonprofit discussion forum for personnel selection and personnel development, whose topics are formed by German-speaking practitioners and experts working in industrial and service organizations. Additionally, we contacted members of the section ‘‘Wirtschaftspsychologie’’ of the ‘‘Berufsverband Deutscher Psychologen,’’ the business section of the professional association of German psychologists and the members of different sections of the ‘‘Deutsche Gesellschaft fu¨r Psychologie,’’ an association of psychologists working in science and education. Using this strategy, we were able to contact a wide variety of persons with scientific as well as business background.

Inclusion and Coding Criteria During the search procedure potentially relevant studies had to meet four criteria: (a) the assessment center had to be carried out in German language or had to be conducted predominantly with German-speaking participants, (b) a measure of performance respectively success had to be available at least for selected candidates, (c) the sample size for the criterion sample had to be reported, and (d) the correlation between predictor and criterion had to be reported or, alternatively, sufficient information to calculate this coefficient had to be available. To base our analysis on a large pool of studies, we chose to include studies that used a wide variety of validation criteria and had been conducted for different purposes. Validation criteria included in the study were supervisor ratings, promotions, development of payscale, job performance, and training success. The purposes of the ACs were selection (for specific job or entry positions), development (mostly for self-development

Ó 2011 Hogrefe Publishing

N. Becker et al.: Predictive Validity of Assessment Centers in German-Speaking Regions

without consequences for the career), and potential analysis (intraorganizational candidates served as participants, the results had been used for long-term career development planning and the participants received a profound feedback). This search resulted in 20 potentially relevant studies. One of these studies (Birri & Naef, 2006; yielding two independent coefficients: r = .641, r = .721) involved an extreme elevation of range because only the best and the worst candidates were evaluated (97 out of more than 1,700). As this results in an overestimation of the actual correlation, we decided to exclude it from the analysis. The final sample consisted of 19 studies with 24 independent coefficients and a total N of 3,556. Of these coefficients, 14 had been published, 10 were unpublished. The earliest study included was published in 1987; the most recent studies were conducted in 2007. All of the ACs were conducted in occupational contexts with real applicants or employees and were used for selection, potential analysis, or as development instruments.

Meta-Analytic Procedure In our analysis we followed the procedures described by Hunter and Schmidt (2004) and used the software recommended there (i.e., Schmidt & Le, 2005). As individual artifact correction was not possible due to lack of information, we used artifact distribution meta-analysis instead (see Hunter & Schmidt, 2004, p. 137). We corrected for sampling error (yielding ‘‘bare-bone’’ results), error of measurement in the criterion variable, and direct range restriction [yielding the mean corrected validities (q)]. Error of measurement in the predictor variable was not corrected for since our aim was to obtain the predictive validity of ‘‘realistic’’ AC-results (i.e., results with imperfect reliability). Indirect range restriction was not corrected because the studies in our sample did not provide sufficient information and appropriate estimates could not be deduced from other studies. Information for the generation of the artifact distribution for criterion reliability was available for six of the coefficients. For the other coefficients we used estimates provided in the literature [.52 for supervisory ratings (see Salgado et al., 2003; Viswesvaran, Ones, & Schmidt, 1996), .80 for training success, and 1.00 for promotion, sales figures and development of payscale (artifact distribution along the lines of Gaugler et al., 1987)]. As none of the studies in our sample reported information on direct range restriction we drew on estimates employed in other meta-analyses [.936 for supervisory ratings (artifact distribution along the lines of Hermelin et al., 2007), .837 for promotion, sales figures and development of payscale, and .977 for training success (both artifact distributions along the lines of Gaugler et al., 1987)]. The sample of coefficients used to evaluate the validity on some of the moderator levels was rather small. Hence, as a hedge against possible file-drawer effects we additionally computed fail-safe Ns as recommended by Hunter and Schmidt (2004, p. 501). In doing so, we applied the formulas derived by Pearlman (1982) and Orwin (1983) and according to McNatt (2000) regarded a correlation of .05 as trivial. Ó 2011 Hogrefe Publishing

63

Results Mean (Corrected) Validity The mean observed correlation weighted by sample size was r = .329 [80% credibility intervals (80%-CRs) for all reported coefficients are given in Table 1]. Additionally, correcting for criterion unreliability and direct range restriction resulted in an increased corrected validity of q = .396. The inspection of the corresponding fail-safe N indicates that it would take 166 null findings to reduce the mean effect to a trivial size. As this clearly exceeds the amount of coefficients employed in our study we consider the validity of this result as rather secure. Despite this, only 37.43% of the variance of observed correlations was attributable to the three artifacts, indicating that a substantial amount of variance across the studies is due to factors which could not be corrected. Since this amount fails to meet the 75% rule a generalization of the mean validity of this analysis is not possible. Instead, a search for moderator variables can be considered as necessary (see Hunter & Schmidt, 2004, p. 401).

Moderator Hypotheses Hypothetically, a wide variety of moderators is conceivable in the present context. Based on the work of Gaugler et al. (1987) and further considerations regarding the content of studies in our sample we identified eleven potential moderator variables for which substantial data were available. In detail, the following hypotheses were the rationale for testing their influence on AC validity: As mentioned earlier, we chose to include studies using rather heterogeneous criteria which represent different aspects of the assessees’ performance or success and might therefore, show differences in predictability. The purpose of the AC might also influence the validity of its result. As studies on ACs used for development and potential analysis usually provide the possibility to obtain the criterion measures of mostly all of the candidates, while studies on ACs used for selection purposes do so only for the selected assessees, differences in range restriction can be expected, which should result in different validities. Analogously we also hypothesized that studies conducted with (preselected) internal candidates should show lower validity than studies conducted with external applicants showing greater variability in potential. Concerning the assessees’ average age it can be hypothesized that older applicants might compensate differences in potential through job experience, which decreases the validity in such samples. As GMA is discussed as one of the best predictors of job success (see Kramer, 2009; Schmidt & Hunter, 1998) we assumed that studies employing intelligence tests should show a higher mean validity. Because the amount of information gained on the assessees rises with the number of instruments used we expected ACs employing more instruments as more valid. Likewise it can be expected that ACs lasting longer show a higher validity because they access a broader behavioral sample. As the predictive power in diagnostic settings typically tends to wane over time (see e.g., Journal of Personnel Psychology 2011; Vol. 10(2):61–69

ka

Nb

Mean rccc

Journal of Personnel Psychology 2011; Vol. 10(2):61–69

20 3,029 0.844 0.163 0.9 0.06 0.325 10 1,664 0.795 0.157 0.91 0.052 0.408 10 1,365 0.892 0.161 0.889 0.068 0.224

18 2,903 0.811 0.178 0.904 0.057 0.337 7 774 0.796 0.217 0.905 0.066 0.175 11 2,129 0.82 0.159 0.903 0.054 0.395

24 3,556 0.83 0.181 0.897 0.059 0.329 14 2,052 0.756 0.164 0.924 0.05 0.377 10 1,504 0.932 0.158 0.861 0.051 0.262

Duration of AC One day Several days

Time elapsed Less than two years More than two years

Publication Published Unpublished

0.119 0.122 0.021 0.029

0.116 0 0.084 0 0 0 0.394 0.258 0.333 0.512

0.396 0.389 0.043 0.564 0.324 0.236

q

0.017 0.02 0.001 0

0.016 0 0.01 0 0 0

VAR

0.129 0.14 0.024 0

0.126 0 0.099 0 0 0

SD

0.015 0.122 0.392 0.018 0.132 0.006 0.078 0.442 0.006 0.075 0.019 0.137 0.216 0.025 0.159

0.014 0.116 0.396 0.016 0.126 0.006 0.077 0.464 0.005 0.075 0.016 0.127 0.309 0.02 0.143

0.016 0.125 0.407 0.019 0.137 0.019 0.138 0.216 0.027 0.164 0.001 0.038 0.475 0 0

0.017 0.129 0.387 0.02 0.141 0.002 0.046 0.494 0 0.02 0.016 0.125 0.264 0.019 0.14

0.016 0.127 0.391 0.019 0.14 0.01 0.098 0.266 0.012 0.11 0 0.022 0.527 0 0

0.019 0.137 0.4 0.024 0.153 0 0 0.556 0 0 0.013 0.113 0.251 0.017 0.129

0.017 0.13 0.398 0.021 0.144 0.003 0.057 0.495 0.002 0.045 0.017 0.132 0.278 0.023 0.152

0.014 0.015 0 0.001

0.014 0 0.007 0 0 0

SD

0.221 0.347 0.013

0.235 0.368 0.127

0.231 0.006 0.475

0.207 0.468 0.085

0.212 0.125 0.527

0.204 0.556 0.086

0.214 0.437 0.084

0.23 0.079 0.303 0.512

0.235 0.389 0.084 0.564 0.324 0.236

0.562 0.538 0.419

0.558 0.561 0.493

0.583 0.426 0.475

0.568 0.52 0.444

0.57 0.406 0.527

0.597 0.556 0.417

0.583 0.552 0.472

0.56 0.437 0.364 0.512

0.558 0.389 0.17 0.564 0.324 0.236

0.331 0.394 0.094

0.34 0.414 0.216

0.337 0.09 0.448

0.32 0.459 0.173

0.327 0.206 0.507

0.325 0.556 0.174

0.327 0.454 0.166

0.334 0.171 0.305 0.488

0.34 0.389 0.071 0.564 0.324 0.236

0.453 0.49 0.338

0.452 0.514 0.402

0.477 0.342 0.502

0.454 0.529 0.355

0.455 0.326 0.547

0.475 0.556 0.328

0.469 0.536 0.39

0.454 0.345 0.361 0.536

0.452 0.389 0.157 0.564 0.324 0.236

34.04 59.45 34.59

37.43 62.74 25.59

32.1 33.55 106.44

30.57 95.05 33.06

34.35 53.35 128.47

30.16 152.62 43.95

31.11 81.24 28.5

35.02 41.79 90.29 123.83

37.43 381.57 52.35 144.04 232.55 —

150 118 23

166 116 52

129 23 94

135 89 43

150 65 67

133 71 48

132 98 36

151 46 17 74

166 47 0 93 22 4

80%-CR 80%-CR 95%-CI 95%-CI Variance UBf LBg UBh reduction NFS LBe

Validity generalization

Notes. aNumber of coefficients used for analysis. bTotal N. cMean reliability of the criterion computed as arithmetic mean. dMean range restriction. elower bound of 80% credibility interval (CR). fUpper bound of 80% credibility interval. gLower bound of 95% confidence interval. hUpper bound of 95% confidence interval (CI). iAs the number of coefficients from studies with sufficient information for the moderator analysis varied, we present both the results for the full group in a specific moderator analysis as well as the results for each moderator subgroup separately.

22 3,324 0.845 0.176 0.892 0.057 0.326 15 2,597 0.836 0.176 0.893 0.055 0.367 7 727 0.864 0.189 0.891 0.069 0.18

22 3,042 0.814 0.182 0.903 0.058 0.323 15 1,524 0.846 0.194 0.894 0.065 0.219 7 1,518 0.746 0.139 0.922 0.037 0.427

Number of instruments Up to 10 More than 10

Target group Internal External

19 2,689 0.785 0.178 0.913 0.056 0.328 7 1,416 0.677 0.107 0.936 0 0.435 12 1,273 0.848 0.185 0.9 0.068 0.209

Usage of intelligence test Intelligence test used Intelligence test not used

0.329 0.214 0.284 0.42

0.329 0.334 0.036 0.43 0.28 0.199

VAR

Bare-bones Mean r

19 3,072 0.795 0.185 0.906 0.056 0.326 11 1,701 0.752 0.153 0.925 0.046 0.402 8 1,326 0.855 0.218 0.879 0.06 0.228

0.057 0.066 0 0.046

0.059 0 0 0 0 0

SD u

All ages Up to 30 years Over 30 years

0.936 0.893 0.837 0.911

0.897 0.837 0.837 0.936 0.977 0.837

Mean ud

22 3,379 0.845 0.176 11 1,088 0.852 0.193 3 605 1 0 8 1,686 0.778 0.157

0.181 0 0 0.116 0.035 0

SD rcc

All purposes Selection Development Potential analysis

All criterion measures 24 3,556 0.83 Promotion 7 1,122 1 Sales figures 3 386 1 Supervisor ratings 9 1,504 0.642 Training success 4 306 0.783 Development of payscale 1 238 1

Analysis

Artifact distribution

Table 1. Meta-analytic results and moderator analyses

64 N. Becker et al.: Predictive Validity of Assessment Centers in German-Speaking Regions

Ó 2011 Hogrefe Publishing

N. Becker et al.: Predictive Validity of Assessment Centers in German-Speaking Regions

Anastasi & Urbina, 1997) we expected studies assessing the criterion in close temporal proximity to the AC to report higher validities. Finally, we exploratorily analyzed two formal aspects of the studies as moderators of the validity: First, the publication year of the study was accounted for, to test for a global decrease in predictive validity over time of publication. This trend could be suggested comparing the results of the older meta-analysis (Gaugler et al., 1987) with the newer ones (Hardison & Sackett, 2007; Hermelin et al., 2007). Second, the difference between published and unpublished studies was analyzed to evaluate possible filedrawer effects.

Moderator Analyses To test for moderating effects, we split up the coefficients into subgroups, applying cut-off points based on considerations regarding theoretical or practical relevance and analyzability. For example, the question whether an intelligence test was used implied a simple split (yes/no) resulting in two subgroups of comparable size. For ‘‘average age of assessees,’’ we applied a cut-off at the age of 30, assuming that below this age most assessments aimed at entry level participants without extensive work experience. Also, this cut-off resulted in two subgroups of comparable size. For testing the significance of the difference of q between the different moderator levels we used 95% confidence intervals (95%-CIs) as recommended by Hunter and Schmidt (2000), Schmidt and Hunter (1999), and Whitener (1990). A distinct mean difference and especially nonoverlapping confidence intervals were considered to be a good indicator for moderating effects. As the software employed in our study (i.e., Schmidt & Le, 2005) does not provide this option we computed the intervals manually using the formula stated by Hunter and Schmidt (2004, p. 207). All intervals are reported in Table 1.

Moderator Results The mean corrected validity for unpublished coefficients (q = .309) was lower than for published coefficients (q = .464). Whether this finding is an indication of a direct or indirect ‘‘file-drawer effect’’ is difficult to say. The fact that we were able to retrieve a considerable amount of unpublished coefficients (42%, or 10 out of 24) suggests that the effect of ‘‘publication bias’’ in our overall validity estimate is rather small. Nevertheless, it probably contributes to a slight overestimation of the true underlying correlation in our data. However, compared to other recent meta-analyses, such as the Hermelin et al. (2007) study which included only two unpublished findings, our findings are less likely to be overestimates. Therefore, we tend not to attribute the higher mean validity in our study to ‘‘publication bias.’’ Two of the moderator variables showed no moderating effect. Year of publication had a very small positive effect on the study results which was mainly caused by the studies published in 2007 (r = .365 including and r = .031 excluding studies conducted in 2007), so that an overall linear Ó 2011 Hogrefe Publishing

65

trend of the publishing year was ruled out. A differentiation regarding the percentage of male candidates [less than 85% (q = .438) vs. more than 85% (q = .364)] yielded different mean validities, but also a strong overlap of confidence intervals, indicating the absence of a moderating effect. For the remaining eight variables we found clear moderating effects. The results for these moderators are shown in Table 1. The kind of criterion used to establish AC validity was the first moderator. Supervisor ratings were predicted best (q = .564). The validities for promotion (q = .389), training success (q = .324), and development of payscale (q = .236) ranged lower but were also predicted well, whereas sales figures (q = .043) were not. An examination of the moderating effect of the AC purpose also indicated different predictive validities. ACs used for potential analysis yielded the highest predictive validity (q = .512), followed by ACs used as development instruments (q = .333) and ACs used for selection (q = .258). Comparing target groups, we found that internal candidates (q = .442) were better predicted than external candidates (q = .216). ACs showed a higher predictive validity if the assessees were, on average, younger than 30 years (q = .495 vs. q = .278 for older participants), if an intelligence test was included (q = .556 vs. q = .251 for ACs without intelligence tests), if more than 10 instruments were used (q = .527 vs. q = .266 for ACs using fewer instruments), if the AC lasted no longer than one day (q = .494 vs. q = .264 for longer ACs), and if more than two years had passed between the AC and the assessment of the criterion (q = .475 vs. q = .216 for earlier criterion assessment). In the above analyses, each moderator was treated as an independent variable. This could be regarded as an oversimplified approach since the covariance of moderators across studies is both possible and plausible. In order to analyze the degree to which the moderators were confounded, we computed pairwise Kendall’s s coefficients. Substantial and significant associations were found between the duration of the AC and the age of the assessees (Kendall’s s = .764; p < 0.01) as well as between the duration and use of an intelligence test (Kendall’s s = .516; p < 0.05), indicating that short ACs used to assess younger candidates employed intelligence tests more frequently. Furthermore, an association between the number of instruments used and the employment of an intelligence test (Kendall’s s = .889; p < 0.01) and number of instruments and target group was found (Kendall’s s = .480; p < 0.05), indicating that ACs using many instruments were more likely to employ intelligence tests and are conducted with internal candidates. Finally, the employment of an intelligence test and the time between the AC and the criterion assessment was significantly associated (Kendall’s s = .577; p < 0.05). Possible implications of the nonindependence among moderators are discussed below.

Discussion Until now all meta-analytic studies examining the predictive validity of assessment centers have focused almost solely on Journal of Personnel Psychology 2011; Vol. 10(2):61–69

66

N. Becker et al.: Predictive Validity of Assessment Centers in German-Speaking Regions

ACs conducted in English-speaking regions. The present study was conducted to systematically analyze data from German-speaking regions in order to provide complementary information to the recently published meta-analyses based on data from international samples. Country-specific variations in the use and design of selection methods in general (Ryan et al., 1997) and differences between assessment centers designed in the US and German-speaking countries in particular are well known (Ho¨ft & Obermann, in press; Krause & Gebert, 2003). One major aim of our study was to investigate the consequences of these differences on the criterion-related validity of ACs. In addition to presenting a first overview of the predictive validity available in German-speaking countries, the data analyses were carried out in a way to maximize methodological comparability with recent international findings. Inspecting the results of the meta-analysis for the full sample yields a substantial mean predictive validity of the ACs examined (q = .396). Although this result cannot be generalized due to substantial residual variance it is very similar to the findings presented by Gaugler et al. (1987; q = .37), and higher than the mean predictive validities reported by Hardison and Sackett (2007; q = .26). Comparing the mean corrected validity of Hermelin et al. (2007; q = .28) which is based exclusively on studies using supervisor ratings as criterion with the corresponding result of our analysis (q = .564) reveals even greater differences. The concerns raised by the results of Krause and Gebert (2003) are not confirmed. The predictive validity of ACs conducted in German-speaking regions is comparable to international findings or even higher. Additionally, taking Gaugler et al. (1987) as a reference point, our results do not support a global decrease in mean predictive validity of ACs. As none of the studies in the final sample was conducted before 1987 we were not able to test this assumption directly, but the absence of a linear trend between year of publication and validity points in the same direction. Whether the differences in AC validity between German-speaking regions and English-speaking regions based on the more recent studies involve differences in pre-selection processes or systematic differences in variance restriction is difficult to evaluate at this point, as reliable information on this topic does not exist for the German studies. This aspect touches one of the major limitations of our study: In order to be able to address such issues in future analyses, it is of vital importance that more validity studies including detailed information on the full diagnostic process become available. Such information includes details of the pre-selection process and range restrictions, on the company or, more generally, the assessment setting, the population of applicants, details about the utilized exercises as well as on the development procedure and evaluation of the AC itself. In our moderator analysis we identified eight variables which systematically influenced AC validity in the studies available for this meta-analysis. Of these variables, two (criterion measure, purpose of AC) were not confounded with other moderators while substantial and significant associations were identified among the five remaining variables. In the following, the first two moderators will be discussed Journal of Personnel Psychology 2011; Vol. 10(2):61–69

separately. The results concerning the confounded moderators will be discussed together. Examining the moderating effects of the criterion measure used for the validation of the AC, one finding is striking: Supervisor ratings are predicted very well (q = .564), while the validities for the criteria promotion (q = .389), training success (q = .324), and development of payscale (q = .236) range lower and the validity for sales figures (q = .043) cannot be regarded as significantly greater than zero. Contamination of predictors and criteria are one possible explanation for these discrepancies in predictability. On the one hand supervisor ratings are subjective evaluations of job performance. Promotions, training success, and development of payscale are to a lesser degree but also substantially influenced by subjective ratings of judges (e.g., managers, supervisors). Sales figures on the other hand are an objectively measurable criterion. AC results are also made up of subjective ratings of judges, thus sharing method variance and potential bias (e.g., response sets and halo-errors) with the subjective criteria (i.e., supervisor ratings, promotions, trainings success, and development of payscale). Thus, it could be argued that the correlation between the AC-ratings and subjective criteria may be inflated due to the joint mode of assessment. This hypothesis is supported by the equivalent results in the meta-analyses of Gaugler et al. (1987) as well as of Hardison and Sackett (2007). It has to be remembered that direct job performance indicators (i.e., sales figures) are not in the main focus of assessment centers. AC dimensions typically cover broad dimensions, for example, consideration/awareness of others, communication, drive, influencing others, organizing and planning, or problem solving (Arthur, Day, McNelly, & Edens, 2003), that are typically measured by behavioral indicators. Following the idea of Brunswik symmetry (a short description of the theoretical background can be found in Ackerman & Lohman, 2006, p. 147) it should be kept in mind that the predictor-criterion-validities are maximized on the structural level when (a) there is a match between predictor breadth and criterion breadth and when (b) the mapping between predictors and criteria is direct. Both criteria are fulfilled by subjective performance ratings but not by objective performance indicators. At the same time it has to be remembered that the results concerning the predictability of sales figures in our study were based on three coefficients only (N = 386). Nevertheless, the question whether AC-ratings predicting subjective performance ratings are contaminated by common method variance or are ideally customized for these kind of criteria could be of interest in future studies. A differentiation with regard to the purpose of the ACs revealed that potential analysis worked best whereas ACs used for selection and development showed somewhat lower yet still positive predictive validities. A glance at the credibility intervals revealed that ACs used for the purpose of development (80%-CR = .303  q  .364) and potential analysis (80%-CR = .512  q  .512) constituted homogeneous groups whereas ACs used for selection (80%-CR = .079  q  .437) yielded more heterogeneous results. As hypothesized, one possible explanation could be that the influence of direct or indirect range restriction Ó 2011 Hogrefe Publishing

N. Becker et al.: Predictive Validity of Assessment Centers in German-Speaking Regions

(which could not be corrected) is rather influential for this group of studies. Additionally, overall assessment center ratings used in the selection context often solely concentrate on the final ‘‘pass/fail’’-decision disregarding further differentiations between candidates (e.g., Krause & Gebert, 2003). Unfortunately, our database does not allow further analysis to clarify this heterogeneity at this point. With respect to the association among the six remaining variables which moderated the validity of AC-results, one prototypical group of ACs with higher validities and with particular characteristics can be identified. These ACs are completed on a single day and use many instruments one of which is an intelligence test. Such ACs are employed to assess younger internal candidates and they evaluate the criterion not until at least 2 years after AC completion. How do these characteristics contribute to AC validity and can we elucidate their interrelation? The findings regarding number of instruments and AC duration seem conflicting at first sight. As expected, ACs employing more instruments showed to be more valid. Greater validity as a result of a larger number of instruments used can be understood through the bandwidth and fidelity of ACs (see Cronbach & Gleser, 1965). By employing a larger number of instruments more information about the candidates can be gathered (i.e., bandwidth). As a consequence, more accurate decisions can be made (i.e., fidelity). How does this tie in with the unexpected fact that shorter ACs turned out to be more valid than longer ACs in our study? Our interpretation of this finding involves the assumption that short ACs might provoke more stress, because candidates have less time to accommodate with tasks, which in turn might be advantageous for the separation of successful and less successful assessees. Put together, these two lines of reasoning suggest that short ACs employing many instruments have a higher predictive validity because of a broader bandwidth and higher fidelity plus an intensified assessment of stress tolerance. It can be expected that another critical issue involves diversity of instruments, assuming that the employment of more diverse instruments yield higher predictive validities. Unfortunately, we are not able to quantify the diversity of instruments used in the ACs from our study as this particular data was not available. Nevertheless, it can be assumed that the more instruments are employed the more diverse they are. This leads to the conclusion that not duration of exercises is a crucial aspect for higher predictive validity, but the depth and breadth of gathered information about the candidates. The result that higher validities were found in studies conducted with internal candidates is contradictory to our initial hypotheses on this moderator. A possible explanation for this effect could be that the AC raters are often recruited within the organizations. Though some of them could have known the assessee before the AC and had been able to draw back on these experiences throughout the ratings. A different explanation could be that the raters had been more accurate (and as a consequence more reliable) during their ratings because they felt a higher responsibility for the future work perspectives of the in-house participants. Last but not the least, Klimoski and Strickland’s (1977) old hypothesis of self-fullfilling prophecies (supervisor’s knowledge of good Ó 2011 Hogrefe Publishing

67

AC results of the candidate leads to better performance ratings) could be more prominent for internal candidates. Average age of assessees moderated the validity in such a way that the prediction works better for younger participants (mean age below 30 years in our study). One possible explanation involves the role of job experience. Younger assessees have less job experience, thus AC success depends more on potential than in the older group in which candidates might compensate differences in potential through job experience resulting in more homogenous behavior. The result that higher validity was found if more than two years had elapsed between the AC and criteria assessment might seem surprising since the predictive power in diagnostic settings typically tends to wane over time (see e.g., Anastasi & Urbina, 1997). In the present context, however, two possible explanations should be considered: Especially if assessees are young, their career entry in a new job takes time, so do promotions, wage raises, etc. Differences between more and less qualified candidates do not typically emerge until a certain phase of acclimatization has passed and critical tests of aptitude within real job settings typically do not occur very early after career entry. To the degree that motivational factors play an additional role in determining success on the job, it could be argued that successful AC candidates go through a ‘‘honeymoon period’’ that tends to homogenize individual differences in motivation for some time. This line of reasoning ties in well with previous research on temporal increases in predictive power (e.g., Gaugler et al., 1987; Helmreich, Sawin, & Carsrud, 1986). A comparable reasoning applies to the criterion measurement (e.g., Moser, Schuler, & Funke, 1999): A supervisor needs to gather diverse information in different situations in a suitable time period to form a valid performance rating. The advantage of ACs including intelligence tests is very much in line with the literature on the high predictive validity of cognitive measures in the prediction of vocational success (e.g., Kramer, 2009; Schmidt & Hunter, 1998). Although Schmidt and Hunter focused on the incremental validity of additional diagnostic measures over and above intelligence and showed that ACs have small to no incremental validity over IQ tests we argue that intelligence tests are an important instrument within ACs of higher validity and that their inclusion constitutes a central node in the constellation of AC ‘‘success factors.’’ Summarizing the results for all relevant (and modifiable) validity moderators it is possible to formulate a few recommendations for the design and evaluation of ACs in practice: A somewhat simplified recommendation could be ‘‘keep it short but complex and diverse.’’ It concentrates on the design and the choice of the single exercises. Based on our results a multitude of short-time exercises with relevant levels of cognitive complexity should be favored over longterm simulations. That also implies that ACs should be regarded as ‘‘harmonized multimethod arrangements’’ composed not only of (short) simulations but also of interviews and psychometric tests. The related recommendation ‘‘use intelligence tests in ACs to guarantee higher validity’’ seems to be premature since specifics of the diagnostic setting in which intelligence tests improve AC validity are not yet fully understood. If, for example, the acceptance (social Journal of Personnel Psychology 2011; Vol. 10(2):61–69

68

N. Becker et al.: Predictive Validity of Assessment Centers in German-Speaking Regions

validity, refer to Schuler, 1993) of cognitive measures in ACs is a crucial factor for their diagnostic value, employment of cognitive measures could deflate AC validity. A last recommendation with a grain of salt could be formulated for AC designers who have to evaluate their own systems: They are well advised to choose supervisory ratings, collected at least 2 years after the assessment, as criteria for their (at best for potential analysis purposes used) ACs. Putting all information together our meta-analysis seems to indicate that the AC approach in general works in German-speaking countries and the identified moderators provide a starting point for further empirical studies aimed at understanding and advancing the predictive validity of ACs.

References Ackerman, P. L., & Lohman, D. F. (2006). Individual differences in cognitive functions. In P. A. Alexander & P. H. Winne (Eds.), Handbook of educational psychology (pp. 139–161). Hillsdale, NJ: Erlbaum. Anastasi, A., & Urbina, S. (1997). Psychological testing. Upper Saddle River, NJ: Prentice Hall. *Annen, H., & Eggimann, N. (2006). Assessment center results as predictor of study success. Paper presented at the 48th Annual Conference of the International Military Testing Association, Kingston, Canada. Arthur, W. Jr., Day, E. A., McNelly, T. L., & Edens, P. S. (2003). A meta-analysis of the criterion related validity of assessment center dimensions. Personnel Psychology, 56, 125–154. Birri, R., & Naef, B. (2006). Wirkung und Nutzen des Assessment-Center-Feedbacks im Entwicklungsverlauf von Nachwuchsfu¨hrungskra¨ften – eine retrospektive Studie [Impact and usefulness of assessment center feedback for the personnel development of junior managers: A retrospective study]. Wirtschaftspsychologie, 8, 58–67. Brodbeck, F.C., Frese, M., & Javidan, M. (2002). Leadership made in Germany: Low on compassion, high on performance. Academy of Management Executive, 16, 16–29. Cronbach, L., & Gleser, G. (1965). Psychological tests and personnel selection. Urbana, IL: University of Illinois Press. *Damitz, M., Manzey, D., Kleinmann, M., & Severin, K. (2003). Assessment centre for pilot selection: Construct and criterion validity and the impact of assessor type. Applied Psychology: An International Review, 52, 193–212. *Emrich, M. (2001). Evaluation der Personalentwicklungsmaßnahme ‘‘PQN – Praxisqualifizierung fu¨r Nachwuchskra¨fte’’ bei der DaimlerChrysler AG [Evaluation of the human resources development instrument ‘‘PQN – Practical Qualification for Newcomers’’ at DaimlerChrysler AG]. Unpublished manuscript. *Fennekels, G. (1987). Validita¨t des Assessment-Centers bei Fu¨hrungskra¨fteauswahl und -entwicklung [Validity of assessment centers for the selection and development of executives]. Unpublished doctoral dissertation, Germany: University of Bonn. Gaugler, B. B., Rosenthal, D. B., Thornton, G. C. I., & Bentson, C. (1987). Meta-analysis of assessment centre validity. Journal of Applied Psychology, 72, 493–511. Ghiselli, E. E. (1973). The validity of aptitude tests in personnel selection. Personnel Psychology, 26, 461–477. *Gierschmann, F. (2005). Evaluation von Auswahl- und Potenzial Assessment Centern. Beispiele der Deutschen Post AG [Evaluation of assessment centers for selection and potential analysis. Examples from Deutsche Post AG]. Journal of Personnel Psychology 2011; Vol. 10(2):61–69

In K. Su¨nderhauf, S. Stumpf, & S. Ho¨ft, Assessment Center (Eds.), Von der Auftragskla¨rung bis zur Qualita¨tssicherung (pp. 375–388). Lengerich, Germany: Papst. *Go¨rlich, Y., Schuler, H., Becker, K., & Diemand, A. (2005). Evaluation zweier Potenzialanalyseverfahren zur internen Auswahl und Klassifikation [Evaluation of two instruments for potential assessments in the context of internal selection and classification]. In H. Schuler (Ed.), Assessment Center zur Potenzialanalyse (pp. 203–232). Go¨ttingen, Germany: Hogrefe. *Gutknecht, S. P., Semmer, N. K., & Annen, H. (2005). Prognostische Validita¨t eines Assessment Centers fu¨r den Studien- und Berufserfolg von Berufsoffizieren der Schweizer Armee [Predictive validity of an assessment center for academic and professional success of Swiss army officers] (Ed.), Zeitschrift fu¨r Personalpsychologie, 4, 170–180. Hardison, C. M., & Sackett, P. R. (2007). Kriteriumsbezogene Validita¨t des Assessment Centers: lebendig und wohlauf? [Criterion-related validity of assessment centers: Alive and well?] In H. Schuler (Ed.), Assessment Center zur Potenzialanalyse (pp. 192–202). Go¨ttingen, Germany: Hogrefe. Helmreich, R. L., Sawin, L. L., & Carsrud, A. L. (1986). The honeymoon effect in job performance: Temporal increases in the predictive power of achievement motivation. Journal of Applied Psychology, 71, 185–188. Hermelin, E., Lievens, F., & Robertson, I. T. (2007). The validity of assessment centres for the prediction of supervisory performance ratings: A meta-analysis. International Journal of Selection and Assessment, 15, 405–411. Hofstede, G. (2001). Culture’s consequences: Comparing values, behaviors, institutions and organizations across nations. London, UK: Sage. Ho¨ft, S., & Obermann, C. (in press). Die Qualita¨t von Assessment Centern im deutschsprachigen Raum: Stabil mit Hoffnung zur Besserung [The quality of assessment centers in German-speaking countries: Stabile with hope for advancement]. In P. Gelleri & C. Winsen (Eds.), Personalpsychologische Diagnostik als Beitrag zu Berufs- und Unternehmenserfolg. Go¨ttingen, Germany: Hogrefe. *Holling H., & Reiners, W. (1994). Prognostische Validierung von Assessment Center-Urteilen anhand objektiver Kriterien. [Validation of assessment center ratings with objective criteria] In K. Pawlik (Ed.), Bericht u¨ber den 39. Kongress der Deutschen Gesellschaft fu¨r Psychologie (pp. 801–802). Go¨ttingen, Germany: Hogrefe. Hu¨lsheger, U. R., Maier, G. W., & Stumpp, T. (2007). Validity of general mental ability for the prediction of job performance and training success in Germany: A metaanalysis. International Journal of Selection and Assessment, 15, 3–18. Hunter, J. E., & Hunter, R. F. (1984). Validity and utility of alternative predictors of job performance. Psychological Bulletin, 96, 72–98. Hunter, J. E., & Schmidt, F. L. (2000). Fixed effects vs. random effects meta-analysis models: Implications for cumulative knowledge in psychology. International Journal of Selection and Assessment, 8, 275–292. Hunter, J. E., & Schmidt, F. L. (2004). Methods of meta-analysis (2nd ed.). Thousand Oaks, CA: Sage. *Kersting, M. (2003). Assessment Center: Erfolgsmessung und Qualita¨tskontrolle [Assessment center: Performance measurement and quality control] In S. Ho¨ft & B. Wolf (Eds.), Qualita¨tsstandards fu¨r Personalentwicklung in Wirtschaft und Verwaltung: Wie Konzepte greifen (pp. 72–93). Hamburg, Germany: Windmu¨hle. Klimoski, R., & Strickland, W. J. (1977). Assessment centers: Valid or merely prescient? Personnel Psychology, 30, 353– 363. Ó 2011 Hogrefe Publishing

N. Becker et al.: Predictive Validity of Assessment Centers in German-Speaking Regions

Kramer, J. (2009). Allgemeine Intelligenz und beruflicher Erfolg in Deutschland. Vertiefende und weiterfu¨hrende Metaanalysen [General mental ability and occupational success in Germany: Further meta-analytic elaborations and amplifications]. Psychologische Rundschau, 60, 82–98. Krause, D. E., & Gebert, D. (2003). A comparison of assessment centre practices in organizations in German-speaking regions and the United States. International Journal of Selection and Assessment, 11, 297–312. *Laubsch, K. (2001). Assessment-Center und Situatives Interview in der Personalauswahl von Operateuren komplexer technischer Systeme – ein Verfahrensvergleich unter den Aspekten Konstruktvalidita¨t und Pra¨diktive Validita¨t [Assessment center and situational interview for the selection of operators of complex technical systems – a comparison regarding predictive and construct validity] (DLRForschungsbericht 2001-01). Hamburg, Germany: DLR. McNatt, D. B. (2000). Ancient Pygmalion joins contemporary management: A meta-analysis of the result. Journal of Applied Psychology, 85, 314–322. *Moser, K., Schuler, H., & Funke, U. (1999). The moderating effect of raters’ opportunities to observe ratees’ job performance on the validity of an assessment centre. International Journal of Selection and Assessment, 7, 133–141. *Neubauer, R., & Leiter, R. F. (1996). Validierung von Assessment Centern mit qualitativen Daten [Validation of assessment centers with qualitative data]. In Arbeitskreis Assessment Center e.V (Eds.), Assessment Center heute: Schlu¨sselkompetenzen, Qualita¨tsstandards, Prozessoptimierung (1st ed., pp. 273–285). Hamburg, Germany: Windmu¨hle. Orwin, R. G. (1983). A fail-safe N for effect size. Journal of Educational Statistics, 8, 147–159. Pearlman, K. (1982). The Bayesian approach to validity generalization: A systematic examination of the robustness of procedures and conclusions. Unpublished doctoral dissertation. Washington, DC: Department of Psychology, George Washington University. *Plate, T. (2007). Personalauswahl in Unternehmensberatungen [Personnel selection in management consultancies]. Wiesbaden, Germany: Deutscher Universita¨ts-Verlag. Ryan, A.M., McFarland, L., Baron, H., & Page, R. (1997). An international look at selection practices: Nation and culture as explanations for variability in practice. Personnel Psychology, 52, 359–391. Salgado, J. F., Anderson, N., Moscoso, S., Bertua, C., De Fruyt, F., & Rolland, J. P. (2003). A meta-analytic study of general mental ability validity for different occupations in the European community. Journal of Applied Psychology, 88, 1068–1081.

Ó 2011 Hogrefe Publishing

69

Schmidt, F. L., & Hunter, J. E. (1998). The validity and utility of selection methods in personnel psychology: Practical and theoretical implications of 85 years of research findings. Psychological Bulletin, 124, 262–274. Schmidt, F. L., & Hunter, J. E. (1999). Comparison of three meta-analysis methods revisited: An analysis of Johnson, Mullen, and Salas (1995). Journal of Applied Psychology, 84, 144–148. Schmidt, F. L., & Le, H. (2005). Hunter and Schmidt metaanalysis programs (V1.1). Iowa City, IA: Department of Management and Organizations, University of Iowa. Schuler, H. (1993). Social validity of selection situations: A concept and some empirical results. In H. Schuler, J. L. Farr, & M. Smith (Eds.), Personnel selection and assessment: Individual and organizational perspectives (pp. 11–26). Hillsdale, NJ: Erlbaum. Spychalski, A. C., Quinones, M. A., Gaugler, B. B., & Pohley, K. (1997). A survey of assessment center practices in organizations in the United States. Personnel Psychology, 50, 71–90. *Tatzel, H. (2004). Analyse und Bewertung eines Assessmentcenters im Rahmen eines Bewerberauswahlprozesses [Analysis and evaluation of an assessment centre for personnel selection]. Mainz, Germany: Johannes Gutenberg Universita¨t, Unpublished diploma thesis. Viswesvaran, C., Ones, D. S., & Schmidt, F. L. (1996). Comparative analysis of the reliability of job performance ratings. Journal of Applied Psychology, 81, 557–574. Whitener, E. M. (1990). Confusion on confidence intervals and credibility intervals in meta-analysis. Journal of Applied Psychology, 75, 315–321. Wiechmann, D., Ryan, A. M., & Hemingway, M. (2003). Designing and implementing global staffing systems: Part I. – Leaders in global staffing. Human Resource Management, 42, 71–83.

Stefan Ho¨ft Hochschule der Bundesagentur fu¨r Arbeit (HdBA) Seckenheimer Landstr. 16 68163 Mannheim Germany Tel. +49 621 4209-356 Fax +49 621 4209-880356 E-mail [email protected]

Journal of Personnel Psychology 2011; Vol. 10(2):61–69