An Examination of Rater Drift Within a ... - Wiley Online Library

13 downloads 0 Views 123KB Size Report
rater stringency. Ratings from four components of the USMLE R Step 2 Clinical. Skills Examination data were analyzed. A generalizability-theory framework was.
Journal of Educational Measurement Spring 2009, Vol. 46, No. 1, pp. 43–58

An Examination of Rater Drift Within a Generalizability Theory Framework Polina Harik, Brian E. Clauser, Irina Grabovsky, Ronald J. Nungester, and Dave Swanson National Board of Medical Examiners Ratna Nandakumar University of Delaware The present study examined the long-term usefulness of estimated parameters used to adjust the scores from a performance assessment to account for differences in R Step 2 Clinical rater stringency. Ratings from four components of the USMLE Skills Examination data were analyzed. A generalizability-theory framework was used to examine the extent to which rater-related sources of error could be eliminated through statistical adjustment. Particular attention was given to the stability of these estimated parameters over time. The results suggest that rater stringency estimates obtained at a point in time and then used to adjust ratings over a period of months may substantially decrease in usefulness. In some cases, over several months, the use of these adjustments may become counterproductive. Additionally, it is hypothesized that the rate of deterioration in the usefulness of estimated parameters may be a function of the characteristics of the scale.

During recent years considerable attention has been given to procedures for scaling and equating tests that include complex performance-based formats. In many situations, these tasks are scored by having trained raters review and rate the responses. One concern with this approach is that a rater’s individual tendency to be relatively strict or lenient may result in an unfair advantage or disadvantage for some examinees. Unless all tasks are rated by the same raters, this factor may pose a nontrivial threat to the validity of score interpretations and the reliability of the assessment. Although raters are commonly trained to apply standardized criteria, “. . .raters are human and they are therefore subject to all the errors to which humankind must plead guilty” (Guilford, 1936, p. 272). Differences in rater stringencies have long been documented as a nontrivial source of measurement error on tests of this type and this source of error was among the reasons that many assessment programs based on essays and other performance formats were replaced by multiple-choice items during the last century (e.g., Hartog & Rhodes, 1936; Hubbard & Levit, 1985). The recent resurgence of performance assessments has necessitated reexamination of this issue. A few studies examined the effects of varying rater stringencies in the context of large-scale continuously administered examinations (Clauser, Harik, & Margolis, 2006; Engelhard, 1994; Longford, 1994). The findings emphasize that differences due to varying rater stringencies may contribute substantially to the measurement error for tests of this sort. Numerous authors have discussed issues and procedures for adjusting for rater effects (Braun & Wainer, 1989; De Gruijter, 1984; Engelhard, 1996; Raymond & c 2009 by the National Council on Measurement in Education Copyright 

43

Harik et al.

Viswesvaran, 1993). These adjustment procedures have typically used either linear or IRT models where ratings are considered a function of rater stringency, item difficulty, examinee ability, and other factors of the design. The resulting rater stringency estimates are then applied to the original ratings to adjust for varying rater stringencies. The approaches cited above include a number of linear models based on analysis of variance, including complete block designs in which all raters are assigned to all examinees (Stanley, 1962), and incomplete and partially balanced incomplete block designs (Fleiss, 1981), which have been shown to provide accurate parameter estimation when adequate overlap of raters, examinees, and tasks is ensured (Braun, 1988). In addition to linear models, an IRT-based family of models has been used in calibration procedures of performance assessments. These polytomous probabilistic models simultaneously estimate examinee ability and item difficulty for categorical data (Linacre & Wright, 2002; van der Linden & Hambleton, 1997). Although substantial research has been conducted on modeling and adjusting for rater stringency and/or task difficulty at a point in time, very few studies have examined the effectiveness of these procedures over long periods. The issue of stability of rater stringency estimates is an important one because in any large-scale examination administered continuously measurement instruments must remain invariant over time to ensure that valid score interpretations can be made. In traditional multiplechoice examinations item parameters are periodically reevaluated to account for any possible changes in item difficulties. Changes in item difficulties, often referred to as item parameter drift, are understood to be due to changes in educational practice or to item exposure (Bock, Muraki, & Pfeiffenberger, 1988). When human judgment is involved in the measurement process, in addition to these factors, progressive changes in rater stringencies over time may occur due to changes in rater behavior resulting from environmental and psychological factors. It has been shown that rating judgments may change over time despite attempts to maintain constant standards such as rater retraining and frequent feedback (Bock, 1995; Congdon & McQueen, 2000; McKinley & Boulet, 2004). Wolfe, Moulder, and Myford (2001) examined rater stringency drift in a simulation study where the probability of an examinee receiving a rating in a particular category from a particular rater on a specific task was estimated using a multifaceted Rasch rating scale model. Examining the rater effect over time, within and across raters, the authors detected various types of rater drift. However, the generalizability of the findings is limited because the results were based on simulations only. Thus, the researchers focused on modeling rater stringency drift rather than identifying it. Additionally, the simulated conditions in the study reflected one type of drift at a time, whereas real data may contain multiple simultaneous patterns of rater stringency instability over time. Using elementary school writing performance data, Congdon and McQueen (2000) examined the stability of rater stringency over a period of seven working days. The authors used the multifaceted Rasch model to produce daily estimates of relative rater stringency for 16 raters. The raters used a six-point categorical scale to assess a writing task. The analyses of the rater stringency estimates showed that relative rater stringency (stringency of a rater compared to other raters on the same day) varied from day to day without an apparent predictable pattern. Additionally, the absolute 44

Generalizability Theory Framework

change in rater stringencies, or changes in stringency within a rater, was shown to be significant for the majority of the raters. The study focused on short-term changes that may be classified as daily fluctuations. As such, these effects would likely be viewed as part of the residual error in most contexts because daily adjustment of rater effects would be impractical, if not impossible. Other research involving operational programs that make periodic adjustments for rater stringency have shown that parameters that are relatively stable across days or weeks may shift over more extended periods of time. McKinley and Boulet (2004) examined rater parameters that are sufficiently stable to allow substantial improvement in measurement precision when applied to examinees tested within a 3-month period. The study was based on a yearly cohort (N = 5,335) of international medical graduates who were required to complete a clinical skills assessment in order to enter graduate training in the United States. Variation in rater stringencies over time was analyzed using an ANCOVA design. The dependent variable was the score; raters and time periods were the independent variables; two covariates representing examinee ability were used to control for potential differences in examinees over time periods. The authors reported relatively small effects for most raters but large effects for some. The larger differences in rater stringencies over time were about one-half of a standard deviation, representing a potentially important effect. The authors stressed that such effects may give unfair advantage or disadvantage to some examinees. Although the studies referenced here provide evidence of rater stringency drift, they provide little practical information about how long the estimated parameters remain useful. When an examination is administered periodically and scoring and scaling can be completed within days or weeks, change in rater performance over time may be minimal. When assessments requiring human scoring are administered continuously over months, or year round, the impact of changes in rater behavior may become a matter of practical concern. In addition, these previous studies did not address the impact of rater stringency drift on the reliability or generalizability of scores. The present study provides an assessment of the degree to which the usefulness of statistical adjustments designed to account for differences in rater stringency and task difficulty decreases over periods of months. The results are interpreted within a generalizability theory framework. The data come from the USMLE Step 2 Clinical Skills Examination. The examination is administered and scored throughout the year and the results include information about several different score scales. The scoring algorithms for these scales vary from highly analytic to holistic, presenting an opportunity for examination and comparison of rater stringency drift on different types of scales. Method Description of USMLE Step 2 Clinical Skills Examination The Clinical Skills Examination is required for licensure of allopathic physicians (those with an M.D. degree) in the United States. It is designed to evaluate several aspects of physicians’ clinical skills including data gathering, documentation, communication and interpersonal skills, and spoken English. The structure of the 45

Harik et al.

examination requires the examinee to interact with 12 standardized patients (subsequent analyses will be based on 11 cases per examinee because one of the 12 may be used for pretesting cases and training new standardized patients). These standardized patients are laypersons who have been trained to portray patients with specific medical complaints. The examinee has 15 minutes to interact with the patient, taking the patient’s history and performing a focused physical examination. The examinee then has 10 minutes to produce a structured patient note. During this 10-minute period, the standardized patient completes three assessment instruments: (a) A checklist that documents whether the examinee completed specified physical examination maneuvers and inquired about specific aspects the of the patient’s medical history and the history and status of the present problem; (b) a set of rating scales that assess the examinee proficiency in information gathering, information sharing, and professional manner and rapport; and (c) a single rating scale that assesses the examinee’s spoken English. The structured patient note is subsequently scored by a physician trained to apply a case-specific scoring algorithm. These four assessment instruments result in four scores: (a) a data-gathering score based on the checklist, (b) a communication and interpersonal skills score, (c) a spoken English proficiency score, and (d) a documentation score based on the structured patient note. The four scores examined in this work differ in several important respects. Most obviously, they differ in the specific proficiency they are intended to measure, but they also differ in that three of the scores are based on rating scales while one is based on a behavioral checklist. The checklist is highly analytic, and is necessarily scenario specific. By contrast, the rating scales used for the communication score and the spoken English score are generic in the sense that the same scale is used across patient scenarios without scenario-specific training. Although the standardized patients are given extensive training in the use of these scales, they are less analytic and require more judgment. The documentation score is also based on a rating scale, but the criteria include both general factors such as clarity of expression and specific factors representing case content. Within the current operational scoring and scaling for the examination the effects associated with the difficulty of the clinical scenario (or case) are confounded with the stringency of the individual providing the rating because for each examinee each patient scores only one case. Similarly, each physician rater will typically score only one patient note for each examinee. A linear model, described in detail in the next section, is used to estimate a parameter representing the relative difficulty associated with the combined rater stringency and case difficulty. This confounded effect of rater and the clinical case scenario will be referred to as the “rater/case” effect for the remainder of this article. Previous research indicates that this confounded source of error may contribute substantially to the overall variance for tests of this sort. Margolis, Clauser, Swanson, and Boulet (2003) presented the results of a generalizability analysis of a related examination and showed that this combined rater/case effect contributed significantly to error variance. A subsequent paper (Clauser et al., 2006) based on the data for the USMLE Step 2 Clinical Skills Examination showed that this combined rater/case effect has a significant proportional contribution to total error variance for each of the four scores. In the case of the spoken English score, however, this represents a 46

Generalizability Theory Framework

relatively trivial contribution to overall variance. The contribution that the combined rater/case effect has on the reproducibility of scores argues for the potential benefit of a calibration or adjustment procedure that can account for these differences in difficulty. The Calibration/Adjustment Procedure The adjustment used in operational scoring of this examination is based on a linear model. An offset is estimated for each standardized patient on a clinical scenario or rater/case combination. An examinee score given by a rater for the case is modeled through a generalized linear model applied to an analysis of variance design where each examinee and each rater/case combination represent their own “block” or “treatment.” This design is similar to the partially balanced incomplete block design described in Braun (1988) who showed that unbiased estimates can be obtained for this type of design. The balance in this design comes from the fact that there are groups of raters assigned to groups of examinees (within any given day of an examination); at the same time individual raters are members of other rater groups. Therefore, each examinee receives 11 ratings on different cases, and each rater evaluates hundreds of examinees on a given case. In order to treat this analysis of variance design within a linear modeling framework, dummy variables Z are created to indicate each effect. Suppose there are J examinees and K rater/case combinations; then there are J – 1 plus K – 1, plus one for the intercept, dummy variables. The formula can be represented in matrix notation as y = Ze α + Zr β + ε . (I ×1)

(I ×N ) (N ×1)

(I ×R) (R×1)

(I ×1)

Where y is a vector of I ratings, α is a vector of examinee parameters, β is a vector of rater/case parameters, Z e is an I by (1 + J – 1) matrix of the intercept and examinee dummy variables, Z r is an I by (K – 1) matrix of the rater/case dummy variables, N = [1 + (J – 1)], R = (K – 1), and ε is a vector of random errors. A least squares estimation procedure can be used to obtain the estimates. Once estimated, rater/case difficulties can be used to adjust observed ratings. Conceptually, what this procedure does is calculate a parameter for each rater/case combination and a parameter for each examinee. Because the independent variables (the rater/case combinations and the examinees) are coded 0 or 1, the estimated regression weights are offsets representing relative difficulty for rater/case combinations and relative proficiency or expected score for examinees. These are scaled relative to the Nth examinee and the Rth rater/case combination but it is trivial to rescale to center these values around zero or the mean of the original raw score distribution. The examinee offsets produced by the procedure provide examinee scores that have been adjusted for the difficulty of the rater/case combinations that the examinee encountered. To apply the rater/case difficulty parameters to examinees not included in the sample used for parameter estimation an overall score for an examinee can be computed by averaging the 11 individual ratings the examinee received after those ratings have been adjusted for individual rater/case difficulties. For the purposes of 47

Harik et al.

illustration and to simplify notations, Y jk denotes a rating given to examinee j on rater/case k, and βˆ k is the difficulty estimate for rater/case k. The overall score for examinee j averaged across the 11 tasks is: 11 1  (Y jk − βˆ k ). θ¯ j = 11 N =1

Parameter Drift To examine the stability of the estimated adjustments across samples at a point in time and to investigate the stability of rater/case difficulty estimates over time, two-related studies were conducted. For the first study, the data consisted of approximately 12,800 task-level ratings for nearly 1,200 examinees who completed the USMLE Step 2 Clinical Skills Examination in the fall of 2005. The data were randomly divided into two samples. Rater/case difficulty parameters were estimated on the first sample of examinees using the general linear model described in the previous section. The estimated parameters were then applied to the observed ratings of the second sample to adjust for rater/case difficulties. An analysis of variance method was used to estimate rater/case and examinee variance components before and after the adjustment for both samples. This provided evidence about both the usefulness of the adjustments and the stability of the estimates across samples. A second study was conducted to evaluate the extent to which the estimated parameters remain stable over time. The data for the second study consisted of about 41,000 task-level ratings for approximately 3,700 examinees who completed the USMLE Step 2 Clinical Skills Examination at a single examination center between November 2005 and June 2006. For each component of the examination, rater/case difficulty was estimated using the data from the first 2 months and then applied to the data for the following 6 months. The adjusted scores were then examined using a variance components analysis. The variance components due to the rater/case effects were then compared for each month of the data. G-Theory Framework Variance components analysis provides a practical approach for examining the usefulness of the estimated parameters in making adjustments to account for differences in rater/case difficulty. When applied directly to raw scores the variance components analysis provides an estimate of the impact of differences in rater/case difficulty on the overall score variance. Once the parameters have been estimated and applied to the raw scores, estimating variance components for the adjusted scores provides a measure of the extent to which differences in rater/case difficulty have been eliminated through the adjustments. When the analysis is applied to an independent sample of examinee performances from the same time period in which the parameter estimates were made it provides a cross validation of the estimates. When it is applied to a sample of estimates from a different time period it provides a basis for evaluating the extent to which the rater/case difficulty remains constant over time. The variance components analysis approach also fits seamlessly into a generalizability theory framework (Brennan, 2001). 48

Generalizability Theory Framework

Previous analysis of the generalizability of scores for assessments of this sort has used a design in which examinees are crossed with raters nested in (and confounded with) tasks or cases (Boulet, Rebbecchi, Denton, McKinley, & Whalen, 2007; Clauser et al., 2006; Margolis et al., 2003). In the examinees-crossed-withrater/case-combinations model, the examinee variance represents variability resulting from differences in examinee proficiency. The rater/case variance component represents variability resulting from a combination of differences in rater stringency and case difficulty and the rater-case interaction. The examinee-byrater/case variance represents the extent to which interactions exist between examinees and rater/case combinations; this source of variance also includes the residual effects of sources of variance not accounted for in the design. Generalizability analysis based on the raw scores provides an estimate of the extent to which the reproducibility of the scores is impacted by differences in rater/case difficulty. This source of error will impact the reproducibility of scores when examinees are compared on the basis of different forms of the test (or when examinee scores are compared to a domain referenced standard). The same analysis based on adjusted scores will provide a basis for comparison for evaluating the extent to which the reproducibility of scores is improved as a result of the adjustments. The analysis similarly can show the extent to which this improvement remains stable over time. As suggested previously, the general analysis considers rater/case combinations and examinees as facets of the design. Within any test session, examinees are crossed with rater/case combinations. Across test sessions, rater/case combinations are nested in examinees (although the sampling is not strictly random because cases are sampled to meet the requirements of the test specifications). For the data gathering, communications and interpersonal skills, and spoken English scores, raters may be viewed as nested in cases or cases may be viewed as nested in raters in that each case may be rated (and portrayed) by more than one rater and each rater may rate (and portray) one or two cases. Because many cases are rated by only one rater and many raters rate only one case, differentiating rater and case effects is impossible for these scores. For the documentation scores, each rater may rate performances on five or more cases and each case is rated by at least three raters. This allows for estimation of variance components for a more complex design for the documentation scores. Given the structure of the data set, raters could be viewed as nested in cases or cases could be viewed as nested in raters; for the following analysis, raters are treated as nested in cases because this design supports substantive interpretation of the resulting variance components. Results Tables 1 through 4 present the variance components for the observed scores, the adjusted scores, and the cross-validated adjusted scores for the design with only examinees and rater/case combinations as facets. In these tables, group 1 is the sample used to estimate the parameters used for the adjustment and group 2 is an independent group used for cross validation. Consistent with the typical presentation for generalizability theory results the table presents variance components for single observations and those for mean scores. The examinee (or person) variance components 49

Harik et al.

are identical for single observations and mean scores. For the remaining components (those contributing to measurement error), mean scores are produced by dividing the components for single observations by the number of times that component is sampled to produce the score. The values for single observations have been divided by 11 in these tables because test scores are based on results across 11 cases. The tables also include dependability coefficients1 produced using =

σ p2 2 σ p2 + σi2 /n i + σ pi /n i

.

In this equation, the subscript p represents examinees or persons, and the subscript i represents the item or in this instance the rater/case combination. The subscript pi represents the residual term. For each of the scales, the adjustments result in a proportionately significant improvement in precision. The procedure used for adjusting the scores is designed to produce a rater/case variance component of zero for the sample used to estimate the adjustments. The point of interest is whether the estimated parameters result in a similar reduction when applied to independent samples of examinee performances. Tables 1 through 4 show that similar reductions in the magnitude of the variance components occur when the estimated adjustments are applied to independent samples. The practical importance of the adjustments varies across score scales. For spoken TABLE 1 Communication and Interpersonal Skills Score Variance Components Single Observation Group 1

2

Mean Score

Component

Un-adjusted

Adjusted

Un-adjusted

Adjusted

Examinee Rater/Case Residual Examinee Rater/Case Residual Dependability ()

1.64900 1.78224 3.04666 1.56379 1.64359 3.02732

1.66195 .00000 3.00007 1.56127 .14374 3.02695

1.64900 .16202 .27697 1.56379 .14942 .27521 .78645

1.66195 .00000 .27273 1.56127 .01307 .27518 .84415

TABLE 2 Data-Gathering Score Variance Components Single Observation Group 1

2

50

Mean Score

Component

Un-adjusted

Adjusted

Un-adjusted

Adjusted

Examinee Rater/Case Residual Examinee Rater/Case Residual Dependability ()

.00258 .00549 .01163 .00278 .00607 .01154

.00260 .00000 .01145 .00272 .00027 .01153

.00258 .00050 .00106 .00278 .00055 .00105 .63457

.00260 .00000 .00104 .00272 .00002 .00105 .71716

TABLE 3 Documentation Score Variance Components Single Observation Group 1

2

Mean Score

Component

Un-adjusted

Adjusted

Un-adjusted

Adjusted

Examinee Rater/Case Residual Examinee Rater/Case Residual Dependability ()

.17966 .26843 .72577 .18841 .27749 .68736

.19157 .00000 .70383 .18901 .03397 .68900

.17966 .02440 .06597 .18841 .02523 .06248 .68234

.19157 .00000 .06398 .18901 .00309 .06264 .74199

TABLE 4 Spoken English Proficiency Score Variance Components Single Observation Group 1

2

Mean Score

Component

Un-adjusted

Adjusted

Un-adjusted

Adjusted

Examinee Rater/Case Residual Examinee Rater/Case Residual Dependability ()

1.58074 .08095 .42381 1.52021 .09301 .38407

1.57951 .00000 .41759 1.51937 .00876 .38322

1.58074 .00736 .03853 1.52021 .00846 .03492 .97226

1.57951 .00000 .03796 1.51937 .00080 .03484 .97708

English proficiency, the rater/case variance is a small proportion of the total variance and, relative to the other scales, a smaller proportion of the error variance. As a result, the statistical adjustment of the spoken English score has only a modest impact on the dependability of the scores. On average, for an 11-case examination the scores would have a dependability coefficient of .97 without statistically adjusting for rater/case differences and a dependability of .98 after making the adjustment. In either instance, the scores are highly reliable. For the data-gathering score, the dependability increases from .63 to .72 when adjustments are made for rater/case variability. To increase the dependability by increasing test length rather than making the statistical adjustment would require increasing the test length by approximately 50%. For the documentation score, the dependability would increase from .68 to .74 after making adjustments for rater/case difficulty differences. For the communication and interpersonal skills score, the equivalent increase would be from .79 to .84. For these latter score scales increasing the dependability by increasing test length would require an increase of approximately one-third of the original test length. The results reported in Tables 1 through 4 make it clear that statistically adjusting the scores for data gathering, documentation, and communication and interpersonal skills scores to account for differences in rater/case difficulty may substantially 51

2

Rater/Case Variance

Unadjusted

1.5 Adjusted Adjusted with Cross-Validation

1

0.5

0 1

2

3

4

5

6

7

8

Month

FIGURE 1. Communication and interpersonal skills score rater/case variance. 0.008

Rater/Case Variance

Unadjusted

0.006

Adjusted Adjusted with Cross-Validation

0.004

0.002

0 1

2

3

4

5

6

7

8

Month

FIGURE 2. Data-gathering score rater/case variance.

increase the precision of measurement. The question remains, how stable are these estimated parameters used to make the adjustments? Figures 1 through 3 show the rater/case variance components estimated on a monthly basis for communication and interpersonal skills, data gathering, and documentation, respectively. Each graph shows the variance components based on raw scores for examinees tested in each of the 8 months and variance components for the same examinees adjusted for rater/case difficulty based on parameter estimation carried out using the scores from the first 2 months. As a frame of reference, markers on each graph indicate the variance components based on a cross-validated sample of examinees from the first 2 months. For the data-gathering score (Figure 2) the rater case variance is only modestly greater during months three through eight than the cross-validated values for the first 2 months. For the documentation score (Figure 3) the variance components for the adjusted score during months 3 and 4 are similar to the cross-validated values. These 52

0.4

Rater/Case Variance

Unadjusted

0.3 Adjusted Adjusted with Cross-Validation

0.2

0.1

0 1

2

3

4

5

6

7

8

Month

FIGURE 3. Documentation score rater/case variance.

Variance

0.3

0.2

Case Unadjusted Case Adjusted Rater Unadjusted Rater Adjusted

0.1

0 1

2

3

4

5

6

7

8

Month FIGURE 4. Documentation score rater and case variance.

values then rise over months 5 through 8 but remain well below those for the raw scores. For the communication and interpersonal skills score (Figure 1) the value for the third month is similar to that for the cross-validated results for months 1 and 2 but rapidly rises to reach the same general magnitude as the observed (unadjusted) scores. The more elaborate data collection design for the documentation score, allowing for an estimation of variance components for the rater-nested-in-case effect as well as the case effect, provides a basis for distinguishing between case difficulty and rater stringency. The case and rater-nested-in-case variance components are presented in Figure 4 for each of the 8 months. The estimated variance components for months 2, 4, 6, and 8 are presented in Table 5. Both the table and figure show the estimated variance components for both cases and raters nested in cases based on 53

TABLE 5 Documentation Score Variance Components by Month Single Observation Month 2

4

6

8

Mean Score

Component

Un-adjusted

Adjusted

Un-adjusted

Adjusted

Examinee Case Rater within Case Residual Examinee Case Rater within Case Residual Examinee Case Rater within Case Residual Examinee Case Rater within Case Residual

.18259 .05143 .25183 .67016 .11817 .06699 .23866 .67992 .22438 .03758 .25614 .66546 .16545 .06704 .19855 .63068

.19439 .00000 .00000 .66334 .12126 .00539 .04357 .67995 .22706 .00000 .07367 .66720 .16389 .00000 .10259 .63257

.18259 .00468 .02289 .06092 .11817 .00609 .02170 .06181 .22438 .00342 .02329 .06050 .16545 .00609 .01805 .05733

.19439 .00000 .00000 .06030 .12126 .00049 .00396 .06181 .22706 .00000 .00670 .06065 .16389 .00000 .00933 .05751

raw scores for examinees testing during the 8-month period. They also present the variance components for the same examinees adjusted for rater and case difficulty based on parameter estimation carried out using the scores from the first 2 months. The variance component reflecting case difficulty for the adjusted scores remains near zero throughout the 8-month period. The variance component for raters nested in case increases. This suggests that the reduction of usefulness in the adjustments for rater/case difficulty in Figure 3 is a function of changes in rater behavior rather than a characteristic of the case or examinee behavior. Discussion The results presented in this article lead to three general conclusions: 1. Adjusting for rater/case differences substantially improves the precision of the scores for three of the four score scales on the Step 2 Clinical Skills Examination. 2. The linear model currently used to make those adjustments and described in this article effectively makes the required adjustments within the group of examinees used to estimate the parameters. Those estimated parameters continue to be appropriate when applied to a randomly equivalent group of examinees tested during the same time period or to examinees tested within 1 to 2 months of the time period used to produce the estimates. 3. The usefulness of the estimates deteriorates over time. In some cases, the estimates may become essentially useless in as little as 5 or 6 months.

54

Generalizability Theory Framework

Although it was hypothesized that the instability in estimated parameters over time was a function of changes in rater stringency (i.e., rater behavior) rather than changes in the difficulty of the challenge represented by the case scenario, the results presented in Tables 1 through 4 and Figures 1 through 3 provide no basis for distinguishing between these sources of variability. To more fully examine the breakdown of the rater/case variance component into a case component and a component for raters nested in cases would require the data design in which each case is rated by more than one rater. Only one component of the examination, the documentation score, satisfies this requirement. The results presented in Table 5 and Figure 4 strongly suggest that the change in the usefulness of the adjustments over time is a function of changes in rater behavior rather than changes in the case difficulty. It should be noted in interpreting the rater-nested-in-case variance component that it confounds the effect for overall rater stringency with the effect for the interaction of raters and cases. As such, it does not reflect a characteristic of the rater in the abstract (i.e., one that necessarily generalizes across cases that the rater may rate) but rather a characteristic of the rater when rating a specific case. The results reported in this article demonstrate that parameters estimated to adjust for differences in rater/case difficulty in scoring may be relatively unstable. Although it is risky to generalize from the results of a single study, it is tempting to hypothesize about a relationship between the rate of deterioration in the parameter estimates and the characteristics of the scale. The data-gathering scale is the most analytic of the scales examined in this article. The checklist items are clearly defined; beyond the actual checklist, raters are trained using extensive written descriptions of the response required to receive credit for each item. Raters are monitored and targeted for additional training if their scores vary from those provided by an expert rater monitoring a sample of examinee performances. This is the scale for which the estimated parameters are most stable. By contrast, the communication and interpersonal skills score is the scale for which the estimated parameters are least stable. For this scale raters make three holistic ratings for each examinee interaction: one for information gathering, one for information sharing, and one for professional manner and rapport. This is the least analytic of the scales. Although the raters receive extensive training and are monitored in their use of the scale, the ratings require more judgment. The reported results are far from definitive, but it may be that more analytic scoring rules result in more stable rater performance over time. An alternative explanation, or possibly another means of describing the same relationship, arises when one considers that the source of the variance components described in this study is a confounding of the variance representing variability in the difficulty of the challenge represented by the case and variance representing differences in rater stringency in assessing the examinees. It may be that the case variance is stable and the rater variance is not. Given that the same rating scale is used across cases for assessing communication and interpersonal skills, it may be that relatively little of the variability in score differences across rater/case pairs is associated with the case. By contrast, the data-gathering scale is highly case-content specific. This may result in substantial variability in case difficulty. The documentation scores fall somewhere in between, with part of the scoring criterion depending on case content and part depending on more generic written communication skills. 55

Harik et al.

The issue of the variability of estimated parameters for rater/case difficulty over time may also be considered from the perspective of the adequacy of training. It may be that the stability of these estimated parameters is an indicator of the adequacy of rater training and monitoring. When training is optimal there will be relatively little room for the rater to exercise personal opinion and thus little variability from one rater to the next. Of course, the ease with which this standard of training can be met is likely to depend on the clarity and thoroughness with which the scoring criteria are developed. One technical issue related to the cross-validation results presented in the tables and figures warrants a comment. The estimation of the parameters of interest and application of these parameters to an independent sample of examinee performances is included to demonstrate that the apparent usefulness of these parameters does not result from over fitting and capitalization on chance. (Note that the model forces the rater/case variance component to be zero for the sample on which the estimates were produced.) The results indicate that the adjustment is effective when applied to a cross-validation sample, but it should be noted that the cross-validation procedure may actually underestimate the effectiveness of the procedure. The cross validation requires splitting the sample and estimating parameters based on one half of the sample and applying the estimated parameters to the other half. To the extent that the accuracy of estimation increases with the size of the sample, reducing the sample in this way may under-represent the usefulness of the procedure when applied to an independent sample. More elaborate boot-strapping procedures may overcome this limitation, but they do not seem justified given the effectiveness of the estimated adjustments based on the split samples. Conclusions The results of this study suggest that, for assessments that are administered over an extended period and scored by raters, it may be important to evaluate the stability of estimated parameters used to make adjustments for differences in rater stringency and case difficulty. This will be true whether the adjustments are made using a linear model or an IRT-based procedure. The results should guide decisions about how frequently these estimates should be repeated. This work also raises a variety of questions for further research. These include issues such as: (a) What are the characteristics of scoring procedures that relate to the stability of these parameter estimates? (b) What training procedures may be useful in creating stable rater performance? (c) Are there approaches that would be useful for modeling changes in rater performance when estimating rater difficulty parameters? As a practical matter, this research has led to frequent reestimation of the rater/case parameters used in scoring the Step 2 Clinical Skills Examination. These results have also led to a significant research effort targeted at developing a Bayesian procedure to produce more robust estimates using the necessarily smaller data sets available for frequent parameter estimation. Note 1

The dependability index is defined as the ratio of universe score variance component to the sum of the universe score variance and absolute error variance 56

Generalizability Theory Framework

components (Brennan, 2001). Throughout this report we refer to the reported index as a dependability coefficient, but the reader should note that for the nested design used in this study, this is identical to the generalizability coefficient. References Bock, R. D. (1995). Open-ended exercises in large-scale educational assessment. In L. B. Resnick & J. G. Wirt (Eds.), Linking school and work: Roles for standards and assessment (pp. 305–338). San Francisco: Jossey-Bass. Bock, R. D., Muraki, E., & Pfeiffenberger, W. (1988). Item pool maintenance in the presence of item parameter drift. Journal of Educational Measurement, 25, 275–285. Boulet, J. R., Rebbecchi, T., Denton, E. C., McKinley, D. W., & Whalen, G. P. (2007). Assessing the written communication skills of medical school graduates. Advances in Health Science Education, 9, 47–60. Braun, H. I. (1988). Understanding scoring reliability: Experiments in calibrating essay readers. Journal of Educational Statistics, 13, 1–18. Braun, H. I., & Wainer, H. (1989). Making essay test scores fairer with statistics. In J. Tanur, F. Mosteller, W. H. Kruskal, E. L. Lehmann, R. F. Link, R. S. Pieters & G. S. Rising (Eds.), Statistics: A guide to the unknown (3rd ed., pp. 178–188). Pacific Grove, CA: Wadsworth. Brennan, R. L. (2001). Generalizability theory. New York: Springer-Verlag. Clauser, B. E., Harik, P., & Margolis, M. J. (2006). A multivariate generalizability analysis of data from a performance assessment of physicians’ clinical skills. Journal of Educational Measurement, 43, 173–191. Congdon, P. J., & McQueen, J. (2000). The stability of rater severity in large-scale assessment programs. Journal of Educational Measurement, 37, 163–178. De Gruijter, D. N. M. (1984). Two simple models for rater effects. Applied Psychological Measurement, 8, 213–218. Engelhard, G., Jr. (1994). Examining rater errors in the assessment of written composition with a many-faceted Rasch model. Journal of Educational Measurement, 31, 93–112. Engelhard, G., Jr. (1996). Evaluating rater accuracy in performance assessments. Journal of Educational Measurement, 33, 56–70. Fleiss, J. L. (1981). Balanced incomplete block designs for inter-rater reliability studies. Applied Psychological Measurement, 5, 105–112. Guilford, J. P. (1936). Psychometric methods. New York: McGraw-Hill. Hartog, P., & Rhodes, E. C. (1936). The marks of examiners. London: Macmillan. Hubbard, J. P., & Levit, E. J. (1985) The National Board of Medical Examiners: The first seventy years. Philadelphia: National Board of Medical Examiners. Linacre, J. M., & Wright, B. D. (2002). Understanding Rasch measurement: Construction of measures from many-facet data. Journal of Applied Measurement, 3, 486–512. Longford, N. T. (1994). Reliability of essay rating and score adjustment. Journal of Educational and Behavioral Statistics, 19, 171–200. Margolis, M. J., Clauser, B. E., Swanson, D. B., & Boulet, J. R. (2003). Analysis of the relationship between score components on a standardized patient clinical skills examination. Academic Medicine, 78, S68–S71. McKinley, D., & Boulet, J. R. (2004). Detecting score drift in a high-stakes performance-based assessment. Advances in Health Sciences Education, 9, 29–38. Raymond, M. R., & Viswesvaran, C. (1993). Least squares models to correct for rater effects in performance assessment. Journal of Educational Measurement, 30, 253–268. Stanley, J. C. (1962). Analysis-of-variance principles applied to the grading of essay tests. Journal of Experimental Education, 30, 279–283.

57

Harik et al. Van Der Linden, W. J., & Hambleton, R. K. (1997). Handbook of modern item response theory. New York: Springer-Verlag. Wolfe, E. W., Moulder, B. C., & Myford, C. M. (2001). Detecting differential rater functioning over time (DRIFT) using a Rasch multi-faceted rating scale model. Journal of Applied Measurement, 2, 256–280.

Authors POLINA HARIK is a Senior Measurement Scientist, National Board of Medical Examiners, 3750 Market Street, Philadelphia, PA 19104; [email protected]. Her primary research interests include applied statistical methods. BRIAN E. CLAUSER is Associate Vice President, Measurement Consulting Services, National Board of Medical Examiners, 3750 Market Street, Philadelphia, PA 19104; [email protected]. His primary research interests include psychometric methods. IRINA GRABOVSKY is a Senior Psychometrician, National Board of Medical Examiners, 3750 Market Street, Philadelphia, PA 19104; [email protected]. Her primary research interests include applied statistical methods and psychometrics. RONALD J. NUNGESTER is Senior Vice President, National Board of Medical Examiners, 3750 Market St., Philadelphia, PA 19104; [email protected]. His primary research interests include psychometric methods. DAVE SWANSON is Vice President, National Board of Medical Examiners, 3750 Market Street, Philadelphia, PA 19104; [email protected]. His primary research interests include achievement testing in medical education and computer-based testing. RATNA NANDAKUMAR is Professor, School of Education, University of Delaware, Newark, DE 19716; [email protected]. Her primary research interests include test modeling and differential item functioning.

58