Methods for Evaluating the Validity of Test Scores for ...

2 downloads 0 Views 187KB Size Report
of Test Scores for English. Language Learners. Stephen G. Sireci, Kyung T. Han, and Craig S. Wells. University of Massachusetts, Amherst. In the United States, ...
Educational Assessment, 13:108–131, 2008 Copyright © Taylor & Francis Group, LLC ISSN: 1062-7197 print/1532-6977 online DOI: 10.1080/10627190802394255

Methods for Evaluating the Validity of Test Scores for English Language Learners Stephen G. Sireci, Kyung T. Han, and Craig S. Wells University of Massachusetts, Amherst

In the United States, when English language learners (ELLs) are tested, they are usually tested in English and their limited English proficiency is a potential cause of construct-irrelevant variance. When such irrelevancies affect test scores, inaccurate interpretations of ELLs’ knowledge, skills, and abilities may occur. In this article, we review validity issues relevant to the educational assessment of ELLs and discuss methods that can be used to evaluate the degree to which interpretations of their test scores are valid. Our discussion is organized using the five sources of validity evidence promulgated by the Standards for Educational and Psychological Testing. Technical details for some validation methods are provided. When evaluating the validity of a test for ELLs, the evaluation methods should be selected so that the evidence gathered specifically addresses appropriate test use. Such evaluations should be comprehensive and based on multiple sources of validity evidence.

All tests measure language proficiency to some degree, whether it is part of the targeted construct or not. Proficiency in the language of an assessment can interfere with accurate measurement of students’ knowledge, skills, and abilities, whenever a student is tested in a secondary language. The most common example

Correspondence should be sent to Stephen G. Sireci, School of Education, University of Massachusetts, 156 Hills South, Amherst, MA 01003, USA. E-mail: [email protected]

108

METHODS FOR EVALUATING TEST SCORE VALIDITY

109

in the United States is when English Language Learners (ELLs)1 are tested in English. In such situations, the validity of inferences derived from these students’ test scores are suspect and may be noncomparable to scores from students who are fully proficient in English. This threat to valid test score interpretations has been brought to the attention of the measurement community for several decades (e.g., Geisinger, 1992; Likert, 1932; Olmedo, 1981) and was explicitly addressed in the last two versions of the Standards for Educational and Psychological Testing (American Educational Research Association [AERA], American Psychological Association [APA], & National Council on Measurement in Education [NCME], 1999; APA, AERA, & NCME, 1985). For example, the most recent version of the Standards stated, any test that employs language is, in part, a measure of their language skills: : : : This is of particular concern for test takers whose first language is not the language of the test: : : : In such instances, test results may not reflect accurately the qualities and competencies intended to be measured. (AERA et al., 1999, p. 91)

This statement refers to the situation where proficiency in a particular language is not the proficiency (construct) being targeted by the assessment. In such cases, for example, when ELLs are being tested in mathematics or science, language proficiency impedes accurate measurement of the targeted construct. As Messick (1989) and Haladyna and Downing (2004) described, language proficiency is a common cause of construct-irrelevant variance in test scores. For this reason, the Standards also advise that “a testing practice should be designed to reduce threats to the reliability and validity of test score inferences that may arise from language differences” (AERA et al., 1999, p. 97). To facilitate valid interpretation of scores obtained by linguistic minorities, the Standards provide advice for both test construction and test administration. With respect to test construction, they advise the following: In testing applications where the level of linguistic or reading ability is not part of the construct of interest, the linguistic or reading demands of the test should be kept to the minimum necessary for the valid assessment of the intended construct. (AERA et al., 1999, p. 82)

1 We use the term English language learner, or ELL, in this article to refer to students within the United States whose native language is not English but have an educational goal to increase their English proficiency. Such students are sometimes referred to as having limited English proficiency, however, that term is problematic because complete proficiency in English can never be unequivocally demonstrated for anyone. In any case, the issues discussed in this paper apply in full force whenever someone is tested in a language in which they are not sufficiently proficient to understand and respond to the test items.

110

SIRECI, HAN, WELLS

With respect to test administration, the Standards acknowledge that accommodations to the standardized conditions may be necessary. These accommodations include test translation/adaptation or modifying aspects of the test or test administration procedure such as the presentation format, response format, the time allowed to complete the test, the test setting : : : and the use of only those portions of the test that are appropriate for the level of language proficiency of the test taker. (AERA et al., 1999, p. 92)

In a review of the literature on test accommodations for ELLs, Sireci, Li, and Scarpati (2003) found that all such modifications have been used as attempts to reduce construct-irrelevant variance associated with English proficiency. However, research is needed to demonstrate that such modifications facilitate, rather than hinder, valid score interpretation. To summarize the current situation in testing ELLs, it is known that limited English proficiency is likely to impede the test performance of these students, and for this reason various accommodations have been proposed to remove or reduce any linguistic barriers. However, the utility and fairness of these accommodations have not been widely studied. Although accommodations are made available to provide greater access to tests and to promote valid score interpretations, they may actually inhibit valid score interpretation if they change the construct measured or if they provide an unfair advantage to the students who receive them. They may also nullify comparisons across students who take the tests with accommodations and those who do not. For these reasons, a review of the research and practice in promoting valid assessments of ELLs is warranted. The purposes of this article are to (a) discuss validity issues involved in measuring the knowledge and skills of ELLs and (b) describe statistical and qualitative procedures that can be used to facilitate or evaluate the validity of inferences derived from the test scores of such students. In the first section of this article, we elaborate on the validity issues introduced earlier and describe some test accommodations for ELLs that are designed to increase their access to educational tests and promote more valid interpretations of their knowledge and skills. Drawing from these validity issues and the sources of validity evidence stipulated in the Standards, we describe methods that can be used to evaluate the validity of these test accommodations. These methods include sensitivity review, differential item functioning analysis, exploratory factor analysis, confirmatory factor analysis, multidimensional scaling analysis, experimental studies (testing with and without accommodations), and analysis of population invariance of equating functions. In the concluding section, we provide suggestions for future research in evaluating test accommodations for ELLs.

METHODS FOR EVALUATING TEST SCORE VALIDITY

111

VALIDITY ISSUES IN ASSESSING STUDENTS WITH LIMITED ENGLISH PROFICIENCY Thus far, we have discussed English proficiency as a potential barrier to valid measurement of ELLs’ knowledge and skills. Before extending our discussion of issues relevant to valid interpretation of ELLs’ test scores, we must first formally define validity. To do that, we turn to the Standards as the authoritative source.2 According to the Standards, validity refers to “the degree to which evidence and theory support the interpretations of test scores entailed by proposed uses of tests” (p. 9). From this definition, we see that it is not a test per se that is validated but rather the decisions or interpretations made on the basis of test scores. Therefore, in considering validity issues involved in the assessment of ELLs, we must consider the purpose of the assessment, because issues of validity and fairness will change, depending on the testing purpose and the actions made on the basis of test scores. Test Use, Interpretations, and Consequences What are the specific purposes for which ELLs are tested? In some cases, they are assessed using tests designed to provide norm-referenced information. These tests are typically selected by the school district for the purpose of gauging students’ academic achievement relative to local and national peers. In such norm-referenced situations, the degree to which the norm group is appropriate for these students is a critical validity issue (Geisinger, 2004). In other cases, ELLs are tested as part of statewide assessments for accountability purposes or for grade promotion or high school graduation. In this situation, the degree to which ELLs are accurately placed into proficiency classifications is a critical validity issue. A third case in which ELLs are tested involves admissions or employment testing where tests are used to select a small number of applicants for a limited number of openings. In these cases, the comparability of score interpretations across ELL and non-ELL examinees is an important validity issue. We isolate these three separate situations because the types of decisions or inferences that are made based on test scores are so varied that different types of evidence might be needed to support each type of inference. Thus, there are some critical validity questions in testing ELLs: How can we evaluate the 2 The Standards are authoritative in the sense that they represent consensus standards from the three major organizations involved in appropriate test use and that the courts have used them widely whenever tests are challenged (Sireci & Parker, 2006). However, it should be noted that they represent the cumulative thinking of several prominent validity theorists over the years (e.g., Cronbach, 1971; Cronbach & Meehl, 1955; Kane, 1992; Messick, 1989; Shepard, 1993).

112

SIRECI, HAN, WELLS

utility of specific tests for measuring ELLs? How can we evaluate the validity or fairness of test accommodations that may be provided in these situations? Once again, we can look to the Standards for guidance, as well as to other recent guidelines developed in this area (e.g., Rabinowitz & Sato, 2006). Sources of Validity Evidence The Standards provide a framework for evaluating appropriate test use and interpretation that involves five “sources of validity evidence” (AERA et al., 1999, p. 11). The Standards describe them as “sources of evidence that might be used in evaluating a proposed interpretation of test scores for particular purposes” (p. 11). We find this framework useful, and so the methods subsequently described are organized using these five sources. The sources are evidence based on (a) test content, (b) response processes, (c) internal structure, (d) relations to other variables, and (e) consequences of testing. In the next section, we briefly describe each of these sources as well as their implications for evaluating the validity of inferences derived from ELLs’ test scores. Evidence based on test content. Evidence based on test content refers to information obtained from traditional content validity studies that investigate the congruence of test items to the test specifications and how well they represent the construct measured (Sireci, 1998). It also includes practice analyses used in licensure and certification testing, and alignment studies used in K–12 accountability testing. When tests have high content validity, the content is relevant to and representative of the construct measured and should not contain any material that would be unfair to ELLs. Validity evidence to support the claim that the test content is appropriate for ELLs could come from two sources. The first is a qualitative method called sensitivity review. In such a review, experts familiar with the language and culture of specific ELL groups represented in the population of students tested are asked to review tests for material that may (a) be construed as offensive to ELLs; (b) portray ELLs unfavorably, or in a stereotypical fashion; (c) provide an unfair advantage or disadvantage to ELLs; or (d) be unfamiliar to ELLs (Ramsey, 1993; Sireci, 2004; Sireci & Mullane, 1994). The goal of a sensitivity review is to flag items, passages, and other test material that should be brought forward to content committees to judge their appropriateness from a content perspective. Any material that is deemed problematic by sensitivity reviewers but cannot be justified from a content perspective should be removed from the test. A second source for validity evidence based on test content is statistical and refers to statistical screening of items that may be potentially biased. This method is called analysis of differential item functioning (DIF). The goal of a DIF analysis for ELLs is to identify items that function differentially for ELLs

METHODS FOR EVALUATING TEST SCORE VALIDITY

113

in the sense that they are more or less likely to do well on the item when compared to non-ELL students with very similar performance on the test (or some similar matching criterion). DIF analysis alone does not provide contentrelated evidence for validity. However, when items are flagged for DIF, they also are sent to content review committees for adjudication. If it were judged that DIF for or against ELLs is because of construct-irrelevant problems within an item, the item is likely to be removed. Using simulated data, Lewis and Wells (2007) illustrated that the effects of including items that function differentially against ELLs could be disastrous. Specifically, significantly fewer ELL students achieved passing scores when DIF items were included on the test. When testing programs contain sensitivity reviews and DIF procedures designed to identify and screen items that may be problematic for ELLs, they reduce the likelihood that ELLs’ test performance will be affected by constructirrelevant variance. The results of these studies, including documentation that indicates how problematic items were dealt with, provide important evidence that the test is appropriate for ELLs (Rabinowitz & Sato, 2006). Of course, such evidence is insufficient for concluding interpretations of ELLs’ test performance are valid. Thus they are useful, but limited, evidence for supporting use of a test for ELLs. Evidence based on internal structure. Validity evidence based on internal test structure includes information regarding item homogeneity, such as estimates of internal consistency reliability, as well as dimensionality analyses that reflect the major dimensions needed to represent students’ responses to the items and tasks making up an assessment. These techniques can be used to evaluate whether the intended dimensionality for an assessment is supported by empirical analysis of the data. With respect to assessment of ELLs, studies in this area often involve an evaluation of whether the dimensionality of an assessment is consistent across ELL and non-ELL students. As we report later, much of the work comparing test accommodations with a standard assessment has focused on evidence based on internal structure. These studies have primarily used confirmatory factor analysis to compare the structure of a test across ELL and non-ELL samples. In some situations, the factor structure of the data from standard and accommodated test administrations has been compared. A somewhat more recent methodological procedure for looking at the consistency of internal structure across groups is known as population invariance analysis or score equity assessment (Dorans, 2002; Liu, Cahn, & Dorans, 2006). Analysis of population invariance is unique to the situation where parallel forms of a test are equated onto a common scale, such as when different test forms are administered in consecutive years. Dorans and Holland (2000), Dorans (2002), and Kolen and Brennan (2004) proposed the use of “population invariance” as a criterion for evaluating the results of equating. Using this criterion, tests are

114

SIRECI, HAN, WELLS

considered equitable to the extent that the same equating function is obtained across important subgroups of the examinee population. To evaluate population invariance, separate equatings are done using the data for the entire population (i.e., the typical equating procedure) and using only the data for the subgroup of interest. Invariance can be assessed by looking at differences in achievement level classifications (Wells et al., 2007) across the two equatings, differences in test form concordance tables (Wells et al., 2007), or by computing the root mean square difference (or the root expected mean square difference) between the two equating functions (Dorans, 2004). When item response theory is used, differences between the separate test characteristic curves computed from ELL and non-ELL populations could also be used to evaluate invariance. Evidence based on relations with other variables. Evidence for validity based on relations with other variables refers to what was traditionally known as criterion-related validity. This source of validity evidence includes correlations of test scores with relevant criteria, such as other measures of the targeted construct (convergent validity) as well as variables hypothesized to be unrelated to the construct (discriminant validity). In addition to simple correlations among test scores and other variables, regression analysis is often used in this context, particularly when test scores are used for prediction, such as is the case in college or postgraduate admissions testing. With respect to ELLs, validity evidence in this category most often involves analysis of differential predictive validity, which refers to the consistency of the predictive utility of test scores over subgroups of examinees. Evidence in support of use of a test with ELLs would be in the form of consistent regression coefficients relating test scores to external criteria across ELL and non-ELL subgroups of examinees (Wainer & Sireci, 2005). Inspection of the pattern of errors of prediction across these groups may also be informative regarding the efficiency of prediction of test scores across ELL and non-ELL groups. In addition, more sophisticated regression models, such as those based on structural equation modeling, can be used to ascertain whether test scores exhibit consistent relationships with several variables across ELL and non-ELL groups. Experimental studies of the effects of test accommodations on the test performance of ELLs can also be used to provide validity evidence with respect to interpreting test scores from ELLs. In many cases, ELL and non-ELL students are tested with and without accommodations, using either random assignment across conditions or a counterbalanced repeated measures design (Sireci et al., 2003). Obviously, students cannot be randomly assigned to ELL status, and so these studies are really quasi-experimental and use covariates to account for preexisting differences across ELL and non-ELL groups whenever possible. The validity hypothesis tested in these studies is whether there is an interaction

METHODS FOR EVALUATING TEST SCORE VALIDITY

115

between student group and accommodation status. That is, if the accommodation simply removes an English proficiency barrier, it should not result in higher scores for non-ELL students, but it should result in higher scores for ELLs (Sireci, Scarpati, & Li, 2005). Experimental and quasi-experimental studies of test score differences are classified under the “relations with other variables” source of validity evidence because they relate test scores to inherent differences in examinees or testing conditions. It is possible that subsequent revisions to the Standards may describe such studies using a unique label for this source of validity evidence. Evidence based on response processes. Validity evidence based on response processes refers to analysis of the cognitive processes and strategies examinees use in answering test items, and the degree to which those processes and strategies are congruent with the definition of the construct being measured. If the construct definition involves specific skills, and test items are designed to measure those skills, evidence that confirms students use the hypothesized processes in successfully answering test items supports the interpretations intended to be made from test scores. Specific methods for gathering such evidence include think-aloud protocols (e.g., Leighton, 2004; Wendt, Kenny, & Marks, 2007), analysis of examinees’ draft responses (e.g., scratch paper, draft essays, etc.), and chronometric analysis (analysis of item response time). There are very few examples of applied studies in this area, particularly with respect to ELLs. This paucity of research is unfortunate because confirming that ELLs and non-ELLs use the same processes and strategies in responding to test items would be important evidence in support of the validity of interpretations of ELLs’ test scores. Evidence based on consequences of testing. Validity evidence based on the consequences of testing involves considering the positive and negative consequences associated with a testing program—both intended and unintended. Obviously, tests are used to promote positive consequences such as protection of the public (licensure testing), improved instruction (educational accountability testing), and better understanding of students’ strengths and weaknesses (diagnostic testing). Other positive consequences may appear, even though they were not explicitly intended or envisioned (e.g., professional development for teachers; Cizek, 2001). However, unintended negative consequences may also occur. Such consequences may include increased school dropout, decreased education and employment opportunities for certain groups, and decreased morale within schools and other institutions. The degree to which testing consequences should be considered in evaluating the validity of inferences derived from test scores is a subject of some controversy. Some see considerations of test content as an important social policy issue, but one that is extraneous to validity (Popham, 1997). Others see such

116

SIRECI, HAN, WELLS

considerations as central for evaluating the appropriateness of use of a test (Shepard, 1997). Messick (1989) stressed the importance of considering the value implications associated with test score interpretations (e.g., classifying examinees as passing or failing) as well as societal consequences associated with test use. As he described, The central question is whether the proposed testing should serve as means to the intended end, in light of the other ends it might inadvertently serve and in consideration of the place of the intended end in the pluralistic framework of social choices: : : : It is not that the end should not justify the means: : : : Rather, it is that the intended ends do not provide sufficient justification, especially if adverse consequences are linked to sources of test invalidity. (p. 85)

With respect to testing consequences and validity, an important evaluation should be the degree to which the positive outcomes of the test outweigh any negative consequences. Shepard (1993) may have had considerations of consequences in mind when she stated, In the early years of testing, validity addressed the question “Does the test measure what it purports to measure?” : : : An appropriate metaphor might have been truth in labeling (i.e., does the test have the ingredients or meaning it says?). Today, a more appropriate question to guide validity investigations is “Does the test do what it claims to do?” A more contemporary analogy is the Federal Drug Administration’s standards for testing a new drug: “Do the empirically demonstrated effects weighed against the side effects warrant use of the test (drug)?” (pp. 443–444)

With respect to testing ELLs, many consequences should be considered in evaluating the validity of test use and score interpretation. In some situations, disparate impact (relatively higher proportions of ELLs scoring below a standard) is a major issue. Differential referral rates for remediation are another potential concern. When tests are used to evaluate instructional programs, such as native language instruction, test scores may serve as one impetus for closing some programs or getting rid of such instruction altogether. Whenever such negative consequences occur, the technical quality of the test must be demonstrated, as should the benefits associated with the testing program. Examples of evidence based on testing consequences that could be used to support the use of a test with ELLs could be analysis of educational gains associated with testing programs, the degree to which the tests have positively influenced instruction and provided professional development for teachers, the effects of the test on retention/dropout, and the degree to which the tests may have increased parental involvement in the education of ELLs. Although these are just some examples of the many influences that could be studied, little

METHODS FOR EVALUATING TEST SCORE VALIDITY

117

research has been done on the testing consequences for ELLs, outside of issues of disparate impact and gaps between the test results for ELL and other students. Summary of sources of validity evidence. The Standards provide a useful framework for evaluating the utility of a test for a particular purpose and for evaluating fairness issues in testing ELLs with respect to testing purposes and consequences. Depending on the purpose of testing and the types of interpretations of test scores that are made, some sources of validity evidence may be more pertinent than others. As with all research, the more sources of evidence that can be used to answer a specific question, the better. However, as we report later, only a few sources of validity evidence have been used to support use of a test for ELLs.

TEST ACCOMMODATIONS FOR ENGLISH LANGUAGE LEARNERS As mentioned earlier, accommodations to tests and to the conditions under which they are administered are sometimes made to provide better measurement of the knowledge and skills of ELLs. Sireci et al. (2003) reported the following accommodations used for ELLs: test translation/adaptation, linguistic modification (a.k.a. simplified English), providing bilingual or customized dictionaries, adding glosses to the margins of text booklets to define specific words, and extended testing time. When accommodations are used for ELLs, specific validity questions that are raised include the following: (a) Has the accommodation changed the construct measured? (b) Are scores from accommodated tests comparable to scores from the standard version? (c) Do the scores from accommodated tests provide more accurate measurement of ELLs’ knowledge and skills relative to scores from the standard test? and (d) Does the accommodation provide an unfair advantage to ELLs? The relative importance of these questions and their priority in a validity investigation depends, of course, on the purpose of the assessment and the types of test score interpretations that are made. When ELL and non-ELL examinees are competing for a job, or for a limited number of openings in an academic program, whether the accommodation provides an unfair advantage may be a highly prioritized validity question. If comparisons are not made across ELL and non-ELL students, the degree to which the accommodation alters the construct may be prioritized (although it is hard to imagine a situation in which that question would not be important to address). Research on the effects of test accommodations for ELLs has generally focused on three areas. First, factor analysis has been used to investigate whether the same factor structure underlies the data from standard and accommodated

118

SIRECI, HAN, WELLS

tests. Second, DIF analyses have been conducted to see if either (a) items function differentially across ELL and non-ELL groups, or (b) items function differentially across their standard and accommodated formats. The third area of research has used experimental and quasi-experimental designs to evaluate score differences across ELL and non-ELL student groups under both standard and accommodated conditions. Thus, research in this area has focused on validity evidence based on internal structure and relationships with other variables, and not on test content, response processes, or testing consequences. In the next section of this article, we describe the methods that have been used to evaluate the validity of test interpretations made from ELLs’ test scores as well as methods used to evaluate test accommodations for ELLs. All methods described were previously introduced but without the details to understand the analyses. In Table 1, we summarize the issues and methods presented thus far by listing methods used to investigate validity issues in testing ELLs, organized by the five sources of validity evidences promulgated by the Standards. The validity question addressed by each method is also listed, and sample references to applied studies are provided. The information presented in Table 1 illustrates that most of the empirical research on validity issues with respect to ELLs has focused on the test content, internal structure, and relations to other variables for sources of validity evidence. It is difficult to find any published literature that analyzes response processes or testing consequences as a source of validity evidence in testing ELLs. It is also disappointing that we could not find any studies using population invariance analysis to assess the invariance of equating functions across ELL and nonELL groups. A closer look at the studies cited indicates there are essentially three types from a methodological perspective: (a) studies of DIF, including DIF between ELL and non-ELL groups taking a standard version of a test in English (e.g., Schmitt, 1988) and DIF between original and translated versions of test items (e.g., Sireci & Khaliq, 2002); (b) studies of factorial invariance (e.g., Abedi, Lord, Hofstetter, & Baker, 2000), and (c) experimental and quasiexperimental analysis of mean score differences across standard and accommodated tests administered to ELL and non-ELL groups. Abedi and his colleagues have primarily conducted this latter research (e.g., Abedi, 2001; Abedi, Hofstetter, Baker, & Lord, 2001; Abedi, Lord, Boscardin, & Miyoshi, 2001).

METHODS FOR EVALUATING TEST ACCOMMODATIONS FOR ELL STUDENTS In the preceding sections, we reviewed important issues related to the validity of assessing ELLs. Although we briefly described the methods used for gathering validity information, we did not describe any of the technical details for the

TABLE 1 Summary of Validity Issues in Testing ELLs and Methods to Study Them Source of Validity Evidence Test content

Internal structure

Relations with other variables

Response processes

Testing consequences

Validity Question Do items function differentially across ELL and non-ELL groups? Does linguistic simplification remove construct-irrelevant variance in ELLs’ test performance? Do original and translated items function differentially? Is the factor structure equivalent across standard and accommodated test administrations? Is the factor structure consistent across ELL and non-ELL subgroups of students? Are equating functions invariant across ELL and non-ELL subgroups? Is the predictive utility of an assessment consistent across ELL and non-ELL groups? Do test accommodations lead to improved scores for ELLs relative to non-ELLs?

119

Do ELL and non-ELL students use the same processes in responding to test items? Do ELL and non-ELL students differ w/ respect to the time needed to answer test items? Do teachers of ELL and non-ELL students have the same opinions of standardized tests? Have state-mandated assessments differentially affected instruction for ELLs and non-ELLs?

Evaluation Method Differential item functioning

Applications to ELLs Schmitt (1988) Abedi & Lord (2001) Sireci & Khaliq (2002)

Confirmatory FA

Sireci & Khaliq (2002)

MDS Confirmatory FA

Sireci & Khaliq (2002) Abedi, Lord, Hofstetter, & Baker (2000)

Population invariance Differential predictive validity

Zwick & Schlemer (2004)

Experimental analysis of mean differences

Abedi (2001); Abedi, Hofstetter, et al. (2001); Abedi, Lord, et al. (2001); Albus et al., (2001); Anderson et al. (2000); Castellon-Wellington (1999); Garcia et al. (2000)

Think-aloud protocols Survival analysis Surveys, interviews, focus groups Surveys, interviews, focus groups

Note. ELL D English language learner; FA D factor analysis; MDS D multidimensional scaling.

Schnipke & Pashley (1997)

120

SIRECI, HAN, WELLS

statistical methods. In this section, we briefly describe some of the statistical models used to evaluate validity issues in testing ELLs. Analysis of Test Content When a test is deemed appropriate for ELLs, the validity evidence based on test content that is put forward is typically in the form of DIF analysis followed by content review of the items. DIF analyses evaluate whether examinees with similar knowledge and skill from different groups (e.g., ELL and non-ELL) have similar probabilities of success on an item. In DIF analyses, conditioning procedures are used to systematically match examinees of similar proficiency across groups to distinguish between overall group differences on an item (item impact) and potential item bias. That is, DIF procedures focus on differences in item performance across groups after the groups are matched on the construct measured by the test. In most cases, the matching is done using total test score, as it is assumed to be unbiased. Items are considered to be functioning differentially across groups if the probability of a particular response differs significantly across test takers who are equivalent (i.e., matched) on proficiency (see Holland & Wainer, 1993, for other descriptions of DIF theory and methodology). There are many statistical methods for evaluating DIF, and although we only discuss one method here, most people with a working knowledge of statistics can take comfort in the fact that all DIF procedures follow the logic of analysis of covariance. That is, DIF analyses look at differences between groups, after controlling for general proficiency. However, given that many test items are scored dichotomously, analysis of covariance has not seen much direct application to the problem of detecting DIF. The method we describe to illustrate DIF analysis is the standardization method, which was proposed by Dorans and Kulick (1986). This method is essentially a “conditional p value” procedure, where separate proportion correct item difficulty statistics ( p values) are computed for each group of examinees on each item, conditional on total test score. That is, for examinees with a given test score, the proportion of examinees who answered the item correctly is computed for each group. This process is repeated for all other levels of test score. In practice, test score intervals are typically computed to match examinees so that the sample sizes per test score interval are not too small. The standardization index describes the average difference between the conditional p values for the two groups. This index (STD-P) is computed as: X wm.Ef m Er m / STD

P D

m

X m

(1)

wm

METHODS FOR EVALUATING TEST SCORE VALIDITY

121

where wm D the relative frequency of the reference group at score level m, and Ef m and Er m are the proportion of examinees at score level m who answered the item correctly in the focal and reference groups, respectively. The reference group could represent examinees who responded to the original version of an item, and the focal group could represent examinees who responded to the adapted version of an item, or they could simply represent non-ELL and ELL groups, respectively. The standardization index ranges from 1 to 1. Although there is no statistical test associated with the statistic, an effect size can be computed. For example, a STD-P of .10 indicates that, on average, examinees in the reference group who are matched to examinees in the focal group have a 10% greater chance of answering the item correctly. Using this value as a flagging criterion, if 10 items on a test were flagged for DIF, and they were all in favor of one of the two groups, the aggregate level of DIF on the test would be about 1 point on the total raw test score scale in favor of the relevant group.

Analysis of Internal Structure Studies of the invariance of the factor structure across ELL and non-ELL groups, and comparisons of the factor structures across standard and accommodated versions of a test address the validity issue of whether the construct measured by an assessment is consistent across these student groups and test forms. Similarity of factor structure is an expected aspect of construct equivalence, but it is certainly not a sufficient condition for ensuring equivalence. Given that the data for such analyses is readily available in large-scale assessments involving ELLs, it is not surprising that many applications of these analyses can be found in the literature. Factor Analysis Factor analysis has long been used as a method for understanding what general characteristics are measured by educational assessments (Cattell, 1944). In the case of test evaluation, factor analysis is used to derive latent summary variables from a set of test items that best represent the major dimensions characterizing students’ performance. For example, three factors—perhaps number relations, algebra, and geometry—might characterize a 50-item math test. In some cases, factor analyses are conducted to create factor-specific scores for students. However, in the case of test validation, factor analysis is applied to understand the dimensionality (or factor structure) of students’ responses to test items and to compare this structure to that hypothesized by the test developer. In some cases, the hypothesized factor structure is unidimensional, such as when only a total score is reported on an assessment. In other cases,

122

SIRECI, HAN, WELLS

a multidimensional factor structure is hypothesized to represent a multidimensional construct. When the factor analysis solution replicates the intended factor structure, validity evidence based on this “confirmed” (internal) structure is obtained. In evaluating and comparing factor structures across different groups of examinees (e.g., ELL and non-ELLs) or across different forms of a test (e.g., a standard version presented in English and a version translated into Spanish), confirmatory factor analysis is typically conducted. The measurement model in a confirmatory factor analysis is x D ƒx  C ı

(2)

where x represents the vector of measured variables (e.g., test items), ƒx is a matrix of factor loadings (i.e., loadings of items on the latent variables ), and ı is a vector of the residuals associated with the measured variables. Constraints on the factor-loading matrix separate confirmatory from exploratory factor analysis. When confirmatory factor analysis is used to compare factor structures across two or more groups, the factor loadings, correlations among factors, and item residuals can be constrained to be equal across groups, and statistical tests can be performed to determine whether models that relax these constraints provide a statistically significant improvement in the fit of the model to the data. Descriptive indices of model-data fit are also used to evaluate factorial invariance (Sireci, Bastari, & Allalouf, 1998).

Multidimensional Scaling Another way to compare the factor structure of a test across ELL and non-ELL groups or across different language versions of a test is via multidimensional scaling (MDS). Like factor analysis, MDS is designed to discover the most salient dimensions underlying a set of items. The end result of an MDS analysis is a set of item coordinates (similar to factor loadings) on dimensions that best account for the ways in which the examinees responded to the items. MDS models use distance formulae to arrange items within a multidimensional “space” so that items that are responded to similarly by examinees are close together in this space and are distant from items to which students responded differently. With respect to the multiple-group case, weights for each group are incorporated into the distance model. This procedure, called weighted MDS (WMDS) derives both a common structure that best represents the data for all matrices (subgroups) simultaneously, and a set of unique weights for each matrix that can be used to adjust this common structure to best fit the data for each specific matrix. The most common WMDS model (Carroll & Chang, 1970) specifies a weighted

METHODS FOR EVALUATING TEST SCORE VALIDITY

123

Euclidean distance formula: dijk

v u r uX Dt wka .xi a

xja /2

(3)

aD1

where dijk D the Euclidean distance between items i and j for group k, wka is the weight for group k on dimension a, xi a D the coordinate of item i on dimension a, and r D the dimensionality of the model. A common structural space, called the group stimulus space, is derived for the stimuli. The “personal” distances for each group are related to the common stimulus space by: p xkia D xi a wka (4) where xkia represents the coordinate for item i on dimension a in the personal space for group k, wka represents the weight of group k on dimension a, and xi a represents the coordinate of stimulus i on dimension a in the common stimulus space. Differences in dimensional structure across groups are reflected in the group weights (i.e., wka ). The larger a weight on a dimension (a), the more that dimension is necessary for accounting for the variation in the data for the specific group (k). When structure is consistent across all groups, there is very little variation in the group weights. Using simulated data, Sireci et al. (1998) found that when structural differences exist across groups on one or more dimensions, one or more groups will have weights near zero, whereas other groups will have noticeably larger weights. They concluded nonequivalence of the structure of an assessment across groups should be obvious via inspection of the MDS weights. Examples of WMDS analysis of ELL data are presented in Figures 1 and 2. These figures are based on the Sireci and Khaliq (2002) study and show the results from a WMDS analysis of two forms of fourth-grade statewide mathematics assessment. One form was the standard version, and the other was a dual-language (English–Spanish) version administered to Spanish-speaking ELLs. In the dual language form, the original English-language versions of the items were printed on the left-side pages of the test booklet and the Spanish translations of each item were printed on the right-side pages of the booklet. The four data points displayed in Figure 1 are the dimension weights for four groups of students. The three points clustered together represent random samples of students who took the standard English version of the test. The three points clustered together in Figure 2 are stratified random samples of students who took the standard English version of the test, but the random selection was stratified (matched) by total test score so that these students would have the same score distribution as the ELL group. In both figures, the ELL group has a noticeably larger weight on Dimension 2, and a noticeably smaller weight on

124

SIRECI, HAN, WELLS

FIGURE 1 students.

Multidimensional scaling (MDS) dimension weights for ELL and non-ELL

FIGURE 2 Multidimensional scaling (MDS) dimension weights for ELL and non-ELL students matched on total test score.

METHODS FOR EVALUATING TEST SCORE VALIDITY

125

Dimension 1, relative to the non-ELL groups, which suggests the dimensions differentially account for the structure in each test form. This difference is much more salient in Figure 1. These results illustrate a general lack of factorial invariance across these two test forms, but it appears the difference is exacerbated by general differences in proficiency between the groups. More important for this discussion, however, is that the figure shows how MDS can provide a visual display of the similarity of factor structure across accommodated versions of a test or across ELL and non-ELL groups. Population Invariance Although we have not yet found studies that investigated population invariance across ELL and non-ELL groups, we believe the methodology can be used to evaluate the appropriateness of assessments for ELLs. Dorans and Holland (2000) introduced two indices that can be used to gauge population invariance— the standardized root mean square difference (RMSD) and the standardized root expected mean square difference (REMSD). The standardized RMSD measures the subpopulation dependence at each X score level, as follows: sX wj ŒePj .x/ eP .x/2 j

RMSD.x/ D

(5)

YP

where ePj .x/ and eP .x/ represent the transformed scores of form X to the scale of form Y for the subpopulations and total population, respectively; wj D

Nj N

denotes the proportion of examinees from population P that are in subpopulation Pj ; and YP transforms the units of the measure to the proportion of the standard deviation of Y scores in P producing an effect size type measure. The standardized REMSD summarizes the values of RMSD.x/ by averaging over the distribution of X in population P . REMSD is computed as follows sX j

REMSD D

wj EP fŒePj YP

eP .x/2 g (6)

where EP fg denotes averaging over the distribution. Essentially, REMSD is a double weighted average of the differences between the linking functions for the subpopulations and total group. First, the squared differences between

126

SIRECI, HAN, WELLS

each subpopulation linking function at each score level x are averaged over the subpopulations, using the subgroup size as a weight. Second, the mean of the resulting weighted sums of squared differences are computed across score levels using the relative number of examinees in the total population at each score level as a weight. By taking the square root and dividing by the standard deviation of test Y in the total group produces a measure in which the standard deviation of the composite score is unity. When tests classify students into achievement level categories, an additional means for evaluating population invariance is to assess classification consistency across the general and group-specific equating procedures. If invariance holds, ELLs will generally receive the same classification whether the equating was done based on the entire population or based on analysis of only the ELL data. The extent to which the classifications vary will indicate the degree of invariance. The advantage of this method is that it directly addresses the consequences associated with violations of equating invariance. Analysis of Relations to Other Variables As mentioned earlier, in some situations such as in admissions and employment testing, the critical validity issue is the degree to which a test accurately predicts future performance. Predictive validity is the degree to which test scores accurately predict scores on a criterion measure. A conspicuous example is the degree to which college admissions test scores predict college grade point average (GPA). Given this predictive context, it should not be surprising that regression models are used to evaluate predictive validity. Regression models can also be used to evaluate bias in the predictive power of a test. This analysis of test bias typically investigates whether the relationship between test and criterion scores is consistent across examinees from different groups. Such studies of test bias are often referred to as studies of differential predictive validity. There are two common methods for evaluating differential predictive validity. The first involves fitting separate regression lines to the predictor (test score) and criterion data for each group and then testing for differences in regression coefficients and intercepts. When sufficient data for separate equations are not available, a different method must be used. This method involves fitting only one regression equation. In one variation of this method, the coefficients in the equation are estimated using only the data from the majority group. The analysis of test bias then focuses on predicting the criterion data for examinees in the minority group and examining the errors of prediction (residuals). In another variation of this method, the data from all examinees (i.e., majority and minority examinees) are used to compute the regression equation. Then, the residuals are compared across groups.

METHODS FOR EVALUATING TEST SCORE VALIDITY

127

The simplest form of a regression model used in test bias research is y D b1 X1 C a C e

(7)

where y is the predicted criterion value, b1 is a coefficient describing the utility of variable 1 for predicting the criterion, a is the intercept of the regression line (i.e., the predicted value of the criterion when the value of the predictor is zero), and e represents error (i.e., variation in the criterion not explained by the predictors). If the criterion variable were freshman year GPA and the predictor variable were ACT score, b1 would represent the utility of the ACT for predicting freshman GPA. To estimate the parameters in this equation data on the predictor and criterion are needed. The residuals (e in the equation) represent the difference between the criterion value predicted by the equation and the actual criterion variable. Every examinee in a predictive validity study has a test score and criterion score. The residual is simply the criterion score minus the score predicted by the test. Depending on the data available, investigation of the consistency of prediction across ELL and non-ELL groups could be conducted by statistical testing of the similarity of regression slopes, intercepts, and residuals. However, when ELL and non-ELL examinees have large differences on both the predictor and criterion measures, disentangling potential bias from regression to the mean, may be problematic (Wainer & Sireci, 2005). DISCUSSION In this article, we summarized validity issues relevant to the assessment of ELLs, and we discussed common practices for promoting and evaluating the validity of ELLs’ test scores. We also discussed methodological issues in gathering validity evidence. A theme we tried to incorporate throughout these discussions is that the way test scores are used and the impact this use has on ELLs should be a major consideration when deciding how to go about gathering validity evidence for the assessment of ELLs. A variety of sophisticated statistical procedures exists for evaluating specific aspects of validity, but many other aspects have not received much study. By matching research designs and statistical techniques to specific validity questions, we can go a long way toward a comprehensive evaluation of the appropriateness of specific testing situations for ELLs. For example, experimental designs allow us to gauge the impact of test accommodations on ELL and nonELL test performance. When the results of these studies show that non-ELL students do not benefit from the accommodations but ELLs do, we can have confidence the accommodation does not provide an unfair advantage to ELLs, and assuming it was based on a linguistic theory, we can also be confident it reduced construct-irrelevant variance because of English proficiency.

128

SIRECI, HAN, WELLS

When the comparability of test scores and the fidelity of the construct measured are critical validity issues, confirmatory factor analysis and MDS are useful for analyzing the structural equivalence of a test across ELL and non-ELL groups, or across original and translated versions of a test. In addition, DIF analysis can evaluate problems with specific items as well as the effectiveness of specific accommodations such as linguistic simplification. Measurement of ELLs’ knowledge and skills involves communication by both the testing agency and the student. In the United States, the language in which this communication is done is typically in English. As educational researchers, we are obligated to use our statistical tools to help educational assessment policymakers understand the degree to which the inferences derived from ELLs’ test scores are valid. We are also obligated to promote fair and accurate assessment practices for ELLs. One finding that was disappointing is that applied studies of the validity of assessments for ELLs have generally used only three of the five sources of validity evidence promulgated by the Standards. Analysis of the cognitive processes used by ELLs when responding to assessment items, and comparing these results to those based on non-ELLs, has not been studied but is likely to be illuminating with respect to the construct validity of ELLs’ test scores. As Messick (1989) pointed out, “Validity is an integrated evaluative judgment of the degree to which empirical evidence and theoretical rationales support the adequacy and appropriateness of inferences and actions based on test scores or other modes of assessment” (p. 13). Use of the statistical methods described in this article can help researchers form an integrated, evaluative judgment about the degree to which the interpretations of ELLs test scores are appropriate. We recommend that all sources of validity evidence be investigated whenever possible, and we also recommend that test developers consider ELLs throughout the test development process. Such consideration involves explicit inclusion of ELLs in pilot tests, norming studies, and analysis of DIF, and careful consideration of the various types of ELLs when conducting sensitivity reviews.

ACKNOWLEDGMENTS This article was originally the paper Center for Educational Assessment Research Report No. 633, Center for Educational Assessment, University of Massachusetts Amherst, presented at the 2007 Annual Meeting of the American Educational Research Association (Division D) as part of the symposium “Design and Evaluation of Accessible Assessment Items for English Learners” (R. Duran, Chair). Kyung T. Han is now at the Graduate Management Admissions Council.

METHODS FOR EVALUATING TEST SCORE VALIDITY

129

REFERENCES Abedi, J. (2001, December). Language accommodation for large-scale assessment in science: Assessing English language learners (Final Deliverable, Project 2.4 Accommodation). Los Angeles: National Center for Research on Evaluation, Standards, and Student Testing, University of California Los Angeles. Abedi, J., Hofstetter, C., Baker, E., & Lord, C. (2001, February). NAEP math performance and test accommodations: Interactions with student language background (CSE Tech. Rep. No. 536). Los Angeles, CA: National Center for Research on Evaluation, Standards, and Student Testing. Abedi, J., Lord, C., Boscardin, C. K., & Miyoshi, J. (2001). The effects of accommodations on the assessment of limited English proficient students in the National Assessment of Educational Progress (National Center for Education Statistics Working Paper, Publication No. NCES 200113). Washington, DC: National Center for Education Statistics. Abedi, J., Lord, C., Hofstetter, C., & Baker, E. (2000). Impact of accommodation strategies on English language learners’ test performance. Educational Measurement: Issues and Practice, 19(3), 16–26. American Educational Research Association, American Psychological Association, & National Council on Measurement in Education. (1999). Standards for educational and psychological testing. Washington, DC: American Educational Research Association. American Psychological Association, American Educational Research Association, & National Council on Measurement in Education. (1985). Standards for educational and psychological testing. Washington, DC: American Psychological Association. Carroll, J. D., & Chang, J. J. (1970). An analysis of individual differences in multidimensional scaling via an n-way generalization of “Eckart-Young” decomposition. Psychometrika, 35, 238– 319. Cattell, R. B. (1944). Psychological measurement: Normative, ipsative, interactive. Psychological Review, 512, 292–303. Cizek, G. J. (2001). More unintended consequences of high-stakes testing. Educational Measurement: Issues and Practice, 20(4), 19–27. Dorans, N. J. (2002). Recentering and realigning the SAT score distributions: How and why. Journal of Educational Measurement, 39, 59–84. Cronbach, L. J. (1971). Test validation. In R. L. Thorndike (Ed.) Educational measurement (2nd ed., pp. 443–507). Washington, D.C.: American Council on Education. Cronbach, L. J. & Meehl, P. E. (1955). Construct validity in psychological tests. Psychological Bulletin, 52, 281–302. Dorans, N. J. (2004). Using subpopulation invariance to assess test score equity. Journal of Educational Measurement, 41, 43–68. Dorans, N. J., & Holland, P. W. (2000). Population invariance and equitability of tests: Basic theory and the linear case. Journal of Educational Measurement, 37, 281–306. Dorans, N. J., & Kulick, E. (1986). Demonstrating the utility of the standardization approach to assessing unexpected differential item performance on the Scholastic Aptitude Test. Journal of Educational Measurement, 23, 355–368. Geisinger, K. F. (1992). Psychological testing of Hispanics. Washington, DC: American Psychological Association. Geisinger, K. F. (2004). Testing of students with limited English proficiency. In J. Wall & G. Walz (Eds.), Measuring up: Assessment issues for teachers, counselors, and administrators (pp. 147– 159). Greensboro, NC: CAPS Press. Haladyna, T. M., & Downing, S. M. (2004). Construct-irrelevant variance in high-stakes testing. Educational Measurement: Issues and Practice, 23(1), 17–27. Holland, P. W., & Wainer, H. (Eds.). (1993). Differential item functioning. Hillsdale, NJ: Erlbaum.

130

SIRECI, HAN, WELLS

Kane, M. T. (1992). An argument-based approach to validity. Psychological Bulletin, 112, 527–535. Kolen, M. J., & Brennan, R. J. (2004). Test equating, scaling, and linking: Methods and practices. New York: Springer-Verlag. Leighton, J. P. (2004). Avoiding misconception, misuse, and missed opportunities: The collection of verbal reports in educational achievement testing. Educational Measurement: Issues and Practice, 23(4), 6–15. Lewis, C., & Wells, C. S. (2007, April). The effect of DIF on the classification of LEP students. Paper presented at the annual meeting of the National Council on Measurement in Education, Chicago. Likert, R. (1932). A technique for the measurement of attitudes. Archives of Psychology, 140, 44–53. Liu, J., Cahn, M. F., & Dorans, N. J. (2006). An application of score equity assessment: Invariance of linkage of new SAT® to old SAT across gender groups. Journal of Educational Measurement, 43, 113–130. Messick, S. (1989). Validity. In R. Linn (Ed.), Educational measurement (3rd ed.). Washington, DC: American Council on Education. Olmedo, E. L. (1981). Testing linguistic minorities. American Psychologist, 36, 1078–1085. Popham, W. J. (1997). Consequential validity: Right concern—wrong concept. Educational Measurement: Issues and Practice, 16(2), 9–13. Rabinowitz, S. N., & Sato, E. (2006). The technical adequacy of assessments for alternate student populations: Guidelines for consumers and developers. San Francisco: WestEd. Ramsey, P. A. (1993). Sensitivity review: The ETS experience as a case study. In P. W. Holland & H. Wainer (Eds.), Differential item functioning (pp. 367–388). Hillsdale, NJ: Erlbaum. Schmitt, A. P. (1988). Language and cultural characteristics that explain differential item functioning for Hispanic examinees on the Scholastic Aptitude Test. Journal of Educational Measurement, 25, 1–13. Shepard, L. A. (1993). Evaluating test validity. Review of Research in Education, 19, 405–450. Shepard, L. A. (1997). The centrality of test use and consequences for test validity. Educational Measurement: Issues and Practice, 16(2), 5–24. Sireci, S. G. (1998). Gathering and analyzing content validity data. Educational Assessment, 5, 299–321. Sireci, S. G. (2004, April). The role of sensitivity review and differential item functioning analyses in reducing the achievement gap. Paper presented at the annual meeting of the National Council on Measurement in Education, San Diego, CA. Sireci, S. G., Bastari, B., & Allalouf, A. (1998, August). Evaluating construct equivalence across adapted tests. Invited paper presented at the annual meeting of the American Psychological Association (Division 5), San Francisco, CA. Sireci, S. G., & Khaliq, S. N. (2002, April). An analysis of the psychometric properties of dual language test forms. Paper presented at the annual meeting of the National Council on Measurement in Education, New Orleans, LA. Sireci, S. G., Li, S., & Scarpati, S. (2003). The effects of tests accommodations on test performance: A review of the literature (Commissioned paper by the National Academy of Sciences/National Research Council’s Board on Testing and Assessment). Washington, DC: National Research Council. Sireci, S. G., & Mullane, L. A. (1994). Evaluating test fairness in licensure testing: The sensitivity review process. CLEAR Exam Review, 5(2), 22–28. Sireci, S. G., & Parker, P. (2006). Validity on trial: Psychometric and legal conceptualizations of validity. Educational Measurement: Issues and Practice, 25(3), 27–34. Sireci, S. G., Scarpati, S., & Li, S. (2005). Test accommodations for students with disabilities: An analysis of the interaction hypothesis. Review of Educational Research, 75, 457–490.

METHODS FOR EVALUATING TEST SCORE VALIDITY

131

Wainer, H., & Sireci, S. G. (2005). Item and test bias. In Encyclopedia of social measurement (Vol. 2, pp. 365–371). San Diego: Elsevier. Wells, C. S., Baldwin, S., Hambleton, R. K., Karantonis, A., Jirka, S., Sireci, S. G., et al. (2007). Evaluating score equity across selected states for the 2005 Grade 8 NAEP math and reading assessments (Center for Educational Assessment Research Report No 604). Amherst, MA: Center for Educational Assessment. Wendt, A., Kenny, L. E., & Marks, C. (2007). Assessing critical thinking using a talk-aloud protocol. Clear Exam Review, 18(1), 18–27. Zwick, R., & Schlemer, L. (2004). SAT validity for linguistic minorities at the University of California, Santa Barbara. Educational Measurement: Issues and Practice, 23(1), 6–16.