Reliability and Comparability of TOEFL iBT™ Scores - ETS

15 downloads 247 Views 3MB Size Report
TOEFL iBTTM Research • Series 1, Volume 3. Foreword. We are very ... launch in 1964, the TOEFL test has undergone several major revisions motivated by ...
Insight TOEFL iBT Research TM

Series I, Volume 3

Reliability and Comparability of TOEFL iBT™ Scores

ETS — Listening. Learning. Leading.®

Insight

TOEFL iBT Research • Series 1, Volume 3 TM

Foreword

Preface

We are very excited to announce the TOEFL iBT™ Research Insight Series, a bimonthly publication to make important research on the TOEFL iBT available to all test score users in a user-friendly format.

Since the 1970’s, the TOEFL test has had a rigorous, productive and far-ranging research program. But why should test score users care about the research base for a test? In short, because it is only through a rigorous program of research that a testing company can demonstrate its forward-looking vision and substantiate claims about what test takers know or can do based on their test scores. This is why ETS has made the establishment of a strong research base a consistent feature of the evolution of the TOEFL test.

The TOEFL iBT test is the most widely accepted English language assessment, used for admissions purposes in more than 130 countries including the United Kingdom, Canada, Australia, New Zealand and the United States. Since its initial launch in 1964, the TOEFL test has undergone several major revisions motivated by advances in theories of language ability and changes in English teaching practices. The most recent revision, the TOEFL iBT test, was launched in 2005. It contains a number of innovative design features, including the use of integrated tasks that engage multiple language skills to simulate language use in academic settings, and the use of test materials that reflect the reading and listening demands of real-world academic environments. At ETS we understand that you use TOEFL iBT test scores to help make important decisions about your students, and we would like to keep you up-to-date about the research results that assure the quality of these scores. Through the TOEFL iBT Research Insight Series we wish to both communicate to the institutions and English teachers who use the TOEFL iBT test scores the strong research and development base that underlies the TOEFL iBT test, and demonstrate our strong, continued commitment to research. We hope you will find this series relevant, informative and useful. We welcome your comments and suggestions about how to make it a better resource for you. Ida Lawrence Senior Vice President Research & Development Division Educational Testing Service

The TOEFL test is developed and supported by a worldclass team of test developers, educational measurement specialists, statisticians and researchers. Our test developers have advanced degrees in such fields as English, language education and linguistics. They also possess extensive international experience, having taught English in Africa, Asia, Europe, North America and South America. Our research, measurement and statistics team includes some of the world’s most distinguished scientists and internationally recognized leaders in diverse areas such as test validity, language learning and testing, and educational measurement and statistics. To date, more than 150 peer-reviewed TOEFL research reports, technical reports and monographs have been published by ETS, many of which have also appeared in academic journals and book volumes. In addition to the 20-30 TOEFL-related research projects conducted by ETS Research & Development staff each year, the TOEFL Committee of Examiners (COE), comprised of language learning and testing experts from the academic community, funds an annual program of TOEFL research by external researchers from all over the world, including preeminent researchers from Australia, the UK, the US, Canada and Japan. In Series One of the TOEFL iBT Research Insight Series, we provide a comprehensive account of the essential concepts, procedures and research results that assure the quality of scores on the TOEFL iBT test. The six issues in this Series will cover the following topics:

oooooooo 1

Reliability and Comparability of TOEFL iBT™ Scores

Issue 1: TOEFL iBT Test Framework and Development The TOEFL iBT test is described along with the processes used to develop test questions and forms. These processes include rigorous review of test materials, with special attention to fairness concerns. Item pretesting, try outs and scoring procedures are also detailed.

Issue 2: TOEFL Research The TOEFL Program has supported rigorous research to maintain and improve test quality. Over 150 reports and monographs are catalogued on the TOEFL website. A brief overview of some recent research on fairness and automated scoring is presented here.

Issue 3: Reliability and Comparability of Test Scores Given that hundreds of thousands of test takers take the TOEFL iBT test each year, many different test forms are developed and administered. Procedures to achieve score comparability on different forms are described in this section.

Issue 4: Validity Evidence Supporting Test Score Interpretation and Use The many types of evidence supporting the proposed interpretation and use of test scores as a measure of English-language proficiency in academic contexts are discussed.

Issue 5: Information for Score Users, Teachers and Learners Materials and guidelines are available to aid in the interpretation and appropriate use of test scores, as well as resources for teachers and learners that support English-language instruction and test preparation.

Insight

TOEFL iBT Research • Series 1, Volume 3 TM

Issue 6: TOEFL Program History A brief overview of the history and governance of the TOEFL Program is presented. The evolution of the TOEFL test constructs and contents from 1964 to the present is summarized. Future series will feature summaries of recent studies on topics of interest to our score users, such as “what TOEFL iBT test scores tell us about how examinees perform in academic settings,” and “how score users perceive and use TOEFL iBT test scores.” The close collaboration with TOEFL iBT score users, English language learning and teaching experts and university professors in the redesign of the TOEFL iBT test has contributed to its great success. Therefore, through this publication, we hope to foster an ever stronger connection with our score users by sharing the rigorous measurement and research base and solid test development that continues to ensure the quality of TOEFL iBT scores to meet the needs of score users. Xiaoming Xi Senior Research Scientist Research & Development Division Educational Testing Service

Contributors The primary authors of this section are Mary Enright and Eileen Tyson. The following individuals also contributed to this section by providing their careful review as well as editorial suggestions (in alphabetical order). Cris Breining Rosalie Szabo Xiaofei Tang Xiaoming Xi

Reliability and Comparability of TOEFL iBT™ ‑ Scores ETS has always been committed to the quality of its test scores. As an ETS assessment program, TOEFL® strives to ensure score reliability and comparability through strict adherence to guidelines and practices established for the development and operational implementation of its products and services. Evidence of score reliability and comparability is important because it suggests that test scores will have the same meaning across test forms. ETS strives to ensure that the test scores of the TOEFL Internet based test (iBT) are reliable and comparable by:

• implementing and adhering to standardized administration and test security procedures



• using detailed test specifications to guide test development



• monitoring score reliability and generalizability



• employing an appropriate scale for reporting scores



• using equating and other means to maintain comparable scores across test forms

Standardized Administration and Security Procedures

TOEFL program also provides extensive material to test administrators and test takers so that violations of such procedures can be reported to ETS for investigation. The major procedures are:



• certifying all test centers facilities and equipment for administering TOEFL iBT tests, including hardware, software and Internet connections



• training test center staff on how to handle a test administration session, including test taker identity verification, test launch and incident and irregularity management



• providing online practice tests and other supporting information for test takers to become familiar with the test and test-taking conditions, including section sequence, test duration, use of headphones and microphones and navigating within and across test sections (information for test takers is available at www.toeflgoanywhere.org)



• using technology to control test delivery and transmission of test-related data to ensure the security of test contents and test results



• informing test takers about how to report fraudulent behavior in a test session

Test Specifications

Another important way to help ensure the comparability of scores across test forms is to create detailed specifications to guide the test development process. In large-scale tests such as the TOEFL iBT test, a critical Test specifications are a detailed operational definition component to ensuring score validity and fairness of test characteristics that form an exact content sampling is accomplished by implementing and adhering to plan. For example, test specifications may define the standardized procedures for test administration and kind of content covered by the test, the number related test security. of test questions, the format of test questions and Standardized test administrations and responses, and the response options. The Standards test security measures ensure the As a result, the test for Educational and Psychological Testing (AERA, APA, TOEFL iBT test is given under comparable scores reflect the NCME, 1999, p. 43) provides general guidelines for conditions to all test takers, no matter test takers’ abilities developing and evaluating test specifications. When where or when they take the test. and are not unduly multiple forms of a test are developed according to influenced by other, well-defined test specifications, the test characteristics can unrelated factors. TOEFL iBT operational procedures for be expected to remain very similar across the test forms maintaining standardized conditions for test administration and across test administrations. and security follow the requirements laid out in the ETS Standards for Quality and Fairness (ETS, 2002). The

2 oooooooo

oooooooo 3

Reliability and Comparability of TOEFL iBT™ Scores

The TOEFL iBT test offers multiple test administrations each year, so it is critical to ensure that test forms used for these administrations are similar in content. This is accomplished by following the test specifications for the TOEFL iBT test in the test development process. Details of how these specifications were developed using Evidence Centered Design can be found in Pearlman (2008).

Score Reliability and Generalizability

TM

equivalent. The person would receive two test scores, but neither of the two scores would be the person’s exact true ability score. However, both scores should be around his or her true ability score. SEM is a measure that defines a score range in which one’s true ability score lies with a certain level of probability. Obviously, the smaller the SEM, the better quality (more precise) the test scores will be. Reliability is expressed in a statistical index whose value ranges from 0 (not at all reliable) to 1 (perfectly reliable). Such an index, called a coefficient, can be estimated in different ways depending on the intended use of the coefficient and the underlying theoretical frameworks. In the TOEFL iBT test, the reliability estimation for the Reading and Listening sections that contain selected response questions is carried out using a method based on item response theory (IRT) (Lord, 1980). For the Speaking and Writing sections that contain constructed response tasks, generalizability theory (G-theory) is used (Brennan, 1983).

An important measure of the quality of a test is how reliable the test scores are. Reliability is important because it indicates the replicability of the test scores when either a test could be given twice or more to the same group of people, or two tests constructed in the same manner could be given to the same group of people. Testing, like other measurement events, is subject to the influence of many factors that are not relevant to The more reliable the scores are, the Generally speaking, there are two steps taken in the ability being more confidence score users have in producing IRT-based reliability estimates for Reading measured. Such using the scores for making important and Listening scores. The first step estimates the SEM irrelevant factors decisions about test takers. for each scaled score point after a new test form is contribute to what is equated, and such a SEM specific to each score point called “measurement error,” which in turn determines how is called a conditional SEM or a CSEM. Item parameters and reliable test scores are. personal ability estimates from an IRT model are used in calculating the CSEM values. In the second step, the CSEM In educational measurement, score reliability is a statistical values across all the scaled score points are averaged, and index to quantify and evaluate the consistency of test scores. this averaged CSEM value is used in the calculation of the reliability estimate for the overall test of Reading or Listening In essence, “the concern of reliability is to quantify the by substituting this overall SEM (averaged CSEM) for the SEM precision of test scores and other measurements” (Haertel, in the formula below: 2006, p. 65). A test score is a measurement outcome of a person’s performance on a test, and as such, just like any other measurement, the score contains some degree of x xx measurement error due to many factors. In fact, a person’s real or true ability can never be observed in a test. A welldeveloped test is expected to yield a test score that will where σx is the standard deviation of the scaled score, rxx reflect a person’s real ability as much as possible and to keep is the reliability estimate, and solving for rxx will then obtain measurement error to a minimum. This is what “precision the estimated reliability. The SEM is on the same metric of test scores” really means. Precision of test scores is also as the scaled scores that are reported. expressed as a measurement index called the standard Generalizability theory has seen extensive application in error of measurement (SEM). A person’s true ability score is measuring the score reliability for tests with constructed never obtainable and is therefore estimated using statistical response tasks in which the test taker generates answers methods. Imagine that a person could take two tests that are to test questions instead of selecting from a list of possible constructed to the same specifications and are essentially responses. Using an analysis of variance, generalizability

SEM = σ

4 oooooooo

Insight

TOEFL iBT Research • Series 1, Volume 3

1− r

,

theory separates different sources of variance (test taker ability, rater effects, and task effects) so that the effect of each source of variance (called a facet) can be evaluated. The index for score reliability in this framework is a generalizability coefficient (G-coefficient), which is also on a scale of 0 to 1 with a value closer to 1 being desired. A ‘person by task’ generalizability model is used for the Speaking scores, whereas a nested model is used for Writing (rating nested within task, and this is crossed with person) (see Lee & Kantor, 2005 for details). The above-mentioned reliability and generalizability analyses are conducted for every test form. Table 1 presents the average section and total score reliability estimates and standard errors of measurement based on operational data from 2007.

Table 1.

Reliabilities and Standard Errors of Measurement Score

Scale

Reliability Estimate

SEM

Reading

0-30

0.85

3.35

Listening

0-30

0.85

3.20

Speaking

0-30

0.88

1.62

Writing

0-30

0.74

2.76

Total

0-120

0.94

5.64

The reliability estimates for the Reading, Listening, Speaking, and Total scores are relatively high, while the reliability of the Writing score is somewhat lower. This is a typical result for writing measures composed of only two tasks (Breland, Bridgeman, & Fowles, 1999) and reflects one well-documented limitation of performance testing—reliability estimates for measures composed of a small number of time-consuming tasks are often lower than estimates for measures composed of many shorter, less time-consuming tasks. However, the construct of academic writing as defined for the TOEFL iBT test required the production of extended writing samples (Cumming, Kantor, Powers, Santos, & Taylor, 2000). One implication

of these results is that, for making high-stakes decisions such as admissions to college or graduate school, the Total score provides the best information, both because it reflects all four language skills and because it is the most reliable. Nevertheless, there are circumstances under which decision makers may want to examine the profile of scores for test takers, such as the demands of the curriculum or a need for additional language training. Also note that ETS encourages score users to consider a number of other factors, when making admissions decisions, including grade point average, scores on other admissions exams, teacher recommendations, and interviews with individuals. The reliability estimates in Table 1 are what are used for the TOEFL iBT operational test scores. Other types of reliability estimates also exist that take into account other sources of variability such as differences in test forms or changes in examinees’ performances from day to day. Alternate form reliability, for example, is calculated based on examinees’ scores on two different forms of a test. This requires examinees to take two different test forms, something only a few examinees would volunteer to do. But some examinees do take the test twice during a period of time too short for much learning to occur, for reasons of their own. An analysis of the scores of these repeat test takers on the two test forms provides an approximation of alternate form reliability. Zhang (February 2008) compared the test scores of more than 12,000 examinees who were identified as having taken two TOEFL iBT tests within a period of one month. The correlations of their scores on the two test forms were 0.77 for the listening and writing sections, 0.78 for reading, 0.84 for speaking, and 0.91 for the total test score. Because these measures of reliability take into account additional sources of variability, they are typically lower than internal consistency measures. Nevertheless, they indicate a high degree of consistency in the rank ordering of the scores of these test repeaters.

Scaling TOEFL iBT Scores Reported test scores are derived from performance on a test through a statistical process called “scaling.” In a simple example, a student who answered 55 questions correctly out of 60 questions on a test would receive a score of 55 if each correct answer was worth one score point. This score is the number correct score, which is also called a raw score. For standardized tests, raw score scales are almost never used directly as reported scores. This is because the raw score

oooooooo 5

Reliability and Comparability of TOEFL iBT™ Scores

TOEFL iBT Research • Series 1, Volume 3 TM

is directly dependent on the specific items on a particular Such an equating process, however, is not practical or test form; this particular form may not have exactly the feasible for performance-based tests that contain only a same difficulty level few tasks. For example, a writing test may have only A carefully developed score scale, as other forms of one or two writing tasks that are scored by human together with an equating plan (see the same test. As raters using a scoring rubric (rules or guidelines for following section), is important in a result, the same assigning scores to constructed-response questions raw scores from two maintaining score comparability and or performance tasks). Many equating procedures meaningful interpretation of scores different test forms require repeating previously administered items in across test forms and over time. may not represent the current administration. This may not be feasible if the same level of a writing test has only one or two tasks that are easily performance or ability. Instead, the raw scale is transformed remembered and shared with other examinees. to a reporting score scale, which produces a scaled score. Threats to score comparability on such performance tests The scales for the measures on the TOEFL® iBT test were result from both differences in test form difficulty, and established such that the same scale range (0-30) for the inconsistency in human raters’ scoring activities. In the four sections was chosen to indicate that all sections should absence of any feasible equating procedures, various other be viewed as being equally important in measuring the statistical analyses can be used to evaluate and control the construct of academic language ability. The total score quality of scores (Baldwin, Fowles, & Livingston, 2005). would be the sum of the four section scores. The decision to use a 0-to-30 scale was based primarily on the need to provide reasonable raw to scale score mappings for each of ™ the sections, which differed in their maximum raw scores. The maximum number of raw score points on the four sections of the form used in the filed study ranged from A non-equivalent anchor test design (also called a common 20 for writing to 44 for reading. item non-equivalent group design) is used as the equating data collection design for TOEFL iBT Reading and Listening sections. This means that each new form of the TOEFL iBT test contains an anchor block, which is a set of items that have been pretested in previously administered forms and have already been analyzed. This design enables an adjustment for possible ability differences between the group in which the items were pretested in an old test For testing programs that have multiple administrations form and the group in which the items are used in the with different test forms, it is necessary to maintain score current new test form for equating. This is possible comparability across test forms for meaningful comparison because the same items are given to two groups of of scores. Score comparability across test forms is typically candidates, and the differences in item statistics for the maintained using certain statistical processes called equating. two groups reflect the ability differences between the two groups. Such differences need to be adjusted during Equating is a statistical process that is used to adjust the equating process. scores on test forms so that scores on the forms can be used interchangeably. “Equating adjusts for differences A statistical model within the item response theory (IRT) in difficulty among forms that are built to be similar in framework is used to analyze the items and candidates’ difficulty and content” (Kolen & Brennan, 1995, p. 2). abilities. From the IRT-based analysis, item parameters and candidates’ abilities are derived. The derived item parameters For tests containing selected response items, equating is and ability values are on the TOEFL iBT IRT scales that were routinely carried out to produce reporting scores for a already established. This way, items and candidates from new test form, as is the case for TOEFL iBT Reading and different test administrations can be directly compared. Listening sections.

Equating TOEFL iBT Reading and Listening Sections

Maintaining Score Comparability across Test Forms

6 oooooooo

Insight

The IRT true score equating method (see detailed the Writing or Speaking sections with the Reading and descriptions about this method in Kolen & Brennan, 1995) Listening sections. The performance of raters is evaluated is used to establish by statistics on rater agreement rates, which include The scaled scores for any test forms the relationship both exact agreement (no score difference between are directly comparable as they between scores on two raters) and adjacent agreement (1 point now indicate the same levels of the current new difference). The average of all the scores a rater performance and ability. form and scores on assigns in a scoring session is compared with the the base form. After average score of all the raters participating in the equating, a raw score on the new form is adjusted to be same session. A large difference between these two equivalent to a raw score on the base form. Because each average scores may alert a scoring leader to a possible raw score on the base form already corresponds to a problem in a rater’s scaled score, each raw score on the new form is now Several statistical methods are also performance. implemented to monitor and evaluate related to a scaled score between 0 and 30. the performance of the tasks in each test Whenever possible, monitor papers are form, and the performance of raters. also used to evaluate ™ cross-administration scoring consistency. Monitor papers are selected responses on a task from a prior test administration that were scored previously. If the task is occasionally included in a new test administration, these monitor papers are intermixed with the responses to the task on Speaking and Writing scores are not equated statistically the new test for scoring. Because these monitor papers are due to technical and test security constraints. The two indistinguishable from the responses to the task on the new test sections have six and two constructed response tasks, test, raters will score them in the same way as they score the respectively. Because these few tasks are prone to new responses. Then, the old and new scores on the monitor memorization, test security concerns preclude the possibility papers are compared, and the agreement rates between the of repeating tasks on every new test form, as is required two sets of scores indicate cross-administration rater by equating. To consistency in scoring. minimize differences Careful test development effort and in test form difficulty rigorous scoring standards are used Another type of statistical evidence of score and potential to maintain score quality. consistency across forms comes from the analysis inconsistency due to of repeat test takers. As noted in the section on human scoring, a number of non-statistical procedures are reliability, Zhang (2008) analyzed the test scores of put in place in test development and scoring. examinees who chose to take the test twice within a short period of time. The correlational analyses established that Detailed task specifications guide the development of the examinees were rank-ordered consistently on the two parallel tasks. Then, small-scale tryouts are used to screen test forms. Zhang also reported that the differences in out poorly performing tasks. Training is given to raters using scores on the two test forms for all four sections and for the well-defined and articulated scoring rubrics. In addition, total score were negligible for most examinees. raters are certified before they can begin scoring work and, prior to each scoring session, they must pass a calibration test. Furthermore, during all scoring sessions, raters are monitored and supervised by chief raters.

Comparability of TOEFL iBT Speaking and Writing Sections across Forms

All the constructed response tasks in Writing and Speaking sections are analyzed after a test is given. The analysis examines such statistics as average scores on a task, distributions of scores on a task, and correlations between

oooooooo 7

Reliability and Comparability of TOEFL iBT™ Scores

Conclusion Evidence of score reliability and comparability allows Because different forms of the TOEFL iBT test are decision makers to evaluate the trustworthiness of the test administered to test takers at different times and in scores when the scores are used to indicate candidates’ different locations, score reliability and comparability abilities or performances. are important criteria to evaluate the quality of the test. Therefore, ETS has implemented a Evidence of score reliability and comparability for variety of procedures TOEFL iBT scores comes both from statistical to enhance test analyses and from the application of accepted test score reliability and development, administration and scoring practices. comparability.

References American Educational Research Association, American Psychological Association & National Council on Measurement in Education. (1999). The standards for educational and psychological testing. Washington, DC: American Educational Research Association. www.apa.org/science/standards.html Breland, H., Bridgeman, B. & Fowles, M. E. (1999). Writing assessment in admission to higher education: Review and framework (ETS Research Rep. No. 99-03). Princeton, NJ: ETS. Available on the Web at: www.ets.org/Media/Research/pdf/RR-99-03Breland.pdf Brennan, R. L. (1983). Elements of generalizability theory. Iowa City, IA: American College Testing Program. Cronbach, L. J. (1951). Coefficient alpha and the internal structure of tests. Psychometrika, (16), 297-334. Cumming, A., Kantor, R., Powers, D., Santos, T. & Taylor, C. (2000). TOEFL 2000 writing framework: A working paper. (TOEFL Monograph No. 18). Princeton, NJ: ETS. Available on the Web at: www.ets.org/Media/Research/pdf/RM-00-05.pdf Educational Testing Service. (2002). ETS standards for quality and fairness. Princeton, NJ: Author. Available on the Web at: www.ets.org/Media/About_ETS/pdf/standards.pdf Haertel, E. H. (2006). Reliability. In R. L. Brennan (Ed.), Educational measurement (4th ed., 65 – 110). Westport, CT: American Council on Education and Praeger. Kolen, M J. & Brennan, R.L. (1995). Test equating: Methods and practices. New York, NY: Springer-Verlag. Lee, Y.-W. & Kantor, R. (2005). Dependability of new ESL writing test scores: Evaluating prototype tasks and alternative rating schemes. ETS Research Report (RR-05-14). Princeton, NJ: ETS. http://infoport/sites/rrpts/publications/RR-05-14.pdf Lord, F. M. (1980). Applications of item response theory to practical testing problems. Hillsdale, NJ: Lawrence Erlbaum Associates. Pearlman, M. (2008). Finalizing the test blueprint. In C. Chapelle, M. K. Enright & J. M. Jamieson (Eds.), Building a validity argument for the Test of English as a Foreign Language™. New York: Routledge: Taylor & Francis Group. Zhang, Y. (2008). Repeater analyses for TOEFL iBT. ETS Research Report (RM-08-05). Princeton, NJ: ETS.

8 oooooooo

Contact Us [email protected]

Insight

TOEFL iBT™ Research • Series 1, Volume 3

Copyright © 2011 by Educational Testing Service. All rights reserved. ETS, the ETS logo, LISTENING. LEARNING. LEADING. and TOEFL are registered trademarks of Educational Testing Service (ETS) in the United States and other countries. TOEFL iBT is a trademark of ETS. EDU00025

Listening. Learning. Leading.® www.ets.org

760650