Untitled

0 downloads 0 Views 911KB Size Report
Standardized writing tests are often high stakes tests for students as the writing ..... 3.14.5. Interview Question 5 - What are the best practices for improving rating ..... searching tools were skimmed before twenty articles, dissertations, and books ... responses of IELTS tests were taken from Cambridge IELTS 9 Self-study Pack ...

iii ACKNOWLEDGEMENTS I want to thank my content advisor, Professor Carolyn Duffy for inspiring me in writing this thesis, guiding me on every step of the way, and telling me that: ‘I look forward to reading your study.’ I want to thank my thesis advisor, Professor Harold Booth, who traveled through the snow storm to the library just to give me a draft of my writing, who read and commented meticulously on the content, the organization, and the language of the thesis, who kept me on track with my writing, and who showed me rhetorical features of thesis writing, from the importance of hedging to the power of the simple sentences. Your comments have always been appreciated, Professor. Professor Duffy and Professor Booth, I would not have completed the thesis but for you. I want to thank my participants, who were willing to give their time with the scripts, and share frankly on the scoring. Your openness regarding your own scoring practices and enthusiasm for scoring made me aware of the importance of what I am doing. I want to thank all of my professors at Saint Michael’s College, whose knowledge and passion have had such a significant impact on me. Every page of this writing is inspired by the classes I took with you. I also want to thank Titapa Chinkijakan for making me laugh and for staying up late with me every night while I was working on my paper. Nat, I would not have been able to stay awake to work on my thesis without your company.

iv Thank you, Erika Bodin, for your hugs and for the sound of your violin when I was writing this thesis. The retreats at your family revitalized me. They helped me look beyond the stress of writing the thesis and appreciate other values besides my studies. Thank you, to all of my friends who come over and say ‘hi’ to me in the library, who brought me food, and who believed in my ability to finish this paper with flying colors. Thank you for your tea cup, Bonisiwe; for the kettle, Moldir and Nada; for the jam, Kiki; for the thesis paper, Nazgul; and for the tea, Mike. Thank you, Heather Battig, you created the most memorable experiences during my time here. May kindness return to you in the same beautiful way that it was given. Thank you, Durick Library, you have been as a second home to me. Your quietness, your comfort, and your resources have been the foundation for my continuous efforts. Thank you, my dad, my mum, my younger brother, my aunts, and my second Vietnamese families here in the States who love me unconditionally. You are my supports and the stable shoulders on which I have always been able to rely. Dad, I will always be your beloved daughter, no matter what I am doing. Mum, you taught me the most important lesson in life: Persistence is the key to success. Hoang, I would not believe in myself this much without your love and admiration. Aunt Hanh and Aunt Lam, you are the architects of my success here. I would not be able to pursue my dream without you! Last but not least, thank you, ‘Chị Hai’, the unbelievably kind-hearted woman who treats me as her daughter and looks after me as my mother every day here in Vermont.

v ABSTRACT Standardized writing tests are often high stakes tests for students as the writing test scores often have consequences for the student, the writing teacher, and the instructional program which trains the student writers. How to evaluate writing is usually an important part of teaching writing. However, a close investigation of the current realities of scoring writing in an EFL teaching context reveals a number of difficulties that teachers face in improving validity when scoring writing for standardized tests. This study explores the best practices that raters should use when they score standardized test essays and provides feasible solutions to the scoring of two popular standardized tests, notably the IELTS and the TOEFL-iBT, by non-native English-speaking raters in the English language teaching context of Vietnam. A thorough review of the current literature regarding the scoring of standardized test essays provided this study with essential information on the best rating practices. In addition, a small set of anecdotal data from questionnaires and interviews with three experienced raters is used to supplement key ideas in the literature review. A framework of decision-making behaviors and rating instruments is proposed, which enables raters to align their decision-making behaviors with the rating requirements, thereby enhancing the validity and consistency of their scoring of EFL writing.

vi TABLE OF CONTENTS Acknowledgements

iii

Abstract

v

Table of Contents

vi

List of Tables

X

Chapter One

Introduction ……………………………........................................………

1.1. Background of the Study ……….....................................................

1 1

1.1.1. Lack of Consensus on Decision-Making Behaviors When Rating ESL/ EFL Writing ……………………………..................…………………………………………..

1

1.1.2. Lack of Training for Non-Native English-Speaking Teachers in EFL Teaching Contexts ……………………………..................………………………………………….

3

1.2. Aims and Objectives ………………………………..................…………………………

5

1.3. Research Questions ………………………….........................................................……...

5

Methodology ………………………………........................................……

6

2.1. Data Collection …………………………………………………………..................……

6

2.2. Literature Review ………………………………………………………..................……

6

2.3. Questionnaire for Raters ……………………..................………………………………..

8

Chapter Two

2.3.1. Participants of the Study ………….................………………………………

8

2.3.2. Writing Sample Responses …………..................……………………………

8

2.3.3. Questionnaire for Experienced Raters .............……………………………...

9

2.3.4. Questionnaire Procedures ……….................…………………………………

9

2.4. Interviews with Experienced Raters ……………………………...………………………

10

Results and Discussion ………………………………………………...

11

Chapter Three

Part A: Results and Discussion from Literature Review Results from the Literature Review - Research Question 1: What Are the Best Practices that Standardized Test Raters Use When They Score English Language Essays? 3.1. Components of Scoring ……………………......................................................................

12

3.1.1. Scoring Validity ….……………………………………………………...................

12

3.1.2. Rating Criteria ….……………………….............……………………………….…

12

vii 3.1.3. Types of Rating Scales …………………………….....……………………………

13

3.1.4. An Example of an Analytic Approach – the IELTS test …………………………

15

3.1.5. An Example of a Holistic Approach – the TOEFL-iBT test ………………………

17

3.2. Rating Practices ………………………………..................………………………………

18

3.3. Raters’ Decision-Making Behavior When Rating Standardized Tests Using Analytic Rating Scales ...............…………………………...................…………………………...........

19

3.3.1. Four Approaches to Analytic Scoring ..………………………….............................

19

3.3.2. Analytic Scoring Approaches to the IELTS Writing Component ………………….

20

3.4. Raters’ Decision-Making Behavior When Rating Standardized Tests Using Holistic Rating Scales ...............…………………………...................…………………………...........

25

3.4.1. Cummings, Kantor, and Powers’ (2002) Framework of Raters’ Decision-Making Behavior ……..…..…………………………...................…………………………..........

25

3.4.2. Holistic Scoring Approaches to the TOEFL-iBT Writing Component …………...

26

Results from the Literature Review - Research Question 2: How Can Non-Native English-Speaking Teachers Acquire the Best Rating Practices? 3.5. Differences in Native and Nonnative-Speaking EFL Teachers’ Evaluations of ESL Writing. ………………………….………………………….…………………………..…….

32

3.6. Scoring Training ………………………………………….………………………………

34

3.6.1. Benefits of Training ……………………............…………………………………..

34

3.6.2. Kinds of Rater Consistency ……………….............………………………………..

35

3.6.3. Training Procedures ………………………..................…………………………..

36

3.6.3.1. Norming ………..……………..................………………………………….

36

3.6.3.2. Other Scoring Training Methods ………….....................................................

37

3.7. Summary of the Results of the Literature Review…………………………………….….

39

Part B: Results of the Questionnaire and Interview about Native English-Speaking Raters’ Decision-Making Behavior When Rating Standardized Writing Tests 3.8. Summary of the Questionnaire Procedures …………………….................…………

41

3.8.1. Participants …………………….................………………….................…...…

41

3.8.2. Writing Sample Responses …………………….................…………………...

41

viii 3.8.3. Questionnaire Procedures .…………….................………………………................

42

3.9. IELTS Writing Task 1 …………………….................………………….................…...

42

3.9.1. Description of IELTS Writing Task 1..............…………………..................………

42

3.9.2. Description of the Scoring of the Response to IELTS Writing Task 1 (IELTS Response 1) ………..................………...............……………………...................

43

3.9.3. Report of the Participants’ Responses and Researcher’s Comments to IELTS Response 1….………..................………...............………………………….………….....

44

3.9.4. Further Comments on the Scoring of Grammatical Range and Accuracy of IELTS Response 1 ..................………...............………………………….…………....................

46

3.10. IELTS Writing Task 2……………………..................………………….................…

47

3.10.1. Description of IELTS Writing Task 2 ...........………………… .................………

47

3.10.2. Description of the Scoring of the Response to IELTS Writing Task 2

48

(IELTS Response 2) ……………… ...........………………… .............................……… 3.10.3. Report of the Participants’ Responses and Researcher’s Comments to IELTS

50

Response 2 ………………… .............................……… ...........………………… ............ 3.10.4. Further Comments on the Rating of Grammatical Range and Structure of IELTS

52

Response 2 ………………… .............................……… ...........………………… ............ 3.11. TOEFL-iBT Integrated Writing Task …………………………..................…………

53

3.11.1. Description of TOEFL-iBT Integrated Writing Task ...........……………………..

53

3.11.2. Description of the Scoring of the Response to TOEFL-iBT Integrated Writing

54

Task (TOEFL-iBT Response 3) …… .................………………… .................………… 3.11.3. Report of the Participants’ Responses and Researcher’s Comments to TOEFL-

55

iBT Response 3….…… .................………………… .................………………………... 3.11.4. Further Comments on the Ratings of TOEFL-iBT Response 3 …………………..

55

3.12. TOEFL-iBT Independent Writing Task ……………………………………………..

57

3.12.1. Description of TOEFL-iBT Independent Writing Task …………………………

57

3.12.2. Description of the Scoring of the Response to TOEFL-iBT Independent Writing

57

Task (TOEFL-iBT Response 4) ..................................………………… .................…… 3.12.3. Report of the Participants’ Responses and Researcher’s Comments to TOEFL-

58

iBT Response 4…...................................………………… .................…………………... 3.12.4. Further Comments on the Ratings of TOEFL-iBT Response 4 …………………..

58

ix 3.13. Discussion of the Questionnaire.................……………… .................………………

59

3.13.1.

Score Assignment by Raters ..........…………….....................…………..

59

3.13.2.

Importance of Norming .................……………….............………………

59

3.13.3.

Priority of Errors in Assessing Writing Tasks .................……………

61

3.13.4.

Time in Scoring Writing .................………………… .................……………

61

3.14. Summary of the Interview Procedures .................……….……………...............……

62

3.14.1.

Interview Question 1: What do you think about the tasks? .................………

62

3.14.2.

Interview Question 2 – What are your opinions regarding the analytic IELTS

63

rating scale and the holistic TOEFL-iBT rating scale? ...........…………....................…… 3.14.3.

Interview Question 3 – What are the procedures that you follow when scoring? ……

3.14.4.

Interview Question 4 – Do you score the responses based on your

64

preconceived ideas on good writing? ………..................…………………........................

65

3.14.5.

67

Interview Question 5 - What are the best practices for improving rating

performances for NES and NNES raters? .................………………… .................……… 3.14.6.

Interview Question 6 - Can NNES become good raters? .................…………

68

3.15. Discussion of the Results from the Interview .................…………………..................

69

Chapter Four

71

Conclusions.................…………………..................……………………...

4.1. Research Question 1: What Are the Best Practices that Raters Use When They Score

71

Standardized Test Essays? ………....................………….................…………................… 4.1.1.

Suggested Rating Behavior for the IELTS and TOEFL-iBT Writing Tasks …

71

4.1.2.

Other Factors that Impact Scoring .................………….................…………

72

4.2. Research Question 2: How Can Non-Native English-Speaking Teachers Acquire the Best Practices in Rating Standardized Test Essays? .............………….................…………

73

4.3. Limitations of the Study …..…………………………….................…………………….

74

4.4. Recommendations for Future Studies …………………………................………………

76

4.5. Application of the Thesis to NNES raters in Viet Nam .................……………………..

76

4.6. Summary .................……………………………………...................……………………

77

Appendices .................……………….................……………….................…………………

78

References ……………………………………………………….................…………………

94

x LIST OF TABLES Table

Page

Table 1: Revised method of assessment for IELTS Writing

21

Table 2: Descriptive Framework of Decision-Making Behaviors while Rating TOEFL

27

Writing Tasks Table 3: Milanovic, Saville, and Shens’ Model of Decision Making Processes of

28

Examiners Using Holistic Rubrics Table 4: Norming process

37

Table 5: Description of the Scoring of IELTS Response 1

43

Table 6: Report on Task Achievement of IELTS Response 1

44

Table 7: Report on Coherence and Cohesion of IELTS Response 1

45

Table 8: Report on Lexical Resource of IELTS Response 1

45

Table 9: Report on Grammatical Range and Accuracy of IELTS Response 1

46

Table 10: Description of the Scoring of IELTS Response 2

48

Table 11: Report on Task Response of IELTS Response 2

50

Table 12: Report on Coherence and Cohesion of IELTS Response 2

50

Table 13: Report on Lexical Resource of IELTS Response 2

51

Table 14: Report on Grammatical Range and Accuracy of IELTS Response 2

51

Table 15: Description of the Scoring of TOEFL-iBT Response 3

54

Table 16: Report on TOEFL-iBT Response 3

55

Table 17: Description of the Scoring of TOEFL-iBT Response 4

57

Table 18: Report on TOEFL-iBT Response 4

58

Table 19: Time in Scoring Writing

61

Table 20: Interview Question 1- What do you think about the tasks?

62

1

Chapter One INTRODUCTION This study explores the best practices that raters use when they score standardized test essays. A framework of decision-making behaviors and rating instruments is proposed, which enables raters to align their decision-making behaviors with the rating requirements, thereby enhancing the validity and consistency of their scoring of EFL writing. The study also aims to provide feasible solutions to the scoring of popular standardized tests, notably the International English Language Testing System (IELTS) and the Test of English as a Foreign LanguageInternet Based Test (TOEFL-iBT), by non-native English-speaking raters in the English language teaching context of Vietnam. Chapter 1 presents a rationale for the current study based on current theory and practical considerations. The aims and objectives of the research are then presented, accompanied by the research questions. 1.1.

Background of the Study

1.1.1. Lack of Consensus on Decision-Making Behaviors When Rating ESL/ EFL Writing Writing tests are often high stakes tests for students as the writing test scores often have consequences for the student, the writing teacher, and the instructional program which trains the student writers. How to score writing is usually an important part of teaching writing. However, a close investigation of the current realities of scoring writing in an EFL teaching context reveals a number of difficulties that all teachers face in improving validity in scoring standardized tests (Shi, 2001; Cumming, Kantor, & Powers, 2002). This study identifies a lack of theory and

training in rating ESL/ EFL writing as two major issues that contribute to the reliability of scoring among writing teachers. First, there is a lack of consensus among raters, both experts and classroom teachers alike, as to the correct criteria for scoring student essays (Shaw & Weir,

2

2007). Secondly, this lack of consensus is compounded in EFL settings, where there is often a lack of basic training for non-native English-speaking teachers in assessing written English. A survey of previous studies illustrates the lack of consensus among writing examiners with different experience and backgrounds regarding their decision-making behaviors when rating ESL/ EFL writing (Weir, 1983; Hamp-Lyons, 1991; O’Loughlin, 1992). Such differences in decision-making behaviors also exist amongst expert raters and native English speaking EFL/ ESL classroom teachers. Two of these differences are the extent of their previous experience with evaluating essays and the language features they choose to focus on. Weir (1983) states that when scoring student essays, instructors in English language programs ascribe great importance to the mechanics of writing, whereas experienced raters attend primarily to the content and organization. Examining the responses of language-trained experts and teachers in other disciplines to writing assessment by non-native English-speaking learners in an EFL context, Hamp-Lyons (1991) observes that native English-speaking teachers are more concerned with rhetorical features of writing, whereas language-trained experts and teachers in other disciplines are primarily concerned with content. In a similar study contrasting EFL teachers and teachers of English literature, O’Loughlin (1992) states that when grading native and non-native speaker essays, EFL teachers focus more on grammar and cohesion while teachers of literature focus more on content. Numerous studies have explored the subjectivity and inconsistency of writing examiners (Wigglesworth, 1993; Weigle, 1994; Wolfe, 1997; Tedick, 2002). However, little research has

addressed the best practices that examiners use in rating standardized test essays. Furthermore, only a few studies have been conducted on the differences in scoring by native and non-native English-speaking teachers of English. In addition to this lack of theoretical background in

3

scoring writing among non-native English-speaking (NNES) teachers, scoring itself may not be emphasized in some EFL teaching contexts. For example, in an EFL context such as Vietnam, the need for expertise in assessing writing by NNES teacher becomes very important. There is no pool of NES raters available, and NNES teachers may not have been provided with an adequate knowledge of scoring theories and the training necessary in how to use them. 1.1.2. Lack of Training for Non-Native English-Speaking Teachers in EFL Teaching Contexts In Vietnam, the student teacher of English receives little or no training in the rating of writing, in general, and for standardized tests, in particular. The curriculum of English Teaching majors in Vietnamese universities does not incorporate knowledge of the scoring of written composition, including areas such as rater characteristics, the rating process, rating conditions, rater training, post-exam adjustment, or grading and awarding of scores. Therefore, the teacher of English, in both public and private settings, usually lacks sufficient training to rate essays for standardized tests properly. This lack of training in scoring can put Vietnamese teachers of English at a substantial disadvantage in comparison to NES teachers, who most likely have received such training and also have the advantage of intuitive proficiency in the language. In addition to the lack of training of raters in the Vietnamese setting, the demands of scoring an essay can be somewhat formidable to NNES teachers in general, especially those in universities and private language schools. These Vietnamese lecturers of English are confronted by a significant scoring load due to the increasing demands for the scoring of standardized tests, such as the IELTS and the TOEFL-iBT. For example, during my two years teaching writing in a language school, I scored around seventy 250-word essays per week. Correcting essays took approximately five hours each day. Although I utilized the IELTS and TOEFL-iBT rubrics

4

published by the IELTS and the TOEFL-iBT testing agency to guide my teaching and assess my students’ writing throughout the course, my students scored lower than I had expected on the actual exams. This was a disappointment to me. Therefore, I wished to acquire an accurate understanding of the IELTS and TOEFL-iBT writing rubrics, and how IELTS and TOEFL-iBT examiners score student responses so that I could improve my teaching and better evaluate my students’ writing. Generally, as a result of NNES teachers’ lack of training and theoretical knowledge when scoring compositions, their scoring methods may not be sufficient to the task irrespective of the time and effort devoted to the task, and thereby remaining somewhat ineffective. The current research seeks to remedy this situation by establishing the optimal practices in scoring compositions. From the above-mentioned theoretical background and its practical implications, it is clear that the development of a proper understanding of the best practices that standardized test raters should use when scoring English language is of crucial importance to all writing examiners. Whether NES or NNES, knowledge of the best scoring practices is necessary. Proper rating must necessarily include a framework of decision making behaviors and rating instruments. Whereas such a methodology is needed by all writing examiners, a framework of decision making behaviors and rating instruments is of particular consequence for non-native EFL raters. NNES may possess neither the advantages of intuitive language proficiency nor proper training, yet are confronted by a significant demand for such knowledge in their job assignment. This study addresses the best practices that raters of standardized tests, notably the IELTS and the TOEFL-iBT, should use when they score English language essays and suggests optimal methods for the application of these best practices for NNES raters in EFL teaching contexts.

5

1.2.

Aims and Objectives This research examines the best practices that standardized test raters use when they

score English language essays. A framework of decision-making behaviors and rating instruments is proposed, enabling language proficiency raters to align their decision-making behaviors with the rating requirements, thereby enhancing the validity and consistency of their scoring of EFL writing. This study also aims to propose practical and feasible solutions to the scoring of standardized tests, such as the IELTS and TOEFL-iBT, by NNES raters in the English language teaching context of Vietnam. 1.3.

Research Questions The current research seeks to answer two questions: 1. What are the best practices that raters use when they score standardized test essays? 2. How can non-native English-speaking teachers acquire these best practices in rating standardized test essays?

6

Chapter Two METHODOLOGY This chapter presents the methods used to collect the data from literature review and to conduct a preliminary investigation into the rating systems of three experienced raters. 2.1.

Data Collection The current research is based primarily on review of data from previous studies. A

thorough review of the current literature regarding the scoring of standardized test essays provided this study with essential information on the best rating practices to utilize when evaluating student writing. In addition, a small set of anecdotal data from questionnaires and interviews from experienced raters of writing are used to supplement key ideas in the literature review. 2.2.

Literature Review The key terms of the current study were researched using the online searching tools

provided at the Saint Michael’s College Library: library searching tools, ProQuest dissertation, Google Scholar, and Worldcat. The original searched terms included scoring validity, scoring criteria, rating scale, and rater characteristics. Abstracts of articles, dissertations, and books found through searching tools were skimmed before twenty articles, dissertations, and books from reliable sources, notably TESOL-Quarterly, Written Communication, Cambridge University Press were selected for detailed reading. These sources were chosen for what was deemed to be their relevance to the decision-making processes in the scoring of standardized tests. Some of the selected sources detailed a descriptive framework on the decision-making behaviors to utilize when scoring standardized tests while others presented studies on aspects of scoring validity, namely scoring criteria, rating scale, rater characteristics, rating process, rating

7

conditions, rater training, post-exam adjustment, and grading and awarding. Also, as the current study proposes to address the scoring practices on standardized tests such as the IELTS and the TOEFL-iBT, sources on rater training for the IELTS and the TOEFL-iBT were searched. Articles related to differences in the scoring of writing between NES and NNES teachers were selected since current thinking regarding these differences is pivotal to answering the second research question, how NNES teachers acquire these best practices in rating standardized test essays. From the many sources obtained through library research, one became prominent in yielding insights regarding the best practices in the rating of writing tasks: Cummings, Kantor, and Powers’ (2002) Decision Making while Rating ESL/EFL Writing Tasks: A Descriptive Framework. This article introduces a framework of decision-making behaviors that raters should utilize when scoring writing tasks. This description of decision-making behaviors helped the researcher to realize and made explicit the complexity and the interactive facets of decisionmaking behaviors involved in the evaluation of writing. Based on the information from the Cummings, Kantor, and Powers’ (2002) article, further research was conducted to attain articles and books focusing on recommended rating procedures for the IELTS and the TOEFL-iBT as the current study proposes to address the scoring practices in these standardized tests. A second source that influenced the selection of content for this study was the book Examining writing: Research and practice in assessing second language writing by Shaw and Weir (2007). Shaw and Weir established the importance of several important terms which are relevant to scoring, such as scoring validity, scoring criteria, rating scale, rater characteristics, rating process, or rater training. Information on the different standardized tests of Cambridge ESOL testing agency, including the IELTS, was also provided. Shaw and Weir also made

8

important recommendations on how to improve scoring performance. These terms and ideas, in turn, helped shape further research quests, utilizing search terms such as decision making of scorers in standardized tests and classroom tests, rubrics, IELTS writing rubrics, TOEFL-iBT writing rubrics, IELTS examiners, and TOEFL examiners. The results clustered around the best practices in rating standardized tests, using analytic and holistic scoring. These ideas became essential parts in the formation of the outline of the current thesis. The best practices for the IELTS and the TOEFL-iBT have also been presented in a number of articles, books, and especially materials provided by the IELTS and the TOEFL testing agencies. These resources are integral to the foundation of the reporting of the current research. 2.3.

Questionnaires for Raters

2.3.1. Participants for the Study The participants are three writing instructors who teach in both the Intensive English Program and Academic English Program at Saint Michael’s College. They are, therefore, accustomed to evaluating and grading student written responses at all levels, from low intermediate through advanced. They include two males and one female, each of whom has been scoring ESL students’ essays for a minimum of five years. To protect participants’ confidentiality, they will be referred to as Rater 1 (R1), Rater 2 (R2), and Rater 3 (R3). 2.3.2. Writing Sample Responses As the current study seeks to address the practices in scoring utilized by the IELTS and the TOEFL-iBT testing agencies, the writing sample responses from the IELTS and TOEFL-iBT, hereafter referred to as the responses, are used. The four responses (See Appendix A) that were used in the data collection include one response each of IELTS Writing Task 1, IELTS Writing Task 2, TOEFL-iBT Integrated Writing, and TOEFL-iBT Independent Writing. The two

9

responses of IELTS tests were taken from Cambridge IELTS 9 Self-study Pack (2013), and the two responses of TOEFL-iBT tests were taken from TOEFL-iBT Writing Sample Response (2003). All four of these responses had been scored by expert scorers of IELTS and TOEFL-iBT agencies. 2.3.3. Questionnaire for Experienced Raters The questionnaire consists of seven questions addressing decision-making behaviors that raters employed when scoring the four responses. Regarding the questionnaire for IELTS Writing Task 1, the first question was how much time the raters had spent on scoring the student response to IELTS Writing Task 1. The four subsequent questions probed into the rater’s decision-making behaviors when arriving at a score for Task Achievement, Coherence and Cohesion, Lexical Resource, and Grammatical Range and Accuracy. The questionnaire for the raters when scoring the student response to IELTS Writing Task 2 follows a similar pattern. Questions for the raters when scoring the student responses to TOEFL-iBT Integrated and Independent Writing Tasks were designed to determine the amount of time that the raters had spent on scoring these responses, and their decision-making behaviors when arriving at a score for the student responses to TOEFL-iBT Integrated and Independent Writing Tasks. 2.3.4. Questionnaire Procedures The raters in this project were given the four above mentioned responses from the two standardized tests along with the IELTS and TOEFL-iBT rating criteria. The raters then scored the responses on their own and according to their own schedule in order to assign a level of writing proficiency. After scoring the responses, the raters answered a list of written questions (See Appendix B). These questions sought to determine the linguistic features that most shaped the raters’ decisions in determining the score for student writing proficiency. The raters were

10

also asked to highlight words and phrases that illustrated certain linguistic features of the student response. Neither instructions on how the use the rubrics nor recommended decision-making behaviors were given. 2.4.

Interviews with Experienced Raters In a follow up interview session between this researcher and the raters, there was further

examination of what language features were deemed pertinent in the scoring, and how the scoring instruments were applied. Data from the interviews, to a certain extent, revealed further information about and illumination of what actually happened in the scoring process. Topics such as the instructors’ scoring experience, rater’s decision-making behaviors, linguistic features of high-scored essays and low-scored essays, the use of standardized rubrics, or rater’s confidence with his/ her scoring performance were investigated (See Appendix C). Factors that affected the scoring performance, such as validity and reliability in scoring, frequency of training or norming, and rater’s perspectives to training or norming were also explored. Data from the scoring of the responses and the questionnaires were interpreted in relation to the findings from the literature review. The data from this preliminary study is, in all cases, anecdotal, and no quantification of the data was conducted. This preliminary investigation may, hopefully, serve as the basis for future investigation.

11

Chapter Three RESULTS AND DISCUSSION This study explores the best practices that raters of standardized tests utilize when they score English language essays. Chapter 1 of this thesis presented a rationale for this study along with the aims and objectives of the research and the research questions. Chapter 2 provided details of the methodology utilized to answer the following research questions: 1.

What are the best practices that raters use when they score standardized test essays?

2.

How can non-native English-speaking teachers acquire the best practices in rating standardized test essays? To address Research Question One, which seeks to identify the best practices that trained

standardized test raters use when they score essays written for standardized tests, the current chapter reviews studies conducted on standardized test rating of the IELTS and the TOEFL-iBT and presents the results in Section 3.1 Components of Scoring and Section 3.2 Rating Practices. Raters’ decision-making behaviors when rating standardized tests using analytic and holistic rating scales are presented in Section 3.3 and Section 3.4 respectively. To address Research Question Two, regarding how non-native English-speaking teachers acquire the best practices in scoring English language standardized test essays, Section 3.5 presents the differences in native and nonnative-speaking EFL teachers’ evaluations of ESL writing, and Section 3.6 addresses rater training. In addition to the results from the literature review, the results from a small survey of raters utilizing questionnaires and interviews are subsequently presented. Chapter 4 indicates the limitations of the study and useful areas of further research.

12

PART A: RESULTS AND DISCUSSION FROM LITERATURE REVIEW Results from the Literature Review - Research Question 1: What Are the Best Practices that Standardized Test Raters Use When They Score English Language Essays? 3.1.

Components of Scoring

3.1.1. Scoring Validity Weir (2005) asserts that scoring validity is an a posteriori (after-the-test) component of the writing test performance. “Scoring validity is linked directly to both context and cognitive validity and is employed as a superordinate term for all aspects of reliability” (p. 47). This superordinate term includes such factors as rating scale, rater characteristics, rating process, rating conditions, rater training, post-exam adjustment, and grading and awarding. This definition of scoring validity is of crucial importance to the current study. It identifies the critical components of scoring validity and limits the search to a constrained number of sources. 3.1.2. Rating Criteria A rating scale, which is key to achieving validity in scoring, is defined as a coherent set of criteria for evaluating student work that includes descriptions of levels of performance quality on the specified criteria (Brookhart, 2013). McNamara (1996) states that the importance of marking schemes in assessing performance writing epitomizes the theory on which a given test is based, including the skills and abilities that the test developers aim to measure. The criteria and the descriptors for each criterion will, in turn, determine how the ratings are applied in arriving at a score for a student’s writing (Alderson, Clapham, & Wall, 1995). However, rubrics cannot be effective without rater training. Other studies cast doubt on the extent to which language descriptors can enable a consistent and reliable rater decisionmaking process. Rubrics may not differ from mnemonic devices through which raters of writing

13

are trying to match the score with the best description of the student work (Shaw & Weir, 2007). Unless all raters internalize the rating scale in a consensual way, each rater may demonstrate significant differences in their scoring according to their background and preferences (Shaw & Weir, 2007). These findings further substantiate the importance of rater training in assuring that a consensus on the application of language descriptors and appropriate rating behavior is attained. 3.1.3. Types of Rating Scales Two main kinds of rubrics are currently used in second language testing: holistic and analytic. With holistic rubrics, all criteria are considered simultaneously (Brookhart, 2013). Since raters make judgments by forming an overall impression of a performance and matching it to the best fit with the descriptors on a scale, raters may save time by minimizing the number of decisions they must make (Tedick, 2002). White (1995) indicates that reading holistically is a more natural process than reading analytically. Holistic scoring, however, has a number of significant drawbacks. It does not necessarily lend itself to the provision of diagnostic feedback, which can cause substantial difficulties for second language writers (Charney, 1984). This is because “different aspects of writing ability develop at different rates for different writers” (Weigle, 2002, p. 114). Weigle further argues that the levels of proficiency indicated by holistic marking do not provide raters and teachers with the acquisition order of a wide range of writing skills in student essays. In other words, holistic scoring does not allow raters to differentiate features of writing, such as the extent of lexical use, aspects of rhetorical organization, and grammar accuracy and structures (Shaw & Weir, 2007). Both the advantages and disadvantages of holistic rating should be taken into account when assessing student responses using a holistic marking approach.

14

With analytic rubrics, as opposed to holistic ones, each criterion has separate performance descriptions (Brookhart, 2013). Numerous standardized and classroom writing tests capitalize on the analytic scale since it not only provides useful feedback on learner strengths and weaknesses (Hamp-Lyons, 1991), but is also advantageous for focusing rater judgments and ensuring an acceptable consensus among raters (Weir, 1990). Analytic scales, in particular, seem to lend themselves to the evaluation of second language writing because different features of writing develop at different rates (Weigle, 2002). Other researchers, however, indicate potential drawbacks associated with the construct validity of analytical rating scales. Tedick (2002) emphasizes that “the whole is greater than the sum of its parts” (p. 36), meaning that separate assessment of individual subscales does not contribute to authentic rating since it fails to assess “the whole of a performance”. Echoing Tedick, Hughes (1989) and Fulcher (2010) explicitly articulate that focusing on single aspects of a piece of writing may divert a rater’s attention from the overall effect. However, whatever strengths and weaknesses of analytic and holistic rating criteria, the testing agencies have their own theory and philosophy behind their choice of rating criteria. Any assessment of the rating criteria, therefore, should take into account the theory and philosophy espoused by the testing agencies and which is, therefore, implicit in their rating criteria. The current research is limited to the investigation of rating practices that expert scorers employ when rating standardized test essays of the IELTS and the TOEFL-iBT. The IELTS writing test is given by Cambridge English for Speakers of Other Languages (Cambridge ESOL), which is part of the University of Cambridge and develops various standardized tests in the United Kingdom. The TOEFL-iBT writing test is given by Educational Testing Service (ETS), an educational testing and assessment organization which provides English language assessments

15

and qualifications in the United States. Results from both the IELTS and the TOEFL-iBT tests are accepted in universities all over the world. The IELTS writing test employs an analytic approach to scoring. The TOEFL-iBT writing test is an example of a holistic approach to the evaluation of writing. In the following sections, a description of the two standardized tests - the IELTS and the TOEFL-iBT - is provided to help illuminate raters’ decision-making behaviors when scoring writing products of the IELTS and the TOEFL-iBT. 3.1.4. An Example of an Analytic Approach to Scoring - the IELTS Test IELTS is a popular English-language test for study, work and immigration in Englishspeaking countries. The IELTS Academic Test is designed for those seeking admission to undergraduate or postgraduate levels in programs in which the language of instruction is English, as well as for advancement in employment. The IELTS Academic test includes four parts – Listening, Reading, Writing and Speaking. IELTS results are graded on the IELTS 9-score scale. The IELTS Academic Writing test is 60 minutes long. It has two writing tasks of 150 words and 250 words, respectively. On Task 1 of the IELTS Academic Writing, test-takers are asked to describe and interpret information from the graph/table/chart/diagram in their own words. This task basically requires test-takers to interpret information provided by the visual aids. Test-takers are not asked to express their opinions or suggest any speculation on the information. They are given 20 minutes to write at least 150 words. In Task 2, test-takers are presented with a point of view, argument or problem, and they are asked to present and defend their opinions concerning a social issue. Testtakers are given 40 minutes to respond, using a minimum of 250 words. Task 1 essays are scored on the basis of four writing criteria: Task Achievement, Coherence and Cohesion, Lexical Resource, and Grammatical Range and Accuracy. Task 2

16

essays are scored based on the criteria of Coherence and Cohesion, Lexical Resource, and Grammatical Range and Accuracy. Task Achievement assesses the extent to which the essay responds to the requirements appropriately, accurately, and relevantly, with at least 150 words, within the 20-minute timeframe. The Coherence and Cohesion criterion examines the candidate’s ability to organize and link information, ideas and language. Coherence is concerned with the linking of ideas through logical sequencing. Cohesion refers to the appropriate use of diverse cohesive devices, specifically logical connectors, pronoun referents, and conjunctions. Lexical Resource assesses the variety and appropriateness of vocabulary in the response. Grammatical Range and Accuracy refers to the range and accurate use of the grammatical resource at the sentence level. Task Response on Task 2 writing criteria requires the test-taker to formulate and develop a position in relation to a given prompt in the form of a question or a statement. The candidates should strengthen their position by using relevant evidence and examples, using the minimum of 250 words in 40 minutes. In both tasks, the required minimum word limit must be strictly adhered to. The tasks in the Academic IELTS Writing Module are assessed independently of each other. Task 2 carries more weight than Task 1, but how much weight each task contributes is not available from the IELTS testing agency. IELTS test-takers receive scores on a scale from 1 to 9, with a score being reported for each test component. The individual test scores are then averaged and rounded to produce an overall score according to a confidential IELTS score conversion table. Overall scores and individual test scores are reported in whole and half scores.

17

3.1.5. An Example of a Holistic Approach to Scoring – the TOEFL-iBT Test The TOEFL-iBT test measures candidates’ ability to use and understand English at the university level. It evaluates how well examinees combine listening, reading, speaking and writing skills to perform academic tasks. There are two writing tasks on the TOEFL-iBT. In addition to outlining, organizing, and the development of ideas, TOEFL-iBT writing tasks require accurate and appropriate use of various grammatical features, idiomatic expressions, and mechanics. The first task is an integrated one, requiring test takers to read, listen, and then write in response to what they have read and heard. Examinees have 20 minutes to plan and write their response. Their response is assessed on the basis of the overall quality of their writing and on how well their response presents the main points in the lecture and their relationship to the reading passage. Typically, an effective response ranges from 150 to 225 words. The second task is an independent one, requiring test takers to support an opinion on a topic. Examinees are expected to present their points of view with relevant examples and convincing reasons in an appropriate discourse type that highlights the topic-centered nature of English writing. In the independent task, examinees have 30 minutes to plan, write, and revise their essay. Typically, an effective independent essay contains a minimum of 300 words. TOEFL-iBT results are graded on a scale from 0 to 5. As can be seen from the descriptions of the IELTS and the TOEFL-iBT, the tests serve specific purposes and have been systematized in terms of the tasks, the prompts, the rating criteria, and the score. Given that the rating scales embody the theory on which a given test is based, including the skills and abilities that the test developers aim to measure (McNamara, 1996), the IELTS and the TOEFL-iBT rating criteria represent the skills and abilities that their

18

testing agencies aim to measure in the test-takers. This implies that each rating behavior for these standardized tests should be purposefully conducted, conforming to the intention of their test designers. In other words, the best practices in rating the IELTS and the TOEFL-iBT writing tasks should be the ones which are recommended by their respective testing agencies, Educational Testing Service (ETS) for the TOEFL-iBT and Cambridge English for Speakers of Other Languages (Cambridge ESOL) for IELTS. The best practices for use with analytic and holistic rating scales to score standardized writing tests, for this research the IELTS and the TOEFL-iBT, will be examined. These rating practices are frequently revised and updated by the testing agencies and are those currently in use in rater training by both Cambridge ESOL and ETS. 3.2.

Rating Practices Rating practices, which are defined as the decision-making behaviors that raters attend to

in arriving at a score for a script, have long been of concern in scoring validity. An unsystematic or ill-conceived scoring is likely to vitiate all the effort that has gone into improving aspects of scoring validity (Alderson et al, 1995). How raters make decisions when scoring, therefore, is one of the most crucial facets of the marking procedure, and this process has received significant attention in the literature. A better understanding of raters’ decision-making behavior (DMB) is necessary for the provision of basic information for making more valid and reliable assessments (Milanovic, Saville, & Shen, 1996). Variation in raters’ DMB has been shown in many studies, notably Huot (1990) and McNamara (1996). McNamara identifies four factors leading to rater variance in marking behaviors: 1.

Raters may be at variance because of their tendency to leniency.

19

2.

Raters may display bias to certain candidates or types of task.

3.

Raters may exhibit inconsistent rating behaviors.

4.

Raters may interpret and apply the rating scale instrument differently. (adapted from McNamara, 1996) Among these factors, the impact of rating scales on rater behaviors has been investigated

in a number of studies. McNamara (1996) highlights the importance of such rating scales in assessing writing performance since rating scales embody the theory on which the test is based. This includes the skills and the abilities that the test developers aim to measure. Furthermore, the scales and descriptors for each scale level have been shown to determine how the ratings are applied in assessing a score for a student’s writing (Alderson, Clapham, & Wall, 1995). As previously mentioned, this research is limited to the investigation of decision-making behaviors that expert scorers employ when rating the standardized test essays of the IELTS (analytic approach) and the TOEFL-iBT (holistic approach). In the following sections, raters’ decisionmaking behaviors when using an analytic and holistic rating scale to score standardized writing tests, notably the IELTS and the TOEFL-iBT, are examined in detail. 3.3.

Raters’ Decision-Making Behaviors When Rating Standardized Tests Using

Analytic Rating Scales 3.3.1. Four Approaches to Analytic Scoring The behaviors that raters attend to when scoring standardized tests using analytic rubrics have received significant attention in the literature. Milanovic and Saville (1994) examined different approaches utilized by 16 raters in evaluating the essays of two English tests through interviews, introspective verbal reports, and retrospective written reports. Their results examined four basic approaches to analytic scoring: the principled two-scan approach, the pragmatic two-

20

scan approach, the read through approach, and the provisional mark approach. The principled two-scan/ read approach is characterized by double scanning, or reading the essay twice, before deciding on a final mark. In this approach, the second reading plays the main role in determining the score for the essay. The pragmatic two-scan approach also utilizes a double reading by the rater, but the second reading occurs only when raters have difficulty scoring during the first reading. In the read through approach, raters read an essay once and take notes of the strong and weak points. They then assign a score. Raters adopting the provisional mark approach read the script once, but they stop for an initial assessment, usually done towards the beginning of the essay, before resuming reading to either confirm or reject the initial assessment. 3.3.2. Analytic Scoring Approaches to the IELTS Writing Component Other significant findings on examiners’ DMBs in analytic rating are revealed in the IELTS Writing Revision Project (2006) carried out by Cambridge ESOL, a department of Cambridge University known for providing English language assessments, including the IELTS. This project provides information on how examiners interpret rating scales, what particular information raters attend to, how they reach their final judgments, and whether some criteria are more difficult to assess than others. These issues are manifest in the revised method of assessment for IELTS writing tasks (See Table 1). The word “script” in Table 1 means a response to a writing task.

21

Table 1: Revised Method of Assessment for IELTS Writing

1. Work through the four criteria in order, starting with Task Achievement or Task Response, noting length requirements 2. For each criterion start with the over-arching statement that most closely matches the appropriate features of the script 3. Read through the most detailed features of performance at that band and match these to the script 4. Check that all positive features of that band are evident in the script 5. Check through the descriptors BELOW the band to ensure that there are no penalties/ ceilings that are relevant 6. Check through the descriptors ABOVE the band to ensure that the rating is accurate. 7. Where necessary, check the number of words and note the necessary penalty on the Answer Sheet 8. Write the band for the criterion in the appropriate box on the Answer Sheet and ensure that any other relevant boxes are completed. 9. Rate all Task 1 and Task 2 responses together (Taken from Shaw & Weir, 2007, p. 274) The IELTS Writing Revision Project (2006) states that IELTS raters most often adopt one of two marking approaches: a principled two-scan/ read or a pragmatic two-scan/ read. On the whole, the raters in the project adhere to the revised method of assessment for IELTS writing tasks. The raters first analyze the task requirements and identify the salient features of the task, through addressing questions like: “What does the prompt call for?” or “What are the requirements of the task?” (p. 11). The raters are also advised to examine all responses to the first task before rating the responses to the second. The four examiners in the IELTS Writing Revision Project tended to review the task throughout their marking. This recurring behavior mirrors the results of previous research, which indicates that prior to assigning a score, raters

22

may read the prompt several times and take notes of important information that should be included in student responses (Shaw, 2002). It is clear, then, that raters utilize the first reading of the text to gain an initial impression but pay particular attention to the Task Achievement of Task 1 and Task Response of Task 2, as well as noting the length of the essay. Raters may ask themselves the following questions in reaching a score for a response: 1.

Is the response balanced?

2.

Is the response relevant, accurate and appropriate?

3.

Does it present both sides of the argument?

4.

Does it have a conclusion?

5.

Is there a progression to the argument?

6.

Is it copied from the rubric? (Taken from IELTS Writing Revision Project, 2006, p.11) The last question of whether the text incorporates language from the rubric is important

since candidates are required to write using their own words. They can paraphrase the prompts, but they should not copy from the prompts directly. Those responses which copy from the task prompts are, therefore, assigned lower scores. Similar to TOEFL-iBT holistic scoring, essay length is a compulsory element on the IELTS writing test. On IELTS Task 1, candidates are expected to write at least 150 words in 20 minutes. On IELTS Task 2, candidates are required to write at least 250 words in 40 minutes. Failure to attain the minimum length results on lower scores on the test. Shaw and Weir (2007) also highlight that “Guidelines for what constitutes a word and a ‘word count’ are a helpful aid in assessing IELTS scripts (or responses)” (p. 12). In

23

brief, essay length is a salient factor in evaluating both Task Achievement on Task 1 and Task Responses on Task 2 on the IELTS tests of writing. To reach a score for Task Achievement of Task 1 and Task Response of Task 2, raters read the rubrics from top to bottom to select the overall statement that is most closely aligned with the global features of the response. Raters then read the details of Task Achievement of Task 1 and Task Response of Task 2 in order to determine whether sufficient detail is manifest in the response. Raters also read the description above and below the score to ensure that their evaluation is accurate and that there are no relevant penalties. The three raters in the above-mentioned IELTS Writing Revision Project then asked themselves a number of questions before they awarded a score for a response. To illustrate, with regards to Coherence and Cohesion, the raters addressed whether the writing was logically organized, whether it had clear progression and paragraphing, and whether a range of cohesive devices were manifest. In assigning a score for Lexical Resource, the raters considered occasional errors and the intelligibility of the words. The range of vocabulary was also given consideration. In terms of Grammatical Range and Accuracy, the raters studied complex rhetorical structures and punctuation issues and assessed whether a broad range of sentence structures was used in order to yield a higher score. Regardless of the various scoring strategies employed, the raters followed the same behavior in assigning a score. The raters all referred to the rubrics and searched for the descriptions that most closely match the features of the responses. They then sought to confirm or revise their score assignment by considering the descriptors above and below the score they selected. In addition, the raters were asked to write the score for the response in a box on the answer sheet and required to note the number of words in the essay as well as add any penalties as necessary.

24

In comparison to Cumming, Kantor, and Powers’ (2002) descriptive framework, the revised method of assessment for IELTS Writing seems to be more specific and orderly. The framework presents a thorough and detailed overview of raters’ DMB, while the revised method of assessment enables raters to attend to salient features of the rating scales. In the revised method of assessment, suggestions for rater DMBs are provided, thus promoting unity in scoring. For example, raters are expected to analyze the task requirements before attempting to rate the responses. It is also advised that raters examine all responses to a task before moving on to the second task. Rating criteria like Task Achievement and Task Response require consideration before Lexical Resource and Grammatical Range and Accuracy. Both the revised method of assessment and the descriptive framework reveal that several scoring behaviors vary depending on the use of rubrics. That is, when raters of standardized tests have to rate an essay through comparison of the student writing using rubrics rather than their own criteria, they are more likely to attain consensus on the application and interpretation of rating scales. The result is that their DMBs increase in consistency. In other words, the descriptors on the rating scales of standardized tests play a critical part in determining how the rating is applied. Generally, both the revised method of assessment and the descriptive framework capture the main aspects of the process of decision-making behavior. However, the revised model of assessment of Cambridge ESOL includes the use of rubrics, which facilitates more unanimity of rating behaviors.

25

3.4.

Raters’ Decision-Making Behaviors When Rating Standardized Tests Using Holistic

Rating Scales 3.4.1. Cummings, Kantor, and Powers’ (2002) Framework of Raters’ Decision-Making Behaviors Indeed, as with their analytic rubric counterparts, rating behaviors using holistic rubrics when scoring standardized tests have also been prominently featured in the literature. Cummings, Kantor, and Powers (2002) propose a set of beliefs, ideas, or rules that should be used as the basis for making rating decisions, a so-called framework of raters’ decision-making behaviors, as benchmarks in the rating of TOEFL writing tasks (See Table 2). This framework was based on the think-aloud protocols of 10 experienced ESL/EFL raters and 7 highly experienced Englishmother-tongue composition raters. According to this framework, raters tend to employ selfmonitoring strategies, such as interpretation and judgment. Interpretation strategies are reading strategies that aim at overall comprehension of the composition. Interpretation strategies also include reading or interpreting prompts as well as envisioning the personal situation of the writer. Raters also make judgment strategies – which are evaluation strategies - for formulating a rating, or score. They then decide on strategies for reading and rating or compare a composition with other compositions. Specifically, the majority of raters simply start reading and assessing the compositions in the sequence in which they are presented, whereas others consider the shortest papers first. Along with interpretation and judgment strategies, raters focus on rhetoric, ideas, and language. Cummings, Kantor, and Powers declare that decision-making is an interactive and multifaceted process. Raters carry out both interpretation and judgment processes while attending to various aspects of rhetoric and ideas as well as language use.

26

3.4.2. Holistic Scoring Approaches to the TOEFL-iBT Writing Component It should be noted that the raters in Cummings, Kantor, and Powers’ (2002) research “are left to their own devices and judgments as to how to rate the composition” (p. 71). Neither rubrics nor benchmark papers were provided to them. In other words, these raters were not simultaneously considering the same criteria when scoring the writing samples. Through this non-use of rating scales, the researchers were able to investigate the self-monitoring strategies and the rhetoric, ideas, and language that the experienced raters attend to. However, at the same time, the researchers may have overlooked certain potential behaviors that raters might have employed when provided with a more uniform criteria. The descriptive framework of decision-making behaviors while rating TOEFL writing tasks in Table 2 presents a thorough and detailed overview of examiners’ DMB. The framework helps to raise raters’ awareness of the rating process, including judgment and interpretation strategies as well as those features of student responses that they need to attend to in assigning a score. As can be seen from the framework, rater behaviors are diverse, and there is no singular approach to assigning a score. Some raters may read or interpret prompts or task input or both. Some might consider their own personal responses or bias whereas others define or revise their own criteria. Other raters may prefer classifying errors into types while summarizing ideas or propositions. Each rater approaches a piece of writing from a different point of view and with a different focus.

27

Table 2: Descriptive Framework of Decision-Making Behaviors while Rating TOEFL Writing Tasks Rhetorical and Ideational Focus

Language Focus

Discern rhetorical structure

Classify errors into types

Read or reread composition

Summarize ideas or propositions

Interpret or edit ambiguous or unclear patterns

Envision personal situation of the writer

Scan whole composition or observe layout

Self-Monitoring Focus Interpretation Strategies Read or interpret prompt or task input or both

Judgment Strategies Decide on macro-strategy for reading and rating; compare with other compositions; or summarize, distinguish, or tally judgments collectively Consider own personal responses or bias Define or revise own criteria Articulate general impression Articulate or revise scoring decision

Assess reasoning, logic, or topic development

Assess quantity of total written production

Assess task completion or relevance

Assess comprehensibility and fluency

Assess coherence and identify redundancies

Consider frequency and gravity of errors

Assess interest, originality, or creativity

Consider lexis

Assess text organization, style, register, discourse functions, or genre Consider use and understanding of source material

Consider syntax and morphology Consider spelling or punctuation Rate language overall

Rate ideas or rhetoric

(Taken from Cummings, Kantor, & Powers, 2002, p. 77)

28

Indeed, the framework in Cummings, Kantor, and Powers’ (2002) study (See Table 2) is expected to serve as a foundation for the TOEFL rating criteria as well as a tool for the training of raters. It is assumed that when training novice raters, we should be aware that each rater brings their own rating experience, language background, and personal preferences to the rating process. These differences must be taken into account in training the raters in the scoring of standardized tests. It is also important to propose ways to help level the considerable differences among raters. Even when a given approach is recommended, and attempts at leveling are provided, a wide range of variables may still influence raters’ performance. This variance in rating behaviors calls for the adoption of a recommended framework that meets the demands of the specific standardized test, and, in addition, each rating behavior in the recommended framework should serve a pre-determined purpose. Another approach to decision making processes of examiners using a holistic is that of Milanovic, Saville, and Shen in their 1996 model of the decision making processes of the examiners. Based on the findings in Cummings, Kantor, and Powers’ (2002) and other studies on holistic rating, Milanovic, Saville, and Shen (1996) developed the following model of the decision making processes of examiners using holistic rubrics (See Table 3). Table 3: Milanovic, Saville, and Shen (1996)’s Model of Decision Making Processes of Examiners Using Holistic Rubrics 1. 2. 3. 4. 5. 6.

internalize marking schemes and interpret the tasks; scan for length, format, handwriting and organization; read quickly to establish level of comprehensibility; rate by assessing relevance, development of topic coherence and organization, error distribution, command of syntactic complexity, and mechanics; reassess and revise if necessary; then decide the final mark (as cited in Sakyi, 2003, p. 24).

29

The above-mentioned model is based on the twenty-eight decision-making behaviors in Cumming’s (1998) study, and it provides a step-by-step list for raters to follow in holistic rating. According to this list, internalizing a marking scheme and interpreting the tasks are the first rating behaviors which require attention. This appears reasonable since internalizing the marking scheme and interpreting the tasks are skills that may not be easily acquired through the rater’s own practice and experience. Materials on rater training for standardized tests reveal that the internalization of marking scheme and interpretation of tasks are a phenomenon which often occur with expert raters and are processed through a number of steps. Raters have to first understand the descriptors on the rubrics in the ways which were intended by the test developers. Through benchmark essays, or prototypical example essays, raters can ascertain why one criterion is chosen over other criteria, and what terms such as “successfully select the important information”, “coherently and accurately presents this information”, “generally good in selecting the important information”, or “vagueness” (p. 1, TOEFL-iBT rubrics) mean. Overall, the interpretation of marking schemes and interpretation of tasks should be prioritized over other steps in rating. In addition, raters need to be trained in rubric internalization and task interpretation so that they may shape their perspectives according to recommended behaviors, thereby gradually eliminating bias or personal opinions due to differences in rater experience and language backgrounds. According to the Milanovic, Saville, and Shen’s (1996) model, essay length is another important feature in the evaluation of essays, as a minimum length for the essay is assigned on TOEFL-iBT writing rubrics. On the TOEFL-iBT Integrated task, candidates are expected to write from 150 to 225 words in 20 minutes. On the TOEFL-iBT Independent task, candidates are required to write at least 300 words in 30 minutes. Without a required minimum essay length,

30

some candidates write less than others in order to avoid making errors. A required essay length helps to ensure equity among candidates as well as allowing raters to be exposed to a wide range of writing proficiencies within uniform lengths. The third and the fourth steps of the Milanovic, Saville, and Shen’s 1996 model include reading to establish level of comprehensibility and assessing features of the writing. These steps roughly correspond to three processes espoused by Freedman (1981): building text image, evaluating the text image, and articulating judgment (as cited in Wolfe, 1997, p. 10). The works of Milanovic, Saville, and Shen and Freedman both support White’s (1984) assertion that rating essays is basically a problem-solving activity. Overall, as Sakyi (2003) comments, the model in Milanovic, Saville, and Shen’s studies captures the two main aspects of marking: (1) the process of examiner decision-making behavior and (2) the composition characteristics to which examiners must direct their focus. Holistic rating practices, however, may lead to scoring inconsistency. Sakyi (2003) reveals that raters who follow holistic scoring criteria, such as that of the TOEFL-iBT, tend to rely on either the content or the language of the essay in determining its score. That is to say, in contrast to scoring from an analytical rating scale, where raters can assign a different score for different features of an essay, all features are considered simultaneously and represented by one score when using a holistic rating scale. There are no separate descriptors for different features such as organization, content or language. This leads to the fact that raters may focus on any criteria or features in the descriptors since no pre-defined priority is given. This, in turn, may make raters refer to their own scoring practices in assigning a score for a response. These individual scoring practices may not be similar to what the testing agency aims to measure and lead to lower inter-rater consistency – the rating consistency that helps to improve the similarity

31

of the marks awarded by different examiners. In other words, holistic scoring criteria may increase the chance of inconsistency in scoring, particularly inter-rater consistency. The above discussion of scoring scales for writing highlights the best practices in rating standardized tests using the holistic and the analytic scoring approaches. These practices are gathered from the expertise and experience of expert scorers whose work has been subject to regular and frequent review and revision by Cambridge ESOL and ETS. Cambridge ESOL and ETS intend raters to promote these practices in rater training (IELTS Writing Revision Project, 2006; Milanovic, Saville, & Shen, 1996). The current research recommends the utilization of these rating practices for improving rating performance whether using analytic or holistic rating criteria. Based on an understanding of these best practices, this study seeks to answer the question of whether NNES raters are able to become good scorers, provided that they apply these rating practices. It also addresses possible methods for helping NNES raters acquire these best rating practices. In the sections that follow, differences in native and nonnative-speaking EFL teachers’ evaluations of ESL writing are well-presented in order to determine if there is an insurmountable gap between NES and NNES teachers and the nature of the these differences are. The following section also addresses if and how methods in the training for scoring, notably norming, paired rating, and rating scales, aid NNES raters in acquiring the best rating practices.

32

Results from the Literature Review - Research Question 2: How Can Non-Native EnglishSpeaking Teachers Acquire the Best Rating Practices? 3.5.

Differences in Native and Nonnative-Speaking EFL Teachers’ Evaluations of ESL

Writing Previous studies point out differences in scoring between native and non-native Englishspeaking (NES and NNES) raters, and these variations may be attributable to differences in their language proficiency. Shi (2001) compares the differences in native and nonnative-speaking EFL teachers’ evaluations of Chinese students’ writing in English. Shi states that NES teachers attend more positively to content and language, whereas NNES teachers attend more to the organization and length of the essays. His study further suggests that NNES instructors appear less focused on language quality than NES raters because NNES raters lack confidence and possibly ability in their English language use. In other words, some NNES raters are aware of their lack of knowledge and training in the English language, and thus are afraid to comment on the language quality of the essay or they may lack the ability to do so. In another study that questions nonnative ESL teachers’ ability to correct English composition, Kobayashi (1992) indicates that native English-speaking raters are more likely to examine grammaticality and provide accurate corrections than their Japanese-speaking counterparts, who might accept grammatically correct but awkward sentences. In addition, native English-speaking teachers seem to evaluate clarity and organization more positively than their Japanese counterparts. Kobayashi further asserts that “whereas a native ESL instructor can judge the acceptability of certain expressions by intuition, drawing upon implicit knowledge, nonnative ESL instructors depend greatly on their explicit knowledge of prescriptive grammar unless they have had exposure sufficient to develop intuitions for written English” (p.82). In other words,

33

according to this assertion, NNES raters are eligible for becoming competent raters for writing tests if they are exposed to written English to the extent that they develop intuition for it. This view is in alignment with Bowden, Steinhauer, Sanz, and Ullman’s (2013) assertion that nativelike proficiency in syntax can be acquired by typical university foreign-language learners. In Bowden, Steinhauer, Sanz, and Ullman’s study, advanced L2 students with over three years of college classroom experience and one to two semesters abroad achieved substantial native-like brain processing of syntax. Obviously, the more raters know about written English, the more they are able to notice grammar and mechanics, lexical resources, and syntactic use in a student response and attend to other parts of writing, such as coherence, cohesion, and task achievement. NNES raters, therefore, may be able to evaluate and score student essays as correctly as NES, provided that the NNES rater has acquired native-like proficiency in written English. Unfortunately, there are few studies on whether NNES raters are as proficient as NES in cases where NNES raters have a sufficient foundation of written English, including grammar, vocabulary, and syntax. Likewise, it is largely unexplored how much exposure is needed for NNES teachers to achieve native-like proficiency. In other words, the definition of native-like proficiency in English writing is not clearly defined. The testing agency of Pearson Test of English Academic, for example, requires that raters have English as a native language. Nonnative English speakers must have native-like proficiency, which means their English proficiency should be near equal to that of educated native speakers. As a minimum requirement at Cambridge ESOL, examiners are selected based on educational background (including teaching qualifications), teaching and/ or examining experience, age, and overall language proficiency. Generally, provided that NNES raters have this native-like proficiency, it is likely that the testing agencies will admit them as raters. Also, provided that NNES raters are given the

34

same application of recommended rating practices, they are eligible for becoming expert scorers of standardized writing tests. A question that remains unanswered has to do with which kinds of support are offered by Cambridge ESOL and ETS in order to improve rating performances of raters, in general, and of NNES raters, in particular. These kinds of support include norming, peer rating, Multi-faceted Rasch measurement, and the rating scale. Benefits of scoring training and scoring methods in current use are discussed in the following section. 3.6.

Rater Training

3.6.1. Benefits of Training A review of the literature indicates that although debate remains on the effects of training on scoring, training is generally believed to benefit raters in a number of ways. Weigle (1994) indicates three benefits to such training. Initially, it is important to raise raters’ awareness of features that have been highlighted by test designers and testing agencies. As previously mentioned, rating criteria have been written and revised by expert scorers. If a certain criterion is valued in rubrics, raters should know where and how this criterion is reflected in the writing. Shaw and Weir (2007) highlight this benefit of training since the descriptors on the rating scale may contradict raters’ previous scoring practices. Training also helps adjust rater expectations of examinees’ essays. In training, student responses at different levels of competence are analyzed, discussed, and scored in meetings of the raters. Through reading and discussing the benchmark essays, raters acquire the expectations of the testing agencies on criteria and levels of performance and the knowledge of linguistic features that indicate a certain level of proficiency. Secondly, training provides a comparative reference group of other raters. In other words, a record of raters’ individual scoring performances, including their accuracy and consistency, is collected throughout the training. This is important to improve raters’ scoring consistency

35

because from this record, administrators monitor scoring progress and conduct prompt actions in case a rater’s scoring differs significantly from that of other raters. Finally, rater training is of great benefit in analytic scoring. This is logical, since analytic scoring generates diagnostic information on student language proficiency levels, thus allowing raters to reach a consensus on understanding the rubrics (Shaw & Weir, 2007). Davies (1999) confirms that scoring performance can be improved as a result of proper rater training, including multiple ratings. In general, training can be said to be advantageous in improving rating performance with the use of standardized test rubrics. 3.6.2. Kinds of Rater Consistency There are two kinds of rater consistency that training seeks to achieve: intra-rater consistency and inter-rater consistency. Intra-rater consistency aims to improve the degree of severity of an examiner’s scoring, whereas inter-rater consistency helps to improve the similarity of the marks awarded by different examiners. Intra-rater consistency is believed to be promoted during training, as raters become more consistent in their marking approach (Lunz, Wright, & Linacre, 1990) and less harsh in awarding their scores after successive iterations in their training (Shaw, 2002). Regarding inter-rater reliability, peer pressure is the major reason which contributes to improvement of inter-rater reliability. Cooper (1977) points out that peer pressure supports and enhances inter-rater reliability. Wigglesworth (1993) further emphasizes that feedback from expert raters “engenders a tendency among raters to become increasingly more reflective of their cognitive processes during marking” (p. 185). However, while intra-rater consistency can improve through training in scoring, interrater consistency is not likely to be achieved by the same methods. Shaw and Weir (2007)

36

indicate that it is difficult to eradicate differences based on individual traits since traits such as leniency and severity are fixed. Lunz, Wright, and Linacre (1990) also remark that although examiners may adjust their severity and leniency to an acceptable level though training, these differences do exist and some remain despite the initial training. On the balance, research reveals that training improves intra-rater consistency (the consistency in the severity of an examiner’s marking) more than inter-rater consistency (the similarity of the marks awarded by different examiners). From his own experience, McNamara (1996) suggests that instead of improving both inter and intra- rater consistency, rater training should try to “get raters become more focused and to encourage new examiners to be selfconsistent” (p.182). This is also the view adopted by Cambridge ESOL. As the current research seeks to address rating practices in current use of Cambridge ESOL and the ETS, it also adopts the view that rater training should encourage new examiners to be consistent with their own marking practices, rather than focusing on consistency between raters. 3.6.3. Scoring Training Methods 3.6.3.1.

Norming

Norming is defined as “the practice of having planned, regular discussions with fellow faculty members to share and combine ideas and make decisions that will be carried out by all participants within their areas” (Sholars & Terreri, 2009). Norming process has been discussed in a number of materials among of which the norming process given by teachers and administrators on the site Teaching Matters (http://www.teachingmatters.org/toolkit/norm-setting-protocol) includes clear instructions and videos to illustrate how the steps in norming sessions are conducted in reality (See Table 4). These instructions and illustration are beneficial to NNES

37

raters in EFL teaching contexts. The NNES raters can utilize the sources from the site to plan for workshops for training scoring. Table 4: The Norming Process

Step 1: Review the Process - Discuss the value of norming and scoring this writing - Emphasize that measurement is only useful if scoring is consistent. Step 2: Discuss the Prompt - Read the grade level prompt or discuss the task that students were assigned. Step 3: Review the Rubric - Review dimension definitions - Identify components within each dimension Step 4: Review the Anchor Papers - Read anchor papers over - Review commentary on anchor papers in order to fully understand scoring Step 5: Score Practice Papers - Read a practice paper - Score paper independently using rubric Step 6: Compare Scores and Discuss - Discuss impressions of student work - Compare teacher scores Step 7: Compare Scores to Expert - Compare teacher scores to expert scores - If discrepancy, refer back to rubric and anchor papers for insight - Repeat scoring practice papers and comparing scores until high level of agreement is reached (Taken from “Norming Process”, n.d.) 3.6.3.2.

Other Scoring Training Methods

Other training methods in current use by standardized testing centers include peer rating, Multi-faceted Rasch measurement, and the rating scale (Shaw & Weir, 2007). The Pearson Test of English Academic pairs raters when scoring as one of its training methods. Pearson suggests that through peer rating, both of the pair add knowledge by means of constant peer-review and peer-remediation. Thus, false interpretations of the scoring rubrics can be reduced or avoided.

38

Another training method currently in use is Multi-faceted Rasch Measurement. This tool is built on the view that rater leniency and severity are fixed characteristics of raters. Rasch advocates that instead of trying to make all raters award similar marks, testing agencies should determine the degree of raters’ severity and leniency from which the true score of the test taker can be calculated. In other words, individualized feedback on their rating behavior combined with Multi-faceted Rasch Measurement can enhance raters’ own consistency (O’Sullivan & Rignall, 2007). The standardization effect of rating scales on rater judgments is another salient feature of scoring training that has been explored in a number of studies. In a study that analyzes differences between the marks awarded by the trainee examiners and senior examiners, the rating criteria themselves are reported to have some standardizing effect on rating performance (Furneaux & Rignall, 2000). Using rubrics, some raters who have not undergone training are able to score similarly to senior examiners. This finding echoes Shaw’s position (2002) that the marking scheme itself has a powerful standardizing impact on raters. In Shaw’s study, half of the new examiners are unfamiliar with the rating criteria and yet are able to attain scores that are relatively similar to the standard score. It is significant that detailed and explicit descriptors of the rating criteria engender a standardizing effect even if a formalized training program is not conducted. Given their benefits, scoring training methods such as norming, peer rating, the Multifaceted Rasch measurement, and rating scales can be described as helpful to improving rating performance, in general, and NNES rating, in particular, in the quest to acquire the best scoring practices possible. These scoring methods should be considered by raters in EFL teaching contexts when planning for training for scoring.

39

3.7.

Summary of the Results of the Literature Review The literature review has reviewed the following research questions:

1. What are the best practices that standardized test raters use when they score English language essays? 2. How can non-native English teachers acquire these best practices? In answering the first question, it was found that decision-making is an interactive and multifaceted process. Raters carry out both interpretation and judgment processes while attending to various aspects of rhetoric and ideas and of language use (Cummings, Kantor, & Powers, 2002). Extensive research on the behaviors and practice recommended for raters has been conducted by test designers of both the IELTS (Cambridge ESOL) and the TOEFL-iBT (ETS). The purpose of Question 1 was to provide the best practices for the training of NNES in rating standardized writing tests as well. In terms of the second question of how non-native English teachers can acquire these best practices for evaluating student writing, evaluations of ESL writing suggest that a lack of language proficiency is a barrier to non-native English-speaking raters. Studies on the training of both NES and NNES raters highlight the benefits of clarifying marking criteria, adjusting the expectations of examinee responses, and providing a comparative reference group of raters (Weigle, 1994). Two kinds of rater consistency, intra-rater and inter-rater, are also discussed. More improvement is seen in intra-rater consistency than in inter-rater consistency as a result of training. The last section of rater training focuses on current training methods currently used by standardized testing centers. These methods include norming, peer rating, multi-faceted Rasch measurement, and the rating scale. Notably, the two models which provide the best rating practices are (1) the model suggested in the IELTS Writing Revision Project (2006) carried out

40

by Cambridge ESOL using the analytic scoring approach (See Table 1), and (2) Milanovic, Saville, and Shen’s (1996) model of the decision making processes of the examiners using a holistic scoring approach (See Table 3). With clear instructions and illustrative videos, the norming process on the site teachingmatters.org (See Table 4) should be utilized for NNES raters in planning for scoring training workshops. Lastly, the current study emphasizes that together with applying the recommended rating practices, NNES raters are eligible for becoming expert scorers for standardized tests, provided that they acquire a native-like proficiency in writing. The above results are from the literature review. The second part of this study utilizes information about how experienced NES raters arrive at a score for a writing response, using analytic and holistic rating criteria. It surveyed three experienced NES raters who teach in both the Intensive English Program and Academic English Program at Saint Michael’s College. It is hoped that this information will help to further illuminate the results from the literature review so that they can be used to develop effective materials and procedures for training NNES raters.

41

PART B: RESULTS OF THE QUESTIONNAIRES AND INTERVIEWS ABOUT NATIVE ENGLISH-SPEAKING RATERS’ DECISION-MAKING BEHAVIOR WHEN RATING STANDARDIZED WRITING TESTS 3.8.

Summary of Questionnaire Procedures

3.8.1. Participants The participants in this project were three writing instructors who teach in both the Intensive English Program and the Academic English Program at Saint Michael’s College. They include two males and one female, each of whom has been scoring ESL students’ essays for a minimum of five years. The researcher asked the participants to self-rank their experience as a rater of writing according to three levels: very experienced, experienced, and novice. The ratings were: Participants 1 and 2, experienced (five years and fourteen years respectively) and Participant 3, very experienced (ten years). The participants were accustomed to evaluating and grading student classroom written responses at all levels from low intermediate through advanced. In addition, all participants noted that they were more experienced in rating student placement essays, which are similar to the TOEFL-iBT Independent Writing Task. (It should be noted that student placement essays and the TOEFL-iBT Independent Writing Task do not utilize the same evaluation rubric.) The rater’s familiarity with the TOEFL-iBT Independent Writing Task is not surprising, as IELTS is only beginning to be used in the United States. To protect participants’ confidentiality, they will be referred to only as Rater 1 (R1), Rater 2 (R2), and Rater 3 (R3). 3.8.2. Writing Sample Responses The four writing sample responses in the data collection (See Appendix A), hereafter referred to as the responses, include one response each of IELTS Writing Task 1, IELTS Writing

42

Task 2, TOEFL-iBT Integrated Writing, and TOEFL-iBT Independent Writing. The two responses of the IELTS tests were taken from the Cambridge IELTS 9 Self-study Pack (2013), and the two responses of TOEFL-iBT tests were taken from the TOEFL-iBT Writing Sample Response (2003). All four of these responses had been scored by expert scorers of the IELTS and TOEFL-iBT testing agencies. The two responses to IELTS Writing Task 1 and Task 2 will be referred to as IELTS Response 1 and IELTS Response 2, respectively. The two responses to TOEFL-iBT Independent and Integrated Writing will be referred to as TOEFL-iBT Response 3 and TOEFL-iBT Response 4, respectively. 3.8.3. Questionnaire Procedures The three raters in this project were each given the four above-mentioned responses as well as the rating criteria for the assigned tasks from the IELTS and TOEFL-iBT writing tests. The raters then scored the responses independently and according to their own schedule in order to assign a level of writing proficiency. After scoring the responses, the raters answered a list of written questions (See Appendix B). This follow-up questionnaire sought to determine the linguistic features that most shaped the raters’ decisions in deciding the score for student writing proficiency. In addition to the questionnaire, the raters were asked to highlight words and phrases that illustrated certain linguistic features of the student response. No instructions were provided for either the use of rubrics or the recommended decision-making behaviors for scoring. 3.9.

IELTS Writing Task 1

3.9.1. Description of IELTS Writing Task 1 On IELTS Writing Task 1 (See Appendix A), test candidates are asked to describe visual information illustrated in four pie charts that contain data on the ages of the populations of Yemen and Italy in the year 2000, and the projections for these populations in 2050. Test-takers

43

are asked to summarize the information by selecting and reporting the main features, and to make comparisons where relevant. Their description is to be presented in their own words. They need to write 150 words in about 20 minutes. The responses to IELTS Writing Tasks were taken from Cambridge IELTS 9 Self-Study Pack (2013). IELTS candidates receive scores on a scale from 1 to 9, with a score being reported for each test component. The individual test scores are then averaged and rounded to produce an overall score according to a confidential IELTS score conversion table. Overall scores and individual test scores are reported in whole and half scores. 3.9.2. Description of the Scoring of the Response to IELTS Writing Task 1 (IELTS Response 1) Table 5: Description of the Scoring of IELTS Response 1 IELTS Response 1 Scoring Scale: 1 - 9

Averaged Score Task Achievement Coherence & Cohesion Lexical Resource Grammatical Range & Accuracy

Sample Score (Examiner’s Score) 6

Rater 1 ( R1) Score 7.5 7 8 8 7

Rater 2 ( R2) Time Score 15’ 6 5 6 6 8

Rater 3 (R3) Time Score 5’ 7.5 7 7

Time 15’

8 8

IELTS Response 1 was scored based on four writing criteria: Task Achievement, Coherence and Cohesion, Lexical Resource, and Grammatical Range and Accuracy. The total score is the average of these four components. R1 spent 15 minutes on scoring IELTS Response 1 and assigned scores of 7, 8, 8, and 7, respectively, on these criteria. The total score for IELTS Response 1 that R1 assigned was 7.5. R2 spent 5 minutes on IELTS Response 1 and assigned scores of 5, 6, 6, and 8, respectively, on the same criteria. The total score for IELTS Response 1

44

that R2 assigned was 6. R3 spent 15 minutes on IELTS Response 1 and assigned scores of 7, 7, 8, and 8, respectively, again, on the same criteria. The total score for IELTS Response 1 that R3 assigned was 7.5. The examiner in Cambridge IELTS 9 Self-study Pack (2013) assigned a score of 6 for IELTS Response 1. The scores for the four components were not given. However, together with the total score, the examiner offered some remarks on each response component: Task Achievement, Coherence and Cohesion, Lexical Resource, and Grammatical Range and Accuracy. These remarks provided clarification regarding how the examiner responded to the tasks and how the evaluation of the examiner was different from that of the participants. This is discussed below. Generally, for IELTS Response 1, Rater 1 and Rater 3 assigned higher scores than either the IELTS Examiner or Rater 2, on the following criteria: Task Achievement, Coherence and Cohesion, Lexical Resource, and Grammatical Range and Accuracy. R2 assigned the same total score as the IELTS Examiner. With regard to time spent, R1 and R3 spent much more time on scoring than R2. R1 explained that this was because the task was new; hence, it took R1 increased time to interpret the task and the rubric before arriving at a score for the writing. 3.9.3. Report of the Participants’ Responses and Researcher’s Comments to IELTS Response 1 Table 6: Report on Task Achievement of IELTS Response 1 IELTS Response 1: Task Achievement “Summarize the information by selecting and reporting the main features, and make comparisons where relevant. Rater 1 (R1)

7

Rater 2 (R2)

5

The test taker did a good job of summarizing, selecting and reporting the main features as well as making relevant comparisons. Data not fully exploited Did not make appropriate comparisons. Details are all accurate but ignores important parts of the chart. E.g. older population in Italy vs.

45

Rater 3 (R3) 7 IELTS 6 (total Examiner score) Comments Researcher’s Comments

younger population in Yemen More focus on key trends needed. Too detailed. Addresses the task: sufficient details but has problem interpreting one chart. Clear comparisons. Overview given although focuses only on one age group R1, R3, and Examiner gave positive overview responses to Task Achievement. R2 did not give an overview response. All four raters agreed that the data was not fully exploited, and more key trends should have been presented. R1 and Examiner agreed that comparisons were appropriate. Meanwhile, R2 stated that comparisons were not appropriate. R3 did not comment on comparisons.

Table 7: Report on Coherence and Cohesion of IELTS Response 1 IELTS Response 1: Coherence and Cohesion R1 8 The test taker did a nice job of sequencing the summary. I also appreciated the range of discourse employed. R2 6 Good use of discourse markers. However the selection and presentation of information was not done adequately. R3 7 Logical progression, correct use of cohesive devices IELTS 6 (total The information is well-organized. A range of linking devices is used, Examiner score) e.g. “whereas, the latter country, while, we can see that”. Comments Researcher’s All three raters and Examiner agreed on the test taker’s good use of Comments discourse markers. In assessing Coherence & Coherence, it seems that discourse markers or linking devices were one of the important indicators of Coherence & Cohesion. R1 indicated the errors with the following phrases: ‘the diagram show that, also, in + year, whereas, on the other hand, while, the projections, in contrast, overall, there is an upward trend.’ R1, R3, and the Examiner agreed that the writing was well-organized. R2, on the other hand, stated that the selection and presentation of information was not done adequately. Table 8: Report on Lexical Resource of IELTS Response 1 IELTS Response 1: Lexical Resource R1 8 This was another strong area for this test taker. I was impressed by many of the expressions and academic words that were used. In particular, I was impressed by the appropriate selection of words/ expressions like respectively, upward trend, habitants, former and latter R2 6 Vocabulary had several errors, including word forms and word choice. However, the range of vocabulary was quite good. R3 8 Fluent with some word choice errors IELTS 6 (total Vocabulary is adequate for the task and generally accurate, although

46

Examiner Comments

score)

Researcher’s Comments

attempts to use less common words are less successful. A few errors in word formation, e.g. statistic (statistical); estimative (estimate), but they do not affect understanding. Raters differed greatly regarding the score for Lexical Resource. While R1 and R3 favorably reported the appropriate selection of words/ expressions and assigned a relatively high score for Lexical Resource of the response (8 out of 9), R2 and the Examiner assigned a lower score for Lexical Resource (6 out of 9). All raters indicated a few errors with word formation, word choice, and prepositions. R2 scored the writing down due to several errors in the response.

Table 9: Report on Grammatical Range and Accuracy of IELTS Response 1 IELTS Response 1: Grammatical Range and Accuracy R1

7

R2 R3 IELTS Examiner Comments Researcher’s Comments

8 8 6 (total score)

Overall, this was not bad. The test taker lost points for a “few errors”. In addition, some sections required a second reading due to errors with prepositions and punctuation. Grammar was mostly perfect with one mistake. Wide variety of structures with occasional errors Simple and complex sentence forms are produced with few grammatical errors, but the range of structures is rather restricted. Three raters and the Examiner agreed that there were a few errors with the grammar and structures used in the writing. R2, however, said there was only one grammatical error. The Examiner stated that the structure range was rather restricted while R3, on the other hand, complimented the wide range of structures utilized in the writing. R1 and R2 did not comment on the range of sentence structures.

3.9.4. Further Comments on the Scoring of Grammatical Range and Accuracy of IELTS Response 1 Grammatical Range and Accuracy is another item on the IELTS rubric where there are major differences between raters in the assignment of scores. The three raters of this study all reported positively on Grammatical Range and Accuracy of the response. Two of them assigned a relatively high score for the use of grammar in the response (8 out of 9), while the third was

47

quite close (7). The IELTS Examiner, on the other hand, when considering Grammatical Range and Accuracy, did not find the response contained either ‘a wide range of structures’ (Score 8), or ‘a variety of complex structures’ (Score 7). Instead, they assigned a score of 6, indicative of ‘a mix of simple and complex sentence forms’ (Score 6). This has to do with the accuracy and complexity of the response. In other words, incorporation of accurate grammar does not necessarily equate to the usage of a wide range of complex and various structures. Such instructions as ‘a wide range of structures with full flexibility and accuracy; uses a wide range of structures; uses a variety of complex structures; uses a mix of simple and complex sentence forms’ need to be clarified and illustrated through examples so that raters interpret these phrases appropriately. Raters also differed regarding the number of errors that they reported. R3 stated that the grammar was mostly perfect with one mistake while R1 reported that there were several errors. Indeed, the writing had one error with subject-verb agreement. However, some other grammar errors with articles (a, an, the) and prepositions (with) could also be found. It appears that the raters tended to score a response down if the writing had errors with subject-verb agreement, whereas errors with punctuation, articles, and prepositions seem not to have been noticed. This behavior may be explained by the fact that errors in punctuation, articles, and prepositions tend to be considered lower in severity of error and may take several readings to register. 3.10. IELTS Writing Task 2 3.10.1. Description of IELTS Writing Task 2 On IELTS Writing Task 2 (See Appendix A), test candidates are asked to answer the question ‘To what extent do you agree or disagree?’ The three sentences preceding this prompt present the discussion topic exactly. Candidates need to present information which is relevant to

48

the topic, in this particular case, the advantages and disadvantages associated with nuclear weapons and nuclear power. Relevant examples or evidence should be used to support their views. It is not enough for IELTS candidates to give only their personal opinion about this topic. They must additionally indicate the extent to which they agree or disagree with the use of nuclear technology, and follow the instruction which states ‘Give reasons for your answer and include any relevant examples from your own knowledge or experience’. Candidates’ answers optimally utilize academic language. Also, their response should be clearly understandable to someone who has no special knowledge of the topic. The instruction provided is to write 250 words in 40 minutes. Task 2 of the IELTS Writing Test was taken from Sample Candidate Writing Scripts and Examiner Comments (2013). IELTS candidates receive scores on a scale from 1 to 9, with a score being reported for each test component. The individual test scores are then averaged and rounded to produce an overall score according to a confidential IELTS score conversion table. Overall scores and individual test scores are reported in whole and half scores. 3.10.2. Description of the Scoring of the Response to IELTS Writing Task 2 (IELTS Response 2) Table 10: Description of the Scoring of IELTS Response 2 IELTS Response 2

Scoring Scale: 1-9 Averaged Score Task Response Coherence & Cohesion Lexical Resource Grammatical Range & Accuracy

Sample Score (IETLS Examiner’s Score) 7

Rater 1 ( R1)

Score 7 6.5 6 8 8

Rater 2 ( R2)

Time Score 20’ 5 5 5 6 5

Rater 3 (R3)

Time Score 8’ 7 7 7 8 7

Time 20’

49

As previously described, IELTS Response 2 was scored based on four writing criteria: Task Response, Coherence and Cohesion, Lexical Resource, and Grammatical Range and Accuracy. The total score is the average of scores on these four components. Rater 1 spent 20 minutes on scoring IELTS Response 2 and assigned scores of 6.5, 6, 8, and 8, respectively, on these criteria. The total score that R1 assigned was just over 7 out of 9. Rater 2 spent 8 minutes on IELTS Response 2 and assigned scores of 5, 5, 6, and 5, respectively, on the same criteria. The total score for IELTS Response 2 assigned by R2 was 5 out of 9. Rater 3 spent 20 minutes on IELTS Response 2 and assigned scores of 7, 7, 8, and 7, respectively, again, on the same criteria. The total score that R3 assigned for IELTS Response 2 was 7. In the manual Sample Candidate Writing Scripts and Examiner Comments (2013), the IELTS Examiner assigned a score of 7 for IELTS Response 2. The individual scores of the four components were not provided. However, in the reporting of the total score, the Examiner made some remarks on each component: Task Response, Coherence and Cohesion, Lexical Resource, and Grammatical Range and Accuracy. Generally, for IELTS Response 2, R1 and R3 assigned scores similar to the IELTS Examiner. R2 assigned a significantly lower score to this response than either the other raters or the IELTS Examiner. In this regard, it should be noted that R1 and R3 spent significantly more time on scoring this response than R2 did. This difference in length of scoring time may have contributed to the raters’ varied scores on this response. In other words, R2 might have assigned a lower score than the other raters since R2 did not spend equivalent time on the scoring procedures.

50

3.10.3. Report of the Participants’ Responses and Researcher’s Comments to IELTS Response 2 Table 11: Report on Task Response of IELTS Response 2 IELTS Response 2: Task Response Rater 1 (R1)

7

Rater 2 (R2) Rater 3 (R3) IELTS Examiner Comments Researcher’s Comments

5 7 7 (total score)

This is well-written, but misses the mark on how nuclear power maintains world peace (“Some parts … more fully covered…). It is also “repetitive”. Has a lot of irrelevant information. The position is not always clear, insufficient detail to support position. Key features present, discussed clearly and appropriately. Some repetition. More specifics needed. The answer is well written and contains some good arguments. It does tend to repeat these arguments but the writer’s point of view remains clear throughout. R1, R3, and the Examiner agreed that the answer was well-written, contained some good arguments, and addressed all parts of the task. R2, on the other hand, stated that the response addressed the task only partially, and that the writer’s position was not always clear. Three raters agreed that the response lacked sufficient detail to support given positions. All four raters agreed that the data was repetitive. However, while R3 referred to some repetition, R2 pointed out that the answer had a lot of repetition.

Table 12: Report on Coherence and Cohesion of IELTS Response 2 IELTS Response 2: Coherence and Cohesion R1

6

R2

5

R3 7 IELTS 7 (total Examiner score) Comments Researcher’s Comments

I believe this is the area that needs the most work. Each “paragraph” is jumbled, lacking unity and clarity. I also would like a position in the introduction. Lack of overall progression. Paragraphing needs to be developed more. The first three should be combined and reduced. The fourth should be explained and split and the fifth is too brief. Advantages and disadvantages discussed; gives opinions The message is easy to follow and ideas are arranged well with good use of cohesive devices. There are minor problems with coherence and at times the expression is clumsy and imprecise. R1 and R2 specifically commented on the paragraphing of the essay which needs development. R3 pointed out main ideas covered in the essay but did not comment on progression and paragraphing. The Examiner commented on minor problems with coherence but did not

51

indicate the problems. R2 and the Examiner commented on the progression of the essay. However, while R2 indicated the lack of progression, the Examiner stated the writer communicated the essay well despite some problems with coherence and cohesive devices. The Examiner mentioned about the good use of cohesive devices employed but stated that the expression was clumsy and imprecise. Other raters did not comment on cohesive devices. Table 13: Report on Lexical Resource of IELTS Response 2 IELTS Response 2: Lexical Resource R1

8

R2

6

R3 8 IELTS 7 (total Examiner score) Comments Researcher’s Comments

Excellent. Like a native speaker – words/ expressions used appropriately with context. A couple errors that did not interfere with my comprehension of test taker’s message. Many word choice errors, one spelling mistake, otherwise good academic vocab Sufficient range, with occasional inaccuracies in word choice, spelling Only small problems in the use of vocabulary, mainly in the areas of spelling and word choice All four raters and the Examiner agreed that the range of vocabulary was adequate. R1 and R3 highly appreciated the use of vocabulary in the essay while R2 scored the test-taker’s use of vocabulary down due to many errors with word choice. Raters pointed out some errors with word choice and spelling. R1 reported positively to the lexical use in the response, using comments such as “excellent” and like “a native speaker”. R2 did not deny the good use of academic lexicon by the writer, but stated that the writer had many word choice errors and spelling mistakes.

Table 14: Report on Grammatical Range and Accuracy of IELTS Response 2 IELTS Response 2: Grammatical Range and Accuracy R1 R2

8 5

R3 7 IELTS 7 (total Examiner score) Comments Researcher’s

Fantastic. I could have scored this higher. No sentence structure mistakes, generally complex and varied; however several article, participle and agreement errors. Many complex structures. Some errors (run-ons) There is a wide range of structures that are well handled.

R2 assigned a low score for the grammatical range and accuracy in the

52

Comments

essay with 5 while the other two raters and the Examiner reported favorably to the use of grammar in the essay and assigned a relatively high score (7 and 8). All four raters and Examiners agreed that there were a few errors with the grammar and structures used in the writing. R3 pointed out several errors with the use of articles, participles and agreement. All raters agreed on the use of sentence structures which were complex and varied.

3.10.4. Further Comments on the Rating of Grammatical Range and Accuracy of IELTS Response 2 There were some inconsistencies in the raters’ scoring of Grammatical Range and Accuracy of IELTS Response 2. Rater 1 and Rater 3 assigned scores of 8 and 7, respectively, on the criterion. R1 mentioned that he/she could have possibly scored the response higher, as perhaps a 9, indicating that the response “uses a wide range of structures with full flexibility and accuracy; rare minor errors occur only as ‘slips’” (IELTS Writing Task 2 Rubrics). In fact, there were some errors with the writing such as the wrong use of prepositions (danger for, a risk for), errors in punctuation (the writer did not use a comma before coordinating conjunctions), and an incorrect adverb placement in the term (there really is no danger). Other errors including articles (limitless supply, safe military power) and participles (if mishandles) can also been detailed. The above discussion demonstrates that these errors may not be of significant import, provided that the writer already has a good overall command of language. R2, in particular, remarked that there are ‘no sentence structure mistakes, [sentences are] generally complex and varied.’ In actuality, a wide range of structures was used by the test taker in their response, including conditional sentences and a mix of simple and complex sentences (complex sentences with relative clauses; future continuous, passive or present continuous, and

53

conditional sentences). Indeed, there were occasional small grammatical errors and punctuation errors, but they did not impede raters from understanding the student response. The raters, generally, did not provide a fully accurate evaluation of the test taker’s use of grammatical range and accuracy. 3.11. TOEFL-iBT Integrated Writing Task 3.11.1. Description of TOEFL-iBT Integrated Writing Task The TOEFL-iBT Integrated Writing requires test takers to read, listen, and then write in response to what they have just read and heard. Test takers should be able to read and summarize a passage that is on the computer screen for three minutes. They must then listen to a 3-minute lecture while taking notes, which will help them write the response. Finally, based on information from both the listening and the reading, test takers are asked to summarize the points made in the lecture and explain how these points (1) cast doubt on the points made in the reading, (2) challenge arguments made in the reading, (3) answer the problems raised in the reading, or (4) support the explanations given in the reading. Test takers have 20 minutes to plan and write their response. This response is assessed on the basis of the overall quality of their writing and the extent to which it presents the points in the lecture and their relationship to the reading passage. Typically, an effective response is from 150 to 225 words. TOEFL-iBT Integrated Task results are graded on a scale from 0 to 5. The reading in the TOEFL-iBT Integrated Writing Task above provides test takers with information about the benefits of group work, such as sharing expertise and skills as well as making creative solutions and risky decisions. Other stated advantages include a feeling of authority over one’s decisions and opportunities to be recognized by one’s peers. The listening casts doubt on the assertions in the reading by indicating several disadvantages of group work.

54

The main disadvantage provided is that some team members receive recognition without any significant involvement in a group task. Another disadvantage is that it often takes groups a long time to reach a decision. In addition, a few influential members in a group may dominate the meeting, and their ideas are given preference over others. In their responses, test takers are supposed to summarize the advantages of group work which are enumerated in the lecture along with explaining how the lecturer casts doubts on the perspectives offered in the reading. In contrast to the the IELTS Writing Tasks, the TOEFL-iBT Integrated Task provides test takers with some language which they may utilize in their responses. The TOEFL-iBT Integrated Task tests different sets of skills than the IELTS Writing Task. Some of these differences test candidates’ abilities to read and listen, as well as their ability to connect opinions from the reading and the listening. This writing task addresses the integration of reading, listening, and writing skills as much as the writing itself. 3.11.2. Description of the Scoring of the Response to TOEFL-iBT Integrated Writing Task (TOEFL-iBT Response 3) Table 15: Description of the Scoring of TOEFL-iBT Response 3 TOEFL-iBT Response 3

Sample Score (Examiner’s Score)

Scoring Scale: 1 - 5 5

Rater 1 Rater 2 Rater 3 ( R1) ( R2) (R3) Score Time Score Time Score Time 5

10’

3.5

7’

5

10’

Rater 1 spent 10 minutes scoring Response 3, then assigned a score of 5 to the response. Rater 2 spent 7 minutes on the response, assigning a score of 3.5. Rater 3 spent 10 minutes on scoring and assigned a score of 5. The time that these raters spent on scoring TOEFL-iBT Integrated Writing was only half as much as that spent on scoring the IELTS Writing Tasks. Two

55

out of three raters (R1 and R3) assigned the same scores for the TOEFL-iBT Integrated Writing Task, and these scores were both similar to that of the TOEFL-iBT Examiner (i.e. a score of 5). R2, on the other hand, assigned a much lower score on the response to the TOEFL-iBT Integrated Writing Task, a 3.5 out of 5. 3.11.3. Report of the Participants’ Responses and Researcher’s Comments to TOEFL-iBT Response 3 Table 16: Report on TOEFL-iBT Response 3 TOEFL-iBT Response 3: The TOEFL-iBT Integrated Writing task Rater 1 (R1) Rater 2 (R2)

5 3.5

Rater 3 (R3)

5

TOEFL-iBT Examiner Comments

5

Researcher’s Comments

No comments Mostly clear and accurately related to task. However imprecise connection to reading. A number of small errors in grammar, spelling and word choice. Good synthesis, well-organized, only occasional spelling, word form, grammar errors; no interference with clarity Once you read past what seem to be the results of poor typing, this Benchmark 5 does an excellent job of presenting the points about the contribution and recognition of group members as well as about speed of group decisions. The final paragraph contains one noticeable error (“influent”), which is then used correctly two sentences later (“influential”). Overall, this is a successful response and scored within (though perhaps not at the top of) the 5 level. All raters agreed that the task addressed the response well. The task showed a clear connection between listening and reading. However, while R1, R3, and the Examiner pointed out the accuracy of the connection between listening and reading, R2 indicated some imprecise connections to the reading. All raters pointed out errors with grammar, spelling, and word choice in the response. However, as the Examiner commented, these errors did not impede readers from understanding the task.

3.11.4. Further Comments on the Ratings of TOEFL-iBT Response 3 A number of grammatical and linguistic features of the response to TOEFL-iBT Integrated Writing Task were highlighted by the raters. R3 pointed out some prominent language features, such as the special use of hedging in the writing (might emerge, would be, might turn into). All raters also highlighted the fluency in the use of language in response to developing

56

coherence and cohesion, especially cohesive devices such as discourse markers (that is, first, second, third); repetition of key words; use of synonyms (the passage, the speaker, the lecturer). R3 highlighted the use of the summary word (this), which is used many times in the response. Similar to those of the TOEFL-iBT writing tests, positive features also figure prominently in Kennedy and Thorp’s (2007) study on the linguistic features of IELTS writing scripts. Kennedy and Thorp analyze 130 responses to IELTS Writing Task 2 in order to identify the linguistic nature of the writing at three proficiency levels: 8 (expert user), 6 (competent user), and 4 (limited user). According to Kennedy and Thorp (2007), Level 8 writers employ a significantly higher level of different words and a greater number of once-only words than the other groups. Level 8 scripts also show more involvement with the reader than Levels 6 and 4, utilizing methods such as rhetorical questions and hedging. Generally, high-scoring essays on IELTS and TOEFL-iBT Writing Tasks are characterized by certain positive features, and these features, in turn, assist raters in assigning a score appropriately. Regarding the errors made on TOEFL-iBT Response 3, different raters highlighted different kinds of errors. R1 indicated errors with spelling and word form. R2 pointed out a number of errors with word choice, spelling, and grammar. R3 indicated errors with subject-verb agreement, a feature of basic grammar that clearly needs to be acknowledged. However, despite highlighting these errors, the raters did not seem to reflect significant deductions for them in their scoring. This may be because the response was well written, with a high command of language. According to Longman Preparation Course for the TOEFL® Test: iBT Writing (Deborah Phillips, 2007), easily-corrected mechanical errors, such as capitalization of the first word in a sentence, some spelling errors, and some minor grammatical errors, do not seem to play an

57

important role in scoring provided that the overall level of grammatical and lexical competence of the writing is high. 3.12. TOEFL-iBT Independent Writing Task 3.12.1. Description of the TOEFL-iBT Independent Writing Task The TOEFL-iBT Independent Task requires test takers to support an opinion on a topic. Test takers are expected to present their points of view with relevant examples and convincing reasons in an appropriate discourse style that highlights the topic-centered nature of English writing. In this independent task, test takers have 30 minutes to plan, write, and revise their essay. Typically, an effective independent essay contains a minimum of 300 words. TOEFL-iBT Independent Task results are graded on a scale from 0 to 5. 3.12.2. Description of the Scoring of the Response to the TOEFL-iBT Independent Writing Task (TOEFL-iBT Response 4) Table 17: Description of the Scoring of the TOEFL-iBT Response 4 TOEFL-iBT Response 4

Sample Score (Examiner’s Score)

Scoring Scale: 1 -5 3

Rater 1 Rater 2 Rater 3 ( R1) ( R2) (R3) Score Time Score Time Score Time 3.5

10’

2

3’

4

10’

Rater 1 spent 10 minutes on scoring the TOEFL-iBT Integrated Writing Task, assigning a score of 3.5 to the response. Rater 2 spent 3 minutes on scoring this task and assigned the response a score of 2. Rater 3 spent 15 minutes on scoring the response and assigned it a score of 4. The examiner in TOEFL-iBT Writing Sample Response (2003) assigned a score of 3 for the response.

58

3.12.3. Report of the Participants’ Responses and Researcher’s Comments to TOEFL-iBT Response 4 Table 18: Report on TOEFL-iBT Response 4 TOEFL-iBT Response 4 – TOEFL-iBT Integrated Writing Task Do you agree or disagree with the following statement? Always telling the truth is the most important consideration in any relationship. Use specific reasons and examples to support your answer. Rater 1 (R1) 3.5 There was good organization and examples here. The test taker stayed on task although the idea “relationship” could have been expressed more clearly. I also scored down due to the accumulation of errors that I have identified. Rater 2 (R2) 2 Many sentence structure, grammar and vocabulary errors. Poorly organized and lacking sufficient connections between ideas. Some parts of the essay are not clear because of language issues. Rater 3 (R3) 4 Topic/ task addressed, not all fully elaborated; examples given; progression and coherence present, some unclear connections; errors in articles and word endings; no interference with meaning TOEFL-iBT 3 In this essay, the writer advances a number of points about telling the Examiner truth. First, the writer seems to equate telling the truth with a selfComments defeating notion of “disclosure” of trade secrets and then discusses some instances when a lie might be good. This is a fairly strong 3-level essay, but the problem in language use (overall level of vocabulary, the control of phrase-level grammar, and some problems using connectives) are why this response was scored as a 3. Researcher’s All raters agreed that the response addressed the topic although some Comments opinions should have been developed and examples given. The examiner, R1, and R3 all agreed on the good organization of the response while R2 scored the response down due to its poor organization.

3.12.4. Further Comments on the Ratings of TOEFL-iBT Response 4 Along with word choice, all raters highlighted other examples of good use of language on TOEFL-iBT Response 4, including vocabulary such as ‘tender, stabilize, accompanied results, illustrate.’ Overall, they seemed to have been favorably impressed with the level of vocabulary. Also, all raters focused on the use of cohesive devices in the response. Specifically, they highlighted cohesive devices such as ‘however, for example, when we were children, despite.’ In

59

fact, cohesive devices and word choice were both highly valued by all of the raters, and this was true on both the IELTS and the TOEFL-iBT writing tasks. 3.13. Discussion of the Questionnaire 3.13.1. Score Assignment by Raters The three raters shared similar opinions about the major issues (strengths and weaknesses) of the response. Rater 1 and Rater 3 assigned the same score to three out of four responses, and these scores were in alignment with the IELTS and TOEFL-iBT Examiners’ score. Rater 2 assigned the same score with the IELTS Examiner regarding the response to IELTS Writing Task 1. Except for that, Rater 2 scored consistently low in comparison to the others. 3.13.2. Importance of Norming The data analysis points out several other important issues regarding the necessity of interpretation of the rubrics and task prompts. These issues articulate the importance of norming when assessing writing. First, norming helps to explain and clarify a number of phrases contained in the descriptors on the rubrics. This, in turn, improves scoring reliability and promotes similar interpretation among raters. As already discussed, the data analysis revealed that raters held diverse expectations to some features of the response. For example, regarding Coherence and Cohesion, Rater 1 suggested that the student should work on Coherence and Cohesion the most, specifically in each paragraph since the “paragraph is jumbled, lacking unity and clarity”. In the same category, R1 also proposed that “I also would like a position in the introduction.” This means the rater expected test-takers to indicate or foreshadow their opinion in the introductory paragraph, indicating the extent to which they agree or disagree with the prompt. However, as pointed out in some preparatory materials which accompany the IELTS

60

writing test, IELTS Writing Task 2 does not require test-takers to address their positions in the introduction. Test-takers can simply state that the issue is problematic, and that the essay is going to investigate the issues in depth. Certainly, there are different opinions on how the essay should be organized, as exemplified by the raters’ comments. These opinions could most profitably be discussed by the raters in a preliminary norming session. In this case, it is important that the raters understand the perspective of the test creators. In other words, methods addressing the organization of a response should be discussed in advance with raters so that they can attain a consensus regarding the method and features of organization. Norming also helps raters to clarify the meaning of a number of phrases on the rating criteria of the IELTS and TOEFL-iBT Writing Tasks. This need for clarification is amply demonstrated in the interviews with the raters and through the scoring of the responses. Rater 2 suggested that it would have been helpful for him/her to read the benchmark essays that include descriptors in order to more correctly assess the value of each essay. R3 cast doubt on the definition of the phrase “bullet points” on the IELTS Writing Tasks. In general, descriptions such as “a wide range of vocabulary,” “a sufficient range of vocabulary,” and “an adequate range of vocabulary” are likely to be interpreted differently by different raters. When rating both the IELTS and the TOEFL-iBT, a sufficient knowledge of the level of vocabulary appropriate for a given task type is required in order to more accurately assess the candidates’ use of vocabulary. For example, Task 1 of the IELTS Writing Test utilizes language for describing charts, graphs, or tables while the Independent Task of the TOEFL-iBT calls for language abilities to express comparison and contrast, to cast doubt, or to report. These differences can be best explained through the use of benchmark essays, illustrative of different student levels, during norming sessions. The use of such benchmark essays promotes rater consistency.

61

3.13.3. Priority of Errors in Assessing Writing Tasks Through data analysis, it can be seen that some types of errors are weighted more heavily than others by all raters, both in this study as well as among the IELTS and TOEFL-iBT Examiners. In the cases of both the IELTS and TOEFL-iBT Writing Tasks, easily-corrected mechanical errors (capitalization of the first word in a sentence, spelling errors), and minor grammatical errors (articles, punctuation, and prepositions) do not impede readers’ understanding of the response and do not appear to have received deductions in the raters’ scoring. In contrast, responses with errors of word choice, plural forms, or subject-verb agreement tend to receive such deductions. This difference in the priority assigned to various errors in preference to others articulates the importance of prior knowledge regarding which linguistic features the test developers aimed to measure. 3.13.4. Time in Scoring Writing Table 19: Time in Scoring Writing Time (minutes) Response Response 1 (IELTS Writing Task 1) Response 2 (IELTS Writing Task 2) Response 3 (TOEFL-iBT Integrated Writing Task) Response 4 (TOEFL-iBT Independent Writing Task)

Rater 1 and 3

Rater 2

15 20 10 10

5 8 7 3

It is noteworthy that Rater 1 and Rater 3, in all cases, spent much more time on scoring than Rater 2, and this is true on all of the writing tasks. Apparently, the short amount of time allocated to the task by R2 did not allow him/her to provide sufficient comment on the full range of errors with grammar and vocabulary. While the other two raters highlighted several errors in the response when assessing Grammatical Range and Accuracy of IELTS Writing Task 1, R2 highlighted only one error, which was with subject-verb agreement. Additionally, R1 and R3

62

even indicated the pattern of error in the IELTS Writing Task 1, i.e. the candidate tended to use the preposition “with” irrespective of what the correct combination should have been. It appears that a rater’s time and degree of attention may be significant factors in assigning a score for a writing. That is to say, if raters have adequate time, they will pay closer attention to the writing, and more features of the writing will thereby be revealed. However, in a shortened period of time, there may be some lapse of rater attention, and raters may not notice all of the errors in grammar and structure. This inattention can lead to an assignment of a deflated (or even an inflated) score. 3.14. Summary of the Interview Procedures In an interview session between this researcher and the raters, there was further examination of both the language features which were deemed pertinent to the scoring and the scoring instruments that were applied. Topics such as the instructors’ scoring experience, rater’s decision-making behaviors, linguistic features of high-scoring and low-scoring essays, the use of standardized rubrics, and rater’s confidence with his/ her own scoring performance, were investigated (See Appendix C). Factors that affected scoring performance, including validity and reliability in scoring, frequency of training or norming, and raters’ perspectives regarding training or norming, were also explored. Data from the interviews, to a certain extent, illuminated the rater’s decision-making behaviors. 3.14.1. Interview Question 1: What do you think about the tasks? Table 20: Interview Question 1- What do you think about the tasks? Task 1 IELTS Writing Task 1 difficult to assess; authentic; interesting Task 2 IELTS Writing Task 2 easy to write; not as authentic as task 1 & 3 Task 3 TOEFL-iBT Integrated Writing authentic; instructions are not clear Task Task 4 TOEFL-iBT Independent the most familiar to raters; easy to write Writing Task

63

Most of the raters’ comments clearly registered their viewpoints regarding the authenticity of the writing tasks. IELTS Writing Task 1, which utilizes charts, graphs, tables, and the TOEFL-iBT Integrated Writing Task, which requires synthesis of the reading, listening, and writing, were deemed more authentic for academic writing than IELTS Writing Task 2 and TOEFL-iBT Independent Task, which simply require the writing of personal opinions. However, although raters had their own preferences for the tasks, they did, in all cases, base their scoring on the criteria provided. They did not let their personal preferences determine or interfere with their scoring. This is, no doubt, due to the extent of their professional training. Raters also expressed that they wanted to know more about the expectations of the test makers for IELTS Writing Task 1 and TOEFL-iBT Integrated Writing Task, with which they were unfamiliar. These requests accentuate the importance of the need for clarification of the descriptors on the rubrics and the reading of benchmark essays prior to scoring when assigning raters a task that is unfamiliar, involving the integration of skills. 3.14.2. Interview Question 2 – What are your opinions regarding the analytic IELTS rating scale and the holistic TOEFL-iBT rating scale? According to all three raters, the IELTS rubrics are the most specific and analytic, as indicated by the fact that raters take more time to score, to think, to read, and to understand the tasks. Significantly, R2 suggested that analytic rubrics may help raters separate biases better than holistic rubrics. R2 explained that with analytic rubrics, there are four criteria, particularly Task Achievement, Lexical Resource, Coherence and Cohesion, and Grammatical Range and Accuracy, and each criterion has its own descriptors. This separation of the four rating criteria allows raters to attend to all four aspects when evaluating the essay rather than being inclined to assess according to a specific criterion or even doing one’s own thing. Although all raters stated

64

that the analytic IELTS writing task is more time-consuming than the TOEFL-ibT, R3 shared the view that analytical rating scales have been shown to be more statistically reliable than the holistic ones. In their teaching context, all three raters have been using analytic rubrics with seven criteria in order to assess students’ placement. The holistic writing rubrics used on the TOEFL-iBT, on the other hand, are clear, convenient, and easy to use, in the sense that that they aided the raters in scoring quickly. R3 added that the TOEFL-iBT writing rubrics emphasized important points. The comments align with the ways that the three raters scored the responses. R2 spent only three minutes in scoring the two responses to the TOEFL-iBT Integrated and Independent Writing Tasks, both of which use holistic rubrics. R1 and R3 devoted only half as much time for TOEFL responses (10 minutes) as for IELTS responses (20 minutes). In other words, while the IELTS writing rubrics allow raters to score more precisely, the TOEFL-iBT writing rubrics help raters score quickly and perhaps assess the whole picture more clearly. 3.14.3. Interview Question 3 – What are the procedures that you follow when scoring? All three raters followed the following three steps in arriving at a score for an essay: 1. Read and interpret the task 2. Read the responses 3. Compare the rubric and the response When reading and interpreting the task, the raters addressed the question: ‘What is important to look for in the writing?’ Then, as they interpreted the task, they underlined key words of the prompt and addressed the possible organization and ideas in each part of the organization. In the second step, the raters read the responses the first time for content and organization. Together with the use of pencils to circle errors in vocabulary, grammar, and

65

mechanics, the raters sought to gain holistic understanding of the test-taker’s response, proficiency, and educational background. In the first reading, the raters checked the ideas that the test-taker wanted to communicate, and noted whether these ideas were in accordance with the ideas that the tasks embodied. Then, they read the script a second time, reviewing vocabulary and grammar. Sometimes the raters considered all the factors and arrived at a score in the first reading. Or they may have had to read the task several times in order to fully assess it. In either case, after reading the task and the response, the raters assigned the score. Finally, they examined the rating descriptors above and below their predicted score in order to confirm that their scoring was accurate. 3.14.4. Interview Question 4 – Do you score the responses based on your preconceived ideas on good writing? While rating the response, the raters also expressed different perspectives on what constitutes good writing. Rater 1 acknowledged that the scoring of the responses was based on his/ her preconceived ideas of what good writing is. In other words, R1 has firm beliefs regarding the most important criteria that the rubrics should contain. For instance, according to R1, the rubrics should have about seven or eight criteria. He/she suggested that Task Achievement should be expanded to include categories such as Fluency and Clarity. He/she also stated that Coherence and Cohesion should be separated into both Discourse and Organization. In Grammar Accuracy and Sentence Structure, R1 stated that grammar should be evaluated separately, and not on the same bar with mechanics. Such assumptions on the part of Rater 1 came from evaluation of writing for student placement that he/she does regularly for their job. R1 suggested that organization and discourse can be taught and learned in a relatively short time, whereas the teaching of grammar takes

66

longer. Significantly, according to R1, a well-organized essay which still contains several errors with verb tenses is, in general, not a good essay. R1 stated that the use of present perfect tense is a good indicator of students’ grammar since the present perfect can reveal the learners’ knowledge of past participles, basic building blocks of English. As R1 acknowledged, the underlying theory about language use that she/he holds does play an important role in their knowledge and attitude toward student writing. R1 highlighted that his/her opinions toward the essays were, to a considerable extent, shaped by his/her prior experience. The rubrics only finely tuned their evaluation and scoring of the responses. It would appear that the rubrics helped to clarify their opinions, leading to increased consistency. Rater 2 agreed with the criteria contained in both the IELTS and TOEFL-iBT writing rubrics. That is, R2 stated that good writing should incorporate the following descriptors: rich content, academic vocabulary, and clear organization. The writing should be free from grammar errors and use a wide range of sentence structures. R2 suggested that good writing should be easy to follow and easy to understand. At the same time, good writing should achieve fluency; that is, the ability to use vocabulary, structure, and grammar to communicate the test-taker’s ideas convincingly. Like Rater 1, Rater 2 had his/ her own bias regarding the linguistic features that raters attend to in scoring an essay. R2 prioritized the vocabulary used in the essay, such as the correct and varied use of word forms and prepositions. According to R2, good vocabulary use is an indicator of good writing, since the use of vocabulary can help increase the learner’s ability to communicate their ideas. In other words, content is dependent on vocabulary. Rich-vocabulary allows learners to communicate ideas more fluently and concisely, whereas the use of limited

67

vocabulary impedes learners from expressing their ideas clearly, no matter how logical their ideas may be. Differently from Raters 1 and 2, Rater 3 does not state any preconceived ideas of what constitutes good writing. R3 seemed to primarily attend to the descriptor on the IELTS and TOEFL-iBT criteria to evaluate in the responses. R3 stressed the importance of using rubrics in instructing raters how to form their score. The interviews with the raters in this study revealed several interesting points, especially those having to do with the raters’ preconceived opinions about what constitutes good writing. The raters in this study differed in their opinions about what, in fact, makes writing good. They also differed in the ways that they utilize their own opinions in relation to the criteria provided. This might be profitably examined in future research. 3.14.5. Interview Question 5 - What are the best practices for improving rating performances for NES and NNES raters? All of the raters made the assertion that conducting a norming and training session in advance of scoring is crucial for enhancing rating performance. The three raters stated that this procedure was applicable in the case of both NES and NNES raters. All agreed that through norming, raters become more aware of the important elements of the rating process by (1) participating in the reading of prompts, (2) studying of the rubrics, (3) reading benchmark essays, and (4) comparing prompts and rubrics. Norming also provides a clearer understanding of the terms used in the rubrics and motivates discussion of how these terms can be identified in student writing. Overall, norming ensures a more appropriate, consensual interpretation of rubrics and a stricter adherence to the procedures of scoring.

68

3.14.6. Interview Question 6 - Can NNES become good raters? According to Rater 2, NNES teachers can become good raters. However, he/she added that NNES teachers may emphasize grammar more than NES teachers. She/he attributed this difference to the way NES and NNES raters acquire the English language. NES raters acquire language in content-based ways – that is they are inclined to evaluate the content of the essay. In contrast, NNES raters acquire the language through grammar-focused methods. Accordingly, R2 stated that NNES raters pay attention to grammar, punctuation, and syntax more than NES do. R2 also suggested that the way one rates writing is likely to embody the way in which one acquired the language. Thus, NNES raters who primarily learned the language in an English communicative environment in an English-speaking country focus more on clear communication than accurate grammar. Regarding the IELTS and TOEFL-iBT writing tasks, R2 stated that there are no differences in language requirements (e.g. vocabulary range) between NES and NNES raters. In other words, R2 supported the notion that NNES teachers can become competent raters. Similarly, Rater 1 and Rater 3 agreed that NNES teachers can become good raters. R1 declared that NNES raters with an MA or a PhD in the English language are eligible to score essays for standardized tests. R3 raised the importance of norming and learning about English rhetorical conventions in order to improve scoring consistency among NNES raters. Noteworthy to this study is that one of the three raters was, in fact, a non-native Englishspeaking teacher. This rater has lived and worked in the U.S for 35 years. This rater described himself/herself as a very experienced rater. Their written and spoken English is commensurate with that of their native English-speaking counterparts. She/he assigned consistent scores to the responses. This rater has lived in the U.S. for a period extensive enough to become an experienced teacher in an accredited English program who demonstrates consistent and

69

appropriate scoring in IELTS and TOEFL-iBT writing tasks. Similarly, other NNES teachers can succeed in becoming good raters of written English, provided that they acquire a native-like written fluency. The case of this NNES rater helps to answer one of the basic questions of the current study: Can non-native English-speaking teachers acquire these best practices in rating standardized test essays? 3.15. Discussion of the Results from the Interview The interviews with the raters reveal several interesting points, especially those concerning (1) the raters’ preconceived opinions on the authenticity of the task, (2) the relevance of the rubrics, and (3) the elements that specifically constitute good writing. First, most of the raters expressed their viewpoints regarding authenticity. According to their reports, IELTS Writing Task 1 (which utilizes charts, graphs, and tables) and the TOEFL-iBT Integrated Task (which manipulates the reading, listening, and writing) are more authentic tasks for academic writing than IELTS Writing Task 2 and the TOEFL-iBT Independent Task, (both of which require solely personal opinions). Second, the raters responded more positively to analytic rating scales than to holistic ones due to the benefits that analytic scales provide in improving the reliability of scoring. Third, all raters had their own ideas of what a good response looks like. These ideas had been shaped in their learning and teaching. However, although the raters had their own preferences regarding the tasks, the rubrics, and the responses, they none-the-less relied on the provided rating scales when determining the score for a response. At the same time, the raters also expressed the need to be aware of the testing institution’s language requirements (e.g. vocabulary, rhetorical features, or sentence structures) of unfamiliar tasks like IELTS Writing Task 1 and the TOEFL-iBT Integrated Task. The need for clarification of terms on the rubrics in addition to the inclusion of reading benchmark essays in norming sessions apparently

70

becomes even more important when the task is unfamiliar and involves an integration of skills. Apparently, norming is a sine qua non of assessment in writing. Major recommendations of this study will be presented in Chapter 4.

71

Chapter Four CONCLUSION This study explores the best practices that raters use when they score standardized test essays. It also aims to provide feasible solutions to the scoring of popular standardized tests, notably the IELTS and the TOEFL-iBT, by non-native English-speaking raters in the English language teaching context of Vietnam. This chapter summarizes the results on the best practices in rating standardized test essays that help to improve the scoring abilities of nonnative-speaking EFL teachers in Vietnam. Limitations of the study and suggestions for future research are also presented. 4.1.

Research Question 1: What Are the Best Practices that Raters Use When They

Score Standardized Test Essays? The literature review provided many insights into scoring practices, especially to those practices which would be of benefit to non-native English-speaking raters in the English language teaching context of Vietnam when learning to score essays. These results of the study were also confirmed in the interviews with experienced raters conducted in the current research. 4.1.1. Rating Procedures for the IELTS and TOEFL-iBT Writing Tasks It was evident that each rater brings his or her own rating experience, language background, and personal preferences to the rating process. These differences can contribute to a variance in rating behaviors and, thereby, the rating performance. This variance in rating behaviors indicates that there is a need for the adoption of a framework that meets the demands of the specific standardized test, and that each rating behavior in the recommended framework

72

should serve a pre-determined purpose. The following framework, which is recommended by the IELTS testing agency, can be used in an EFL teaching context. Table 1: Revised Method of Assessment for IELTS Writing

1. Work through the four criteria in order, starting with Task Achievement or Task Response, noting length requirements 2. For each criterion start with the over-arching statement that most closely matches the appropriate features of the script 3. Read through the most detailed features of performance at that band and match these to the script 4. Check that all positive features of that band are evident in the script 5. Check through the descriptors BELOW the band to ensure that there are no penalties/ ceilings that are relevant 6. Check through the descriptors ABOVE the band to ensure that the rating is accurate. 7. Where necessary, check the number of words and note the necessary penalty on the Answer Sheet 8. Write the band for the criterion in the appropriate box on the Answer Sheet and ensure that any other relevant boxes are completed. 9. Rate all Task 1 and Task 2 responses together (Taken from Shaw and Weir, 2007, p.274) 4.1.2. Other Factors that Impact Scoring First, time itself can be a factor in improving the reliability of scoring writing since spending sufficient time on the evaluation of a given script would allow the rater to enhance rater consistency of the task prompt and the details of lexical resources and grammar. Second, the analytic rubrics are preferable for use by raters since they help in assigning more accurate scores than holistic rubrics. Third, some types of errors are weighted more heavily than other types of

73

errors. In both the IELTS and TOEFL-iBT writing tasks, easily-corrected mechanical errors (capitalization of the first word in a sentence and spelling) and minor grammatical errors (articles, punctuation, and prepositions) are not as heavily weighted as errors as word choice, plural forms, or subject-verb agreement. 4.2. Research Question 2: How Can Non-Native English-Speaking Teachers Acquire the Best Practices in Rating Standardized Test Essays? Both the discussion sections in the literature review and the interviews with the experienced raters suggest that NNES who have acquired a native-like written fluency can become good raters. Notably, one rater in this study is a prime example indicative of the fact that NNES teachers can become experienced raters, provided that they are exposed to sufficient knowledge of written English. Rater training which utilizes benchmark essays plays a substantial role in helping to assure that raters attain a consensus on the application of language descriptors and appropriate rating behaviors. Methods in the training for scoring, notably norming and rating scales, will aid NNES raters in internalizing the marking scheme and interpreting the tasks. Norming is necessary in order to raise raters’ scoring performance, as norming yields a number of benefits. These benefits have been confirmed through the findings of the literature review, the questionnaire, and the interviews conducted for this study. Initially, norming helps to establish a consensus of understanding of the descriptors, which, in turn, contributes to improvement in scoring reliability and promotes a similarity in interpretive schemata among raters. Second, norming helps to ensure that raters share similar perspectives regarding what constitutes good writing. Discussion which elucidates difficult terms on the rubric must be conducted in order to ensure that the progression of scoring and the terms of the rubrics are clearly and consensually

74

understood by everyone. The reading of benchmark essays by raters is also necessary, so that raters may compare features between essays and evaluate how these features are reflected in the rating criteria. The standardization effect of rating scales on rater judgments is another salient feature of scoring training that has been explored in a number of studies. Two among the three raters in this study assigned the same score as the other, and both of these were very similar to the score assigned by the TOEFL-iBT and IELTS Examiners. These raters focused closely on the descriptors on the rubrics and prioritized those linguistic features that were suggested in the rating criteria in order to assign a score for a response. In other words, both their actions and the results suggest that rubrics provide reliable guidance for raters in ascertaining the features of a response when scoring. Future studies on the relationship of the analysis of rubric descriptors and rating criteria are required in order to determine the effectiveness of rubrics on improving rating performance among raters. 4.3. Limitations of the Study The researcher gained many insights into scoring and the training of scoring from writing this thesis. However, if similar research were to be conducted in future, there are some changes that should be made. First, the level of rater experience should be considered and similarly ensured. Though the three raters were familiar with the TOEFL-iBT writing, they had not been normed with benchmark essays or rubrics of either IELTS or TOEFL-iBT standardized tests. Therefore, as they did not follow all the procedures recommended by IELTS or TOEFL-iBT testing agencies, the scoring behaviors that these raters exhibited may not be a fully accurate mirror of what IELTS and TOEFL-iBT raters should do in order to achieve accurate scoring. In

75

future, participants should be selected based on their experience to make sure that all participants have relatively similar scoring experience and training in scoring. The procedures for scoring employed by the raters need to be clarified and clearly specified in advance, to ensure that all raters proceeded in a similar way. There should either be a set procedure for all raters, or the raters need to describe their procedure for each task. In this study, the raters were asked to highlight linguistic features after finishing the scoring. The raters were then expected to record the total time spent on scoring, not including the time spent on highlighting. This was necessary for the researcher to know the exact amount of time that the raters spent on scoring a script. However, asking the raters to highlight the linguistic features that drove their score assignment after they finished scoring might have caused readers to change their initial score. That is, the raters may have noticed some other linguistic features that more closely matched the descriptors and highlighted them. This, in turn, impacted the data collected on the scoring behavior since the raters might not have noticed such linguistic features in the actual scoring. In future, factors such as when to score (e.g. after the first reading or the second reading) should be considered so as to enhance the validity of the data on scoring behavior. Another item requiring further attention is how the raters define “scoring”. Raters showed differences in how they defined scoring. The third rater defined “scoring” as only the actual time spent on the procedure of scoring. This definition included neither reading the prompts nor interpreting the task. The other two raters, on the other hand, defined “scoring” as the total time for both reading the prompts and interpreting the tasks. Thus, a specification for the method and time when scoring should have been discussed with the raters and clarified in advance in order to

76

collect more reliable data. Such considerations are important to the planning of any future research. 4.4. Recommendations for Future Studies In future, responses at different levels of proficiency on the same task should be analyzed in order for researchers to learn about the differences in content, organization, grammar, and vocabulary of essays at the different levels. This would importantly help differentiate essays of different language proficiencies. In addition, more studies need to be conducted on the standardization effects of rubrics and the effectiveness of benchmark essays when scoring them. This could be accomplished by utilizing three groupings: a group rating without rubrics, a group rating with rubrics, and a group rating with benchmark essays and rubrics. By this method, the effectiveness of rubrics and benchmark essays can be demonstrated and assessed, and their significance established. Then, further recommendations can be proposed in order to utilize rubrics and benchmark essays for the improvement of scoring performance for NNES teachers in EFL teaching contexts. 4.5. Application of the Thesis to NNES raters in Vietnam Information on the recommended rating behaviors for standardized rating criteria which was derived from the literature will allow the researcher not only to improve her own rating practices, but also to instruct other raters in her EFL teaching context in scoring standardized test essays. In future, the researcher will be able to arrange meetings among teachers to norm the scoring of writing in classrooms in Vietnam using the recommended framework and benchmark essays. It is hoped that through such practices, NNES teachers in EFL teaching contexts can acquire an accurate understanding of the IELTS and TOEFL-iBT writing rubrics and how IELTS

77

and TOEFL-iBT expert examiners score student responses so that they can employ such practices themselves and better evaluate students’ writing. 4.6. Summary This study explores the best practices that raters use when they score standardized test essays and provides feasible solutions to the scoring of popular standardized tests, notably the IELTS and the TOEFL-iBT, by non-native English-speaking raters in the English language teaching context of Vietnam. The current research is based primarily on review of data from previous studies. A thorough review provided this study with essential information on the best rating practices. In addition, a small set of anecdotal data from questionnaires and interviews from three experienced raters was used to supplement key ideas from the literature review. A framework of decision-making behaviors and rating instruments was proposed, which enables raters to more accurately align their decision-making behaviors with the rating requirements, thereby enhancing the validity and consistency of their scoring of EFL writing. The researcher plans to use the thesis as a pilot for further research and as a model for teacher education and scorer training workshops.

78

APPENDIX A: TASKS & RESPONSES TO TASKS IELTS Writing Task 1 You should spend about 20 minutes on this task. The charts below give information on the ages of the populations of Yemen and Italy in 2000 and projections for 2050. Summarize the information by selecting and reporting the main features, and make comparisons where relevant. Write at least 150 words. (Source: Test 3, Writing task 1, Cambridge IELTS 9 Self-study Pack (Student's Book with Answers and Audio CDs (2)): Examination Papers from University of Cambridge ESOL Examinations (IELTS Practice Tests), 2013, p. 168)

79

Response 1: Response to IELTS Writing Task 1 The diagrams show statistic information regarding the ages of the habitants of Yemen and Italy in 2000 and also a estimative for 2050. We can see that in 2000 the majority of people in Yemen was between 0 and 14 years old, whith 50.1%, whereas in Italy most of the population was between 15-59 years old (61.6%), in the same year. On the other hand, just 3.6% of people in the former country w 60 years old or more in 2000, while in the latter country this figure is represented with 24.1%. The projections for 2050 show that the number of people with 15-59 years and 60 years or more will increase in Yemen, reaching 57.3% and 5.7% respectively. In contrast, in Italy, the population with 15-59 years will decrease to 46.2%, while people with 60 years or more will grow to 42.3% Overall, it is possible to see that there is an upward trend on the rates of people with 60 years or more in both countries.

80

IELTS Writing Task 2 You should spend about 40 minutes on this task. Write about the following topic: The threat of nuclear weapons maintains world peace. Nuclear power provides cheap and clean energy. The benefits of nuclear technology far outweigh the disadvantages. To what extent do you agree or disagree?

Give reasons for your answer and include any relevant examples from your own knowledge or experience. Write at least 250 words. (Source: IELTS Academic Writing Task 2 Activity – Sample Script B. Retrieved from http://www.cambridgeenglish.org/teaching-english/resources-for-teachers/, pp. 18-19)

81

Response 2: Response to IELTS Writing Task 2 Nuclear power is an alternative source of energy which is carefully being evaluated during these times of energy problems. During these years we can say that we have energy problems but in more or less 50 years, we will be facing an energy crisis. Nuclear power is an alternative source of energy and unlike other sources such as solar energy, nuclear power is highly effective for industrial perpouses. If it is handled correctly, there really is no danger for the public. It is cheap, there is no threat of pollution and best of all it is limitless. It is difficult to think about nuclear power as a good source of energy for people in general. This is due to the use it has been given since its birth during the second world war. It is expressed as military power and in fact at the moment nuclear power is limited to few hands who consider themselves world powers. When and if there is a change of ideology regarding the correct use of nuclear power, then we may all benefit from all the advantages nuclear power can give us. If we outweigh the advantages and disadvantages of nuclear technology we then have the following: As stated before, the advantages are that there is limitless supply, it is cheap, it is effective for industrial perpouse and still there are many benefits which have not yet been discovered. The disadvantages are at present time that it is limited to only a few countries who regard it is as safe military power. Also if mishandles there is a risk for the popullation around the plant to undergo contamination as we all know happened in Chernobyl. If these disadvantages can be overcome, then it is clear that nuclear energy can give us more benefits than problems. It will in the future be very important as the energy crisis is not far ahead. In conclusion, nuclear power is good, it can be safe, and we will all benefit. It is up to our leaders to see that it is handled well so that we can all benefit from it.

82

TOEFL Internet-based test (iBT) Integrated Task (Source: TOEFL® iBT Writing Sample Responses – ETS. Retrieved from https://www.ets.org/, pp. 1–3) READING First examinees see the following reading passage on their computer screen for three minutes: In many organizations, perhaps the best way to approach certain new projects is to assemble a group of people into a team. Having a team of people attack a project offers several advantages. First of all, a group of people has a wider range of knowledge, expertise, and skills than any single individual is likely to possess. Also, because of the numbers of people involved and the greater resources they possess, a group can work more quickly in response to the task assigned to it and can come up with highly creative solutions to problems and issues. Sometimes these creative solutions come about because a group is more likely to make risky decisions that an individual might not undertake. This is because the group spreads responsibility for a decision to all the members and thus no single individual can be held accountable if the decision turns out to be wrong. Taking part in a group process can be very rewarding for members of the team. Team members who have a voice in making a decision will no doubt feel better about carrying out the work that is entailed by that decision than they might doing work that is imposed on them by others. Also, the individual team member has a much better chance to “shine,” to get his or her contributions and ideas not only recognized but recognized as highly significant, because a team’s overall results can be more far-reaching and have greater impact than what might have otherwise been possible for the person to accomplish or contribute working alone. A narrator then says, “Now listen to part of a lecture on the topic you just read about.” Then examinees listen to and can take notes on the following lecture, the script of which is given below.

83

LISTENING They view: A picture of a male professor standing in front of a class. They listen to: (Professor) Now I want to tell you about what one company found when it decided that it would turn over some of its new projects to teams of people, and make the team responsible for planning the projects and getting the work done. After about six months, the company took a look at how well the teams performed. On virtually every team, some members got almost a “free ride” ... they didn’t contribute much at all, but if their team did a good job, they nevertheless benefited from the recognition the team got. And what about group members who worked especially well and who provided a lot of insight on problems and issues? Well...the recognition for a job well done went to the group as a whole, no names were named. So it won’t surprise you to learn that when the real contributors were asked how they felt about the group process, their attitude was just the opposite of what the reading predicts. Another finding was that some projects just didn’t move very quickly. Why? Because it took so long to reach consensus...it took many, many meetings to build the agreement among group members about how they would move the project along. On the other hand, there were other instances where one or two people managed to become very influential over what their group did. Sometimes when those influencers said “That will never work” about an idea the group was developing, the idea was quickly dropped instead of being further discussed. And then there was another occasion when a couple influencers convinced the group that a plan of theirs was “highly creative.” And even though some members tried to warn the rest of the group that the project was moving in directions that might not work, they were basically ignored by other group members. Can you guess the ending to *this* story? When the project failed, the blame was placed on all the members of the group.

84

WRITING The reading passage then reappears and the following directions and question appear on the screen. They read: You have 20 minutes to plan and write your response. Your response will be judged on the basis of the quality of your writing and on how well your response presents the points in the lecture and their relationship to the reading passage. Typically, an effective response will be 150 to 225 words. They respond to: Summarize the points made in the lecture you just heard, explaining how they cast doubt on points made in the reading. Response 3: Response to TOEFL-iBT Integrated Writing Task The lecturer talks about research conducted by a firm that used the group system to handle their work. He says that the theory stated in the passage was very different and somewhat inaccurate when compared to what happened for real. First, some members got free rides. That is, some didn’t work hard but gotrecognition for the success nontheless. This also indicates that people who worked hard was not given recognition they should have got. In other words, they weren’t given the oppotunity to “shine”. This derectly contradicts what the passage indicates. Second, groups were slow in progress. The passage says that groups are nore responsive than individuals because of the number of people involved and their aggregated resources. However, the speaker talks about how the firm found out that groups were slower than individuals in dicision making. Groups needed more time for meetings, which are neccesary procceedures in decision making. This was another part where experience contradicted theory. Third, influetial people might emerge, and lead the group towards glory or failure. If the influent people are going in the right direction there would be no problem. But in cases where they go in the wrong direction, there is nobody that has enough influence to counter the decision made. In other words, the group might turn into a dictatorship, with the influential party as the leader, and might be less flexible in thinking. They might become one-sided, and thus fail to succeed.

85

Response 4: TOEFL Internet-based test (iBT) Independent Writing Task They read: Read the question below. You have 30 minutes to plan, write, and revise your essay. Typically, an effective essay will contain a minimum of 300 words. They respond to: Do you agree or disagree with the following statement? Always telling the truth is the most important consideration in any relationship. Use specific reasons and examples to support your answer. (Source: TOEFL® iBT Writing Sample Responses – ETS. Retrieved from https://www.ets.org/, p. 13) Response 4: Response to Independent Writing Task: When we were children, we were taught to tell truth. Telling truth and being honest also become the criteria of judging a person. I do think that most people prefer to live in a world of pure truth and it is also the best wish of all kind hearted people. However, what we have to face is not the dream land. Telling truth all the time could only exist in our dreams. In another word, lies could not be avoided. I will illustrate my opinion with the following facts. In the business world, always telling truth equals to commit suicide. For example, when competitor company is asking the content of a tender or the cost of a major product, telling truth is nonsense. Furthermore, we could learn from old collegue that to keep the business secrete is to keep the position. For those people whose profession is politics, lies is their favorite language. In order to stablize the mood of anxcious people, some times president has to tell lies to comfort the peolple so that no more serious result will hapen. Among friends, allways telling truth will also cause lots of avoidable confllicts. For example, when a 49 years old lady ask her froend whether she is beautiful or not. If the answer is No (truth for sure). We could imagine the accompanied results. This is why we are using the term white lie to find the suitable reason to use minor lies.

86

However, despite all these difficulties, we should always try to be honest and tell truth as much as we could. In lots of situation, to tell truth can strengthen the relationship and deepen the communication and understanding among people. Especially among family memers, telling truth is quite necessary for every one. A family full of lie will bankrupt immediately. In a word, in different situation, we should tell different words.

87

APPENDIX B: QUESTIONNAIRES WHEN SCORING RESPONSES Please answer the following questions after scoring the response to IELTS Writing Task 1: 1. What amount of time did you spend when scoring the student response to IELTS Writing Task 1? ______________________________________________________________________________ 2. What score did you assign to the student response to IELTS Writing Task 1? _____________________________________________________________________________ 3. How did you arrive at your score for Task Achievement on the student response to IELTS Writing Task 1? Please use the yellow marker to highlight words and phrases that show Task Achievement features of the student response to IELTS Writing Task 1. E.g. In arriving at Score 7 for Vocabulary of the student response to IELTS Writing Task 1, you can write ‘fluent language’ and then highlight academic words and phrases in the student response to IELTS Writing Task 1 that show that evidence of the writer’s fluent language. _____________________________________________________________________________ _____________________________________________________________________________ _____________________________________________________________________________ _____________________________________________________________________________ ______________________________________________________________________________ ____________________________________________________________________________ _____________________________________________________________________________ _____________________________________________________________________________

88

4. How did you arrive at your score for Coherence and Cohesion on the student response to IELTS Writing Task 1? Please use the blue marker to highlight words and phrases that show Coherence and Cohesion features of the student response to IELTS Writing Task 1. ______________________________________________________________________________ ______________________________________________________________________________ ______________________________________________________________________________ _____________________________________________________________________________ _____________________________________________________________________________ 5. How did you arrive at your score for Lexical Resource on the student response to IELTS Task 1? Please use the green marker to highlight words and phrases that show Lexical Resource features of the student response to IELTS Task 1. ______________________________________________________________________________ ______________________________________________________________________________ ______________________________________________________________________________ 6. How did you arrive at your score for Grammatical Range and Accuracy on the student response to IELTS Writing Task 1? Please use the pink marker to highlight words and phrases that show Grammatical Range and Accuracy features of the student response to IELTS Writing Task 1. ______________________________________________________________________________ ______________________________________________________________________________ ______________________________________________________________________________

89

Please answer the following questions after scoring the response to IELTS Writing Task 2: 1. What amount of time did you spend when scoring the student response to IELTS Writing Task 2? ______________________________________________________________________________ 2. What score did you assign to the student response to IELTS Writing Task 2? _____________________________________________________________________________ 3. How did you arrive at your score for Task Response on the student response to IELTS Writing Task 2? Please use the yellow marker to highlight words and phrases that show Task Response features of the student response to IELTS Writing Task 2. ______________________________________________________________________________ ______________________________________________________________________________ ______________________________________________________________________________ ______________________________________________________________________________ ______________________________________________________________________________ ______________________________________________________________________________ 4. How did you arrive at your score for Coherence and Cohesion on the student response to IELTS Writing Task 2? Please use the blue marker to highlight words and phrases that show Coherence and Cohesion features of the student response to IELTS Writing Task 2. ______________________________________________________________________________ ______________________________________________________________________________ ______________________________________________________________________________

90

______________________________________________________________________________ ______________________________________________________________________________ ______________________________________________________________________________ 5. How did you arrive at your score for Lexical Resource on the student response to IELTS Writing Task 2? Please use the green marker to highlight words and phrases that show Lexical Resource features of the student response to IELTS Writing Task 2. ______________________________________________________________________________ ______________________________________________________________________________ ______________________________________________________________________________ ______________________________________________________________________________ ______________________________________________________________________________ ______________________________________________________________________________ 6. How did you arrive at your score for Grammatical Range and Accuracy on the student response to IELTS Writing Task 2? Please use the pink marker to highlight words and phrases that show Grammatical Range and Accuracy features of the student response to IELTS Writing Task 2. ______________________________________________________________________________ ______________________________________________________________________________ ______________________________________________________________________________ ______________________________________________________________________________ ______________________________________________________________________________ ______________________________________________________________________________

91

Please answer the following questions after scoring the response to TOEFL-iBT Integrated Writing Task: 1. What amount of time did you spend when scoring the student response to TOEFL-iBT Integrated Writing Task? ______________________________________________________________________________ 2. What score did you assign to the student response to TOEFL-iBT Integrated Writing Task? _____________________________________________________________________________ 3. How did you arrive at your score for the student response to TOEFL-iBT Integrated Writing Task? Please use the markers to highlight words and phrases that show features of the student response to TOEFL-iBT Integrated Writing Task that shape your decisions. ______________________________________________________________________________ ______________________________________________________________________________ ______________________________________________________________________________ ______________________________________________________________________________ ______________________________________________________________________________ ______________________________________________________________________________ ______________________________________________________________________________ ______________________________________________________________________________ ______________________________________________________________________________ ______________________________________________________________________________

92

Please answer the following questions after scoring the response to TOEFL-iBT Independent Writing Task: 1. What amount of time did you spend when scoring the student response to TOEFL-iBT Independent Writing Task? ______________________________________________________________________________ 2. What score did you assign to the student response to TOEFL-iBT Independent Writing Task? _____________________________________________________________________________ 3. How did you arrive at your score for the student response to TOEFL-iBT Independent Writing Task? Please use the markers to highlight words and phrases that show features of the student response to TOEFL-iBT Independent Writing Task that shape your decisions. ______________________________________________________________________________ ______________________________________________________________________________ ______________________________________________________________________________ ______________________________________________________________________________ ______________________________________________________________________________ ______________________________________________________________________________ ______________________________________________________________________________ ______________________________________________________________________________ ______________________________________________________________________________

93

APPENDIX C: QUESTIONS FOR THE INTERVIEW 1. What do you think about the tasks? 2. What are your opinions regarding the analytic IELTS rating scale and the holistic TOEFL-iBT rating scale? 3. What are the procedures that you follow when scoring? 4. Do you score the responses based on your preconceived ideas on good writing? 5. What are the best practices for improving rating performances for native-English speaking and non-native English-speaking raters? 6. Can non-native English-speaking teachers become good raters?

94

References Alderson, J. C., Clapham, C., & Wall, D. (1995). Language test construction and evaluation. Cambridge: Cambridge University Press. Belcher, D. D., & Braine, G. (1995). Academic writing in a second language: Essays on research and pedagogy. Norwood, NJ: Ablex Publishing Corporation. Bowden, H. W., Steinhauer, K., Sanz, C., & Ullman, M. T. (2013). Native-like brain processing of syntax can be attained by university foreign language learners. Neuropsychologia, 51(13), 2492-2511. doi: 10.1016/j.neuropsychologia.2013.09.004 Brookhart, S. M. (2013). How to create and use rubrics for formative assessment and grading. Alexandria, VA: Association for Supervision & Curriculum Development. Cambridge IELTS 9 Self-study Pack. Oxford: Cambridge University Press. Charney, D. (1984). The Validity of Using Holistic Scoring to Evaluate Writing: A Critical Overview. Research in the Teaching of English, 18(1), 65-81. Cooper, C. R. (1977). Holistic evaluation of writing. In Cooper, C. R. and Odell, L. (Eds.), Evaluating Writing: Describing, Measuring, Judging (pp. 3-31). Urbana, IL: The National Council of Teachers of English. Crossley, S. A., Roscoe, R., & McNamara, D. S. (2014). What is successful writing? An investigation into the multiple ways writers can write successful essays. Written Communication, 31(2), 184-214. doi: 10.1177/0741088314526354 Cumming, A. (1998). An investigation into raters’ decision making, and development of a preliminary analytic framework, for scoring TOEFL essays and TOEFL 2000 prototype writing tasks. Princeton, NJ: Educational Testing Service.

95

Cumming, A., Kantor, R., & Powers, D. E. (2002). Decision making while rating ESL/EFL writing tasks: A descriptive framework. Modern Language Journal, 86(1), 67-96. doi: 10.1111/1540-4781.00137 Davies, A. (1990). Principles of Language Testing. Oxford: Blackwell. Educational Testing Service. (2008). TOEFL iBT writing sample response questions [data and software]. Retrieved from https://www.ets.org/Media/Tests/TOEFL/pdf/ibt_writing_sample_responses.pdf Freedman, S. W. (1979). Why do teachers give the grades they do? College Composition and Communication, 30(2), 365-387. doi: 10.2307/356323 Freedman, S. W. (1981). Influences on evaluators of expository essays: beyond the text. Research in the Teaching of English 15(3), 245 – 155. Fulcher, G. (2010). Practical Language Testing. London: Hodder Education. Furneaux, C., & Rignall, M. (2007). The effect of standardisation-training on rater judgements for the IELTS Writing Module. In L. Taylor & P. Falvey (Eds.), Research in Speaking and Writing Assessment (pp. 422-445). Cambridge: University of Cambridge ESOL Examinations and Cambridge University Press. Hamp-Lyons, L. (1991). Assessing second language writing in academic contexts. Norwood, NJ: Ablex Publishing Corporation. Hughes, A. (1989). Testing for language teachers. Cambridge: Cambridge University Press. Huot, B. A. (1990). Reliability, validity, and holistic scoring: What we know and what we need to know. College Composition and Communication, 41, 201–213.

96

Kennedy, C. & Thorp, D. (2007). A corpus-based investigation of linguistic responses to an IELTS academic writing task. In L. Taylor & P. Falvey (Eds.), Research in Speaking and Writing Assessment (pp. 316-378). Cambridge: University of Cambridge ESOL Examinations and Cambridge University Press. Kobayashi, T. (1992). Native and nonnative reactions to ESL compositions. TESOL Quarterly, 26, 81–112. Lunz, M. E., Wright, B. D., and Linacre, J. M. (1990). Measuring the impact of judge severity on examination scores. Applied Measurement in Education 3(4), 331-45. Norming process. Retrieved from http://www.teachingmatters.org/toolkit/norm-setting-protocol Marsh, H. W., & Ireland, R. (1987). The assessment of writing effectiveness: A multidimensional perspective. Australian Journal of Psychology, 39(3), 353-367. doi: 10.1080/00049538708259059 McNamara, T. F. (1996). Measuring second language performance. London: Longman. O'Loughlin, K. J. (1992). The assessment of writing by English and ESL teachers. Australian Review of Applied Linguistics, 17(1), 23-44. O'Sullivan, B., & Rignall, M. (2007). Assessing the value of bias analysis feedback to raters for the IELTS writing module. In L. Taylor & P. Falvey (Eds.), IELTS Collected Papers: Research in speaking and writing assessment (pp. 446–478). Cambridge, England: Cambridge University Press. Phillips, D. (2007). Longman Preparation Course for the TOEFL® Test: iBT Writing. New York: Pearson Education ESL.

97

Sakyi, A. A. (2003). The study of the holistic scoring behaviours of experienced and novice ESL instructors. (Doctoral dissertation). Retrieved from ProQuest Dissertations and Theses Database. (UMI No. NQ78033). Sample Candidate Writing Scripts and Examiner Comments. (n.d.). Retrieved from http://www.ielts.org/ Shaw, S. D. (2002). The effect of training and standardization on rater judgement and inter-rater reliability. Research Notes, 9, 13–17. Shaw, S. D. (2003). IELTS Writing Assessment Revision Project (Phrase 3): Validating the Revised Rating Scale – A quantitative analysis, Cambridge: UCLES internal report. Shaw, S. D., & Weir, C. J. (2007). Examining writing: Research and practice in assessing second language writing. Cambridge: Cambridge University Press. Shi, L. (2001). Native- and nonnative-speaking EFL teachers’ evaluation of Chinese students’ English writing. Language Testing, 18, 303–325. doi: 10.1177/026553220101800303 Tedick, D. J. (2002). Proficiency-oriented language instruction and assessment: Standards, philosophies, and considerations for assessment. In Minnesota Articulation Project, D. J. Tedick (Ed.), Proficiency-oriented language instruction and assessment: A curriculum handbook for teachers (Rev Ed.). CARLA Working Paper Series. Minneapolis, MN: University of Minnesota, The Center for Advanced Research on Language Acquisition. Weigle, S. C. (1994). Effects of training on raters of ESL compositions. Language Testing, 11, 197–223. Weigle, S. C. (2009). Assessing writing. Cambridge: Cambridge University Press.

98

Weir, C. J. (1983). Identifying the language problems of overseas students in tertiary education in the United Kingdom (unpublished doctoral dissertation). Institute of Education, London. Weir, C. J. (1990). Communicative language testing. New York: Prentice Hall. Weir, C. J. (2005). Language testing and validation: An evidence-based approach. Basingstoke: Palgrave Macmillan. White, E. M. (1995). An apologia for the timed impromptu essay test. College Composition and Communication, 46(1), 30-45. doi: 10.2307/358868 Wigglesworth, G. (1993). Exploring bias analysis as a tool for improving rater consistency in assessing oral interaction. Language Testing 10(3), 305-35. Wolfe, E. W. (1997). The relationship between essay reading style and scoring proficiency in a psychometric scoring system. Assessing Writing, 4, 83–106. Wolfe, E. W., Kao, C. W., & Ranney, M. (1998). Cognitive differences in proficient and nonproficient essay scorers. Written Communication, 15(4), 465-492. doi: 10.1177/0741088398015004002