Ratings of L2 Oral Performance in English: Relative ...

2 downloads 0 Views 416KB Size Report
Ratings of L2 Oral Performance in English: Relative Impact of Rater ... In rating English language speaking performances of nonnative speakers (NNSs),.
Spaan Fellow Working Papers in Second or Foreign Language Assessment Copyright © 2008 Volume 6: 181–205

English Language Institute University of Michigan www.lsa.umich.edu/eli/research/spaan

Ratings of L2 Oral Performance in English: Relative Impact of Rater Characteristics and Acoustic Measures of Accentedness Okim Kang

University of Georgia

ABSTRACT A review of the literature indicates that rating scores may be tied

to particular characteristics of raters who score oral performances. However, previous studies have not generally taken account how each individual factor of raters’ characteristics can affect the rating of oral assessments. No prior study has examined relations between rater’s assessment of oral performances and acoustic measures of accentedness in the speech samples being rated. Therefore, this study tested the supposition that raters’ background characteristics—including their rater training status—along with attitudes toward L1-accented English, influence their rating of oral performances. It was predicated that the impact of rater characteristics would be comparable to the influence on ratings of acoustic properties of the speech samples. In light of the growing internationalization of the U.S. teaching workforce, speech samples were produced by international teaching assistants (ITAs). The study revolved around two phases of data collection. In Phase I, 63 raters with no training intervention rated 11 ITAs’ speaking performances. In Phase II, 29 of the raters were given social-psychological rater intervention (training), in which undergraduates interacted for an hour informally with ITAs. The 63 raters (29 trained and 34 untrained) rated the same 11 speech samples. All data collection from participant-raters was conducted online. Using the PRAAT computer program, four minutes of continuous speech were acoustically analyzed for measures of speech rate, pauses, and intonation of speech. Results revealed that about 20% of variance in proficiency and intelligibility ratings was due to variables relevant to accent (speech rate, pauses, and stress). Raters from different language backgrounds (NS/NNS, exposure to NNSs’ contact, and prior teaching experience) had different perceptions on the ITAs’ accented speech, and the intercultural intervention (rater training) exerted an impact on accentedness rating. The findings of this study reveal valuable insights into the mechanism of second and foreign language assessment and of language stereotyping.

In rating English language speaking performances of nonnative speakers (NNSs), certain objectively measurable features of pronunciation are certainly relevant to “true score” variance. On the other hand, variance due to rater background and disposition counts as 181

182

O. Kang

measurement error. In other words, the phonetic and prosodic qualities of the speech samples should count toward speaker proficiency scores, but it should not matter who the particular raters are. Accordingly, the present study compares the impact of selected linguistic and nonlinguistic factors that may affect ratings of speech samples of NNSs’ oral performances. More specifically, it explores attitudinal, experiential, and acoustic factors (the rate, pauses, and intonation of speech) in evaluating accented speech. It further seeks to determine the degree to which rater training (a brief sociopsychological intervention function) may reduce the impact of extraneous rater characteristics on speech ratings. The current movements of globalization and World Englishes have muddled our professional knowledge (Canagarajah, 2006). These trends can pose fresh inquiries that can be associated with raters of oral performance assessment in particular, such as characteristics required for raters to evaluate nonnative speakers’ English oral performances, or the validity of raters’ impressions of “foreign” accentedness. Because listeners are so prone to rendering social inferences about speakers on the basis of just a few seconds of speech, rating oral performances is well known to be extremely susceptible to rater expectation and stereotype (Bradac, Cargile, & Hallet, 2001; Piché, Michlin, Rubin & Sullivan, 1977). This linguistic stereotyping effect is so pernicious that ratings of speaker accent are distorted by perceptions along extraneous dimensions, even when raters are asked to reflect on possible biasing factors (Nisbett & Wilson, 1977). Rubin’s (2002) program of research on U.S. undergraduates’ ratings of, and comprehension of, ITAs offers stark documentation of that phenomenon (Rubin, 2002). Listeners who harbor negative stereotypes of nonnative speakers may “hear” interference where there is none. Such listeners can hardly be expected to render accurate judgments of oral language performances. What is less well understood is whether listeners with ample exposure to nonnative speech, and with positive dispositions, are likewise hampered from rendering accurate discriminations among levels of comprehensibility. Rubin’s work suggests that sheer exposure to nonnative speakers can acclimate listeners to accents. Familiarity with NNSs’ speech does facilitate comprehension by native speakers (Gass and Varonis, 1984). Anecdotally, we know that international educators and ESL specialists often become desensitized to elements of accentedness that may be quite salient to mainstream listeners. Thus stereotypes of NNSs, degree of exposure to nonnative speech, and technical knowledge of language and language variation are all possible speaker characteristics likely to exert impact upon evaluation of NNSs and their speaking proficiency. On the other hand, speech science has made progress toward identifying objectively measurable features of pronunciation that affect comprehensibility. Pickering’s (2001, 2004) research, in particular, indicates that intonational indices derived from fundamental speech formants are associated with expected comprehensibility of instructors’ speech. Speech rate also interacts with accent to affect comprehensibility (Anderson-Hsieh & Kohler, 1988). Typically, accent comprehensibility is determined by listener dictation accuracy, and degree of accentedness is judged by panels of experts. It is now possible, however, for elements of accent to be detected by instrument and computer-assisted acoustical analysis (Pickering, 1999). This approach can lead to a possible solution to the problem of subjectivity in rating oral performances. To the degree that conforming to native speaker comprehension needs constitutes a criterion for oral performance assessment, these acoustical parameters measured via instrumentation can be considered legitimate proxies of true score components of speaking proficiency. That is the rationale for the present project: to ascertain the proportion

Ratings of L2 Oral Performance in English

of variance in speaker ratings attributable to measurable parameters of accentedness and the proportion attributable to potentially biasing rater characteristics. This study is guided by the following research questions: (1) To what extent do raters’ perceptual judgments of accentedness of ITAs’ oral performance reflect objectively measured acoustic characteristics of accented English? (2) What are the relations between raters’ backgrounds and their ratings of ITAs’ oral performance? (3) To what extent does a course of training (a brief sociopsychological intervention function) mitigate the impact of measurement error factors such as rater backgrounds and attitudes toward accented English? Theoretical Consideration In terms of the theoretical accounts of language assessment, the present proposal is situated in Spolsky’s (1978, 1995) integrative-sociolinguistic approach and Canale’s (1988) naturalistic-ethical tradition. This study then takes a new interpretation of the validity of oral performance assessment from a perspective of World Englishes. The integrative-sociolinguistic approach (Spolsky, 1978) emerged from the dissatisfaction with the structuralist and behaviorist ways of language teaching and assessment (e.g., no room for creativity), which led to a wealth of linguistic research on communicative competence and on the contexts of language. The notion of communicative competence was expanded to include the importance of context beyond the sentences to appropriate language use, which includes sociolinguistic situation (Hymes, 1972). Accordingly, the practice of language testing is also connected to social relationships, power, and control (Shohamy, 2001; Norton & Toohey, 2005). Canale’s (1988) naturalistic-ethical trend reflects the social responsibility and the ethical aspect of testing. This trend insists that language assessment can measure students’ competence by using naturalistic language, as well as evaluate competence by observing students as they perform authentic language tasks. This movement can well justify the purpose of using ITAs’ instructional (natural and authentic) presentations for this project. An ethical approach to language testing makes clear the limitations of our tests to everyone involved—not only test takers, but also their parents, their teachers, their raters, and political decision makers (Hamp-Lyons, 2002). In this respect, a question can arise regarding the validity of score interpretation of NNSs’ English langauge oral performances, when those performances are only judged by native speakers of English (Lowenberg, 2002). In fact, nonnative speakers use English more in communication with other nonnative speakers than with native speakers in international contexts. More varieties of English are used by nonnative speakers in their local communities than are used by so-called inner circle English speakers (Kachru, 1997; Yano, 2001). Despite changes in the uses and functions of the English language, however, the norms for English language speakers continue to frame the dominant conventions for the assessment of English oral performances in terms of inner circle English (Kim, 2005). Validity is the most important concept in language assessment and involves the adequacy and appropriateness of the ways that the assessments are interpreted and used (Bachman, 1990). With regard to the validity of rating, there are a number of studies on test method effect, which includes rater factors (e.g., Bachman, 2002; Chalhoub-Deville & Wigglesworth, 2005). As for the rater characteristics, raters’ sex, occupation, teaching experience, and training experience have been examined (e.g., Brown, 1995; Lynch &

183

184

O. Kang

McNamara, 1998; Upshur & Turner, 1995), but raters’ characteristics within the World Englishes perspective, such as their native/nonnative status and their attitudes toward World Englishes for assessment purposes, require examination. Brown (1995) asserted that raters’ decision-making about what is right comes from their experience. In investigating rater effect on assessment of NNSs’ English oral performances, this study proposes not only that raters must be a part of a research framework as they may influence assessment results of NNSs’ oral performance, but also that raters’ attitudes toward World Englishes should be the baseline for the development of the research framework of speaking assessment performance. Potentially Biasing Speech Rater Characteristics Oral assessment in language learning has been the subject of intensive attention among second-language acquisition researchers (Iwashita, McNamara, & Elder 2001). But, research specifically pertaining to rater effects (e.g., Brown, Iwashita & McNamara, 2005) and to the evaluation of spoken English proficiency (Shohamy, 1993) is still in some respects in exploratory stages (Boulet, van Zanten, Mckinley & Gary, 2001). Linacre (1989/1993) used the term severity to refer both to the overall severity of the rater and to differences between raters in the way they interpret rating scales or criteria. McNamara and Adams (1991/1994), use of the term rater characteristics to cover both overall severity and more specific effects such as rater bias. Effects of Rater Educational and Professional Experience Expert raters, as compared with novice or native raters, may be less influenced by surface features and more capable of examining language use, content, and rhetorical organization concurrently (Cumming, 1990). Studies by Galloway (1980), Barnwell (1989), Hadden (1991) all found that classroom teachers and nonteaching native speakers differ in their assessments of learners’ second language (L2) oral ability. Whereas Barnwell’s (1989) results found that the nonteaching raters were relatively harsher than the teaching rater group, Galloway’s (1980) and Hadden’s (1991) findings indicated that teachers were more critical of students’ grammatical abilities than were laypersons. Chalhoub-Deville’s (1995) investigation of rater groups (teachers of Arabic and nonteaching Arabic speakers) indicated that teachers tended to emphasize grammar in their assessment of students’ proficiency and nonteachers tend to be concerned with the more communicative aspects of the language. In contrast, another set of studies found that linguistically naïve listeners are quite sensitive to cross-dialectal and cross-linguistic differences in prosodic patterns, and that such listeners perform reliably when judging foreign accents (Anderson-Hsieh & Koehler, 1988; Munro 1995). On that basis, Piske, MacKay, and Flege (2001) suggest that a broad and diverse sample of raters should be recruited and not only one particular type of rater employed.

Ratings of L2 Oral Performance in English

Effects of Rater Nationality and Native Language Research is inconclusive regarding how raters’ nationality and native language affect their ratings of examinee oral proficiency. Brown’s (1995) results pertaining to the Japanese Test for Tour Guides showed that there is little evidence that native speakers are more suitable than nonnative speakers. However, some studies (e.g., Fayer & Krasinski, 1987; Santos, 1988) have found NNS raters to be more severe than NSs. Chalhoub-Deville and Wigglesworth (2005) inquired whether there was a shared perception of speaking proficiency among raters from different English-speaking countries, Australia, Canada, the United Kingdom, and the United States, when rating speech samples of international English language students. They found that the U.K. raters were the harshest and the U.S. raters were the most lenient. Effects of Rater’s Intercultural Contact and Exposure to NNS Varieties According to Thompson (1991), individuals unfamiliar with a particular World English variety generally perceive a higher degree of L2 foreign accent than those who are familiar with that particular variety. Powers, Schedl, Wilson-Leung, and Butler (1999) had listeners complete a language background questionnaire at the outset of the experiment in which they indicated their degree of familiarity with languages other than English. The scales pertained to participants’ foreign-language study and travel, and the nature and frequency of their contact with nonnative speakers of English. Their findings showed that no variables were consistently related to judges’ performance. Derwing and Munro (1997) asked listeners to indicate on a scale the amount of contact they had had with people who speak with any other foreign accent. Their results showed that listeners’ self-reported familiarity with various accents predicted their success at language identification, and also there was a correlation between familiarity and intelligibility scores. Indeed, these findings are generally consistent with evidence that amount of interaction with speakers of specific languages or World Englishes facilitates listening comprehension of those English varieties (Field, 2003; Gass & Varonis, 1984). Effects of Rater Training Through exposure to anchor-point examples, through review of rating criteria, and by means of comparison with other raters, conventional rater training should be able to “calibrate” novice raters to a consistent standard and to recalibrate more experienced raters who may have drifted from that standard. Yet a body of research indicates that conventional training does not always reduce inter-rater differences (Brown, 1995; Myford & Wolfe, 2000). The benefits of training may be short-term at best (Lumley & McNamara, 1995). Even with training, then, raters are not interchangeable. On the other hand, compensation for the nonconformity of individual raters can be achieved through statistical approaches such as generalizability theory (e.g., Stansfield and Kenyon, 1991) and many-facet Rasch modeling (e.g., Linacre, 1993; Lumley & McNamara, 1995). Generalizability theory provides a methodological approach to estimating the relative effects of variation in test tasks and rater judgments on test scores (Shavelson & Webb, 1991). Many-facet Rasch measurement investigates rater fit and adjusts for rater severity (Linacre, 1993).

185

186

O. Kang

Sensitivity of Speech Evaluation to Listeners’ Social Stereotypes Another important factor that influences the rating process can be found in raters’ attitudes toward nonnative speakers’ Englishes. There has been little research regarding raters’ potential attitudes toward accented-English in language assessment. Lambert, Hodgson, Gardner, and Filenbaum (1960) first posited the Linguistic Stereotype Hypothesis, which holds that even short samples of nonprestige varieties of speech are sufficient to trigger among listeners a cascade of negative evaluations of speakers. Many of those evaluations are quite extraneous to language behaviors and touch upon physical characteristics such as height and attractiveness, general intelligence, and civility. The intervening years have produced a slew of research validating that hypothesis and extending it to a wide catalogue of speech varieties (e.g., gender-typed speech, World English pronunciations) and a wide range of social judgments (e.g., occupational competence, teaching ability). Yet perhaps the most consequential of such language attitudes are those that bear on teachers’ biased assessments of student performance (e.g., Piché, Michlin, Rubin, & Sullivan, 1977), because negative linguistic stereotypes held by teachers are likely to impede achievement of language minority students. Don Rubin has extended the tradition of language and attitude research (summarized in Rubin, 2002). His body of research reveals that not only does speech style affect judgments of teachers, but the converse is also true: social expectations regarding those teachers affect students’ judgments of their speech proficiency. When students are lead to believe that they are listening to a NNS instructor, they rate that instructor’s speech more highly accented, and their listening comprehension measurably declines. This expectation effect holds true even when the actual speech stimulus is thoroughly Standard American English (Rubin, 1992). The sort of “reverse linguistic stereotype” effect to which Rubin’s work points, that is, washback of general social judgments to more specific evaluations of language proficiency, is substantiated by other research. Nguyen (1993) has claimed that inherent rater biases against certain nationalities renders valid standardized testing of oral proficiency unattainable for speakers from those countries. In short, when listeners harbor certain stereotypes about speaker identity, they are rendered incapable of objectively assessing speaker pronunciation. However, little is known about what predisposes some listeners to hold more negative expectations than others. Attitudes toward International Teaching Assistant (ITA) Too often, undergraduates perceiving problems complain about the intelligibility of international instructors (Lindemann, 2003). Distortion can be particularly potent in the undergraduate student/ITA relationship. It is often believed that the ITAs’ lack of English proficiency hinders the undergraduates’ ability to comprehend subject material (Smith, Strom, & Muthuswamy, 2005). Even though research shows that most nonnative speakers possess sufficient language proficiency to accomplish their instructional goals, some people may still question the language skills of the nonnative teacher (Llurda, 2000). In fact, NNS instructors are almost universally regarded as being less competent teachers than their native-English speaking peers (Rubin & Smith, 1990). Previous research (Rubin, 2002; Lindemann, 2002), documents that these student complaints are frequently more a function of students’ stereotyped expectations than of

Ratings of L2 Oral Performance in English

instructors’ objective language performance. Nonetheless, the negative attitudes and expectations held by many U.S. undergraduates can materially interfere with information uptake and exert a deleterious effect on learning. According to one survey, 40% of undergraduates at some point in their educations dropped or switched classes because the instructor was a NNS (Rubin & Smith, 1990). Lindemann (2003) demonstrated that U.S. students who have the most prejudiced expectations of internationals’ English proficiency are least likely to engage in vigorous questioning of those internationals. Reluctance to interact with one’s instructor no doubt does result in lowered learning outcomes. Intergroup Contact as a Tool for Reducing Prejudice Interactional contact between two groups can have positive effects on group attitude and improve group relations and can reduce prejudice if certain conditions are met (Voci & Hewstone, 2003). Those conditions include: (1) ensured equal status for members of the different groups; (2) the groups working for common goals; (3) contact tasks with cooperative interdependence between the group members; (4) sufficient time given with other stress relieved; (5) a high potential for interpersonal acquaintance between members provided; and (6) participants seen as being typical of their groups (suggested by Milhouse, Asante, & Nwosu, 2001). According to the contact hypothesis (Allport, 1954), contact under these conditions will create a positive intergroup encounter, which, in turn, will bring about an improvement in intergroup relations (Amichai-Hamburger & McKenna, 2006). Besides, true acquaintance lessens prejudice (Allport, 1954). Measurable Components of Accent and Comprehensibility Experts are hardly unanimous regarding which specific aspects of NNS pronunciation affect intelligibility the most (Fayer & Krasinski, 1987). Until recently, most research examining aspects of L2 speech production had been concerned with phonemic segmental phenomena (e.g., Cole, Jakimik, & Cooper, 1978). Currently, the consensus, however, seems to be moving toward an appreciation of the role that differences in speaking rates, pauses, and intonation patterns may play in intelligibility and listeners’ assessments (Derwing & Munro, 1997, 2005; Munro, 1995). So far, acoustic parameters combining rate, pauses, and intonation of speech have not yet been well studied. Also, most research utilizing acoustical analyses of speech are directed toward a linguistic sense of phonetics or phonology, but they have not been applied to the assessment of oral performances. Anderson-Hsieh and Koehler (1988) emphasize the importance of speaking rate for comprehension of heavily accented NNS speech. Nonstandard word stress can also erode comprehensibility (Gallejo, 1990; Field, 2005). In addition, analysis of nonnative speaker data shows a qualitative difference in both placement and length of pauses, which can materially affect the overall prosodic structure of the discourse (Pickering, 1999). In Pickering’s (1999) study of two parallel lecture extracts, one given by NS teaching assistants and the other by Chinese international teaching assistants (ITA), she found that pauses in the NNS data were both longer and more erratic than those in the NS data and tended to regularly break up conceptual units. Rounds (1987) found a prevalence of “empty pauses,” regular moments of silence unrelated to board work or for dramatic effect, which artificially increased the amount of silence in the discourse. These empty pauses were likely to be linked to negative perceptions of ITAs on the part of undergraduate students. Intonation is likewise a key component of comprehensibility and for communication in general (Brazil, 1997). For

187

188

O. Kang

example, intonation patterns characteristic of many East Asian speakers can cause U.S. listeners to lose concentration or to misunderstand the speaker’s intent (Pickering, 2001). Elements of accent can now be detected by instrument and computer-assisted acoustical analysis, without reference to listener dictation accuracy or expert ratings. Computer-assisted phonetic analysis (e.g., Computerized Speech Laboratory, and PRAAT) has assisted in characterizing different accents by examining patterns of fundamental frequencies or F0 formants. In recent studies, it is becoming common to use these instrumentations rather than depending on subjective judgments (see, e.g., Ingram & Park; 1997; Pickering, 2001, 2004; Levis & Pickering, 2004). Generally the methodology must incorporate discourse analysis, wherein an analyst identifies a pragmatic context in which a particular intonational contour would be expected. Following discourse analysis to identify an expected context for a particular intonation contour, computer-based analysis is used to confirm (or disconfirm) that the expected contour does indeed appear at that site in the speech stream (see Wennerstrom, 2001). Computer-assisted phonetic analysis likewise simplifies and renders more precise the task of ascertaining speech rate. Method Participants The study involved two phases of data collection. In Phase I, raters with no training intervention rated speaking performances. In Phase II, a subset of the raters were given socialpsychological rater intervention (training). Sixty-three undergraduate participants (13 male and 50 female) attending a large public university in the Southeastern United States were selected for this study such that they collectively varied across the predictor variables: (a) English language native speaker status (43 native and 20 nonnative), (b) composite index of exposure to nonnative English speaking friends and acquaintances (self-report: 98.4% of respondents were exposed to more than one NNS, mean = 6.27, SD = 5.98), and (c) formal training in language studies or experience in English language teaching (42 with no formal training and 21 with at least one formal class in Linguistics; foreign languages studied, 1 with none, 26 with one, 25 with two, and 11 with more than three languages studied either at high schools or colleges; language teaching experience, 52 with no experience and 11 with at least one course tutoring). Only undergraduates with no previous experience in standardized assessment rating activities were recruited. They were remunerated at a rate of $8.00 per hour. Speech Performance Samples Speech samples were obtained from 11 male international teaching assistants (3 Chinese, 2 Japanese, 1 Korean, 2 Arabic, 1 Russian, 1 Nepali, 1 Sri Lankan). Selected were segments of ITAs’ in-class presentations, which describe a concept from each presenter’s major course of study, approximately four minutes in length (+/- 10 seconds). They included a TA’s narrative lecture that did not have much disfluency (e.g., repetitions and restarts) or any interruptions from students’ questions, responses, or laughter. The TAST (TOEFL® Academic Speaking Test) scores of the ITAs were collected from their course instructor to independently identify ITAs’ communication skills. In addition, classroom lectures of three male U.S. native TAs were audio recorded and used to examine the degree of difference between native and nonnative speaker realizations in terms of pause structures, speech rates, and intonation. Table 1 shows the ITAs’ background information with performance scores on the linguistic measures. ITAs’ collaboration efforts were compensated at a rate $10.00 per hour.

Table 1. Speech Performance Samples and their Linguistic Measures Mean PhonaArticu- Length tion ParticiTAST Speech lation of Time pants Nationality scores Rate Rate Runs Ratio Saudi ITA1 Arabia 21.25/30 2.99 4.27 5.90 69.90 ITA2 Nepal 18.75/30 2.69 4.23 5.19 63.68 Saudi ITA3 Arabia 18.75/30 2.26 4.10 4.50 55.07 South SPEAK ITA4 Korea 40/50 2.75 4.45 4.69 61.10 ITA5 Japan 16.25/30 2.14 3.84 3.86 55.79 ITA6 Russia 23.75/30 2.67 4.30 5.80 62.05 ITA7 India 25/30 3.05 4.69 4.30 64.91 ITA8 China 20/30 2.84 4.41 4.88 64.33 ITA9 Japan 16/30 2.25 3.61 3.76 62.40 ITA10 China 14.75/30 2.20 3.81 4.23 57.80 ITA11 China 12.25/30 1.23 3.11 4.12 39.74 US TA1 US NA 4.82 6.20 11.11 77.81 US TA2 US NA 4.04 4.94 8.61 81.30 US TA3 US NA 4.39 5.09 12.02 86.15 Mean Length Silent Pause 0.58 0.69 0.89 0.66 0.78 0.80 0.49 0.60 0.61 0.79 1.97 0.52 0.38 0.37

Silent Pause per min 30.83 31.70 30.43 35.40 33.83 28.20 43.29 35.46 36.50 31.83 18.34 22.75 28.40 22.22

7.67 5.38 3.98 7.62 9.14 9.20 15.91 3.72 2.99 1.44 1.44

3.53

6.10 3.10

Filled Pause per min

0.27 0.38 0.21 0.38 0.35 0.48 0.38 0.33 0.34 0.39 0.14

0.45

0.31 0.26

29.49 28.14 26.21 28.60 32.40 26.38 26.53 16.61 35.07 37.97 46.19

31.09

30.83 30.54

Mean Length Filled Pause Pace

0.25 0.30 0.26 0.24 0.26 0.35 0.32 0.38 0.18 0.25 0.26

0.29

0.24 0.27

Space

Ratings of L2 Oral Performance in English

189

190

O. Kang

Linguistic Measures: Acoustic Analysis of Speech Rate and Pause Structure As these acoustic parameters are gradient in nature, measurements were taken of the range of baseline native speaker realizations of significant features and the degree of difference between native and nonnative speaker realizations were calculated. The acoustic indices have been chosen based on previous work investigating phonetic cues used to indicate discourse structure in native speaker models and perceived comprehensibility of the discourse (e.g., Riggenbach 1991; Pickering, 1999; Towell, Hawkins, and Bazergui, 1996). Using the PRAAT computer program, four minutes (+/- 10 seconds) of continuous speech were identified and demarcated with the program cursor. Rates of speech were calculated according to the method recommended by Riggenbach (1991) and Kormos & Denes (2004). For the measures of pause structure, the pause unit model of Brown (1977) and Brown & Yule (1983) was utilized; that is, only pauses over 0.1 seconds will be considered. Rate measures: 1. Speech Rate: mean number of syllables produced per second 2. Articulation Rate: mean number of syllables produced per minute over the total amount of time talking, excluding pause time 3. Mean Length of Runs: average number of syllables produced in utterances between pauses of 0.1 seconds and above 4. Phonation Time Ratio: percentage of time spent speaking as a proportion of the total time taken to produce the speech sample Pause measures 1. Number of Silent Pauses: number of silent pauses per four-minute speech 2. Mean Length of Pauses: total length of pauses of 0.1 or greater 3. Number of Filled Pauses: number of filled pauses (does not include repetitions, restarts or repairs) 4. Length of Filled Pauses: average of length of filled pauses Intonation measures 1. Pace: number of stressed words per minute (Vanderplank, 1993) 2. Space: proportion of stressed words to the total number of words (Vanderplank, 1993) Procedures Participants were administered a measure of reverse linguistic stereotyping ( Rubin, 1992; Zahn & Hopper, 1985). They heard two different sections of a four-minute audiotaped lecture in Standard American English with two different faces associated with each. Slide photographs representing the instructor were projected on the screen, with a photograph of either a Caucasian or an Asian man. A distracter was included in between two sessions. Immediately after hearing the lecture, participants were asked to complete a speech evaluation instrument (SEI) and a partial cloze test of listening comprehension. After about four weeks, participants were provided with a URL and rated the 11 four-minute ITA speech samples online, which served as streaming audio for each of the dependent variables. The dependent variables included: (a) a measure of speaker comprehensibility, (b) a measure of instructional competence, (c) a measure of language proficiency, and (d) an accentedness rating. Raters first listened to one trial speech sample with a brief online training tutorial. The 11 speech samples were ordered randomly. Each of the four response measures utilized sets semantic differential items on seven-point scales. The measure of speaker comprehensibilty,

Ratings of L2 Oral Performance in English

had five subscales and is an expansion of Derwing & Munro’s (1995, 1997) single item. Each element was scored from 1 (was hard to understand) to 7 (was easy to understand). The Cronbach alpha coefficient across the five items was 0.94. Therefore all five items were summed into a composite measure of comprehensibility. The measures of instructional competence (e.g., Poor teacher: : : : : : : : Effective teacher) and English language proficiency (e.g., Low proficiency : : : : : : : : High Proficiency) consisted of eight semantic differential items. The former was based on Sarwark, Smith, MacCallum, and Cascallar (1995) and Rubin (1992), and the Cronbach alpha of the subscales was 0.92. The latter was derived from the ETS rating rubrics and the internal consistency reliability coefficient of the scales was 0.91. The accentedness rating subscales (e.g., Speak with foreign accent : : : : : : : : Speak with American accent) were composed of four semantic differential items and its internal consistency was marginally acceptable (0.76). Accordingly, the total scores of these measures were used for the final analysis. Correlations among the four dependent variable measures were computed and turned out to be moderately correlated, with Pearson r values of 0.67 and above, all at the p < .01 level. Four weeks after Phase I, rater training for Phase II as a social-psychological inoculation against linguistic stereotyping took place with 11 ITAs and 29 randomly selected raters. Each of the three meeting sessions consisted of a one hour informal meeting between five to six ITAs and a random sample of nine to ten raters. Then the 63 undergraduate raters (29 trained raters and 34 untrained raters) listened to the same 11 samples and completed the four measures utilized in Phase I. The time between Phase I and Phase II was six weeks. Data Analysis The data were sets of repeated measures. Multiple regression was employed to account for the variance in an interval dependent, based on linear combinations of independent variables. A set of independent variables can explain a proportion of the variance in a dependent variable at a significant level through a significance test of R2, and can ascertain the relative predictive importance of the independent variables by comparing beta weights. Stepwise multiple regressions were used for purpose of pure prediction. That is, in stage one, the independent variable that best correlated with the dependent variable was included in the equation. In the second stage, the remaining independent variables with the highest partial correlation with the dependent variable, controlling for the first independent variable, were entered. This process was repeated until the addition of a remaining independent variable did not increase R2by a significant amount. In order to examine the effects of different variables, the analyses required separate multiple regression runs for speech acoustic indicators as predictors and for rater background characteristics as predictors. In these analyses, when testing for the predictive power of the two acoustic indicators, variance due to rater background characteristics was ignored in the regression model. In addition, independent T-tests were conducted to determine whether each of the acoustic variable means were significantly different for the ITAs and U.S. TAs. Next, the variance attributable to the rater background characteristics was isolated, ignoring variance due to acoustic features. To follow up significant effects on specific dimensions, a multivariate analysis of variance (MANOVA) was performed. Both of the regression models were run separately for each of the dependent variables. Only rating scores of Phase I were analyzed for the regression models because those in Phase II had a different structure of rater evaluations for the trained and untrained groups.

191

192

O. Kang

For the analysis of attitude perceptions, the linguistic stereotyping measures were divided into four separate dependent variables (three factors and one cloze test scores) and submitted to a 2 x 2 mixed factorial ANCOVA, using perceived physical attractiveness as a covariate. Finally, in order to determine the effect of the rater training, an ANCOVA was computed for each of the dependent variables where rating scores of Phase I were used as covariates. Results Perceptual Judgments and Acoustic Measures of Accentedness Prior to the primary analysis of the multiple regressions, ITAs and U.S. TAs were compared by means of the independent-sample T-test, which determined whether each of the variable mean scores for the two groups was distinct, provided that the underlying distributions could be assumed to be normal and have equal variances. The results in Table 2 reveal significant differences between ITAs and U.S. instructors in six of the investigated ten acoustic variables: speech rate, articulation rate, mean length of runs, phonation time ratio, number of filled pause per minute, and number of stressed words per minute (p