Does Test Preparation Work? Implications for Score

5 downloads 0 Views 464KB Size Report
English Test Band 4 (CET4) and the differential effects of test preparation practices on its ... Language Testing System (IELTS) outperformed English for Academic ... Oxford, 1995; Park, 1997) found these six strategies loaded on fewer factors, ...
This article was downloaded by: [Hong Kong Institute of Education] On: 23 May 2013, At: 21:24 Publisher: Routledge Informa Ltd Registered in England and Wales Registered Number: 1072954 Registered office: Mortimer House, 37-41 Mortimer Street, London W1T 3JH, UK

Language Assessment Quarterly Publication details, including instructions for authors and subscription information: http://www.tandfonline.com/loi/hlaq20

Does Test Preparation Work? Implications for Score Validity Qin Xie

a

a

Hong Kong Institute of Education , Tai Po , Hong Kong Published online: 22 May 2013.

To cite this article: Qin Xie (2013): Does Test Preparation Work? Implications for Score Validity, Language Assessment Quarterly, 10:2, 196-218 To link to this article: http://dx.doi.org/10.1080/15434303.2012.721423

PLEASE SCROLL DOWN FOR ARTICLE Full terms and conditions of use: http://www.tandfonline.com/page/terms-and-conditions This article may be used for research, teaching, and private study purposes. Any substantial or systematic reproduction, redistribution, reselling, loan, sub-licensing, systematic supply, or distribution in any form to anyone is expressly forbidden. The publisher does not give any warranty express or implied or make any representation that the contents will be complete or accurate or up to date. The accuracy of any instructions, formulae, and drug doses should be independently verified with primary sources. The publisher shall not be liable for any loss, actions, claims, proceedings, demand, or costs or damages whatsoever or howsoever caused arising directly or indirectly in connection with or arising out of the use of this material.

Language Assessment Quarterly, 10: 196–218, 2013 Copyright © Taylor & Francis Group, LLC ISSN: 1543-4303 print / 1543-4311 online DOI: 10.1080/15434303.2012.721423

Does Test Preparation Work? Implications for Score Validity Qin Xie Downloaded by [Hong Kong Institute of Education] at 21:24 23 May 2013

Hong Kong Institute of Education, Tai Po, Hong Kong

This article reports an empirical study that examined the pattern of test preparation for College English Test Band 4 (CET4) and the differential effects of test preparation practices on its scores, thereby drawing implications for CET4 score validity. Data collection involved 1,003 test takers of CET4. A pretest was administered at the beginning of a 2-month test preparation period; a posttest and a test preparation questionnaire were administered about 9 weeks later. Multiple regression and structural equation modeling were used in data analysis. The study found that test preparation pattern was better explained by the perceived functions of strategy use in improving test scores rather than the information processing mechanism underlying strategy use. Test preparation did improve test scores, but the effects were primarily from preparation practices via narrowing the curriculum, especially drilling. Although the effects were small in absolute terms, they represented almost one third of the effects from the pretest. Findings of this study suggest that the extrapolation validity of CET4 scores may warrant special attention in future validation studies.

The effectiveness of test preparation in improving test scores is of considerable interest not only to educators and test takers but also to test developers and users. For the latter, the core questions are as follows: Does test preparation inflate test scores? If so, to what extent do the inflated scores reflect a corresponding increase on the intended constructs of the tests? To what extent does test preparation affect the validity and reliability of test scores? Messick (1982) provided an inclusive definition for test preparation (or coaching) as “any intervention procedure specifically undertaken to improve test scores, whether by improving the skills measured by the test or by improving the skills for taking the test, or both” (p. 70). In the literature on educational assessment and measurement, there are two strands of studies on test preparation. One strand focuses on the effects of test preparation on scores. Most studies in this strand (e.g., Powers, 1985; Powers & Rock, 1999) were conducted in the 1980s and early 1990s, focusing on Scholastic Aptitude Test (SAT) and Graduate Record Examination (GRE). These studies employed a longitudinal, pretest and posttest design to investigate the effects of test preparation on improving test scores. Their findings suggested that test preparation could improve scores, but the effects tended to be small. Messick (1982) reported an average effect size of lower than one fifth of a standard deviation based on his meta-analysis of 39 studies on test preparation for SAT. Correspondence should be sent to Qin Xie, B4-2f/12, Department of Linguistics and Modern Languages, Hong Kong Institute of Education, 10 Lo Ping Road, Tai Po, New Territory, Hong Kong. E-mail: [email protected]

Downloaded by [Hong Kong Institute of Education] at 21:24 23 May 2013

DOES TEST PREPARATION WORK

197

The other strand is within the area of washback studies of high-stakes language tests. Washback, or backwash, refers to the influence of tests or testing on teaching and learning (Alderson & Wall, 1993). Within this strand, test preparation is also termed teaching and learning of test and is considered to be one aspect of washback. Prodromou (1995) referred to test preparation as “overt” washback, in contrast to the less observable, hence “covert,” form of washback. Examples of the latter are test influences on teaching pedagogy, textbook, and school administration. Many washback studies focus on the “covert” forms of washback and attempt to identify the reasons why the intended washback does or does not happen (e.g., Alderson & Hamp-Lyons, 1996; Cheng, 1999; Wall & Alderson, 1993; Watanabe, 2004). In comparison, studies focusing on “overt” washback are not many, among which only a handful have explored the relations between test preparation and scores (e.g., Gan, 2009; A. Green, 2007), and even fewer have investigated the implications for score validity. Although 11 years apart, both Hamp-Lyons (1998) and Gan (2009) noted the inadequacy of studies on test preparation in contrast to its booming market in many parts of the world. Washback studies that investigate the effects of test preparation (e.g., Brown, 1998; Elder & O’Loughlin, 2003; A. Green, 2007; Robb & Ercanbrack, 1999) usually adopt a cross-sectional, comparative design. These studies compare a test-taker group engaged in preparation courses with another group in general English language courses and/or a group in English language courses for specific purposes (e.g., academic, business). Findings of these studies remain inconclusive, though most studies find test preparation courses do not have significantly larger effects than English language courses. Brown (1998) found that preparation courses for International English Language Testing System (IELTS) outperformed English for Academic Purposes (EAP) courses in improving test performance. A. Green (2007), however, observed that students in IELTS preparation courses did not obtain larger gain-scores than those in EAP courses. Similarly, Robb and Ercanbrack (1999) reported that preparation courses for the Test of English for International Communication (TOEIC) showed no clear advantages over business English courses or general English courses in improving overall TOEIC scores. Moreover, Gan (2009) found no significant differences in IELTS scores between students who participated in preparation courses and those who did not. In both strands of studies on test preparation, the exact nature of test preparation remains underinvestigated and underdocumented. Messick (1982), for instance, noted that many studies on test preparation did not clearly describe the exact preparation practices involved, making it difficult to compare findings across studies. Furthermore, few studies examined the differential effects of preparation practices on test scores. Because some practices may improve scores via enhancing target constructs, whereas other practices may do so without improving target constructs, the differential effects of preparation practices clearly have different implications for score validity. Intuitively, test preparation practices are related to language learning and use strategies. Within that area of study, the differential effects of strategy use on test performance are of constant interest (Cohen & Macaro, 2007). O’Malley and Chamot (1990) classified learner strategies according to information processing theory into three types: metacognitive, cognitive, and socioaffective. They noted that goal setting and planning, monitoring the learning process, assessing, and evaluating were metacognitive strategies, whereas information comprehension, memorizing and storing, and retrieval were cognitive strategies. Metacognition means “beyond cognition,” which “is viewed as encompassing strategies for regulating and controlling one’s own

Downloaded by [Hong Kong Institute of Education] at 21:24 23 May 2013

198

XIE

cognition” (Oxford & Schramm, 2007, p. 63). Similarly, Oxford (1990) classified learning strategies into six types: memory, cognitive, compensation, metacognitive, affective, and social in her well-known Strategy Inventory for Language Learners (SILL). Later studies (e.g., J. M. Green & Oxford, 1995; Park, 1997) found these six strategies loaded on fewer factors, lending support to the three-part broader classification. Within language testing, Purpura (1999) found metacognitive strategy was better explained as an unidimensional factor labelled “assessing.” Goal setting and planning strategies did not load on the “assessing” factor consistently and were deleted from this factor. Both Purpura (1999) and Phakiti (2003) observed positive and strong relationships between the uses of metacognitive and cognitive strategies, which in turn contributed to test performance. Song and Cheng (2006) focused on College English Test Band 4 (CET4) in China; they also obtained similar findings as Purpura and Phakiti. However, none of the aforementioned studies have examined the implications of their findings for score validity. Messick (1982) and Haladyna and Downing (2004) are useful for understanding the differential implications of test preparation practices for score validity. Messick (1982) classified test preparation into three types according to their implications for score validity. According to Messick, Type 1 test preparation improves test scores through enhancing the intended constructs of the tests and thus poses no harm to construct validity. Type 2 test preparation improves test scores through reducing construct-irrelevant difficulties, such as test anxiety, and unfamiliarity with test formats. Because the presence of construct-irrelevant difficulties may inaccurately lower test scores, Type 2 enhances the accuracy of target construct measurement and is beneficial to construct validity. Type 3 improves test scores through enhancing test-taking skills that are irrelevant to the construct. Test-taking skills can help test takers who do not possess the target skills to get correct answers, thus render their test scores inaccurately higher. Therefore, Messick considered Type 3 to be harmful for score validity. Messick (1982), however, did not differentiate test preparation via narrowing the curriculum from learning strategies focusing on developing a broad range of domain skills. Test preparation via narrowing the curriculum (Type 4) is characterised by using test materials, following a test-based curriculum, using similar or identical test items, or focusing exactly on what the test measures. Although both Type 4 and Type 1 test preparation may improve test scores via enhancing target constructs, Type 4 is intrinsically different from Type 1. Whereas Type 1 focuses on the development of a broad range of skills entailed by the target domain, Type 4 focuses on a narrow range of skills sampled by the test. Haladyna and Downing (2004) criticized test preparation via narrowing the curriculum as “unethical,” because it may weaken the extrapolation link from test scores to untested behaviours (Kane, 1992). Achievement tests infer test takers’ ability on a large domain from their performance on a limited number of tasks sampled from the domain. If test takers focus on the narrow range of content and skills that a test samples thereby improving test scores, the inflated scores are unable to represent a corresponding increase on the domain of interest. Thus, Type 4 may weaken the inference link between test scores and the domain of their reference. Based on the aforementioned literature review, this study set out to address the following two research questions within the context of CET4 in China. RQ1: To what extent do test takers make use of preparation strategies to enhance test scores? What is the pattern of their test preparation?

DOES TEST PREPARATION WORK

199

Downloaded by [Hong Kong Institute of Education] at 21:24 23 May 2013

RQ2: To what extent and in what ways do uses of preparation strategies affect test performance? What are the implications for score validity? CET4 is a paper-based, standardized test of Academic English language proficiency. The test is developed by a professional test agency—the National College English Test Committee (NCETC), under Ministry of Education, China. It is administered twice a year to all non-Englishmajor undergraduate students of all universities across Mainland China. Each administration involves more than 2 million test takers (Yang & Weir, 1998). CET4 is considered at high stakes because its results are widely used as a university graduation requirement and for job applications. These high-stakes uses of test results drive test takers to engage in various forms of test preparations. However, empirical studies investigating CET4 takers’ preparation practices and their differential effects on test scores appear to be insufficient (Jin, 2006).

METHOD This study investigated test takers’ preparation practices via a self-report questionnaire. Besides answering a questionnaire, test takers also took a pre- and a posttest of CET4 at the beginning and toward the end of their test preparation. Participants This study involved a large sample of test takers over a 2-month period before they took CET4 in June 2009. All participants were Year 2 undergraduates (2007 cohort) from one university in South China. They all registered for taking CET4 soon after this study. There were 1,003 students, representing one third of this cohort, who were recruited on a voluntary basis from 34 intact classes of 17 English language teachers. Twenty of the 31 teachers teaching this cohort were selected with reference to the types of the students they taught (science, engineering, or arts), their teaching experiences (by years), and gender. The selection aimed at achieving a balanced sample of teachers and students, which could optimally represent the general teaching and learning practices of this cohort in this university. The selected teachers were approached one after another, with 17 of them agreeing to participate. Data Collection Procedure The study had three data collection points. The first data collection started 10 weeks before the CET4 administration, when participants took a pretest. Approximately 9 weeks later, they took a posttest. Three days after taking the posttest, they completed a test preparation questionnaire. The pre- and the posttest took approximately 2 hr each time, and the questionnaire took about 20 min. Because of multiple data collection points, the number of participants at each point varied slightly. After careful data cleaning and examination, 847 valid cases were kept for the pretest, 833 for the posttest, and 873 for the questionnaire. The resultant sample sizes were considered satisfactory, as well as sufficient for the purpose of this study.

200

XIE

Downloaded by [Hong Kong Institute of Education] at 21:24 23 May 2013

Instrumentation Test preparation questionnaires. The questionnaire requires test takers to report their use of strategies in preparation for the CET4 during an 8-week period immediately before the test. It contains 40 items; all items are presented as statements in the past tense with explicit reference to the preparation period and the CET4. Each item is rated on a 5-point Likert scale with the following tags: 1 (does not apply to me at all), 2 (applies to me sometimes), 3 (applies to me half of the time), 4 (applies to me most of the time), and 5 (applies to me entirely). The 40 items fall under six subscales, which represent six preparation practices: test preparation management (TPM), rehearsing test-taking skills (TTS), memorizing (MEM), drilling (DRILL), socioaffective strategies (SOAF), and language-skill development strategies (LSD). The questionnaire was developed primarily through focus group interviews and an open-ended questionnaire in a linked study conducted 1 year earlier (Xie, 2008). Analysis of the interview data from 30 test takers generated six categories of test preparation practices. Based on the results as well as references to established questionnaires, such as the Cognitive and Metacognitive Strategy Questionnaire (Purpura, 1999) and SILL (Oxford, 1990), a test preparation questionnaire was developed. It was piloted with a small sample (N = 157) half a year later. Forty items with adequate psychometric qualities (i.e., high loadings on the intended factors and low loadings on the unintended factors) were selected for the present study. The factor structure of the questionnaire was cross-validated in the present study and achieved satisfactory results. The total sample (N = 873) was split into halves randomly. Exploratory factor analysis (EFA) was performed with the first half of the sample (n = 436). Based on the EFA-derived factor structure, a measurement model was specified and validated by confirmatory factor analysis (CFA) with the other half of the sample (n = 437). EFA generated six factors, which fit the intended structure of the questionnaire very well (Table 1). CFA also achieved a satisfactory model fit, χ 2 (721) = 1434.14, χ 2 /df = 1.989, p = .000, root mean square error of approximation (RMSEA) = .048, 90% confidence interval (CI) [.044, .051], standardized root mean square residual (SRMR) = .054, normed fit index (NFI) = .846, comparative fit index (CFI) = .916, and Tucker-Lewis index (TLI) = .910. Table 2 presents the six factors with example items and reliability statistics. TTS refers to the practices that test takers explore and experiment test-taking skills for different sections of the test. TPM resembles the metacognitive strategy factor of “assessing” in Purpura (1999). It refers to test preparation practices through reading coaching materials, analyzing past test papers to identify frequently assessed areas, and conducting self-assessments of personal strength and weaknesses. DRILL refers to intensive and repetitive practice of a narrow range of language skills and knowledge tested by CET4. MEM refers to the practice of memorizing vocabulary, phrases, linking words, and exemplar essays on past CET4 essay topics. SOAF refers to test takers’ use of social strategies to seek support from teachers and peers, and their use of affective strategies to motivate themselves and to reduce test anxiety. Items in this scale are similar to the social and affective strategy scales of SILL (Oxford, 1990) but are contextualized to CET4 test preparation context. LSD refers to the learning strategies that test takers use to develop language skills via extensive exposure to and functional uses of English language in authentic contexts. The seven items in this scale were initially drawn from the Cognitive Learning Strategies Questionnaire (Purpura, 1999) but were modified to cater for the test preparation context of CET4. The six test-preparation scales are classified as Type 1, 2, 3, and 4 according to their implications for score validity as reviewed earlier. LSD is considered to be Type 1, which improves

DOES TEST PREPARATION WORK

TABLE 1 Exploratory Factor Analysis Pattern Matrix With Half Sample

Downloaded by [Hong Kong Institute of Education] at 21:24 23 May 2013

Items TTS1 TTS2 TTS3 TTS4 TTS5 TTS6 TTS7 TTS8 LSD1 LSD2 LSD3 LSD4 LSD5 LSD6 LSD7 TPM1 TPM2 TPM3 TPM4 TPM5 TPM6 TPM7 TPM8 TPM9 TPM10 MEM 1 MEM 2 MEM 3 MEM 4 MEM 5 SOAF1 SOAF2 SOAF3 SOAF4 SOAF5 SOAF6 DRILL1 DRILL2 DRILL3 DRILL4

TTS

LSD

TPM

MEM

SOAF

DRILL

.715 .679 .668 .661 .608 .583 .572 .495 −.035 .042 −.004 .062 .108 −.031 −.063 .121 .152 .041 −.019 −.171 .046 .135 −.039 −.016 .059 −.001 −.050 .142 .127 .027 −.007 −.036 .098 −.057 .027 .100 .266 .411 −.076 294

.002 .043 .047 .000 .027 .051 −.015 .050 .855 .786 .736 .729 .709 .649 .493 −.018 −.064 .032 .025 .099 −.003 −.028 −.023 .006 .045 .023 .026 −.004 −.036 .031 −.091 −.002 −.033 .167 .120 .108 −.001 −.005 .042 .064

.089 −.025 .009 −.013 .070 .115 .080 .112 .008 .011 .002 −.055 −.027 .024 .069 .645 .645 .643 .639 .619 .606 .589 .530 .511 .483 −.047 .010 .050 .061 .147 −.012 −.033 .045 .050 .014 .100 −.019 .032 .166 .054

.013 .006 −.023 −.146 −.011 −.035 −.052 −.027 .005 .034 −.031 .045 .055 −.028 −.187 −.058 −.018 .001 .016 −.070 −.001 −.052 −.052 −.015 .047 −.839 −.822 −.548 −.503 −.389 −.015 −.022 −.057 −.029 .016 −.014 −.184 −.070 −.001 −.116

.011 −.084 −.057 .059 −.054 −.018 −.088 −.024 .057 −.003 .004 −.056 −.068 −.009 −.040 .023 .005 .028 −.011 −.012 −.074 −.018 −.124 .028 −.052 −.025 .061 −.077 −.123 −.136 −.816 −.761 −.592 −.544 −.531 −.462 −.090 −.019 −.091 −.034

.043 −.013 −.081 .035 .027 −.130 −.041 −.159 .017 −.010 .006 .010 .021 −.045 .009 −.013 .130 −.075 −.004 −.064 −.066 .120 .007 −.135 −.027 .033 −.004 −.010 −.072 −.071 −.067 .027 −.031 −.051 .039 .035 −.546 −.541 −.385 −.352

Note. N = 436. TTS = test-taking skills; LSD = language-skill development strategies; TPM = test preparation management; MEM = memorizing; SOAF = socioaffective strategies; DRILL = drilling. Loadings on the intended factors are bolded.

201

202

XIE

Downloaded by [Hong Kong Institute of Education] at 21:24 23 May 2013

TABLE 2 Test Preparation Questionnaire: Example Items

Test Preparation Scales

Example Items

1. Rehearse test taking skills (TTS)

During the past 8 weeks, I . . . trained my skills to choose options through logic elimination; tried to find ways to keep concentrated on the listening tasks.

Standardized Alpha/Items

Types

.888/8 items

2&3

2. Test preparation management (TPM)

analyzed test papers to identify frequently assessed areas; analyzed test papers to identify the level of difficulty of each section.

.871/10 items

4

3. Drilling (DRILL)

focused on improving my skills to understand key sentences; timed test paper practice to improve my reading speed.

.782/4 items

4

4. Memorizing (MEM)

recited sentence patterns for improving my performance on writing; memorized useful linking words and phrases.

.835/5 items

4

5. Socioaffective strategies (SOAF)

tried to learn from others about ways to improve test scores; sought teachers’ advice on how to improve test performance.

.836/6 items

2

6. Language-skill development strategies (LSD)

kept reading extensively in English, e.g., books, websites; kept writing in English, e.g., e-mails, blogs.

.885/7 items

1

test scores via enhancing a broad range of language skills. TPM, MEM, and DRILL are different from LSD in their deliberation to identify and focus on the narrow range of skills and knowledge sampled by the test. They are considered to be Type 4, which may inflate test scores without improving the larger domain of interest. TPM represents purposeful uses of metacognitive strategies to identify and prioritize a narrow range of skills and knowledge for test preparation. MEM and DRILL represent test takers’ cognitive engagement with a narrow range of tested skills and knowledge. SOAF are classified as Type 2, because they may help reduce test anxiety and boost test-taking motivation, thus reducing the impact of construct-irrelevant variances on test taking. TTS is considered to have mixed effects on test performance. Within this scale, some test-taking skills, for example, “I trained my skills to choose options through logic elimination,” are testtaking tricks used to circumvent target constructs and thus are potentially harmful for score validity (Type 3). However, some test-taking skills are more accurately considered to be Type 2, for instance, “I tried to find ways to keep concentrated on the listening tasks.” This strategy may reduce construct-irrelevant variances of listening tests. Although the two types of TTS were conceptually distinct, they could not be separated empirically in the present study, that is, all eight items of TTS loaded on one factor in EFA, suggesting that test takers could not distinguish the two types of TTS. Treating TTS as Type 3 only seemed unwarranted; TTS was therefore considered to have mixed effects on score validity. CET4 test papers and scoring. CET4 consists of 10 tasks in four components (Table 3). Two sets of past CET4 papers were used for the pretest and the posttest. Before participants were recruited, a survey was administered to investigate the exposure rates of past CET4 papers among the participants. The two test paper sets with the least exposure rates (lower than 3%) were selected.

DOES TEST PREPARATION WORK

203

TABLE 3 CET4 Composition Components I-Listening comprehension

Downloaded by [Hong Kong Institute of Education] at 21:24 23 May 2013

II-Reading comprehension

III-Integrative skills IV-Writing

Subtasks

Weight

Sum

1. Short conversation 2. Long conversation 3. Speech 4. Compound dictationa 5. Careful reading 6. Banked cloze 7. Speed reading 8. Cloze 9. Translation 10. Essay writing

8% 7% 10% 10% 15% 10% 10%

35%

5% 15%

35%

10% 20%

Note. Source: http://www.cet.edu.cn. a For this test method, test takers are given a passage in written form, and they are asked to fill in the blanks in the passage according to what they heard. They will hear the passage read three times and fill in 11 blanks, among which eight blanks are for word dictation (each blank is a missing word) and three blanks are for sentence dictation (each blank is a missing sentence). For the missing words, they need to fill in the exact word heard; for the missing sentences, they can construct their own sentences according to what they heard.

The pretest and posttest were scored manually. Standardized answer sheets were used for both tests. Scores of all tasks involving subjective judgment were computed as the average of two ratings. For the semiconstructive tasks (i.e., filling the blanks in Speed Reading, Compound Dictation, and Translation), two raters marked all three tasks independently following strict rating criteria. Marking of these subjective tasks was analytical by awarding scores according to the key words prescribed on the rating criteria. Therefore, the marking was essentially mechanical. A high agreement is achieved between the two raters: (a) Pearson correlation coefficient for the three semiconstructive tasks are .90, .92, .90 for the pre-test and .91, .93, .90 for the posttest; (b) The results of paired t testsacross the two ratings are not significant at the .05 level. For the essay task, three experienced CET essay raters and a researcher (also an experienced essay rater) marked all essays independently. Each of the three teacher raters marked one third of the essays once, whereas the researcher marked all essays for a second time independently. The interrelater reliabilities are acceptable (r = .68, .76, and .66 for the pretest, and .70, .78, and .69 for the posttest). Although paired t tests find significant difference, the differences are consistent and systematic, and the magnitude of differences is acceptable: The researcher tended to be more lenient than the other three raters by one point (M difference = 1.16, 1.09, and 1.20 for the pretest, and .110, 1.05, 1.15 for the posttest; p < .001). This systematic difference lends support to computing essay scores by averaging two ratings, which adjusts score differences due to two rating tendencies as well as enhance interrater consistency across the three teacher raters. Data Analysis SPSS was used for data preparation (data imputation, missing value, and outlier analysis), EFA, and sequential multiple regression (SMR). AMOS was used for CFA and structural equation

Downloaded by [Hong Kong Institute of Education] at 21:24 23 May 2013

204

XIE

modelling (SEM). For the test preparation questionnaire, the item-parcelling and cross-validation procedures suggested that the 40 items could be reduced to six composite variables in a reliable and valid manner. Composite variables were created for SMR and SEM by averaging the aggregated item scores within each scale. For the test scores, eight composite variables were computed following the CET4 score weight distribution (Table 3). Tests of factor invariance were conducted across the pretest and posttest. To address Research Question 1, the six preparation variables were analysed to identify the underlying pattern of test preparation. To address Research Question 2, SMR and SEM were conducted to investigate the relationships between test preparation and test performance. SEM estimates the relationships among latent and observed variables, whereas multiple regression (MR) estimates only those among observed variables. These two, therefore, address Research Question 2 in a different manner. Compared with standard MR, SMR is more theory driven (Timothy, 2006). Timothy noted that a predictor with known effects on the dependent variable is usually entered first to take away the part of variance that it can explain; afterward, variables whose effects on the dependent variable are not known (variables under study) are entered the model to examine any additional variance they can explain. In this study, the pretest was entered into the model first; afterward the six preparation variables were entered. The pretest was a logical and temporal antecedent of test preparation; previous studies (e.g., those reviewed in Messick, 1982) had found consistent strong effects of prior achievement on test scores. SMR is considered more rigorous than MR because it examines the effects of test preparation after removing the variance accounted by a strong, antecedent variable. The purpose of structural modelling is to verify whether the hypothesized relationships among the observed and latent variables are supported by empirical data (Hoyle, 1995). Technically, SEM tries to explain the variance–covariance matrix underlying a set of observed variables with a prespecified model. The differences between the variance–covariance matrix of the data and that implied by the model are captured by chi-square statistics. The smaller the chi-square value, the closer the data fit the model. Besides chi-square statistics, five other model fit indices are used in this study: RMSEA, SRMR, goodness-of-fit index (GFI), CFI, and TLI. RMSEA values of .05 or lower represent a close fit; values up to .08 represent a reasonable fit. SRMR values lower than .08 are considered to be a good fit. For GFI, CFI, and TLI, values above .95 are considered to be close model fits (Hu & Bentler, 1999).

RESULTS In this section, descriptive statistics are reported to examine statistical assumptions as well as to provide an overview of the test results and test preparation practices. Afterward, results of SMR and SEM are presented. Descriptive Statistics Table 4 displays descriptive statistics of the six test preparation variables. Except for LSD, the other five variables have skewness and kurtosis within the acceptable range (±1.0). Although the two normality statistics for LSD are not within the satisfactory range, the deviation is not

DOES TEST PREPARATION WORK

205

Downloaded by [Hong Kong Institute of Education] at 21:24 23 May 2013

TABLE 4 Test Preparation Practices: Descriptive Statistics Scales

Min

Max

M

SD

Skew

Kurtosis

1. SOAF 2. TPM 3. TTS 4. MEM 5. DRILL 6. LSD

1.00 1.00 1.00 1.00 1.00 1.00

5.00 5.00 5.00 5.00 5.00 5.00

2.45 3.18 3.42 2.68 2.77 1.60

.66 .70 .77 .88 .80 .66

.30 −.23 −.45 −.11 −.11 1.51

−.24 .25 −.02 −.43 −.38 2.80

Note. N = 873. SOAF = socioaffective strategies; TPM = test preparation management; TTS = test-taking skills; MEM = memorizing; DRILL = drilling; LSD = language-skill development strategies.

substantial. The multivariate normality index, Mardia coefficient1 (5.149), is below the acceptable limit (Muthén & Kaplan, 1985). Muthén and Kaplan found from their simulation studies that the impact of nonnormality on parameter estimation was negligible, when multivariate kurtosis was lower than 13.92, and most variables had skewness and kurtosis within the range of ±1.0. As can be seen from Table 4, LSD is the least used one (M = 1.60). Test takers did not use this strategy at all or used it scarcely during the test preparation period, whereas they used the other five test preparation practice much more frequently. TTS is the most frequently used (M = 3.42), followed by TPM (M = 3.18). Uses of DRILL (M = 2.77) and MEM (M = 2.68) are similar; both have lower frequencies than the aforementioned two strategic practices. Socioaffective strategies are used only occasionally with much lower frequency (M = 2.45). Table 5 reports descriptive statistics for test scores. All variables have skewness and kurtosis within the acceptable range (±1.0). Mardia coefficients for the pretest variables (2.753) and the posttest variables (2.258) are satisfactory. Paired t tests found all posttest scores were significantly higher than the pretest for all four components at .01 level. SMR Analysis Table 6 presents the results of SMR. The two diagnostic indices are within satisfactory range (Tolerance values >.20, VIF values