Ardasheva, Y., Wang, Z., Adesope, OO, & Valentine, JC

3 downloads 0 Views 631KB Size Report
contrast, Adesope, Lavin, Thompson, and Ungerleider (2010) reported that, for at least some of ...... MA: Thomson Wadsworth. ... New York, NY: Taylor & Francis.
This is the peer reviewed version of the following article: Ardasheva, Y., Wang, Z., Adesope, O. O., & Valentine, J. C. (2017). Exploring effectiveness and moderators of language learning strategy instruction on second language and self-regulated learning outcomes. Review of Educational Research, 87(3), 544–582. doi:10.3102/0034654316689135, which has been published in final form at http://journals.sagepub.com/doi/abs/10.3102/0034654316689135. This article may be used for non-commercial purposes.

Exploring Effectiveness and Moderators of Language Learning Strategy Instruction on Second Language and Self-Regulated Learning Outcomes Abstract This meta-analysis synthesized recent research on strategy instruction (SI) effectiveness in order to estimate SI effects and their moderators for two domains: second/foreign language and selfregulated learning. A total of 37 studies (47 independent samples) for language domain and 16 studies (17 independent samples) for self-regulated learning domain contributed effect sizes for this meta-analysis. Findings indicate that the overall effects of SI were large, 0.78 and 0.87 for language and self-regulated learning, respectively. A number of context (e.g., educational level, script differences), treatment (e.g., delivery agent), and methodology (e.g., pretest) characteristics were found to moderate SI effectiveness. Notably, the moderating effects varied by language versus self-regulated learning domains. The overall results identify SI as a viable instructional tool for second/foreign language classrooms, highlight more effective SI design features, and suggest a need for a greater emphasis on self-regulated learning in SI interventions and research. Keywords: strategy instruction; language learning strategies; second/foreign language acquisition; self-regulated learning; meta-analysis

1

Exploring Effectiveness and Moderators of Language Learning Strategy Instruction on Second Language and Self-Regulated Learning Outcomes In today’s increasingly mobile and globalized world, developing higher levels of proficiency in a language other than one’s own becomes ever more important to boost one’s “stability, employability, and prosperity” (British Council, 2013, p. 3). Studying variables that may enhance the learning of additional languages, then, becomes crucial for informing practice, policy, and theory affecting the quality and outcomes of language learning experiences. One such variable—identified as a powerful tool in additional language development (as well as in native language development; see Pressley, 2002)—has been referred to as language learning strategies (LLS). LLS have been defined as actions learners use consciously to learn a new language more efficiently (Griffiths, 2007). Although LLS taxonomies vary (see a review of the topic in Barjesteh, Mukundan, & Vaseghi, 2014), most LLS researchers distinguish among three LLS categories: (a) cognitive strategies, “to do with the behaviors and mental processes of the learning” (e.g., keyword, rehearsal, note-taking); (b) metacognitive strategies, “to do with awareness of the learning” (e.g., focusing attention, planning for learning); and (c) socioaffective strategies, to do with “interactions with others” and “personality traits” (e.g., asking for help, self-encouragement; Hassan et al., 2005; p. 1). Higher use of spontaneous (non-instructed) LLS has been associated with higher proficiency both in second language (L2) and foreign language (FL) contexts (e.g., Ardasheva, 2016; Hu, Gu, Zhang, & Bai, 2009; Huang, Chern, & Lin, 2009; Nahavandi & Mukundan, 2014) and with higher performance on self-regulated learning measures (e.g., motivation: MacIntyre & Noels, 1996; self-efficacy: Magogwe & Oliver, 2007). (In this study, we use the term L2 to refer to a language studied inside a country where it is commonly spoken as an official language such as P-16/adult English-as-a-Second-Language classes in the United States or French/English immersion classes in Canada and we use the term FL to refer to a language studied inside a country where is not commonly spoken such as French classes in the Unites States or English classes in Japan). The results of extensive, qualitative reviews of LLS research conducted in both L2 and FL settings since the 1970s (Cohen & Macaro, 2007; McDonough, 1999) suggest that LLS can be successfully taught through what is known as strategy instruction (SI), the variable at the heart of the present study. SI has been defined as “any intervention which focuses on the strategies to be regularly adopted and used by language learners to develop their proficiency, to improve particular task performance, or both” (Hassan et al., 2005, p. 1). There is emergent evidence linking LLS not only with better language outcomes (e. g., Plonsky, 2011), but also with better academic performance in content areas (language arts, mathematics, science; Ardasheva, 2016; Ardasheva & Tretter, 2013, Chamot, 2007; Martínez-Álvarez, Bannan, & Peters-Burton, 2012; Montes, 2002). In other words, LLS are a potentially beneficial, malleable factor under control of educational systems. Yet, evidence regarding the effectiveness of SI on language outcomes remains inconsistent. In their systematic review of SI literature, Hassan et al. (2005) found that out of 25 qualifying, experimental design investigations, 17 studies reported positive results, 6 studies reported mixed results, and 2 studies reported negative results. Overall, the authors concluded that evidence of SI effectiveness was the strongest for reading and writing but less so for listening, speaking, overall proficiency, and vocabulary. In a subsequent meta-analysis of 61 studies with 95 unique samples, Plonsky (2011) found that SI had an overall positive, moderate 2

effect on student outcomes and identified a number of study characteristics associated with SI effectiveness differentials. In contrast to Hassan et al. (2005), however, Plonsky (2011) reported that SI effectiveness was the strongest for speaking and the smallest for writing; the effects for reading and vocabulary were average (a similar effect for reading was reported in a narrower scope meta-analysis comparing the effectiveness of two interventions, SI and glosses, with reading serving as the only outcome examined [Taylor, 2014]). Notably, both Hassan et al. (2005) and Plonsky (2011) noted that the generalizability of their respective synthesis findings was undermined by the substantial between-study differences in structural features of SI treatments designed, particularly in early SI research, “based largely on convenience, intuition, and/or some level of idiosyncrasy” (Plonsky, 2011, p. 998). This is of particular concern provided a notably greater standardization of intervention frameworks emerging in recent years due to recommendations for practice and research emanating from 30 years of SI research (Cohen & Macaro, 2007), the proliferation of pedagogical, how-to literature (e.g., Chamot, 2009; Rivera–Mills & Plonsky, 2007), and an emergent theoretical focus on selfregulated learning skills as an enabling mechanism of SI effectiveness (e.g., Grenfell & Macaro, 2007; Macaro, 2006). Such theory- and practice-driven changes in SI field suggest a need to periodically re-evaluate SI effectiveness across outcomes, considering most recent trends and evidence. The purpose of this meta-analysis, then, was twofold. The first main objective was to synthesize most recent (2008-2014) SI research that emerged in response to recommendations for research and practice emanating from 30 years of SI research and the field’s call for a greater emphasis on self-regulated learning (see the discussion above). Relatedly, the second main objective was to, in contrast to the earlier meta-analysis by Plonsky (2011), separately estimate SI effects and their moderators for two domains, language and self-regulated learning. This decision was made in response to a shift from exclusively focusing on what is learned when a new language is acquired (the product or outcome of learning) to studying how a new language is learned (the process of learning) both in the field of second language acquisition, in general (Griffiths, 2007; Oxford, 2011; see also Dörnyei, 2003, 2005) and in the field of SI, in particular (see the discussion above). In addition, this study explored a number of additional moderators not explored by previous meta-analytic research (e.g., language typology, technology- versus instructor-delivered and researcher- versus teacher-lead SI intervention). Strategy Instruction Theoretical underpinnings. In her seminal works, Rubin (1975, 1981) argued that observing “good language learners” (i.e., those identified as effective in acquiring a new language) and studying processes that help them become successful may inform theories of language processing. This information, in turn, could be taught to less successful language learners. The goals of strategy instruction, then, are to (a) increase learners’ awareness of the most effective methods of language learning (Cohen, Weaver, & Li 1996; Dabarera et al., 2014; De Silva, 2014; Hu et al., 2009) and (b) develop the independent, self-regulated learners actively participating in their own learning processes (Graham & Macaro, 2007; Oxford, 1999). Focus on self-regulated learning, Zimmerman (1990) argued, is needed because the “initial optimism that teaching students various learning strategies would lead to improved selfregulated learning has cooled with mounting evidence that strategy use involves more than mere knowledge of a strategy” (p. 9; see also Dörnyei, 2003, 2005; Grenfell & Macaro, 2007). He further argued that to promote self-regulated learning, instruction should focus on supporting three component processes: (a) behavioral, the knowledge and use of learning strategies; (b) 3

metacognitive, self-feedback regarding the effectiveness of learning and learning strategies and student responsiveness to such feedback; and (c) motivational, the interdependence between learning and motivational processes. Figure 1 summarizes these theoretically formulated relationships among SI, self-regulated learning, and achievement. With regard to the behavioral component, much theoretical and empirical evidence supports the need for the learner to both know and use LLS. Cohen (1998) and Genesee, Lindholm-Leary, Saunders, and Christian (2005) observed that conscious awareness and use of LLS is characteristic of L2 development. This is because L2 learners—beginning to learn a new language at a more mature age—are more aware of language features they need to learn and can consciously draw on explicit LLS to enhance their learning. According to Macaro (2006), LLS do not simply make learning more efficient, but are “the raw material without which L2 learning cannot take place” (p. 332). Indeed, despite some inconsistencies (Nisbet, Tindall, & Arroyo, 2005; Takeuchi, 1993) and evidence to the contrary (Gardner, Tremblay, & Masgoret 1997), a positive relationship between LLS and L2 proficiency has been documented in a number of studies (e.g.; Lan & Oxford, 2003; Nahavandi & Mukundan, 2014; Peacock & Ho, 2003; see also Cohen & Macaro, 2007). In synthesizing findings from 12 studies, Oxford (1999), for example, reported that LLS use accounted for substantial amount of variance in L2 proficiency ranging from 21% (a study of Taiwanese students learning English in secondary and tertiary institutions) to 58% (a study of first-year English learners in a Japanese women’s college). With regard to the metacognitive component, Zimmerman (1990) argued that learners need to engage a cyclic “self-oriented feedback loop” through which they monitor LLS effectiveness and respond to self-feedback in varied ways, “ranging from covert changes in selfperception to overt changes in behavior” (e.g., altering LLS use; p. 5). Many LLS researchers (e.g., Cohen, 1998; Hsiao & Oxford, 2002; Vandergrift & Tafaghodtari, 2010; Zenotz, 2012) argued that L2 success did not depend on number or frequency of LLS use but rather on the learner’s ability to select and orchestrate LLS that are most appropriate for completing a given learning task. After all, Grenfell and Macaro (2007) noted, if inappropriately used, any LLS may result in failure. Indeed, metacogntive knowledge—including knowledge of self, knowledge of task, knowledge of learning goals, and strategic competence—has been linked with higher L2 outcomes (Ardasheva, 2016; Ardasheva & Tretter, 2013; Graham & Macaro, 2008; Huang et al., 2009; Kolic-Vehovec & Bajsanski, 2007; Schoonen, Hulstijn, & Bossers, 1998; van Gelderen et al., 2004). Finally, the motivational component of self-regulated learning relies on learners’ willingness to commit time, effort, and vigilance to initiate and regulate LLS (Zimmerman, 1990). In other words, Zimmerman contends, self-regulated learning is not only “selfdetermined in a metacognitive sense” (p. 6) but also self-motivated. With some inconsistencies (Takahashi, 2005; Vandergrift, 2005), research supported the existence of a direct (e.g., Ehrman & Oxford, 1995; Wang, 2008) or mediated (e.g., Ardasheva, 2016; Pae, 2008) relationship between motivation and L2 achievement. Examples of investigated individual difference characteristics mediating motivation include metacognitive strategies (Ardasheva, 2016), effort (Bernaus & Gardner, 2008), and intensity and self-confidence (Pae, 2008). Further, studies found a robust association between motivation and LLS (e.g., MacIntyre & Noels, 1996; Schmidt & Watanabe, 2001; Vandergrift, 2005). Peacock and Ho (2003), for examples, found that high LLS users reported being highly motivated and perceived learning an L2 as personally important and enjoyable. Other studies comparing high versus low motivation students found that the former group used LLS with greater frequency (Oxford & Nyikos, 1989) and reported 4

knowing more LLS and tended to find LLS more effective and easier to use (MacIntyre & Noels, 1996). Importantly, research indicated that motivation may be stimulated by instructional environments that satisfy human needs for competence (the know-how), autonomy (selfinitiation, -regulation), and relatedness (“secure and satisfying relationships with others;” Deci, Vallerand, Pelletier, & Ryan, 1991, p. 327; see Noels, 2001; Noels, Clément, & Pelletier, 1999; Wu, 2003). Taken together, these findings suggest that motivation is an important, manipulableby-instruction component of self-regulated learning; its impact on L2 development may be direct or mediated by LLS and related individual difference characteristics. Although, as noted above, a number of statistical modeling studies examined mediating effects of different aspects of self-regulated learning on L2 achievement, SI studies per se typically examine only the direct SI impacts on self-regulated learning and/or on L2 achievement, without considering the theoretically posited mediating effects. Following the typical SI study design, this meta-analysis will separately estimate SI effects and their moderators for two domains: language and self-regulated learning. Below we first describe a typical SI study that exemplifies the common structures of SI studies included in this metaanalysis and then discuss variables that may moderate SI effectiveness depending on the operationalization of such common structures across individual studies. A typical SI study showcasing common SI structures. Macaro and Erler (2008) conducted a 14-month SI study in reading with a sample of young (11–12 year olds), beginner (year 1) learners of French enrolled in six secondary schools in England. Six intact classrooms were matched on teachers’ years of experience and non-randomly assigned to either treatment or control conditions (one control classroom dropped from the experiment before the end of the study contributing to an attrition rate of 30.2%; participants and nonparticipants did not differ in reading on pretest, suggesting that attrition was not an issue). Whereas 62 students in the treatment group received awareness raising of 12 strategies listed in a pre-intervention questionnaire plus six additional strategies, 54 students in control groups received regular instruction. Strategies targeted in the treatment group included cognitive (e.g., sounding out, suing context clues, background knowledge) and socio-affective (e.g., asking for help, not giving up easily) strategies. Both treatment and control groups were taught by their regular teachers; treatment teachers used researcher-developed reading/SI materials, in addition to regular textbooks. SI lasted, on average, about 10 min per day with instructional procedures including the following steps: (a) awareness raising, (b) modelling of strategies, (c) scaffolded practice, (d) removal of scaffolding, and (e) evaluation. The latter component included individualized feedback by the teacher and pair and whole-group discussions. Importantly, “implicit in all these activities was the concurrent development of metacognition with regard to making decisions about clusters of strategies available, and evaluating strategies used” (p. 105). Participants were assessed pre- and post-intervention on three researcher-developed measures: (a) a reading comprehension test, (b) a reading strategy use questionnaire, and (c) a French attitudes scale. The reading test—the posttest was administered one month after the intervention—included narrow (translation) and broad (idea unit identification) reading tasks. The complexity and length of L2 text increased from pre to post intervention to reflect growing proficiency of the students; test responses were provided in the first language (L1; also referred to as the native language). To allow for results’ comparability, reading tests were scored as percent correct. Strategy use and attitudes measures were both self-report, with 3- and 5-point Likert-type scales, respectively. The reported reliabilities for all three measures were acceptable (.7 and above). Results indicated that SI significantly improved comprehension with an effect 5

size—corrected for small sample size—of 1.17, “brought about changes in strategy use” (p. 90; with some of the changes favoring control group), and improved reading attitudes. Yet, no benefits of SI were found in at least two other recently published L2 reading comprehension studies (Gladwin & Stepp-Greany, 2008; Takallou, 2011) qualifying for this meta-analysis; similar discrepancies in SI impacts were documented for other language (e.g., listening; Cross, 2009; Vandergrift & Tafaghodtari, 2010) and self-regulated learning (e.g., strategy use; Ranalli, 2013; Zenotz, 2012) outcomes. Cross (2009) and Vandergrift and Tafaghodtari (2010), for example, found no versus positive SI impacts, respectively. Notably, although both studies were similar in terms of SI instructional features and focused on adults FL learners, the two studies differed in terms of SI duration, participants’ proficiency levels, and L1L2 linguistic proximity. These findings suggest that variations in common SI structures across individual studies may play a moderating role on SI effectiveness. We discuss such moderators next, including both known and novel (not-explored by previous meta-analytic research) moderators. Moderator Variables Outcomes. Research on spontaneous strategy use linked LLS with higher L2 achievement across a broad range of language and self-regulated learning outcomes. Examples of latter include reading (Huang et al., 2009; Kolic-Vehovec & Bajsanski, 2007; Schoonen et al., 1998; van Gelderen et al., 2004), listening (Dreyer & Oxford, 1996; Peacock & Ho, 2003; Takeuchi, 1993; Vandergrift, Goh, Mareschal, & Tafaghodtari, 2006), speaking and writing (Peacock & Ho, 2003), overall proficiency (Nisbet et al., 2005; Takeuchi, 1993), and vocabulary and grammar knowledge (Dreyer & Oxford, 1996; Fraser, 1999; Peacock & Ho, 2003; Takeuchi, 1993). Examples of former include motivation (MacIntyre & Noels, 1996) and self-efficacy (Magogwe & Oliver, 2007). Research on instructed strategy use, in turn, suggested that SI effectiveness varied depending on the targeted outcome, both for language and self-regulated learning domains. Hassan et al.’s (2005) systematic review, for example, found that SI “works for reading comprehension and writing skills, and [that] the research evidence for this is stronger than it is for listening, speaking and overall proficiency” (p. 6). Evidence regarding SI effectiveness for vocabulary, in turn, was judged as being weak. In line with Hassan et al.’s (2005) synthesis, Plonsky’s (2011) meta-analysis estimated that SI effects on overall proficiency and listening were negligible. In contrast to Hassan et al.’s (2005) work, however, Plonsky (2011) found that SI effectiveness was the strongest for speaking; the effects for reading and vocabulary were average and the effect for writing was the smallest. Further, out of the two SI effects on selfregulated learning outcomes—namely, strategy use and attitudes—only that on strategy use was statistically significant and large in size; the effect on attitudes was estimated to be negligible. Two reasons may account for such discrepancies, namely, scope of work included in each synthesis (published over half a decade later, Plonsky’s [2011] study included a worth of eight more years of research) and differences in analytic approaches (meta-analytic techniques in Plonsky [2011] and team-assigned weights of evidence regarding study trustworthiness, methodological soundness, and relevance in Hassan et al. [2005]). In addition to these obvious reasons, the between-syntheses discrepancies in conclusions may be attributed to methodological advancements in the field itself such as the differentiation between generic and task-specific LLS (e.g., speaking LLS, listening LLS; see Dörnyei, 2005; Hsiao & Oxford, 2002; Oxford, 2011; Oxford, Cho, Leung, & Kim, 2004). This latter trend, in particular, may explain the drop in the percentage of SI studies targeting overall proficiency over time. That is, among all synthesized 6

studies targeting a specific language outcome, studies that focused on overall proficiency were 5 out of 27 in Hassan et al. (2005; see Table 4.1) and 4 out of 95 in Plonsky (2011), roughly 18% and 4% of all qualifying studies, respectively. (We discuss other advancements in the field later in the paper.) Taken together, these findings highlight language and self-regulated learning outcomes as an important moderator of SI effectiveness and suggest a need to periodically reevaluate SI effectiveness across outcomes, considering most recent trends and evidence. Further, it is important to recognize that language outcomes, specifically, may be broadly categorized into language outcomes per se (speaking, listening, vocabulary, and grammar) and literacy outcomes (reading and writing). Although the two categories may somewhat overlap in adult L2 learning, this may not be the case for young learners who typically develop language skills first (for this reason, bilingualism does not necessarily imply biliteracy) and, as noted earlier, few SI interventions place an equal emphasis on both language and literacy skills. These considerations suggest that both learners’ age and proficiency (language and literacy skills) may moderate SI effectiveness, along with other learner and study characteristics discussed next. Context. Researchers have long recognized that context-related variables may differentially impact the effectiveness of any intervention. Literature suggests that when it comes to SI interventions, such variables may include: L2 versus FL setting, age, educational level, proficiency, and language typology. L2 versus FL setting. SI effectiveness has been extensively tested in both FL (e.g., Graham & Macaro, 2008; Vandergrift & Tafaghodtari, 2010) and L2 (Gunning, 2011; Ranalli, 2013) contexts. Although it is difficult for individual studies to account for such setting differences, Plonsky’s (2011) meta-analysis did find that the SI effectiveness effect size was almost two times larger in L2 versus in FL contexts. A narrower in scope (see below) metaanalysis conducted by Taylor (2014), however, reported an opposite trend with effect sizes favoring FL contexts. There were two substantial differences between these two meta-analytic studies, however. First, the overall SI effectiveness (i.e., an overall, main effect size) tested for FL-L2 setting moderator was operationalized differently in the two studies. That is, whereas in Plonsky (2011) the overall effect was synthesized across both language and self-regulated learning outcomes, in Taylor (2014) the overall effect integrated SI and glossing studies and was synthesized only for a single language outcome (reading). Second, whereas Plonsky (2011) considered two moderator levels (FL, L2), Taylor (2014) considered three moderator levels (ESL, EFL, FL). Such differences in construct conceptualization may have contributed to the above-mentioned discrepancy in research findings. To address this discrepancy for SI studies, we will ascertain the moderating effect of setting at three levels (ESL, EFL, FL) and we will do so separately for language and for self-regulated learning outcomes. Age and educational level. Age and educational level have been identified among variables directly related to “the choice, use, or evaluation” of LLS (Oxford & Leaver, 1996, p. 227; e.g., Hu et al., 2009). Magogwe and Oliver (2007), for example, found that learners preferred strategies of different complexity depending on their educational level: Whereas primary students favored social strategies, secondary (and above) students favored metacognitive strategies. Peacock and Ho (2003) found that older students used significantly more strategies than did younger students. These findings may be attributed to increasing cognitive maturity as with age students develop increasingly sophisticated perceptions of self and academic tasks (Zimmerman, 1990). In testing “whether greater (meta)cognitive capacity of adults offers an advantage over 7

children” (p. 997) in terms of benefiting from SI, Plonsky (2011) however found a somewhat unexpected result. That is, the SI effect for younger learners (i.e., under 12 years old, primary education) was 1.4 to 3.3 times larger than that for older and upper educational level students. As SI effectiveness research continues with different age groups, including children (e.g., Martinez-Alvarez et al., 2012), adolescents (e.g., Graham & Macaro, 2008), and adults (e.g., Soleimani, Zandiye, & Esmaeili, 2014), we will further investigate this issue by examining if age-related cognitive maturity may impact SI effectiveness differently, depending on the outcome—language versus self-regulated learning—domains. Proficiency. With some notable exceptions (e.g., Hong-Nam & Leavell, 2006; Phillips, 1992) much research found a positive and linear relationship between LLS use and L2 proficiency (e.g., Hu et al., 2009; Nahavandi & Mukundan, 2014). Students with higher proficiency were found to use LLS more frequently (Dreyer & Oxford, 1996; Griffiths, 2007) and within an increasing range (Griffiths, 2007; Kaylani, 1996). In a longitudinal study, Chesterfield and Chesterfield (1985), for example, found that increased levels of L2 competence seemed to imply the ability to use a larger range of increasingly sophisticated strategies progressing from receptive (cognitive) strategies to more interactive (socioaffective) strategies and to more self-regulatory (metacognitive) strategies by the end of the study. Notably, this developmental pattern in LLS use documented in individual studies with “more strategy options becom[ing] available to the L2 learner as competency increases” (Taylor, 2014, p. 45) parallels, to some extent, findings indicating learners’ ability to make a greater use of SI at greater L2 proficiency levels. That is, although individual studies (e.g., De Silva, 2014; Jurkovic, 2011; Urlaub, 2012) continue to report discrepant results in terms of at what proficiency level learners benefit most from SI, two recent meta-analyses (Plonsky, 2011; Taylor, 2014) suggested that, overall, SI is more beneficial for more advanced learners. An important question remains, however, do the relationships between learner proficiency and SI vary by the learning outcome—language versus self-regulated learning— domains? Based on previous empirical (Plonsky, 2011; Taylor, 2014) and theoretical (Zimmerman, 1990; see also Dörnyei, 2003, 2005; Oxford, 2011) research, one would expect that higher proficiency students would be greater SI beneficiaries and that the impacts for language and self-regulated learning outcomes within proficiency levels would be comparable in size. With some notable exceptions (confirming the latter but not the former hypothesis; De Silva, 2014), however, individual studies rarely control for proficiency and rarely do so for both outcomes, a shortcoming that would be easily addressed by the synthetic nature of the metaanalytic approach used in this study. Language typology. Although the impacts of L1/L2 differences on SI effectiveness have not yet been closely examined, our interest in investigating this potentially moderating effect is grounded in research on transfer (also referred to as crosslinguistic influence; Kellerman, 1995), a phenomenon referring to transferring prior linguistic knowledge from L1 to L2. When L1 items are applied correctly to L2 contexts, transfer is said to be positive; negative transfer occurs when application of L1 forms disrupts performance in L2 (Saville-Troike, 2006). Although there have been different perspectives on the role of transfer in L2 acquisition (see discussions in Gass & Seliker, 2008; Kellerman; MacWhinney, 1992), there is evidence to suggest that L1/L2 differences may impact L2 acquisition (Navarra, Sebastián-Gallés, & Soto-Faraco, 2005; van Boxtel, Bongaerts, & Coppen, 2003). Tao and Healy (1998), for example, found that English reading task performances of native speakers of Dutch and English were similar to each other, but differed from the performances of native speakers of Japanese and Chinese. The similarity in 8

Dutch and English speakers’ performance could be attributed to two language-related features, not shared by Japanese and Chinese languages. That is, both Dutch and English languages share the same script (Latin) and belong to the same language family (Indo-European), both features plausibly increasing opportunities for L1/L2 transfer due to shared orthographic and common grammatical features (the latter being a characteristic of genetically related languages; Fromkin, Rodman, & Hyams, 2007). On the other hand, whereas the Chinese language belongs to the Sino-Tibetan language family and uses the Han script, the Japanese language belongs to the Japonic language family and uses the Hiragana and Katakana scrips in addition to the Han scrip (see Ethnologue at https://www.ethnologue.com/statistics/family). To capture these two language-difference features in our study, we operationalize L1/L2 differences in two ways, namely, as L1/L2 belonging versus not to belonging the same language family and as L1/L2 sharing versus not sharing the same script. Treatment. In an earlier systematic review of SI studies, Hassan et al. (2005) noted that between-study differences in structural features of SI treatments—such as the selection of strategy type (metacognitive, cognitive, or socioaffective strategies), strategy scope (single or packaged together strategy/ies), treatment duration (from up to two weeks to up to a school year), and instructional approach (awareness-raising versus behavior-modeling)—“limited the degree to which studies could be combined cumulatively” (p. 64) to assess the overall effectiveness of SI interventions. The authors called for greater standardization of intervention frameworks in the future and for more research comparing effectiveness of awareness-raising versus behavior-modeling approaches. Subsequently, the use of meta-analytic techniques—as contrasted with Hassan et al.’s (2005) weight-of-evidence judgements—did allow Plonsky (2011) to quantify the extent to which differences in structural features of interventions, but not in the selection of an instructional approach, contributed to differences in SI effectiveness. In particular, Plonsky (2011) found that teaching fewer strategies (eight or less), targeting cognitive rather than metacognitive strategies, and providing longer interventions (over two weeks) were associated with larger SI impacts. Echoing Hassan et al. (2005), Plonsky (2011) noted the continued lack of SI standardization which he attributed to a lack of comprehensive SI theory leaving “researchers and practitioners to design studies of SI based largely on convenience, intuition, and/or some level of idiosyncrasy” (p. 998). It is worthwhile to note, however, that although Plonsky (2011) incorporated more recent research, both Plonsky’s (2011) and Hassan et al.’s (2005) reviews included an overlapping data subset of early (80s and 90s) studies. Due, in part, to recommendations for research and practice emanating from 30 years of SI research (Cohen & Macaro, 2007), the proliferation of pedagogical, how-to literature (e.g., Chamot, 2009; Rivera–Mills & Plonsky, 2007; see a discussion of SI models in Chamot, 2004), and an emergent theoretical focus on self-regulated learning—particularly on the metacognitive component of self-regulated learning—as an enabling mechanism of SI effectiveness (e.g., Grenfell & Macaro, 2007; Macaro, 2006), a notably greater standardization of intervention frameworks has gradually emerged in the last decade. In particular, most recent SI interventions adopt an awareness-raising instructional model targeting task-specific strategy clusters—rather than single strategies—often across metacognitive, cognitive, and socioaffective strategy types (e.g., Dabarera et al., 2014; Lam, 2009; Macaro & Erler, 2008; Takallou, 2011; Vandergrift, & Tafaghodtari, 2010). With some minor variations, the awareness-raising model typically includes the following steps: (a) consciousness raising, students reflect on learning and their current and potential strategies; (b) 9

modeling, teachers introduce and model new strategies appropriate for the learning goal; (c) guided practice, students are given opportunities to practice strategies with gradual removal of teacher prompts to do so; and (d) evaluation and goal-setting, “students identify problem areas, select strategies that might help remedy them and evaluate their success” (Graham & Macaro, 2008, p. 753). Yet, a number of studies still employ a behavior-modeling approach (including only modeling and practice; Kim, 2013; Morett, 2012; Urlaub, 2012). Accordingly—beyond duration and strategy number moderators examined in Plonsky (2011)—we will explore whether the selection of SI approach (awareness-raising versus behavior-modeling) affects SI effectiveness. Other moderators possibly affecting SI effectiveness, not investigated in previous meta-analyses, will include SI scope and delivery mode. With regard to former, we will examine whether or not larger strategy selection scope (clustering strategies across metacognitive, cognitive, and socioaffective strategy types; e.g., Dabarera et al., 2014; Lam, 2009; Takallou, 2011) would be associated with greater gains than narrower strategy selection scope (only a single strategy type; e.g., Ding & Songphorn, 2012; Gladwin & Stepp-Greany, 2008). With regard to delivery mode, we will estimate whether or not technology-delivered instruction (an emergent trend in recent SI research; e.g., e-learning tools: Morett, 2012; Ranalli, 2013; video: Kim, 2013) would be associated with greater or lesser gains when compared to human-delivered instruction and whether or not greater familiarity with SI theories and procedures and a greater control over fidelity of implementation would benefit researcher-led (e.g., Baleghizadeh & Mortazavi, 2013; Bozorgian & Pillay, 2013; Chan, 2014; Dabarera et al., 2014) versus teacher-lead (e.g., Grenfell & Harris, 2013; Gunning, 2011; Hu et al., 2009) interventions. Methodological features. Past meta-analytic research, across educational fields, has demonstrated that research design and implementation features (e.g., assignment to conditions) may serve as sources of variance in the estimated outcomes. Plonsky (2011), for example, found that although studies with stronger design (pretest, random groups, reliability reported) consistently yielded larger effects than did studies with weaker designs (no pretest, random groups, reliability reported), on average, all studies included in his meta-analysis yielded statistically detectable effects, regardless of the strengths or weakness of their designs. By contrast, Adesope, Lavin, Thompson, and Ungerleider (2010) reported that, for at least some of the examined outcomes, weaker design studies (reliability not reported) “produced a statistically detectable mean effect size” whereas stronger design studies (reliability reported) did not (p. 226). Thus, in addition to exploring substantively and practically meaningful moderators described in the literature review, we will also investigate methodological moderators. In particular, building on previous research, we will investigate the extent to which individual study quality markers (reliability reported: e.g., Adesope et al., 2010; Plonsky, 2011; and pretest and random assignment implemented: Plonsky, 2011) covary with study effect sizes. Research Questions There are three objectives beyond identifying the main effect of SI on language and selfregulated learning outcomes, specifically, identifying: (a) L2 outcomes most affected by strategy instruction; (b) effectiveness differentials across contexts; and (c) structural features of strategy training practices associated with positive L2 outcomes. In addition, the proposed research will investigate ways in which methodological features of the studies related to the observed outcomes. The following research questions will guide this study: 1. What is the overall effectiveness of strategy instruction in improving L2 outcomes?

10

a) What L2 outcomes (reading, writing, listening, speaking, overall proficiency, vocabulary, grammar) are most positively affected by strategy instruction? b) What contextual (e.g., ESL/EFL/FL), treatment (e.g., SI delivery mode), language typology (e.g., L1/L2 script similarity), and research (e.g., random assignment) characteristics can moderate strategy instruction effectiveness on L2 outcomes? 2. What is the overall effectiveness of strategy instruction in improving self-regulated learning? a) What self-regulated learning outcomes (anxiety, self-efficacy, attitudes, strategy use, strategy effectiveness) are most positively affected by strategy instruction? b) What study characteristics (contextual, treatment, methodological) can moderate strategy instruction effectiveness on self-regulated learning? Method Data Sources Literature search procedures. To ensure that high proportion of eligible sources (both published and unpublished) are located, the search strategy included an electronic literature search as well as a number of complementary literature searches. Complementary literature searches served to locate literature not easily accessible through electronic databases (e.g., research reports, book chapters, dissertations and theses). This was done to minimize the publication bias threat (i.e., underrepresentation of studies lacking statistically significant findings in published sources potentially leading to overestimating the effects of interest in metaanalytic studies; Lipsey & Wilson, 2001). The electronic literature search included the following databases: (1) Linguistics and Language Behavior Abstracts [LLBA], (2) ProQuest Dissertations & Theses [PQDT], (3) ProQuest Research Library, (4) ERIC, (5) PsychINFO, (6) Sociological Abstracts, and (7) Web of Science. Basic search terms in three categories—intervention, outcome, methodological characteristics (e.g., language learning strategies, second language, experiment)—and subject indices in each database were used to locate eligible studies. Individual databases use different keywords and search terms, thus, the basic search terms were tailored for each database to maximize the number of results and their relevance. An example of a library search strategy for ProQuest is: (strateg* NEAR/2 lang* OR learn* strateg* NEAR/2 lang* OR strateg* instruction NEAR/2 lang* OR strateg* train* NEAR/2 lang* OR learn* train* NEAR/2 lang* OR processbased teach* NEAR/2 lang* OR learn* skills NEAR/2 lang* OR learn* behave* NEAR/2 lang* OR self* learn* NEAR/2 lang* OR autonom* learn* NEAR/2 lang*) AND (ab(second language OR foreign language OR biling* OR immersion)) AND (ab(intervention OR treatment OR control OR comparison OR experiment OR effect OR impact OR outcome)). Complementary literature search procedures included manual search of the reference lists of qualifying studies and forward citations through Google Scholar of earlier syntheses. Inclusion criteria. To capitalize on a broader knowledge base and to reflect a large variability in language learners’ backgrounds, the present meta-analysis aimed to integrate studies conducted between 2008 and 2014 around the world in varied L1/L2 combination contexts. This decision is based on precedence in educational (August & Shanahan, 2006), applied linguistics (Spada & Tomita, 2010), and educational psychology (Adesope et al., 2010) research. To capture relevant studies on the effectiveness of SI, the following criteria for inclusion were developed: (a) Studies must involve an explicit strategy training intervention targeting a single or multiple learning strategies in one or multiple strategy categories (i.e., cognitive, metacognitive, and/or socioaffective); 11

(b) Studies must have been carried out in a second or foreign language setting; (c) Studies must be primary, quantitative investigations to allow for statistical data extraction; (d) Studies must be experimental or quasi-experimental, between-groups designs, thus allowing for more valid inferences regarding instructional strategy effectiveness; (e) Studies must be presented in English, French, Chinese, or Russian. (The language selection in “e” was limited to those languages which are spoken by the authors and in which research is typically published.) Data Coding After running electronic and complementary literature searches, the titles and abstracts of identified studies were scanned using a Screening Guide that aligns with our inclusion criteria. The Screening Guide served as a basis for rater judgments regarding the likely relevance of the studies. If study relevance could not be determined from the readings of the titles and abstracts, full-length study reports were retrieved and reviewed. Final relevance decisions for all studies was based on the reading of the full text. The initial electronic and complementary literature searches yielded a total of 2,628 potentially relevant articles. Following a screening of titles and abstracts, 2,528 studies were removed. A full-text screening of the remaining 98 studies (two studies could not be retrieved) was conducted independently by two authors; a third author was consulted in case of a disagreement. Reasons for excluding studies during the full-text screening process included: irrelevant treatment (e.g., LLS were part of a larger-scope intervention with no possibility of ascertaining the unique effects of LLS; n = 43), irrelevant design (e.g., two SI interventions were compared with no true control; results for language learners were reported in aggregate with those of native speaking participants; n = 15); and irrelevant outcomes (i.e., neither FL/L2 nor self-regulated learning outcomes were assessed; n = 1). The final sample of eligible studies across two research questions was 39: 37 studies (47 independent samples) for first research question and 16 studies (17 independent samples) for second research question (two studies were non-overlapping, focusing exclusively on self-regulated learning outcomes). Next, studies meting the inclusion criteria were coded by two independent raters using a pre-established coding protocol. The protocol was developed based on Valentine and Cooper’s (2003, 2008) recommendations for study quality assessment, piloted in a scoping review study, and refined by the research team. It included eight sections that captured information about the review process (e.g., reviewer and study identification), the study setting (e.g., educational level), sampling procedures (e.g., assignment to treatment), sample characteristics (e.g., age), treatment details (e.g., intervention duration), data collection (e.g., outcome measures), data analyses (e.g., unit of analysis), and the results (e.g., means, standard deviations). One of the outcomes of data extraction procedures is to generate a set of effect sizes to be synthesized in a meta-analysis. Another outcome is to code studies—that is, to generate a list of categorical codes along four dimensions: outcomes, learning contexts, structural features of SI treatments, and methodological characteristics of individual studies—for subsequent moderator analyses. After the initial training, the average interrater reliability on the coding protocol was about 92% with most disagreements falling under the treatment details categories. All disagreements were resolved between the two coding authors, consulting with a third author as needed. Data Analyses Effect size estimation. The standardized mean difference (Cohen’s d) served as the measure of effect size; its estimation for individual studies and integration across studies 12

followed the common and best practices recommended by Cochrane Handbook (Higgins & Green, 2011). For each individual study, a Cohen’s d-index (Lipsey & Wilson, 2001) was computed initially. This index is a standardized measure of difference between two group means divided by the pooled standard deviations. Cohen’s d-index was computed as: Y Y d T C , sp where YT is the treatment group mean, YC is the comparison group mean, and sp is the pooled standard deviation. Then, because of the inherent bias in the standardized mean difference effect sizes with small sample studies, we applied Hedges’ g (Hedges, 1981) as the usual correction for this issue. When both pretests and posttests were available, we used the adjusted posttest effect sizes by computing the “difference in differences” in the means from pretest to posttest. This strategy allowed for more comparable groups, thus addressing some of the shortcomings of studies with quasi-experimental designs: (Y -Y ) - (Y postC -Y preC ) d post- pre = postT preT , sp where Y postT is the treatment group mean at posttest, Y preT is the treatment group mean at pretest,

Y postC is the comparison group mean at posttest, Y preC is the comparison group mean at pretest, and sp is the pooled (posttest) standard deviation. When primary studies presented only inferential statistics (e.g., F ratio), we used formulas given in Borenstein, Hedges, Higgins, and Rothstein (2009) to convert the statistics to the appropriate effect size metric. For studies reporting results from analyses of covariance, we used adjusted means (if they are reported). When studies reported more than one effect size from the same sample (e.g., multiple measures of the same construct) we either selected the most relevant outcome (e.g., test rather than self-assessment) or used a weighted average effect size (e.g., synthesizing results reported individually by LLS), so that each study only contributed one independent effect size to the main analysis, and used the shifting unit of analysis approach (Cooper, 2010) when testing moderators. The shifting unit of analysis approach involves averaging effects when appropriate (e.g., averaging related language outcome effects when testing the overall SI effect), then splitting the effects when testing the dimension on which they differ. In this example, when asking whether program effects appear to be larger for one language outcome than for the other, a study with two or more language outcome measures contributed to both levels of that moderator. Effect size integration and moderator analyses. Although interpretation of results could potentially be maximized by providing both the fixed- and random-effects models, a decision was made to use the random-effects model to compute the main language and selfregulated learning effects of explicit SI for two major reasons. First, unlike the fixed-effect model, random-effects model assumes that the true effect size varies from one study to the other (Hedges & Vevea, 1998). That is, the fixed-effect model assumes that all studies are estimating a single population parameter. This assumption implies that if all studies were infinitely large, they would all have the same effect size. By extension, it also implies that methodological differences across studies do not matter. In contrast, the random-effects model assumes that studies are drawn from a distribution of effect sizes. This assumption implies that: (a) differences in methods across studies do matter and (b) even if all studies were infinitely large, they would not all have the same effect size. Thus, the random-effect model is more appropriate 13

for our study as even though recent SI research has seen a notably greater standardization of SI intervention frameworks, other variations in methods (e.g., duration, delivery mode) and samples (e.g., age, educational level) remain. Second, a random-effects model allows for broader generalization of results (Lazowski & Hulleman, 2016). Considering our goal to more broadly draw inferences beyond the studies included in this meta-analysis, we decided to use the random effects model for the main effects. As recommended by Borenstein et al. (2009), all moderator analyses were carried out using mixed-effects model, which includes a random-effects model within subgroups. Within constructs, all individual effect sizes were integrated to compute a weighted average effect size and its 95% confidence interval. Weighing by sample sizes of contributing individual sizes was done using the typical inverse variance weighting method. The heterogeneity of study effects was explored using a statistical significance test (Q-statistics; a significant Q-statistics indicates that the effect sizes vary more than would be expected given sampling error alone, suggesting that studies estimate different population parameters). To provide estimates of the amount (degree) of such heterogeneity, we also computed 2 and I2. (These data are available upon request.) All effect size integration and moderator analyses were conducted using Comprehensive Meta-Analysis software. Missing data. Missing data can undermine the interpretation of any meta-analysis. Publication bias occurs due to the less likely tendency of studies lacking statistically significant results to be published. Because published studies are easier to locate than unpublished studies, this can lead to a bias against the null hypothesis. There are no well-accepted statistical methods for dealing with this problem, thus, we used complementary literature searches to search for unpublished material (procedures to locate the highest number of eligible, both published and unpublished studies, are described earlier). In addition, to assess the likelihood of the presence and magnitude of publication bias, we carried out three different tests: Classic Fail-Safe N, Orwin’s Fail-Safe N and Egger’s regression analysis (Borenstein et al., 2009). Results The first research question addressed the effectiveness of SI in improving second/foreign language learning. The overall weighted mean of .78 (Q = 247.38, p < .001; see Table 1) represents a large effect on Cohen’s (1988) scale. The second research question addressed the effectiveness of SI on self-regulated learning associated with second/foreign language learning. The overall weighted mean for this learning domain of .87 (Q = 167.59, p < .001) also represents a large effect. These results indicate that, on average, the participants in the SI treatment groups scored approximately 0.8 and 0.9 standard deviations above those in control groups on language and self-regulated learning measures, respectively. Statistically significant homogeneity test results (Q-statistics) indicated that the studies estimated different population parameters. To assess the likelihood of the presence and magnitude of publication bias we first conducted Egger’s regression test (Egger, Smith, Schneider, & Minder, 1997). With regard to study sample for the first research question, the results showed that the symmetrical distribution around the weighted mean effect size is indeed an indication that there was no publication bias (p > .05). Second, a “Classic fail-safe N” test was performed to determine the number of null effect studies needed to raise the p value associated with the mean effect above an arbitrary alpha level (α =.05). The Classic fail-safe N test revealed that 3,473 additional qualified studies would be required to nullify the overall effect size found in this meta-analysis. Finally, we conducted a more stringent test (Orwin’s fail-safe N) to examine the number of studies needed to invalidate the results of this meta-analysis. Results showed that 533 missing null studies would be required 14

to bring the mean effect size found in this meta-analysis to a trivial level of .05. Similar results were found for the second research question sample (Classic fail-safe N = 718; Orwin’s fail-safe N = 176). The results of these fail-safe statistical tests show that the number of null or additional studies needed to nullify the overall effect size found in this meta-analysis is larger than the 5k + 10 limit suggested by Rosenthal (1995). Hence, the meta-analytic results of this study are valid, not threatened by publication bias. For individual outcomes, the largest effects were detected for vocabulary and reading, followed by listening, general proficiency, and speaking (language learning domain) and for strategy effectiveness, strategy use, and anxiety (self-regulated learning domain; see Table 1). Non-trivial effects were detected for writing, attitudes, and self-efficacy. Statistically detectable effect sizes were obtained irrespective of primary study designs’ strengths or weaknesses (i.e., whether reliability of instruments used was reported or not, the assignment was random or not, and pretests were administered or not; see Table 2). As elaborated below, a number of context, treatment, and methodological characteristics were found to have moderating influences on SI effectiveness for both language learning and self-regulated learning domains. When there were three levels of the moderator and the omnibus moderator analysis indicated significant differences among the three levels, we conducted follow-up, pairwise comparisons to examine the source of the significant differences. SI effectiveness for language. Six contextual variables—setting, age, educational level, proficiency, and two language typology variables—were assessed with respect to their relationship with the effectiveness of SI on language learning (see Table 2). Moderate to large effects were obtained for all subgroups with the smallest effect being associated with the ESL setting (ES = 0.58) and the largest effect being associated with elementary education level (ES = 1.31). The overall trends in the results show that larger effects on language learning were obtained in FL settings, for younger learners (both in terms of age and educational level), for leaners above minimum proficiency, and for learners studying target languages belonging to the same family—but not necessarily having the same script—as their native languages. In addition, six treatment-related variables were assessed (strategy scope, number of strategies, instructional procedures, duration, and two delivery variables). Medium to large effects were detected for all these subgroups with only two medium effect size exceptions: SI interventions targeting more than eight strategies (ES = 0.55) and not focusing on metacognition (i.e., behavior modeling approach; ES = 0.57). The overall trends in the results show that larger effects on language learning were obtained for interventions that were shorter, focused on eight or fewer strategies, incorporated more than one strategy type, and used an awareness-raising approach. The effect for interventions targeting eight or fewer strategies (ES = 1.01) was significantly larger than that for interventions targeting more than eight strategies (ES = 0.55; Q = 6.08, p = .01). Last, the three methodological moderators (reliability reported, random assignment, and pretest) were assessed. Medium to large effects were detected across different methodological features, with the only medium-sized effect associated with studies not using pretest (ES = 0.53) significantly smaller than that for studies using pretest (ES = 0.72; Q = 4.69, p = .03). SI effectiveness for self-regulated learning. With regard to self-regulation, the moderating effects for all—but language family—contextual variables were statistically significant (see Table 2). Prior to further discussion of the results for this learning domain, however, it is worthwhile to note that moderator analyses for educational level, duration, and

15

procedures should be interpreted with caution as at least one of the comparison groups has a small sample size (i.e., a single study). Overall, contrasting with results for language learning, to some extent, larger effects were obtained in ESL settings, for older learners (both in terms of age and educational level), for leaners above minimum proficiency, and for learners studying target languages that differ from their native language to a greater extent (belonging to a different language family and not sharing the same script). A post hoc analysis conducted for a 3-level settings moderator revealed that both the ESL and EFL effects were significantly larger than the FL effect (Q = 12.41, p < .001 and Q = 5.33, p < .05, respectively). There was no significant difference between the ESL and the EFL settings effects (Q = 2.45, p = .117). Another post hoc analysis for a 3-level moderator revealed that the effect for university/adult learners was larger than those for middle/high school (Q = 8.19, p < .001) and elementary school (Q = 30.99, p < .001) learners. In addition, the effect for middle/high school learners was larger than that for elementary school learners (Q = 6.15, p < .05). Notably, the effect on self-regulated learning was significantly and substantively (by about 3.4 times) larger for older participants (12 years and older; ES = 0.98) than it was for younger participants (under 12 years old; ES = 0.29; Q = 7.25; p = .01). A statistically and practically larger SI effects were found for learners studying languages with different from—rather than similar to—their native languages scripts (effect sizes of 1.12 and .26, respectively; Q = 9.83, p = .00). For treatment-related variables, larger effects on self-regulated learning were obtained for interventions that were longer in duration, focused on a single strategy type, used an awarenessraising approach, were delivered by technology, and were researcher-led. Two of the moderating effects were statistically significant. Specifically, the effects were significantly larger for studies with SI focusing on only one strategy type (ES = 2.30) as contrasted with SI focusing on a combination of strategies (ES = 0.69; Q = 4.91; p = .03) and with SI being led by the researcher (ES = 0.97) as contrasted with SI being led by the teacher (ES = 0.31; Q = 9.75; p = .02). For method, there was only one significant moderator effect associated with assignment to treatments. That is, studies employing random assignment designs reached a statistically larger effect size (ES = 1.19) than did studies with non-random assignment designs (ES = 0.37; Q = 6.89; p = .01). Correlations among moderators. Correlation analyses among statistically significant moderator results by domain and group indicated that for self-regulated learning, nonsurprisingly, age and educational level positively correlated, r = .54, p < .05. Interestingly, there was a significant negative correlation between proficiency and script, r = -.67, p