Teaching Thinking Skills

10 downloads 0 Views 1008KB Size Report
Venezuelan seventh graders, whose classes were matched to those of a control group. .... target abilities, a list of things the students should be able to do after ...
Psychology in Action

Teaching Thinking Skills Richard J. Herrnstein Raymond S. Nickerson Margarita de Sfinchez John A. Swets

ABSTRACT: A course was developed to teach cognitive skills that apply to learning and intellectual performance independently of subject matter, stressing observation and classification, reasoning, critical use of language, problem solving, inventiveness, and decision making. With pretests and posttests, it was taught experimentally to over 400 Venezuelan seventh graders, whose classes were matched to those of a control group. Although evaluation of such a course is conceptually difficult and long-term effects have not been assessed, standard and special objective tests and various subjective tests indicate consistently that the course had sizable, beneficial effects on its students.

Interest in classroom teaching of generally useful thinking skills has increased markedly. The prospects of improving intellectual competence have been discussed in several conferences and workshops, and intervention programs developed in several countries have recently been reviewed under the auspices of the National Institute of Education (Nickerson, Perkins, & Smith, 1985). This article describes one such program that was distinguished by an eclectic approach to the conceptual structure of the curriculum and a relatively extensive evaluation of results. The program described here was initiated by the Venezuelan government, which requested the assistance of Harvard University. From the onset, the collaboration also included researchers at Bolt Beranek and Newman, Inc. and in the Venezuelan Ministry of Education. The larger program in Venezuela of which this project was a part was described in Science (Walsh, 1981) and in the APA Monitor (Cordes, 1985). Although the course materials were developed for Venezuelan schools, in particular the seventh grade, and much attention was given to ensuring the appropriateness of the materials to that culture, they have since been published in English (Adams, 1986a). The course was designed to enhance performance on a wide variety of tasks that require careful observation and classification, deductive or inductive reasoning, critical use of language, hyiaothesis generation and testing, problem solving, inventiveness, and decision making. The focus was on cognitive skills that apply to learning and November 1986

9

American Psychologist

Copyright 1986 by the American Psychological Association, Inc. 0003-066X/86/$00.75 Vol. 41, No. 11, 1279-1289

Harvard University BBN Laboratories Incorporated, Cambridge, MA Ministry of Education, Caracas, Venezuela BBN Laboratories Incorporated, Cambridge, MA

intellectual performance independently of subject matter, rather than conventional academic content. Skills that can reasonably be considered to be components of intelligence and that are sufficiently well defined to lend themselves to explicit instruction were the targets of the course.

The design and preparation of the course material began in the fall of 1980. About a dozen experienced Venezuelan teachers worked with the project staff during part of the summer of 1981, becoming familiar with the material and suggesting modifications. Some of these teachers later trained the Venezuelan teachers who would actually teach the course in its final form, at which time they supervised the classroom teaching. During the second academic year, 1981-1982, some of the material was tested in Venezuelan classrooms and evaluated informally by the teachers, after which the material was further modified. Also during that year the remainder of the course was prepared, as were the testing instruments that were to be used for formal evaluation of the course during the academic year 1982-1983. Various tests of mental abilities were adapted, developed, and used during the second year. Two main purposes were served by the testing at this stage: The first was to assess the feasibility of the tests and to adapt them where necessary for Venezuelan seventh graders. The second purpose was to find schools in which the distributions of test scores were sufficiently similar to permit the selection of comparable experimental and control groups for the third year. Besides several standard tests of general aptitude, some 500 test items reflecting the evolving content of the experimental course were written, of which about 70% were administered to samples of schoolchildren. The resulting Target Abilities Tests, as well as the final set of standard mental abilities tests, are described later.

The Experimental Course The course as finally composed contains six lesson series, each of which addresses a topic that the project staffconsidered to be important to intellectual competence. Each lesson series is divided into 2 or more units that focus on specific aspects of the series topic. Each of the 20 units 1279

consists of 3 or more lessons, so that the total number of lessons is approximately 100. Table 1 gives a complete list of the series and their units, the number of lessons prepared, and the numbers of lessons in each unit actually taught in the experimental course. The table also includes brief descriptions of some of the main objectives of each unit. A central feature of the course not evident in the table is that the basic concepts and skills developed in Lesson Series I are generalized and exercised in new contexts throughout Lesson Series II through V. The course's length and general nature preclude our describing it more fully here; an analysis of its design is discussed elsewhere (Adams, 1986b). The individual lessons are the backbone of the course. They are presented in a common format, and in every case a detailed set of suggestions is given regarding how to proceed in the classroom. The format always gives the teacher the following categories of information: (a) a rationale, an explanation of why the lesson is included in the course; (b) objectives, the specification of the skills or concepts that the lesson is intended to develop; (c) target abilities, a list of things the students should be able to do after completing the lesson; (d) products, tangible things the students are required to produce; (e) materials, other than paper and pencil, that are needed for the lesson; and (f) classroom procedure, detailed instructions to the teacher regarding how to proceed. The material presented under the category of classroom procedure provides the teacher with a detailed plan for conducting the class, including estimated amounts of time for each activity. Specifying the classroom procedure in detail was part of an attempt to make the material usable by teachers without special training or qualifications. In addition, for purposes of evaluation, it was important that the course be taught as similarly as possible by each of the teachers who participated in the evaluation effort. The instructions for conducting classes include many suggestions for what the teacher and the students might say as they progress through a lesson. These dialogues, or scripts, were not meant to be read in class or memorized by the teacher, nor were the students' responses in the scripts the only acceptable ones. Rather, the purpose of the scripts was to emphasize the imporThis project was inspired by Luis Alberto Machado, then Ministerfor the Developmentof Human Intelligencein Venezuela.It was fundedby Petr61eosde Venezuela.Althoughit is impracticalto mention all of the peoplewho havemadesignificantcontributionsto it, we do wishto note the followingpeople. Developersof the curriculum material included Marilyn J. Adams, Jos6 Buscaglia, Allan Collins, Cad Feehrer,Mario Grignetti, SusanHerrnstein,A. W. E Huggins,Catalina Laserna,David Perkins, W. J. Salter,Kathryn Spoehr,Brenda Starr, and the authors of this article. Design and preparation of the lessonswere coordinatedby MarilynJ. Adamsand supervisedforthe Ministryof Educationby Margarita de Sfinchez.DavidGetty playeda keyrole in the evaluationeffort. Jorge Domlnguez, then chairman of Harvard's Committee on Latin American and Iberian Studies, was the project's principal investigator. The authors constitutedthe project's steeringcommitteeand are listed in alphabeticalorder. Correspondenceconcerningthis articleshouldbe addressedto John A. Swets,BBN Laboratories, 10 Moulton St., Cambridge, MA 02238. 1280

tance of a continuous interaction between teacher and students and to provide a model of how it might be achieved.

Teaching and Evaluating the Course The Course Taught The approximately 100 lessons prepared were more than could be taught in a single year given the scheduling and time constraints under which the course was presented. In anticipation of the need for compression, we selected approximately 60 lessons to represent the course's coverage. Fifty-six lessons were actually used during the final evaluation year, as shown in Table 1. In general, the class met four days a week. Succeeding lessons were used on three of these days; the fourth day was devoted to review or completion of unfinished lessons. Locale

The city of Barquisimeto was the site of the experiment. This city of approximately 400,000 inhabitants is the capital of the state of Lara in the central region of Venezuela and is approximately 300 kilometers southwest of Caracas. In consultation with Venezuelan associates, we decided to use classes only in schools classified by educational authorities as "barrio schools," a designation that indicates that the schoolchildren come from families of low socioeconomic status and minimal parental education. Of the 12 Barquisimeto schools thus designated at the appropriate grade level, 4 were eliminated from consideration either because of an insufficient number of seventh-grade sections or because standardized test scores obtained during the preliminary phase of the study suggested that the student population tested higher than the student population in other barrio schools. From the remaining 8 schools, three pairs (6 schools) were identified as good matches in terms of school size, physical layout, and type of neighborhood; thus, it was considered appropriate to use these schools to make up the experimental and control groups.

Students Twenty-four seventh-grade classes from these six schools (4 from each school) participated in the final administration and evaluation of the course. Each class had approximately 30 to 40 students. Twelve of the classes (4 from each of three schools) were designated experimental classes, and 12 classes (from the three remaining schools) were designated control classes. The experimental group contained 463 students; the control group contained 432 students. All of the participating classes from a given school were either experimental classes or control classes; no school had classes in both categories. Because of the way students are assigned to classes in a given grade in Venezuela, classes varied considerably in average student age (from 145.5 months to 181.6 months). Experimental classes were paired with control classes so as to equate as nearly as possible the average student age in the two classes of each pair. The largest difference between the November 1986 9 American Psychologist

Table 1

Contents of the Course Number of lessons Series title, unit title, and description

Prepared

Taught

Lesson Series I: Foundations of Reasoning Unit 1: Observation and classification Using dimensions and characteristics to analyze and organize similarities and differences; discovering the basics of classification and hypothesis testing Unit 2: Ordering Recognizing and extrapolating different types of sequences; discovering special properties of orderable dimensions Unit 3: Hierarchical classification Exploring the structure and utility of classification hierarchies Unit 4: Analogies: Discovering relationships Analyzing the dimensional structure of simple and complex analogies Unit 5: Spatial reasoning and strategies Developing strategies to solve problems of resource allocation via tangrams

21a 6

18a 6

5

5

3

3

4

4

3

0

Lesson Series I1: Understanding Language Unit 1: Word relations Appreciating the multidimensional nature of word meanings Unit 2: The structure of language Discovering the logic and utility of rhetorical conventions Unit 3: Reading for meaning Analyzing text for explicit information, implicit information, and point of view

15 5

15 5

5

5

5

5

Lesson Series II1: Verbal Reasoning Unit 1: Assertions Exploring the structure and interpretation of simple propositions Unit 2: Arguments Analyzing logical arguments; evaluating and constructing complex arguments

20 10

0 0

10

0

Lesson Series IV: Problem Solving Unit 1: Linear representations Constructing linear representations to interpret n-term series problems Unit 2: Tabular representations Constructing tabular representations to solve multivariate word problems

18 5

9 5

4

0

Number of lessons Series title, unit title, and description

Prepared

Unit 3: Representations by simulation and enactment Representing and interpreting dynamic problem spaces through simulation and enactment Unit 4: Systematic trial and error Developing systematic methods for enumerating all possible solutions; developing efficient methods for selecting among such solutions Unit 5: Thinking out the implications Examining the constraints of givens and solutions for problem-solving clues

Taught

4

4

2

0

3

0

Lesson Series V: Decision Making Unit 1: Introduction to decision making Identifying and representing alternatives; trading off outcome desirability and likelihood in selecting between alternatives Unit 2: Gathering and evaluating information to reduce uncertainty Appreciating the importance of being thorough in gathering information; evaluating consistency, credibility, and relevance of data Unit 3: Analyzing complex decision situations Evaluating complex alternatives in terms of the dimensions on which they differ and the relative desirability of their characteristics on each of those dimensions

10 3

5 3

5

0

2

2

Lesson Series Vh Inventive Thinking Unit 1: Design Analyzing the designs of common objects in terms of functional dimensions; inventing designs from functional criteria Unit 2: Procedures as designs Analyzing and inventing procedures in terms of the functional significance of their steps

15 9

9 9

6

0

99

56

Total lessons prepared and taught, 19821983

= Series total.

average age of an experimental class and that of its matched control class was 4.6 months; the smallest was 0.1 month. The average age of experimental students was 158.6 months (13.22 years); the average age of control students was 158.8 months (13.23 years). November 1986 9 American Psychologist

Tests of academic potential administered to approximately 900 seventh-grade students during the year preceding the course provided grounds for anticipating an approximate comparability of the seventh-grade student populations from which the experimental and con1281

trol groups were to be selected. As will be shown, our expectations were approximately confirmed by the test results obtained from the students in the study.

Test

Battery

Course evaluation was primarily based on objective, multiple-choice tests given immediately before (pretest), during (interim test), and immediately after (posttest) the teaching of the course. The standardized tests included the Otis-Lennon School Ability Test (OLSAT; Otis & Lennon, 1977), which provided 80 scorable items1; the Cattell Culture Fair Intelligence Test (CATTELL; Cattell & CatteU, 1961), 89 scorable items2; and a representative group of General Abilities Tests (GAT, Manuel, 1962a), 239 scorable items. 3 Not including practice items or time spent giving instruction, the standardized testing required about 4.5 hours. (The time limits for some of the subtests were extended beyond the recommended durations so as to give most students a chance to work on all the items. Because we could not have properly used United States norms in any event, we decided to relax the time pressure on answering test items.) Target Abilities Tests (TATs) were tests we designed specifically along the lines of the course itself, drawing on the skills and processes, but not the content, employed in the lessons. The five TATs that appeared in both the pretest and posttest contained 296 items and took about 3.5 hours to complete, exclusive of practice items and instructions. Besides the pretest and posttest, four of the five TATs were given as interim tests, immediately or shortly after the lesson series for which they were appropriate. The fifth TAT, Inventive Thinking, was not given as an interim test because the relevant lessons came at the end of the course, just before the posttest. 4 The tests were administered by 12 Venezuelans m i The OLSATIntermediateLevel 1, Form R (Otis & Lennon, 1977), intended for Grades 6 through 8, was translated from Englishto Spanish, and where necessary,acculturated for the Venezuelancontext. Three of the 80 items in the original Englishversionwerejudged to be impossible to acculturate satisfactorily;for these, 3 comparableitems were selected from earlier versionsof Otis-Lennon tests. 2Scale2, FormsA and B (forage 8 to averageadult), of the CATTELL (Catteli & Cattell, 1961) were administered in Spanish. Each form contains four subtests, in which only pictorial shapes and figures are used to completea series, classifyfigures,completea matrix, and infer rules. Because the test is quite short, we merged the two forms by combining the corresponding subtests. It was later decided that 3 of the 92 items in a combined test were ambiguous, and hence unscorable, so these items were eliminated, resulting in a test of 89 items. 3The GATcombinedeight subtestsfrom varioussources:three from Guidance TestingAssociates'Tests of General Ability (Manuel, 1962a) and three from their Tests of Reading (Manuel, 1962b); one from the Puerto Riean Department of Education's Test of General Ability; and one developedby our own staff. Five of the subtests were verbal (i.e., vocabulary,sentencecompletion,synonyms,verbalanalogies,and reading comprehension), two were quantitative (i.e., arithmetic, numerical series), and one was pictorial (visual analogies). 4 A practice test consisted of 25 items adapted from earlier OtisLennon tests and was unscored. An attitude questionnaire was designed to assess students' attitudes toward school, peers, teachers, and self. In this report, nothing more will be said about the results of this questionnaire, which, in fact, werenot significantlycorrelatedwith the measures of the educational effectsof the course. 1282

teachers from Barquisimeto and professional staff from the Ministry of Education. The test administrators received intensive training in the use of the test materials, including practice administrations in the presence of both Venezuelan and North American observers. They were provided with detailed scripts for the conduct of testing sessions and with battery-operated, hand-held, electronic countdown timers. To the extent possible, the matched experimental and control classes were given their tests by the same administrators, at approximately the same point in the school year; time-of-day effects were also balanced across experimental and control classes.

Teacher Selection and Preparation The course was taught by Venezuelan middle-school teachers. They were chosen from a group of 12 candidate teachers who were selected during the preceding year from the pool of experienced seventh- to ninth-grade teachers in the barrio district of Barquisimeto. Besides being available for and interested in full-time participation in the project during the year, the candidates had to be qualified teachers in standard academic subjects, such as civics, mathematics, biology, social science, or language. The teachers selected were those who maintained interest in the project as preparations for the course evolved and whose practice sessions with seventh graders in the preceding year seemed most satisfactory. Eight were designated as course teachers: six who were to give the course, and two who observed classes and filled in as substitutes when necessary. Occasionally, a class was taught by a m e m b e r of the Ministry of Education staff who had also been observing classes. Two teachers were assigned at random to each of the three experimental schools, and within each school, each teacher was randomly assigned to two seventh-grade sections to which he or she (there was one male teacher in the group) taught the entire course. All classes were monitored by one or more members of the Venezuelan staff and sometimes by visiting members of the North American staff. The course involved a regular cycle of activities. First, the teachers and other members of the Venezuelan staff met to read, analyze, and discuss a lesson, and to agree on how the lesson script would be used. Next, the lesson was rehearsed, with one of the teachers addressing the rest of the group. Subsequent discussion provided further tuning of the script. The lesson was then taught to the students, in the presence of Venezuelan observers. For the beginning of a new lesson series, North American members of the team were also present. Review and feedback sessions were held at the end of each school day in which a lesson had been taught. A weekly review session addressed more general consideration of the progress of the course.

Results Evaluation of an effort of this type is a difficult undertaking. What should be taken as evidence of success or failure? Indeed, what would constitute success or failure? November 1986 9 American Psychologist

How do we determine what the students, and the teachers, learned? If the students' performance on test instruments improved, how do we know to what to attribute the improvement? In particular, how do we know that any improvement is not simply a Hawthorne effect, a strengthening of performance that is due to generalized motivational factors, rather than specific improvements in intellectual skills? How do we judge whether specific effects are likely to persist? How do we evaluate the merits of particular parts or aspects of a course so as to determine what modifications would improve it as a whole? How do we assess the roles of uncontrolled variables as determinants of results? At this point, few if any of these questions can be answered decisively. However, an analysis of the data leaves no doubt that the course had statistically reliable and substantial effects on the students who took it and that the effects reflected the course content. Standard Tests

Table 2 summarizes the results for the general mental abilities tests before and after the course, for experimental and control students. The relative ability levels of the two groups were established by subtracting the number of items correctly answered on the pretest by the average control student from the number correctly answered by the average experimental student. On all three tests, the experimental students had a slightly higher average pretest score, but for the OLSATand GAT, the difference fell short of conventional statistical significance, as estimated from the t distribution. The gain score (posttest - pretest) was significantly higher for the experimental students on all three tests, although just marginally so on the CATTELL. As background, Table 3 shows the averages of the number of correct items for the experimental and control groups on the pretests and posttests. The bottom row of Table 2, the d value, is a measure of "effect size," as used in the evaluation literature. (Various uses of d are discussed in Light, 1983). As indicated in Table 2, this measure is calculated by subtracting the Table 2 Tests of Genera/Mental Ability in Percentage of Items Correctly Answered OLSAT Measure

Pretest

Gain

CAT'I'ELL Pretest

Gain

GAT Pretest

Gain

Difference (Experimental Control) t value p < dvalue

0.9 4.1 1.2 7.1 .23 .001 0.09 0.43

2.1 1.3 2.4 2.0 .02 .02 0.19 0.11

1.4 10.0 0.6 9.1 .6 .001 0.05 0.35

Note. OLSAT = Otis-Lennon School Ability Test. CAI-FELL = Cattell Culture Fair Intelligence Test. GAT = General Abilities Tests. The OLSAT included 80 items; the CATTELL, 89 items; and the GAT, 239 items. Gain = p o s t t e s t - pretest; difference in gain = e x p e r i m e n t a l (posttest - pretest) - control (posttest - pretest); d value = (experimental group M - control group M) + control group SD.

November 1986 9 American Psychcflogist

Table 3 Pretest and Posttest Scores for the Experimental and Control Groups on the Standard Tests OLSAT

Measure

Experimental M

SD Control M

SD

CATTELL

GAT

Pretest

Poettest

Pretest

Posttest

Pretest

Posttest

27.0 9.0 26.2 9.7

39.9 11.6 34.9 10.8

49.9 11.2 47.8 11.4

57.6 9.3 54.2 9.3

123.1 29.0 121.7 28.8

147.9 27.7 136.5 27.8

Note. OLSAT = Otis-Lennon School Ability Test. CATTELL = Cattell Culture Fair Intelligence Test. GAT = General Abilities Tests.

control group mean from the experimental group mean and dividing the result by the control group standard deviation. It thus expresses the experimental group gains in units of control group variability in gains. Standardized in this manner, the differences between the experimental group's scores and the control group's scores were plainly affected by the course, especially for the OLSATand GAT tests. Considering that d = .35 for the gains on the GAT, it follows that by taking the course, an "average" student would shift from the 50th percentile to the 64th percentile on the GAT. Given essentially equal standard deviations of the experimental group's and control groups's distributions, and the demonstration that students in the experimental group at different initial levels of ability were affected about equally in terms of their performance on the GAT (described later), other increases in percentile rank to be expected from a d of .35 can be estimated, such as 64 to 76, 76 to 85, and 85 to 92. If we think of the GAT scores as a test for entry into some higher level of education or occupational capability, then the higher the selection criterion, the greater the impact of the course. With d = .35, a criterion that permitted half of the control population to pass would permit 64% of the experimental population to pass, which is a 28% improvement; a criterion that permitted only 10% to pass would result in a 70% improvement for those who took the course. With higher values of d (as for the TATs, described later), the effects are correspondingly larger. Of the four subtests of the CATTELL, only the gain for visual series was reliably greater for experimental students than for control students, even though the overall gain was itself significantly different. On the other three subtests (classification, matrices, and inferring rules), the experimental students and control students gained about equally during the school year. On the GAT, the gain was significantly greater for the experimental group for seven of the eight subtests, the one exception being the subtest for numerical series, on which the experimental students showed an insignificantly larger gain than the control students did. Both the effect on visual series and the absence of an effect on numerical series appear to be consistent with the course content, for visual series involve skills 1283

Table 4

Target Abilities Tests in Percentage of Items Correctly Answered Foundations of Reasoning Measure

Difference (Experimental - Control) t value p < d value

Language

Problem Solving

Decision Making

Inventive Thinking

Pretest

Gain

Pretest

Gain

Pretest

Gain

Pretest

Gain

Pretest

Gain

1.2 2.0 .04 0.15

5.0 11.1 .001 0.64

2.6 3.2 .001 0.27

6.1 10.1 .001 0.62

1.1 2.5 .01 0.19

2.6 6.1 .001 0.46

0.4 1.6 .12 0.12

2.2 8.2 .001 0.77

0.8 2.1 .03 0.17

2.4 6.9 .001 0.50

Note. The Foundations of Reasoning test had 74 items; Language, 85 items; Problem Solving, 60 items; Decision Making, 28 items; and Inventive Thinking, 42 items. Gain = posttest - pretest; difference in gain = experimental (posttest - pretest) - control (posttest - pretest); d value = (experimental group M - control group M) + control group SD.

Table 5

Pretest and Posttest Scores for Experimental and Control Groups on the Target Abilities Tests Foundations of Reasoning

Measure

Experimen~lM

SD ControlM

SD

Language

Decision Making

Inventive Thinking

Pretest

Posttest

Pretest

Posttest

Pretest

Posttest

Pretest

Posttest

Pretest

Posttest

37.9 7.8 36.7 7.8

44.7 8.5 38.5 8.1

41.2 10.3 38.6 9.9

51.2 11.8 42.4 10.9

22.0 5.7 20.9 5.7

26.7 6.7 23.0 6.4

10.1 3.0 9.7 2.9

12.7 3.7 10.2 2.8

19.2 4.9 18.4 4.8

23.4 5.8 20.2 5.3

explicitly addressed in Lesson Series I, Unit 2: Ordering, whereas numerical series skills are not taught at all. What may not be clear is why performance failed to improve on the two other CATTELLsubtests that overlapped with the course, namely, classification and inferring rules. The CATTELLas a whole is meant to measure "fluid g," which, according to Cattell (1971), is a pervasive mental capacity that is not readily trainable. It is typically contrasted with "crystallized g," an equally pervasive mental capacity that depends on fluid g but also reflects the effects of learning and is captured by more conventional omnibus measures of intelligence, such as the OLSAT and GAT. Our results are, in short, approximately consistent with Cattell's distinction between fluid g and crystallized g.

Target Abilities Tests In Table 4, the comparison of pretest and gain scores on the TATs indicates that all the skills taught by the course improved. The experimental group scored higher than the control group on the pretest and then gained substantially more during the year. The measure of effect size (d) for the gains on all the TATs combined is .70, twice that for the GAT. An average student would shift from the 50th percentile to the 75th. In terms of numbers of ~students passing various criteria, a d of .70 means that the course produces a 50% improvement for a selection criterion that allows half the control group to pass and a 170% improvement for one that allows one tenth of the 1284

Problem Solving

control group to pass. 5 As background, Table 5 shows the averages of the number of correct items for the experimental and control groups on the TATs. Figure 1 summarizes the major results by showing the experimental group gains as a percentage of control group gains for the three standard tests and the TATs. Identical performance by the experimental and control groups (no difference) would have been shown as a 100% gain. Even the smallest of the obtained values, 121% for t h e CATTELL, was a statistically significant advantage in gains for the experimental group. The largest value, that for the TATs, was 217%.6 Interactions With Initial Ability A fundamental question is whether the course differentially affected students who began at different levels of ability. The question of initial ability would arise in any event, but it is especially germane here because the exs As shown in Table 4, the smallest difference in gains (as measured by d) was for the Problem Solving test, but this was itself almost half a standard deviation. The test called Foundations of Reasoning, which superficially resembles the CATTELL most, showed a relatively large superiority in the gains of the experimental students as compared with the control students. 6 Experimental gain as a percentage of control gain is not a safe measure if the control gain is small. That, of course, is not the case here, and the percentage index is supplied as a summary and as a matter of facile communication, in the context of the several other indices reported.

November 1986 9 American Psychologist

Figure 1 Summary of Major Results: Gains of Experimental Group as Percentage of Gains of Control Group for the Three Standard Tests and the Target Abilities Tests

students. Younger children tested higher than older ones, a not uncommon finding in school systems that do not rigidly adhere to "social" promotion. The gains were also inversely correlated with age, but only for the experimental students. A young child in an advanced class probably got there because of greater receptivity to education. Gains for the control students, presumably owing to factors external to the course, were virtually independent of age. But even the significant correlations were small, ranging from - . 15 to -.36.

Sex In contrast to age, sex seemed to have no measurable relation to ability levels or benefits from the course. O f 66 compari*sons of male students' and female students' pretests, posttests, and gains, 3 were found to be statistically significant at the .05 level of confidence, which is consistent with differences due to chance. Attendance Students' recorded rate of absence from classes was approximately 10%, after students who were absent from

Figure 2 perimental group was superior to the control group on most pretest measures, although only slightly so. To examine the contribution of initial ability, Figures 2 through 5 divide the two groups into deciles of pretest general mental ability scores and plot the gains as measured by the various tests. The measure of initial ability is based on the 408 items of the combined OLSAT, CATTELL,and GAT when the TAT is plotted (in Figure 5), and on the appropriate subset of those 408 itenlts that excludes, the OLSAT (Figure 2), the CATTELL (Figure 3), or the GAT (Figure 4) when their respective gains are plotted. In each of the figures, the upper panel traces, against deciles of initial ability, the pretest and posttest scores (in percentage correct) for the experimental and control groups. In all cases, there was room for improvement over pretest performance, even for students in the top decile. The lower panel shows just the', gains in percentage correct for the experimental and control groups. The two functions with open symbols are for pretests; filled symbols, for posttests; squares, for the experimental group; and triangles, for the control group. The quite consistent gains in percentage correct (bottom panels) indicate that the course generally benefited students throughout the range of initial abilities. 7

Percentages Correct on the OLSA T for the Experimental and Control Groups on Pretest and Posttest (upper pane/) and Gains in Percentage Correct for Each Group (lower panel) for Each Decile of Scores on the Pretest of Genera/Mental Ability

Age The students spanned several years in age. Although not shown here, correlations were calculated for pretest, posttest, and gain scores for the experimental and control 7Not havingadduced evidencefor an intervalscale of test scores, we shall nt)t look for patterns in the functions shown in the bottom panels of Figures2 through 5, for example, in their slopes.The general appearanceis of slightdifferencesin gain acrossabilitylevelsfor a given test and only unsystematicdifferencesfrom test to test. November 1986 9 American Psychologist

1285

Figure 3

Figure 4

Percentages Correct on the CA TTELL for the Experimental and Control Groups on Pretest and Posttest (upper pane/) and Gains in Percentage Correct for Each Group (lower pane/) for Each Decile of Scores on the Pretest of Genera/Mental Ability

Percentages Correct on the GA T for the Experimental and Control Groups on Pretest and Posttest (upper pane/) and Gains in Percentage Correct for Each Group (lower pane/) for Each Decile of Scores on the Pretest of Genera/Mental Ability

more than 80% of classes were excluded. The correlation between classes missed (during the last four lesson series when data are available) and gains in percentage correct on the last four TATs combined was - . 18, which is again a small but statistically significant relationship. We cannot say that greater attendance per se would have increased gains, for absenteeism may be correlated with other variables that affect performance, such as age.

tributed significantly to variation in both standardized tests and the TATs. 8 Table 6 presents the correlations between the general mental abilities tests separately and combined (the combination of the standard tests is called "STD") and the TAT for the pretest, posttest, and gains. The experimental and control groups had quite similar patterns during pretest and posttest, and the most similar tests (e.g., TAT and GAT) had the highest correlations. The much lower correlations for gains confirms the earlier conclusion that

Other Yariables With an analysis of variance (fixed-effect, completely nested model with unequal cell sizes and an unweightedmeans procedure), we attempted to assess the contributions of school, class, and teacher to the variance in gains among the experimental students on the standardized tests as a whole and on the TATs. Differences among the 12 classes contributed no significant variation; the three schools contributed significantly to variation in the TATs, but not in the standardized tests; and the teachers con1286

s We had not attempted to obtain an elite cadre of teachers for the course, and the finding of a "teacher effect" within the experimental group suggestsa significant variation across the teachers selected. Although the selection was not random, we believe that the teachers in both groupsreasonablyrepresented the teacherpopulationin the district. Wementionherethat the experimentalgroupwas exposedto the selected teachers for about 15% of their total class time, and thus we do not entertain the hypothesisthat the differenceswe observed between the experimental groupand controlgroupresulted fromdifferencesin teacher skill unrelated to the course. November 1986 9 American Psychologist

Figure 5 Percentages Correct on the TATs for the Experimental and Control Groups on Pretest and Posttest (upper panel) and Gains in Percentage Correct for Each Group (lower panel) for Each Decile of Scores on the Pretest of General Mental Ability

ability levels were only weakly associated with changes in scores during the year. 9

Supplementary Assessments The hope for any educational innovation, of course, is that it will have large, positive, long-term effects. Unfortunately, long-term effects can only be determined directly after years have passed and then only with considerable effort. There are, however, several ways to assess shortterm effects. No one of these ways is sufficient by itself, 9One unexpected and potentially important effect of the course was to reduceslightlythe reliabilityof the ~ d a r d i z e d tests, as measured by pretest-posttest correlations. Increasesin measured ability may be an additional source of variance for posttest scores, which here may be showing up as reductions in pretest-posttest correlations. For the experimental group, this correlation for the standardized tests combined was .89; for the control group, it was .93, a highly significantdifference for groups of this size. Each of the individual general mental abilities tests had a similar difference. For the TATs,the pretest-posttest correlation was .88 for the experimentalgroup and .89 for the control group, an insignificant difference. November 1986 9 American Psychologist

but taken together they may provide a fair representation of what a course such as this one has or has not accomplished. Objective testing probably has the greatest credence, but even the results of carefully designed objective tests can be misleading. If, for example, literacy is a basic problem, then written tests may fail to provide a good indication of what a child has or has not learned. Moreover, improvement in test scores per se is not the final objective of the effort; one wants some evidence that the teaching has not been so narrowly focused on solving problems of the kinds represented on the tests that the benefits have little generality. Oral and open-ended test items. In addition to the results summarized earlier, several additional posttests were conducted to assess more precisely how the course had affected cognitive skills. A subset of 30 items of demonstrated intermediate difficulty from the TATs was chosen as especially representative of the underlying processes (as opposed to specific content) taught by the course. This " s u m m a r y TAT" was administered to samples of experimental students and control students either as openended (rather than multiple choice) written questions or as oral questions (read aloud twice by a teacher while the student followed the written text). Samples of 82 experimental students and 81 control students took the oral test, and samples of 179 experimental students and 182 control students took the openended written test. The students' answers were scored by members of the Venezuelan staff who did not know which group the students were in. Both experimental and control students correctly answered oral items more often than written items. However, the ratio of experimental group's performance to control group's performance was 120% in both formats. Written items were, as expected, harder than oral ones, but the benefits of the course were equally evident for both formats. Comparisons of the students' scores with their posttest scores on the regular multiplechoice version of the items indicated that experimental students were helped less by oral presentation than control students were (2 points improvement in percentage correct for experimental students; 8 points for control students). The course apparently improved not only the underlying cognitive skills across formats, but also the ability to deal with written multiple-choice questions, a specialized skill that modern educational systems rely on considerably. Design problem. There was some concern that multiple-choice questions do not adequately probe the originality that one might hope would be fostered by lessons on inventive thinking. Accordingly, a special 20-min Design Test was administered after the course to a random sample of 90 experimental and 90 control students. The single task on the test was to design a table for a small apartment that lacked room for a normal table. The instructions asked for drawings of the design as well as a written description of how the design worked. Two judges who were unaware of the students' group affiliations independently scored each test on 14 variables (e.g., number of words, relevant features, irrelevant features, multiple 1287

Table 6 Intertest Correlations Pretest

Posttest Experimental

Gain

Experimental

Control

Control

Experimental

Control

OLSAT-CATTELL OLSAT-GAT CATTELL-GAT

.60 .77 .66

.62 .77 .70

.70 .84 .66

.68 .83 .66

.03 .19 .30

.02 .10 .20

TATs-OLSAT TATs-CATTELL TATs-GAT

.77 .64 .88

.77 .62 .88

.84 .63 .85

.84 .59 .87

.39 .04 .14

.24 .06 .15

STD-TATs

.88

.87

.87

.88

.25

.18

Note. OLSAT = Otis-Lennon School Ability Test; CATTELL = Cattell Culture Fair IntelligenceTest; GAT = General Abilities Tests; TATs = Target Abilities Tests; STD = all general standard mental abilities tests combined.

views, clarity of construction, fasteners). When the judges' ratings (which were highly correlated) were pooled, the experimental students and control students differed significantly on all 14 scales at the .02 level, and on 12 scales at the .001 level. In general, the experimental students provided more structured, elaborated, explicit, relevant, and functional designs. For the 12 variables showing unambiguously the superiority of the experimental group's designs, the value of d was .70, as compared with .50 for the inventive thinking TAT (see Table 4) or .70 for the TATs as a whole. The course evidently sharpened a student's ability to solve a practical design problem and to express the solution both pictorially and verbally. Oral argument. The third and final special posttest was an Oral Reasoning Test, an attempt to provide a more ecologically valid indication of the course's impact on practical reasoning of a type not directly addressed in that part of the course used during the year. Samples of 79 experimental students and 80 control students were given the written question, "If a person has to have only two meals per day and cannot eliminate lunch, which would be the best for health--to eliminate breakfast or supper?" The students were tested individually by one of six interviewers, who instructed the student in the procedure and then gave him or her 5 min to prepare a position and arguments in support of it. The interviewer did not know the student or his or her group; the students tested by each interviewer were drawn in a random sequence from experimental and control groups. The students' arguments were tape-recorded, then scored, usually by two independent judges (not interviewers) who were unaware of the students' group affiliations. These ratings were then averaged, but occasionally scoring was done by just one judge. The two variables scored were "numbers of reasons" and "quality of reasons." Experimental students earned higher scores on both variables at the .01 level of significance. The value of d for the two variables pooled was.50, roughly halfway between the d of.35 for the GAT and the d of .70 for the TATs. In whatever way the students were changed by the 1288

course, it was sufficiently general to appear in a wide variety of performance evaluations.

Overview In its use of large, matched groups, a fully specified intervention lasting an entire academic year, and extensive and objective pretesting and posttesting, this project was of unprecedented scale. A 56-lesson course directed toward fundamental cognitive skills was shown to have sizable and beneficial effects on a sample of Venezuelan seventh graders from economically and educationally deprived backgrounds. The benefits were assessed by general mental abilities tests and by specialized tests of the skills addressed by the course. The magnitude of the net beneficial effects ranged from a b o u t . 10 to .75 in standard score units. The larger benefits were generally obtained on measures closest to the course content, but moderate to large benefits were obtained for some measures quite remote from the specific course content. Moreover, even several of the smaller effects were statistically significant. Although the course succeeded in various respects, major issues remain unresolved. First and foremost, only short-term results are available now. Whether the fate of fading benefits awaits this program as it has many others, only time will tell--and only if nothing more is attempted with these students. Second, we cannot say specifically what aspects of the course were responsible for the beneficial effects. We cannot even positively exclude a Hawthorne effect, that is, an effect due to a generalized motivational enhancement for experimental students, as compared with control students, analogous to the placebo effect in medical intervention. However, the fact that the magnitude of the benefits on various tests bore fairly specific relationships to the structure of the course is evidence against a pure motivational enhancement. There is, moreover, at least one reason to believe that effects of the course will be sustained. From direct observation, we are convinced that the lessons created a new, dynamic interaction between teacher and student, changing the classroom profoundly for both. The teacher November 1986 9 American Psychologist

received continual feedback from the students; the typical student shifted from a somewhat passive classroom mode to a much more active involvement with the flow of material, more like a natural social interaction outside the classroom. The classes gave an observer a sense of 45 minutes packed with intellectual activity, characterized by probing and mutual questioning. From the protocols of both student and teacher reactions, it is clear that this heightened participation was also fiflt (and appreciated) by the participants themselves. Such a new interaction and the thoughtfulness on which it ,depends may be selfsustaining even after the specific results of an intervention have faded. The composition of the lesson units was determined by several factors, in particular, the literature on intelligence, the intuitions of the project staff, and the requirement that the course not cover material included in the standard seventh-grade curriculum--which was taken to rule out inclusion of units on quantitative skills that would have overlapped with seventh-grade mathematics. The established findings of factor analysis of intelligence test items show that certain kinds of items are loaded with high-level factors, such as g itself or the so-called primary factors for language and reasoning. These findings account for the presence in the course of lessons on perceptual and verbal reasoning, verbal analogies, and reading comprehension, as well as various other lessons that involved materials known to be loaded with such factors. To these lessons were added lessons that were dictated by more pragmatic concerns, such as problem solving, decision making, and inventive thinking. Given the abilities we hoped ultimately to enhance, several lessons were implicitly designed to teach preparatory skills. That the course worked suggests that at least some of our conceptions were well founded. Virtually everyone who has been associated with the course has come away with a strengthened belief in the possibility of teaching intellectual competences more directly than conventional school subjects do. TJais is not to deny the benefits of conventional academic subjects.

November 1986 9 American Psychologist

Rather, it is to point to another role for the classroom, one that traditional education often ignores, although it is well recognized at the frontiers of educational research (e.g., Detterman & Sternberg, 1982). Whether thinking skills are better taught with contents other than those of the usual school subjects or by training in the context of specific school subjects--or by a combined a p p r o a c h - remains an open question (Adams, 1986b; Glaser, 1984). Our results show that cognitive skills can be enhanced by direct instruction; it remains to be found how best to achieve pervasive and sustained benefits, and just how far such educational interventions can go. REFERENCES

Adams, M. J. (Coordinator).(1986a).Odyssey:A curriculum for thinking. Watertown, MA: MasteryEducation Corp. Adams, M. J. (1986b).Schema mediation as robust thinking: The theory behind the Odysseycurriculum. Manuscriptsubmittedfor publication. Cattell, R. B. (1971).Abilities: Their structure, growth, and action. Boston, MA: HoughtonMifflin. Cattell, R. B., & Cattell, A. K. S. (1961). Culture Fair Intelligence Test (Scale 2, FormsA & B). Champaign, IL: Institute for Personalityand Ability Testing. Cordes, C. (1985, March). Venezuelatests 6-yearemphasis on thinking skills. APA Monitor, pp. 26-28. Detterman, D. K., & Sternberg, R. J. (1982). How and how much can intelligence be increased. Norwood,NJ: Ablex. Glaser, R. (1984). Education and thinking: The role of knowledge. 9American Psychologist, 39, 93-104. Light, R. J. (Ed.). (1983). Evaluation studies: Review annual (Vol. 8). BeverlyHills, CA: Sage. Manuel, H. T. (1962a). Tests of General Ability." Inter-American Series (Spanish, Level4, FormsA & B). San Antonio,TX: GuidanceTesting Associates. Manuel, H. T. (1962b). Tests of Reading: Inter-American Series (Spanish, Levels 3 & 4, Forms A & B). San Antonio, TX: Guidance Testing Associates. Nickerson, R. S., Perkins, D. N., & Smith, E. E. (1985). The teaching ofthinldng. Hillsdale,NJ: Erlbaum. Otis, A. S., & Lennon, R. T. (1977). Otis-Lennon School Ability Test (Intermediate Level 1, Form R). New York:Harcourt BraceJovanovich. Walsh, J. (1981). A plenipotentiaryfor hfiman intelligence. Science, 214, 640-641.

1289