Validity of the Differential Aptitude Test for the

0 downloads 0 Views 215KB Size Report
Differential Aptitude Test (DAT) for the assessment of secondary school ... various school subjects and achievement tests for mathematics and Dutch. ...... Most discussions about the use of IQ tests for immigrants say that the test scores of.
Educational Psychology, Vol. 20, No. 1, 2000

Validity of the Differential Aptitude Test for the Assessment of Immigrant Children

JAN TE NIJENHUIS, ARNE EVERS & JAKKO P. MUR [1], Work and Organizational Psychology, University of Amsterdam, The Netherlands

This paper addresses the construct as well as the criterion validity of the Differential Aptitude Test (DAT) for the assessment of secondary school minority group students (N 5 111) as compared to majority group students (N 5 318) in The Netherlands. Comparison of the test dimensions with the structural equation modelling program EQS showed that construct validity was good for both groups. With one exception, the subtests of the DAT measured the cognitive abilities of minority and majority group students equally well. The estimate of g as computed with the DAT showed strong predictive validity with little bias for various school subjects and achievement tests for mathematics and Dutch. Although some criteria revealed prediction bias to the disadvantage of the minority group, these differences concerned very small changes in R2. Conversely, the predictive value decreased substantially when an estimate of g was used excluding subtests that measure aspects of crystallised intelligence. Spearman’s hypothesis tested with DAT subtest scores and criterion scores showed that g explained most of the group differences. Professional test users can safely draw conclusions from the DAT regardless of the students’ ethnicity. ABSTRACT

Standardised ability tests were ® rst developed early in the 20th Century. Since then, the use of these kind of tests has increased substantially. A large part of this increase resulted from the assessment of cognitive abilities of children by schools, who were the largest consumers of intelligence tests (Suzuki & Valencia, 1997). The use of standardised ability tests in educational settings is justi® ed, because of their high predictive value: The average correlation between high school grades and standardised ability test scores is about 0.50. According to Jensen (1998), the highest validity coef® cients (0.60 to 0.70) are found in elementary schools. This value drops slightly (0.50 to 0.60) for high school students and still lower values are reported when the samples are even more restricted in range as in higher educational courses. Similar values are reported by the 0144-3410/00/010099-17 Ó

2000 Taylor & Francis Ltd

100

J. te Nijenhuis et al.

Board of Scienti® c Affairs of the American Psychological Association (Neisser et al., 1996). Educational decisions may have long-term consequences for the lives of the children or students concerned. Since the test results often play an important role in the decision-making process, only tests with adequate validity should be used. Furthermore, tests should be equally valid; in other words, they should be unbiased for different groups, as small group differences in validity can have large consequences and may result in unfair treatment of the members of those groups. Review of the research on prediction bias in standardised ability tests shows that most tests are not biased against minority group children. Having evaluated a large body of research on test bias, Jensen (1980, p. 474) concluded that, ª with a good choice of predictors ¼ it is possible to predict scholastic achievement in the elementary grades with considerable validity (about 0.70) for white, black, and Mexican-American children without having to take their ethnicity into account.º The situation in the European Community can be quali® ed as non-optimal: only a handful of studies is available, whereas the results of American research of native-born English-speaking minorities cannot be extrapolated indiscriminately to the immigrant populations of West European countries. In The Netherlands, research on ethnic bias will become more and more important, since the composition of the Dutch population has changed substantially during recent decades and will continue to do so in the near future. Owing to the increase of the immigrant population, minority group participation in Dutch high schools increased from 3.7% to 7.3% between 1992 and 1996; in the four major cities this ® gure rose from 18% to 30% (Dutch Ministry of Education, Culture and Science, 1997). However, there is no published research on test bias in secondary school children. The few published studies on test bias deal with job applicants (te Nijenhuis & Van der Flier, 1997) or with primary school children (De Jong & Batenburg, 1984; Resing et al., 1986). Although these studies show little or no bias of well-known intelligence tests with respect to construct as well as predictive validity, the use of tests for the assessment of the heterogeneous immigrant population in The Netherlands is criticised both in the press and in the academic world. For instance, the Dutch Bureau for Prevention of Discrimination (LBR), which has great political in¯ uence, states that the use of traditional psychological tests strongly reduces the chances of minority group members in selection procedures and therefore advises the use of culture-reduced tests (Pattipawae & Tazelaar, 1997). On various grounds, such a strategy is questionable, yet one cannot support the use of standardised ability tests for the assessment of secondary school children of minority groups when research on bias is absent. Moreover, the growing number of minority children justi® es and indicates the need for further research. The length of residence in The Netherlands correlates about 0.30 with the scores on tests with a verbal component and correlates near to zero with tests having no verbal component (te Nijenhuis, 1997). In their attempts to construct special tests for immigrants, Dutch researchers have followed strategies that are used in cross-cultural research, for instance, the elimination of language in items and instructions, scholastic and cultural context, and specialised skills. A popular test in this respect is the Multicultural Capacities Test (MCT-M) (Bleichrodt & Van den Berg, 1997). An important conceptual distinction when testing in various groups is to be clear about whether one’ s purpose involves predictive validity or construct validity. A quite highly culture-speci® c test may derive much of its predictive validity from its cultural speci® city per se. As a consequence, the elimination of tests with a verbal component

Validity of the Differential Aptitude Test

101

from a test battery usually reduces the predictive validity of the test when the criterion involves verbal ability, as for example with scholastic performance (Jensen, 1980). Cattell’ s Culture Fair Test of g, for instance, with its strong emphasis on nonverbal and pictorial tests, has only a moderate predictive validity for scholastic achievement (0.20± 0.50) (Cattell, 1987).

Research Question The central question of this study is whether the validity of standardised ability tests is the same for minority and majority group secondary school children. Validity is de® ned by Messick (1989, p. 13) as ª an integrated judgment of the degree to which empirical evidence and theoretical rationales support the adequacy and appropriateness of inferences and actions based on test scores or other modes of assessment.º Applying this de® nition to the current study, the research question can be reformulated as: can we draw the same conclusions from the test scores of minority and majority group members? This evaluation revolves around two issues: The ® rst is whether children from the two groups with the same test score have equal cognitive abilities as measured with standardised ability tests. The second question is whether majority and minority group students with the same test scores will show equal educational performance. The ® rst issue concerns, among other things, the equality of the dimensions measured in the different groups. Cross-cultural researchers (KagitcËibasi & Savasir, 1988) state that invariance of factor structures in different cultural-ethnic groups and in different environments is a good indicator of test equivalence for these groups. The second issue concerns the degree in which children with the same test score show the same criterion behaviour. A regression analysis of criterion variables on test scores should show coinciding regression lines.

Method Research Participants The data for this research were collected at three secondary schools: one was located in the west, one in the middle and one in the east of The Netherlands. All ® rst year students participated, except for those who were ill on the day of testing. The individual results were reported to the schools and used for counselling. In the Dutch educational system, all secondary school students, irrespective of level of ability, follow the same curriculum in the ® rst year. It is only in the second or later years that differentiation according to level takes place. In the participating schools, the lowest level courses where not part of the curriculum; the schools combined ® rst year lower general secondary education (MAVO), higher general secondary education (HAVO) and pre-university education (VWO) students. The students were rated as minority or majority group members by their teachers. Minority students were de® ned as the students whose parents were born outside The Netherlands or students with a clearly different cultural background than the majority group members. There was no information as to the exact period the minority students had lived in The Netherlands or to their country of birth, but the schools indicated that most of the minority group students were ® rst-generation immigrants. Because ethnic classi® cation is a highly sensitive topic in The Netherlands, more detailed background information could not be collected.

102

J. te Nijenhuis et al.

The sample consisted of 429 students: 215 males (50.1%) and 214 females (49.9%); 318 (74.1%) majority group members and 111 (25.0%) minority group members. Members of both ethnic groups were about equally divided with respect to gender.

Test The Dutch Differential Aptitude Test (DAT; Evers & Lucassen, 1992) is partly a translation and partly an adaptation of the American DAT, Forms S&T (Bennett et al., 1974). In addition, a vocabulary subtest has been added to the eight American subtests. The Dutch version of the DAT does not have two parallel tests; mainly items from Form T were used for the construction of the Dutch test. The nine Dutch subtests are all timed and are administered in multiple-choice format. The DAT is one of the most widely used tests in The Netherlands. Evers and Zaal (1982) showed that 40% of a sample of 300 professional test users used the DAT. Extrapolation of this ® gure to all test users led them to the conclusion that in 1975 more than 200,000 people had been tested with the DAT in The Netherlands. Of all tests in The Netherlands that are developed for the assessment of cognitive abilities for children of 12 years and older, and adults, the psychometric qualities of the DAT are rated highest by the Dutch Committee on Testing (COTAN; Evers et al., 1992). Studies reported in the Dutch DAT manual (Evers & Lucassen, 1992) show a good reliability and validity of the test. The reliability coef® cients that are reported in the DAT manual for groups of the same level as in the present study are: Vocabulary 0.84, Spelling 0.73, Language Usage 0.78, Verbal Reasoning 0.75, Abstract Reasoning 0.85, Spatial Relations 0.90, Mechanical Reasoning 0.84, Numerical Ability 0.83 and Clerical Speed and Accuracy 0.86. All reliabilities are Cronbach’ s alpha coef® cients, except for the Clerical Speed test for which a test± retest coef® cient was computed. As hierarchical factor models are at present the best validated models of psychometric intelligence, it is useful to describe the DAT subtests within the framework of such a model, although the test was not designed with a hierarchical factor model of intelligence in mind. For this purpose, the most widely accepted hierarchical factor model, the three stratum model of intelligence (Carroll, 1993), was used. In this model the highest level of the hierarchy (Stratum III) is general intelligence or g. One level below are the broad abilities (Stratum II) Fluid Intelligence (Gf), Crystallised Intelligence (Gcr), General Memory and Learning, Broad Visual Perception (Gv), Broad Auditory Perception, Broad Retrieval Ability and Broad Cognitive Speediness. At the lowest level (Stratum I) we ® nd narrow abilities like Sequential Reasoning, Spelling Ability and Visualisation. The subtests of the DAT are: (1) Vocabulary (VO): in this subtest from ® ve words, one has to choose the word that has the same meaning as the target word. According to Carroll’ s (1993) taxonomy it measures Lexical Knowledge at Stratum I and is a measure of Gcr at Stratum II. (2) Spelling (SP): in this subtest one has to judge whether or not words are spelled right. This subtest measures Spelling Ability and is a measure of Gcr. (3) Language Usage (LU): in this subtest one has to look for grammatical errors in sentences. This subtest measures Grammatical Sensitivity and is a measure of Gcr. (4) Verbal Reasoning (VR): in this subtest one has to complete verbal analogies that

Validity of the Differential Aptitude Test

(5)

(6) (7)

(8)

(9)

103

are stated in the form ª ¼ is to B as C is to ¼ º . This subtest measures both Induction and Lexical Knowledge and is therefore a measure of both Gf and Gcr. Abstract Reasoning (AR): in this subtest one has to choose the diagram that should logically be the next diagram of a series of changing diagrams, thereby showing that the operating principle is understood. This subtest measures Induction and is a measure of Gf. Space Relations (SR): in this subtest one has to imagine an object by unfolding and rotating it. This subtest measures Visualisation and is a measure of Gv. Mechanical Reasoning (MR): in this subtest questions are asked about pictorial representations of mechanical situations and about their operating mechanisms. This subtest measures Mechanical Knowledge and is a measure of Gv. Numerical Ability (NA): in this subtest arithmetic problems have to be solved. The arithmetic computations in this test are complicated; therefore, it is classi® ed as Quantitative Reasoning and not as Numerical Facility (Carroll, 1993, p. 350) and is a measure of Gf. Clerical Speed and Accuracy: in this subtest one has to ® nd target letter and number combinations in strings of letters and numbers. This subtest results in two scores: one indicating the number of right combinations, the other indicating the number of errors. Two scores from the same test may make them interdependent. The violation of independence among observations can have a serious effect on the probability of type-one-errors (Stevens, 1996). In order to prevent this, only the number of right answers (CSR) was used in the analyses. This subtest measures Speediness in the domain of Broad Visual Perception (Carroll, 1993, p. 465), which classi® es this test as a measure of Gv.

All tests were administered according to the conditions prescribed in the manual. Criteria Two types of criterion measures were collected. The ® rst type consists of school grades for the subjects biology, Dutch, English, French, geology, history and mathematics. Grade points can range from 1 to 10, where a 10 represents a perfect score. A 5.5 score usually means that a student has suf® cient command of the subject. For the sake of convenience in this study, grade points were multiplied by 10, giving a possible range from 10 to 100, with a score of 55 indicating suf® cient command of the subject. The reliabilities of the school grades could not be computed. The second type of criterion measures are the scores on scholastic achievement tests for mathematics and Dutch. These tests were especially constructed for this study according to the instructions of the Dutch Central Institute of Test Development (CITO). The test for Dutch has a possible range from 0 to 57; the test for maths from 0 to 46. In one of the schools the achievement tests were not administered; therefore this school is not included in the analysis of the predictive validity of the DAT. The Cronbach’ s alpha coef® cients as computed in this study were 0.78 for the Dutch achievement test and 0.76 for the mathematics achievement test. Statistical Analyses Means and g Scores. The means of the subtests for the majority group, the minority group and a norm group of ® rst year secondary school students are reported. The DAT

104

J. te Nijenhuis et al.

manual provides a wide range of norm groups. The reported norm group consists of ® rst year secondary school students who visit schools that are comparable to the ones the research participants go to. The size of the mean differences between minority and majority group members was calculated in terms of the standard deviations of the majority group members. According to Jensen and Weng (1994), a good estimation of the subtests’ g values can be made when a wide range of broad cognitive abilities (Stratum II) are being measured. In turn, each of these Stratum II abilities must be measured by at least three ® rst-order cognitive abilities (Stratum I). As the nine DAT subtests are tests of Fluid Intelligence, Crystallised Intelligence and Broad Visual Perception, and none of them measures the other Stratum II cognitive abilities, the ® rst requirement is not met. Instead, we used the subtests’ loadings on the ® rst unrotated factor of a principle axis factor analysis to get an estimation of the subtests’ g values (Jensen & Weng, 1994; Thorndike, 1985). The g score of research participants was computed by summing the products of participant’ s z scores and the subtest’ s g values for all the subtests. The g scores were then used as a predictor. Dimensional Comparability. In order to judge dimensional comparability, the factor structures of the DAT scores of the two groups were compared with the structural equation modelling program EQS (Bentler, 1995). The data were applied to several models with increasing degrees of constraint. Firstly, the covariance matrices were compared; secondly, the factor models; and, ® nally, the subtest loadings on the Stratum II factors. The factor model to which the data were applied was based on Carroll’ s (1993) hierarchical model of cognitive abilities. As the Stratum II factors are oblique in Carroll’ s model, we allowed the Stratum II factors to correlate in the model in our study. The Stratum III ability g was not implied in the model, as the low number of only three Stratum II factors would have caused identi® cation problems. We used Carretta and Ree’ s study (1995) on sex differences in cognitive abilities as a standard for the interpretation of group differences in factor loadings. Most of the discrepancies they found were below 0.05, the biggest difference being 0.12. Therefore, a difference in factor loadings above 0.10 is considered substantial in this study. To explore further the comparability of the dimensions, loadings on the ® rst unrotated principle axis factor and subtest loadings on the Stratum II factors were compared by calculating the congruence coef® cient Á (Tucker, 1951). Values greater than 0.85 are generally considered high. From such values one may conclude that factors have the same interpretation. Differential Prediction. The predictive validity of g was calculated for both types of criterion measures. The resulting regression equations were tested for prediction bias with a step-down hierarchical regression analysis (Lautenschlager & Mendoza, 1986). This procedure begins by testing the hypothesis that a common regression line alone is suf® cient to account for the relation of g with the criteria. If the null hypothesis of identical regression lines is rejected, it is tested whether group 3 test shows signi® cant slope bias or whether test and group show signi® cant intercept bias. This procedure stops whenever the signi® cance level is not reached in one of the abovementioned steps Additional predictive validities were computed for a g score based on the Fluid and Broad Visual Perception subtests only; all Crystallised tests were excluded. The

Validity of the Differential Aptitude Test

105

remaining six subtests closely resemble tests that are generally constructed for the assessment of immigrants. Spearman’s Hypothesis Tested With DAT Scores. Spearman’ s hypothesis (Jensen, 1993) states that g is the predominant factor, but not the sole factor, determining the size of the differences between two groups. To test this hypothesis, the correlations between the g values of the subtests (vector of g values, Vg) and the values of the group differences on the subtests (vector of effect sizes, Veg) were computed. To test Spearman’ s hypothesis seven methodological requirements have to be met (Jensen, 1993): (1) (2) (3) (4)

The sample should not be selected on criteria with a high g loading. The variables should have a reliable variation in their g values. The variables must measure the same latent traits in the different groups. The variables should measure equal g values in the subgroups, i.e. the value of the congruence coef® cient of the ® rst unrotated principle axis factor should be above 0.95. (5) The g values should be computed separately; if the congruence coef® cient indicates a high degree of similarity, one should take the average g value. (6) In order to avoid differences in reliability to in¯ uence the correlation between Vg and Ves., the variables should be corrected for attenuation. (7) The test of Spearman’ s hypothesis is the Pearson correlation (r) between Vg and Ves. Spearman’ s rank order correlation (rg) should be computed to denote statistical signi® cance. To estimate the degree to which g can account for the group differences on the subtests, the mean differences on the subtests were regressed on the g values of the subtests. When subtests show greater group differences than might be expected from their g values, other variables in addition to g may have had an effect on the test scores, for example bias. Spearman’s Hypothesis Tested with Criterion Scores. Spearman’ s hypothesis was also tested for the criterion scores. As the criterion scores were not developed to measure g, it would not make sense to estimate g loadings on the basis of a factor analysis of the criterion scores themselves. Instead, the correlations of the criteria with the g score as measured with the DAT were used as an estimate of their g value. The application of Spearman’ s hypothesis to criterion scores will give an indication of the degree to which the differences in mean criterion scores of the two groups are accounted for by g. Results Means and g Scores Table I shows the mean scores of the majority group students, the minority group students and a norm group of the same educational level (Evers & Lucassen, 1992). Furthermore, the deviation of the scores of the minority group students from the majority group scores are given. The means and standard deviations of the majority group members are slightly and consistently lower than the scores of the norm group. Generally, the scores of the minority group members are about one standard deviation lower than the majority group mean scores. Subtests with a language component, such as Mechanical Reasoning, Language Usage and Vocabulary, show the largest deviations

106

J. te Nijenhuis et al.

TABLE I. Means and standard deviations of the DAT subtests for the majority group, norm group and minority group and the deviation of the minority group from the majority group in terms of the majority group standard deviation (Dev.) Majority group

a

Norm group

Minority group

Subtest

M

SD

M

SD

Vocabulary Spelling Language Usage Verbal Reasoning Abstract Reasoning Spatial Relations Mechanical Reasoning Numerical Ability Clerical Speed and Accuracy g score

31.66 53.13 24.41 13.88 29.63 26.23 36.31 14.53

9.91 8.46 6.87 5.37 8.78 9.42 9.13 5.75

32.6 55.6 25.3 16.1 31.5 29.1 36.7 16.8

10.1 8.8 6.8 6.8 8.1 9.9 9.2 6.6

22.03 52.64 17.40 11.23 22.44 20.13 26.54 10.88

9.13 8.92 5.58 3.89 9.34 8.58 8.55 5.35

25.66 0.86

7.43 3.21

28.2

7.5 2 2.47

26.39 2.67

7.64 1.14

M

SD

Dev.

2

0.97 0.05 1.02 0.49 0.82 0.65 1.07 0.49 0.10

a

For a description of the norm group, see text; Norm group scores were reported in one decimal ® gures by the DAT manual (Evers & Lucassen, 1992).

from the majority group scores. However, subtests with no language component, such as Abstract Reasoning, Spatial Relations and Numerical Ability, show a substantial difference in mean scores as well. The scores on the subtests Spelling and Clerical Speed and Accuracy do not differ much. Table I also shows the g scores of the research participants. The mean g scores of the minority group members are 1.14 SD below the majority group g scores. The loadings on the ® rst unrotated principle axis factor are given in Table II. The pattern of loadings is similar for both groups, though there are some differences for individual subtests.

Dimensional Comparability The model that is used in this study is Carroll’ s (1993) hierarchical model; the Broad Cognitive Abilities (Crystallised Intelligence, Fluid Intelligence and Broad Visual Perception) were allowed to covariate. Subtest correlations are presented in Table III. TABLE II. Loadings on the ® rst unrotated principle axis factor of subtests for majority and minority group Group membership Subtest Vocabulary Spelling Language Usage Verbal Reasoning Abstract Reasoning Spatial Relations Mechanical Reasoning Numerical Ability Clerical Speed and Accuracy

Majority 0.64 0.32 0.66 0.57 0.63 0.70 0.61 0.52 0.14

Minority 0.54 0.40 0.59 0.39 0.81 0.71 0.51 0.61 0.38

107

Validity of the Differential Aptitude Test

TABLE IV. Factor loading of the subtests resulting from applying the data to Carroll’s hierarchical model of intelligence and congruence indices of factor loadings Majority group Subtest

Gf

Vocabulary Spelling Language Usage Verbal Reasoning Abstract Reasoning Spatial Relations Mechanical Reasoning Numerical Ability Clerical Speed and Accuracy Congruence Coef® cient Gf 5 Fluid Intelligence; Gcr 5

0.330 0.755

Gcr

Minority group Gv

Gf

0.699 0.474 0.831 0.257

Gcr

0.414 0.798

0.642 0.378 0.844 0.251

0.882 0.624

0.840 0.592

0.541

0.998

Gv

0.606

0.997

0.177 0.999

Crystallised Intelligence; Gv 5

0.998

0.997

0.155 0.999

Broad Visual Perception.

In order to ® t our data to the Broad Abilities Gcrp Gfp and Gvp the subtests which best represented these Broad Abilities were ® xed, so that the other parameters could be estimated. For Gcr the best representative subtest was Vocabulary, for Gf it was Abstract Reasoning and for Gv Spatial Relations. The standardised loadings of the subtests on the Board Cognitive Abilities are presented in Table IV. The ® t of the model appeared to be non-optimal in both groups: CFI 5 0.89, v 2 (23) 5 99.50, p , 0.001 in the majority group and CFI 5 0.89, v 2 (23) 5 52.048, p , 0.001 in the minority group. Although the model’ s ® t was not as good as we had expected, the dimensions of the groups were compared by applying models with an increasing degree of constraint. The ® rst constraint tested was equality of covariance matrices; secondly, equality of the number of factors was tested; and, ® nally, the subtests were tested for equal factor loadings; Table V demonstrates the resulting Congruence Fit Indices. The covariance matrices of the groups are highly equivalent: CFI 5 0.97, v 2 (45) 5 68.10, p 5 0.014. the test of equal factor models showed that the factor models of the two groups were only moderately equivalent, CFI 5 0.89, v 2 (48) 5 161.98, p , 0.001. The same conclusion applies to the factor loadings, CFI 5 0.88, v 2 (53) 5 169.64, p , 0.001. TABLE V. Comparative ® t indices resulting from the comparison of majority and minority group dimensions Model Subtest MR included Equality of covariance matrices Equality of factor models Equality of factor loadings Subtest MR excluded Equality of covariance matrices Equality of factor models Equality of factor loadings

v

2

df

CFI

68.10* 161.92** 183.62**

45 48 53

0.97 0.89 0.88

54.12* 83.71** 68.19**

36 32 38

0.98 0.95 0.94

CFI 5 Comparative Fit Index; MR 5 Mechanical Reasoning; *p , **p , 0.001.

0.05.

108

J. te Nijenhuis et al. TABLE VI. Means and standard deviations of the criterion measures for majority and minority group members and the deviation of the minority group from the majority group in terms of the majority group standard deviation (Dev.) Majority group members a

Criteria

Achievement tests Dutch Maths Grades Biology Dutch English French Geology History Maths a

Minority group members

M

SD

M

SD

Dev.

37.29 24.29

6.29 5.85

31.58 22.88

6.48 6.42

0.91 0.24

68.37 62.37 63.91 61.64 67.33 66.14 65.52

10.13 11.48 13.66 15.87 11.61 10.79 12.40

59.94 58.07 62.75 61.96 60.57 63.97 61.85

10.78 11.01 14.99 17.19 12.64 11.53 12.54

0.83 0.37 0.08 2 0.00 0.58 0.20 0.30

N 5 111 for the minority group; N 5

231 for the majority group members.

With regard to the residual covariance matrices, it appeared that the subtest Mechanical Reasoning accounted for the moderate to low ® t indices. The model ® tted much better without this subtest: CFI 5 0.96, v 2 (16) 5 39.09, p , 0.001 in the majority group and CFI 5 0.94, v 2 (16) 5 28.98, p 5 0.024 in the minority group. For this adjusted model, the same constraints were applied to the data. Table VIII shows that the covariance matrices of the groups remained highly equivalent, CFI 5 0.98, v 2 (36) 5 54.12, p 5 0.027; furthermore, the test of equal factor models showed that the factor models of the groups also became highly equivalent, CFI 5 0.95, v 2 (32) 5 68.19, p , 0.001. The same applies to the factor loadings, CFI 5 0.94, v 2 (38) 5 83.71, p , 0.001. Although there were signi® cant differences in the test score dimensions that were compared, the high ® t indices indicate that these differences were actually small. Additional analyses revealed that the loadings of the subtests on the Stratum II factors in the two groups are highly equivalent: The congruence index Á computed for the loadings of the subtests on the Stratum II factors and for the separate Stratum II factors were all at least 0.99. Table IV illustrates that the factor loadings showed no substantial differences between the groups. The largest differences were found in tests with a language component, for example Spelling, Vocabulary and Verbal Reasoning, although the differences were never larger than 0.10. The subtest Verbal Reasoning loaded on Fluid and Crystallised Intelligence; however, the pattern of loadings is highly similar in both groups. The loading of Mechanical Reasoning on Broad Visual Perception was almost the same in both groups, which indicates that the way this subtest disturbed the model ® t was not the result of differences in the factor loading of Mechanical Reasoning on Broad Visual Perception.

Differential Prediction Table VI shows the means and standard deviations of the criterion scores for the majority and minority group. The means of the minority group are, on average, one-third of a standard deviation lower than the means of the majority group. The

Validity of the Differential Aptitude Test

109

TABLE VII. First step in the hierarchical regression procedure: are the criterion measures signi® cantly predicted by g? Criterion measure

Achievement test Dutch Math Grades Biology Dutch English French Geology History Maths *p ,

2

R

R

F

R (without Crystallised tests)**

0.62 0.49

0.38 0.24

191.17* 98.36*

0.52 0.52

0.65 0.51 0.39 0.33 0.56 0.46 0.58

0.42 0.26 0.15 0.11 0.31 0.21 0.34

238.27* 113.76* 58.38* 41.28* 149.07* 88.16* 170.70*

0.54 0.39 0.25 0.30 0.50 0.38 0.60

0.01; **for explanation, see text.

standard deviations are very similar. The g scores of the participants were used as a predictor. We subjected the data to a step-down hierarchical regression procedure (Lautenschlager & Mendoza, 1986) to ® nd out whether the criteria were differently predicted by g for the majority and minority groups. The ® rst step in this procedure was to test whether the criterion scores were signi® cantly predicted by the g scores. Table VII shows highly signi® cant results for each criterion. Without the Crystallised tests, the predictive validity decreased on average by 14%; for the criteria involving verbal abilities, predictive validity went down by as much as 36%. The second step was to look for bias, either in intercept or in slope of the regression lines. As Table VIII shows, different regression lines resulted in a signi® cantly better prediction in ® ve cases. Step three implies the assessment of bias either in slope or in intercept or in both. TABLE VIII. Second step in the hierarchical regression procedure: increase in the prediction if different regression equations for the majority and minority groups are assumed Criterion measure Achievement test Dutch Maths Grade Biology Dutch English French Geology History Maths *p , 0.05; ** p ,

0.01.

2

R change

F

0.01 0.02

2.54 4.92*

0.01 0.01 0.03 0.04 0.00 0.02 0.03

1.32 1.69 6.12** 7.68** 0.12 4.80** 7.49**

110

J. te Nijenhuis et al. TABLE IX. Third step in the hierarchical regression procedure: increase in prediction if slope or intercept bias is assumed Slope bias 2

Criterion measure

R Change

Achievement test Maths Grades English French History Maths * p , 0.05; ** p ,

Intercept bias 2

F

R Change

F

0.00

0.00

0.02

8.18*

0.01 0.00 0.00 0.00

2.78 1.46 0.37 1.76

0.03 0.03 0.02 0.03

12.22** 15.03** 8.95** 14.79**

0.01.

Table IX shows that intercept bias was found in the prediction of the maths achievement test score F (3,308) 5 8.18, p , 0.05 and in the prediction of the grades for English F (3,329) 5 12.22, p , 0.05, French F (3,332) 5 15.03, p , 0.05, history F (3,331) 5 8.95, p , 0.05 and maths F (3,332) 5 14.79, p , 0.05. Slope bias was not found. Spearman’s Hypothesis Tested with DAT Scores Prior to testing this hypothesis, we ensured that the above mentioned methodological requirements of Jensen (1993) were met. The participating students were distributed over a wide range of educational levels; there was no selection based on g loaded criteria. Table II shows that the subtests have suf® cient variation in their g loadings;

Fig. 1.

Validity of the Differential Aptitude Test

111

TABLE X. Factor loadings of criteria on the ® rst principle axis factor, correlation of the criteria with g score and congruence indices Factor loading

Achievement tests Dutch Maths Grades Biology Dutch English French Geology History Maths Congruence coef® cient

Correlation with g

Minority

Majority

Minority

Majority

Mean r

0.60 0.70

0.54 0.59

0.53 0.47

0.54 0.53

0.53 0.50

0.82 0.83 0.78 0.88 0.87 0.82 0.86 0.999

0.75 0.81 0.70 0.84 0.74 0.73 0.81

0.60 0.50 0.49 0.44 0.45 0.48 0.65 0.992

0.58 0.49 0.38 0.36 0.55 0.47 0.57

0.59 0.49 0.43 0.40 0.50 0.48 0.61

Fig. 2.

p 5 0.086, and the regression formula was ES 5 2.21g± 0.58. The regression line in Fig. 1 shows that the group differences are for a large part accounted for by the difference in g. Subtests above the regression line are relatively more dif® cult for minority group members than predicted by the g loading, whereas subtests under the regression line are easier.

112

J. te Nijenhuis et al.

Spearman’s Hypothesis Tested with Criterion Scores Jensen’ s (1993) ® rst requirement was met in the same way as above. Table X shows the correlations of the criterion variables with g. It appears that the second requirement was not entirely met: criteria with low g loadings were underrepresented. Again, the third and the fourth requirement were fully met, with congruence indexes F of 0.99 for both the factor loadings and the g loadings. So, the values of the g loadings shown in Table X could be averaged to give an estimate of the g loadings of the criteria. The correlation between Ves and Vg was 0.65, rs 5 0.73, p 5 0.025, and the regression formula was ES 5 3.08g± 1.16. The regression line in Fig. 2 shows that group differences in g were to a high extent, but not fully, responsible for group differences in criterion scores. Criteria above the regression line are relatively more dif® cult for minority group members than their g loading would suggest and criteria under the regression line are easier.

Discussion The central question of this study addressed the validity of the DAT (Evers & Lucassen, 1992) for the assessment of minority group students. The results indicate that most of the DAT subtests are valid measures of the cognitive abilities of Dutch minority group students. The g score resulting from the DAT is a good predictor of school grades and scholastic achievement test scores. However, some criteria are slightly better predicted when differential prediction equations for the majority and the minority group are used instead of one overall equation. As these differences are small and, moreover, apply only to some of the criteria, professional test users can safely draw their conclusions from the DAT regardless of the ethnicity of the students. However, a careful test user should be cautious with the interpretation of the scores on the subtest Mechanical Knowledge when testing minority group students. The results from this study, combined with the outcomes of other studies, suggest that tests that are speci® cally designed for the assessment of immigrants will generally show a lower predictive validity than traditional tests that cover a broader range of abilities, such as the DAT. The conclusion that group differences in abilities found with the DAT are only to a small extent caused by test bias, corroborates the ® ndings of American and Dutch studies on test bias. It appears that cognitive tests can be used quite well, though they are not perfect, for the assessment of immigrants.

Dimensional Comparability The question whether the DAT subtests measure the same cognitive abilities in the two groups was addressed by comparing the test score dimensions. The indices that were ® rst computed when ® tting the data to Carroll’ s (1993) model of cognitive abilities showed however, the number of subtests with low g loadings was rather small. To check for the third and the fourth criterion, we computed Tucker’ s F of both the factor structure and the g loadings, which resulted in values of 0.99 and 0.97, respectively. Thus, the values of the two groups could be averaged to give an estimate of the subtest’s g loadings. With all necessary conditions ful® lled, the correlation between Ves and Vg could be computed. Pearson’ s correlation (r) was 0.78, Spearman’ s correlation

Validity of the Differential Aptitude Test

113

(rs) was 0.60, differences in the dimensions measured. This was, to a large extent, caused by the subtest Mechanical Reasoning, which was found to have a negative effect on the model ® t of both groups. Furthermore, this subtest is more dif® cult for the minority group students than is expected by the g value of the test. Despite this difference in dif® culty, the loading of Mechanical Reasoning on the factor Broad Visual Perception is the same for both groups. These are contradictory ® ndings. The model without Mechanical Reasoning showed equivalent dimensions in the two groups. This indicates that the DAT subtests, except for Mechanical Reasoning, measure the cognitive abilities of minority group students just as well as those of majority group students. The loadings of the subtests on the broad cognitive abilities were similar, showing some discrepancies for tests involving language. The differences in factor loadings were, on average, 0.05 and never larger then 0.10. This is similar to the discrepancies found in studies on sex differences by Carretta and Ree (1995). Differential Prediction The question of whether students with the same test scores show the same criterion behaviour was addressed with a hierarchical regression analysis. The g scores were shown to be very strong predictors of school grades and performance on scholastic achievement tests. For several criteria, a differential prediction equation improved the prediction, though only to a small extent. Despite these small differences, it is of interest to emphasise that two of these criteria concern grades for foreign languages, namely French and English, and that in both cases a common regression line underestimates the scores of the minority group. This indicates that minority group members get better grades for these foreign languages then estimated by their g score. This may be the result of a greater familiarity of some minority group members with one or both of these languages: Moroccans often speak French and Surinamese and Antillians often speak English as second language. In those cases where differential prediction for school grades is found, the prediction line of the minority group is located above that of the majority group. Consequently, the use of a common regression line provides an underestimation of the minority group for English, French, history and mathematics. This may be due to bias in the criteria. Teachers may judge minority students on other grounds than purely their command of the subject; Van de Vijver and Willemse (1991) found that minority group grades re¯ ect improvement of students rather then command of the subject. This constitutes an alternative explanation for the differential prediction of the grades for French and English as suggested before. The underestimation of the objective mathematics achievement test score of the minority group is a unique ® nding, that contradicts the usual ® nding of overprediction of the criterion by bias in intercept (Arvey & Faley, 1988). This effect may be due to the way the predictor was assembled, namely by deriving a g score from the subtests. Spearman’s Hypothesis Strong relations were found between differences in subtest scores and g loadings of the subtests and between differences in educational achievement and its estimated cognitive complexity. This clearly makes g the predominant factor accounting for the differences in subtest and criterion scores. The fact that this relation was non-optimal

114

J. te Nijenhuis et al.

could be the result of: (a) different intelligence pro® les, (b) sampling error in g values, (c) biasing factors and (d) measurement error. Most discussions about the use of IQ tests for immigrants say that the test scores of immigrants are negatively in¯ uenced by the assumed predominance of language components in these tests. However, the ® ndings of this study strongly refute that theory. The mean scores on the four subtests with a language component, Language Usage (g 5 0.63), Vocabulary (g 5 0.59), Verbal Reasoning (g 5 0.48) and Spelling (g 5 0.36), can be predicted quite well from their g loadings. Language Usage and Vocabulary have similar g loadings and show about the same group differences; with respect to Verbal Reasoning, the g loading and effect size take an intermediate position, whereas Spelling has the lowest g loadings and shows the lowest group differences. It appears that language is only a vehicle for g (Jensen, 1992). This illustrates how the use of outdated intelligence taxonomies, such as Thurstone’ s Primary Abilities (1938), in which the classi® cation Verbal Abilities is used for tests that largely measure g, leads one astray. Correspondence: Jan te Nijenhuis, Work and Organizational Psychology, Roetersstraat 15, 1018 WB Amsterdam, The Netherlands. Email: ao_tenijenhuis@macmail. psy.uva.nl NOTE [1] The ® rst two authors contributed equally to this study; the order of the names is random.

REFERENCES ARVEY, R.D. & FALEY, R.H. (1988) Fairness in Selecting Employees (New York, Addison-Wesley). BENNETT, G.K., SEASHORE, H.G. & WESMAN, A.G. (1974) Manual for the Differential Aptitude Test Forms S and T (New York, The Psychological Corporation). BENTLER, P.M. (1995) EQS, A Structural Equation Program Version 5.4 [Computer software] (Multivariate Software, Inc.). BLEICHRODT, N. & VAN DEN BERG, R.H. (1997) Voorlopige Handleiding MCT-M. Multiculturele Capaciteiten Test Middelbaar niveau [Provisional Manual MCT-M. Multicultural Capacities Test Intermediate level] (Amsterdam, Stichting NOA). CARRETTA, T.R. & REE, M.J. (1995) Near identity of cognitive structure in sex and ethnic groups, Personality and Individual Differences, 19, pp. 149± 155. CARROLL, J.B. (1993) Human Cognitive Abilities: a survey of factor-analytic studies (New York, Cambridge University Press). CATTELL, R.B. (1987) Intelligence: its structure, growth and action (Amsterdam, North Holland). DE JONG, M.J. & VAN BATENBURG, TH.A. (1984) Etnische herkomst, intelligentie en schoolkeuze advies [Ethnic origin, intelligence, and school choice advice], Pedagogische StudieÈn, 61, pp. 362± 371. DUTCH MINISTRY OF EDUCATION, CULTURE AND SCIENCE MINISTERIE VAN ONDERWIJS, CULTUUR & WETENSCHAPPEN (1996) Voortgezet onderwijs in cijfers 1996 [Secondary education in ® gures 1996]. Internet: www.minocw.nl EVERS, A. & LUCASSEN, W. (1992) Handleiding DAT `83 [DAT ’ 83 Manual] (Lisse, Swets & Zeitlinger). EVERS, A., VLIET-MULDER, J.C. & TER LAAK, J. (1992) Documentatie van tests en test research in Nederland [Documentation of tests and test research in the Netherlands] (Assen, Van Gorcum). EVERS, A. & ZAAL, J.N. (1982) Trends in test use in the Netherlands, International Review of Applied Psychology, 31, pp. 35± 53. JENSEN, A.R. (1980) Bias in Mental Testing (London, Methuen). JENSEN, A.R. (1992) Vehicles of g, Psychological Science, 3, pp. 275± 278. JENSEN, A.R. (1993) Spearman’s hypothesis tested with chronometric information-processing tasks, Intelligence, 17, pp. 47± 77. JENSEN, A.R. (1998) The g Factor: the science of mental ability (Westport, Preager).

Validity of the Differential Aptitude Test

115

JENSEN, A.R. & W ENG, L.J. (1994) What is a good g?, Intelligence, 18, pp. 231± 258. KAGITCË IBASI, C. & SAVASIR, I. (1988) Human abilities in the Eastern Mediterranean, in: S.H. IRVINE & J.W. BERRY (Eds) Human Abilities in Cultural Context, pp. 232± 262 (New York, Cambridge University Press). LAUTENSCHLAGER , G.J. & MENDOZA, J.L. (1986) A step-down hierarchical multiple regression analysis for examining hypothesis about test bias in prediction, Applied Psychological Measurement, 10, pp. 133± 139. MESSICK, S. (1989) Validity, in: R.L. LINN (Ed.) Educational Measurement, 3rd Edn (Washington, DC, American Counsel on Education, MacMillan). NEISSER, U., BOODOO, G., BOUSHARD, T.J., BOYKIN, A.W., BRODY, N., CECI , S.J., HALPERN, D.F., LOEHLIN, J.C., PERLOFF, R., STERNBERG, R.J. & URBINA, S. (1996) Intelligence, knowns and unknowns, American Psychologist, 51, pp. 77± 101. PATTIPAWAE, C.F. & TAZELAAR, C.A. (1997) Met recht discriminatie bestrijden: een juridische handleiding bij de bestrijding van discriminatie op grond van ras en nationaliteit [Battling racism with the law: a manual to combat racism on grounds of race or nationality] (Utrecht, LBR). RESING, W.C.M., BLEICHRODT, N. & DRENTH, P.J.D. (1986) Het gebruik van de RAKIT bij allochtoon etnische groepen [Use of the RAKIT for the assessment of immigrants], Nederlands Tijdschrift voor de Psychologie, 41, pp. 179± 188. STEVENS, J. (1996) Applied Multivariate Statistics for the Social Sciences (Mahwah, NJ, Lawrence Erlbaum Associates). SUZUKI, L.A. & VALENCIA, R.R. (1997) Race-ethnicity and measured intelligence: educational implications, American Psychologist, 52, pp. 1103± 1114. TE NIJENHUIS, J. (1997) Comparability of test scores for immigrants and majority group members in the Netherlands, Ph.D. thesis, Vrije Universiteit, Amsterdam. TE NIJENHUIS, J. & VAN DER FLIER, H. (1997) Comparability of GATB scores for immigrant and majority group members: some Dutch ® ndings, Journal of Applied Psychology, 82, pp. 675± 687. THORNDIKE, R.L. (1985) The central role of general ability in prediction, Multivariate Behavioral Research, 20, pp. 241± 254. THURSTONE, L.L. (1938) Primary mental abilities, Psychometric Monographs, no. 1. TUCKER, L.R. (1951) A method for synethesis of factor analysis studies (Personnel Research Section Report No. 984) (Washington DC, Department of the Army). VAN DE VIJVER, F.J.R. & WILLEMSE, G.R. (1991) Are reaction time tasks better suited for cultural minorities than paper-and-pencil tests?, in: N. BLEICHRODT & P.J.D. DRENTH (Eds) Contemporary Issues in Cross-cultural Psychology, pp. 450± 464 (Amsterdam, Swets & Zeitlinger).