Testing for Unidimensionality of GAT Data

Dimiter M. Dimitrov TR029-2013

املزكز الٌطين للقَاس ًالتقٌٍم يف التعلَم العالُ ٍشحع باحثَى على التعبري عن آرائوم بكل حزٍة ًمونَة ،لذا فإن اآلراء ًاالقرتاحات يف هذي الٌرقة ال متثل بالضزًرة ًجوة نظز املزكز ًال التٌجى الزمسُ لى. مجَع البحٌث اليت ٍنشزها املزكز الٌطين للقَاس ًالتقٌٍم خضعت للتحكَم من قبل متخصصني يف جمال البحث

2

Section and 52 items form the Quantitative Section of the test. The data consisted of the binary (1/0) responses of 5,970 students on this test. The items in the Verbal Section of the test are clustered into four content-specific domains, namely: word meaning (13 items), sentence completion (16 items), analogy (17 items), and reading comprehension (22 items). The items in the Quantitative sections are clustered into six content-specific domains, namely: analysis (3 items), geometry (8 items), arithmetic (26 items), logical reasoning (1 item), algebra (5 items) and math comparison (9 items).

Introduction: Testing for (uni)dimensionality of test data is of critical importance for the selection of an appropriate model for data analysis and validation in any measurement framework— classical true-score theory (CTT), item response theory (IRT), or confirmatory factor analysis (CFA). Without going into a review of existing approaches to testing for dimensionality of data, the approach used with this task is bifactor modeling in the framework of CFA for investigating the dimensionality of data collected through the administration of the GAT assessment at the NCA. The rational for the choice is based on arguments in the research on dimensionality that the bifactor model (a) provides an evaluation of the distortion that may occur when unidimensionality models are fit to multidimensional data , (b) allows researchers to evaluate the utility of forming subscales, and (c) provides an alternative to nonhierarchical multidimensional models for scaling individual differences (e.g., Chen, West, & Sousa, 2006; Reise, Morizot, & Hays, 2007). The main question to be addressed with using a bifactor model is whether the data are sufficiently unidimensional to apply a unidimensional IRT model, without significant distortion in item parameters, or a multidimensional item response theory (MIRT) model is more appropriate?” Also, if a unidimensional model is appropriate, are the domain-specific clusters of items substantial and reliable enough to allow for scoring the examinees on such domains? This questions are particularly relevant to the context NCA tests as they typically assume one general dimension (factor) that underlies the examinees’ performance on the test and several domainspecific subdomains.

Confirmatory Factor Analysis: Given the planned grouping of items into content-specific domains in each section (Verbal and Math) of GAT, a confirmatory factor analysis (CFA) was appropriate for testing the dimensionality of GAT data used in this study. Specifically, the CFA was conducted in steps as follows: 1. Nine separate CFA procedures were used to test for unidimensional coherence of the items in each content-specific domain in the verbal and mathematics sections of GAT; (as the logical reasoning domain consists of one item, there was no CFA on this “domain.” The purpose of these analyses was to support the meaning of scoring the student performance by content-specific domains. Each CFA was conducted using the computer program Mplus (Muthén & Muthén, 2008) with the WLSMV estimator for categorical indicators; that is, interpreting the item scores (1/0) as categorical data to avoid problems associated with the treatment of such data as continuous and using the maximum likelihood (ML) estimator. The examination of the CFA results on the domain “geometry” from the math section led to the exclusion of one item (the fourth item in this domain) from subsequent analyses due to poor fit. Likewise, two items from the math section “math comparison” (the 4th and 5th items in this domain) were also excluded from subsequent analyses due to poor fit. Thus,

Method Data: This task relates to investigating the dimensionality of data collected through the administration of a specific NCA test, namely the Arabic version of the General Aptitude Test (GAT). The test consists of 120 multiple-choice items, of which 68 items form the Verbal 3

the math section was reduced from 52 to 49 items in subsequent CFA analyses. 2. A one-factor CFA was conducted with the 68 items in the verbal section of GAT used as categorical indicators of a single factor—verbal skills . 3. A one-factor CFA was conducted with the 59 items in the math section of GAT used as categorical indicators of a single factor—math skills. 4. A two-factor CFA was conducted with the two factors being verbal and math skills. Due to the increased complexity of this CFA model, with 68 binary indicators for the verbal factor and 59 binary indicators for the math factor, the domain-specific items were treated as “parcels” so the average score on the items in a parcel was used as indicator of the respective verbal or math factor. In this way, the verbal factor was related to four indicators—the mean scores of the items in each of the four content domains in the verbal section of GAT. Likewise, the math factor was related to five indicators—the mean scores of the items in each of the five content domains in the math section of GAT (e.g., see Kishton & Widaman, 1994; Yuan, Bentler, & Kano, 1997, for benefits and caution with parceling items in CFA). As

the parcel scores are treated as continuous variables, a maximum likelihood (ML) estimator was used with the CFA in this case. 5. A bifactor CFA was conducted, with treating the verbal and math parts of GAT as specific aspects of a general aptitude (GA). The purpose of this CFA was to justify the use of unidimensional IRT calibration of GAT data, with using a total GAT score and separate scores on its verbal and math sections. Results The results from the five-step CFAs described here above are summarized in Table1. The goodness-of-fit for each CFA model is tested by the use of three indices: the comparative fit index (CFI), the Tucker-Lewis Index (TLI), and the Root Mean Square Error of Approximation (RMSEA). For the purposes of this study, CFI > .90, TLI > .90, and RMSEA < .05 were used jointly as indication of an adequate data fit of the respective CFA model. The chi-square test values are provided in Table 8.1 only as descriptive information as they were all statistically significant due to the very large sample size in this study, N = 5,969 (e.g., see Dimitrov, 2012, p. 105).

Table 1:CFA on GAT Data by Verbal and Mathematics Parts and Their Content-Specific Domains SCALE/ SUBDOMAINS

VERBAL SECTION DOMAINS Word Meaning (9 items) Sentence Completion (16 items) Analogy (17 items) Reading Comprehension (22 items) MATH SECTION DOMAINS Arithmetic (26 items) Geometry (8 items) Algebra (5 items) Analysis (3) [Just-identified] Math Comparison (9 items) VERBAL SECTION: One-Factor Model (68 items)

4

Chisquare

df

CFI

TLI

RMSEA

382.772 539.037 677.427 658.392

65 104 119 209

0.960 0.957 0.985 0.962

0.952 0.951 0.982 0.957

0.029 0.026 0.028 0.019

960.259 47.115 5.330 0.000 82.667 8723.574

299 14 5 0 14 2210

0.952 0.917 0.999 1.000 0.918 0.948

0.948 0.875 0.998 1.000 0.877 0.946

0.019 0.020 0.003 0.000 0.029 0.022

MATH SECTION One-Factor Model (59 items)

3017.436

1127

0.939

0.937

0.017

VERBAL & MATH SECTIONS: Two-factor model (9 parcels)

552.897

26

0.971

0.960

0.058

GAT (VERBAL & MATH): One-factor model (9 parcels)

1038.094

27

0.945

0.926

0.079

75.279

18

0.997

0.994

0.023

GAT: Bifactor model (9 parcels)

factor (general aptitude) represented by its verbal and mathematics aspects. In support of this expectation is the excellent fit of the bifactor model (CFI = .997, TLI = .994, and RMSEA = .023). Graphically, this bifactor model is depicted in Figure 8.1, where VP stands for “Verbal Parcel” and MP for “Math Parcel.”

Interpretation: The values of the goodness-of-fit indices (CFI, TLI, and RMSEA) for the CFA by content-specific domains in both the verbal and math sections of GAT indicate an adequate data fit of the respective nine CFA models. Thus, it is meaningful to score the student performance on each of these domains as they are coherently represented by their respective items.

Conclusion: There is evidence of a general unidimensionality of GAT data used in this study, so it is appropriate to calibrate the data in the framework of unidimensional IRT models. Also, the verbal and mathematics parts of GAT are coherently measured by the items associated with them. Finally, the content-specific domains (4 in the verbal part and 5 in the mathematics part) in GAT are also adequately measured by their respective items. These findings imply that it is meaningful to score the students’ performance on the total test, on each of its verbal and math parts, as well as on each content-specific domain—e.g., for more refined feedback about the performance profiles of students on GAT.

The results for the one-factor models, separately for the verbal and math sections of GAT, indicate that the 68 items in the verbal section coherently measure verbal skills and, likewise, the 59 items in the math section coherently measure math skills of the students. Thus, reporting verbal and math scores is meaningful. Further, a very good data fit is indicated for the twofactor model, where the verbal factor is measured by 4 parcels of items and the math factor is measured by five parcels of items; that is, there are nine parcel indicators in this CFA model. The correlation between the two latent (verbal and math) factors with this model was found to be quite high (.835) thus indicating the existence of a general

Figure 1 5

6

Dimiter M. Dimitrov TR029-2013

املزكز الٌطين للقَاس ًالتقٌٍم يف التعلَم العالُ ٍشحع باحثَى على التعبري عن آرائوم بكل حزٍة ًمونَة ،لذا فإن اآلراء ًاالقرتاحات يف هذي الٌرقة ال متثل بالضزًرة ًجوة نظز املزكز ًال التٌجى الزمسُ لى. مجَع البحٌث اليت ٍنشزها املزكز الٌطين للقَاس ًالتقٌٍم خضعت للتحكَم من قبل متخصصني يف جمال البحث

2

Section and 52 items form the Quantitative Section of the test. The data consisted of the binary (1/0) responses of 5,970 students on this test. The items in the Verbal Section of the test are clustered into four content-specific domains, namely: word meaning (13 items), sentence completion (16 items), analogy (17 items), and reading comprehension (22 items). The items in the Quantitative sections are clustered into six content-specific domains, namely: analysis (3 items), geometry (8 items), arithmetic (26 items), logical reasoning (1 item), algebra (5 items) and math comparison (9 items).

Introduction: Testing for (uni)dimensionality of test data is of critical importance for the selection of an appropriate model for data analysis and validation in any measurement framework— classical true-score theory (CTT), item response theory (IRT), or confirmatory factor analysis (CFA). Without going into a review of existing approaches to testing for dimensionality of data, the approach used with this task is bifactor modeling in the framework of CFA for investigating the dimensionality of data collected through the administration of the GAT assessment at the NCA. The rational for the choice is based on arguments in the research on dimensionality that the bifactor model (a) provides an evaluation of the distortion that may occur when unidimensionality models are fit to multidimensional data , (b) allows researchers to evaluate the utility of forming subscales, and (c) provides an alternative to nonhierarchical multidimensional models for scaling individual differences (e.g., Chen, West, & Sousa, 2006; Reise, Morizot, & Hays, 2007). The main question to be addressed with using a bifactor model is whether the data are sufficiently unidimensional to apply a unidimensional IRT model, without significant distortion in item parameters, or a multidimensional item response theory (MIRT) model is more appropriate?” Also, if a unidimensional model is appropriate, are the domain-specific clusters of items substantial and reliable enough to allow for scoring the examinees on such domains? This questions are particularly relevant to the context NCA tests as they typically assume one general dimension (factor) that underlies the examinees’ performance on the test and several domainspecific subdomains.

Confirmatory Factor Analysis: Given the planned grouping of items into content-specific domains in each section (Verbal and Math) of GAT, a confirmatory factor analysis (CFA) was appropriate for testing the dimensionality of GAT data used in this study. Specifically, the CFA was conducted in steps as follows: 1. Nine separate CFA procedures were used to test for unidimensional coherence of the items in each content-specific domain in the verbal and mathematics sections of GAT; (as the logical reasoning domain consists of one item, there was no CFA on this “domain.” The purpose of these analyses was to support the meaning of scoring the student performance by content-specific domains. Each CFA was conducted using the computer program Mplus (Muthén & Muthén, 2008) with the WLSMV estimator for categorical indicators; that is, interpreting the item scores (1/0) as categorical data to avoid problems associated with the treatment of such data as continuous and using the maximum likelihood (ML) estimator. The examination of the CFA results on the domain “geometry” from the math section led to the exclusion of one item (the fourth item in this domain) from subsequent analyses due to poor fit. Likewise, two items from the math section “math comparison” (the 4th and 5th items in this domain) were also excluded from subsequent analyses due to poor fit. Thus,

Method Data: This task relates to investigating the dimensionality of data collected through the administration of a specific NCA test, namely the Arabic version of the General Aptitude Test (GAT). The test consists of 120 multiple-choice items, of which 68 items form the Verbal 3

the math section was reduced from 52 to 49 items in subsequent CFA analyses. 2. A one-factor CFA was conducted with the 68 items in the verbal section of GAT used as categorical indicators of a single factor—verbal skills . 3. A one-factor CFA was conducted with the 59 items in the math section of GAT used as categorical indicators of a single factor—math skills. 4. A two-factor CFA was conducted with the two factors being verbal and math skills. Due to the increased complexity of this CFA model, with 68 binary indicators for the verbal factor and 59 binary indicators for the math factor, the domain-specific items were treated as “parcels” so the average score on the items in a parcel was used as indicator of the respective verbal or math factor. In this way, the verbal factor was related to four indicators—the mean scores of the items in each of the four content domains in the verbal section of GAT. Likewise, the math factor was related to five indicators—the mean scores of the items in each of the five content domains in the math section of GAT (e.g., see Kishton & Widaman, 1994; Yuan, Bentler, & Kano, 1997, for benefits and caution with parceling items in CFA). As

the parcel scores are treated as continuous variables, a maximum likelihood (ML) estimator was used with the CFA in this case. 5. A bifactor CFA was conducted, with treating the verbal and math parts of GAT as specific aspects of a general aptitude (GA). The purpose of this CFA was to justify the use of unidimensional IRT calibration of GAT data, with using a total GAT score and separate scores on its verbal and math sections. Results The results from the five-step CFAs described here above are summarized in Table1. The goodness-of-fit for each CFA model is tested by the use of three indices: the comparative fit index (CFI), the Tucker-Lewis Index (TLI), and the Root Mean Square Error of Approximation (RMSEA). For the purposes of this study, CFI > .90, TLI > .90, and RMSEA < .05 were used jointly as indication of an adequate data fit of the respective CFA model. The chi-square test values are provided in Table 8.1 only as descriptive information as they were all statistically significant due to the very large sample size in this study, N = 5,969 (e.g., see Dimitrov, 2012, p. 105).

Table 1:CFA on GAT Data by Verbal and Mathematics Parts and Their Content-Specific Domains SCALE/ SUBDOMAINS

VERBAL SECTION DOMAINS Word Meaning (9 items) Sentence Completion (16 items) Analogy (17 items) Reading Comprehension (22 items) MATH SECTION DOMAINS Arithmetic (26 items) Geometry (8 items) Algebra (5 items) Analysis (3) [Just-identified] Math Comparison (9 items) VERBAL SECTION: One-Factor Model (68 items)

4

Chisquare

df

CFI

TLI

RMSEA

382.772 539.037 677.427 658.392

65 104 119 209

0.960 0.957 0.985 0.962

0.952 0.951 0.982 0.957

0.029 0.026 0.028 0.019

960.259 47.115 5.330 0.000 82.667 8723.574

299 14 5 0 14 2210

0.952 0.917 0.999 1.000 0.918 0.948

0.948 0.875 0.998 1.000 0.877 0.946

0.019 0.020 0.003 0.000 0.029 0.022

MATH SECTION One-Factor Model (59 items)

3017.436

1127

0.939

0.937

0.017

VERBAL & MATH SECTIONS: Two-factor model (9 parcels)

552.897

26

0.971

0.960

0.058

GAT (VERBAL & MATH): One-factor model (9 parcels)

1038.094

27

0.945

0.926

0.079

75.279

18

0.997

0.994

0.023

GAT: Bifactor model (9 parcels)

factor (general aptitude) represented by its verbal and mathematics aspects. In support of this expectation is the excellent fit of the bifactor model (CFI = .997, TLI = .994, and RMSEA = .023). Graphically, this bifactor model is depicted in Figure 8.1, where VP stands for “Verbal Parcel” and MP for “Math Parcel.”

Interpretation: The values of the goodness-of-fit indices (CFI, TLI, and RMSEA) for the CFA by content-specific domains in both the verbal and math sections of GAT indicate an adequate data fit of the respective nine CFA models. Thus, it is meaningful to score the student performance on each of these domains as they are coherently represented by their respective items.

Conclusion: There is evidence of a general unidimensionality of GAT data used in this study, so it is appropriate to calibrate the data in the framework of unidimensional IRT models. Also, the verbal and mathematics parts of GAT are coherently measured by the items associated with them. Finally, the content-specific domains (4 in the verbal part and 5 in the mathematics part) in GAT are also adequately measured by their respective items. These findings imply that it is meaningful to score the students’ performance on the total test, on each of its verbal and math parts, as well as on each content-specific domain—e.g., for more refined feedback about the performance profiles of students on GAT.

The results for the one-factor models, separately for the verbal and math sections of GAT, indicate that the 68 items in the verbal section coherently measure verbal skills and, likewise, the 59 items in the math section coherently measure math skills of the students. Thus, reporting verbal and math scores is meaningful. Further, a very good data fit is indicated for the twofactor model, where the verbal factor is measured by 4 parcels of items and the math factor is measured by five parcels of items; that is, there are nine parcel indicators in this CFA model. The correlation between the two latent (verbal and math) factors with this model was found to be quite high (.835) thus indicating the existence of a general

Figure 1 5

6