Why the Major Field (Business) Test Does Not Report ... - ETS.org

3 downloads 210464 Views 129KB Size Report
Apr 13, 2009 - Educational Testing Service, ETS, the ETS logo, and Listening. Learning. ...... Test scoring, item statistics, and item factor analysis (Computer program). ... In ACER ConQuest: Generalized item response modeling software (pp.
Why the Major Field (Business) Test Does Not Report Subscores of Individual Test-takers—Reliability and Construct Validity Evidence

Guangming Ling ETS, Princeton, NJ

Paper presented at the annual meeting of the American Educational Research Association (AERA) and the National Council on Measurement in Education (NCME) April 13-17, 2009, San Diego, CA.

Unpublished Work Copyright © 2009 by Educational Testing Service. All Rights Reserved. These materials are an unpublished, proprietary work of ETS. Any limited distribution shall not constitute publication. This work may not be reproduced or distributed to third parties without ETS's prior written consent. Submit all requests through www.ets.org/legal/index.html. Educational Testing Service, ETS, the ETS logo, and Listening. Learning. Leading. are registered trademarks of Educational Testing Service (ETS).

Abstract The current study evaluated whether to report individual test-takers’ subscores of the Major Field Business Test (MFT Business) by analyzing subscores’ reliabilities and the internal structure of the test. Reliability analysis found that for each individual student, the observed subscores did not contribute statistically meaningful information beyond the total score of the test. In addition, analysis of internal structure of the MFT Business found a uni-dimensional construct to be present, which also did not support the additional reporting of subscores for each individual student. The relationship between the two analyses was also discussed and an alternate method was recommended for future research. The study concluded that the MFT Business should not report subscores of individual students.

i

Introduction Reporting scores on the subscales (or sub-domains) of a test may provide testtakers and test users with better knowledge of test performance on a sub-domain or a subset of items, especially when the sub-domains of a test vary by content or underlying construct. For example, an exit test for undergraduate business majors may contain items related to different aspects of business knowledge, including knowledge of accounting, economics, management, quantitative and information systems, finance, marketing, and legal and social environment. In addition to the total test score, students or teachers may also be interested in knowing examinees’ competence in each aspect of the curriculum. In recent years, there have been increasing demands for the reporting of subscores, especially subscores of individuals. In their review, Goodman and Hambleton (2004) found that all of the states and companies being reviewed, including two Canadian provinces, eleven U.S. states, and three U.S. testing companies, provided certain information on the sub-domain level (i.e. subscores of individual students).

Not

surprisingly, test-takers and score users of the Major Field Test of Business (MFT Business) are also requiring more information from the test, including subscores of individual students in addition to the total test score. MFT Business is a comprehensive outcomes assessment of basic, critical knowledge obtained by students in a business major (for an associate, bachelor, or MBA degree). Students typically take the Major Field Tests after they successfully complete the major-required courses. The total test scores of individuals are reported on a scale of 120–200 (the MBA test has a scale of 220–300). The MFT Business, like MFT of other majors, also reports subscores (on a scale of 20–100) at aggregate levels (e.g. the average 1

subscore of a class or a program) in subfields of the discipline (ETS, 2008). The subscores of MFT Business at the institutional level are used to indicate mastery of business knowledge as a group (i.e., as a class or a program) instead of an individual student. However, the MFT Business does not report subscores of individual students. In order to report the subscores to individual students1, several steps are required during the test development procedure. For example, it is necessary to ensure that each content or domain area is well and equally (or proportionally) represented, each subscale has similar and sound psychometric properties (e.g., with equal or comparable high reliabilities). When a test was not designed to report subscores of individual students, a post hoc evaluation is necessary. Different methods and perspectives need to be considered in evaluating this issue. There have been debates on whether to report subscores and under what conditions. Ferrara & DeMauro (2007) recommended that a subscore to be reported if it has a high reliability (i.e., an internal consistency reliability estimate of .85 or higher) and does not highly correlate with other subscores; a subscore with a low reliability should not be reported. On the other hand, subscores of different content areas, subdomains, etc., are in great demand by stakeholders (Haladyna & Kramer, 2004) regardless of the original purpose when the test was first developed. Test-takers want to know about their strengths and weaknesses in different content areas for future improvement; teachers, deans, and institutions want to know test performance on various sub areas to make necessary improvement to programs’ curricula.

1

In this paper, subscore, if not specified, stands for the subscore of individual students.

2

The internal structure of the MFT Business may also need to be considered in addition to subscores’ reliabilities.

The internal structure of the test could provide

information such as what sub-constructs or sub-contents are implied in the test design, how they are measured, and what the inter-relationships are. Understanding the internal structure would help to interpret the test scores as well as the subscores. For example, if the MFT Business presents a multidimensional structure, it may support the reporting of individuals’ subscore of each dimension. The current study aimed to address three research questions: 1. Do subscores of individual students add information to the total score statistically? 2. What is the internal structure of the MFT Business? Does the internal structure support the reporting of content-related subscores of individual students? 3. Are the answers to these two questions consistent with one another?

Methods Instrument and Data The data came from the MFT Business test administered between 2002 and 2006. There were 155,921 students who took the test during this time period. Students with incomplete records were excluded from the analysis and a final number of 155,235 students were included in this study. The test included 118 multiple choice items from

3

seven sub-domains of business2. The test was divided into two, 60-item sections when administered. The total test had items from seven content areas, including accounting, economics, management, quantitative business analysis & information systems, finance, marketing, and legal & social environment. In the current paper, they were labeled from S1 to S7 respectively. The number of items varied from 12 to 21 for each subscale. Each individual item was scored as one if correct and zero if incorrect. In addition to the total score, subscores of a class or a program related to each of the seven content areas were reported for the MFT Business. Table 1 displays the descriptive statistics for each subscale. The total test had an acceptable reliability of .89 (in terms of Cronbach’s Alpha) and .89 in terms of KR-20 (Kuder & Richardson, 1937) for binary items. Table 1. Subscales and relevant statistics Subscales # of items Mean SD

Acount. Econ. Mgmt. Quant. & Finance Mrkt. Leg. & Soc. Total test (S1) (S2) (S3) Info. (S4) (S5) (S6) Envrn. (S7) 21 20 19 19 13 14 12 118 9.37 8.57 10.95 10.79 4.75 6.48 6.02 56.92 3.50

3.22

3.30

3.27

2.28

2.27

2.16

14.98

Correlation with total score Cronbach Alpha

.78

.78

.78

.81

.69

.66

.67

.64

.60

.64

.65

.53

.44

.43

.89

KR-20*

.64

.59

.64

.65

.53

.44

.41

.89

Expected reliability**

.57

.55

.54

.54

.45

.47

.43

* KR-20 is the Kuder Richardson form (Kuder & Richardson, 1937) of reliability for tests or subscales consisting of binary item responses. ** the expected reliability is suggested by Wainer et al. (2001), computed based on Spearman-Brown prophecy formula, given the total test reliability, the number of items in the subscale, and the number of items in the total test. Please refer to the Results section for the computation method.

2

The total test includes 120 items, and two items (item 6 and 18 in the second section) were excluded from the scoring and equating procedures. Thus these two items are excluded in this study.

4

Methods and Analyses Wainer et al. (2001) suggested using an augmented score—borrowing strength from other items in the test to compute the subscore of a given student — to improve the subscore’s reliability. Wainer’s augmented scores could be treated as a special case of the several indices suggested by Haberman (2005; Haberman, Sinharay, & Puhan, 2006; Sinharay, Haberman & Puhan, 2006). Haberman suggested that reliability-based analysis may help to inform the decision of subscore reporting. Haberman’s approach compared the mean-squared-error (MSE) of true subscore when it was predicted (or approximated) by the observed subscale score, the observed total test score, and the two scores conjointly. The rationale is that if the MSE of the true subscore is smaller than the MSE implied by other models or estimations, then the observed subscore contributes something unique and substantial in addition to the total score. Otherwise, it only provides redundant information statistically, since the same level of information could have been obtained from the observed total test score with less error. Descriptive analyses of item scores, subscores, and total test scores were conducted.

Moreover, the reliability of each subscale, the total test, the correlations

between subscores, and the correlations between each subscore and the total test score were analyzed and compared. Following the approach of Haberman (2005) and his colleagues, MSE’s of the true subscores were computed and compared for the seven subscales respectively. It should be noted that the reliability-based approach failed to take into account the internal structure of the test. In many cases, the dimensionality of a test may influence the interpretation of the subscore and impact the decision of reporting subscores of 5

individual students as well. Analyzing the internal structure (or dimensionality) of the test and subscales may provide additional evidence over the reliability analysis. Factor analysis and structural equation modeling were applied to evaluate the internal structure of the MFT Business. Traditional factor analysis extracts factors from a set of continuous observed variables by assuming a subset of variables representing a latent factor (or construct). However, such an approach may lead to a biased estimation of the latent construct and other parameters when the observed variables are categorical or binary. More contemporary factor analysis methods were developed in the last three decades, typically for categorical or binary variables (i.e., item responses in the form of true or false, or in a Likert scale). Woods (2002) summarized the difference between traditional factor analysis based on continuous observed variables and more contemporary factor analysis based on categorical (binary) variables. He suggested that the application of traditional factor analysis to binary item scores can lead to biased estimation of the standard errors, biased significance tests, overestimation of the number of factors, and underestimation of the factor loadings. Two general methods were developed to link the item responses (categorical or binary responses) and the underlying latent factor/construct: the probit linking function (limited information approach, Joréskög, 1999) and the logit linking function (full information approach, Bock & Aitkin, 1981). The models using the probit linking function were basically under the framework of structural equation modeling (SEM), which was first developed by Joréskög and later by Muthén and others (Muthén and Muthén, 1998, in MPLUS; LISREL, Joréskög, 1999; and EQS, Bentler, 2001). The model was typically fitted to the tetrachoric/polychoric correlation matrix, and was 6

generally called the limited information factor analysis (LIFA). On the other hand, a logit approach was taken and used in logistic IRT models (named as item factor analysis or full3 information item factor analysis (FIFA; Bock & Aitkin, 1981; Bock, Gibbons, & Muraki, 1988; Takane & De Leeuw, 1987; Wood, Wilson, Gibbons, Schilling, Muraki, & Bock, 1991). Both approaches assume that a continuous and normally-distributed latent variable is underlying the observed dichotomous or categorical responses. TESTFACT (Wilson, Wood, & Gibbons, 1991) is a factor analysis program that is typically used for binary scored items based on the tetrachoric correlations and the full information–itemfactor–analysis (FIFA). It computes the maximum likelihood estimates of parameters via the expectation-maximization algorithm (Bock & Aitkin, 1981; Bock & Lieberman, 1970; Dempster, Laird, & Rubin, 1977), which provides a significance test for the change of number of factors. Other analytic tools have been developed using the logit approach, such as MULTILOG (Thissen, 1991), and ConQuest (Wu, Adams, & Wilson, 1998; Wang, 1995; Adams, Wilson, & Wang, 1997). However, the FIFA approach requires a large sample size relative to the number of items ( 2n for n binary items) to obtain a reasonable frequency for all possible response patterns. For example, when there are only 2 binary items, there will be four ( 22 ) possible patterns, 00, 01, 10, and 11. The sample size should be at least four in order to have all the possible patterns occur (du Toit, 2003). In order to have all the possible unique patterns of the 118 binary items, the sample size needs to be at least 2118 . Unfortunately, the sample size we have in this study is only

3

“Full” means using all the item level responses information, comparing to analysis only based on the correlation /covariance matrix.

7

155,235 (between 217 and 218 ), which is much smaller than required. Thus, the FIFA approach was not considered in this study. A major concern was that the sample size was not big enough relative to the number of items (118). Two levels of exploratory analysis were conducted to evaluate the internal structure of the test: One was based on the variances and covariances matrix of observed subscores, and the other was based on the tetrachoric correlation matrix of the item scores. Traditional factor analysis (both principal component analysis and the exploratory factor analysis based on maximum likelihood method) was performed based on the variance-covariance matrix of the seven subscale scores (assumed as continuous variables). Confirmatory factor analysis was then performed based on the variancecovariance matrix of subscale scores. The more contemporary factor analysis (both exploratory and confirmatory) was performed based on the binary item scores (the tetrachoric correlation matrix). The purpose was to examine three questions: How well each subscale was measured by a subset of items related to a sub-domain of business knowledge; How the general knowledge and skills of the business majors were measured by the 118 items; And how the seven subscales were structured to each other. Four models were examined based on the item scores. First, a single factor measurement model was fit to the data for all items, examining whether all 118 items measured a single factor. Second, a measurement model was examined for each subscale based on the binary item scores for each of the seven subscales. The purpose was to examine how the items of the same sub-domain performed when they were set to measure a common latent construct. Third, a full structural equation model, including all 8

the seven subscales, was fit to the data, with all the subscale-related latent variables intercorrelated. The fourth model had the same seven subscales as the third model, with a second order common factor extracted from the seven latent variables. All the four models were based on the probit linking function between the binary item score and the underlying latent variable. The analyses were based on the tetrachoric correlation matrix of the binary scores, applying the diagonal weighted least square (DWLS) estimation method through LISREL 8.8 (Joréskög & Sorbom, 1999).

Results Results of Descriptive Analysis and Reliability Analysis Descriptive results showed that the reliability of each subscale (Cronbach’s Alpha coefficient) ranged from .43 to .65, which were much lower than the total test reliability (.89). These subscale reliabilities would be considered low according to the commonly accepted value of .70 in psychological and educational measurement (Nunnally & Bernstein, 1994, pp. 265). Similar low values of reliabilities were found for each subscale using the KR-20 formula (Kuder & Richardson, 1937). The expected reliability4 of each subscale was also computed using the Spearman-Brown formula given the total test reliability (Nunnally & Bernstein, 1994). The expected reliabilities were generally lower

4

Expected subscale reliability of a given subscale ( ρ s ) is calculated based on the Spearman-Brown

formula, given the number of items in the total test (K), the number of items in the subscale (k), and the total test reliability ρ .

ρ=

(K / k)* ρs (See Nunnally & Bernstein, 1994; Wainer, et al., 2001). 1+ (K / k −1)* ρs 9

than the actual reliabilities across the seven subscales, which supports a unidimensional model for the test (Wainer, et al, 2001). Following Haberman and his colleagues’ (2005; Sinharay, Haberman & Puhan, 2006) approach, four sets of MSEs were computed and compared to the MSE of the true subscore when it was estimated from the observed subscore –SD(Se). The four sets of mean square errors were computed based on the regression of the true subscore on the observed total score—SD(L-St), on the observed total score and the subscore—SD(MSt), on the Kelly’s estimation of true subscore—SD(K-St), and the approximated true error of subscore—SD(F-Dt). It was expected that the SD(Se) be the smallest if the subscore provided substantial information in addition to the total test score. However, results showed that the MSE of the true subscale score estimated from the observed subscale score was consistently greater than those estimated from the observed total score, the combination of the subscale score and the total score, and the approximated true MSE (See Table 2). Using the first subscale S1-accounting as an example, the SD(Se) for S1 was 2.09, which represented the standard error of the true subscore of S1 when it was predicted from the observed subscore. The standard error for the observed total score SD(L-St) was only 1.38, which was much smaller (See Table 2). This suggested that the prediction of subscale S1’s true score from the observed subscore produced more error than it did from the total test observed score. Similarly, we found the MSEs of the other predictors or approximation methods were all smaller than that of the observed S1 subscore. The same results were found in the other six subscales (see Table 2); the true subscore’s MSE was greater when it was predicted by the observed subscore than those with other predictors. 10

These results suggested that the observed subscore of a given student provided similar if not redundant information over the total test score but at a less reliable level. Table 2. Mean square error comparisons using Haberman’s (2005) approach Subscale

S1

S2

S3

S4

S5

S6

S7

#Items

21

20

19

19

13

14

12

SD(Se)

2.09

2.05

1.99

1.93

1.56

1.70

1.66

SD(K-St)

1.67

1.58

1.59

1.56

1.14

1.12

1.06

SD(L-St)

1.38

1.06

1.25

1.06

0.86

0.64

0.58

SD(M-St)

1.53

1.39

1.44

1.40

0.97

0.88

0.83

SD(F-Dt)

1.67

1.58

1.59

1.56

1.14

1.12

1.06

Note: S stands for subscale observed score, sub-note t stands for the true score respectively; sub-note e stands for the measurement error in classical test theory; E( ) stands for the expected value or average; SD( ) stands for the standard deviation; Dt is the approximated true residual defined by Haberman (2005); SD(Se) is the mean standard deviation of measurement error (or the regression residual when using observed subscale score to predict the subscale true score St; SD(K-St) stands for the mean-squared error of regression when using K to predict the subscale true score St (K is the Kelly approximation of St); SD(L-St) stands for the mean-squared error of regression when using the observed total test score to predict the subscale true score St (L is the predicted St from the total observed score); SD(M-St) stands for the mean-squared error of regression when using the observed subscale score S and the observed total test score X together to predict the subscale true score St; SD(F-Dt) stands for the mean-squared error of regression when using F to predict the true residual Dt (F is an approximation of the true residual Dt, see Haberman, 2005).

11

Results of Subscale Score Factor Analysis Exploratory factor analysis (EFA) was conducted based on the variancecovariance matrix of the seven observed subscores (number of correct raw scores). SAS9.1 (SAS INC., 2002) was used for the EFA. The screeplot in Figure 1 suggests that one or two dimensions may represent the seven observed subscores at an acceptable level. A clear elbow was present at the second factor, with 54% of the total variance explained by the first factor, and 65% by the first two factors.

E i g e n v a l u e s

‚ 1 3.0 ˆ 2.5 ˆ 2.0 ˆ ‚ 1.5 ˆ ‚ ‚ 1.0 ˆ ‚ 0.5 ˆ ‚ 2 0.0 ˆ 3 ‚ 4 5 6 7 -0.5 ˆ ˆƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒˆƒ 0 1 2 3 4 5 6 7

Figure 1. Screeplot of eigenvalues

The factor loadings for the one-factor and the two-factor models based on maximum likelihood estimation method are presented and compared in Table 3. The subscales’ loadings on the single factor were relatively high, ranging from .61 to .76. For the two-factor structure (after PROMAX rotation), four subscales (S2, S3, S4, S6 and S7) had a substantial loading on the first factor, while the remaining two subscales (S1 and 12

S5) together with subscales S2 and S4 had a substantial loading on the second factor. It seemed that the two factors are not clearly interpretable. The second factor seemed to be related to quantitative knowledge, but the subscale of quantitative and information system also had high loadings on the first factor. The first factor seemed to be related to less quantitative business concepts or theories. Table 3. Factor loadings for 1-factor and 2-factor models 2-factor 2-factor Number 1-factor after PROMAX before rotation of items rotation 1 1 2 1’ 2’ S1-accounting

21

.71

.71

-.12

.21

.54

S2-economic

20

.72

.72

-.06

.30

.46

S3-management

19

.72

.72

.15

.62

.15

S4-Quantitative & Information system

19

.76

.76

.01

.43

.38

S5-finance

13

.64

.64

-.18

.09

.59

S6-marketing

14

.63

.63

.09

.48

.19

S7-Legal & Social Environment

12

.61

.61

.11

.50

.14

Given that the second factor’s eigenvalue was less than one and the two factors was difficult to interpret, only the one-factor model was examined in the next stage of confirmatory factor analysis. The factor loadings below .2 were omitted (fixed as zero) in the model specification. The model was fitted to the data using LISREL 8.8 (Joréskög, 1999). The fit indices suggests that the model fitted the data acceptably well. The root mean square error of approximation (RMSEA) value was .064, the comparative fit index (CFI) value was .99, the Tucker-Lewis Index (TLI) value was .98, the standardized root mean square residual (SRMR) value was .024, and the Chi-square value was 8851 13

(p