Interdisciplinary Educational Studies

5 downloads 0 Views 398KB Size Report
between items' location order in reference and in focal group, we calculated the Spearman rank correlation coefficient (ρ). In fact, as it is well known, ρ is a ...
VOLUME 9 ISSUE 3-4

The International Journal of

Interdisciplinary Educational Studies __________________________________________________________________________

Male and Female Performance in Mathematics Empirical Evidence from Italy CLELIA CASCELLA

thesocialsciences.com

THE INTERNATIONAL JOURNAL OF INTERDISCIPLINARY EDUCATIONAL STUDIES www.thesocialsciences.com First published in 2015 in Champaign, Illinois, USA by Common Ground Publishing LLC www.commongroundpublishing.com ISSN 2327-011X © 2015 (individual papers), the author(s) © 2015 (selection and editorial matter) Common Ground All rights reserved. Apart from fair dealing for the purposes of study, research, criticism or review as permitted under the applicable copyright legislation, no part of this work may be reproduced by any process without written permission from the publisher. For permissions and other inquiries, please contact [email protected]. The International Journal of Interdisciplinary Educational Studies is peer-reviewed, supported by rigorous processes of criterionreferenced article ranking and qualitative commentary, ensuring that only intellectual work of the greatest substance and highest significance is published.

Male and Female Performance in Mathematics: Empirical Evidence from Italy Clelia Cascella, Italy Abstract: The term Differential Item Functioning (DIF) refers to questions that are relatively easier or more difficult to answer for different groups of people who are matched on ability. When this happens, the probability of correctly answering is not exclusively explained by the personal relative ability (i.e. subject’s ability over item’s difficulty), but also by other factors, such as gender, socio-economic status, and so on. Unlike previous studies about the gender gap, this paper is not aimed at comparing male and female students’ performance from a substantive standpoint, but instead is mainly aimed at controlling question items’ functionality. This investigation plays a key role in developing an achievement test because, when DIF occurs, the local independence (one of the main properties of the Rasch model) is violated. In order to achieve this goal, we observed the items’ location order along the latent trait both in male and female group and then compared them. Given the aim of this study, our methodological approach could be more fruitfully implemented in pre-testing phase, but, in order to illustrate it, we applied our strategy to the Italian census data gathered by INVALSI (Italian National Institute for Educational Evaluation) in 2013. The data matrix included 484.279 pupils who attend the 5th grade level of elementary school. Keywords: Differential Item Functioning (DIF), Rasch Model, Gender Gap, Students’ Achievement, Testing Instruments’ Development, Pre-test, Mathematics

Introduction

O

ver the last 50 years, the investigation of gender differences in academic achievement has received increasing attention. From the first studies in this field until now, the assessment of students’ performance has highlighted a significant difference between boys and girls: on average, boys perform better than girls in mathematics, but they fell behind girls on textual comprehension and grammar. Similar results have been recently confirmed, for Italian students, by national and international investigations, carried out respectively by INVALSI (Italian National Institute for Educational Evaluation) and PISA (Program for International Student Assessment). Both INVALSI and PISA study gender differences on the basis of total medium score observed in male and female groups (i.e. the total medium number of correct answers given by students to items that compose the achievement test). Surely, it is a valid indicator of the medium ability in a certain group of subjects and, therefore, it allows comparing performances in two or more groups. An alternative method to pursue a similar goal is comparing the positioning of items along the latent trait on the basis of their difficulty. One of the most used techniques to assess students’ abilities and competencies, in fact, is the Rasch model. According to this one, both items and subjects can be scaled along the same latent trait, according to (respectively) their difficulty (δi) and their ability (βn). So, even though the probability of correctly answering to an item is estimated as function of person’s intrinsic ability (i.e. the ability of person n over the difficulty of item i), and consequently as function of the difference between βn and δi, one of the main properties of the Rasch model is the sufficiency of estimated parameters. It means that the model estimates the subjects’ ability (β) regardless of items’ specific features (i.e. their difficulty level), and, at the same time, it estimates items’ difficulties (δ) regardless of subjects’ characteristics (i.e. without taking into account their ability level). This mathematical feature of the model allows us treating βn and δi separately. So, comparing the positioning of items’ difficulty parameter (δ) along the latent trait in two or more groups clustered on the basis of one (or more) criterion, we can immediately notice if those items The International Journal of Interdisciplinary Educational Studies Volume 9, 2015, www.thesocialsciences.com, ISSN 2327-011X © Common Ground, Clelia Cascella, All Rights Reserved Permissions: [email protected]

THE INTERNATIONAL JOURNAL OF INTERDISCIPLINARY EDUCATIONAL STUDIES

are relatively easier or more difficult for different groups of people (matched on ability). This approach qualifies our study mainly from measuring standpoint. In fact, sometimes, the probability of giving a correct answer can change as function of one or more “clustering” variables (such as, e.g., gender, socio-economic status of students’ family, gender, and so on), that can “disturb” item behaviour expected on the basis of Rasch hypothesis: in this case, the probability of giving a correct answer is not only governed by the difference between βn and δi (as required by the Rasch model), but it is also influenced by other factors (Ackerman 1992; Merideth 1993; Roussos and Stout 1996; Embretson and Reise 2000). Previous studies (such as, for example, INVALSI and PISA investigations) are focused on the comparison between the total medium scores observed in male and female groups. On the contrary, in this study, our main interest is focused on the psychometrical analysis of item functioning. So, in particular, we want verify if (and, eventually, how) gender can disturb items functionality. In order to achieve this goal, we carried out a differential item functioning analysis (DIF). This technique requires a non linear, weakly increasing monotonic relation between items. This kind of relation is perfectly consistent with the Rasch model because it is a logistic function that requires the cumulativity of persons’ response set (Bliss 1935; Bock and Lieberman 1970; McIver and Carmines 1981). So, in our case, if the probability (of a correct answer) changes in each group (clustered on the basis of gender variable), we can conclude that students’ gender disturbs items’ functionality. In order to illustrate it, we apply our strategy to the census data gathered by INVALSI, in 2013. The population included 484.279 pupils who attend the 5 th grade level of elementary school. We analyzed the answers given to a questionnaire composed by polytomous and binary 47 items aimed at testing students’ skills in Mathematics. Obviously, given the goal pursued here, this kind of investigation could be more adequate (and fruitful) in pre-testing phase, to guarantee the testing instruments’ fairness. We ended with some final remarks and critical consideration about our procedure.

Differential Item Functioning (DIF) Differential item functioning (or measurement bias) «refers to differences in the way a test item functions across demographic groups that are matched on the attribute measured by the test or the test item (Camilli 2006; Camilli and Shepard 1994; Holland and Wainer 1993; Penfield and Camilli 2007)» (Osterlind 2009). Over last two decades, a lot of different techniques to detect DIF have been proposed, such as, for example, Mantel-Haenszel procedure and its follow-up consisting in the implementation of a revised version of chi square (Osterlind, 2009, pp. 29-37). In more recent years, an increasing number of IRT methods have been developed to detect DIF (Clauser and Mazor 1998; Millsap and Everson 1993; Osterlind 2006; Penfield and Camilli 2007). The IRT approaches have proven useful «for both conceptualizing and assessing DIF by examining between-group differences in a test item’s, or a set of items’, features, independent of examinee ability. Both theoretically and procedurally, IRT-based DIF methods provide an opportunity for a more comprehensive investigation of the phenomenon than can be done using Classical Test of Theory» (Osterlind 2009). In fact, in IRT framework, parameters’ estimations are less confounded with sample characteristics than are those of classical measurement theory. Moreover, the statistical properties of items can be described in more precise manner and, so, when a test item functions differently in two groups, the differences can be described more precisely (Camilli and Shepard, 1994, p. 47). Formally, DIF is verified if the conditional probability distribution of Y (i.e., the response set of person n to items in an achievement test) is dependent on group membership (i.e., if there is an interaction effect between the person response to an item and group membership – see, e.g.,

2

CASCELLA: MALE AND FEMALE PERFORMANCE IN MATHEMATICS

Swaminathan and Rogers 1990; Swanson, et al. 2002; Van den Noortgate and De Boeck 2005). In this way, we can individuate a sub-set of items that are very similar to the other ones (and so that can have a strong correlation with other items) but that are sensitive not only to model’s parameters but also to other factors (Ackerman 1992; Merideth 1993; Roussos and Stout 1996; Embretson and Reise 2000). In order to reveal the existence of those (“hidden”) factors, we observed the items’ hierarchy, i.e. items’ positioning along the latent trait. In fact, if some items are relatively easier or more difficult for different groups of people (who are matched on ability), then the clustering factor(s) can influence items’ functionality and, consequently, they have to be conceived as disturbing variable, in the sense above specified. So, in practice, our methodology consists in the following steps. Preliminarily, on the basis of gender variable, we divided Italian population in two groups (called focal and reference). Statistically, it makes no difference which group is designed as focal and which as reference. However, according to psychometric literature, the focal group is generally composed by students whom we suspect the test favors and the reference one is viewed as those individuals who are at risk of being disadvantaged by the test. Given that, typically, girls fell behind boys in Mathematics, such as the SAT (the Scholastic Aptitude Test and the Scholastic Assessment test) or the NEAP (the National Assessment of Educational Progress) (Wilder and Powel 1989), our focal group for mathematical competencies is composed by boys and reference one by girls. Subsequently, we carried out a preliminary analysis to observe β distributions in both subgroups. On the basis of those results, we reorganized the data matrix as shown in table 1, i.e. creating four subsets, both for boys and girls, matched on the ability. To assess measurement bias, in fact, it is essential observing performance revealed in groups matched on the ability, in order to make comparable those differences. In fact, sometimes, DIF effect may merely reflect group differences in the ability distributions, and one would erroneously conclude that an item shows DIF. Table 1.1: Descriptive Statistics of Groups Matched on the Ability β (FREQ.) MATHEMATICS GIRLS

1 2 3 4

GROUPS MATCHED ON THE ABILITY ]-4.0≤ β≤ -3.1]* [-3.0≤ β≤ -2.1] [-2.0≤ β≤ -1.1] [-1.0≤ β≤ -0,1] [0≤ β≤ +0.9] [+1.0≤ β≤ +1.9] [+2.0≤ β≤ 2.9] [+3.0≤ β≤ +4[ ** TOT.

BOYS

AREA 1

AREA 2

AREA 3

AREA 4

AREA 5

AREA 1

AREA 2

AREA 3

AREA 4

AREA 5

329

272

280

479

412

901

685

706

708

347

28271

20602

21426

25778

18910

27519

20469

20088

25585

18385

29872

21475

19817

24771

16769

30417

22100

21584

25345

18877

2511

1746

2124

2134

2170

2584

1804

2036

2506

1485

60983

44095

43647

53162

38261

61421

45058

44414

54144

39094

* = THERE ARE FEW CASES WITH BETA LOWER THAN "-4". THOSE CASES WERE NOT EXCLUDED. ** = THERE ARE FEW CASES WITH BETA LOWER THAN "-4". THOSE CASES WERE NOT EXCLUDED.

The table 1 shows data divided in four groups matched on the same ability level and in five geographical areas. We considered this territorial parceling because many previous studies (see also INVALSI Technical Reports) have highlighted systematic differences of students’

3

THE INTERNATIONAL JOURNAL OF INTERDISCIPLINARY EDUCATIONAL STUDIES

performance in those five areas. (Those areas are clustered according to specific criteria, such as, e.g., the socio-economic conditions of families, well-being indicators, and so on. For an in-depth presentation of the methodology adopted by INVALSI, see INVALSI National Annual Report 2012). So, in order to guarantee a straightforward comparison between boys and girls, we subdivided our data matrix taking into account (also) those five geographical areas.

Results In order to make more interpretable the results obtained by applying our procedure, a synthetic measure able to summarize the comparison between the items’ location order both in male and female groups was proposed. Since we did not hypothesize any a priori assumptions about the kind of relationship between items’ location order in reference and in focal group, we calculated the Spearman rank correlation coefficient (ρ). In fact, as it is well known, ρ is a nonparametric measure of statistical dependence, it does not require any linear relationship between items, and it is particularly appropriate for ranked variables. In the next tables (tables 2-6), we summarized the output of correlation analysis carried out by using the Spearman’s coefficient. As the Brevais-Pearson index, the Spearman’s coefficient has a similar interpretation. When it tends to +1 or -1, there is a perfect (respectively, positive or negative) relationship between variables. So, in this case, no significant differences in items’ location order can been revealed. On the contrary, when ρ tends to zero, there is not any relationship between items. And, in this second case, gender works as disturbing factor, in the sense above specified. The Spearman’s coefficient is preferable to the Brevais-Pearson index because the second one is sensitive only to a linear relationship between two variables, but we did not have sufficient information to make an a priori assumption about the kind of relationship between items’ positioning along the latent trait. Table 2.1: Synopsis (Area 1) BETA 1 BETA 2 Spearman’s rank correlation = 1,00 Spearman’s rank correlation = 1,00 Brevais-Pearson correlation = 0,97

Brevais-Pearson correlation = 0,98

y = 0.9106x - 0.1258 R² = 0.9344

6.00 5.00

2.00 1.00

3.00 BOYS

BOYS

4.00 2.00 1.00 0.00 -2.00 -1.000.00

-3.00

-2.00

0.00 -1.00 0.00 -1.00

1.00

2.00

6.00

8.00

-3.00 GIRLS

Beta 3 Spearman’s rank correlation = 1,00

Beta4 Spearman’s rank correlation = 1,00

Brevais-Pearson correlation = 0,98

Brevais-Pearson correlation = 0,97

y = 0.9798x - 0.2522 R² = 0.9636

2.00

-2.00

0.00 -1.000.00

-6.00 2.00

4.00

-4.00

-2.00

BOYS

BOYS

1.00 -4.00

3.00

-2.00 2.00 4.00 GIRLS

3.00

-2.00 -3.00 -4.00 GIRLS

4

y = 0.943x - 0.1563 R² = 0.9597

3.00

GIRLS

y = 1.0273x - 0.2868 2.00 R² = 0.9508 1.00 0.00 2.00 -1.000.00 -2.00 -3.00 -4.00 -5.00 -6.00 -7.00

CASCELLA: MALE AND FEMALE PERFORMANCE IN MATHEMATICS

By comparing items’ hierarchies, they appear nearly identical in both groups, as confirmed by the high value of correlation coefficient. According to those results, no DIF evidence exists. Table 2.2: Synopsis (Area 2) BETA 1 BETA 2 Spearman’s rank correlation = 1,00 Spearman’s rank correlation = 1,00 Brevais-Pearson correlation = 0,98

Brevais-Pearson correlation = 0,61 y = 0.5662x + 0.5912 R² = 0.367

4.00

2.00

3.00

1.00

2.00

-3.00

1.00 0.00 -1.00 0.00 -1.00

y = 0.944x - 0.1532 R² = 0.9573

3.00

BOYS

BOYS

5.00

-2.00

0.00 -1.00 0.00 -1.00

1.00

2.00

-2.00 1.00

2.00

3.00

4.00

5.00

-3.00 GIRLS

GIRLS

BETA 3 Spearman’s rank correlation = 1,00

BETA 4 Spearman’s rank correlation = 1,00

Brevais-Pearson correlation = 0,98

Brevais-Pearson correlation = 0,95

3.00

y = 0.9895x - 0.2469 R² = 0.9592

1.00 -2.00

0.00 0.00 -1.00

-6.00 2.00

4.00

-4.00

-2.00

BOYS

BOYS

2.00

-4.00

3.00

-2.00 -3.00 -4.00 GIRLS

GIRLS

y = 0.971x - 0.3295 2.00 R² = 0.8988 1.00 0.00 2.00 -1.000.00 -2.00 -3.00 -4.00 -5.00 -6.00 -7.00

The results obtained in the geographical area 2 (Table 2.2) are substantially similar to the previous and next ones (see Tables 2.4 – 2.5). In the area 2, only the first group requires a particular attention because it shows an interesting difference between r and ρ (Table 3). In this case, linear relationship between variables was not verified (as revealed by R2=0,367) and this confirm that the Spearman rank correlation is more adequate than the Person correlation coefficient. This circumstance implies that the correlation value calculated by r is remarkably lower than the value of ρ: This one, instead, independently from the kind of relationship between variables, is able to compare the items’ ordering (in both reference and focal group) along the latent trait. Therefore, it undoubtedly gives us a more useful information.

5

THE INTERNATIONAL JOURNAL OF INTERDISCIPLINARY EDUCATIONAL STUDIES

Table 2.3: Synopsis (Area 3) BETA 1 BETA 2 Spearman’s rank correlation = 1,00 Spearman’s rank correlation = 1,00 Brevais-Pearson correlation = 0,97 5.00 4.00

2.00

3.00

1.00

2.00

-3.00

1.00 0.00 0.00 -1.00

1.00

2.00

3.00

4.00

5.00

y = 0.9432x - 0.0901 R² = 0.9615

3.00

BOYS

BOYS

Brevais-Pearson correlation = 0,98 y = 0.9629x - 0.249 R² = 0.939

0.00 -1.00 0.00 -1.00

-2.00

1.00

2.00

3.00

-2.00

6.00

-3.00 GIRLS

GIRLS

BETA 3 Spearman’s rank correlation = 1,00

BETA 4 Spearman’s rank correlation = 1,00

Brevais-Pearson correlation = 0,98

Brevais-Pearson correlation = 0,97 y = 0.9918x - 0.1761 R² = 0.9687

-4.00

-2.00

2.00

2.00

1.00

1.00

0.00 -1.000.00

2.00

y = 0.9918x - 0.1761 R² = 0.9687

3.00

BOYS

BOYS

3.00

4.00

-4.00

-2.00

0.00 -1.000.00

2.00

-2.00

-2.00

-3.00

-3.00

-4.00 GIRLS

-4.00 GIRLS

4.00

Table 2.4: Synopsis (Area 4) BETA 1 BETA 2 Spearman’s rank correlation = 1,00 Spearman’s rank correlation = 1,00 Brevais-Pearson correlation = 0,97

Brevais-Pearson correlation = 0,99 y = 0.9066x + 0.0102 R² = 0.9421

5.00 4.00

2.00

2.00

BOYS

BOYS

3.00

1.00

-2.00

0.00 -1.00 0.00 -1.00

y = 0.9664x - 0.0676 R² = 0.972

3.00

1.00

2.00

3.00

4.00

5.00

1.00

-1.00

6.00

0.00 0.00 -1.00 -2.00

GIRLS

1.00

BETA 4 Spearman’s rank correlation = 1,00

Brevais-Pearson correlation = 0,94

Brevais-Pearson correlation = 0,98 y = 0.9124x - 0.3114 R² = 0.8893

4.00

3.00 1.00 -2.00

0.00 -1.000.00

-8.00 2.00

4.00

BOYS

BOYS

2.00

-4.00

-6.00

y = 0.9726x - 0.1053 2.00 R² = 0.9666 0.00 -4.00 -2.00 0.00 2.00 -2.00 -4.00

-2.00 -3.00 -4.00 GIRLS

3.00

GIRLS

BETA 3 Spearman’s rank correlation = 1,00 4.00

6

2.00

-6.00 -8.00 GIRLS

4.00

CASCELLA: MALE AND FEMALE PERFORMANCE IN MATHEMATICS

Table 2.5: Synopsis (Area 5). BETA 1 BETA 2 Spearman’s rank correlation = 1,00 Spearman’s rank correlation = 1,00 Brevais-Pearson correlation = 0,95

Brevais-Pearson correlation = 0,99 y = 0.8894x + 0.1949 R² = 0.9067

6.00

y = 0.9385x + 0.0133 R² = 0.9726

2.50 2.00

5.00

1.50 1.00

3.00

BOYS

BOYS

4.00

2.00

-2.00

0.50 0.00 -1.00 -0.500.00

1.00

2.00

3.00

-1.00

1.00 0.00 -1.00 0.00

-1.50 1.00

2.00

3.00

4.00

5.00

-2.00

6.00

GIRLS

GIRLS

BETA 3 Spearman’s rank correlation = 1,00

BETA 4 Spearman’s rank correlation = 1,00

Brevais-Pearson correlation = 0,99

Brevais-Pearson correlation = 0,99

4.00

y = 0.9776x - 0.074 R² = 0.9813

3.00

BOYS

1.00 -4.00

-2.00

0.00 0.00 -1.00

2.00

4.00

-2.00 -3.00 -4.00

BOYS

2.00 -8.00

-6.00

-4.00

y = 0.9978x - 0.2542 3.00 R² = 0.9711 2.00 1.00 0.00 -2.00-1.000.00 2.00 4.00 -2.00 -3.00 -4.00 -5.00 -6.00 -7.00 GIRLS

GIRLS

As clearly showed in previous tables, both in male and female groups (and in each geographical area), the values of correlation index are close to one. So, no significant difference can be revealed in items’ location order. So, we can conclude that, from a measuring standpoint, gender does not work as disturbing factor.

Discussion and Final Remark One of the main properties of the Rasch model is the sufficiency of statistics, i.e. of models’ parameters, βn (subject’s ability) and δi (item’s difficulty). Differently from current INVALSI and PISA approach, we treated the study of gender differences in academic achievement focusing our attention on the items’ difficulty parameters, and in particular on their positioning along the latent trait. So, the main goal of our study is to control the items functionality in order to verify the possible presence of factors that can disturb the normal items behavior hypothesized by the Rasch model. Our procedure consists of the following phases. First at all, we divided the Italian census data in two groups, called reference (girls) and focal (boys). For each of them, we estimated the ability parameters’ distribution, by using the Rasch Model. On the basis of those results, we reorganized the data matrix in eight groups, matched on the ability (four for girls and four for boys). Then, we carried out a new analysis, by using the same model, to estimate the items’ difficulty parameters again, and, finally, we compared the items’ location ordering (i.e., the positioning of items’ difficulty parameters along the latent trait) in male and female group, by using the Spearman’s rank correlation coefficient (ρ). This measure was preferred to the Brevais-

7

THE INTERNATIONAL JOURNAL OF INTERDISCIPLINARY EDUCATIONAL STUDIES

Pearson correlation index (r) because we could not do any preliminary assumptions about the relationship between the items’ location ordering in reference and focal groups. Moreover, our choice was confirmed also by empirical results because, by comparing ρ to r, the first one appeared more adequate. Findings obtained in the empirical part of the study showed no DIF evidence in our genderbased analysis. Obviously, this kind of investigation can reveal same interesting information expost, after the administration of testing instruments, but it could be more fruitful before or during the pre-testing phase, to construct items and to develop testing instruments.

Acknowledgement I would like to thank the Italian National Institute for Educational Assessment (INVALSI) for giving me all need information and data. In addition, I would like also to specify that the content of this paper is referable only to me and the INVALSI is not responsible for it.

REFERENCES Ackerman, Terry. A. 1992. “A didactic explanation of item bias, item impact, and item validity from a multidimensional perspective." Journal of Educational Measurement 29: 67 - 91. Allalouf, Avi, Ronald K. Hambleton, and Stephen G. Sireci. 1999. “Identifying the causes of DIF in translated verbal items." Journal of Educational Measurement 36: 185-198. Bliss, Chester I. 1935. “The calculation of the dosage mortality curve." Annals of Applied Biology 24: 815 - 852. Bock, Darrel, and Michael Lieberman. 1970. “Fitting a response model for n dichotomously scored items." Psychometrika 35: 179-197. Camilli, Gregory. 2006. “Test fairness." In Educational Measurement, edited by R. L. Brennan, 220-256. Westport, CT: American Council of Education. Camilli, Gregory, and David L. Shepard. 1994. Methods for identifying biased test items. Thousand Oaks, CA: Sage. Clauser, Brain E., and Kathleen M. Mazor. 1998. “Using statistical procedures to identify differential item functioning test items." Educational Measurement: Issues and Practice 17: 31-44. Cvencek, Dario, Andrew N. Meltzoff, and Manu Kapur. 2014. “Cognitive consistency and mathgender stereotypes in Singaporean children." Journal of Experimental Child Psychology 117: 73-91. Embretson, Susan E., and Steven P. Reise. 2000. Item Response Theory for Psychologists. Mahwah, New Jersey: Lawrence Erlbaum Associates Publishers. Gallagher, Ann M., Richard De Lisi, Patricia C. Holst, A. V. McGillicuddy-De Lisi, Mary Morely, and Cara Cahalan. 2000. “Gender Differences in Advanced Mathematical Problem Solving." Journal of Experimental Child Psychology 75: 165-190. Holland, Paul W., and Howard Wainer. 1993. Differential Item Functioning. Hillsdale, NJ: Lawrence Erlbaum. INVALSI. 2012. OCSE-PISA 2012. Rapporto Nazionale. http://www.invalsi.it/invalsi/ri/pisa 2012/rappnaz/Rapporto_NAZIONALE_OCSE_PISA2012.pdf INVALSI. 2013. Rapporto SNV 2013. http://www.invalsi.it/invalsi/istituto.php?page=rapporti Le, Luc T. 2009. “Investigating Gender Differential Item Functioning Across Countries and Test languages for PISA Science Items." International Journal of Testing, 9: 122-133. Legewie, Jochen, and Thomas A. Di Prete. 2012. “School Context and Gender Gap in Educational Achievement." American Sociological Review 77: 463-485.

8

CASCELLA: MALE AND FEMALE PERFORMANCE IN MATHEMATICS

McCullogh, Laura. 2004. “Gender, Context, and Physics Assessment.ˮ Journal of International Women's Studies 5: 20-30. McIver, John P., and Edward G. Carmines. 1981. Unidimensional scaling. Beverly Hills: Sage. Merideth, William. 1993. “Measurement invariance, factor analysis and factorial invariance.ˮ Psychometrika 58: 525-543. Millsap, Roger E., and Howard T. Everson. 1993. “Methodological review: Statistical approaches for assessing measurement bias." Applied Psychological Measurement 17: 297-334. Osterlind, Steven J., and Howard T. Heverson. Differential Item Functioning. Thousand Oaks, California 91320: Series: Quantitative Applications in the Social Sciences, 2010. Pae, Tae-II. 2004. “Gender Effect on reading comprehension with Korean EFL learners." Systems 32: 265-281. Penfield, Randal D., and Gregory Camilli. 2007. “Differential Item Functioning and Item Bias." In Handbook of Statistics, by C. R. Rao, 127-167. New York: Elsevier. Roussos, Louis, and William F. Stout. 1996. “A multidimensionality-based DIF analysis paradigm." Applied Psychological Measurement 20: 355-371. Swaminathan, Hariharan, and H. Jane Rogers. 1990. “Detecting Differential Item Functioning Using Logistic Regression Procedures." Journal of Educational Measurement 27: 361370. Swanson, David B., Brain E. Clauser, Susan M. Case, Ronald J. Nungester, and Carol Featherman. 2002. “Analysis of differential item functioning (DIF) using hierarchical logistic regression models." Journal of Educational and Behavioral Statistics 27: 53-75. Van den Noortgate, Wim, and Paul De Boeck. 2005. “Assessing and explaining differential item functioning using logistic mixed models." Journal of Educational and Behavioral Statistics 30: 443-464. Wilder, Gita, and Ken Powell. 1989. Sex differences in test performance: A survey of the literature. Vols. Report No. RR 89-4, Princeton, NJ: Educational Testing Service. Wu, Margaret. 2010. “Measurement, Sampling, and Equating Errors in Large-Scale Assessments." Educational Measurement: Issues and Practice 29: 15-27. Zirk-Sadowski, Jan, Charlotte Lamptey, Amy Devine, Mark Haggard, and Denes Szucs. 2014. “Young-age gender differences in mathematics mediated by independent control or uncontrollability." Developmental Science 17:1-10. Zumbo, Bruno. 1999. A handbook on the theory and methods of differential item functioning: Logistic regression modeling as a unitary framework for binary and Likert-type item scores. Ottawa: Directorate of Human Resources Research and Evaluation, National Defence Headquarters.

ABOUT THE AUTHOR Dr. Clelia Cascella: Researcher, Italian National Institute for Educational Evaluation, Rome, Italy.

9

The International Journal of Interdisciplinary Educational Studies is one of eight thematically focused journals in the collection of journals that support the Interdisciplinary Social Sciences knowledge community—its journals, book series, conference and online community. The journal explores the processes of learning about the social and social learning. As well as papers of a traditional scholarly type, this journal invites case studies that take the form of presentations of practice—including documentation of socially-engaged practices and exegeses analyzing the effects of those practices. The International Journal of Interdisciplinary Educational Studies is a peer-reviewed scholarly journal.

ISSN 2327-011X