479680.qxd 1/30/04 2:32 PM Page 7
Journal of Autism and Developmental Disorders, Vol. 34, No. 1, February 2004 (© 2004)
Methodological Issues in Group-Matching Designs: α Levels for Control Variable Comparisons and Measurement Characteristics of Control and Target Variables Carolyn B. Mervis1,3 and Bonita P. Klein-Tasman2
Group-matching designs are commonly used to identify the diagnosis-specific characteristics of children with developmental disabilities. In this paper, we address three issues central to the use of this design. The first concerns the α level to be used for considering groups to be matched on the control variable(s). The second involves the measurement characteristics of the control and target variables. We discuss the properties of standard scores, raw scores, and age equivalents and argue against the use of age equivalents. In addition, we consider the appropriateness of the commonly made prediction that groups that are matched for a control variable such as language ability or nonverbal reasoning ability but are not matched for chronological age should perform at equivalent levels on the target variable. Finally, we discuss issues related to the interpretation of significant between-group differences on the target variable, assuming groups are well-matched on the control variables, and describe the benefits of a method that focuses on characterizing a disorder on a case-by-case basis and then aggregating the cases, using the measures of sensitivity and specificity from signal detection theory. KEY WORDS: Autism; developmental disability; matching; mental age; methodology; signal detection theory; spectrum.
social skills of a group of younger children with a similar developmental level or those of a group of children who have developmental delays but do not have PDDs. To be confident that a significant between-group difference on social skills is not due simply to differences in other abilities such as language, the groups would be matched on level of ability on the potentially confounding variables. As the field of molecular genetics advances, characteristic cognitive and behavioral profiles can begin to be linked to their genetic underpinnings. The ability to demonstrate genotype–phenotype correlations is dependent on phenotypic descriptions that focus on characteristics that are as universal to and specific to members of a particular diagnostic group as possible. Group-matching designs provide an important first step in identifying these characteristics. In this paper, we address three issues central to the use of the group-matching design. The first concerns
INTRODUCTION To hone in on characteristics that are universal among and specific to individuals with a particular syndrome or disability, researchers often compare the performance of a target group to that of one or more groups with other disabilities and/or to a group of children who are developing normally. For example, to determine if children with pervasive developmental disorders (PDDs) show more extreme deficits in social skills than children with other disabilities, the social skills of a group of children with PDDs might be compared to the 1
Department of Psychological and Brain Sciences, University of Louisville, Louisville, Kentucky 40292. 2 Department of Psychology, University of Wisconsin–Milwaukee, Milwaukee, Wisconsin 53211. 3 Correspondence should be addressed to Carolyn B. Mervis; e-mail: [email protected]
7 0162-3257/04/0200-0007/0 © 2004 Plenum Publishing Corporation
479680.qxd 1/30/04 2:32 PM Page 8
8 the α level to be used for considering groups matched on the control (confounding) variable(s). The second concerns measurement characteristics of the control and target variables. Included in this issue are the questions of whether groups should be matched for chronological age (CA) and whether it is appropriate to predict that groups matched for a control variable such as language ability but not matched for CA should perform at equivalent levels on the target variables. The third concerns interpretation of significant between-group differences on the target variable, assuming groups are well-matched on the control variables.
DETERMINING IF GROUPS ARE MATCHED: ACCEPTING THE NULL HYPOTHESIS Most researchers are very well aware that if the groups included in a study are not matched on the control variables, then any significant between-group differences found on the target variables are not interpretable because these differences could be due to group differences on the control variables. Determination that the groups do not differ on the control variable usually is based on the results of a test of the difference between groups (e.g., a t test). If the null hypothesis that the groups do not differ on the control variable cannot be rejected (typically because p > 0.05), researchers often automatically respond by tacitly accepting the null hypothesis that the groups are equivalent. Harcum (1990, p. 404) describes this phenomenon as “casual acceptance of the null hypothesis.” For example, in a study comparing the receptive vocabulary ability of a group of adolescents and young adults with Williams syndrome and a group of adolescents and young adults with Down syndrome, Paterson (2001) considered the two groups to be matched for cognitive ability based on the results of a t test with a significance value of p < 0.07. In a series of studies concerning the social competence of children with autism, children with Down syndrome, children with other types of developmental delays, and typically developing children, Sigman and Ruskin (1999) used p > 0.10 as their criterion for considering the groups to be matched. The null hypothesis should not be accepted so easily, however. Rejecting the null hypothesis because it is improbable under the theoretical sampling distribution should be a very different decision from accepting the null hypothesis because it is likely to be true. In terms of the group-matching design, the probability of making a type-II error (accepting the null hypothesis that the groups do not differ even though they do) is of
Mervis and Klein-Tasman primary concern. As the α level increases, the probability of making a type-II error decreases. As Cohen (1990) pointed out, the null hypothesis is almost never literally true. The question then becomes: how high should α be before the researcher accepts the null hypothesis that the groups do not differ on the matching variable? Frick (1995) proposes the following guidelines: Any p value less than 0.20 is too low to accept the null hypothesis. A p value greater than 0.50 is large enough to accept the null hypothesis. A p value between 0.20 and 0.50 is ambiguous. Thus, Mervis and Robinson (2003) suggest that groups not be considered matched on the control variable unless a p value of at least 0.50 is found on the test of group differences. To illustrate the impact of different criteria for considering groups matched on the control variable on the determination of whether the groups differ on the target variable, we present an example based on actual data. As part of a study examining intellectual strengths and weaknesses of 9- and 10-year-olds with Williams syndrome or Down syndrome (Klein & Mervis, 1999), we tested a cohort of 18 children with Williams syndrome and 23 children with Down syndrome on the McCarthy Scales of Children’s Abilities (McCarthy, 1972) and the Peabody Picture Vocabulary Test-Revised (Dunn & Dunn, 1981). The two groups were wellmatched on CA ( p = 0.502). Here, we use these samples to address the question of whether 9- and 10-year-olds with Williams syndrome have significantly larger receptive vocabularies than same-CA children with Down syndrome. As a first step, we compared the receptive vocabulary ability of the full sample of 18 children with Williams syndrome and 23 children with Down syndrome. As indicated in Table I, the results of this analysis indicated that receptive vocabulary size was significantly larger for the group of children with Williams syndrome. Given that the two groups were well-matched for CA, the significant difference in receptive vocabulary can appropriately be interpreted to indicate that within the CA range studied, children with Williams syndrome on average have larger receptive vocabularies than children with Down syndrome. However, because the two groups were not matched on overall cognitive ability, there is a reasonable possibility that the apparent weakness of children with Down syndrome on receptive vocabulary reflected not a specific difficulty with receptive vocabulary but rather a general weakness in overall cognitive ability. Therefore, we took the important second step of performing a more considered test of possible
479680.qxd 1/30/04 2:32 PM Page 9
Table I. Impact of Group-Match p Level for Cognitive Performance on the Significance of Between-Group Differences on the PPVT-R p level for cognitive match
p level for PPVT-R comparison
p < 0.001 p = 0.106 p = 0.254 p = 0.650
Not matched Not matched (but commonly used) Unclear if matched Definitely matched
p < 0.001 p = 0.030 p = 0.074 p = 0.204
PPVT-R, Peabody Picture Vocabulary Test-Revised.
vocabulary differences in which general cognitive ability was taken into account. To better match the two groups on the control variable of general cognitive ability, we gradually removed the children with the lowest McCarthy raw scores from the Down syndrome group and the children with the highest McCarthy raw scores from the Williams syndrome group until the difference between the two groups on McCarthy raw score reached three different p levels, all >0.05. As shown in Table I, the p level used to judge whether the groups were matched on the control variable of overall cognitive ability had a strong effect on the findings regarding the target variable. If the groups are considered matched simply because no significant group difference was found at the p = 0.10 level, then one could conclude that they differed significantly on receptive vocabulary ability, even when matched for overall cognitive ability. This finding would typically be interpreted as indicating that children with Down syndrome have smaller receptive vocabularies than would be expected for their overall level of cognitive ability (or alternatively that children with Williams syndrome have larger receptive vocabularies than would be expected for overall level of cognitive ability). However, as the matching of cognitive abilities of the two groups is conducted according to increasingly stringent criteria (i.e., higher α levels), the group difference in receptive vocabulary disappears. Once matched closely for overall cognitive abilities, 9- and 10-year-olds with Down syndrome or Williams syndrome do not differ on average in the size of their receptive vocabularies. This finding indicates that, at least within this CA range, there is no syndromespecific weakness in receptive vocabulary for children with Down syndrome nor is there a syndrome-specific strength in receptive vocabulary for children with Williams syndrome. Rather, as revealed when the groups are closely matched for cognitive ability, differences in receptive vocabulary size between children with Williams syndrome or Down syndrome are likely attributable to between-syndrome differences in overall cognitive ability.
In sum, when matching groups on a control variable, researchers need to move beyond determination that the null hypothesis of no difference cannot be rejected. Researchers often rely on a finding of “no significant group difference” to determine that groups are matched on a control variable. Such a finding at the p = 0.05 level reveals only that there is more than a 5% chance that the group difference observed exists by chance alone. This difference does not indicate that the groups are highly similar on the control variable, which is the desired outcome if groups are to be considered matched. Thus, it is important to show that the group distributions on the matching variable overlap strongly, as evidenced, we suggest, by a p level of at least 0.50 on the test of mean differences for the control variable(s). As demonstrated above, the amount of overlap of group distributions on the control variable often has a significant impact on the group difference results for the target variable.
IMPACT OF MEASUREMENT CHARACTERISTICS OF SCORES FOR CONTROL AND TARGET VARIABLES Comparisons using the group-matching design necessarily require measuring both control and target variables. These designs usually include at least one control variable that is measured by performance on a standardized assessment. In many cases, target variables are measured by standardized assessments, even when they are not used to match groups. Standardized assessments usually offer several types of scores: standard scores and raw scores, which are measured on an interval scale, and age equivalents, which are measured on an ordinal scale. As described below, these differences have important implications for the appropriateness of between-group statistical comparisons. Ideally, the groups are also closely matched for CA, over a narrow range; studies that include groups that are not matched for CA and/or that include a broad CA range face additional complications, as described later in this paper.
479680.qxd 1/30/04 2:32 PM Page 10
10 THE CASE OF CA-MATCHED NARROW-CA-RANGE GROUPS When groups are well-matched for CA over a narrow CA range and this range is within the norming range for the assessment used as the control variable, groups could in principle be matched on standard score, raw score, or age equivalent (AE). Standard scores, which place an individual’s raw score relative to the distribution of raw scores for similar-CA individuals in the norming sample, are generally the best measurement choice for both control and (if based on a standardized assessment) target variables. In cases in which several participants earn the lowest possible standard score on the control variables, special attention needs to be paid to these individuals because the lowest standard score often corresponds to a wide range of raw scores. In such cases, groups should be considered matched only if the raw scores are closely matched. (Although this may seem also to be a solution for groups differing in CA or for groups covering a wide CA range, it is not, for reasons we describe later.) The use of raw scores to match groups may also be appropriate when the groups are well-matched for CA and include only a narrow CA range, but that range is older than the oldest group included in the norming sample. This procedure is reasonable if the assessment accurately captures the ability level of the participants so that floor and/or ceiling effects are not a problem on any of the subtests included in the raw score used for matching. Researchers studying children with developmental disabilities often match groups based on language age (LA) or other types of age equivalents. The label “age equivalent” appears to offer face validity to researchers who are trying to match groups. This apparent face validity may encourage the use of AE scores without consideration of their measurement characteristics, which are highly problematic. In some cases, researchers use AE scores because the measure does not provide standard scores; in other cases, study participants may be older than the oldest CA for which standard scores are available. The lack of availability of standard scores, however, does not justify the use of AE scores, given the inherent measurement problems. An AE score is simply the median CA at which a particular raw score was obtained. Thus, AE scores are not measured on an interval scale and therefore have multiple measurement characteristics that make them inappropriate for statistical comparisons. Unlike standard scores, AE scores do not reflect relative standing among CA peers. Thus, the same AE score on different subtests
Mervis and Klein-Tasman of the same assessment is often associated with very different standard scores, and the same standard score on different subtests may be associated with very different AE scores (see Mervis [in press] for further discussion). In discussing the use of AE scores, Semel, Wiig, and Secord (2003) point out that contrary to what might be expected for a measure labeled “AE,” AE scores substantially above or below CA may be well within the average range for the child’s CA. For example, on the Recalling Sentences subtest of the Clinical Evaluation of Language Fundamentals (CELF-4; Semel et al., 2003), the AE range corresponding to standard scores that are within 1 standard deviation of the CA mean for a child aged 7;6 (7 years 6 months) is from 5;9 to 10;11. Semel et al. also note that the difference between two same-CA children in AE is not equivalent to the difference in their language skills. The example they use is that a 12-month difference in AE between two children aged 5;4 does not indicate that the child with the higher AE score has language skills that are 12 months more advanced than those of the other child. Unfortunately, statistical analyses with AE as the dependent variable make precisely this assumption. The measurement problems with AE scores are common to all aspects of development. These are wellillustrated by an example from the language domain: If LA was appropriately measured on an interval scale, then the 4-month difference in LA between 1;4 and 1;8 would represent the same amount of growth as the 4-month difference in LA between 5;4 and 5;8. This is clearly false. The first 4-month interval corresponds to enormous differences (Robinson & Mervis, 1998). At 1;4, children typically have very small vocabularies and speak in single word utterances. At 1;8, they usually are in the midst of a vocabulary spurt, with expressive vocabularies between 2 and 3 times as large as at 1;4, as measured by the MacArthur Communicative Development Inventory (Fenson et al., 1993). Furthermore, a significant proportion of their utterances involves multiple words. In contrast, the 4-month interval between 5;4 and 5;8 corresponds to a very slight change in language ability. Most people who are familiar with young children would not be surprised that these two 4-month differences do not correspond to equivalent amounts of language development; they know full well that development is nonlinear. This knowledge should provide the basis for rejecting the use of AE scores in statistical analyses; the standard statistical comparisons (e.g., t tests, ANOVA, ANCOVA) treat all differences of N months in LA as equivalent. If all differences of a particular number of months are not equivalent, then these
479680.qxd 1/30/04 2:32 PM Page 11
Methodological Issues statistical comparisons are clearly invalid. Statistical comparisons such as these depend critically on the use of scores measured on an interval scale, preferably with a normal distribution.
Target variables often involve performance on researcher-designed measures. In these cases, provided that groups are closely matched for CA and any other control variable included by the researcher and that each group covers only a narrow CA range, group comparison of raw scores is appropriate. For example, a researcher interested in the ability of young children with developmental disabilities to engage in joint attention might design a script including 15 bids for joint attention and count the number of bids to which the child responded. In many cases, especially when researchers are interested in determining if children with a given syndrome show a particular pattern of strengths or weaknesses, target variables are measured from standardized assessments. In these cases, the same guidelines as for control variables apply: Measurement on an interval scale is critical. The appropriate variable to determine if a child (or syndrome) shows a particular pattern of strengths and weaknesses is standard score.4 A person who evidences the same standing relative to his or her CA-peers on subtests measuring different abilities will earn the same standard score (within the measurement error of the test) on each of the subtests. These identical standard scores may be associated with very different AEs, however, especially for children whose abilities are at the high or low end of the distribution. For example, a 61/2-year-old who has syndrome A and earned a scaled score of 4 (2.00 standard deviations below the mean) on each of the Level 1 subtests of the CELF-4 (Semel et al., 2003) would earn the following AEs: Concepts and Following Directions, 4;10; Word Structure, 4;2; Recalling Sentences, 4;9; and Formulated Sentences, 5;2–5;3. Use of AE instead of scaled score would lead to the incorrect conclusion that
the child showed a jagged profile with a strength in Formulated Sentences and a weakness in Word Structure when in fact the child showed a flat profile. Now consider a 6 1 / 2 -year-old who has syndrome B and earned an AE of 4;5–4;6 on all four subtests. This pattern often is interpreted as indicating that the child has equivalent levels of ability for the skills measured by each of the four subtests. Examination of the standard scores corresponding to these AEs, however, contradicts this interpretation: This AE is associated with a scaled score of 1 for the Concepts and Following Directions subtest, a considerably higher scaled score of 5 for Word Structure, a scaled score of 2, 3, or 4 for Recalling Sentences,5 and a scaled score of 1 or 26 for Formulated Sentences. Assuming the findings for these two children are representative of individuals with their syndrome, the use of scaled scores would indicate a flat profile for syndrome A and a jagged profile for syndrome B whereas the use of AE scores would erroneously lead to exactly the opposite conclusion that syndrome A is associated with a jagged profile and syndrome B with a flat profile. This problem is by no means restricted to the CELF-4; it is most likely manifested on all standardized assessments. Even relatively similar abilities may develop at different rates, especially for children at the extremes of the distribution (who either are gifted or have developmental delay or mental retardation). Children of the same CA who have identical standard scores on two measures normed on the same sample may still have very different AE scores on these measures. As another example, consider performance on a measure of receptive single-word vocabulary, the Peabody Picture Vocabulary Test-III (PPVT-III; Dunn & Dunn, 1997), and a measure of expressive single-word vocabulary, the Expressive Vocabulary Test (EVT; Williams, 1997). These two measures were co-normed on the same extensive and carefully stratified sample. Nevertheless, a child with syndrome C aged 6;6 who earned the identical standard score of 55 on both the PPVT-III and the EVT would have AE scores that differ by 12 months: 2;7 for the PPVT-III and 3;7 for the EVT. Now consider a child of the same CA who has syndrome D and received the same AE score of 2;9 on both assessments. This child would earn a standard
THE CASE OF CA-MATCHED NARROWCA-RANGE GROUPS: MEASUREMENT CHOICE FOR TARGET VARIABLES
If a researcher wishes to compare a pattern of strengths and weaknesses across measures for which there are no norms or which were normed on very different samples or many years apart, he or she can develop standard-score norms for the CA groups to be included in the research. Development of such norms is discussed in Mervis and Robinson (1999, 2003).
For the Recalling Sentences subtest, an AE of 4;6 is associated with a nine-point range of raw scores and a three-point range of scaled scores. 6 For the Formulated Sentences subtest, an AE of 4;6 is associated with a four-point range of raw scores and a two-point range of standard scores.
479680.qxd 1/30/04 2:32 PM Page 12
12 score of 57 on the PPVT-III but only 41 on the EVT; more than a full standard deviation lower. Assuming the findings for these two children are representative of individuals with their syndrome, the use of standard scores would correctly indicate a flat vocabulary profile for syndrome C and a relative strength in receptive vocabulary for syndrome D whereas the use of AE scores would erroneously lead to the very different conclusion that syndrome C is associated with a strength in expressive vocabulary and syndrome D with a flat profile. Examples such as these illustrate the problem of using AE scores even when children are the same CA and the measures are normed on the same sample. These problems are due to the fact that AE scores do not constitute an interval scale and are subject to distortion due to patterns of developmental trajectories for different abilities. Standard scores derived from subtests of the same assessment or from measures that were co-normed are not susceptible to these problems.
THE CASE OF GROUPS THAT ARE NOT MATCHED FOR CA AE scores are especially likely to be used when groups are not matched for CA. This situation most often occurs when researchers wish to compare the performance of one or more groups of children with developmental disabilities to that of younger children who are developing normally. In some studies involving multiple groups of children with disabilities, these groups also differ from each other in CA (e.g., Jarrold, Baddeley, & Phillips, 2002; Sigman & Ruskin, 1999). A better option than matching for AE when groups differ in CA is to match for raw score, which does not have the same problematic psychometric properties. After groups are matched on the control variable, they usually are compared for level of performance (AE or raw score) on the target variable. The assumption is that if the groups show the same relation between control and target variables and they are matched on the control variables, then they should not differ significantly on the target variable. There is a fundamental flaw in this logic, however: It is based on the assumption that rate of development is similar for the control and target variables. This assumption is often not correct. In such cases, differences in CA will confound the group-matching design because the differences in raw scores (and therefore in AE) will not be stable across CA. In these cases, the researcher should not predict that two groups with identical raw scores or AEs on the
Mervis and Klein-Tasman control variable but different CAs should have similar scores on the target variable. The situation in which two or more variables are developing at different rates is quite common. To illustrate this issue, we examined the relations between AEs on different domains from the Vineland Adaptive Behavior Scale (VABS; Sparrow, Balla, & Cicchetti, 1984), a measure commonly used to assess the communication, socialization, and daily living skills of children with autism and other developmental disabilities. We selected an AE of 2;7 in the Communication domain and then compared this to the expected AEs for the Daily Living Skills and Socialization domains, for different CAs, based on the VABS norms. Thus, we treated Communication AE as the control variable and the AEs for the other subtests as target variables. We asked the question, based on the norms for the VABS, “If a child has an AE of 2;7 in the Communication domain, and all of his or her abilities are at the same level (i.e., the child earns the same standard score in each domain), what are his or her predicted AEs for the other domains?” As one would expect, if a child’s CA is 2;7 and his or her Communication AE is also 2;7, then AEs for the other subtests would be 2;7 as well (Figure 1). This hypothetical child is performing at the median level for his or her CA in Communication and would therefore be expected to perform at the median level on the other two domains, assuming he or she has equivalent abilities in each of the domains tested. But what happens
Fig. 1. Variability in expected AE for the VABS Daily Living Skills and Socialization domains as a function of CA, given a constant AE of 2 years 7 months for the Communication domain and the assumption that the child has earned identical standard scores in all domains.
479680.qxd 1/30/04 2:32 PM Page 13
Methodological Issues when a child’s performance in Communication is below the median for his or her CA, as would be expected for most children with autism? As Figure 1 indicates, this relation varies as a function of which domain is the target variable. Relative to Communication AE, expected Daily Living Skills AE increases (at a variable rate) with CA. In contrast, expected Socialization AE is always lower than Communication AE. The shape of the Socialization AE curve is particularly disturbing in that it is not monotonic; AE decreases at a variable rate until age 8;7 and then begins to increase slowly as CA increases. Therefore, if Communication AE was the matching variable and Daily Living Skills AE was the target variable, then rather than predicting equivalent AE scores for the two domains, the researcher should predict that older children with a Communication AE of 2;7 (who are hence delayed in their communication development) would have stronger absolute levels (higher raw score or AE) of Daily Living Skills and weaker absolute levels of Socialization skills than younger children with the same Communication AE. If the participants all fall in a narrow age range reflecting more average functioning (e.g., CA between 2 and 3 years, with AE in Communication of 2;7), then the prediction that all groups should perform at the same absolute level (same raw score or AE) on the target variable(s) is reasonable. However, sampling from a large CA range (with children whose abilities are weaker than expected based on their CA) would invalidate the prediction that all groups should earn the same raw score (or AE) on the target variable(s). Thus, a difference in AE between groups of 2- and 7-year-olds would be predicted for both Daily Living Skills and Socialization, assuming the groups had been well-matched for AE in the Communication domain and the individuals in the group had the same level of ability (same standard score) for the constructs measured by the three domains. Figure 1 makes it clear that children whose Communication AE lags behind their CA (as is the case for most individuals with autism spectrum disorders) would be expected to show a relative strength in Daily Living Skills and a relative weakness in Socialization skills based on AE, even when they show no difference in skills based on standard scores. This is precisely the pattern that is generally seen in studies of adaptive behavior in autism spectrum disorders (e.g., Carpentieri & Morgan, 1996; Carter et al., 1998). In fact, in their multisite study of adaptive behavior of 684 individuals with autism, Carter et al. state, “The expected profile of a relative weakness in
13 Socialization and relative strength in Daily Living Skills was obtained with age-equivalent but not standard scores” (p. 287). The authors attribute their failure to find the expected profile using standard scores to a basal effect in the standard scores of the autism group. However, this pattern of discrepancy between AEs and standard scores was observed in all groups, even though scores were near floor level only for the lowest functioning group. The properties of AEs on the VABS described above account better for the pattern of findings. As illustrated, evenly developed (relative to CA expectations) but delayed skills would be expected to show the pattern of AE differences across domains that we described, due to differences in the developmental trajectories of the domains measured by the VABS. Hence, it remains likely that the pattern of findings based on AE is simply consistent with delayed adaptive behavior rather than reflecting an autismspecific pattern of strengths and weaknesses. The finding of variable developmental trajectories for the different domains of the VABS is not an isolated result. A similar pattern has been demonstrated for subtests of the Differential Ability Scales (DAS; Elliott, 1990), a full-scale measure of cognitive ability, using either verbal reasoning ability (Mervis & Robinson, 1999, 2003) or nonverbal reasoning (Mervis, 2004) as the control variable. The examples from the CELF-4, PPVT-III, and EVT presented earlier in this paper indicate the same difficulty with these measures. The problem is not actually with the various assessments, however; it is inherent in AE scores. A critical point that follows from these analyses is that even if groups that differ in CA are closely matched for raw score (or AE) on a control variable, one should not automatically predict that if the relation between the control variable and the target variable is the same for both groups, then both should be expected to earn the same raw score (or AE) on the target variable. This prediction follows only if the developmental trajectories of the control and target variables are parallel for the CAs included in the study. What one can assume, if groups differ in CA but are closely matched for raw score on the control variable, is that if the relation between the control variable and the target variable is the same for all of the groups, then participants should earn the same standard score on the target and control variables.7 In cases in which the developmental trajectories 7
That is, a given individual’s standard score on the target variable should be the same as his or her standard score on the control variable. Individuals in the younger group will earn higher standard scores than individuals in the older group.
479680.qxd 1/30/04 2:32 PM Page 14
Mervis and Klein-Tasman Table II. Predicted EVT and PPVT-III Scores for Three Groups of Differing CA Closely Matched for EVT AE = 3;0
EVT raw score corresponding to AE = 3;0
3;0 5;0 7;0
33 33 33
EVT standard score for raw score = 33
Correctly predicted PPVT raw score (based on EVT standard score)
Incorrectly predicted PPVT raw score (based on AE = 3;0)
PPVT standard score for incorrectly predicted PPVT raw score
101 73 40
39 34 22
38 38 38
100 77 55
EVT, Expressive Vocabulary Test; PPVT-III, Peabody Picture Vocabulary Test-III; CA, chronological age; AE, age equivalent.
of the control and target variables are not parallel for the CAs included in the study, if the groups earn the same standard score on the target variable as on the control variable, the result will be that the groups earn different raw scores on the target variable, even though the relation between the control and target variables is the same for all of the groups. Consider the example in Table II of three groups differing in CA, each of whom earned the same raw score of 33 on the EVT, corresponding to an AE of 3;0. Not surprisingly, the standard scores corresponding to this AE differ radically as a function of CA. As shown earlier in this paper, the developmental trajectories for expressive vocabulary and receptive vocabulary as measured by the EVT and PPVT-III are not parallel. Consequently, the correctly predicted PPVT-III raw scores are different for each of the three CA groups, based on the assumption that each group shows the same relation between receptive and expressive singleword vocabulary ability. If the three groups earned the same raw score on the PPVT-III, this would actually indicate that they do not show the same relations between receptive and expressive vocabulary (see the final column in Table II). In Table III, a second example is provided, this time with the control variable being receptive vocabulary and the target variable expressive vocabulary.
Hence, the expected relation between the control variable and the target variable is affected by the developmental trajectories of the abilities in question. Researchers generally assume that the abilities are developing at the same rate throughout the CA range included in the study. However, as illustrated above, a second very common possibility is that the control and target variables do not actually develop at the same rate across the CA range included in the study. The impact of each of these scenarios on determination of group difference is further discussed below. If the control variable and the target variable are indeed developing at the same rate during the CA range included in the study and both groups earn the same standard score on the target variable as they did on the control variable, then the raw scores will be identical for the groups, even though the younger group(s) will have a higher standard score. The raw score earned by the younger group reflects stronger performance. Consider the example in Table IV. As measured by the Picture Similarities and Verbal Comprehension subtests of the DAS-Preschool, nonverbal reasoning and receptive language are developing at the same rate over the CA range from 2;10 to 6;10. Therefore, as shown in the table, groups of children aged 2;10, 4;10, and 6;10 matched for Ability score (the DAS equivalent of raw score) on Picture Similarities would all be expected to earn the
Table III. Predicted PPVT and EVT Scores for Three Groups of Differing CA Closely Matched for PPVT-III AE = 3;0
PPVT raw score corresponding to AE = 3;0
3;0 5;0 7;0
38 38 38
PPVT standard score for raw score = 38 100 77 55
Correctly predicted EVT raw score (based on PPVT standard score)
Incorrectly predicted EVT raw score (based on AE = 3;0)
EVT standard score for incorrectly predicted EVT raw score
32 35 41
33 33 33
101 73 40
EVT, Expressive Vocabulary Test; PPVT-III, Peabody Picture Vocabulary Test-III; CA, chronological age; AE, age equivalent.
479680.qxd 1/30/04 2:32 PM Page 15
Table IV. Predicted DAS Picture Similarities and Verbal Comprehension Scores for Three Groups of Differing CA Closely Matched for Picture Similarities AE = 4;10
PS ability (Ab) score corresponding to AE = 4;10
PS T score for Ab score = 75
Correctly predicted VC Ab score (based on PS T score)
Predicted VC Ab score (based on AE = 4;10)
2;10 4;10 6;10
75 75 75
76 50 36
121–122 121–122 121–122
121–123 121–123 121–123
VC T score for Ab score predicted based on AE = 4;10 76–77 50–51 36–37
DAS, Differential Ability Scales; PS, Picture Similarities; VC, Verbal Comprehension; CA, chronological age; AE, age equivalent.
same Ability score on Verbal Comprehension (even though the T scores for the three groups differ substantially), assuming that all three groups showed the same relation between the control and target variables. If the control variable and the target variable are indeed developing at the same rate during the CA range included in the study and the groups earn significantly different raw scores on the target variable even though they were closely matched for raw score on the control variable, then it can legitimately be concluded that the relation between the control variable and the target variable is different for the groups. Interpretation of the difference should be based on comparisons of the standard scores for the control and target variables. There are several possible interpretations of such a group difference. These comparisons could indicate, among other possibilities, that the group with the lower raw score on the target variable has a relative weakness on that variable whereas the other group is performing at the expected level; that the group with the higher raw score has a relative strength on the target variable whereas the other group is performing at the expected level; that one group has a relative weakness on the target variable and the other group has a relative strength; or that both groups have a relative strength (or weakness) on the target variable, but one group shows a larger relative strength (or weakness). If the control variable and the target variable are not developing at the same rate (as for receptive and expressive single-word vocabulary), then, as we have shown earlier, two groups that are closely matched on raw score on the control variable but differ in CA would be predicted to earn different raw scores on the target variable. Once again, determination of whether the relation between the control and target variables is similar for the two groups is best made based on standard scores. Interpretation of differences follows the same principles as described above for the case in which the control and target variables are developing at the same rate.
FOCUSING ON SENSITIVITY AND SPECIFICITY OF DEVELOPMENTAL PROFILES AS A FUNCTION OF DISABILITY A group difference in performance on a target variable, provided alternative explanations have been ruled out, is an initial indication that it may be possible to link performance on the target variable with the etiology of the target disorder. However, it is not sufficient to demonstrate that, as a group, individuals with the target disability have stronger (or weaker) abilities than other groups on the target variable; even highly significant group differences can occur at the same time, as there is substantial overlap in the score distributions for the groups. Graphing the raw data is an excellent way to visualize the extent of this overlap. To determine if a characteristic on which groups differ significantly is likely to be important in differentiating individuals in the target group from members of other groups, signal detection theory (e.g., Kraemer, 1988; Siegel, Vukicevic, Elliott, & Kraemer, 1989) may be used to identify the cut-point that best separates the target group from other groups. Performance of individuals can then be examined to determine the proportion of people in the target group who meet the cut-point criterion (sensitivity, or Se) and the proportion of people in the contrast group(s) who do not (specificity, or Sp). For example, we (Klein-Tasman & Mervis, 2003) recently examined the personality of 8-, 9-, and 10-year-olds with Williams syndrome using two parent-report measures: the Multidimensional Personality Questionnaire (Tellegen, 1985) and the Children’s Behavior Questionnaire (Rothbart & Ahadi, 1994; Rothbart, Ahadi, & Hershey, 1994). We first demonstrated that a group of children with Williams syndrome differed significantly from a CA- and IQ-matched group of children with other developmental disabilities in parental ratings of their personality characteristics. Next, we used signal detection theory to identify the
479680.qxd 1/30/04 2:32 PM Page 16
16 personality profile that best differentiated individuals in the Williams syndrome group from individuals in the contrast group. The resulting personality profile, based in part on the group difference findings, was characteristic of 96% of the children with Williams syndrome (Se = 0.96) but only 15% of the children in the contrast group (Sp = 0.85). Hence, Se and Sp were used to move beyond group level differences and hone in on the characteristic and distinctive personality of children with Williams syndrome at the level of the individual child. Characteristics that have high Se and Sp are most likely to be helpful in genotype–phenotype research. This approach can be implemented even more powerfully by examining the Se and Sp of within-child strengths and weaknesses. An example is the Williams Syndrome Cognitive Profile (Mervis et al., 2000), which was used in the research implicating the hemizygous deletion of the LIM-kinase 1 gene, in transaction with other genes and the environment, in the extreme difficulty individuals with Williams syndrome have with visuospatial construction (Frangiskakis et al., 1996; Morris et al., 2003). The distinctiveness of the characteristics studied (Sp) can differ dramatically, depending on the makeup of the contrast group(s). For instance, if the target group was autism and the contrast group was males with Fragile X syndrome, the Sp for mental retardation would be very low. If the contrast group was children who were developing normally, however, Sp would be very high. Therefore, Sp can only be appropriately determined if a large contrast group (or groups) is used, which includes adequate representation of those groups that are most likely to be similar to the target group on the variables of interest.
Mervis and Klein-Tasman the target variable and stressed that often significant differences should be predicted even when groups are well-matched for the control variable(s); these differences frequently are due to variability in the developmental trajectories of different abilities. Finally, we suggested that a significant difference between wellmatched groups is only a first step in making the case that a certain level of performance is characteristic of members of a particular group and atypical for members of other groups. Once a significant between-group difference is found, it is important to examine the applicability of the characteristic to the individual members of both the target group and the contrast group(s), as measured by Se and Sp. Given the expected heterogeneity in the genetic basis of autism (Cook, 1998), it may be unlikely that performance on one or two target variables will yield high Se and Sp for a group of children with autism. However, combined with close attention to the issues of effective matching discussed above, such an approach may help to identify subtypes of the disorder that may share genetic etiology. ACKNOWLEDGMENTS Preparation of this manuscript was supported by Grant No. NS35102 from the National Institute of Neurological Disorders and Stroke and by Grant No. HD29957 from the National Institute of Child Health and Human Development. We thank Byron Robinson for discussions regarding many of the methodological issues discussed in this chapter and Joanie Robertson for preparation of Fig. 1.
REFERENCES CONCLUSIONS Matching designs have traditionally been important in characterizing disorder phenotypes. In this paper, we presented a number of important guidelines for matching participant groups that contribute to the validity of interpretation of group differences. First, we illustrated the importance of using a stringent criterion for accepting the null hypothesis that the groups do not differ on the control variable(s). Second, we underscored the importance of using standard scores to measure performance on the control and target variables and cautioned strongly against using AE scores due to their measurement properties. Relatedly, we discussed the impact of group differences in CA on interpretation of findings of significant between-group differences on
Carpentieri, S., & Morgan, S. B. (1996). Adaptive and intellectual functioning in autistic and nonautistic retarded children. Journal of Autism and Developmental Disorders, 26, 611–620. Carter, A. S., Volkmar, F. R., Sparrow, S. S., Wang, J-J., Lord, C., Dawson, G., Fombonne, E., Loveland, K., Mesibov, G., & Schopler, E. (1998). The Vineland Adaptive Behavior Scales: Supplementary norms for individuals with autism. Journal of Autism and Developmental Disorders, 28, 287–302. Cohen, J. (1990). Things I have learned (so far). American Psychologist, 45, 1304–1312. Cook, E. H. (1998). Genetics of autism. Mental Retardation and Developmental Disabilities, Research Reviews, 4, 113–120. Dunn, L. E., & Dunn, L. E. (1981). Peabody Picture Vocabulary TestRevised. Circle Pines, MN: American Guidance Service. Dunn, L. E., & Dunn, L. E. (1997). Peabody Picture Vocabulary Test, 3rd ed. Circle Pines, MN: American Guidance Service. Elliott, C. D. (1990). Differential Ability Scales. San Diego: Harcourt Brace Jovanovich. Frangiskakis, J. M., Ewart, A. K., Morris, C. A., Mervis, C. B., Bertrand, J., Robinson, B. F., Klein, B. P., Ensing, G. J.,
479680.qxd 1/30/04 2:32 PM Page 17
Methodological Issues Everett, L. A., Green, E. D., Pröschel, C., Gutowski, N., Noble, M., Atkinson, D. L., Odelberg, S., & Keating, M. T. (1996). LIM-kinase 1 hemizygosity implicated in impaired visuospatial constructive cognition. Cell, 86, 59–69. Fenson, L., Dale, P. S., Reznick, J. S., Thal, D., Bates, E., Hartung, J. P., Pethick, S., & Reilly, J. S. (1993). MacArthur Communicative Development Inventories: User’s guide and technical manual. San Diego, CA: Singular. Frick, R. W. (1995). Accepting the null hypothesis. Memory & Cognition, 23, 132–138. Harcum, E. R. (1990). Methodological vs. empirical literature: Two views on the acceptance of the null hypothesis. American Psychologist, 45, 404–405. Jarrold, C., Baddeley, A. D., & Phillips, C. E. (2002). Verbal short-term memory in Down syndrome: a problem of memory, audition, or speech? Journal of Speech, Language, & Hearing Research, 45, 531–544. Klein, B. P., & Mervis, C. B. (1999). Cognitive strengths and weaknesses of 9- and 10-year-olds with Williams syndrome or Down syndrome. Developmental Neuropsychology, 16, 177–196. Klein-Tasman, B. P., & Mervis, C. B. (2003). Distinctive personality characteristics of children with Williams syndrome. Developmental Neuropsychology, 23, 271–292. Kraemer, H. C. (1988). Assessment of 2 × 2 associations: Generalization of signal-detection methodology. The American Statistician, 42, 37–49. McCarthy, D. (1972). McCarthy Scales of Children’s Abilities. New York: The Psychological Corporation. Mervis, C. B. (2004). Cross-etiology comparisons of cognitive and language development. In M. L. Rice & S. Warren (Eds.), Developmental language disorders: From phenotypes to etiologies (pp. 153–185). Mahwah, NJ: Lawrence Erlbaum. Mervis, C. B., & Robinson, B. F. (1999). Methodological issues in cross-syndrome comparisons: Matching procedures, sensitivity (Se), and specificity (Sp). Commentary on M. Sigman & E. Ruskin, Continuity and change in the social competence of children with autism, Down syndrome, and developmental delays. Monographs of the Society for Research in Child Development, 64(256), 115–130. Mervis, C. B., & Robinson, B. F. (2003). Methodological issues in cross-group comparisons of language and/or cognitive development. In Y. Levy & J. Schaeffer (Eds.), Language competence across populations: Toward a definition of specific language impairment (pp. 233–258). Mahwah, NJ: Lawrence Erlbaum.
17 Mervis, C. B., Robinson, B. F., Bertrand, J., Morris, C. A., Klein-Tasman, B. P., & Armstrong, S. C. (2000). The Williams Syndrome Cognitive Profile. Brain and Cognition, 44, 604–628. Morris, C. A., Mervis, C. B., Hobart, H. H., Gregg, R. G., Bertrand, J., Ensing, G. J., Sommer, A., Moore, C. A., Hopkin, R. J., Spallone, P., Keating, M. T., Osborne, L., Kimberley, K. W., & Stock, A. D. (2003). GTF2I hemizygosity implicated in mental retardation in Williams syndrome: Genotype/phenotype analysis of 5 families with deletions in the Williams syndrome region. American Journal of Medical Genetics, 123A, 45–59. Paterson, S. (2001). Language and number in Down syndrome: The complex developmental trajectory from infancy to adulthood. Down Syndrome Research and Practice, 7, 79–86. Robinson, B. F., & Mervis, C. B. (1998). Disentangling early language development: Modeling lexical and grammatical development using an extension of case-study methodology. Developmental Psychology, 34, 363–375. Rothbart, M. K., & Ahadi, S. A. (1994). Temperament and the development of personality. Journal of Abnormal Psychology, 103, 55–66. Rothbart, M. K., Ahadi, S. A., & Hershey, K. L. (1994). Temperament and social behavior in childhood. Merrill-Palmer Quarterly, 40, 21–39. Semel, E., Wiig, E. H., & Secord, W. A. (2003). Clinical Evaluation of Language Fundamentals, 4th ed. San Antonio, TX: Psychological Corporation. Siegel, B., Vukicevic, J., Elliott, G. R., & Kraemer, H. C. (1989). The use of signal detection theory to assess DSM-IIIR criteria for autistic disorder. Journal of the American Academy of Child & Adolescent Psychiatry, 28, 542–548. Sigman, M., & Ruskin, E. (1999). Continuity and change in the social competence of children with autism, Down syndrome, and developmental delays. Monographs of the Society for Research in Child Development, 64(256), 1–114. Sparrow, S. S., Balla, D. A., & Cicchetti, D. V. (1984). Vineland Adaptive Behavior Scales—Interview Edition. Circle Pines, MN: American Guidance Service. Tellegen, A. (1985). Structures of mood and personality and their relevance to assessing anxiety, with an emphasis on self-report. In A. H. Tuma & J. D. Maser (Eds.), Anxiety and the anxiety disorders (pp. 681–716). Hillsdale, NJ: Lawrence Erlbaum. Williams, K. T. (1997). Expressive Vocabulary Test. Circle Pines, MN: American Guidance Service.