Cognitive Validity of Test Items and Scores - CiteSeerX

0 downloads 0 Views 334KB Size Report
that the performance observed on these items adequately represents ..... solve the problem, but the correct response option is phrased “none of the above”.
Cognitive Validity of Test Items and Scores

What Are You Thinking? Postsecondary Student Think-Alouds of Scientific and Quantitative Reasoning Items

Amy D. Thelk Emily R. Hoole Susan M. Lottridge James Madison University

1

Cognitive Validity of Test Items and Scores Introduction Assessment, a key component of program evaluation, is often accomplished through multiple-choice testing. This method has many advantages (mostly related to efficiency) but one significant drawback is the lack of information about how items are interpreted, reacted to and solved by the examinee. Greater understanding of the test-taker experience provides greater evidence for the validity of the scores produced by an instrument. Here, validity means that a test measures what it is intended to measure. Throughout the different stages of assessment (item development, test administration and interpretation of results) attention to validity is our professional duty and ethically, our responsibility. For instance, the consequences of testing (consequential validity) have received more press in recent years to draw attention to the uses and abuses of test results. Before one can even consider reporting on test data, test score validity (structural validity) must be examined through statistical analysis and evaluation of administrative conditions. To further deconstruct the idea of validity evidence, one cannot trust validity data at the test level until test-taker performance of each individual item (cognitive validity) has been considered. Our research targets validity at this “building-block” level (Ferrara, Duncan, Perie, Freed, McGivern and Chilukuri, 2003). Through the use of think-aloud procedures, we can make educated and well-informed statements about the cognitive validity of the items used for measuring scientific and quantitative reasoning. We believe this research is quite valuable in that to date, a literature search fails to reveal other research on think-alouds with a post-secondary population in the area of scientific reasoning. Additionally, cognitive validity is an essential, and often overlooked, area when validity studies are being conducted. Finally, the novice-expert design in our research is unique since we used pre- and post-treatment groups to distinguish the two skill levels. This process can

2

Cognitive Validity of Test Items and Scores endorse change-over-time results, therefore attesting to the value of higher education and providing a key component for program evaluation. Validity has long been recognized as the most important aspect of testing and psychological assessment (Standards for Educational and Psychological Testing, 1999). Whenever a test user wishes to make an inference from test scores, the validity of those inferences must be verified. Current conceptions of validity are best represented by Messick’s (1989) unified theory, which placed all other types of validity under the umbrella of construct validity. In this framework, all evidence provided strengthens the argument that the construct of interest is actually the construct the scores represent. A key point to understanding validity is the realization it is not the test which is valid or invalid, but the test scores and the proposed inference the test user wishes to make that are valid or invalid. Given that validity is so important in testing, researchers have developed a wide array of techniques to investigate validity. Messick (1995), father of the modern conceptualization of validity, provided six methods of investigating the construct validity of scores. Content validity studies investigate whether test items appear to be measuring the construct of interest. This is often examined through content experts’ review of test items or use of a test blueprint during development to match each item to its desired content area. The use of a test blueprint also helps to eliminate construct underrepresentation. Substantive validity studies deal with the theoretical foundation underlying the construct of interest. Empirical evidence must be gathered which demonstrates that test-takers are actually using the processes which theory states they should be using. Structural validity is investigated by examining the interrelationships of dimensions measured by the test and their relationship with the construct of interest along with possible score implications. Generalizability deals with the extent to which a test user has confidence

3

Cognitive Validity of Test Items and Scores that the performance observed on these items adequately represents performance on the construct and whether the results observed would generalize across groups, settings, and tasks. Within external validity resides convergent and discriminant validity as well as criterion and predictive validity. These methods investigate the extent to which test scores relate to scores on similar constructs (convergent) and the extent to which test scores do not relate on a construct theoretically assumed to be different (divergent). Criterion validity and predictive validity convey the degree to which the scores relate to an external standard (criterion) and to what degree the scores provide accurate prediction of future performance (predictive). Consequential validity is described by Messick as investigation of the impact of using invalid scores for decision-making purposes, or purposes for which the score interpretation is inappropriate. There are two major threats to score validity: construct underrepresentation and construct-irrelevant variance. Construct underrepresentation occurs when the test fails to adequately represent the breadth and depth of the construct, if the intent is to generalize the scores to a broader domain. Construct-irrelevant variance is a threat when answers to questions may be influenced (positively or negatively) by constructs other than the one we are interested in measuring. An example would be when we are interested in measuring math ability, yet the reading load influences people’s ability to demonstrate their knowledge. Another example is when test takers eliminate options on a multiple-choice item, and arrive at the correct response despite not possessing the knowledge required to arrive at a correct answer. Cognitive validity becomes important when the construct of interest is demonstrated through the use of cognitive skills, abilities and/or processes. Standard 1.8 of the Standards for Educational and Psychological Testing states, “If the rationale for test use or score interpretation depends upon the premises about the psychological processes or cognitive operations used by

4

Cognitive Validity of Test Items and Scores examinees, then theoretical or empirical evidence in support of those premises should be provided” (p. 19). Cognitive validity relates directly to Messick’s evidence for substantive validity in articulating and testing the theoretical assumptions regarding the skills and abilities used when answering test items. The key idea is that test developers and users need to verify the assumed processes are actually used by test-takers as opposed to contradictory processes (such as option elimination) that introduce construct-irrelevant variance into the scores. The National Research Council (2001) in “Knowing What Students Know” strongly advocates for investigation of the connection between cognition, observation and interpretation. The idea of investigating the cognitive validity of test scores is still in its early stages, though a variety of frameworks have been utilized. Baxter and Glaser (1998) investigated the cognitive complexity of science performance assessments, specifically looking at how the performance of high scorers differs in cognitive quality from low scorers. The framework they developed is composed of three steps: specification of intended task demands, investigation of the inferred cognitive activity, and congruence with the scores obtained. Baxter and Glaser specified the intended task demands within a content-process space, in which content is rated from “lean” to “rich,” and the process spans from “open” to “constrained.” The inferred cognitive validity is investigated through use of verbal protocols to gather evidence on the methods used by test takers to solve the tasks. Ruiz-Primo, Shultz, Li, and Shalvelson (1999) used Baxter and Glaser’s framework to investigate the cognitive validity of three types of concept maps: a construct-a-map, a fill-in-the-nodes map, and a fill-in-the-lines map. The task demands were explicated based upon the degree of directedness inherent in the maps, with the construct-a-map having low directedness and the two fill-in-the-maps having a high degree of directedness. They used verbal reports to gather evidence on the inferred cognitive activity, and

5

Cognitive Validity of Test Items and Scores compared task demands and the verbal reports to the observed scores. The results they obtained indicate that low directedness in the task (i.e. the construct-a-map) required more content knowledge than the high directedness techniques, while high directedness led to greater selfmonitoring during the task. They concluded that high task directedness caused an overestimation of student’s knowledge. This provides direct evidence of invalidity regarding the inference to be made from the test scores on the fill-in-the-nodes and fill-in-the-lines techniques. A similar study (Ruiz-Primo, Shultz, Li & Shavelson) was conducted in 1998 using two concept mapping techniques, again varying in directedness, with the same results. Baxter, Glaser, and Raghava (1993) applied their framework to three science performance tasks: exploratory investigation, conceptual integration, and component identification. They compared the verbal protocols collected with direct observations of performance, reviews of the test booklets, and a comparison of the scoring rubric to the intended and actual cognitive processes. Their study resulted in a series of guidelines for task development to increase cognitive validity in performance tasks: 1) Using procedural openended items; 2) Designing items to draw on content area knowledge; 3) Requiring the application of knowledge; 4) Developing rubrics to clearly match the task expectations; 5) Employing scoring methods which are sensitive to meaningful use of knowledge, and; 6) Scoring in a manner which clearly captures the process students engage in during the task. Ayala, Shavelson, and Ayala (2001) studied three science performance assessments using Baxter and Glaser’s framework employing three additional dimensions: basic knowledge and reasoning, spatial mechanical reasoning, and quantitative science reasoning. One task was used to measure each dimension targeted to the content-rich, process-open quadrant of the processspace. A novice/expert design was utilized in the verbal protocols. Their purpose was to “tease

6

Cognitive Validity of Test Items and Scores out the reasoning needed to complete science performance assessments, testing the validity of these cognitive (reasoning) claims” (p. 2). Differences in reasoning demands were observed. All items drew on basic knowledge and reasoning, with the expected task drawing on spatial reasoning but more basic knowledge and reasoning than expected. Participants demonstrated quantitative science reasoning in the third task as expected. A second framework developed to investigate cognitive validity of test items and scores is the Assessment Square (Ruiz-Primo, Shavelson, Yi & Shultz, 2001) modified from the Assessment Triangle (Pellegrino, Chudowsky, & Glaser, 2001).

______________________________________________________________________________ Construct Warranted Inference? Conceptual Analysis Statistical and/or Qualitative Analysis Logical Analysis

Cognitive Analysis Assessment

Observation

Figure 1. The Assessment Square ______________________________________________________________________________ Use of the assessment square begins with the construct of interest in the top left hand corner. From this initial construct, a conceptual analysis is undertaken to develop a working definition of the construct along with a detailed description of domain areas to be covered and the types of behaviors to be produced. At this stage, possible tasks, theorized to elicit the desired

7

Cognitive Validity of Test Items and Scores response, would be developed. After the tasks are developed, a logical analysis then details the intended task demands to determine if the tasks will logically produce the behaviors the test developer is seeking. In the bottom left hand corner of the square is the actual assessment, the tasks developed to generate the desired display of skills, ability and knowledge. The cognitive analysis is used to compare the observed scores to the expected scores on the tasks. The observed cognitive activity gathered through think-alouds is compared to the expected cognitive activities conceptualized in the logical analysis. The final two processes of the cognitive analysis look at the relationship between the observed scores on the tasks and the cognitive activity. The ability of the scoring procedure to reflect the quality of the cognitive activity is then investigated, and the relationship between the observed scores on these tasks with associated and divergent constructs is examined. Observation is the third corner of the Assessment Square and the cognitive analysis links the assessment corner to the observed corner. After the cognitive validity of the test scores is investigated, a decision is made regarding the inferences to be made from the assessment scores. Shavelson, Ruiz-Primo, Li and Ayala (2003) applied the Assessment Square framework to an analysis of the Third International Mathematics and Science Study (TIMSS) using achievement as the construct of interest. Achievement was operational defined in the conceptual analysis as declarative knowledge (what), procedural knowledge (how), schematic knowledge (why), and strategic knowledge (when and why). Items were coded for task demands, inferred cognitive demands, item openness (indicating constructed response or selected response) and complexity (an indication of the cognitive load imposed by such factors as reading or language). Both think-alouds and retrospective interview questions were used in a design combining novices and experts. Eighty-five percent agreement was realized between the expected item

8

Cognitive Validity of Test Items and Scores characteristics and the actual item characteristics. In their second example, concept maps were used in which the key characteristics were process-open (construct-a-map) and processconstrained (fill-in-a-map). Again, an expert/novice design was used to compare the quality of the cognitive activity elicited. They expanded the coding for the verbal protocols, adding microlevel and macrolevel codes. Microlevel codes include explanations, monitoring, conceptual errors and inapplicable events (e.g. “I am reading the instructions again”). Macro level coding reflects cognitive activity related to planning and strategy. Results from the study indicate the amount of monitoring and explanations verbalized by the test-takers were related to ability level, with higher ability individuals displaying a lower amount of monitoring and a higher number of explanations. Lower ability individuals engaged in a higher level of monitoring, providing fewer explanations. Once again, support was found for the validity of the construct-a-map, but less support for the cognitive validity of the fill-in-the-map technique. A third framework, from Snow, Corno and Jackson (1996) incorporates more than just cognitive processes. Their framework combines cognitive, affective and conative concepts into an “aptitude complex”. The cognitive component contains the knowledge, perceptions and reasoning of the test-taker, while the affective folds in the emotions, moods and temperament and the conative adds motivation and volition. Test-takers exhibit this aptitude complex through two pathways, the performance pathway (cognitive processes) and the commitment pathway (affective and conative processes). This taxonomy encourages test developers and users to recognize performance as driven by all three processes, with the validity of the inferences dependent upon all three. For example, a student who is not motivated to do well is unlikely to engage in the desired cognitive processes required by the item. In this case, our inferences may not be valid regarding what this specific student knows and can do.

9

Cognitive Validity of Test Items and Scores The final framework, and the most informative for our study, is from Ferrara, Duncan, Perie, Freed, McGivern and Chilukuri (2003). These researchers argue inferences made about tests scores are based upon item level responses as the “building blocks” of total test scores. Therefore, an item must properly assess what it is meant to measure in order for the test score to be valid. Their framework encompasses three steps: determine the intended content area knowledge, skills and processes as determined by the test developer, establish the enacted content area knowledge, skills and processes as coded by the researcher, and collect evidence regarding the actual content area knowledge, skills and processes via think-aloud interviews.

Figure 2. Ferrara, Duncan, Perie, Freed, McGivern and Chilukuri (2003) Framework

Within this framework the researchers have also incorporated Snow’s aptitude complex of cognitive, affective and conative components. Item coding for the intended content area knowledge, skills and processes is based upon the test developer’s specifications; two independent researchers code the enacted content area

10

Cognitive Validity of Test Items and Scores knowledge, skills and processes. Content area knowledge or declarative knowledge is classified by the science discipline (e.g. earth science or chemistry). Content area skills or procedural knowledge is coded based upon the responses required by the students. For example, does the item require the student to apply knowledge, answer and explain, analyze, categorize or describe? Does the item require the student to think beyond the current context? Broader cognitive processes are devised as processes relevant but not specific to science such as memory and language demands, metacognition, visualization or interpretation of graphics, and needing to experience empathy or alterative viewpoints. The fourth coding category is response strategy opportunities such as elimination of response options or guessing. The framework used in our study is modified from Ferrara, Duncan, Perie, Freed, McGivern and Chilukuri (2003) with a slight wording change from enacted content area knowledge, skills, and processes to anticipated content area knowledge, skills and processes. This change was made to better reflect the fact that the researcher codes the item based upon a logical analysis of task demands, making the coding what the researcher anticipates test-takers will use, rather than enacted, which implies the actual performance of task demands. The study is designed with item/test improvement in mind to identify ancillary skills, which could lead to construct irrelevant variance and threaten the interpretation of the test scores. We also hoped there was little discrepancy between the intended, anticipated and actual content area knowledge, skills and processes elicited by the items. The study also utilized a novice/expert design with incoming freshmen serving as the “novices” (as they had not started any college coursework in scientific and quantitative reasoning), and second-semester sophomores/juniors who had taken targeted general-education classes serving as the “experts.” Through the study we wished to investigate the difference between freshmen and sophomores with regard to cognitive

11

Cognitive Validity of Test Items and Scores sophistication and results (number correct) with the expectation that sophomores would display greater scientific and quantitative reasoning compared to incoming freshmen.

Method At James Madison University, the site of our research, general education is part of the undergraduate experience. The coursework is divided into five “clusters” and is a required part of every curricular plan. According to the JMU website on general education, “This core of knowledge, skills, and experiences transcends every major and professional program and is essential for successful and rewarding careers and lives.” Assessment of general education clusters is achieved through university-wide assessment sessions, held twice a year. At the first (fall semester) session, incoming first-year students’ baseline knowledge is measured prior to participating in college coursework. At the second (spring-semester) assessment day, secondyear students (actually, any students who have completed between 45 and 60 credits) complete cluster tests, and data is then interpreted from a change-over-time perspective. This information is used in university, state and federal reporting to support the value-added facet of higher education, and is used as feedback to improve programs.

Natural-World 6 (NAW-6) Test The Natural World-6 exam (NAW-6) was first administered in the fall of 2003 and given for the second time in spring of 2004. This 80-item multiple-choice format instrument is intended to measure scientific and quantitative reasoning skills of postsecondary students. For this study, 10 items were chosen from NAW-6. These items represent a cross-section of intended content areas developed by the JMU Faculty for measuring scientific and

12

Cognitive Validity of Test Items and Scores quantitative reasoning (Appendix A) and through piloting were ascertained to best lend themselves to the think-aloud process.

Participants and Sampling For the fall 2003 sample, freshmen (N = 30) were randomly selected and e-mailed with a request for participation. Of these, about half responded, but many of those students were wary of the study or uncomfortable with the alternative to assessment day, and chose not to participate. Of the sample who performed think-alouds (N=7), one of the audio-tapes could not be transcribed, leaving us with final freshman sample of N=6. The spring 2004 sample (N = 21) was also a randomly drawn group. These second-year students were contacted via e-mail and informed that their Assessment Day requirement was to participate in the study. Students could decline participation in the study and complete the available make-up assessment activities if they wished. Out of 30 students requested to participate, 7 opted for the alternative activity. Two of the audio-tapes could not be transcribed, leaving us with our final sample of 21 sophomores.

Think-aloud Procedures Interviews were conducted beginning the week preceding university-wide testing and continued through the week after. As per the instructions contained in the e-mail, students called in for an appointment time. Four female researchers conducted interviews. At the think-aloud session, the researcher introduced herself and had the student sign in (to get credit for attendance). After the purpose and methods were briefly explained, the researcher then demonstrated the process of “thinking aloud” through a test item specifically chosen for its

13

Cognitive Validity of Test Items and Scores dissimilarity to subsequent items in the verbal protocol task (to reduce a priming effect). The participant then completed a warm-up activity to practice verbalizing the problem-solving procedure. The warm-up was chosen carefully to reduce the likelihood that it would influence student responses on later items (a practice effect). When he or she had completed this warm-up and had no further questions, the first test item, printed on a sheet of paper, was presented. For each item presented, the student was asked to read the item out loud and then to speak aloud while tackling the problem. Standard interventions included prompting students to think out loud if he or she appeared to be working the problem but not talking (“Can you talk about what you are doing now?”) After each item was answered, two structured interview questions were asked: “Can you summarize for me how you arrived at the answer?” (retrospective reporting) and “Can you tell me where you learned how to solve this type of problem?” This second structured interview question was collected as further evidence that the test items either do or do not reflect acquired learning in the general education program for which the items were created to assess. Beginning with the first warm-up item, researchers audio-taped responses and took notes about nonverbal activity like gestures and expressions. This standardized process is based upon the protocol developed by Ericsson and Simon (1993). To provide codes for the anticipated content area knowledge and skills, two faculty members (both considered to be experts in the area of quantitative and scientific reasoning) performed think-alouds in a manner identical to that of the students.

14

Cognitive Validity of Test Items and Scores Coding of Think-Aloud Sessions Results from the interviews (actual content area knowledge and skills) were compared with test specifications used during item development (intended content area knowledge and skills) and anticipated content area knowledge and skills (based on coding of items by two neutral experts). Following the completion of all think-aloud sessions, the audio-tapes were transcribed Figure 1 Item coding grid _____________________________________________________________________________ Transcript ________________

Item #

Content Area Skills

Broader Cognitive Processes

Examinee Response Strategies

1 2 3 4 5 6 7 8 9 10 _____________________________________________________________________________ and printed out by the first researcher. The other two researchers were given copies of the transcripts and entered what they determined to be the most appropriate codes for each item in a grid (Figure 1). The first researcher then compared codes. In the inevitable events of lack of agreement between raters, codes assigned by the rater with a much greater familiarity with the research 15

Cognitive Validity of Test Items and Scores study were used. If neither rater was able to assign a single code (difficulty choosing between two or more options) and none of the options matched, the first researcher reviewed the transcript and made a judgment about which code should be used as the final rating.

Analysis 1. The intended content area is based on the general education objectives for the cluster the NAW-6 is assessing. These objectives are listed in Appendix A, and were drafted in advance of test item development. On the 80-item NAW-6, items are approximately evenly distributed across the objectives. Test items were either created to assess a particular objective, or else later paired with an objective through a back translation (Smith & Kendall, 1963; Dawis, 1987). The back translation process involves matching test items to objectives and then convening with others who have completed the same task and comparing results. Discrepancies are discussed until the group is in total agreement about item-objective pairings. 2. Anticipated content area was developed based on performance on the think-aloud task by content experts (two professors at our university). 3. Finally, the actual content area is based on the results of the think-alouds. For each item, an alignment summary was completed (Figure 2). Items were deemed “ALIGNED” if the modal (most frequently occurring) code for all three groups (experts, freshmen and sophomores) matched. Assigning an item a “Partially ALIGNED” code involved some judgment; the ratings partially matched in some manner. Finally, items were determined to be “Not ALIGNED” when the most commonly occurring answers in each cell did not intersect. The alignment summary grid is presented below. The 10 completed alignment summaries (one for each of the 10 items) can be found in Appendix B.

16

Cognitive Validity of Test Items and Scores Figure 2 Alignment Summary Grid ______________________________________________________________________________ Item: Brief description of item Anticipated Where learned1

N=2 Answers given by our experts

Group B: Science Skills

Coded responses from expert thinkalouds

Group C: Broader Cognitive Processing

Coded responses from expert thinkalouds

Group D: Examinee Response Strategies

Coded responses from expert thinkalouds

1

Objective: Analogous to intended content area Actual Alignment Freshmen Sophomores result (Aligned, List out List out partially answers given answers given and N for each and N for each aligned or misaligned?) answer answer (Aligned, List out which List out which partially codes were codes were aligned or endorsed and endorsed and misaligned?) by how many by how many students students (Aligned, List out which List out which partially codes were codes were aligned or endorsed and endorsed and misaligned?) by how many by how many students students (Aligned, List out which List out which partially codes were codes were aligned or endorsed and endorsed and misaligned?) by how many by how many students students

“Where do you think you learned how to do an item like this one?”

______________________________________________________________________________

Results and Discussion Novice-expert design Due to the low number of participating first-year students, we decided to eliminate the comparison of novice and expert students on number of correctly answered items and types of strategies employed in the think-aloud sessions. Even with a larger sample, we are skeptical that our results would demonstrate much difference between the freshman and sophomore sample, because so few students finish their related general-education coursework in science and math

17

Cognitive Validity of Test Items and Scores prior to the second semester of their sophomore year. A much larger or more deliberately selected sample is needed to investigate the expected changes in scientific and quantitative reasoning ability due to participation in general-education coursework from freshman to sophomore/junior year. This is a general problem faced by the Cluster 3 exam, The Natural World, at JMU. A variety of methods are now being employed at the University to encourage and facilitate earlier completion of Cluster coursework by students. Since assessment at JMU is focused on evaluating the value-added aspect of higher education, it is important that the items adequately reflect the changes that occur in knowledge, skills and abilities. It is also important that students be in a position to demonstrate the hypothesized changes in learning.

Comparisons between intended, anticipated and actual knowledge, skills and processes The item grids comprising Appendix B summarize the data for each item. The first row contains responses given to the question “Where did you learn how to do this type of item?” The test developers had hoped to target knowledge gained through attending general education courses. However, experienced students (sophomores) most commonly invoked knowledge garnered earlier in their academic settings rather than during college. It is unclear whether the statements by the sophomores indicate that the test is not actually accessing knowledge from college-level courses, or that the test does access appropriate knowledge gained in these courses but this knowledge is not recognized by the sophomores. The second row is comprised of codes assigned by the raters for Content Area Skills. A complete listing of all codes can be found in Appendix C. Alignment of these codes is of interest to this project. Content Area Skill codes tap into the types of knowledge and skills utilized to solve the item.

18

Cognitive Validity of Test Items and Scores The third row, Broader Cognitive Processes, are devised as processes relevant but not specific to science such as memory and language demands, metacognition, visualization or interpretation of graphics, and needing to experience empathy or alterative viewpoints. Alignment of these codes is not essential since individuals can use various types of related behaviors to arrive at an answer without compromising validity of the inferences arising from test scores. A larger concern with broader cognitive processes is the examination of whether these processes, such as reading load, compromise the student’s ability to solve the problem, since the construct being measured is not reading comprehension. The fourth coding category is response strategy opportunities such as elimination of response options or guessing. While students are apt to vary widely in these strategies, alignment in this area will provide the test developers with important information about construct-irrelevant forces influencing student responses. For the 10 items sampled in this study, 30% are aligned in the content area, an additional 30% are partially aligned and 40% are misaligned. These are telling numbers, since the primary purpose of the test is to measure the content area knowledge, skills and abilities. Steps should be taken to revise the items to bring them into alignment, or replacement with items that are aligned. Within the area of broader cognitive processing, 70% of the items are aligned, with 10% partial aligned, and 20% misaligned. Finally, for examine response strategies; only 1 item was aligned with the anticipated strategies, with 40% partially aligned and 50% unaligned with the anticipated examinee response strategy. This likely illustrates that experts in the field, even seasoned teachers with years of experience in writing classroom tests, are not proficient at predicting the wide variety of strategies that student’s engage in while answering test items.

19

Cognitive Validity of Test Items and Scores Use of Rubric developed by Ferrara et al. The rate of agreement between coders was 39.8% for Cognitive Process codes, 41.4% for Broader Cognitive Processing codes, and 35.9% for Examinee Response codes. The discordant ratings reflect that many codes were difficult to discriminate among. The researchers who worked to develop the coding scheme did so on middle school children. The rubric was also developed for use with constructed response items, with many of the available codes only applicable to constructed response items. It is also possible that our older, postsecondary population actually exhibits a different enough strategy that revising the framework to reflect the cognitive advancement of our population is warranted.

Previously unforeseen sources of irrelevance Perhaps the most surprising, not to mention entertaining, aspect of this research was the revelation of strategies and interpretations used by students as active problem-solving strategies. There were multiple cases in which correct answers were inadvertently reached by constructirrelevant methods, or in which the correct answer was eliminated as a possible choice because of surreptitious cues undetected by the “experts.”

Specifically, one item designed to measure

the ability to read and interpret information from a table appears to be measuring knowledge and beliefs regarding SAT scores, upon which the table is based. This belief regarding the interpretation of SAT scores leads to some students’ selection of distracters that might not be chosen with a different table. It is also common for students to eliminate options based upon their knowledge of the proper interpretation of SAT scores to arrive at the correct answer without actually knowing what a standard deviation is or how to interpret one. If the purpose of the item is to measure the use of graphical, symbolic, and numerical methods to analyze, organize and

20

Cognitive Validity of Test Items and Scores interpret natural phenomenon, use of a table less likely to tap into previously held folk beliefs regarding interpretation of the SATs should be used. Another item requires students to recognize that not all needed information is provided to solve the problem, but the correct response option is phrased “none of the above”. Students who construct an incorrect formula and solve it do not find their answer provided and so choose the correct answer choice out of incorrect reasoning. The examples provided above are reminders that investigating the cognitive validity of test scores is of serious importance. The use of item-total correlations, response option analysis and other common test analysis procedures are unlikely to identify misaligned items and certainly do not provide insight into the cognitive processes used by examinees. As institutions face the mounting pressures of accountability, ensuring the accuracy of our assessments used in reporting to stakeholders is more critical than ever. Use of a cognitive validity framework to compare task demands, actual cognitive activity, and the ability of the scores to differentiate cognitive quality, is an important, yet rarely used, component in the test developer’s toolbox.

Future research •

Develop new coding scheme or adjusted coding scheme on postsecondary students.



Offer incentives to students in an attempt to control for motivational issues.



Secure larger sample sizes to facilitate the novice-expert comparison and empower the findings overall.



Allow adequate time for coding so that raters may meet regularly to establish reliability between findings; additional raters to prevent rater fatigue and the shifting of assigned codes.

21

Cognitive Validity of Test Items and Scores References Ayala, C. C., Shavelson, R., & Ayala, M. A. (2001). On the cognitive interpretation of performance assessment scores. (Report No. CSE-546) Los Angeles: Center for the Study of Evaluation. Baxter, G. P, Glaser, R., & Raghavan, K. (1993). Analysis of cognitive demands in selected alternative science assessments. Analysis of structures and processes assessed in science. Project 2.1 Alternative approaches to assessment in mathematics and science. Cognitive theory as the basis for design of innovative assessment. Los Angeles: Center for the Study of Evaluation. Baxter, G. P., & Glaser, R. (1998) Investigating the cognitive complexity of science assessments. Educational Measurement: Issues and Practice, 17(3), 37-45.

Dawis, R. (1987). Scale construction. Journal of Counseling Psychology, 34(4), 481-489. Duncan, T., Perie, M., Ferrara, S., &Chilukuri, R., (2002). Cognition, conation, and affect in middleschool students: Implications for the construct validity of science assessment items. Report of the American Institutes for Research funded by the National Science Foundation under Grant #0126088. Ericsson, K. A., & Simon, H. A. (1993) Protocol analysis: verbal reports as data. Cambridge, Mass.: MIT Press. Ferrara, S., Duncan, T., Perie, M., Freed, R., McGivern, J., & Chilukuri, R., (2003). Item construct validity: Early results from a study of the relationship between intended and actual cognitive demands in a middle school science assessment. Paper presented at the annual meeting of the American Educational Research Association, Chicago, IL. Messick, S. (1989). Meaning and values in test validation: The science and ethics of assessment. Educational Researcher, 18, 5-11.

22

Cognitive Validity of Test Items and Scores Messick, S. (1995). Validity of psychological assessment: Validation of inferences from persons’ responses and performances as scientific inquiry into score meaning. American Psychologist, 50 (9), 741-749. Pellegrino, J. W., Chudowsky, N., & Glaser, R. Eds. (2001) Knowing What Students Know. Washington, DC: National Academy Press. Ruiz-Primo, M. A., Schultz, S. E., Li. M., & Shavelson, R. (1998). Comparison of the reliability and validity of scores from two concept-mapping techniques. Concept mapping representation of knowledge structures: Report of year 2 activities. (Report No. CSE-492). Los Angeles: Center for the Study of Evaluation. Ruiz-Primo, M. A., Schultz, S. E., Li. M., & Shavelson, R. (1999). On the Cognitive Validity of Interpretations of Scores From Alternative Concept Mapping Techniques. (Report No. CSE-503). Los Angeles: Center for the Study of Evaluation. Ruiz-Primo, M. A., Schultz, S. E., Li. M., & Shavelson, R. (2001). On the Validity of Cognitive Interpretations of Scores from Alternative Concept-Mapping Techniques. Educational Assessment. 7(2), 99-141. Shavelson, R., Ruiz-Primo, M. A., Li, M., & Ayala, C. C. (2003) Evaluating new approaches to assessing learning. (Report No. CSE-604). Los Angeles: Center for the Study of Evaluation. Smith, P. & Kendall, L. (1963). Retranslation of expectations: An approach to the construction of unambiguous anchors for rating scales. Journal of Applied Psychology, 47, 149155. Snow, R. E., Corno, L., & Jackson, D. (1996). Individual differences in affective and conative functions. In D. C. Berliner & R. C. Calfee (Eds.), Handbook of educational psychology (pp. 243-310). New York, Macmillan. Standards for educational and psychological testing (1999). Washington, DC: American Educational Research Association.

23

Cognitive Validity of Test Items and Scores APPENDIX A

Learning Objectives for the Natural World Assessment (NAW-6) of Scientific and Quantitative Reasoning Objective 1: Use theories and models as unifying principles that help us understand natural phenomena and make predictions. Objective 2: Use graphical, symbolic, and numerical methods to analyze, organize, and interpret natural phenomenon. Objective 3: Describe the methods of inquiry that lead to mathematical truth and scientific knowledge and be able to distinguish science from pseudoscience. Objective 4: Recognize the interdependence of applied research, basic research, and technology, and how they affect society. Objective 5: Evaluate the credibility, use, and misuse of scientific and mathematical information in scientific developments and public-policy issues. Objective 6: Illustrate the interdependence between developments in science and social and ethical issues. Objective 7: Formulate hypotheses, identify relevant variables, and design experiments to test hypotheses. Objective 8: Discriminate between association and causation, and identify the types of evidence used to establish causation.

24

Cognitive Validity of Test Items and Scores APPENDIX B Alignment Summaries

Item 1: Alternatives to Animal Testing Anticipated 1

Where learned

Group B: Content Area Skills Group C: Broader Cognitive Processing Group D: Examinee Response Strategies 1

Code (N) Defining words Reading comp.

Objective: Illustrate the interdependence between developments in science and social and ethical issues. Actual Alignment Freshmen Sophomores result Code (N) Code (N) 6th grade (1) None of the English (2) students Test-taking (3) Grade school (3) Mom (1) indicated that Math (2) English (2) they gained the High school SATs (3) knowledge Volunteer work Test-taking (1) Math (1) required to Context cues High school (2) solve the item SAT Review Geology (1) while in a Common Sense (2) Language class college course

11b (2)

11b (7)

29 (2)

29 (6)

33 (2)

33 (4) 36b (2)

Reading Elimination 11b (19) 13e (1) 13f (1) 22 (4) 23 (2) 29 (16) 33 (19) 34 (1) 36b (3)

“Where do you think you learned how to do an item like this one?”

25

ALIGNED

ALIGNED

ALIGNED

Cognitive Validity of Test Items and Scores Item 2: Conversion of mythical currency system to our monetary system Anticipated Code (N) Where learned

HS Algebra Elementary

Group B: Content Area Skills Group C: Broader Cognitive Processing Group D: Examinee Response Strategies

11e (2)

26

Objective: Use theories and models as unifying principles that help us understand natural phenomena and make predictions. Actual Alignment Freshmen Sophomores result Code (N) Code (N) None of the Algebra - 4 5th grade students Math - 6 Math indicated that Physics Chemistry they gained the Currency conv. Chemistry - 2 knowledge Middle sch - 2 Geometry required to Stats - 2 Elementary – 3 solve the item while in a Pre-algebra college course Science 11e (7) 11e (22) ALIGNED

29 (2)

23 (3) 24c (1) 29 (2)

23 (9) 24c (2) 29 (11)

Partially ALIGNED

36b 37

32a (1) 35a (2) 36b (3)

32 (1) 32a (3) 34a (1) 35a (7) 35c (1) 36 a (3) 36b (5) 35c (1) 37 (1)

Partially ALIGNED

Cognitive Validity of Test Items and Scores Item 3: Mean and standard deviation of SAT scores Anticipated Code (N)

Objective: Use graphical, symbolic, and numerical methods to analyze, organize, and interpret natural phenomenon. Actual Alignment Freshmen Sophomores result Code (N) Code (N) Middle Sch (2) Some of the Math (3) HS/AP Stats (2) sophomores Reasoning indicated that ISAT** Stats they gained the Math Geometry knowledge Stats (3) required to High Sch (2) College in genl. solve the item while in a Taught herself college course Liberal arts Drugs & Behav Not ALIGNED 11b (3) 11b (2) 11f (14) 11e (1) 13f (5) 11f (2) 13f (2)

Where learned

SAT prep College stats

Group B: Content Area Skills

11b (2)

Group C: Broader Cognitive Processing Group D: Examinee Response Strategies

24a (2)

23 (2) 24c (1) 29 (3)

34 (2)

32a (1) 34 (2) 35c (3)

**Integrated Science and Technology: Dept. at JMU

27

23 (9) 24c (3) 26 (3) 29 (7) 33 (2) 34 (14) 35a (1) 35c (5) 36b (1)

Not ALIGNED

Partially ALIGNED

Cognitive Validity of Test Items and Scores Item 4: Legitimacy of published research findings Anticipated Code (N)

Objective: Evaluate the credibility, use, and misuse of scientific and mathematical information in scientific developments and public-policy issues. Actual Alignment Freshmen Sophomores result Code (N) Code (N) Word probs Middle School Some of the sophomores Statistics Chemistry indicated that Science ISAT they gained the Critical reading Geology knowledge Magazine English required to Health Class Critic thinking solve the item History SAT Prep while in a SOLs Test taking college course English Psych Test-taking skills Nutrition class Earth Science Science (4)

Where learned

Grad school Reading comp

Group B: Content Area Skills Group C: Broader Cognitive Processing Group D: Examinee Response Strategies

11b (2)

11f (3) 13f (4)

29 (2)

23 (1) 29 (5)

36b 36c

34 (3) 35c (2) 36a (1)

28

11b (1) 11f (4) 13f (17) 23 (5) 26 (2) 29 (15)

Not ALIGNED

34 (11) 35c (7) 36a (2) 36b (2) 36c (1)

Not ALIGNED

ALIGNED

Cognitive Validity of Test Items and Scores Item 5: Features of experimental design Anticipated Code (N) Where learned

Grad school (2)

Group B: Content Area Skills Group C: Broader Cognitive Processing Group D: Examinee Response Strategies

11f (2)

29

29 (2)

36b (2)

Objective: Formulate hypotheses, identify relevant variables, and design experiments to test hypotheses. Actual Alignment Freshmen Sophomores result Code (N) Code (N) Some of the Psych (4) Environ. Sci. sophomores Sociology Math indicated that HS/AP Stats Chemistry (4) Science (5) Science they gained the Elementary sch knowledge Psychology required to Chemistry (3) Biology solve the item Biology while in a college course ALIGNED 11f (18) 11b (1) 13a (1) 11f (4) 13f (3) 13f (1) 23 (1) 26 (9) ALIGNED 29 (5) 29 (13)

33 (1) 34 (1) 35c (1) 36b (3)

33 (7) 34 (6) 35c (2) 36b (7) 36c (1)

Partially ALIGNED

Cognitive Validity of Test Items and Scores Item 6: Causal relationships

Anticipated Code (N)

Objective: Discriminate between association and causation, and identify the types of evidence used to establish causation. Actual Alignment Freshmen Sophomores result Code (N) Code (N) Middle/High Sc Some of the Math (3) sophomores Philosophy Chemistry (2) indicated that Statistics (4) Science they gained the HS Geometry HS Psych knowledge Geometry/Proofs Genl Knowl required to SATs Deductive reas. solve the item while in a Science (2) college course Math Psych (2) Probability

Where learned

G-Sci 103

Group B: Content Area Skills

11b 11f

11f (2) 13f (5)

Group C: Broader Cognitive Processing Group D: Examinee Response Strategies

29 (2)

23 (2) 29 (4)

33 34

33 (1) 34 (2) 35a (1) 35c (1) 36b (1)

30

11b (1) 11f (8) 13a (2) 13e (1) 13f (10) 23 (7) 26 (1) 29 (14)

Partially ALIGNED

33 (1) 34 (13) 35a (2) 35c (5) 36a (1) 36b (1)

Partially ALIGNED

ALIGNED

Cognitive Validity of Test Items and Scores Item 7: Velocity question

Anticipated Code (N) Where learned

HS Physics (2)

Group B: Content Area Skills

11e 13f

Group C: Broader Cognitive Processing

26 (2)

Group D: Examinee Response Strategies

38 39

31

Objective: Use theories and models as unifying principles that help us understand natural phenomena and make predictions. Actual Alignment Freshmen Sophomores result Code (N) Code (N) Some of the Word problems Algebra sophomores Physics (4) Physics (2) indicated that Calculus (2) Chemistry they gained the Chemistry (3) Math (2) knowledge H.S. Math Algebra required to 5th grade Trig solve the item Science while in a Math (2) college course Partially 11e (5) 11b (3) ALIGNED 13f (1) 11e (15) 11f (2) 13a (1) 13f (1) Not ALIGNED 23 (8) 23 (2) 24c (2) 24c (1) 26 (5) 29 (3) 29 (6) 31 (1) Not ALIGNED 32a (7) 32a (2) 32b (2) 33 (1) 35a (10) 35a (1) 35c (2) 36b (2) 36b (2)

Cognitive Validity of Test Items and Scores Item 8: Type of research Anticipated Code (N)

Objective: Formulate hypotheses, identify relevant variables, and design experiments to test hypotheses. Actual Alignment Freshmen Sophomores result Code (N) Code (N) Some of the Psych (7) Science (3) sophomores Sociology HS Psych (4) indicated that Stats History Political Sci. Statistics they gained the knowledge Microecon. Sociology required to Macroecon. SOLs solve the item HS Psych College in gen’l. while in a Science college course Social Studies Government

Where learned

Grad school (2)

Group B: Content Area Skills

11f (2)

13a (7)

Group C: Broader Cognitive Processing Group D: Examinee Response Strategies

29 (2)

23 (1) 26 (1) 29 (4)

33 36b

33 (3) 34 (2) 35c (1)

32

11e (1) 11f (1) 13a (18) 13e (1) 13f (1) 23 (1) 26 (4) 29 (17)

Not ALIGNED

33 (1) 34 (10) 35c (2) 36a (1) 36b (7) 36c (2)

Not ALIGNED

ALIGNED

Cognitive Validity of Test Items and Scores Item 9: Types of research Anticipated Where learned

Grad school (2)

Objective: Formulate hypotheses, identify relevant variables, and design experiments to test hypotheses. Actual Alignment Freshmen Sophomores result Psych - 8 (Aligned, Science - 3 Sociology partially Psych - 5 Stats aligned or History Chemistry misaligned?) Statistics Sociology (HS) Microecon. Macroecon. SOLs Psych (HS) College in genl. Interpersonal-skills class

Group B: Content Area Skills

11f (2)

13a (7)

Group C: Broader Cognitive Processing Group D: Examinee Response Strategies

29 (2)

23 (1) 26 (1) 29 (4)

33 36b

33 (2) 34 (2) 35c (1) 36b (1)

33

11f (1) 13a (19) 13e (1) 13f (1) 26 (3) 29 (19)

Not ALIGNED

33 (3) 34 (10) 35c (2) 36b (4) 36c (4)

Not ALIGNED

ALIGNED

Cognitive Validity of Test Items and Scores Item 10: Placebo

Anticipated Where learned

Grad school “Sometime before college”

Group B: Content Area Skills

11b 11f

Group C: Broader Cognitive Processing Group D: Examinee Response Strategies

29 (2)

34

36c (2)

Objective: Illustrate the interdependence between developments in science and social and ethical issues. Actual Alignment Freshmen Sophomores result Common Sense Some of the Research sophomores English Statistics indicated that Stats Math Science - 2 Critical Read. they gained the knowledge Psych - 5 Psych - 2 Governor's Sch. required to Science solve the item Comm. Knowl. Opinion while in a college course Partially 11b (1) 11f (5) ALIGNED 13f (6) 13a (1) 13e (2) 13f (14) ALIGNED 23 (2) 23 (5) 29 (4) 26 (1) 29 (16) 34 (1) 35c (1) 36b (3) 39 (1)

33 (1) 34 (7) 35c (4) 36b (10) 36c (1)

Not ALIGNED

Cognitive Validity of Test Items and Scores APPENDIX C CODING SCHEME from Ferrara et. al

Group B: Science Content Area Skills This set of codes is intended to document the science content area/skills used by the respondent in order to identify a plausible answer to the item. These skills incorporate the idea of science as inquiry. Note: Science Skills are strategies and processes that tend to be used more frequently in science. In contrast, Broader Cognitive Processes are more general and tend to cross content domains. 11. Use/Apply - 11a. Experimental design and scientific process skills: When examinees design and conduct a scientific investigation; when examinees use/apply skills that are specifically part of experimental and investigative procedures. This code should be used only when the item specifically requires the examinee to design and conduct an experiment, in contrast to code 11d, where the examinee conducts an experiment, but is guided throughout the process by explicit, step-by-step instructions. - 11b. Information given within the item or item set: When examinees use/apply content area information given with the item or item set (e.g. in a passage, graph, or table) to respond to the current item. In contrast, factual items such as “What is Boyle’s Law?” would not require the respondent to use any information within the item stem or item set. NOTE: Codes 11b and 24a-c are often used together. When you find code 11b to be applicable, also investigate the appropriateness of code 24a-c for that situation. - 11c. Information generated as part of responding to an item: When examinees generate or collect information in order to respond to an item at hand. - 11d. Use of scientific tools: When examinees use appropriate tools and techniques to gather, analyze, and interpret data. This includes instances where the examinees are asked to follow explicit directions for conducting an experiment, in contrast to code 11a, where the examinees are asked to design an experiment on their own. -11e. Use of mathematics: When examinees use mathematics in various aspects of scientific inquiry, such as measurement or using a formula. - 11f. Prior knowledge and expectations, including everyday science knowledge: When examinees use/apply (a) prior science knowledge that is not given with the item, (b) prior expectations in order to respond successfully to an item, or (c) science content area skills that are likely to be learned in informal, everyday situations rather learned in formal situations such as school science classes. 12. Answer, Explain, or Communicate Scientific Procedures and Explanations - 12a. Defend the answer given: When examinees provide a rationale, data, or other support to justify an answer, illustrate its accuracy, and so forth. NOTE: If this code is used for a MC item, it must be used in conjunction with 34 or a 36 code and can only be used as a secondary code. Code as primary for CR items. - 12b. Explain why something happened: When examinees a) provide the likely cause of something observed or something concluded, or b) propose possible explanations. NOTE: If this code is used for a MC item, it must be used in conjunction with 34 or a 36 code and can only be used as a secondary code. Code as primary for CR items. - 12c. Explain or describe thought process or skills used to arrive at a response: When examinees describe the thinking processes, reasoning, or information they use to arrive at a response to an item. In math, the analogue is “showing all one’s work.” NOTE: This code is only used when required by the item. It does not apply to MC items. - 12d. Reiterate, re-explain, or summarize: When examinees repeat, rephrase, or summarize a response given previously or a step taken previously. NOTE: If this code is used for a MC item, it must be used in conjunction with 34 or a 36 code and can only be used as a secondary code. Code as primary for CR items. - 12e. Explain using scientific principles and concepts: When examines explain principles and concepts in a way that facilitates learning and problem solving and that reflects conceptual understanding. NOTE: If this code is used for a MC item, it must be used in conjunction with 34 or a 36 code and can only be used as a secondary code. Code as primary for CR items. - 12f. Provide an explanation: When examinees give explanations where providing an explanation is how one answers the question. The code is sued for items that DO NOT require a two-part answer. 13. Analyze, Categorize, or Hypothesize - 13a. Formulate hypotheses: When examinees identify questions that can be answered through scientific investigations. This includes making predictions about what will happen if certain conditions were true.

35

Cognitive Validity of Test Items and Scores Group B: Science Content Area Skills - continued This set of codes is intended to document the science content area/skills used by the respondent in order to identify a plausible answer to the item. These skills incorporate the idea of science as inquiry. Note: Science Skills are strategies and processes that tend to be used more frequently in science. In contrast, Broader Cognitive Processes are more general and tend to cross content domains. - 13b. Formulate competing hypotheses: When examinees recognize and analyze alternative explanations and predictions. - 13c. Identify components: When examinees test objects to determine their components or how these components are organized or identify via observation characteristics and features of objects. - 13d. Describe patterns in data, procedures , or results: When examinees describe: (a) patterns in data, (b) procedures in a science investigation, and/or (c) results in a science investigation. - 13e. Identify relationships: When examinees thing critically and logically to make the relationships between evidence and explanations. - 13f. Analyze, interpret, and/or draw conclusions: When examines demonstrate (a) analysis and interpretation of data or results from a scientific investigation, and/or (b) a final judgment or decision based on results from an investigation, data that are given with the item, or data collected as part of an item. - 13g. Observe processes: When examinees perform observations or model a process that cannot be manipulated. - 13h. Compare/contrast: When examinees: (a) conduct an experiment to compare two or more objects on some attribute, or (b) otherwise describe similarities and differences. - 13i. Classify: When examinees classify objects according to critical attributes to serve a practical or conceptual purpose. 14. Create, Invent, or Learn - 14a. Invent a solution algorithm: When examinees demonstrate the invention of an algorithm or other solution strategy specifically to respond to the item. - 14b. Acquire new knowledge in order to complete the item: When examinees demonstrate the: (a) learning of new science concepts or skills or to integrate new science concepts or skills into existing knowledge structures, or (b) by the use of previous conclusions or other information from the task in order to respond successfully to the item. - 14c. Formulate models: When examinees develop descriptions, explanations, predictions, and models using evidence. 19. Other Content Area Skill not listed above.

36

Cognitive Validity of Test Items and Scores

Group C: Broader Cognitive Processing This set of codes is intended to document additional cognitive processes used by the respondent in order to identify a plausible answer to the item. NOTE: Broader Cognitive Processes are more general and tend to cross content domains. In contrast, Science Content Area Skills are strategies and processes that tend to be used frequently in science. 21. Demands on working memory: [Not Applicable] 22. Language-related demands - 22a. Language processing: When examinees: (a) generate extensive amounts of text, (b) use complex or challenging text, or c) use challenging language (e.g. concepts or technical terms) in response to the item. - 22b. Language production requirements (oral and written): [Not Applicable] 23. Metacognition: - When examinees demonstrate (a) skills such as planning, monitoring, goal setting, adjusting strategy use, and efficiency in time use, (b) the ability to control and regulate their thinking and reasoning, (c) checking their thinking for accuracy, or (d) awareness of how their personal experiences, habits, and prejudices shape and impede one’s own understanding. 24. Visual-Spatial Analysis and Use of Graphics - 24a. Use of graphs, charts, and other representations of information and data given: When examinees use external graphs, charts, and other representations of information and data made available in the item to visualize information given in or otherwise relevant to the item. NOTE: Codes 11b and 24a are often used together. When you find code 24a to be applicable, also investigate the appropriateness of code 11b for that situation. - 24b. Use of graphical representations given: When examinees use external graphical representations (e.g. drawings, sketches) made available in the item to visualize information given in or otherwise relevant to the item. NOTE: Codes 11b and 24b are often used together. When you find code 24b to be applicable, also investigate the appropriateness of code 11b for that situation. - 24c. Use of tabular information given: When examinees use external tabular information made available in the item to visualize information given in or otherwise relevant to the item. NOTE: Codes 11b and 24c are often used together. When you find code 24c to be applicable, also investigate the appropriateness of code 11b for that situation. - 24d. Visual-Spatial Analysis: When examinees manipulate a graphic, such as folding paper, measuring a graphic, etc. - 24e. Create a graphic: When examinees comply with an item’s directive to create a graphic (in contrast to the examinee spontaneously making a sketch, table, or graph). In contrast to code 31b, this code should not be applied when the examinee spontaneously creates a graphic. 25. Perspective and Empathy - When examinees demonstrate that (a) they understand different viewpoints and “see the big picture” or (b) “find value in what others find odd, alien or implausible.” 26. Retrieval from Long-Term Memory - When examinees make an active attempt to search and retrieve concepts, terms, and other distant memories in an effort to recall information from their memory stores. 29. Other Broader Cognitive Processing not listed above.

37

Cognitive Validity of Test Items and Scores

Group D. Examinee Response Strategies Different types of items offer the examinee different opportunities to use one or more response strategies in determining a correct response. The examinee response strategies listed here are intended to document the strategies/tactics/approaches that might be used by an examinee. 31. Visualization and Gestures - 31a. Using visualization/making gestures. When examinees actually use (internal) visualization or make gestures as part of their response path. - Note: This code should be used any time an examinee interacts with a graphic (e.g. pointing or tracing the image). In such cases, 24a must also be coded. - 31b. Making graphics. When examinees make external graphical representations (e.g., sketches) to visualize information given in or otherwise relevant to the item. In contrast to code 24b, this code should be used when the examinee spontaneously creates graphics, and is not simply complying with an item’s directives. 32. Use Estimation, Insight, or Abstract/Analytic Approach - 32a. Use estimation or insight: When examinees use estimation or insight, because it is quicker than using algorithms formally. - 32b. Use abstract/analytic approach: When examinees use abstract reasoning without assigning numerical values or using direct calculations to respond to items that may require them to apply textbook algorithms. 33. Trial and Error - When examinees work backwards from multiple-choice options, substituting each of the responses into the item stem. 34. Identifying plausible or implausible correct answer(s). - When examinees use partial knowledge to eliminate one or more multiple choice item response choices or constructed response possibilities. This includes cases where examinees explain their rationale for eliminating various responses. 35. Approximating or Compensating for Limited or No Knowledge - 35a. Guessing: When examinees have little or no relevant knowledge and therefore guess outright to choose or generate a response to a multiple-choice question. This includes the case where the examinee explains his/her guessing strategy. - 35b. Approximating for constructed response items: When examinees have little or no relevant knowledge and therefore provide loosely organized, inexact, incomplete, imprecise responses to a constructed response item. This includes regurgitating any/all information associated with the topic, rephrasing the item stem, etc. This includes the case where the examinee explains his/her attempt to fill in something to write in the blank space. - 35c. Approximating for multiple-choice items: When examinees bring some knowledge to the item and use a combination of elimination and guessing. NOTE: This code is different from 35a Guessing and 34 Eliminating in that the examinee uses some knowledge to respond to the item but does NOT have enough knowledge to systematically eliminate response options. 36. Identifying known correct, or believe to be correct, answer. - 36a. Know and Respond: When examinees know, or think they know, the correct answer before reading the answer choices and search to identify their choice from the response options (or begin filing in the constructed response without hesitation). This includes cases where an examinee remarks, “I know this” (or some variation of that phrase). - 36b. Read and Recognize: When examinees read the item stem and answer choices and quickly identify the correct answer from the response options. This code is used in cases where examinees are required to read the answer choices (e.g. Which of the following …) or when the examinee does so voluntarily. This code is used for multiple choice items only. - 36c. Find and Confirm: When examinees read the item stem and answer choices, immediately identify the correct answer from the response options, and review other options to reconfirm their choice. In this case examinees know the answer but explain why the other choices are incorrect. This code is different than 34a in that the examinee does not eliminate the options in order to arrive at the correct response; the student arrives at the correct response and then provides a rationale for not choosing the other options. 37. Substitution and Value Assignment - In science items that involve mathematical formulas or calculations, when examinees replace a variable with numeri values from the item stem, followed by direct computation, either to (a) obtain a solution, or (b) generalize about the relationship between two variables.

38

Cognitive Validity of Test Items and Scores Group D. Examinee Response Strategies - continued Different types of items offer the examinee different opportunities to use one or more response strategies in determining a correct response. The examinee response strategies listed here are intended to document the strategies/tactics/approaches that might be used by an examinee. 38. Asking for Additional Information - When examinees ask for additional information/clarification about a test question, or for definition of unfamiliar terms. Although there is no guarantee that asking for this information will be permitted and obtained during a test, this response strategy is one that some students implement. Note: This code does not pertain when examinees ask the interviewer questions pertaining to the test directions or procedures. 39. Other Examinee Response Strategy not listed above.

39