Validity and Reliability of an Instrument for Assessing ...

0 downloads 0 Views 189KB Size Report
that many cases in bioengineering ethics correspond to the cognitive ..... invoked by the students, where the master list is “utilitarianism, consequentialism,.
Validity and Reliability of an Instrument for Assessing Case Analyses in Bioengineering Ethics Education Ilya M. Goldin 1 Center for Digital Data, Analytics & Adaptive Learning Pearson [email protected] Rosa Pinkus Department of Bioengineering University of Pittsburgh [email protected] Kevin Ashley Learning Research & Development Center University of Pittsburgh [email protected]

Keywords: moral reasoning, assessment, ethics case analysis, mixed methods, validity, reliability Science and Engineering Ethics DOI: 10.1007/s11948-015-9644-2

1

At the time of writing, Ilya Goldin was a PhD student in the Intelligent Systems Program, University of Pittsburgh

1

Abstract Assessment in ethics education faces a challenge. From the perspectives of teachers, students, and third-party evaluators like the Accreditation Board for Engineering and Technology and the National Institutes of Health, assessment of student performance is essential. Because of the complexity of ethical case analysis, however, it is difficult to formulate assessment criteria, and to recognize when students fulfill them. Improvement in students’ moral reasoning skills can serve as the focus of assessment. In previous work, Pinkus and Gloeckner developed a novel instrument for assessing moral reasoning skills in bioengineering ethics. In this paper, we compare that approach to existing assessment techniques, and evaluate its validity and reliability. We find that it is sensitive to knowledge gain and that independent coders agree on how to apply it.

2

Introduction As in many areas of discipline-based applied ethics, teachers and students working in bioengineering ethics would benefit from a valid, reliable, objective assessment tool. This tool would also be important to external agencies interested in professional ethics pedagogy. For example, the Accreditation Board for Engineering and Technology (ABET) requires that “by the time of their graduation [students are expected to have] an understanding of professional and ethical responsibility” (Accreditation Board of Engineering and Technology 2006), and engineering education researchers have responded with various assessment tools that address different ABET criteria (Shuman et al. 2005). The National Institutes of Health requires that researchers receive “human subjects education,” so as to conduct research in a manner that protects human subjects (i.e., satisfies the requirements of the Belmont Report.) (National Institutes of Health Office of Extramural Research 2006) The U.S. Department of Health and Human Services promotes training in responsible conduct of research (RCR), including areas such as mentor / trainee responsibilities and conflict of interest. (Steneck 2007) Bioengineering ethics presents an especially difficult challenge for assessment. Consider that many cases in bioengineering ethics correspond to the cognitive science notion of illdefined problems. (Voss and Post 1988) Although usage is unsettled, to us, ill-defined problems encompass open-ended ones (i.e., problems without exactly one plausible answer), and ill-structured ones (i.e., problems without exactly one solution path). Moreover, ill-defined problems require the problem solver to define them better through framing. For instance, in professional ethics, one must add constraints and these in turn 3

affect how ethical principles or institutional codes of ethics apply to the problem at hand. Mapping the principles to the situation and weighing the effects of alternative actions or creative compromises in order to resolve conflicts are skills that must be learned. The correctness of a particular resolution depends on how one frames the case. (Goldin et al. 2006a) How one can assess student work on an ill-defined problem is often unclear. Related work such as (Kipnis 2000; Goldie et al. 2001; Goldie et al. 2002) recognizes that one practical measure of a solution to an ill-defined problem is its acceptability to a community of practitioners. Note that his would be a “communitarian” assessment. One’s methodological approach to teaching ethics also affects assessment. Given the variations in teaching applied ethics, one must be clear about the goals of teaching, and the real opportunities for assessment. For example, a theoretical approach may define ethics in a relatively rigid logical framework. Assessing whether or not a student uses the framework appropriately could be comparatively straightforward. If Kant’s deontological theory is to be taught, then a student must include its central tenet: never use a person as a means to an end. Whatever other concepts, qualifiers or inferences a student adds to the analysis, this tenet should be evident. An approach based on the “four principles” (Beauchamp and Childress 2001) provides a terminological framework and shifts the analysis from abstract theories closer to the case. The principles complicate assessment, however, because a student needs not only to invoke the principles, but also to prioritize them. A “casuist” approach (Jonsen and Toulmin 1988) involves reasoning with cases to help map principles to the problem and resolve conflicts. In solving ill-defined ethical problems, casuistry constitutes “reflection and deliberation about what to do…based on 4

comparison of cases and agreed upon moral standards, as well as engineering codes of ethics.” (Weil 1990, 5,7) Creating an instrument to assess formatively how a student reasons “with cases” presents special challenges discussed below. (Arras 1991; Kuczewski 1997; Pinkus 1997; Pinkus et al. 2015)

Assessment of Analytical Components of Methods of Moral Reasoning The Assessment Instrument we evaluate was developed in the context of a graduate and undergraduate course, Bioengineering Ethics, taught by author Pinkus. The course is designed to instruct bioengineering students to identify the values embedded in their practice. It stresses the acquisition of methods of moral problem solving to be used to identify, analyze and resolve dilemmas posed when these values conflict. The Instrument is distinct from other kinds of assessment instruments in ethics, such as those that assess a student’s general level of moral development, reasoning or judgment (Kohlberg 1981; Gibbs and Widaman 1982; Kohlberg 1984; Rest et al. 1999; Lind 2000; Comunian 2002), or those that survey moral values, as reviewed by Lynch et al. (2004) and Rudnicka (2004). Rather, it was designed to assess students’ mastery of a set of analytical skills and moral problem-solving methods taught over the semester. The skills match those presented in the course textbook (Harris et al. 2000), which invites students to consider the morally relevant facts of the case, both known and unknown; to structure the analysis via the conceptual issues that can relate the facts to each other; to use their moral

5

imagination to propose and compare alternative resolutions to the dilemma; and finally to justify a particular resolution. Pinkus has applied the Assessment Instrument to the capstone assignment in her course. Students write a one- or two-page case study based on their technical area of engineering research. Usually, the protagonist is an engineer faced with a dilemma that arises from, or will have ramifications for, the engineer’s professional duties. The course of action is unclear, and requires analysis. Each student presents the case to the class for comment and then writes a paper that analyzes the case using the methods taught in the class. Given the importance of framing in ethical problem-solving, asking students to create their own cases is an ideal pedagogical exercise; it requires them to frame the problems. Furthermore, assigning students to create cases close to their professional expertise and interests has been shown to be a positive factor in student learning. (Pinkus et al. 2015) On the other hand, this need for framing recalls the prospect that a given ill-defined problem may have multiple valid analyses. The answer depends in part on the justifications, as well as on how the problem is conceived. When students analyze a specific, well-studied case, one can attempt to assess their work against a “gold standard” list of facts and issues that they ought to mention. This can be thorny (Hébert et al. 1992; Hayes et al. 1999; Savulescu et al. 1999), and even when the relevant facts of a given case can be enumerated, the quantity of facts that students identify is a poor predictor of the holistic quality of their analyses. (Pinkus et al. 2015) Moreover, when students have nearly complete freedom in describing the facts of the case that they will analyze, the resulting variety of cases and framings makes it impossible to create a gold standard. 6

In lieu of a gold standard, the Instrument looks for evidence that a student has grasped certain analytical skills important in reasoning with ethical cases called (Pinkus et al. 2015) higher-level moral reasoning skills (HLMRS). The five measures are whether a student: a) Employs professional engineering knowledge to frame issues. One’s engineering expertise can impose useful constraints on an ill-defined problem. b) Views the problem from multiple levels. This skill speaks to viewing a problem from the perspectives of different stakeholders. c) Moves flexibly among multiple levels. Students who weave together the perspectives of different stakeholders throughout their analyses go beyond merely identifying the multiple perspectives. d) Identifies analogous cases and articulates ways the cases were analogous. To articulate the way the problem at hand is analogous to a past case is to find deep similarity by using problem structure. e) Employs a method of moral reasoning in conducting the analysis. A methodical analysis brings organization to solving ill-defined problems. The first four were identified within the context of an NSF-funded project (award #9720341), and the fifth within the context of teaching. Together, they are representative of the “measurable” learning goals of the Bioengineering Ethics course. Pinkus arrived at 7

the HLMRS by scoring (on a three-point poor-mediocre-excellent scale) nine students’ analyses of an assigned case. The scoring criterion was her holistic understanding of what analyses were the best-considered, based on her experience as a teacher of bioengineering ethics. A comparison of the highest-scoring analyses against the lowest-scoring ones focused her on key distinctions, which became formalized as the HLMRS. She then “tested” the HLMRS by seeing whether they were helpful in assessing other students’ analyses of various cases. The Instrument poses the questions verbatim regarding whether a student’s analysis demonstrates skills a through d. The fifth HLMRS, “Employs a method of moral reasoning in conducting the ethical analysis,” was operationalized with an approach called labeling, defining, and applying (LDA) ethics concepts. Thus, one of the five HLMRS enjoys a standardized method by which an evaluator can code for evidence of the skill. The gist of the LDA approach is to look for evidence that the student has invoked specific concepts of applied ethics. The LDA approach can be said to operationalize the fifth HLMRS in the sense that the use of a concept in one’s analysis indicates an attempt to “employ a method of moral reasoning,” in the context of the Bioengineering Ethics course. Pinkus and Gloeckner identified pedagogically important concepts taught in the course texts and lectures that are associated with methods of moral reasoning (Figure 1). The list is derived from case books (Pinkus 1997; Harris et al. 2000; Pence 2000) and term papers from several semesters of the Bioengineering Ethics course. Although this

8

list is not exhaustive, it is as comprehensive as possible, to enhance the content validity of the Instrument. [IG: Please insert Figure 1 about here.] Specifically, an evaluator records whether the student has ‘labeled,’ ‘defined,’ and/or correctly ‘applied’ the relevant concepts (Figure 1) to the case at hand. In a case analysis, an ethics concept is said to be ‘labeled’ as such if the term for the concept is present; ‘defined’ if a dictionary-like definition of the concept is present; and ‘applied’ if the concept is brought to bear appropriately in the context of the particular case. Consider this excerpt from an actual term paper, where ‘paternalism’, one of the concepts of interest, is invoked. We annotate the example to highlight some, but not all, instances of labeling (as enclosed within the tags and ), defining (), and applying (). A coder applying the Instrument produces similar annotations. There are several principles to address in this case, specifically, Dr. Hall’s paternalistic attitude and the informed consent process. Dr. Hall’s attitude toward Paul is paternalistic. In a paternalistic model, the physician promotes the patient’s medical well being, independent of the patient’s current preferences. There are several models of paternalism: paternalism for incompetence, trust-based paternalism and best-interest paternalism. Dr. Hall’s attitude is an example of best interest paternalism. Paul’s values play a minimal role in Dr. Hall’s decision. Dr. Hall feels Paul’s best interests are served most by administering the mechanical valve. Dr. Hall’s pursuits are what are best medically and it is obvious that 9

Dr. Hall’s main emphasis is on Paul’s health, not his well-being. Paul’s ability for self-determination was negated in Dr. Hall’s decision.

Coders using the instrument receive written instructions and a glossary that gives a definition and an example for each concept. Here is the entry for ‘paternalism’: Paternalism: Paternalism means substituting one’s own judgment about what is good for a person for the judgment of that person. It is usually equated with a doctor or engineer making judgments for the patient/client. (Harris et al. 2000, 2nd:240) (See also doctor-patient relationship)

Example: “Robins’ firm operates a large pineapple plantation in Country X. The firm has required the workers to leave their traditional villages and to live in company villages…The workers, however, prefer the older villages…The managers refuse to relent, saying that the workers will be healthier and happier in the new environment.” (Harris et al. 2000, 2nd:251)

Throughout, the Glossary cross-references related concepts and cites authoritative sources. (Harris et al. 2000; Beauchamp and Childress 2001) In this way, the Glossary provides a network of concepts, definitions, and examples that serves as an annotating aid. It can be especially useful where a student applies a concept without explicitly labeling it, or if the coder suspects an error in how the student defined or applied a concept.

10

The assessment instrument accommodates the variety of cases created by students, and the variety of ways students frame them. Given that students author the cases that they go on to analyze, not all of the concepts taught in the course and listed in the instrument will be relevant to every case. Which concepts are relevant depends on how the student frames the problem. Since the instrument lists most concepts covered in the course, the concepts relevant to a particular case will generally be a subset of those listed. An evaluator who finds that the student labels, defines, or applies a concept that is missing from the instrument may write it in the spaces provided. One anonymous reviewer wondered whether this constitutes “an adequate measure of the student’s skill or depth in moral reasoning.” As one of 5 HLMRS, it provides an objective scoring of a student’s depth of understanding of concept use. If a concept is applied correctly and also defined correctly, we believe that it is more likely that the student has a comprehensive understanding of the concept than if the concept is merely labeled. Note also, this is not a “stand-alone” measure of learning. It is one of five interrelated skills, all of which are attended to in the Assessment Instrument. Objectifying one of these skills makes assessment more transparent. (Pinkus et al. 2015) An evaluator who holds a list of the concepts that a student labeled, defined and applied can use this information to understand the student’s strengths and weaknesses, and to provide constructive feedback. In this way, the list of concepts facilitates a formative assessment of student work. Formative assessment provides students and teachers with specific critiques, rather than the aggregate score one draws from summative assessment.

11

Shapiro and Miller (1994) also develop an instrument for annotating analyses of cases authored by the students themselves. It asks coders to tally principles and theories invoked by the students, where the master list is “utilitarianism, consequentialism, deontology, virtue, autonomy, justice, beneficence and nonmaleficence.” Our Instrument essentially subsumes this; the additional concepts and HLMRS make it useful not only for instructors who focus on theory or principles, but also for those who teach with a case-based approach.

Empirical Evaluation of the Assessment Instrument Assessment instruments for engineering education, however, need to satisfy standards of methodological rigor, including validity and reliability. (Olds et al. 2005) We report the results of two studies in which we sought to evaluate the Assessment Instrument. In the first study, we focused on validity of the Instrument: we asked whether the Instrument reports scores that correspond to what one would expect. In the second study, we focused on inter-rater reliability, i.e., whether the Instrument reports similar scores when it is applied by different people. In this section, we describe the context of the studies; the following sections report the methodology and results. Validity describes how well an instrument measures that which is important to observe, and that which it is intended to observe; we claim that the Assessment Instrument is a valid measure because it is sensitive to relevant skills of students in performing ethical case analysis. We investigated whether the Instrument is sensitive to changes in learning during a semester-long graduate bioengineering ethics course. We compared students’ 12

skills at ethical case analysis, as measured by the Instrument, before and after the course. We argue that if one assumes that students actually learned in the class, then their learning can only be reflected in posttest scores if the Instrument is sensitive to changes in learning. To evaluate whether the Assessment Instrument is reliable is to ask whether the results one obtains from applying the Instrument are a reproducible event or just a one-time occurrence. One way to evaluate the reliability of an instrument is with a study of interrater reliability, which measures how much independent coders agree when they apply the instrument. High coder agreement implies that an instrument is reliable. In both the sensitivity and the inter-rater reliability studies we treat the Instrument as consisting of two parts: the four questions that address Higher-Level Moral Reasoning Skills constitute one part; the second part consists of questions that operationalize the HLMRS “Employed a method in conducting the ethical analysis” in that they ask whether or not a concept has been labeled, defined, or applied. Notably, the two types of questions place different cognitive demands on the person applying the Instrument. Specifically, the HLMRS require reflection about the paper as a whole and free-text responses, whereas the concept-related questions call for decisions at the level of a page, a paragraph, or a sentence (depending on how one deploys the Instrument), and yes-or-no responses. Thus, one impetus for distinguishing the questions is the need to compare “apples to apples” when performing statistical computations. In statistical analysis it is important to consider whether the phenomena under examination are independent of each other. For example, it is unlikely (but possible) that 13

a student will define a concept without labeling it. To take this into account, we could consider a space of events – labeling, defining, and applying a concept; labeling and defining, but not applying; and so forth, for a total of eight possible combinations. The alternative is to “assume” (i.e., stipulate) independence of labeling, defining, and applying for the purpose of the statistical analyses. In the analyses here, we always assume independence of labeling, defining and applying because to do otherwise adds little information (which we have verified), and yet requires an involved explanation of statistics. The two studies apply the Instrument to two different data sets. Recall that the Instrument was designed to assess term papers that students write at the end of a semester-long course. Accordingly, the inter-rater reliability study evaluates the Instrument as applied to a data set of term papers. In the sensitivity study, however, the baseline is student performance at the beginning of the semester, when it is unreasonable to expect an indepth analysis of a case. (In the argot of instructional science, the difficulty of the pretest would lead to a floor effect.) Thus the sensitivity study uses a data set of short (one- to two-page) case analyses that were produced in response to prompts that were known to be capable of eliciting students’ thoughts at the beginning and at the end of a semester. Circumstances precluded us from training coders in a uniform manner such that their judgments would approach consensus in either study. For example, consider the training for scoring essays from the standardized tests of the Educational Testing Service, e.g., the SAT. Since all test-takers respond to the same pool of essay questions, the coders can be trained in depth on each question and its range of appropriate answers. Coder agreement 14

is measured and coders are retrained if their scores diverge beyond acceptable limits. These uniform training procedures are necessary when agreement itself is the goal. In the two studies here, however, we look at how the Instrument holds up even if coders do not receive such rigorous uniform training. The reason we do so is that conventionally, the data used to develop assessment instruments and to measure coder agreement are usually gathered when students write responses to set questions, e.g., (Sindelar et al. 2003). By contrast, in the term papers data set, students wrote not only the responses (the case analyses), but also the questions (cases). With such data, we can never train coders on the full range of questions; the questions are not known until the students create them. During the course of the studies reported here, the Instrument was a work in progress. When the data analyzed in these studies were collected, the coders received no glossary, and only oral instructions. In their annotations, the coders’ unit of observation was either the entire paper (for HLMRS questions) or the page (for concept questions). For simplicity, the unit of analysis for all the statistical tests is the paper. Coders now annotate at a finer “grain” using annotation software, and receive the written instructions and glossary. We expect that these measures will only improve upon the results reported here.

Sensitivity Study Methods We investigated whether the Instrument is sensitive to changes in learning during a semester-long bioengineering ethics course. We compared students’ skills at ethical case analysis, as measured by the Instrument, before and after the course. 15

Three coders applied the Instrument to all of the responses. Coders M and K did not know that we were measuring sensitivity, that responses came from pre- and posttests, that responses could be paired by author¸ or that other coders worked on the same task. Although the third coder, A, was not blinded, she endeavored to apply the Instrument objectively. To compare student performance on the pre- and posttest, we calculated two scores, HLMRS Sensitivity and Concept Sensitivity. The first corresponds to the four HLMRS, and the second to the concepts operationalizing the HLMRS “Employed a method in conducting the ethical analysis.” Our study has two “fixed effects:” coder and test time. The significance tests allow for four possible outcomes: a) that there are significant differences between pretest and posttest scores; b) that there are significant differences among the coders; c) both a and b; d) neither a nor b. Briefly, the remainder of this section proceeds as follows. We first explain how to compute and interpret HLMRS Sensitivity and Concept Sensitivity. We then perform significance tests on these scores across test times and coders using the technique of Generalized Linear Mixed Modeling (GLMM). (Venables and Ripley 2002) GLMM was appropriate for our data because they fit a Poisson distribution (rather than a normal one). Using GLMM we find that there are significant differences between pretest and posttest scores, and also significant differences between some coders. We believe that coder disagreements can be explained by the fact that the coders did not undergo uniform training for the task. Importantly, although coders sometimes disagreed, each coder

16

independently reported differences between pretest and posttest scores. We conclude with some reflections on methodology. The Concept Sensitivity score is a simple tally of every instance in which a concept was labeled, defined, or applied in a given paper. For example, if the concept “justice” is labeled and defined, and “bribery” is applied, Concept Sensitivity = 3. We emphasize that the ideal student analysis will not invoke all or even most concepts, because the concepts that are germane to a given analysis depend on the particular student’s framing of the case. Therefore, the maximum Concept Sensitivity = 3 actions * 41 concepts = 123 is not only far from ideal, but essentially impossible. If the Concept Sensitivity scores followed a normal distribution, we could calculate whether there were significant differences between pre- and posttest scores by employing a t-test. However, we know that the underlying phenomena that “generate” the Concept Sensitivity scores—the “true”, coder-independent distribution of labeling, defining, and applying of concepts—are infrequent in practice but could, hypothetically, occur often. This resembles a Poisson process. If we could assume that the Concept Sensitivity scores follow a Poisson distribution, then that would provide a model for our analysis. First, the Poisson assumption is justifiable if we assume 1) that labeling, defining, and applying are independent actions and 2) that these actions occur with equal probability. In combination, assumptions 1 and 2 imply that Concept Sensitivity scores follow a binomial distribution. Another way of saying the same thing is that for a given student’s test we observe 41 concepts times 3 possible actions = 123 independent Bernoulli random variables, and thus we have a binomial distribution. Binomial distributions are defined by 17

parameters n and p; here, n is the number of concept questions (123). Typical Concept Sensitivity scores ranged from 1 to 15 (i.e., p of the assumed binomial distribution was small). In the case when n is rather large (as is 123), and when p is small, the Poisson distribution is known to approximate the binomial distribution. Second, we evaluate whether our data are approximately Poisson distributed by comparing them with data that are known to be so. We present side-by-side boxplots of our data and 100 randomly generated observations from Poisson distributions with means that are similar to the estimates of our group means (Figure 2). The similarity of the boxplots also supports treating Concept Sensitivity observations as Poisson distributed. [IG: Please insert Figure 2 around here] The mean of the Poisson distribution is related to the appropriate combination of the fixed effects. As noted above, our study has two fixed effects: three coders, A, M and K, and two test times, pretest and posttest, or six “repeated measures” of HLMRS Sensitivity plus six measures of Concept Sensitivity per student. It follows that we can use generalized linear mixed modeling (GLMM) to calculate the repeated measures for each student to perform significance testing on our data. (Venables and Ripley 2002) The HLMRS Sensitivity score is the average number of affirmative responses to the HLMRS questions for a given paper. For example, in response to “Viewed problem from multiple levels” a coder might list the multiple levels: “doctor, patient, engineer” (which we would interpret as ‘Yes’), and leave the other HLMRS questions blank. For that paper, HLMRS Sensitivity = 0.25. Not all HLMRS are necessary to produce a thorough analysis of every case, but one could consider them all in one paper. 18

For the HLMRS Sensitivity score, we treat each student as a binomial distribution, avoiding the Poisson approximation, and then again use GLMM. These distributions are defined as (n, p), where parameter n is the number of HLMRS questions (four), and p is related to the combination of the fixed effects (coder and test time). GLMM incorporates dependencies among the repeated measures within each subject into the estimated fixed effects. We treat student as a random effect. The resulting model yields one common mean, and each student is represented via a different intercept for a repeated measure. To simplify the significance testing, we exclude two drop-outs.

Sensitivity Study Results and Discussion The participants were students in a graduate bioengineering ethics course class taught by author Pinkus. All students participated, but two dropped the class during the semester. On the first day of class, the students analyzed an ethical case, Artificial Heart (n=15). For the posttest, after their final exam, the students analyzed one of two cases, Price is Right (n=6) or Trees (n=7). Students did not choose which case to analyze, and their responses were all one or two pages long. [IG: Please insert Table 1 around here] There is an effect due to test time (p ≈ 0) for Concept Sensitivity scores, i.e., the difference from pre- to posttest was significant. There is also a partial coder effect: there was a significant difference in the Concept Sensitivity scores assigned by coder M from scores assigned by A and K (p < 0.001). The coder effect is at least in part due to the fact that the coders did not receive uniform training, which usually ensures consistency 19

among coders, and thus they interpreted the Instrument somewhat differently. (We compute inter-rater reliability on this particular dataset in (Goldin et al. 2006b), and we give another example of differing coder perspectives under Inter-rater Reliability below.) The coder effect does not take away from the effect due to test time for each coder—that is, the operationalization of labeling, defining and applying concepts reflects student learning over the course of the semester. When we compare HLMRS Sensitivity scores, there is again an effect due to test time (p < 0.001), which means that the difference in student scores from pretest to posttest was significant. There is also a coder effect (p ≈ 0), i.e., significant differences in scores assigned by all three coders. Again, we attribute the coder effects to lack of uniform training, and emphasize that they do not take away from the effect due to test time. [IG: Please place Table 2 about here] A potential weakness in our study arose from a mix-up in administering the post-test. If the pretest case were more difficult than the posttest cases, that would provide an alternative explanation of the effect due to test time. In the parlance of Campbell and Stanley (1963), this would be an example of instrumentation, or “autonomous changes in the measuring instrument which might account” for pre- to posttest difference. We had intended to control for this by separating the students into two groups and crossing the cases the groups see at pre- and posttest, but we made an error when administering the tests. Nonetheless, we controlled for other aspects of instrumentation by randomly assigning students to one of two posttest groups, by shuffling together pre- and posttest

20

responses at coding time, and by employing multiple coders blinded to the nature of our study. Furthermore, we chose similarly difficult cases as tests. Artificial Heart and Trees are both “conflict” cases; that is, a student ought to identify the values in conflict, and consequently to frame an analysis either by prioritizing values, or by proposing a “creative middle-way solution” to reconcile the conflict. Price is Right is a line-drawing dilemma that requires one to clarify concepts, such as what constitutes “quality”, and whether “long-term testing” is needed. Thus, the cases were chosen for structure, and for complexity with regard to concepts. All three cases focus on bioengineering ethics; all could be analyzed using the same method (Harris et al. 2000); none have easy answers. We also know that all three are of comparable difficulty based on our experience of analyzing them with students. In a related study that involved prompted case-analysis interviews, these cases elicited similarly complex reactions. (Pinkus et al. 1999) Finally, we can verify empirically that the two posttest cases were similarly difficult. We compared student performance on the posttest cases using a GLMM model. There is no significant difference in Concept Sensitivity according to coders M and K (p = 0.624), nor when we include non-blinded coder A (p = 0.9407). Similarly, there is no significant difference in HLMRS Sensitivity according to M and K (p = .2167), nor when we include A (p = 0.2137). However, we note that the samples are small, and we may lack the power to detect a moderate difference.

21

Potential confounds aside, the sensitivity study shows that the Instrument reflects student learning over the course of the semester with regard to the higher-level moral reasoning skills as well as to labeling, defining, and applying ethics concepts.

Inter-rater Reliability Study Methods In the inter-rater reliability study we look at whether independent coders agree when they apply the Instrument to ethical case analyses. We calculate one agreement score for the four HLMRS, and another for the concepts operationalizing the HLMRS “Employed a method of moral reasoning in conducting the ethical analysis”. In the remainder of this section, we describe how we can tally our coders’ judgments using contingency tables, and how we use Cohen’s κ (Kappa) metric for computing coder agreement. (Cohen 1960) We briefly consider how κ is affected by the distribution of the data, and how to interpret κ. We arrange our data in two-by-two contingency tables. A contingency table can compare two coders’ answers to a given question across a set of student papers. For us, this could be an HLMRS question such as “Did the student view the problem from multiple levels?” or a concept question such as “Did the student define the concept ‘informed consent’?” We compute two agreement scores, one that averages agreement across all HLMRS and one across labeling, defining, and applying concepts. We calculate agreement as Cohen’s κ (Kappa). (Cohen 1960) κ is widely used for reporting agreement. (Byrt et al. 1993; Carletta 1996; Di Eugenio and Glass 2004; Sim and Wright 2005) The κ metric computes the probability of actual (observed) agreement 22

corrected by the expected probability of agreement; κ ranges from 1 (perfect agreement) to -1 (“perfect” disagreement). While κ is widely recommended for reporting agreement, it is also susceptible to the problems of bias and prevalence, which can obscure the true levels of agreement in a dataset. Bias means that κ can be artificially inflated because of differences in the coders’ beliefs about the actual distribution of LDA combinations in our dataset. Prevalence means that κ can be artificially depressed because of imbalance in the “true”, coder-independent underlying distribution of LDA. In fact, we know that the actual distribution is unbalanced. First, there must always be many more instances of the absence of labels, definitions, or applications than of their presence; recall that not all of the concepts taught in the course and listed in the assessment instrument will be relevant to every case. Second, it is to be expected that coders have an easier time agreeing on instances where a concept is absent than on the instances that are less clear-cut. Following (Byrt et al. 1993; Di Eugenio and Glass 2004; Sim and Wright 2005), we compute bias-adjusted κ (BAK) and prevalence-adjusted bias-adjusted κ (PABAK). Since BAK and PABAK complement κ, it is customary to interpret all three on the same scale. We found that κ and BAK diverge little for our data, which means that there is little difference in coder bias; thus, we omit BAK. However, κ and PABAK do differ, which confirms the intuition described above that prevalence of concept absences over presences depresses our κ values. We therefore focus on PABAK, and report κ for completeness.

23

As a guide to interpreting the metrics, (Sim and Wright 2005) cite an interpretation of κ from (Landis and Koch 1977): κ ≤ 0 is poor, .01-.20 is slight, .21-.40 is fair, .41-.60 is moderate, .61-.80 is substantial, and .81-1 is almost perfect. We are aware of other interpretations; “the choice of such benchmarks, however, is inevitably arbitrary, and the effects of prevalence and bias on κ must be considered when judging its magnitude.” (Sim and Wright 2005)

Inter-rater Reliability Results and Discussion We collected regular required term papers from three graduate and one undergraduate sections of the bioengineering ethics class taught by author Pinkus (N=41, Table 3). Students authored their own cases, and analyzed them using skills they learned in class. This longitudinal sample (all papers from all students in the four sections) is representative of the target population of papers by bioengineering ethics students. Each section’s papers were annotated using the Instrument by two of three possible coders: A, J, or C. [IG: Please place Table 3 around here] Coders were trained to use the instrument by practicing on several term papers under the supervision of author Pinkus. Coders C and J had been introduced to bioengineering ethics from an applied perspective through their work with Pinkus. C had been a student in the graduate bioethics course and then completed two teaching assistantships in that course under Pinkus’s supervision. J was a research assistant with Pinkus for two years and worked on several applied ethics projects during that time. Coder A had just 24

completed a Master's in Bioethics that focused on a theoretical, philosophical approach to ethics. She was predisposed to this approach when she coded papers. Accordingly, we frame our discussion in terms of agreement either between two coders who share the applied ethics perspective (C vs. J), or between coders with differing perspectives (philosophical vs. applied, i.e., A vs. C, and A vs. J). As discussed below, differences in coder perspective have an apparent effect on reliability. Because students chose paper topics themselves, the frequency of the concepts varies greatly. The least frequent was the concept ‘divergence’, which was never labeled, defined, or applied in any term papers; the most frequent was ‘utilitarianism’, labeled in all 41 papers, defined in 31, and applied in 38. Nineteen concepts were labeled, 12 were defined, and 23 were applied in at least 25% of the papers. [IG: Please insert Figure 3 about here] Let us begin by considering agreement on labeling, defining and applying concepts. There are three basic results: first, with both coders sharing the applied ethics perspective, they agree almost perfectly on whether a concept is labeled (PABAK=0.931, Figure 3), defined (0.943), or applied (0.911). Second, when coders hold different perspectives, agreement is substantial or better (PABAK between 0.698 and 0.874, ibid.). Third, PABAK is higher than κ across all coder pairs: the prevalence of concept absences “penalizes” κ. (We omit BAK. It is almost identical to κ, which indicates that there are few discrepancies in our coders’ biases.) [IG: Please insert Figure 4 about here]

25

The coders’ agreement on the four HLMRS (Figure 4) is lower than agreement on labeling, defining and applying concepts. Coders sharing the applied ethics perspective have moderate agreement (PABAK=0.583), and when one coder holds a different perspective, slight to fair agreement (PABAK 0.132 to 0.4). This makes sense, because the HLMRS are relatively coarse-grained and abstract, and thus more difficult to agree on than concepts, which operationalize one of the HLMRS at a fine representational grain, and whose specific nature permits more reliable, objective coding. The gist of the agreement measurements is that coders who hold the same perspective on ethics agree on whether students label, define and apply ethics concepts (Figure 3, C vs. J), but they do not agree on whether students employ higher-level moral reasoning skills (Figure 4). These basic results are also true when coders hold distinct perspectives, although their agreement scores are lower. We infer that the Instrument can be applied reliably. These findings also underscore the value of operationalizing one of the HLMRS in favor of labeling, defining and applying ethics concepts. Even though the question of whether a student has labeled, defined or applied a concept is subject to the interpretation of coders, educating coders about applied ethics makes it is possible to approach consensus. Because our coders did not have the opportunity to practice and hone their skills on multiple case analyses of each of the 41 distinct cases, we believe the agreement results reported above are all the more remarkable. In future work, we may test whether agreement would be stronger if coders were trained under a uniform procedure on

26

multiple standardized analyses of a few hand-chosen cases, and retrained if they diverged too much. The way we conceive of labeling, defining and applying leads us to believe that a coder’s task ought to be most difficult in detecting applications, less so for detecting definitions, and least difficult for detecting labels. This intuition is supported by prior work (Goldin et al. 2006b), but on the term papers dataset, coder pairs A vs C and J vs C agree more about defining than about labeling (Figure 3, PABAK). One cause could be that while one label is trivial to find (a coder only needs to look for the particular term), it is difficult to continuously look out for any one of over 40 labels, and it may be easier to pick up on the longer definitions. Indeed, coders have an easier time finding labels that are relevant to the subject of a particular paper. (Goldin et al. 2006b) Ultimately, detection of labels ought to be performed automatically by a computer with greater accuracy than by a human coder.

Conclusion As far as we know, the Assessment Instrument is unique in that it fits the following criteria, all relevant and desirable for assessment in bioengineering ethics as taught in a university setting. The Instrument is both theoretically and empirically motivated. It has stood up to an empirical evaluation of its validity and reliability, as reported here. It fits the setting and goals of assessment of a semester-long course, and instructors and teaching assistants adapt to it readily. It is comprehensive in that it can be used to assess a wide range of case analyses of a wide range of bioengineering ethical cases with no or 27

minimal changes. It does not require a gold standard to assess student answers. It applies to student answers in essay form, which is the highest fidelity representation of a case analysis. Finally, its application is transparent and comprehensible even to a non-expert. It is unique, too, in its inclusion of the HLMRS and its focus on operationalizing one of those in terms of labeling, applying and defining the concepts targeted in the course. The results of the inter-rater reliability study underscore the value of operationalizing theoretical constructs like the HLMRS in favor of more specific ones. While the Assessment Instrument only operationalizes in a standardized way the HLMRS “Employs a method of moral reasoning in conducting the ethical analysis”, we are interested in operationalizing the remaining four. We believe this would bring welcome reproducibility to assessment in ethics education. This is a matter of continuing research. For example, the combination of skills “identifies different perspectives” and “moves flexibly among the various perspectives” is comparable to adaptive expertise (Martin et al. 2005), a cognitive skill that is recognized as important in bioengineering and other professional endeavors. Fisher and Peterson (2001) propose a tool to measure this skill, which indicates a growing interest in objectifying cognitive aspects of moral reasoning. Operationalizations not only enhance the reliability of the Instrument; they make its formative application possible: the detail of student work revealed via the operationalized HLMRS makes plain to instructors the opportunities for feedback to students. Aside from our effort to bring methodological rigor to assessment in bioengineering ethics education, the work reported here serves another purpose. We have proposed to build an intelligent tutoring system that helps students learn bioengineering ethics. 28

(Goldin et al. 2006a) The system aims to respond intelligently to student case analyses by detecting whether the students label, define, and apply ethics concepts. The detection patterns are to be learned from a training set of manually annotated essays. This automated detection is useful only if the LDA operationalization is valid, and it is possible only if the coding of LDA is reliable. The evidence reported here suggests that the ability to invoke concepts by labeling, defining and applying is both measurable and a strong predictor of the ability to analyze a case, and thus the Instrument can serve as a basis for the system we propose to build.

Acknowledgments The authors thank Christian Schunn, Janyce Wiebe, Diane Litman, and our reviewers for valuable feedback; Christos Theodoulou and Scott Nickleach working under the supervision of Dr. Alan Sampson for statistics consulting; Claire Gloeckner and Jessica DiFrancesco for coding and data entry; Mark Sindelar and Kaitlin Jones for coding, and Angela Fortunato for extensive assistance and advice. This work was supported by NSF Engineering and Computing Education grant #0203307.

References Accreditation Board of Engineering and Technology. 2006. Criteria for Accrediting Engineering Programs. Arras, John D. 1991. Getting Down to Cases: The Revival of Casuistry in Bioethics. Journal of Medicine and Philosophy 16: 29–51. doi:10.1093/jmp/16.1.29. Beauchamp, T.L., and J.F. Childress. 2001. Principles of biomedical ethics. Vol. 5. New York, N.Y.: Oxford University Press. Byrt, T., J. Bishop, and J.B. Carlin. 1993. Bias, prevalence and kappa. J Clin Epidemiol 46: 423–9. 29

Campbell, D.T., and J.C. Stanley. 1963. Experimental and quasi-experimental designs for research. In Handbook of Research on Teaching, ix, 84. Chicago: Rand McNally. Carletta, J. 1996. Assessing agreement on classification tasks: the kappa statistic. Computational Linguistics 22: 249–254. Cohen, J. 1960. A coefficient of agreement for nominal scales. Educational and Psychological Measurement 20: 37–46. Comunian, A.L. 2002. Structure of the Padua Moral Judgment Scale: A Study of Young Adults in Seven Countries. In 110th Annual Conference of the American Psychological Association. Chicago, IL. Di Eugenio, B., and M. Glass. 2004. The Kappa statistic: a second look. Computational Linguistics 30. Fisher, F.T., and P.L. Peterson. 2001. A Tool to Measure Adaptive Expertise in Biomedical Engineering Students. In American Society for Engineering Education. Mira Digital Publishing, Inc. Gibbs, J.C., and K. Widaman. 1982. Social Intelligence: Measuring the Development of Sociomoral Reflection. In , 191–211. New Jersey: Prentice-Hall. Goldie, J., L. Schwartz, A. McConnachie, and J. Morrison. 2001. Impact of a New Course on Students’ Potential Behaviour on Encountering Ethical Dilemmas. Medical Education 35: 295–302. Goldie, J., L. Schwartz, A. McConnachie, and J. Morrison. 2002. The impact of three years’ ethics teaching, in an integrated medical curriculum, on students’ proposed behaviour on meeting ethical dilemmas. Med. Educ. 36: 489–497. Goldin, Ilya M., Kevin D. Ashley, and Rosa L. Pinkus. 2006a. Teaching Case Analysis through Framing: Prospects for an ITS in an Ill-defined Domain. In Workshop on Intelligent Tutoring Systems for Ill-Defined Domains, 8th International Conference on Intelligent Tutoring Systems. Jhongli, Taiwan. Goldin, Ilya M., Kevin D. Ashley, and Rosa L. Pinkus. 2006b. Assessing Case Analyses in Bioengineering Ethics Education: Reliability and Training. In Proceedings of the 2006 International Conference on Engineering Education. San Juan, Puerto Rico. Harris, C.E., M.S. Pritchard, and M.J. Rabins. 2000. Engineering Ethics: Concepts and Cases. Vol. 2nd. Belmont, CA: Wadsworth. Hayes, R.P., A. Stoudemire, K. Kinlaw, M.L. Dell, and A. Loomis. 1999. Qualitative outcome assessment of a medical ethics program for clinical clerkships: A pilot study. Gen. Hosp. Psych. 21: 284–295. Hébert, P.C., E.M. Meslin, and E.V. Dunn. 1992. Measuring the Ethical Sensitivity of Medical Students: A Study at the University of Toronto. J. Med. Ethics 18: 142– 147. Jonsen, A.R., and S.E. Toulmin. 1988. The abuse of casuistry: a history of moral reasoning. Berkeley: University of California Press. Kipnis, K. 2000. Medical Ethics Education in a Problem-based Learning Curriculum. American Philosophical Association Newsletter on Philosophy and Medicine 100. Kohlberg, L. 1981. The Philosophy of Moral Development. 1: Harper & Row. Kohlberg, L. 1984. The Psychology of Moral Development. 2: Harper & Row. 30

Kuczewski, M.G. 1997. Fragmentation and consensus: communitarian and casuist bioethics. Washington, D.C.: Georgetown University Press. Landis, J.R., and G.G. Koch. 1977. The measurement of observer agreement for categorical data. Biometrics 33: 159–74. Lind, G. 2000. Moral Regression in Medical Students and Their Learning Environment. Revista Brasileira de Educacao Médica 24: 24–33. Lynch, D.C., P.M. Surdyk, and A.R. Eiser. 2004. Assessing professionalism: a review of the literature. Medical Teacher 26: 366–373. Martin, T., K. Rayne, N.J. Kemp, J. Hart, and K. Diller. 2005. Teaching Adaptive Expertise in Biomedical Engineering Ethics. Science and Engineering Ethics 11: 257 –276. National Institutes of Health Office of Extramural Research. 2006. Frequently Asked Questions for the Requirement for Education on the Protection of Human Subjects. May 9. Olds, B.M., B.M. Moskal, and R.L. Miller. 2005. Assessment in Engineering Education: Evolution, Approaches and Future Collaborations. Journal of Engineering Education 94: 13–25. Pence, G.E. 2000. Classic cases in medical ethics: accounts of cases that have shaped medical ethics, with philosophical, legal, and historical backgrounds. Vol. 3rd. Boston: McGraw-Hill. Pinkus, R.L. 1997. Engineering ethics: balancing cost, schedule, and risk--lessons learned from the space shuttle. New York: Cambridge University Press. Pinkus, R.L., M.T.H. Chi, J. McQuaide, K.D. Ashley, and M. Pollack. 1999. Some Preliminary Thoughts On Reasoning With Cases: A Cognitive Science Approach. Symposium presented at the Association for Moral Education, November 20, Minneapolis, MN. Pinkus, R.L., C. Gloeckner, and A. Fortunato. 2015. The Role of Professional Knowledge in Case Based Reasoning. Science and Engineering Ethics. doi:10.1007/s11948015-... Rest, J.R., D. Narvaez, M.J. Bebeau, and S.J. Thoma. 1999. Postconventional Moral Thinking: a Neo-Kohlbergian Approach. New Jersey: Lawrence Erlbaum Associates. Rudnicka, E. 2004. A review of instruments for measuring moral reasoning/values. In 10th International Conference on Industry, Engineering, and Management Systems, 305–311. Cocoa Beach, FL. Savulescu, J., R. Crisp, K.W.M. Fulford, and T. Hope. 1999. Evaluating ethics competence in medical education. J. Med. Ethics 25: 367–374. Shapiro, J., and R. Miller. 1994. How medical students think about ethical issues. Academic Medicine 69: 591–3. Shuman, L.J., M. Besterfield-Sacre, and J. McGourty. 2005. The ABET “Professional Skills” -- Can They Be Taught? Can They Be Assessed? Journal of Engineering Education 94: 41–55. Sim, J., and C.C. Wright. 2005. The kappa statistic in reliability studies: use, interpretation, and sample size requirements. Phys Ther 85: 257–68. 31

Sindelar, M., L. Shuman, M. Besterfield-Sacre, R. Miller, C. Mitcham, B. Olds, R.L. Pinkus, and H. Wolfe. 2003. Assessing Engineering Students’ Abilities to Resolve Ethical Dilemmas. In 33rd Annual Frontiers in Education. Boulder, CO: IEEE. Steneck, N.H. 2007. ORI Introduction to the Responsible Conduct of Research. August 3. Venables, W.N., and B.D. Ripley. 2002. Modern applied statistics with S. Vol. 4th ed. New York: Springer. Voss, J.F., and T.A. Post. 1988. On the solving of ill-structured problems. In The nature of expertise, ed. M. T. H. Chi, R. Glaser, and M. J. Farr, 261–285. Hillsdale, NJ: Lawrence Erlbaum. Weil, V. 1990. Engineering Ethics in Engineering Education: Summary Report. In Conference at the Center for the Study of Ethics in the Professions at the Illinois Institute of Technology. Chicago.

Appendix: Figures and Tables Category

Concepts

Theoretical

Utilitarianism: Act utilitarianism / Rule utilitarianism, Cost/benefit

Approaches

analysis, Risk analysis, Risk-benefit analysis, Risk disclosure, Risk management Respect for persons: Golden rule Informed consent

Analytical

Convergence, Creative middle way, Divergence, Line drawing,

Techniques

Multiple Perspectives, Reversibility, Universalizability

Principles

Autonomy, Beneficence, Cicero’s creed, Common morality, Confidentiality, Failure to seek out the truth, Honesty, Justice, Nonmaleficence, Paternalism

Professional Ethics

Animal rights, Bribery, Conflict of interest, Definition of death,

Issues

Doctor-engineer responsibility, Doctor-patient relationship, Physician responsibility, Product liability, Research vs. clinical 32

ethics, Responsibility of the bioengineer, Risk analysis, Safety, Whistle-blowing Psychological

Groupthink

“barriers” to ethical decisions Figure 1: Concepts operationalizing HLMRS criterion "Employed a method in conducting the ethical analysis". These four categories are approximate and simplifying, and not mutually exclusive.

33

Figure 2: Concept Sensitivity scores (left) and simulated Concept Sensitivity scores (right, marked with S). "A, pre" is coder A, pretest, etc.

34

Dataset

Students

“Target” Concepts

15

25

post case 1 (Price is Right)

6

19

post case 2 (Trees)

7

15

pre (Artificial Heart)

Table 1: “Target” concepts were possibly relevant to cases in the sensitivity study (the instrument lists 41 concepts).

35

Sensitivity Measure

Coder A (not blinded)

Coder K

Coder M

Mean pretest Concepts

2.880

2.898

1.822

Mean posttest Concepts

8.239

8.291

5.214

Mean pretest HLMRS

0.627

0.333

0.303

Mean posttest HLMRS

0.796

0.537

0.502

Table 2: Pre- and posttest mean Concept Sensitivity and HLMRS Sensitivity scores estimated from fits to Poisson and binomial distributions, respectively.

36

Pair of coders N Graduates N Undergraduates N Total A vs C

11

8

19

A vs J

10

10

J vs C

12

12

Table 3: The number of items (term papers) in each condition (coder pair), including graduate and undergraduate term papers from four sections of Bioengineering Ethics.

37

1 0.5 0

K

PABAK

A vs C A vs J J vs C A vs C A vs J J vs C A vs C A vs J J vs C L L L D D D A A A

Figure 3: Agreement over labeling, defining, and applying concepts.

38

1 K

BAK

PABAK

0.5 0 A vs C

A vs J

J vs C

Figure 4: Average HLMRS agreement

39