Assessing the Pedagogical Effectiveness and ... - Semantic Scholar

3 downloads 0 Views 31KB Size Report
Department of Psychology, Rhodes College, 2000 N. Parkway, Memphis, TN, 38112,. 2 ... help college-level students learn about computer literacy topics.
Assessing the Pedagogical Effectiveness and Conversational Appropriateness in Three Versions of AutoTutor 1

Tanner Jackson1, Jim Mueller1, Natalie K. Person1, and Arthur C. Graesser2 Department of Psychology, Rhodes College, 2000 N. Parkway, Memphis, TN, 38112, 2 Department of Psychology, University of Memphis, Memphis, TN, 38152

Abstract. AutoTutor's effectiveness as a tutor and conversational partner was assessed during three development cycles of the system. In Cycle 1 AutoTutor interacted with virtual students, whereas in Cycles 2 and 3, AutoTutor interacted with human students. The tutoring transcripts for the three cycles were analyzed by two sets of knowledgeable judges. One set of judges rated the pedagogical quality of each AutoTutor dialog move; the other set rated the conversational appropriateness of each move. Data from three evaluative cycles are presented in the paper.

1 Background AutoTutor is a fully automated computer tutor that responds to student contributions by simulating the dialog moves of untrained, human tutors. AutoTutor is currently designed to help college-level students learn about computer literacy topics. AutoTutor’s architecture includes the following components: Curriculum Script, Language Analyzers, Latent Semantic Analysis (LSA), Dialog Move Generator, Dialog Advancer Network (DAN), and an Animated Agent. All of these architecture components have been discussed extensively in other publications [1, 2, 6, 15]. We currently have three versions of AutoTutor: AutoTutor 1.0, AutoTutor 1.1, and AutoTutor 2.0. The versions primarily differ in the mechanisms that control their respective dialog move generators. Versions 1.0 and 1.1 simulate the dialog moves of normal, untrained human tutors via production rules, whereas AutoTutor 2.0 uses production rules along with particular dialog move configurations to simulate more ideal tutoring strategies. The purpose of this paper is to present data that illustrate AutoTutor's strengths and weaknesses as a tutor and conversational partner over the three development cycles.

2 How AutoTutor works AutoTutor helps students learn by having a conversation with them. The material that AutoTutor covers in the tutoring session is organized in a curriculum script. A curriculum script is a well-defined, loosely structured lesson plan that includes important concepts, questions, cases, and problems that teachers and tutors plan on covering in a particular lesson [3, 4, 8, 12]. AutoTutor’s curriculum script includes 36 computer literacy questions and problems along with corresponding “ideal answers” (which are comprised of many potential good answer aspects), anticipated bad answers, corrections for anticipated bad answers, and all of AutoTutor’s dialog moves that contain information about the question/problem (i.e., assertions, hints, prompts, prompt responses, and summaries). AutoTutor begins the tutoring session with a brief introduction and then asks the student a question from the curriculum script. The student responds to the question by typing her/his answer on the keyboard and hitting the “Enter” key. A number of language analyzers operate on the words in the student’s contribution. These analyzers include a word and punctuation segmenter, a syntactic class tagger, and a speech act classifier. The speech act

classifier assigns the student’s input into one of five speech act categories: Assertion, WHquestion, Yes/No question, Frozen Expression, or Prompt Completion. These speech act categories enable AutoTutor to sustain mixed-initiative dialog by responding to the student in conversationally and pedagogically appropriate ways [9, 10, 11]. The LSA component is used to assess the quality of the student Assertions and to monitor other informative parameters such as Topic Coverage and Student Ability Level. LSA is a statistical technique that measures the conceptual similarity between two text sources. In AutoTutor, LSA assesses Student Assertion quality by comparing each Assertion against two other computer literacy text sources, one that contains potential good answer aspects of the topic being discussed and one that contains the anticipated bad answers. LSA computes a geometric cosine that indicates the best conceptual match between two text sources, and therefore, determines how AutoTutor responds to the student Assertion. Past analyses have indicated that our application of LSA is considerably accurate in evaluating the quality of learner Assertions [5,16]. The Dialog Move Generator is governed by a set of production rules that exploit data produced by LSA. The production rules are tuned to the following LSA parameters: (1) the quality of the student’s Assertion, (2) the student’s ability level, (3) the student’s verbosity level, and (4) topic coverage (i.e., how much of the topic has been covered in the tutoring session). AutoTutor currently has 12 dialog moves that are frequently used by untrained, but effective, human tutors. The dialog moves include five forms of immediate short feedback (e.g., positive feedback), pump, positive pump, hint, correction, prompt, assertion, and summarize. A set of fifteen fuzzy production rules determines which dialog move AutoTutor should produce next [7]. Each fuzzy production rule specifies the conditions in which a particular dialog move should be initiated. The conditions specified in the production rules are based on previous research of human tutors’ dialog move choices [3, 4]. For example, consider the CORRECTION rule: IF [student ability = LOW or MEDIUM & student initiative = LOW or MEDIUM & subtopic coverage = LOW or MEDIUM & Student Assertion match with bad answer text = HIGH] THEN [select CORRECTION] If each of these conditions is met (that is, the LSA values for student ability, subtopic coverage, and bad answerness exceed a predetermined threshold), then AutoTutor will generate an answer correction on the next tutor turn.

3 Assessing Pedagogical Effectiveness and Conversational Appropriateness The dialog moves generated by the three versions of AutoTutor were assessed in three evaluation cycles. After each evaluation cycle, the curriculum script, the fuzzy production rules, and the LSA parameter thresholds were adjusted in ways we believed would enhance AutoTutor’s performance in subsequent cycles. For each of the evaluation cycles, two sets of judges rated AutoTutor’s dialog moves on two holistic dimensions, namely pedagogical effectiveness (PE) and conversational appropriateness (CA). The two judges who rated PE were human tutoring experts who are quite knowledgeable of the effective pedagogical strategies that are frequently employed by untrained, human tutors. The two judges who rated CA considered several factors relevant to conversation in their ratings of AutoTutor’s dialog moves. These factors included the Gricean maxims of quality, quantity, relevance, and manner, along with the overall awkwardness of the tutor dialog move. Both PE and CA were rated on a six-point scale (1 = very poor, 6 = very good). Reliability measures were computed for both pairs of judges; results indicated significant reliability between judges for both dimensions (Cronbach’s alpha = .94 for PE and .89 for CA). The particulars of

the evaluation cycles are discussed below. The descriptive statistics for each cycle are presented in Table 1. 3.1 Cycle 1 Assessment In Cycle 1, AutoTutor 1.0 interacted with virtual students. The use of virtual (or synthetic) students to test tutoring systems is not uncommon and has been advocated by other researchers [13, 14]. Our virtual students were created to emulate human students of varying ability and verbosity levels. Specifically, seven virtual students of varying ability and verbosity levels were created from the answers provided by human students in a pilot study. For example, the Good Verbose student was created by compiling all of the lengthy good answers from the human corpus for each of the 36 subtopics. The seven virtual students included in this analysis were Good Verbose, Good Succinct, Good Coherent, Vague, Mute, Bad/Erroneous, and Monte Carlo. The virtual student contributions were then fed to AutoTutor and transcripts of the tutoring sessions between AutoTutor and the seven virtual students were created and evaluated by the two sets of judges. The PE and CA means (M), standards deviations (SD), and number of rated dialog moves (n) for each of the virtual students are presented in Table 1. Table 1 Means for Pedagogical Effectiveness & Conversational Appropriateness Ratings AutoTutor Version AutoTutor 1.0 (Cycle 1) Good Verbose Good Succinct Vague Erroneous Mute Good Coherent Monte Carlo All Virtual Students AutoTutor 1.1 (Cycle 2) Human Students AutoTutor 2.0 (Cycle 3) Human Students

Pedagogical Effectiveness

Conversational Appropriateness

M 4.52 4.63 4.12 3.30 3.75 5.26 4.03 4.23

SD 1.41 1.42 1.52 1.59 1.47 1.26 1.77 1.47

n 210 283 338 301 592 286 273 2283

M 4.77 4.48 4.47 3.97 4.52 4.86 4.39 4.49

SD 1.52 1.17 1.24 1.59 1.20 1.29 1.41 1.34

n 210 283 338 301 592 286 273 2283

4.13

1.24

2058

4.02

1.57

2070

3.75

1.29

2470

3.81

1.61

2568

3.2 Cycle 2 and Cycle 3 Assessment The dialog moves of AutoTutor 1.1 and AutoTutor 2.0 were evaluated in Cycles 2 and 3, respectively. In Cycles 2 and 3, AutoTutor interacted with 60 human students (36 in Cycle 2 and 24 in Cycle 3) enrolled in a college-level computer literacy course. Computer literacy is a required course at the institution where the study was conducted; therefore, it is reasonable to assume the participants represented students of varying ability levels. All students received extra credit in the course or money for their participation. Versions 1.1 and 2.0 primarily differ in terms of the mechanisms that control the particular dialog moves that are generated after a student contribution. AutoTutor 1.1 simulates the dialog moves of untrained human tutors using a revised set of the fuzzy production rules mentioned earlier. AutoTutor 2.0 uses a combination of production rules and predetermined discourse patterns to simulate more ideal tutoring strategies. More

specifically, AutoTutor 2.0 incorporates tutoring tactics that prod students to elaborate their knowledge. Whereas AutoTutor 1.1 considers information about a topic to be covered when either the student or the tutor articulates it, AutoTutor 2.0 considers only what the student says when computing topic coverage. Therefore, if the student does not articulate information about a particular topic, it is not considered covered. This tactic forces the student to articulate the explanations in their entirety, an extreme form of constructivism. To get the student to contribute and elaborate, AutoTutor 2.0 uses discourse patterns that organize dialog moves in terms of their progressive specificity. Hints are less specific than Prompts, and Prompts are less specific than Assertions. AutoTutor 2.0 cycles through a Hint-Prompt-Assertion configuration until the student articulates the information AutoTutor is expecting. The other dialog moves (e.g., short feedbacks and summaries) are controlled by the production rules that were described for AutoTutor 1.1.

4 Conclusions Although the reported PE and CA means were progressively lower in each evaluation cycle, the mean values indicate that AutoTutor is a somewhat effective tutor and conversational partner (recall that a value of 6 is “very good”). That is, the judges’ PE and CA ratings were more positive than negative for all student types in all evaluation cycles. AutoTutor clearly has room for improvement on both dimensions; however, we are not terribly discouraged by these assessment means for the following reasons. First, to some extent, we anticipated that the Cycle 2 ratings would be lower than Cycle 1 because Cycle 2 was AutoTutor’s first encounter with human students. The interactions with the virtual students in Cycle 1 did not possess any of the mixed-initiative qualities that occurred in the Cycle 2 interactions. The human students in Cycle 2 asked questions and typed a number of meta-communicative and meta-cognitive speech acts (e.g., “What did you just say?”, “How am I supposed to know that?”) that AutoTutor was not equipped to handle. Second, we suspect that the lower mean ratings for Cycle 3 are in large part due to length of the interactions with AutoTutor 2.0. Recall, AutoTutor 2.0 forces the student to construct the answer to the topic question or problem. This particular strategy resulted in considerably longer interactions with the tutor, not to mention, a good bit of redundancy in AutoTutor’s dialog moves. Hence, it may be the case that AutoTutor 2.0 is a better overall tutor when student learning gains are measured. The lengthier sessions, however, may have produced frustration in the students, which in turn affected the nature of the tutorial interaction.

References [1] Foltz, P.W. (1996). Latent semantic analysis for text-based research. Behavior Research Methods, Instruments, and Computers, 28, 197-202. [2] Graesser, A.C., Franklin, S., Wiemer-Hastings, P. & the Tutoring Research Group (1998). Simulating smooth tutorial dialog with pedagogical value. Proceedings of the American Association for Artificial Intelligence (pp. 163-167). Menlo Park, CA: AAAI Press. [3] Graesser, A. C., & Person, N. K. (1994). Question asking during tutoring. American Educational Research Journal, 31, 104-137. [4] Graesser, A. C., Person, N. K., & Magliano, J. P. (1995). Collaborative dialog patterns in naturalistic one-on-one tutoring. Applied Cognitive Psychology, 9, 359-387. [5] Graesser, A.C., Wiemer-Hastings, P., Wiemer-Hastings, K., Harter, D., Person, N., & the Tutoring Research Group (in press). Using latent semantic analysis to evaluate the contributions of students in AutoTutor. Interactive Learning Environments. [6] Hu, X., Graesser, A. C., & the Tutoring Research Group (1998). Using WordNet and latent semantic analysis to evaluate the conversational contributions of learners in the tutorial dialog. Proceedings of the International Conference on Computers in Education, Vol. 2, (pp. 337-341). Beijing, China: Springer. [7] Kosko, B. (1992). Neural networks and fuzzy systems. New York: Prentice Hall.

[8] McArthur, D., Stasz, C., & Zmuidzinas, M. (1990). Tutoring techniques in algebra. Cognition and Instruction, 7, 197-244. [9] Person, N. K., Bautista, L., Kreuz, R. J., Graesser, A. C. & the Tutoring Research Group (in press). The dialog advancer network: A conversation manager for AutoTutor. In the ITS 2000 Proceedings of the Workshop on Modeling Human Teaching Tactics and Strategies. Montreal, Canada. [10] Person, N. K., Graesser, A. C., & the Tutoring Research Group (2000). Designing AutoTutor to be an Effective Conversational Partner. Proceedings for the 4th International Conference of the Learning Sciences. Ann Arbor, MI. [11] Person, N. K., Graesser, A. C., Kreuz, R. J., Pomeroy, V. & the Tutoring Research Group (2000). Simulating human tutor dialog moves in AutoTutor. International Journal of Artificial Intelligence in Education. [12] Putnam, R. T. (1987). Structuring and adjusting content for students: A study of live and simulated tutoring of addition. American Educational Research Journal, 24, 13-48. [13] Ur, S. & VanLehn, K. (1995) Steps: A simulated, tutorable physics student. Journal of Artificial Intelligence in Education, 6(4), pp. 405-437. [14] VanLehn, K., Ohlsson, S. & Nason, R. (1994). Applications of simulated students: An exploration. Journal of Artificial Intelligence in Education, 5(2), pp. 135-175. [15] Wiemer-Hastings, P., Graesser, A. C., Harter, D., & the Tutoring Research Group (1998). The foundations and architecture of AutoTutor. Proceedings of the 4th International Conference on Intelligent Tutoring Systems (pp. 334-343). Berlin, Germany: Springer-Verlag. [16] Wiemer-Hastings, P., Wiemer-Hastings, K., & Graesser, A. C. (1999). Improving an intelligent tutor's comprehension of students with latent semantic analysis. International Journal of Artificial Intelligence in Education, 535-542. Amsterdam: IOS Press.