Evidence of Stimulus - Semantic Scholar

3 downloads 0 Views 461KB Size Report
Nov 8, 2001 - Michael A. Erickson, Department of Psychology, Carnegie Mel- ... in which participants learn to classify stimuli from category structures that contain ..... likely to consider the attributes that allow a rock to be used as a “hammer.
c 2001 by Michael A. Erickson $Id: part2.tex,v 2.0 2001/11/08 15:54:21 erickson Exp $

Submitted for publication—8 November 2001 Do not quote without permission

Multiple Representations in Inductive Category Learning: Evidence of Stimulus- and Time-Dependent Representation Michael A. Erickson

John K. Kruschke

Carnegie Mellon University

Indiana University, Bloomington

ATRIUM, a rule-and-exemplar theory of category learning, posits that people can learn to use different psychological representations to classify different stimuli. This entails that people’s representations will change from stimulus to stimulus at a given time and will change over time. Experiment 1 extended work by Aha and Goldstone (1992) to demonstrate that participants utilize multiple representations once training is complete and that the representation that is selected varies systematically from stimulus to stimulus. Experiment 2 extended work by Nosofsky, Clark, and Shin (1989) to demonstrate that many participants use exemplarsimilarity based representations during training and shift to rule representations when feedback is eliminated and participants are required to generalize to novel stimuli. ATRIUM is able to account for the data and able to assay participants’ representations.

The ability to recognize a dog as a “dog” is one that most people take for granted. As people hear words and see objects, their percepts are reliably transformed into concepts. The question of the process and representation underlying this transformation has been central to cognitive psychology. In this article, we address the question of representation using category learning tasks. Theorists have proposed a wide variety of representations for concepts and categories. These may be grouped into two classes: Under one class of theories, category representation are summaries of previous perceptual experiences. Sometimes these summaries are conceived of as a single prototype or average per category (Homa, 1984; Posner & Keele, 1968; Reed, 1972), sometimes they are conceived of as a collection of prototypes that are combined to represent each category (Anderson, 1991; Ashby & Waldron, 1999), and sometimes they are conceived of as a collection of rules or propositions that define conditions that must be met for membership in each category (Anderson, 1993; Bruner, Goodnow, & Austin, 1956; Medin, Wattenmaker, & Michalski, 1987;

Nosofsky, Palmeri, & McKinley, 1994; Trabasso & Bower, 1968). Under a second class of theories, category representations are composed of entire collections of previous perceptual experiences. These theories are known as exemplar theories (Medin & Schaffer, 1978; Nosofsky, 1986; Kruschke, 1992). We have previously argued that people use both classes of representations (Erickson & Kruschke, 1998, in press). In particular, we provided evidence that human learners used rule representation when extrapolating outside the range of training. This was done using a category structure that was composed of a simple rule with a few high-frequency exceptions. At the conclusion of training, participants were asked to classify transfer items among which were some that were beyond the range of training and were more similar to the exception training items than to the rule training items. Participants tended to classify these items according to the rule. This finding could not be accounted for by a model that only incorporated exemplar representation. Further, we proposed a hybrid theory of categorization called ATRIUM (Attention To Rules and Instances in a Unified Model). According to this theory, people learn to classify stimuli using both rule and exemplar representations, and they learn which representation is best suited to classify different stimuli correctly. This model was able to account for participants’ behavior in these experiments. The purpose of the present article is to test two predictions made by ATRIUM. The first is that people use different representations to classify different stimuli within a single task, and the second is that over time people learn which representation is best suited to classify the different stimuli that are presented.

Michael A. Erickson, Department of Psychology, Carnegie Mellon University and Center for the Neural Basis of Cognition, Pittsburgh, Pennsylvania, and John K. Kruschke, Department of Psychology and Cognitive Science Program, Indiana University, Bloomington. This work was supported in part by Indiana University Cognitive Science Program Fellowships, by National Institute of Mental Health (NIMH) Research Training Grant PHS-T32-MH19879, NIMH National Research Service Award 1-F32-MH12722, and in part by NIMH FIRST Award 1-R29-MH51572. We thank Jason Arndt, Melanie Cary, Michael Gasser, Jay McClelland, Robert Nosofsky, and Richard Shiffrin for their thoughtful comments. Correspondence concerning this article should be addressed to Michael A. Erickson, Center for the Neural Basis of Cognition, Carnegie Mellon University, Pittsburgh, PA 15213. Electronic mail may be sent to [email protected].

This article proceeds as follows: First, we give an overview of ATRIUM. Next, we describe two experiments in which participants learn to classify stimuli from category structures that contain multidimensional rules. In the first experiment, we demonstrate that people learn to use different representations for different stimuli. In the second exper1

2

ERICKSON AND KRUSCHKE

iment, we examine how participants’ representations change over time. Because representations are not directly observable, mathematical models are utilized in both experiments to indicate the degree to which different underlying representations are able to account for the data. Finally, we consider the implications of these experiments for related theories of category learning.

Gating Mechanism

ATRIUM: A Hybrid Rule-And-Exemplar Model of Categorization ATRIUM is a model of category learning that instantiates five psychological principles. The first and second are, respectively, that people use exemplar and rule representation when learning to categorize. The third principle, which is the major focus of this article, permits the mediation of different representations. It is that people learn to use representations that are well suited to the stimuli that are being classified. For example, when classifying different members of the violin family, one might use rules based on size, whereas when classifying different types of guitars (e.g., electric vs. acoustic), shape-based rules might prove more fruitful. Additionally, if one has extensive experience with a particular guitar, one might be able to classify that guitar, and others that are similar, using direct associations. We call this third principle representational attention. The fourth principle is that people learn to attend to the aspects of stimuli that are relevant for the task at hand. This principle has been termed dimensional attention (Kruschke, 1992; Nosofsky, 1986). The final principle is that people learn when they make mistakes. This principle is called error-driven learning. In this section, we describe how each of these principles function in ATRIUM at a conceptual level. A more formal description is presented in Appendix A. Additionally, we describe differences between the current formalization of ATRIUM and the one we previously described (Erickson & Kruschke, 1998). The reason for these differences is that the previous version of ATRIUM had a simplified rule-learning mechanism: It could only learn to apply a rule it had been given. At the time, this simplification was deemed adequate because the primary focus of that work was to understand whether the principles in ATRIUM could provide a satisfactory account of the interaction between rule and exemplar representation (Erickson & Kruschke, 1998, p. 117). In the present article, the simplification is no longer acceptable inasmuch as the category structures that are utilized permit participants to choose among a number of possible rules. ATRIUM is formalized as a hybrid connectionist model. As can be seen in Figure 1, it is composed of several modules that are mediated by a competitive gating mechanism. Each module uses a different representation to classify the current stimulus. It is the job of the gating mechanism to learn which module does the best job from stimulus to stimulus.

Rule Modules The principle of rule representation is instantiated in the rule modules. In ATRIUM the notion of a rule is formalized

Height−Rule Module

Position−Rule Module

Exemplar Module

Dimensional Inputs

Figure 1. A simplified representation of the architecture of ATRIUM , the hybrid rule and exemplar model with multiple rules, used to fit the experimental data. The dotted lines represent connections with learned weights.

as a boundary that is perpendicular to a single psychological dimension. Each boundary divides the psychological space into two regions, and each region is associated with a category label. For example, according to ATRIUM, if a participant were given the task of classifying items according to size, this could be represented by a boundary perpendicular to the size dimension in psychological space. Because this boundary is perpendicular to the size dimension, it is not influenced by the value of other dimensions such as shape or color. Thus, because the rule modules each are sensitive to variation only along a single stimulus dimension, they also implement the principle of dimensional attention. If large items were assigned to Category A and small items to Category B, this could be represented in ATRIUM by an association of the large region with Category A and the small region with Category B. In our previous formalization of ATRIUM, the model contained only a single rule module, so it could only account for a rule on one dimension. Moreover, this rule module had a rule boundary that was fixed in the correct position at the beginning of each simulation. In the present version of ATRIUM, however, there is a rule module for each dimension of stimulus variation. Thus, ATRIUM can learn to form rules on multiple dimensions. Additionally, the present version of ATRIUM learns to select boundaries that correspond to the category structure being learned. This allows ATRIUM to learn where the rule boundary is on each dimension and, if necessary, learn multiple boundaries along a single dimension.

Exemplar Module In addition to the rule modules, ATRIUM contains an exemplar module that implements exemplar representation. This is done using Kruschke’s (1992) ALCOVE model as the

LEARNING CATEGORY REPRESENTATIONS

exemplar module. Under exemplar representation, a stimulus is classified according to its similarity to members of each category. Unlike the rule modules, the exemplar module can integrate information from all the dimensions of stimulus variation simultaneously. Using this integrated information, the exemplar module assesses the overall similarity of the current item to the previously categorized items. It is classified as a member of the category to which its overall similarity is greatest. The exemplar module can also learn that similarity along one dimension is more important for categorization than similarity along other dimensions. Thus, because it can weight the relative importance of stimulus differences along each dimension differently, it also instantiates the principle of dimensional attention.

Gating Mechanism The gating mechanism uses exemplar representation for a different purpose than the exemplar module. Rather than learning to classify each stimulus into the correct category, the gating mechanism learns which module should be used to classify each item. For example, if the current item is similar to items that have been successfully classified by a rule based upon size, the size rule module will be responsible for ATRIUM’s ultimate category decision. As the gating mechanism learns which module to use to classify each stimulus, it is controlled by a set of cost parameters. Each module has its own cost parameter. These parameters control how easy it is to use each of the modules to make a classification. For example, if it were very difficult to apply a particular rule, the cost of learning to use the module for that that rule would probably be high relative to the other modules. The gating mechanism in ATRIUM uses the mixture of experts algorithm described by Jacobs, Jordan, Nowlan, and Hinton (1991). As described by Jacobs et al., the gating mechanism is a stochastic device that probabilisticly selects one of the modules on each trial to provide the model’s output. According to Jacobs (personal communication, August 7, 1996), however, in their simulations, they used the gating mechanism to provide weights for a linear mixture of the outputs from each module. The rationale for this is that it will produce the expected value of a stochastic gate. ATRIUM adopts this latter procedure. The gating mechanism has one output (or gate unit) for each module in the model. The gate units are normalized to values between 0 and 1 that sum to 1, and each gate unit acts as a multiplier on the output of its corresponding categorization module. Thus, by controlling the degree to which each module, that is, each representation, contributes to the classification of each stimulus, the gating mechanism implements the principle of representational attention.

Error-Driven Learning During training, after making its response, ATRIUM learns based on the same feedback given to participants in the experiment. The feedback indicates which category label is the

3

correct one. Each of the modules as well as the gating mechanism adjusts its behavior incrementally so that if the same stimulus were presented again, ATRIUM’s response would be more likely to be correct. This is done using backpropagation of error within the mixture of experts framework (Jacobs et al., 1991; Rumelhart, Hinton, & Williams, 1986). Under the mixture of experts framework, the amount of feedback that each module receives is a function of how well it was able to classify the current stimulus. This is computed for each module by combining its accuracy on the current trial with activation of its gate unit, which provides an indication of its past accuracy. Thus, for a given stimulus, if one module is more accurate than the other modules and it has been more accurate in the past, it will receive almost all of the feedback. Equally important, a module that is poorly suited to classify a given stimulus, relative to the other modules available, will receive very little feedback, and therefore, will not try to adapt to classify such stimuli. This allows each module to learn to classify the stimuli for which it is an “expert.”

Continuity Because ATRIUM has undergone changes since it was first described by Erickson and Kruschke (1998), one might wonder whether it is still able to account for the behaviors observed in those experiments. Erickson and Kruschke (in press) utilized the current version of ATRIUM to account for people’s behaviors with a category structure that was highly similar to the ones used by Erickson and Kruschke (1998), and it was able to provide an excellent account of the major phenomena of interest.

Using Models to Understand Behavior The purpose of mathematical models in this article is to understand human behavior in much the same way as general linear models are used in analysis of variance to understand human behavior. For example, in analysis of variance, one of the questions that is often asked is whether the dependent measures from two groups deviate from the grand mean to a sufficient degree to claim that the groups’ behaviors are different. One way of testing this is to propose two competing models to account for the data. One model is designed to be simple by ignoring group membership: It includes a single free parameter that is estimated from data to be equal to the grand mean. The second model is designed to be more complex by incorporating an additional principle, the principle of the group membership. It, therefore, has two free parameters that are set equal to the means from the two groups. If the two-parameter model provides an account of the data that is sufficiently superior, it is inferred that there is a systematic relationship between the dependent measure and the additional principle in the more complex model, that is, group membership. Thus, by using model-fitting techniques, ANOVA tests provide information about systematic relationships in the data.

4

ERICKSON AND KRUSCHKE

In the two experiments described in this article, two models are fit to the data. One model, ALCOVE, is simpler than the other, ATRIUM. A LCOVE is simpler because it does not instantiate the principle of representational attention; ATRIUM is more complex and has more free parameters because it does. In the fitting process, parameters are estimated from the data using a maximum likelihood process that yields the same statistical properties for these parameters as for means in the case of an ANOVA. Once the best estimates of the parameters are obtained, the fits of the two models can be compared in a way that is highly similar to the F-test used to evaluate an ANOVA. If the more complex model yields a sufficient improvement in fit, it is inferred that there is a systematic relationship between the dependent measure and the additional principle (or principles) in the more complex model, in this case, representational attention. When examining an ANOVA, it is often fairly simple to relate the results of significance tests to the data. However, with a sufficiently complex design, three- or four-way interactions may be found that are often difficult to visualize or to understand. Nevertheless, if a statistical test indicates the significance of a complex interaction, it is not ignored or set aside. The model-fitting analyses in this article fall along the same continuum as do ANOVAs. As will be seen, in Experiment 1, the analyses confirm what is suggested by a visual inspection of the data. In Experiment 2, however, the statistical conclusions are not immediately obvious in the data. Just as with ANOVA, however, this does not mean that the result should be ignored or set aside.

Experiment 1: Context Specific Rules When people make categorization judgments in everyday life, the way they weight attributes varies according to context. For example, when building a house, one would be unlikely to consider the attributes that allow a rock to be used as a “hammer.” When pounding in tent stakes, however, these attributes are likely to be quite salient. Experiment 1 adapts a category structure first used by Aha and Goldstone (1992) to determine the degree to which participants learn context sensitivity in a single experimental session. This experiment has two main purposes: The first is to examine patterns of individual differences in this task to verify that the results obtained by Aha and Goldstone accurately reflected individual performances and were not the result of averaging. The second purpose is to use model-based theoretical analyses to demonstrate that the representations used by participants are stimulus dependent. We first give an overview of the experiment followed by an explanation of possible shortcomings in the analysis performed by Aha and Goldstone. Aha and Goldstone (1992) used a clever category learning experiment to demonstrate a shortcoming of models such as the Generalized Context Model (GCM, Nosofsky, 1986) and ALCOVE (Kruschke, 1992). These models incorporate the principle of dimensional attention. They posit that people are able to allocate attention to dimensions of stimulus variation differentially. Further, they hold that dimensional at-

X W V U T S R Q P O

C D E

F G H

I

J

K

L

Figure 2. A sample stimulus from Experiment 1. On each trial, the rectangle and line segment as well as the tic marks and labels appeared on the screen. The rectangle height and line segment position fell at regular intervals between the tic marks and were, therefore, never aligned with any of the tic marks.

tention changes to optimize overall accuracy and that it does not change as a function of the stimulus. Therefore, according to these models, dimensional attention is a function of a participant’s history of exposure to the stimuli and the feedback they received. Aha and Goldstone, however, argued that people could learn to shift dimensional attention from stimulus to stimulus. In other words, they held that dimensional attention is a function of the current stimulus as well as the participant’s history of exposure and feedback. In Aha and Goldstone’s (1992) experiments, they used a category learning task with a category structure similar to that shown in Figure 3. The training items are indicated by cells containing the letter A or B, which indicate the category assignment. The key stimuli in this category structure are indicated by the cells labeled A or B . As will be explained, they provide an assay to determine whether people are consistently attending to one dimension more than another or are shifting attention from dimension to dimension as a function of which stimulus is presented. In this experiment, participants learned to classify stimuli into one of two categories, labeled A and B. Each stimulus was a rectangle with a short interior line segment located near the bottom (see Figure 2 for an example). These stimuli varied along two, psychologically separable dimensions: rectangle height and the horizontal position of the line segment. The training stimuli were drawn from two clusters, one in the lower-left quadrant of Figure 3, which are referred to as height stimuli, and one in the upper-right quadrant of Figure 3, which are referred to as position stimuli. The height





5

LEARNING CATEGORY REPRESENTATIONS

19

9

18

Rectangle Height

B

16 14

11 10 9

4

8

A A

7

3

6 5

2

4 3

1

2

A

 

12

5

A A A

B

13

6

A

B B

15

7

A

A

B

17

8

 

 

A

A

A B

B

A

B

B

A



A

A B

A

B B

B

 

A

B

B B B

B

1

0

0 0

1

0

2

3

1

4

5

2

6

7

3

8

9 10 11 12 13 14 15 16 17 18 19

4

5

6

7

8

9

Segment Position

Figure 3. Category structure for Experiment 1. The rows and columns represent the stimulus values along each dimension (rectangle height or segment position). The cells containing the letter A or B were training items for their respective categories. The cells labeled with A or B indicate key transfer stimuli. The numbers 0– 9 along the horizontal and vertical axes indicate the location of the tic marks displayed beside the stimuli. The numbers 0–19 along the horizontal and vertical axes are used to label the different stimulus values.

 

stimuli could be classified by using a rule that considered only the stimulus height whereas the position stimuli could be classified by using a rule that considered only the stimulus position. The subscripts of the stimuli indicate how they would be classified if these two rules were used.



as ALCOVE and the GCM, if a stimulus is equally similar to the members of the available categories, it should have an equal probability of being classified into each of the categories. According to these exemplar models, however, attention is modeled by allowing distances along one dimension to be scaled differently than distances along another. If participants are attending to one dimension more than another, distances along that dimension will be stretched so that they will be greater than along the other. Greater distances along a dimension in these models cause smaller changes in similarity along that dimension to have a greater impact on the categorization decision. This allows these models to predict classifications of the stimuli that favor one category or another in a specific way. If, for example, distances along the height dimension are twice as great as distances along the segment position dimension, then the A stimuli in the lower-left quadrant would tend to be classified as members of Category A and the B stimuli in that quadrant would tend to be classified as members of Category B. At the same time, the A stimuli in the upper-right quadrant would tend to be classified as members of Category B and the B stimulus in that quadrant would tend to be classified as members of Category A. These exemplar models cannot, however, account for data in which all of the A stimuli are classified as members of Category A and all of the B stimuli are classified as members of Category B. Such a pattern of classification would require greater attention to height for the height stimuli (those in the lower-left quadrant) and greater attention to segment position for the position stimuli (those in the upper-right quadrant). Thus, this yields two profiles: In the first, the stimulus-independent profile, the A and B stimuli in one quadrant are classified in Categories A and B, respectively, and the A and B stimuli in the other quadrant are classified in the opposite categories, that is, in Categories B and A, respectively. This profile is consistent with standard exemplar models. In the second, the stimulus-dependent profile, all of the A stimuli are classified in Category A and all of the B stimuli are classified in Category B. This second profile contradicts standard exemplar models. Aha and Goldstone’s (1992) experiment had the diagnostic properties just described, and they found the stimulusdependent profile. They proposed an extension to exemplar models that allowed them to account for the data. According to their model, distance was computed differently depending upon the exemplar that was being presented and the category label that was under consideration. This allowed the the model to allocate attention to the height dimension when classifying the height stimuli, and to allocate attention to the segment position dimension when classifying the position stimuli. Aha and Goldstone (1992), however, obtained their results by averaging over the responses of the 20 participants in each of their studies. It is possible that the pattern of responses they obtained was an artifact of the averaging process. For example, with this category structure, there might have been participants who learned to classify the stimuli



The stimuli were selected to be possibly equally similar to each of the two categories where similarity is defined as in ALCOVE and the GCM. (See Appendix A for the formal definition of similarity used in these models.) According to these models, similarity is conceived of as an inverse function of distance in diagrams like the one shown in Figure 3. If two stimuli are equally distant from a third in such a diagram, they are also equally similar. Further, similarities are combined by addition, so that if the sums of distances between a stimulus and two groups is equal, then the stimulus is equally similar to the two groups. Under conditions such as the ones in this experiment, in which stimuli vary along psychologically separable dimensions, distance is measured using a city-block metric. To compute distance between two stimuli in Figure 3 using a city-block metric, the horizontal and vertical differences between the positions are simply summed to obtain the total distance. Using simple counting, therefore, one can verify that each of the stimuli in Figure 3 are equally similar to the training items from Categories A and B. According to exemplar models of categorization such



























6

ERICKSON AND KRUSCHKE

from one quadrant but were only guessing in the other quadrant. It is possible that their results might have been obtained by combining the responses of a group of participants that only learned to classify the height stimuli with a group of participants that only learned to classify the position stimuli. If the number of participants in the two groups were equal, then one would expect the overall proportion of Category A classifications for the A stimuli and Category B classifications for the B stimuli to be 1 2 1 0 1 2 0 5 0 75. Assuming that responses were distributed binomially with p 75 and n 20 and using α 05, the critical value is 19 stimulus-dependent classifications. All of the classifications of stimuli in Aha and Goldstone’s experiments are less than this critical value, and hence, this possibility cannot be excluded. It should be noted, however, that even if the individuals in Aha and Goldstone’s study were exhibiting the behavior described here, it would still show evidence of stimulus dependent representation that would be difficult to account for using a standard exemplar model of categorization. In the present experiment, we modify the category structure of Aha and Goldstone (1992) to permit the evaluation of additional stimuli at increasing distances from the training items. This provides a rich set of data to constrain possible accounts of participants’ behavior. Further, we consider the performance of individual participants to determine the degree to which their performance is consistent with the average performance described by Aha and Goldstone.

 









          



Method Participants. The participants were 157 Indiana University undergraduate students. Four of the participants were paid for their participation, and the remaining 153 received partial credit for their introductory psychology course. All 157 participants were eligible for an additional monetary award based on performance. Stimuli and Apparatus. The stimuli were rectangles that varied in height and contained a vertical line segment located near the base of the rectangle that varied its horizontal position (see Figure 2). The stimuli were presented with tic marks labeled with letters to help participants distinguish different stimuli. Figure 3 shows that no stimulus value (rectangle height or line segment position) was exactly aligned with any tic mark. The stimulus values were always one-third of a tic-mark before or beyond the nearest tic mark. The letters C–L were used to label the tic marks on one dimension and the letters O–X were used to label the other. Which dimension was labeled with which series of letters was counterbalanced between participants. Figure 3 shows the category structure and training stimuli. Each axis represents one dimension of stimulus variation, and each cell represents a possible stimulus. There were 20 possible values along each dimension of stimulus variation yielding 400 possible stimuli. The cells containing an A or B signify training items for Categories A and B respectively. The remaining cells, including the ones labeled A and B , represent possible transfer stimuli used to test generalization.





Whereas there were 400 possible stimuli, to reduce the duration of the experiment each participant saw only a random sample of 200 stimuli (including some training items) in the transfer portion of the experiment. The category structure was counterbalanced between participants by using its mirror image with the diagonal extending from the lower-left to the upper-right corner as the axis of symmetry. For example, the stimulus in the condition shown in Figure 3 with Segment Position 5 and Height 2 would have Segment Position 2 and Height 5 in the counterbalanced category structure. Because the tic-mark labels were also counterbalanced between participants, there were a total of four different instantiations of the abstract structure in this experiment. Procedure. At the beginning of the experiment, participants were presented with instructions emphasizing that it was possible to achieve perfect accuracy if participants attended to both the height of the rectangles and the position of the interior line segments, and that their primary concern should be to respond as accurately as possible. To reinforce the importance of accuracy, participants were told that the person who achieved the best performance over the course of the experiment would receive a $25 award, the person with the second-best performance would receive a $15 award, and the person with the third-best performance would receive a $10 award. Participants were trained over the course of 20 blocks of 24 trials each. Within each block every training item shown in Figure 3 was presented once. After each block of training trials participants were given a self-timed break and shown the percent of correct classifications they made during the previous block. During the break they were reminded of the importance of attending to both the rectangle height and line segment position, and they were told that if they did they would be able to get 100% correct but if they did not they could not expect to get consistently more than 75% correct. In each training trial, participants were shown a stimulus and were instructed to assign it to one of two categories by pressing either 4 or 6 on the keyboard. The assignment of the two response keys to Categories A and B was randomized between participants. Numbers were used rather than letters to minimize any confusion with the letters used to label the tic marks. Participants were given 30 s to respond. When participants gave the correct response, the computer displayed, “Correct!” When they failed to give the correct response, the computer displayed, “Wrong!” and generated a 600 ms tone. If they failed to respond within the 30 s time period, the computer displayed, “Faster” and generated a high-pitched tone for 250 ms. The correct response was then displayed, and the participants could study it and the stimulus for up to 30 s. Training was followed by one block of transfer trials. In the transfer block, a random sequence of 200 of the 400 possible stimuli was presented to the participants. Before the transfer block began, participants were told that they should assign labels to the stimuli as before although they would no longer be told whether or not their responses were correct. They were told that they would see rectangles they had not

LEARNING CATEGORY REPRESENTATIONS

seen before and that they should make their best educated guess based on what they had learned earlier in the experiment. They were also told that because they would receive no feedback, they should not try to alter their strategy based on these rectangles.

Empirical Results Participants’ performance during the last three blocks of learning was evaluated to determine if they learned to classify the training stimuli correctly before they began the transfer trials. To determine whether participants had learned to classify both the height and the position stimuli, these two classes of stimuli were evaluated separately for each participant. In each block, twelve height and twelve position stimuli were presented. Thus, over the last three blocks, n 36 of each were presented. Because both categories appeared an equal number of times in each block, the probability of guessing correctly on any trial was p 5. A binomial distribution with n and p as indicated and α 05 yielded a critical value of 24 correct to exceed chance performance. Of the 157 participants in the experiment, 75 learned to classify both the height and the position stimuli better than would be expected by chance, 12 learned to classify only the height stimuli at above-chance levels, 38 learned to classify only the position stimuli at above-chance levels, and the remaining 32 failed to obtain performance above chance for either the height or the position stimuli. Because these 32 participants failed to learn to classify the training stimuli, their data were not included in any further analyses.



   

Learning. Because participants’ performance during the transfer trials should be influenced by what they learned over the course of training, the performance of participants who learned to classify the training items from both quadrants was analyzed to determine whether their learning performance for one quadrant was superior to that of the other quadrant. For each participant we performed a t-test with α 1 for the final blocks of training beginning on the first block in which a perfect score was obtained for the stimuli from either quadrant. Based on the results of the t-test and the direction of the difference, participants who learned to classify the training stimuli from both quadrants were assigned to the Height Stimuli First group, the Position Stimuli First group, or the Height and Position Together group. Table 1 shows each groups’ average proportion correct during training. It can be seen that the groups that only learned to classify the stimuli from one quadrant performed very close to chance on the unlearned quadrant. The remaining groups all showed reasonable levels of performance for both sets of stimuli, and approximately equal overall levels of performance.

 

Transfer. One purpose of analyzing participants’ learning data was to suggest groups that might have qualitatively different patterns of generalization during transfer. Figure 4 shows clear differences in the pattern of generalization for each group. 1 Each group’s pattern of generalization reflects

7

Table 1 Empirical and predicted proportions correct during learning for each group. Group/ Stimuli Source n Height Position Height Only 12 Empirical .66 (.11) .53 (.05) A LCOVE .65 .52 ATRIUM .66 .52 Position Only 38 Empirical .53 (.06) .74 (.10) A LCOVE .52 .74 ATRIUM .52 .75 Neither First 24 Empirical .75 (.13) .78 (.10) A LCOVE .75 .79 ATRIUM .75 .78 Height First 10 Empirical .84 (.10) .70 (.16) A LCOVE .82 .71 ATRIUM .81 .69 Position First 41 Empirical .70 (.12) .87 (.08) A LCOVE .69 .85 ATRIUM .69 .86 Note. Standard deviations are listed in parentheses.

its history of learning. Participants who only learned to classify the height stimuli showed a systematic gradation of classification responses along the dimension of rectangle height. This pattern is seen even more clearly for the participants who only learned to classify the position stimuli (along the dimension of segment position). The differences between the three groups who learned to classify both sets of training stimuli are more subtle, but they still make sense based on participants’ history of learning. One might conjecture, for example, that the participants who learned to classify the position stimuli first began much as the participants who only learned to classify the position stimuli. These participants may have considered this position stimuli to be primary in the sense that they were the stimuli with which they first had success. The height stimuli might then have been treated as secondary stimuli or “exceptions” to a position-based rule. If this is an accurate description of participants’ behavior, the pattern of generalization for participants who learned to classify the position stimuli first should be largely the same as that of the participants who only learned to classify the position stimuli, with the exception of the height stimuli (and those nearby). In other words, the primary rule for participants who learned to classify the position stimuli first should be the same as the rule induced by the participants who only learned to classify the position stimuli. In regions of the stimulus space in which the primary rule holds sway for the par1

These data are presented in tabular form in Appendix B.

8

ERICKSON AND KRUSCHKE

Height Only (n=12) 19

B

17

B

A A A

B

9 7 5 3

A A

 

A

A

A

A B

 

A A

9 7

3

B B

1

A

 

A A

B

11

5

B

A

B

13

B B

 

B B

15

Rectangle Height

 

11

B

17

B

B

13

Position Only (n=38) 19

A

B B

15

Rectangle Height

 

A A

 

A

A

A

A B

 

A A

B B

B B B

1 0

2

4

6

8

10

12

14

16

18

0

2

Segment Position

4

6

8

10

12

14

16

18

Segment Position

 

Height and Position Together (n=24) 19

B

17

B

Rectangle Height

A

B B

15

A

 

A A

B

13

B

11 9 7 5 3

A A

 

A

A

A

A B

 

A A

B B

B B B

1 0

2

Height First (n=10) 19

B

17

B

7 5 3

A A

 

A

A

A

A B

 

12

16

18

Position First (n=41)

B

17

B

 

A A A A

A

 

A A

B

11 9 7

3

B B

1

A

B

13

5

B

 

B B

15

B B

14

19

Rectangle Height

9

10

A

B

11

8

Segment Position

A

B

13

6

 

B B

15

Rectangle Height

4

A A

 

A

A

A

A B

 

A A

B B

B B B

1 0

2

4

6

8

10

12

Segment Position

14

16

18

0

2

4

6

8

10

12

14

16

18

Segment Position

Figure 4. Proportion of responses for each stimulus in the transfer phase of Experiment 1 by learning group. The shading in each cell indicates the proportion of responses in each category. Dark cells indicate a high proportion of Category A responses; light cells indicate a high proportion of Category B responses.

9

LEARNING CATEGORY REPRESENTATIONS

ticipants who learned to classify position stimuli first, their performance should be highly similar to participants who only learned to classify the position stimuli. The same holds for the participants who learned to classify the height stimuli first and the participants who only learned to classify the height stimuli. This can be seen most easily by comparing the lower-right corners of each of the five panels (i.e., items with positions 15–19 and heights 0–4). In all of the cases except the one in which the height and positions rules were learned together, the stimuli shown in the lower-right hand corners were classified according to the primary rule.

 Stimuli. As was described previously, in addition to the overall patterns of generalization for each group, the groups’ performance on the  stimuli was critical for distinguishing between different explanations of the data. The top panel of Figure 5 shows the proportion of boundary-consistent responses given by participants for the  stimuli as well as neighboring training items from each category. Category A responses to the  stimuli, Category B responses to the  stimuli, and all correct responses to training items were considered boundary consistent. The denotations same side and other side are relative to the  items and to the boundaries between the Category A and B stimuli in both quadrants. For example, relative to the three  stimuli in the lower-left quadrant, the three Category B training items with positions B

A

B

3, 4, and 5 were considered same-side stimuli and the three Category A training items with positions 2, 3, and 4 were considered other-side stimuli. For the purposes of Figure 5, the data from the two groups who only learned to classify the stimuli from one quadrant were collapsed, and the data from the three groups who learned to classify the stimuli from both quadrants were collapsed, retaining the primary and secondary quadrants for each participant.2 The top panel shows that participants who only learned to classify the training items from one quadrant exhibited the stimulus-independent profile described previously. That is, they tended to make boundary-consistent classifications stimuli in the primary quadrant and boundaryof the inconsistent classifications of the stimuli in the secondary quadrant. The participants who learned to classify the training items from both quadrants showed a more equivocal profile. They tended to make boundary-consistent classifications of the stimuli in both quadrants although only marginally more than chance in the secondary quadrant (M 58, SD 39), t 73 1 86, p 067. This tendency is clearly less for the secondary than for the primary quadrant (M 27, SD 46), t 73 5 12, p 0001.3 In sum, grouping participants according to their learning performance provided a way to avoid averaging over participants with qualitatively different patterns of performance. This gives a different picture of participants’ performance than did the analysis of Aha and Goldstone (1992). To a certain extent, the participants who learned to classify the items from both quadrants have a profile somewhat like that given illustratively in the introduction to the present experiment, inasmuch as their boundary-consistent classifications of the stimuli in the secondary quadrant were only marginally









          









different from chance. Even so, given participants’ performance on the training item, it is highly unlikely that these participants were merely guessing. By taking the whole set of each participant’s responses into consideration, there were clear, systematic patterns for all the stimuli. What remains to be seen is whether these patterns can be accounted for adequately by a standard exemplar-only model such as ALCOVE, or whether they provide evidence that participants are using a more complex attention-switching strategy such as that embodied in ATRIUM.

Theoretical Analyses Because of the complexity of mathematical models, it is difficult to state with certainty the degree to which one model or another will be able to account for a certain set of data without actually generating computations from the model. Because the technique of grouping participants according to their learning performance attenuated the degree to which evidence of a stimulus-dependent profile of classification was found relative to the results of Aha and Goldstone (1992), it may be the case that ALCOVE will be able to account for participants’ behavior in the present experiment as well as the more complex ATRIUM. If, on the other hand, ATRIUM is able to provide a substantially better account of the data, this will furnish evidence that (a) people can use stimulusdependent representations in category learning tasks, and (b) the principles in ATRIUM provide an adequate account of this stimulus-dependence. The models were fit to the data from each group of participants by adjusting the parameters shown in Tables 2 and 3 for ALCOVE and ATRIUM respectively. An overview of the meaning of these parameters is briefly presented here. More detail is given in the the description of the fits of the two models and in Appendix A. As is shown in these tables, most of the parameters were constrained to be the same between groups of participants. Only two parameters in each model varied between groups of participants. For ALCOVE, these two parameters controlled the initial distribution of attention between the two dimensions of stimulus variation (α1 ) and the rate at which the model learned to adjust the dimensional attention (λα ). For ATRIUM, these two parameters controlled the cost of learning a rule along the line segment position dimension (cr1 ) and the cost of learning a rule along the rectangle height dimension (cr2 ). The remaining parameters, those that are fixed between groups, control: the specificity of exemplars, c; the learning rates of the exemplars, the rules, and the gate, λe , λr , and λg , respectively; and the degree of certainty or decisiveness during learning and transfer, φl and φt , 2

Because the participants who learned to classify the Height and the Position stimuli together did not have a primary quadrant, they were assigned to have the position dimension as their primary dimension. This is consistent with their slightly better classification of the position stimuli during learning as shown in Table 1, t 23 2 24, p 035 3 By chance, one participant was not presented with a stimulus in the secondary quadrant. Hence, only 74 of the 75 participants were included in these tests yielding 73 degrees of freedom.

  





10

ERICKSON AND KRUSCHKE



Empirical

1.00

Test Stimuli Training Stimuli Same Side Training Stimuli Other Side

Proportion Boundary-Consistent

0.80



0.60

0.40

0.20

0.00

Primary Secondary One Quadrant Learners



Primary Secondary Both Quadrant Learners



A LCOVE

1.00

Test Stimuli Training Stimuli Same Side Training Stimuli Other Side

*

Proportion Boundary-Consistent

0.80



*

*

*

* *

0.60

*

0.40

!

"

0.00

Primary Secondary One Quadrant Learners

 

!

"

Primary Secondary Both Quadrant Learners

ATRIUM Test Stimuli Training Stimuli Same Side Training Stimuli Other Side

*

Proportion Boundary-Consistent

0.80



 

0.60

4 0.40

The AIC is given by AIC

0.20

0.00







0.20

1.00

respectively. Of these parameters, the learning rates for the rules and the gate pertain only to ATRIUM, and the initial distribution of attention pertains only to ALCOVE. Although ALCOVE can usually be thought of as a restricted version of ATRIUM, it is not in this case because it includes the extra α1 parameter. Because of this, Akaike’s information criterion statistic (AIC, Akaike, 1974) was used to compare the models.4 The AIC is a measure of the lack of fit between a model and data that includes a penalty for model flexibility as measured by the number of free parameters. Therefore, AIC values may be compared between models, and higher values indicate a worse fit. The models were fit using the same sequence of stimuli as was seen by each of the participants. The models were fit to each block of data including both training and transfer. To compute the fit of the model, the training data for each of the five groups of participants in Experiment 1 were placed in a three-way table comprised of 20 blocks 2 sets of training stimuli (position stimuli and height stimuli) 2 category labels. The transfer data for each group of participants were placed in a two-way table comprised of 400 stimuli 2 category labels. As suggested by the previous discussion of the critical stimuli, although the models were fit to 440 data points, a central focus of the theoretical analysis will be the performance of the models for these few stimuli. Nevertheless, it was important that the models be constrained by a broad parametric sample of the possible stimulus space so that the solution produced by the models for the stimuli would be constrained to be consistent with participants’ broader behavior. Another way of thinking about this is that although these data points were included in the set to which the models were fit, the parameter adjusting procedure has no way of “knowing” which data points were the critical ones. Thus, as long as the proportion of critical data points is small, they are unlikely to have a disproportionate impact on the estimated parameter values. In this case, the critical set comprised just 12 data points out of the 440 (less than 3%). This method of examining qualitative predictions for a set of critical data is, in many ways, more easily apprehended than quantitative measures of fit. By examining these critical data, one can determine the degree to which a model can or cannot account

!

"

Primary Secondary One Quadrant Learners



!

"

Primary Secondary Both Quadrant Learners

Figure 5. Proportion of boundary-consistent classifications during transfer for the stimuli (indicated by Test Stimulus) and the nearby training stimuli (indicated by Training Stimulus Same/Other Side). Error bars for the empirical data extend 1 SE above or below the mean. Asterisks above or below bars for the predicted data indicate that the prediction falls outside the 95% confidence interval for the empirical mean.

$# 2 log L % 2N &

(1)

where log L is the log of the maximum likelihood of the data given the model, and N is the number of free parameters in the model. The log likelihood of the data given the model can be computed as log L



'

∑ log ∑ fik i

k

( !#

∑∑ i

k

 log f !)% ik



∑∑ i

k

f

ik

 *+&

(2)

log pi k

where fik is the observed frequency with which the stimuli of type i were classified in category k, and pi k is the model’s overall predicted probability that the stimuli of type i were classified as member of category k.

11

LEARNING CATEGORY REPRESENTATIONS

Table 2 Best fitting parameters for ALCOVE in Experiment 1 and corresponding fit statistics. Shared Height Position Both Height Position Parameter Value Only Only Together First First c 0.1361 λe 0.0038 φl 6.8172 φt 2.3942 λα 0.0000 0.0000 1.1910 0.3587 0.5709 α1 0.2237 0.7534 0.0001 0.0000 0.3501 Height Position Both Height Position Fit Statistic Total Only Only Together First First 2 log L Learning 1,832.77 319.54 351.14 308.60 345.31 508.18 Transfer 7,878.30 1,318.64 1,739.60 1,571.23 1,250.51 1,998.32 Overall 9,711.07 1,638.18 2,090.74 1,879.83 1,595.82 2,506.50

,

for key aspects of participants’ behavior. A LCOVE Predictions. The purpose of fitting ALCOVE to the data from the present experiment was to determine the best possible account of the data assuming stimulusindependent dimensional attention. As has been stated, it is not obvious from the pattern of results that ALCOVE will not be able to provide an acceptable account of the data. This is true for two reasons: First, ALCOVE has successfully accounted for the results from a wide variety of category learning experiments (e.g., Choi, McDaniel, & Busemeyer, 1993; Kruschke, 1992, 1993a, 1993b, 1996; Nosofsky, Gluck, Palmeri, McKinley, & Glauthier, 1994; Nosofsky & Kruschke, 1992; Nosofsky, Kruschke, & McKinley, 1992). Second, the pattern of results as shown in the top panel of Figure 5 do not show the strong stimulus-dependent pattern of results obtained by Aha and Goldstone (1992) for which a stimulus-independent model of category learning was unable to account. When ALCOVE is fit to human learning data, it is generally assumed that participants initially attend to all the input dimensions equally. The logic underlying that assumption is based on the source of initial psychological stimulus representation in ALCOVE, which is a multidimensional scaling study that yields, among other information, a measure of the relative degree of attention given to each psychological dimension of the stimuli, prior to the imposition of any category structure. One of the tasks of ALCOVE, then, is to learn how to reallocate attention based on evidence drawn from the category structure. A scaling study performed by Erickson and Kruschke (1998, Appendix C) found that, on average, naive participants’ attention to the position of the line segment in the stimulus is about 1.5 times greater than their attention to the rectangle height. In this study, however, this initial distribution of attention appeared to vary between groups. To accommodate this variability, ALCOVE’s initial distribution of attention weights was free to vary between groups of par-

ticipants. At the start of training, therefore, attention to the segment-position stimulus dimension, α1 , was chosen freely for each group, and attention to the rectangle-height dimension, α2 , was set to α2 1 α1. A second difference between the five groups in Experiment 1 appeared to be the rate at which they learned to reallocate their attention across the course of the experiment. In the two groups that only learned to classify one of the sets of stimuli, participants’ attentional allocation appeared to remain fixed very near its initial values. In the three groups that learned to classify all the stimuli, part of what participants may have learned was to shift attention away from their initial biases. Therefore the attention learning rate parameter, λα , was allowed to vary freely between the five groups to allow ALCOVE to account for these differences. Table 2 shows the best fitting parameters and the fit statistics obtained using those parameters in ALCOVE. The fit statistics displayed in Table 2 indicate the discrepancy between the model and the data. They are computed as 2 log L where log L is computed as in Footnote 4. The AIC is computed from this value by adding a term that penalizes the fit for each free parameter in the model. For ALCOVE, there were 4 shared free parameters and 10 group-specific free parameters for a total of 14 free parameters. Thus, the overall AIC computed from this fit is 9 711 07 28 9 739 07. This statistic is discussed further when the fits of ALCOVE and ATRIUM are compared directly. Varying the attention learning rate (λα ) and the initial distribution of attention (α1 ) between groups allowed ALCOVE to do a reasonable job of accounting for the different patterns of learning in the five groups of participants. Table 1 shows that with no attention learning in the fits of the data from participants who only learned to classify stimuli from a single quadrant (i.e., either the position stimuli or the height stimuli, but not both), only the classification of the learned set improved above chance. Likewise, in the three remaining groups, the order of learning and degree of advantage during learning for each set of stimuli was very close to the empiri-

 ,

,

-    - 

12

ERICKSON AND KRUSCHKE

Height Only (n=12) 19

B

17

B

A A A

B

9 7 5 3

A A

 

A

A

A

A B

 

A A

9 7

3

B B

1

A

 

A A

B

11

5

B

A

B

13

B B

 

B B

15

Rectangle Height

 

11

B

17

B

B

13

Position Only (n=38) 19

A

B B

15

Rectangle Height

 

A A

 

A

A

A

A B

 

A A

B B

B B B

1 0

2

4

6

8

10

12

14

16

18

0

2

Segment Position

4

6

8

10

12

14

16

18

Segment Position

 

Height and Position Together (n=24) 19

B

17

B

Rectangle Height

A

B B

15

A

 

A A

B

13

B

11 9 7 5 3

A A

 

A

A

A

A B

 

A A

B B

B B B

1 0

2

Height First (n=10) 19

B

17

B

7 5 3

A A

 

A

A

A

A B

 

12

16

18

Position First (n=41)

B

17

B

 

A A A A

A

 

A A

B

11 9 7

3

B B

1

A

B

13

5

B

 

B B

15

B B

14

19

Rectangle Height

9

10

A

B

11

8

Segment Position

A

B

13

6

 

B B

15

Rectangle Height

4

A A

 

A

A

A

A B

 

A A

B B

B B B

1 0

2

4

6

8

10

12

Segment Position

14

16

18

0

2

4

6

8

10

12

14

16

18

Segment Position

Figure 6. A LCOVE ’s predicted proportion of responses for each stimulus in the transfer phase of Experiment 1 by learning group. The shading in each cell indicates the proportion of responses in each category. Dark cells indicate a high proportion of Category A responses; light cells indicate a high proportion of Category B responses.

13

LEARNING CATEGORY REPRESENTATIONS

cal values (cf. Table 1). A LCOVE’s predicted patterns of transfer responses for each of the five groups of participants is shown in Figure 6. By comparison with Figure 4 it can be seen that ALCOVE does a reasonably good job of predicting the gross pattern of participants’ responses. The evaluation of the model’s predictions for the stimuli, however, is the critical test for determining whether the model was able to account for participants’ behavior. A comparison of the top two panels in Figure 5 indicates that ALCOVE was unable to account for participants’ classification of the stimuli in every case in which they learned to classify the training items in that quadrant. As indicated by the asterisk, for the participants that only learned to classify the items from one quadrant, ALCOVE’s prediction for the stimuli from that quadrant was below the 95% confidence interval of the mean. Likewise for the two sets of stimuli for the participants that learned to classify the items from both quadrants, ALCOVE’s predictions were both below the the 95% confidence intervals. Thus, even though the empirical data seemed to indicate that participants in this experiment might not be performing stimulusdependent shifts of attention as readily as was suggested by Aha and Goldstone (1992), ALCOVE was still unable to account for the participants’ classification of the stimuli. Why did ALCOVE fail to account for the participants’ pattern of responses for the stimuli in any quadrant that was learned? There are two reasons, but both revolve around the lack of stimulus dependence. Note in Figure 4 that, even for the participants that only learned to classify the stimuli from one quadrant, the gradient between Category A and B responses in the quadrants that were learned was fairly abrupt. This is reflected in Figure 5 by high proportions of boundaryconsistent responses for the stimuli as well as for the training items in the primary quadrant. In contrast, the gradient between Category A and B responses in the unlearned quadrants was much more gradual.5 Models such as ALCOVE have two mechanisms that could account for the the degree of abruptness in the gradient between two categories. One solution is to increase the associative weights on either side of the boundary between categories. A second solution is to increase the distinctiveness of the stimuli. This latter solution can be executed in ALCOVE either through attention learning or by adjusting the initial specificity parameter, c (see Equation 6 in Appendix A). Each of these solutions, however, would have stimulus-independent effects upon the behavior of the model. Any adjustment to ALCOVE’s parameters that would allow it to sharpen the gradient from Category A to Category B responses in the learned quadrant during transfer would affect the gradient in the unlearned quadrant. Thus, the first reason ALCOVE was unable to account for participants’ classifications of the stimuli in the learned quadrants was that its lack of stimulus dependence did not permit it to vary the slope of the gradient between category labels for different stimuli. A LCOVE’s predicted pattern of responses for the participants who learned to classify the training items in both quadrants highlights the second reason the model was unable to account for participants’ classifications of the stimuli. Fig-



















ure 4 shows that the empirical gradients between the two category labels within each quadrant are fairly sharp and are aligned along the relevant stimulus dimension for that quadrant. In contrast, ALCOVE’s predictions in Figure 6 are more gradual and tend to be oblique. The predicted gradients between category labels for these participants are gradual for the same reasons that they were for the other participants. The inability of the model to align the gradients with the dimensions of stimulus variation is more fundamental: A L COVE is restricted to shift from one category label to another at locations in stimulus space where items are equally similar to training stimuli from both categories. As described previously, the stimuli were selected so that they would be equally similar to both categories when both dimensions of stimulus variation were attended to equally. Although AL COVE can stretch and shrink the stimulus space along each dimension, the lines of equal similarity will move together throughout the space. For example, ALCOVE’s predicted pattern of transfer responses for participants who learned to classify the height and the position stimuli together shows a fairly vertical boundary along the entire right side of the stimulus space. A consequence of that, however, is that as generalization is tested beyond the training items in the lowerleft quadrant, the boundary between Category A and Category B responses is not horizontal as in the empirical data. This entails that in that quadrant, ALCOVE under-predicted the proportion of Category A classifications for the A stimuli and the proportion of Category B classifications for B stimuli. Therefore, although the empirical data was not decisively stimulus-dependent, ALCOVE’s absolute stimulusindependence still prevents it from accounting for the participants’ pattern of responses to the stimuli.









ATRIUM Predictions. Unlike ALCOVE, which was limited to a single representational system to classify all the stimuli in the experiment, ATRIUM could learn to use different representations for different exemplars. In other words, ATRIUM could learn to be stimulus dependent. We hypothesized that the key difference between the five groups of participants was not limited to their dimensional attention alone as was instantiated in ALCOVE, but additionally that it included their ability to learn stimulus-dependent representations. In this experiment, ATRIUM could learn to use one of three representations to make its classifications: height rules, segment-position rules, or exemplar similarity (see Figure 1). Associated with each of these representations was a cost (see Equation 13 in Appendix A). The cost parameters controlled the ease with which the three representations could be learned. For example, if the two rule modules 5

Individual participant’s gradients were examined to determine whether this was an artifact of averaging. That is, that participants’ may have been equally steep in both quadrants, but they were only aligned on a single stimulus value in the primary quadrant. A comparison of participant’s maximum category assignment differences between adjacent levels of the primary dimension within each quadrant indicated that gradients were steeper in the primary than in the secondary quadrant (M 0 14, SD 0 23), t 49 4 26, p 0001.

/0

 

 

 . 

14

ERICKSON AND KRUSCHKE

Height Only (n=12) 19

B

17

B

A A A

B

9 7 5 3

A A

 

A

A

A

A B

 

A A

9 7

3

B B

1

A

 

A A

B

11

5

B

A

B

13

B B

 

B B

15

Rectangle Height

 

11

B

17

B

B

13

Position Only (n=38) 19

A

B B

15

Rectangle Height

 

A A

 

A

A

A

A B

 

A A

B B

B B B

1 0

2

4

6

8

10

12

14

16

18

0

2

Segment Position

4

6

8

10

12

14

16

18

Segment Position

 

Height and Position Together (n=24) 19

B

17

B

Rectangle Height

A

B B

15

A

 

A A

B

13

B

11 9 7 5 3

A A

 

A

A

A

A B

 

A A

B B

B B B

1 0

2

Height First (n=10) 19

B

17

B

7 5 3

A A

 

A

A

A

A B

 

12

16

18

Position First (n=41)

B

17

B

 

A A A A

A

 

A A

B

11 9 7

3

B B

1

A

B

13

5

B

 

B B

15

B B

14

19

Rectangle Height

9

10

A

B

11

8

Segment Position

A

B

13

6

 

B B

15

Rectangle Height

4

A A

 

A

A

A

A B

 

A A

B B

B B B

1 0

2

4

6

8

10

12

Segment Position

14

16

18

0

2

4

6

8

10

12

14

16

18

Segment Position

Figure 7. ATRIUM’s predicted proportion of responses for each stimulus in the transfer phase of Experiment 1 by learning group. The shading in each cell indicates the proportion of responses in each category. Dark cells indicate a high proportion of Category A responses; light cells indicate a high proportion of Category B responses.

15

LEARNING CATEGORY REPRESENTATIONS

Table 3 Best fitting parameters for ATRIUM and corresponding fit statistics. Shared Height Position Both Height Parameter Value Only Only Together First c 0.8268 λe 1.6357 λr 1.9973 λg 0.0003 φl 8.3696 φt 5.7432 cr1 1.7840 0.3631 2.8633 3.2375 cr2 0.1868 6.6323 2.7451 2.0348 Height Position Both Height Fit Statistic Total Only Only Together First 2 log L Learning 1,512.73 267.31 338.73 263.96 304.16 Transfer 6,898.40 1,224.53 1,625.16 1,344.27 1,063.15 Overall 8,411.13 1,491.84 1,963.89 1,608.23 1,367.31

,

had very high costs relative to the exemplar module, ATRIUM would be biased away from rule-based classification. Therefore, in fitting ATRIUM to the data, the relative values of the three cost parameters (i.e., cr1 , cr2 , and ce ) were free to vary between the five groups of participants. Because only the relative costs matter, however, the cost of the exemplar module was fixed at ce 1, and only the two rule module costs varied absolutely. We further hypothesized that because the two rule representations each corresponded to one dimension of stimulus variation, the distribution of attention should be related to their relative costs. For example, it seems unlikely that participants would tend to attend to rectangle height if they preferred to form rules based on the segment position. This hypothesis was formalized by setting the initial attentional value for each dimension inversely proportional to the relative cost of the rule module for that dimension. For example, if the cost of learning rules based upon rectangle height is high relative to the cost of learning rules based upon position, then attention in ATRIUM is initially shifted away from height. The initial attention to the segment-position dimencr sion was set to α1 1 cr 1cr , and the initial attention to



 , 1 the rectangle-height dimension was set to α  1 , 1  1, α . 1

2

cr2 cr1 cr2

2

1

Finally, in fitting ATRIUM to the data from Experiment 1, two parameters that had been allowed to vary freely in past fits (Erickson & Kruschke, 1998, in press) were fixed. The parameter that controls the sharpness of boundaries within the rule modules was fixed at a fairly high level, γr 10, and the attention learning mechanism for the exemplar module was eliminated by setting λα 0. Table 3 shows the best fitting parameters and the fit statistics obtained using those parameters in ATRIUM. The AIC was again computed by adding a term that penalized the fit for each free parameter in the model. For ATRIUM, there





Position First

1.3760 2.3916 Position First 338.57 1,641.29 1,979.86

were 6 shared free parameters and 10 group-specific free parameters for a total of 16 free parameters. Thus, the overall AIC computed from this fit was 8 411 13 32 8 443 13. In comparison with ALCOVE’s AIC of 9 735 07, it can be seen that ATRIUM provided a fit that was substantially better than would be expected had there been no systematic relationship between the added capability of ATRIUM and participants’ behavior. The fit statistics for corresponding subsets of the data in Tables 2 and 3 may be compared to indicate the generality of ATRIUM’s improvements over ALCOVE. To ensure that these differences in fit are due to theoretical differences in ATRIUM and not merely due to the addition of free parameters, penalties analogous to the AIC penalty may be added. It is not clear, however, exactly how much each parameter contributed to fitting each subset of the data. To be maximally conservative (i.e., to favor ALCOVE maximally), it can be assumed that only ATRIUM was helped by its parameters in any of these subsets of data. Thus, no penalty terms are added to ALCOVE ’s fit statistics whereas, for ATRIUM, the maximum penalty terms are added. Because it is possible that all 16 parameters were used to optimize just the learning or just the transfer data in ATRIUM for the total sub-fit statistics, the maximum penalty is 32. Likewise within each of the five groups, only 8 parameters were available yielding a maximum penalty of 16. Even using these conservative statistics, ATRIUM provided a superior fit for each sub-category of data except for the learning data of the participants who only learned to classify the position stimuli (ALCOVE 351 14 0 351 14; ATRIUM 338 73 16 354 73). Manipulation of the cost parameters between groups allowed ATRIUM to do a very good job of accounting for differences in the learning and transfer data for each of the five groups of participants. Table 1 shows that, like ALCOVE, ATRIUM did a very good job of accounting for the overall differences between groups during learning. Although learning

-  -   - 

  



2

  

 2

16

ERICKSON AND KRUSCHKE

curves are not shown in the table, the fit statistics indicate that ATRIUM generally provided a superior account of the shape of the learning curves block by block relative to ALCOVE. As shown in Figure 7, ATRIUM was also able to provide a good account of participants’ behavior during transfer. ATRIUM’s predictions during transfer for the stimuli and for training items are shown in the bottom panel of Figure 5. This figure shows clearly that ATRIUM provided an excellent account of these data. All four of its predictions for the items were within the 95% confidence intervals for the means of the empirical data. Further, of the 12 predictions shown in the figure, only 1 fell outside the 95% confidence intervals. ATRIUM was able to account for participants’ classification of the training items and the stimuli during transfer because of its ability to learn to use stimulus-dependent representations whereas ALCOVE was not. How did ATRIUM use its three representations to classify the stimuli in the two quadrants? To answer this question, the activations of the gate units during transfer were examined for each stimulus. The gate units in ATRIUM control how much influence each module has on the model’s final output. Therefore, the activation of these units indicates the degree to which each representation is being used. For the participants who only learned to classify the training items from one of the two quadrants, ATRIUM used the rule module associated with the relevant dimension almost exclusively. This confirms that according to ATRIUM, their representations were not stimulus-dependent. For the participants who learned to classify the height and position stimuli together, ATRIUM used the height rule to classify the height stimuli and exemplar representation to classify the position stimuli. For the participants who learned to classify the height stimuli first, ATRIUM learned to use the height rule to classify the height stimuli and learned to use both the position rule and the exemplar module to classify the position stimuli. For the participants who learned to classify the position stimuli first, ATRIUM learned to use exemplars to classify the position stimuli and learned to use the height rule to classify the height stimuli. It was possible for ATRIUM to use the exemplar module in this task because the gating mechanism controls both the forward flow of activation and the backward flow of error. By using the gating mechanism to control the backward flow of error, ATRIUM was able to allow the exemplar module to learn to classify the stimuli from one quadrant without interference from the stimuli in the other. To summarize, this experiment provided empirical evidence that people can learn to use stimulus-dependent representations even in simple, inductive category learning experiments. This study refines the results reported by Aha and Goldstone (1992) by dividing participants based upon their performance during training to avoid averaging across participants that might have learned to classify the stimuli from one quadrant but not the other. In so doing, the evidence for stimulus-dependent representation appeared somewhat less compelling. Nevertheless, by fitting two models to the data, one that incorporated stimulus-dependent representation and one that did not, it became clear that the results from this experiment could not be accounted for adequately without







stimulus-dependent representation.

Experiment 2: Multiple Simultaneous Rules The category structure used in Experiment 1 was well designed by Aha and Goldstone (1992) to demonstrate participants ability to use stimulus-dependent representations. Although Experiment 1 showed that consideration of individual differences made the results slightly less clear-cut than originally described, one might argue that the superiority of ATRIUM’s account of the data over ALCOVE’s could have been predicted from the start. Because of this, in Experiment 2, a different category structure was used. This category structure was derived from one used originally by Nosofsky et al. (1989) who found that the GCM, an exemplar-only model of categorization, could account for the data quite well. This experiment, therefore, provides an opportunity to supply further evidence of ATRIUM’s utility in indicating participants’ use of more complex representations than are present in exemplar-only models of categorization. Beyond this general goal, this experiment also has a more specific goal: to examine transitions between rule and exemplar representation over the course of an experiment. There has been considerable research that has suggested that exemplar representation impinges upon rule representation over time (Allen & Brooks, 1991; Anderson & Betz, in press; Ashby, Alfonso-Reese, Turken, & Waldron, 1998; Johansen & Palmeri, 2001; Noelle & Cottrell, 2000; Regehr & Brooks, 1993). These researchers have argued that over the course of learning, there is a general tendency to shift from rule to exemplar representation. A central goal of the present experiment is to use model-based techniques to assay this tendency and to evaluate its generality. The category structure used in this experiment is shown in Figure 8. It was selected for two reasons: First, it admits a number of different classification strategies, including pure rule strategies, pure exemplar strategies, and combined strategies. Second, as stated, it should provide a significant challenge to ATRIUM and allow the findings from this study to be connected with previous research because it is similar to the category structure used by Nosofsky et al. (1989) to provide evidence for an exemplar-only model of category learning. These two motivations are discussed in order. As in Experiment 1, participants in the present experiment were given the task of learning to classify rectangular stimuli as shown in Figure 9. In contrast to Experiment 1, in this experiment, the rectangles were presented without tic-marks or labels. The category structure (Figure 8) shows that there were seven training items to be classified into two categories. The training items could easily be classified using a variety of fairly simple rules. For example, participants could learn that an item was in Category A if its segment position was between 2 and 4 inclusive and its height is greater than 2. Alternatively, with just seven training items, participants should have had little difficulty identifying and classifying each one even without rules. Finally, participants might have mixed these two strategies, using a simpler rule than the one de-

17

LEARNING CATEGORY REPRESENTATIONS

It seems like each of these strategies should yield a clearly different pattern of acquisition during learning and generalization during transfer. For example, one might anticipate that if a participant used the strategy of memorizing the seven training items and classified the transfer items according to their similarity to the training items, the transfer items in the upper right portion of Figure 8 would tend to be classified as members of Category A. Alternatively, if a participant used the rule-and-exception strategy described previously, there should be a sharp division in the labels assigned to items in the upper right portion of the category structure with line segments in positions 4 and 5. As is shown in the results section, these types of qualitative differences do exist between participants. It turns out, however, that by manipulating dimensional attention, exemplar models such as ALCOVE and the GCM can qualitatively account for these patterns. Thus, the interpretation of this experiment relies more heavily upon fit statistics than upon the classification of critical stimuli as in Experiment 1. There are a few differences between the present experiment and the experiments reported by Nosofsky et al. (1989). First, the stimuli were different. Whereas Nosofsky et al. presented circles with radial lines, in this experiment, participants are presented with rectangles. Nevertheless, in both experiments, the key element of psychological separability between the two dimensions of stimulus variation is present. A second difference between Nosofsky et al.’s (1989) experiment and the present one is highlighted in Figure 8. In their category layout, participants saw just four levels along each dimension yielding 16 stimuli total. The four levels presented by Nosofsky et al. are indicated by the shaded areas in the figure. In the present experiment, participants were presented with 49 stimuli. This served two purposes: First, it permitted the testing of extrapolation to assess participants’ underlying representation (Delosh, Busemeyer, & McDaniel, 1997; Erickson & Kruschke, 1998, in press), and second, it provided 33 additional data points per participant which, in turn, furnished additional constraints to help distinguish between the models that were fit to the data. A final difference between the present experiment and earlier work was that in the present experiment participants’ performance during learning was considered. Previous accounts of data from similar experiments have focused only on the results of transfer trials (Nosofsky et al., 1989; Nosofsky & Palmeri, 1998). As will be seen, we argue that careful analysis of both learning and transfer data can prove crucial to understanding category learning behavior.

6

Rectangle Height

scribed previously with the inclusion of an exception to that rule (Medin et al., 1987; Nosofsky et al., 1989; Nosofsky & Palmeri, 1998; Nosofsky et al., 1994). One way this strategy could have been instantiated might have been to classify items as members of Category A if their segment positions were between 2 and 4 inclusive unless the stimulus was the one with the segment position 4 and the height 1.

B

5 4

4 2

A

1

A

6

9

A

8

3 2

B

1

5

B

B

3

7

0 0

1

2

3

4

5

6

Segment Position Figure 8. Category structure for Experiment 2. The rows and columns respectively represent the stimulus values for the segment positions and rectangle heights. The cells containing the letter A or B were training items for their respective categories. The cells numbered 1–9 indicate the transfer items used by Nosofsky et al. (1989). The numbers 0–6 along the axes are used to label the different stimulus values for the purposes of this figure only and are not indicated in the stimulus presentation.

Method Participants. The participants were 123 Indiana University undergraduate students who received partial credit toward the completion of their introductory psychology course. Stimuli and Apparatus. The stimuli were rectangles that varied in height and contained a vertical line segment located near the base of the rectangle that varied its horizontal position (see Figure 9). Figure 8 shows the category structure and training stimuli. Each axis represents one dimension of stimulus variation, and each cell represents a possible stimulus. There were 7 possible values along each dimension yielding 49 possible stimuli. The cells containing an A or B signify training items for Categories A and B respectively. The remaining cells represent possible transfer stimuli used to test generalization. Procedure. At the beginning of the experiment, participants were presented with instructions that emphasized (a) that it was possible to achieve perfect accuracy if participants attended to both the height of the rectangles and the position of the interior line segments and (b) that their primary concern should be to respond as accurately as possible. Participants were trained over the course of 42 blocks of 7 trials each. Within each block each of the training items

18

ERICKSON AND KRUSCHKE

more than 50% of the trials in the last 22 blocks of training was classified as a learner. Using this criterion, 85 of the 123 participants (69%) were classified as learners (compared with 53 of the 96 categorization participants [55%] in Nosofsky et al.’s study). The data from the remaining participants were not analyzed further.

Figure 9. A sample stimulus from Experiment 2 consisting of a rectangle that varied in height with an interior line segment that varied its position.

shown in Figure 8 was presented once. After every fifth block of training trials, participants were given a self-timed break. During the break they were reminded of the importance of attending to both the rectangle height and the line segment position, and they were told that if they did they would be able to get 100% correct. In each training trial, participants were shown a stimulus and were instructed to assign it to one of two categories by pressing either “F” or “J” on the keyboard. The assignment of Category A and B to the two response keys was randomized between participants. Participants were given 30 s to respond. When participants gave the correct response, the computer displayed, “Correct!” When they failed to give the correct response, the computer displayed, “Wrong!” and generated a 600 ms tone. If they failed to respond within the 30 s time period, the computer displayed, “Faster” and generated a high-pitched tone for 250 ms. The correct response was then displayed, and the participants could study it and the stimulus for up to 30 s. Training was followed by three blocks of transfer trials. In each transfer block, all of the 49 possible stimuli, including the training stimuli, were presented to the participants. Before the transfer block began, participants were told that they should assign labels to the stimuli as before although they would no longer be told whether or not their responses were correct. They were told that they would see rectangles that they had not seen before and that they should make their best educated guess based on what they had learned earlier in the experiment. They were also told that because they would receive no feedback, they should not try to alter their strategy based on these rectangles.

Results and Discussion Participants’ performance during training was examined to evaluate the degree to which each participant learned the category structure. To maintain as much similarity as possible with Nosofsky et al.’s (1989) Experiment 1, participants who classified every training exemplar correctly on

Individual Differences. As in Experiment 1, it was important to examine the results of individual participants to identify their patterns of behavior correctly. This was done in two ways: First, to facilitate comparison between this experiment and the one performed by Nosofsky et al. (1989), individual differences were identified the same way as in their study. As is explained later, however, this form of analysis was ill-suited to be applied to the full set of transfer stimuli. Thus, individuals’ performance was also analyzed using cluster analysis to identify groups of participants whose performance was qualitatively similar. In their analysis of individual differences, Nosofsky et al. (1989) examined participants’ patterns of generalization for the nine stimuli in their experiment that were not training exemplars. (Recall that their category layout was composed of just the 16 stimuli shown in the shaded areas in Figure 8.) Nosofsky et al. determined whether each participant classified each of these nine stimuli more often in Category A or in Category B (out of five trials per stimulus). Because each of the nine stimuli could be classified as belonging to one of two categories, there were a total of 29 512 possible patterns of generalization. Nosofsky et al. found that of the 512 possible patterns of generalization, only seven were exhibited by four or more participants. Using this same technique in our experiment, only four patterns of generalization were exhibited by four or more participants. The two most frequent patterns of generalization in our experiment were the same as the two most frequent patterns found by Nosofsky et al. (1989). These patterns are indicated in text as 1 2 3 4 5 6 7 8 9 where the subscripts index the stimulus under consideration and correspond to the numbers 1–9 in the cells of Figure 8. Each i can take on a value of A or B to indicate that each participant who exhibited that pattern classified that transfer stimulus most often in Category A or Category B, respectively. Thus, as in Nosofsky et al.’s study, the two most frequent patterns of generalization were B B A A B A B B B and B B A A A A B B B . The first pattern was exhibited by 33 participants and the second was exhibited by 9. The two remaining patterns were both exhibited by 4 participants. These were B B A A B A B A A and A A A A A A B A A . The first of these was reported to have been observed by Nosofsky et al.; the second was not. This left 35 participants who learned to classify the training items, but whose pattern of generalization was either unique or shared by only one or two others. Whereas Nosofsky et al. (1989) had 9 transfer stimuli, the present experiment had 9 33 42 transfer stimuli. This yielded 242 4 4 1012 (4.4 trillion) different patterns of generalization. To identify qualitatively different patterns of generalization, therefore, the similarity between different



34 -5 -6 -6 -5 -6 -6 -6 5-  7



8 - - - - - - - - 9

8 - - - - - - - - 9

8 - - - - - - - - 9

8 - - - - - - - - 9 :  





19

LEARNING CATEGORY REPRESENTATIONS

patterns of generalization was considered by performing a cluster analysis of participants’ responses during the transfer phase of the experiment. Recall that in this experiment, participants saw each stimulus (including the training stimuli) three times during transfer. The distribution of each participant’s transfer responses was placed in a 49-dimensional vector.6 Each element in the vector contained a number from 0 to 3 indicating the number of Category A responses for each stimulus. The 85 vectors were submitted to a clustering algorithm using the complete linkage method to maximize compactness and minimize chaining (Cormack, 1971; Sokal & Sneath, 1963). The distance between vectors Pi and Pj was computed as di j Pi Pj 2 . Inspection of participants’ patterns of generalization indicated that within each of the four least similar clusters, performance tended to be fairly homogeneous. Of the 85 participants, 50 were clustered into Group 1, 16 into Group 2, 15 into Group 3, and the remaining 4 were grouped into Group 4. The aggregate transfer data and the transfer data for each of the four groups is shown in Figure 10.7 As would be expected, given that almost 60% of the participants were in Group 1, the aggregate data were quite similar to the data from that group. Without considering participants individually, those who exhibited qualitatively different patterns of responses could easily have been passed over as random variation around one central pattern.

;

Transfer. Participants in each of the four groups appear to have utilized clearly different strategies in their approaches to the classification task. Because the patterns of generalization between each of the four groups were so different, additional characterization of the collective data has little meaning. Participants in Group 1 appeared to have used a rule-andexception strategy with a greater emphasis on the rules than on the exception. The rules simply placed stimuli with a segment position between 2 and 4 in Category A unless participants perceived the stimulus to be sufficiently similar to the training item with segment position 4 and height 1, the exception to the rule. Participants in Group 2 appeared to have used a rule-andexemplar strategy. The rule applied to stimuli with segment positions less than (or perhaps equal to) 3 and grouped stimuli with segment positions less than 2 into Category B. Because of the graded response proportions in the upper right cells, unlike the sharp transition in Group 1, the pattern of responses for the remaining stimuli appeared as if it could be accounted for by exemplars placed at the locations of the training items. The Group 3 participants appeared to have used rules similar to the Group 1 participants, but based upon rectangle height rather than line segment position. Although very few participants were grouped in Group 4, their pattern of generalization is interesting nevertheless. Their transfer performance can be described in two parts. First, participants in Group 4 shared a strong tendency to learn that stimuli with Segment position 1 should be classified in Category B. Unlike participants in Group 2 who

learned a category boundary between segment positions 1 and 2, participants in Group 4 learned to classify segment position 1 by itself. It was only because generalization was tested with a more extreme segment position value than the training stimuli (i.e., 0) that it was possible to detect a difference between these two types of classification. Second, participants in Group 4 appeared to classify the remaining stimuli as did the participants in Group 2, as if they made their judgments by comparing transfer stimuli to memorized training items. Learning. Although all the participants included in the analysis learned to classify all the training items at abovechance levels, it might be expected that the differences in transfer performance between groups might have been reflected earlier in their learning performance. For example, participants in Group 1 might have found the training item at position 4 and height 1 difficult to classify because it was an exception to their more general rules. A three-way mixed ANOVA was performed that included group as a between-participant factor and epoch (six consecutive blocks) and training exemplar as within-participant factors. Although there was a main effect of epoch, F 6 486 4 47, MSE 0 0398, p 0001, it did not interact with any of the other factors and is not considered further. The proportion of participants’ correct responses by group collapsed across all blocks of training are shown in the top panel of Figure 10. Overall, participants tended to classify the training exemplars in segment positions 1 and 2 best and the training exemplars in segment position 4 worst. A significant main effect of training exemplar indicates reliable 4 60, differences between training exemplars, F 6 486 0001. Further, there were clear differMSE 0 064, p ences between the four groups of participants in their ability 3 48, to classify the different training items, F 18 486 0001. For example, the training exemMSE 0 064, p plar at position 4 and height 1 was an exception to the rule for the participants in Group 1, and thus, they found it difficult to learn. Participants in Groups 2 and 4 found it less difficult, presumably because it and the training exemplar at position 5 and height 2 are so similar. Finally, participants in Group 3 classified it best of all. This is probably because of the confluence of rule- and exemplar-based information. Although the overall pattern of interactions is too complicated to be explained at this level, the correspondence between participants’ patterns of learning and generalization emphasizes the importance of providing an account of both aspects of perfor1 76, mance. There was no main effect of group, F 3 81 MSE 0 237, p 15.



 



 - ?