Toward a Cognitive Sequence Learner: Hierarchy, Self-Organization

3 downloads 0 Views 271KB Size Report
a text document, which is presented iteratively—character by character—to the system. ... Lower-level areas respond to specific sounds, such as phonemes or ...
Toward a Cognitive Sequence Learner: Hierarchy, Self-Organization, and Top-down Bottom-up Interaction Martin V. Butz

IlliGAL Report No. 2004021 May, 2004

Illinois Genetic Algorithms Laboratory University of Illinois at Urbana-Champaign 117 Transportation Building 104 S. Mathews Avenue Urbana, IL 61801 Office: (217) 333-2346 Fax: (217) 244-5705

Toward a Cognitive Sequence Learner: Hierarchy, Self-Organization, and Top-down Bottom-up Interaction Martin V. Butz Illinois Genetic Algorithms Laboratory (IlliGAL) University of Illinois at Urbana-Champaign 61801 Urbana, IL, USA {butz}@illigal.ge.uiuc.edu

Abstract This paper introduces a hierarchical learning architecture that grows online an adaptive problem representation from scratch. The representation extracts frequent perceptual patterns representing them in a layered hierarchy where neural activity in higher layers is initiated bottom-up by firing neurons in lower layers and, vice versa, firing neurons in higher layers predispose activity and provide reinforcement feedback to neurons in lower layers. The structure evolves by a mixture of reinforcement learning and a genetic algorithm. Learning is biased towards extracting frequently recurring sequences in an input stream. We evaluate the architecture on a text document, which is presented iteratively—character by character—to the system. The results show that the proposed system reliably evolves representations of most frequent characters, syllables, and words in the document. We also confirm that top-down influences bias the evolution of syllable and character representations.

1

Introduction

Recent insights in cognitive and neural processing mechanisms suggest that many brain areas are structured hierarchically. Different but related areas usually interact bidirectionally. Also the auditive system of humans and birds is hierarchically structured (Feng & Ratnam, 2000). Lower-level areas respond to specific sounds, such as phonemes or syllables, and higher levels use the extracted features responding to larger chunks of auditive input, such as words or song parts. Despite this strong evidence from neuroscience, only few artificial neural systems have been developed that mimic such hierarchical structures. This paper introduces a cognitive sequence learning architecture (COSEL) that grows a similar hierarchical representation dependent on the provided input. COSEL evolves a hierarchical neural representation concurrently. The number of neurons adapt to the complexity of the provided input. Each hierarchical layer develops a set of neurons that cluster activity through time in the layer below and propagate activity to the layer above. In turn, the layer above rewards activity in the layer below. Neurons are endowed with an internal state so that neurons themselves represent short sequences. The aim of this paper is to introduce COSEL and show that the system is able to evolve a neural representation of frequent sequences in a data stream. Additionally, we show that COSEL is flexible and extendable and should be regarded as a general learning architecture rather than an application to one particular task. The paper is structured as follows. The next section provides some motivation based on biological and physiological research. Section 3 provides background on similar learning architectures. 1

Next, we specify the addressed problem and then introduce COSEL. Section 6 specifies the obtained results. Section 7 summarizes and concludes the paper.

2

Biological and Physiological Background

Many neuroscience related research areas discovered and investigated the modular aspects of the brain over the last decades. Most prominently, vision research has shown that there are many areas in the brain that are responsible for feature-extraction and basic visual stimulus processing. These mechanisms work in parallel and are hardly influenced by the bottleneck of cognitive attention (Pashler, 1998). However, it was also shown that those structure extraction mechanisms are strongly influenced by top-down processes such as attention related to object-properties, location properties, color properties, as well as predictive behavioral properties (Pashler, Johnston, & Ruthruff, 2001). On the auditive side, research has not made as much progress as in vision but several facts seem to be established. ”[...] neurons extract behaviorally relevant features in parallel hierarchically arranged pathways. [...] it is now recognized that descending auditory pathways can modulate information processing in the ascending pathway, leading to improvements in signal detectability and response selectivity.” (Feng & Ratnam, 2000, p.699) Several aspects are emphasized in this quotation: First, behaviorally relevant features are extracted. Second, the pathways are structured hierarchically. Third, stimuli are processed in parallel. Fourth, top-down influences enhance stimulus detectability and response selectivity. Several selective neural-layers and neurons were identified including specializations in spectral features of sound (posterior thalamic nucleus), temporal features of sound (central thalamic nucleus), and duration-selective neurons (inferior colliculus) (Feng & Ratnam, 2000). While the behavioral relevance is not considered in our non-situated learning system (the system does not influence the world), all other structural properties are considered in the COSEL architecture. Besides the human auditive system, sound processing research focuses on frogs, bats, and birds. For our text-sequencing endeavor, most relevant are the studies on bird song learning and production. Recently, (Margoliash, 2003) identified significant parallels in the speech understanding and production system in humans and the bird song production and perception system suggesting potential similarities in structure and off-line processing mechanisms among others. While a complete review of bird-song production, recognition, and learning is beyond the scope of this paper, several interesting observations can be made regarding the neural organization of song-production and learning. Also bird song production and recognition is realized by highly interconnected and interactive modular brain structures (Brenowitz, Margoliash, & Nordeen, 1997). The essential modules are characterized as follows: ”[...] The neurons in the descending motor pathway (HVc and RA) are organized in a hierarchical arrangement of temporal units of song production, with HVc neurons representing syllables and RA neurons representing notes. the nuclei UVa and NIf, which are afferent to HVc, may help organize syllables into larger units of vocalization.” (Margoliash, 1997, p.671) Thus, research clearly suggest the distinct representation of short sound units such as units of vocalization and longer sound units such as syllables and even larger units of vocalizations such as lines and strophes. While the quoted passage focuses on the motor pathway, Margoliash also 2

emphasizes the sensory-motor interactivity so that the pathway may not only be structured during song-production but also during song-listening, learning, and recognition (Margoliash, 1997). This paper builds on these insight designing COSEL— a hierarchically-structured learning system that can process input in parallel abstracting hierarchically. We show that the system reliably and emergently develops syllable and word representations without any supervision.

3

Related Systems

Several systems approach the problem of (hierarchal) sequence learning. The most straight-forward approach is a multi-layer perceptron (MLP) that learns to predict next input given the current input. However, in the case of complex sequences, that is, sequences in which the next input does not only depend on current input, a simple MLP approach is not sufficient. Introducing recurrences to an MLP (Jordan, 1986) allows the formation of a decaying short-term memory. Elman (Elman, 1990) showed that adding recurrences from the hidden layer to the input layer allows the recognition of simple syntactic patterns in the input stream. The recurrent connections evolve an implicit shortterm memory predicting sequential input patterns. Another approach uses a short-term memory to provide the MLP with historic input limiting the network to predict sequences that maximally depend on the short-term memory size. In general, though, the MLP architectures appear limited to a fixed degree of context or they are prone to ambiguity in the input stream (Mozer, 1992; Wang, 2003). Associative learning architectures are an alternative to MLPs. In the simplest case, the networks are able to learn k-order sequences but the number of connections grow exponential in k. (Wang, 2003). Wang and Yuwono (Wang & Yuwono, 1995) introduced an anticipation model that selforganizes the generation of complex patterns. They use a shift-register assembly to create a shortterm memory that in connection with a winner-take-all layer represents sequential patterns. The system is limited to reproduce sequences of maximal size of the shift-register assembly. Tests showed that the system is able to store overlapping short sequences such as remember and memory. Another interesting approach is exhibited in the wake-sleep algorithm (Hinton, Dayan, Frey, & Neal, 1995) that forms a minimal representation of the encountered input by a top-down bottom-up interactive mechanism. The algorithm learns to reconstruct provided inputs in the wake phase. In the sleep phase, the process is reversed adjusting bottom-up connections. Previous hierarchical approaches tried to form hierarchies in which the upper layer predicts the prediction failure case of the lower layer eventually embedding the knowledge in the lower layer itself (Schmidhuber, 1992). Another approach used decaying activity patterns to evolve an implicit short-term memory in special neurons (Mozer, 1992). Hereby, the tuning of the decay was most crucial to successfully learn complex sequences. Hierarchical clustering and hierarchical self-organizing maps are similar in the general architecture. However, regardless if the approach is agglomerative, in which many clusters are progressively combined, or divisive, in which one cluster is progressively divided, the clusters in each level of the hierarchy are usually non overlapping (Duda, Hart, & Stork, 2001). COSEL is somewhat similar to the wake-sleep algorithm. However, it does not distinguish two learning phases and grows adaptively dependent on the available input. Additionally, the evolving clusters in each layer (or hierarchical level) can be overlapping.

3

4

The Problem

The problem we are addressing in this paper is a text-sequencing problem. We provide a stream of characters that is taken from a text document and are interested in if we are able to concurrently and adaptively evolve a representation of the document from scratch. That is, inputs are presented iteratively one each time step. If a word ends, additional reinforcement is provided to the system. The task is to evolve representations of most frequent characters, syllables, and words in the received text. Of course, this would be easy by an exhaustive statistical analysis. However, we are not interested in such an analysis. Rather, we are interested in if we can create a concurrent learning system that learns a representation of the encountered text from scratch. The resulting system should be able to both identify most frequent characters, syllables and words and predict subsequent text input and possible word completions. Note that the problem is different from the usual lexicon learning approaches in which supervised input is provided such as a word and a meaning that need to be associated or in which the system is forced to explicitly and continuously predict next inputs. COSEL adaptively evolves a hierarchical neural net structure. No supervised learning takes place but only unsupervised learning (that is, hierarchical online clustering in time) with additional reinforcement learning. Thus, COSEL can be characterized as an unsupervised reinforcement-supported clustering mechanism. The next section introduces COSEL in detail.

5

COSEL: A Cognitive Sequence Learning Architecture

The COSEL architecture consists of three major components: (1) a hierarchically structure of layers of neurons that interact concurrently, (2) a reinforcement-learning (RL) (Sutton & Barto, 1998) based evaluation system, and (3) a genetic (Holland, 1975; Goldberg, 1989) and learning classifier system (LCS) (Holland & Reitman, 1978) based developmental (adaptive) component.

5.1

Hierarchical Architecture

Neurons, the behavioral entities in the system, are embedded in an initially empty hierarchical structure. The architecture specifies several layers of neurons that can interact only with the next lower and next upper layer. The complete architecture is shown in Figure 1a. The lowest layer receives character inputs and represents characters. We will refer to it as the character-layer. It sends character activities to the next higher layer that represents short sequences of characterneuron activities. We will refer to it as the syllable-layer. It sends “syllable” activities to the next higher layer and reward to the character-layer for the provision of necessary character activities. The third layer evolves sequences of the syllable-layer somewhat representing word activities. Thus, we will refer to it as the word-layer. Similarly, the word-layer sends activities to the next upper layer and sends reward back to the syllable-layer for the provision of necessary syllables. As can be seen, the learning architecture consists of three major components. The architecture is designed to evolve representations of progressively longer text sequences. While the lowest level represents single characters, the next level represents short letter sequences designed to evolve syllable representations. The third layer represents sequences of these syllable sequences designed to evolve representations of frequent words. Due to the interactivity of the layers each layer depends on the next lower level and on the next higher level. Initially, the dependence will be highly determined by the lower layer activity because only lower activity allows higher level activity. That is, if a neuron in the lower level dies, the neuron in the next higher level will eventually die-out as

4

Word Reward Base Reward

... 1

2

1 1

2

1 1 2 2

...

Neuron Activity

Base Reward

232

1 1

2

1

2

1

1

232

1

1

2

1

Neuron Activity

Base Reward

A

...

S-Neurons

Activity Reward

... 1

Top-down reward to firing S-neurons

Word Layer Base Reward

Syllable Layer

L-Neuron Top-down reward to S-neurons that caused S-neurons in this layer to fire

Lower Layer Firing Input

Activity Reward

E

B

I

T

H

N

G

Layer

Input Sequencing, Reward Distribution

(b)

Character Layer

Activity Reward

... Y

Activity Propagation Base Reward

Character Input

Sequence Neuron Fitness level F 1

(a)

Numerosity N 2

3

(c)

Figure 1: (a) The hierarchical learning architecture. (b) A neural layer interacts with the layers above and below. The special L-neuron is responsible for reward propagation to currently active S-neurons and for the generation of new S-neurons. (c) One S-neuron receives and spreads activity expecting particular sequential input. Fitness and numerosity measure its relative strength in the neural layer. well because it can only be active (and thus receive reward as described below), if the neuron in the lower layer becomes active. Precise neuron characteristics, reward distribution, and development of the layers are specified in the subsequent sections.

5.2

Neuron Characteristics

Each entity in the system is characterized by a neuron that is able to receive messages, send messages, and create new neurons. There are two types of neurons in our system: (1) a simple neuron (referred to as S-neuron) that specifies a short sequence S d of other S-neurons that it depends on as well as a set of S-neurons Su that depend on it and (2) a layer neuron (referred to as L-neuron) that is responsible for covering, reproduction, and reward distribution in each layer. Figure 1b) visualizes the architecture of one layer in further detail. An S-neuron specifies a short sequence Sd of other S-neurons that it depends on as well as a set of S-neurons Su that depend on it. The S-neuron is active, if the first S-neuron in its sequence Sd1 fires. An S-neuron fires, if subsequently all S-neurons in its sequence fire. A simple S-neuron would be e.g. an S-neuron in the syllable-layer that specifies the S-neuron sequence Sd = the. Once the S-neuron t in the character-layer fires (indicating the reception of the character t), the is activated and expects h to fire. If h and e fire in sequence, the fires itself. Upon firing, a neuron notifies all

5

neurons in set Su as well as the L-neurons of this and the next upper layer. More precisely, each S-neuron has the following components: (1) a sequence S d of neurons (or characters in the lowest level) that determine its activity, (2) a set Su of S-neurons in the upper layer that depend on it, (3) a fitness f that determines its value, (4) a numerosity num that specifies how many identical (micro-)neurons this neuron comprises, (5) references to the L-neurons of this and the upper layer. To give another example, an S-neuron X in the syllable-layer may specify the sequence Sd = (b, e), which means that the S-neuron is activated if the S-neuron with sequence Sd =0 b0 fires in the character-layer. Once activated, X expects S-neuron with sequence S d =0 e0 to fire. If this is the case, X fires itself and is then deactivated. Otherwise, X is inactive. Sneurons can duplicate themselves, as specified below, and can destroy themselves. If a duplication or deletion occurs, the neuron informs the L-neuron of this layer and the next layer as well as the S-neurons in set Su . Figure 1c) shows the structure of one S-neuron. L-neurons keep track of the activities of the S-neurons in their level and of the activities in the lower level. They exist from the beginning, cannot be destroyed, and might be considered as the activity control mechanism of a layer. Each layer of neurons has one L-neuron. The L-neuron knows all the neurons of its layer and each S-neuron in the layer knows the L-neuron. Additionally, each S-neuron knows the L-neuron in the upper layer notifying them when activated and when firing. If no S-neuron is active, the L-neuron generates a new S-neuron that is currently active. The L-neuron may also generate a concatenated S-neuron out of previously active S-neurons. Finally, each time step, the L-neuron sends a reward share to all firing S-neurons in its layer.

5.3

Reward Distribution

As outlined in the previous section, layers interact with each other by (1) signaling activity and firing, and (2) distributing reward. While activity is propagated through the network, reward determines the chances of survival and propagation. Reward may be received if a neuron fires. There are two sources of reward: (1) base reward and (2) activity reward. Base reward is provided by the L-neuron of each layer. Each time step, all neurons that fire receive a |Sd | proportional reward of the base reward Rb in their layer (Rb = 10 in our experiments). Thus, all active neurons in a layer compete for the currently available reward. The longer the sequence a classifier specifies, the higher its reward share. The basic idea of reward sharing is derived from the reward sharing mechanism in the learning classifier system ZCS (Wilson, 1994). In ZCS, reward sharing was shown to be crucial for a successful evolution of a complete problem representation (Bull & Hurst, 2002). In combination with the evolutionary algorithm, neurons take over the layer that represent the most frequent patterns. Upon the reception of layer reward, a neuron updates its fitness by the following reinforcement-related delta rule: |Sd | f ← f + δ(Rb P − f ). i |Sdi |

(1)

where |Sdi | refers to the length of the sequence of the currently firing neuron i and f refers to the fitness of the neuron that is updated. The rule essentially keeps a moving average of the reward the neuron receives. Learning rate δ is set to 0.2 in our experiments. Activity reward is propagated downward from the next higher level. A firing neuron sends a reward share δf /|Sd | to each neuron in Sd , which is added to the fitness. This method gives an additional boost to neurons that are associated with neurons in the next higher level. The provided reward depends on the fitness of the neuron in the upper layer. The upper-layer classifier distributes its fitness to the classifiers in the lower level equally sharing among the neurons it depends on. For

6

example, let’s consider a neuron that represents the word behavior (which indeed emerges in our experiments). The word consists of the four syllables be, ha, vi, or. And indeed, usually four neurons evolve that represent those four syllables. When the word behavior is perceived, the four syllable neurons will be active and fire subsequently. The S-neuron behavior becomes active once be fires. Once or fires, also neuron behavior fires sending reward to all four syllable neurons providing a 0.25δf fraction of its own reward to each of them. If the word neuron behavior was linked to syllable neurons beha, vi, and or instead, each syllable neuron beha would receive 0.5δf and the two others would receive 0.25δf each. Since the word neuron knows the ID of the syllable neurons as specified in the Sd sequence, they can send the appropriate rewards directly. In Section 6 we show that this top-down influence causes the stable evolution of lower-level structure important for the higher level.

5.4

Evolutionary Algorithm

As the final component of the current COSEL architecture we chose to use an evolutionary-based mechanism that enables distributed, adaptive processing and ensures noise-robustness (Holland, 1975; Goldberg, 2002). The evolutionary algorithm is a steady-state, niched genetic algorithm (GA) as used in the ZCS classifier system (Wilson, 1994) and most Michigan-style learning classifier systems (Lanzi & Riolo, 2000). The GA continuously reproduces parts of the current population (that is, the set) of neurons, and deletes others. In difference to usual learning classifier systems, though, the population size is not fixed but dynamically adapts dependent on the input received. COSEL is a distributed architecture that does not rely on any global measures such as population size. Rather, the propagation of activity itself causes the system to change and adapt. Each layer and each S-neuron in a layer evolves rather independent from the other neurons. The competition between the neurons comes in due to the reward sharing mechanisms explained above. There are four reproduction mechanisms in each layer: (1) duplication, in which a neuron duplicates itself, (2) deletion, in which a neuron deletes itself with a certain probability partially dependent on its fitness value, (3) covering, in which a neuron is generated by the L-neuron if the current activity from the lower layer is not represented by any neuron, and (4) reproduction, in which a neural sequence produces a combined offspring neuron. The four mechanisms are triggered after each learning step. In the word-layer, this step is triggered at the word ending only. In duplication, individual neurons decide autonomously if they should duplicate. Each neuron has a certain probability of duplication. After a certain number of time steps, θ d (set to 100 in our experiments), a neuron may duplicate dependent on its current fitness. The probability of duplication equals (f − 1)/10. If duplication occurs, the parental neuron shares its current fitness with its offspring. The resulting offspring is identical to the parent. For computational reasons, instead of reproducing the parental neuron, we increase its numerosity num and deduct the fitness share accordingly. Doing this, a neuron is effectively a macro-neuron that represents num microneurons1 . Similarly to the probability of duplication, the neuron has a probability of deletion that is equal to (1 − f )2 . The quadratic term further decreases deletion probability. If a neuron is deleted, its numerosity is decreased by one. If the numerosity drops to zero, the neuron is removed from the system. An additional random deletion is executed, in which a neuron deletes itself with a certain small probability φd (set to 0.01 in our experiments). Noting the probability of reproduction and deletion, neurons tend to evolve to a fitness level 1 This is similar to the method used in the XCS classifier system (Wilson, 1995) and has been shown to only slightly influence performance in the XCS system (Kovacs, 1999).

7

of 1 since higher fitness implies reproduction and thus a decrease in fitness whereas a lower fitness implies deletion. If fitness drops below 1, though, the neuron relies on a new reward to increase its fitness to 1. If the delay until the next reward is too large, the neuron is likely to die out. Covering is realized by the L-neuron in each layer. When receiving activity from the lower layer, the L-neuron monitors if any neuron is active in its layer. If no S-neuron is active, the L-neuron generates a new S-neuron that specifies a sequence of two or more neurons that were subsequently active in the lower layer ending with the current active neuron in the lower layer. The fitness of the neuron is set to a fixed value Fi (set to 2 in our experiments). The generated neuron is immediately activated and fires sending a message to the next higher layer. The length of the matching sequence is limited to a constant θlc (set to 5 in our experiments). Reproduction is also realized by the L-neuron that monitors the subsequent firings of active neurons. With a certain probability φre (set to 0.01 in our experiments), the layer-neuron creates a new neuron out of previously subsequently active neurons. The parental neurons do not provide fitness share to the offspring in this case. The offspring is initialized with the fixed value fitness Fi . It is immediately activated since the generation procedure assures its current activity (that is, it matches the last few activity patterns). However, it may not fire immediately (or at all) since its matching sequence may extend into the future. For example, the subsequent neurons be and ha may be combined by the L-neuron to the neuron beha or beh or eha or eh. The length of the offspring sequence is limited in our experiments to a constant θlr (set to 6 in our experiments). If the generated offspring already exists in the population, the offspring is discarded. Summing up, each neuron underlies certain reproduction and deletion pressures. Reproduction depends on current fitness only and may occur only if a neuron’s fitness is greater than one. Deletion occurs with a small random probability that results in a deletion pressure towards extendedly inactive neurons. The additional deletion due to low fitness additionally increases this pressure. Thus, the system favors frequently active neurons as well as a complete and equally distributed non overlapping representation of the encountered text input.

5.5

Why and How Does it Work?

Several adaptive mechanisms interact in COSEL to evolve a good problem representation. Covering assures initial coverage of all inputs. Reward sharing causes competition among similar neurons. The genetic algorithm takes care of the evolution of new sequences and the propagation of frequent sequences. Since the covering is completely randomized, it can only supply random sequences. The reinforcement learning mechanism in combination with genetic reproduction, recombination, and deletion are responsible to evolve frequent sequences. The reward update mechanism keeps the fitness estimates of each neuron on a competitive level. If some neurons get deleted, fitness updates of similar neurons increase fitness (since a larger share will be available) and the lost neurons will be replaced. Additionally, reproduction decreases fitness so that no over-reproduction of one type of neuron can occur. Thus, disregarding interaction with other neurons, the numerosity of each S-neuron will reach a steady state. In our experiments, the layer reward is set to 10 for each layer so that an average number of 10 S-neurons is expectable for each frequently encountered text sequence. Additional interactions occur. S-neurons may overlap and specify sequences of different length. If they terminate at the same position, the reward is shared among them providing a greater reward share to the longer ones. Thus, longer reliably occurring stretches are favored. Moreover, additional activity reward is provided from the next upper layer. The word-layer receives additional top-down reward upon word endings distributed to all firing neurons that are not longer than the 8

word plus one leading character. Note that syllable and character S-neurons may be connected to many S-neurons in the next higher level. Thus, there is additional evolutionary pressure to evolve characters and especially syllables that are suitable for the formation of frequent words. The top-down influence is confirmed in sections 6.3 and 6.4.

6

Results

The primary goal of the basic COSEL architecture introduced herein is to reliably develop representations of frequent words by neurons in the highest level depending on sequential activity in the lower levels. Thus, we investigate which types of sequences evolve in our experiments. For this purpose, we check the final populations of neurons for particular neurons that represent the most frequent characters, syllables, and words in the provided text input. First, we provide a short analysis of the text data.

6.1

Text Input

To test our system we took a text from a research paper of the author (Butz, ress) and evaluate the performance monitoring which character, syllable, and word representations emerge. The text consists of 24387 characters total. The characters have 50 different types. Table 1 specifies the character occurrences of the most frequent characters in the text. Table 2 specifies occurrences of frequent syllables in the text. On the word level, there are 3590 words total, 971 distinct ones. Table 3 specifies the most frequent word occurrences. In one experiment, the whole text was presented (character-by-character) 60 times to the learning architecture. The obvious desire is that character and syllable representations develop stable representations of the frequent characters and syllables in the text. The word layer should evolve stable representations of the most frequent words.

6.2

Character Representation

The character layer evolves stable neurons for the most frequent characters (see Table 1). Interestingly, several characters became stable which were less frequent than others. For example, character ’-’ was stable in all ten runs whereas number ’0’ or ’1’ were not stable although zero occurred more than double as often as ’-’. Two influences can explain this result: (1) Numbers occurred mainly in the beginning of the text (usually citations with date) and thus the distribution is less uniform than for symbol ’-’, (2) more importantly, character ’-’ is surrounded by spaces in several positions, which triggers the additional word reward to the word layer. Thus, the top-down additional reward boost allows the ’-’ symbol to be stably represented in the character layer (and in fact, the “word” ’-’ was also stably represented in the word layer).

6.3

Syllable Representation

The syllable layer is a little more difficult to analyze because most frequent syllables are difficult to extract from the text. Nonetheless, Table 2 shows some syllable occurrences and how often the syllables were represented at the end of a run. Also the syllable results confirm that more frequent syllables become more stable. Interestingly, again some syllables are represented less stably than other less frequent syllables. For example, syllable “vi” was not reliably present in the final population whereas the less frequent “if” was. A similar argument holds as in the case of the character layer. Since syllable “if” itself coincides with the word “if”, the frequent additional reward

9

Table 1: The number of runs in which a stable representation evolved (out of ten runs) confirms that most frequent characters are represented reliably. The average population size (in macro-neurons) was 46.2 with a standard deviation of 1.4. ’ ’=3642 10 ’c’=847 10 ’y’=318 10 ’(’=45 10 ’&’=22 3

’e’=2504 10 ’h’=827 10 ’b’=305 10 ’0’=42 8 ’-’=18 10

’t’=1948 10 ’l’=742 10 ’v’=261 10 ’9’=42 10 ’;’=17 0

’n’=1716 10 ’d’=682 10 ’w’=211 10 ’1’=33 8 ’6’=15 0

’i’=1640 10 ’u’=566 10 ’,’=178 10 q’=26 10 ’3’=10 0

’a’=1578 10 ’p’=528 10 ’.’=173 10 ’2’=25 3 others=14 0

’o’=1404 10 ’m’=417 10 ’k’=88 10 ’z’=24 10

’s’=1335 10 ’f’=412 10 ’x’=74 10 ’j’=23 8

’r’=1255 10 ’g’=333 10 ’ )’=47 10 ’4’=23 0

Table 2: Also frequent syllables are evolved reliably. The average population size of the syllable layer was 511.1 with a standard deviation of 12.1. an=375 the=370 ti=370 or=329 to=198 ha=192 be=126 vi=119 10 10 10 10 10 9 10 8 of=117 pa=108 ci=107 ly=107 ry = 96 if=39 rd=20 yc=6 10 9 10 10 7 10 6 0 boost from the top layer allows “if” to be stably represented despite its infrequent occurrence. A different reason explains the unstable representation of “ry”. Although this syllable occurs only at the end of a word, which should actually result in an additional reward boost, it does not become stable. The main reason for this is the competition with other syllables. All syllables that end with an ’r’ (that is, “ir”, “or”, “ar”, “ur”, and “er”) were present in the final population of the syllable layer. Additionally, “y” was always present in the syllable layer. Thus, “ry” continuously competes with these other, overlapping syllables and gets extinct sometimes. Additionally, all runs had a representation of an extension of the syllable “ry” such as “ory”, “tory”, or “atory” present stemming from the frequent word “anticipatory”. On the other hand, syllable “ly”, which was always present in the end of a run, has fewer competitors (syllable “ol” was not always present) and a frequently occurring extensions (as for “ry”) are not available.

6.4

Word Representation

The word representations that evolved can be found in Table 3. It can be seen that all most frequent words were reliably present in the end of a run. Even long words such as “behavior”, “anticipatory”, and “learning” evolved reliably. When reaching a word frequency of less than twenty, representations start to become unreliable. Longer words are harder to represent so that these representations become unreliable first. The table also indicates that especially for shorter words more often representations evolved that included the leading space symbol. Obviously both representations compete for the provided reward so that both evolve and often the longer ones, which receive a higher reward share at the syllable level, survive.

10

Table 3: Also representations of the most frequent words evolved reliably (numbers in brackets indicate the presence of the word with a leading space character). The average population size of the word layer was 4432.5 with a standard deviation of 60.3. the=301 and=126 of=108 to=101 in=85 behavior=57 10 10 10 10 10 10 anticipatory=54 is=58 a=50 that=46 as=39 learning=38 10 10 10 10 10 10 be=37 or=33 are=22 on=25 more=21 can=21 10 10 10 10 9(10) 10 processing=19 for=18 an=18 it=17 by=17 sensory=17 4(4) 9(10) 9(10) 10 8(10) 8(7) structure=17 current=16 which=16 was=16 others=2197 diff.others=943 3(1) 5(6) 6(10) 9(10)

7

Summary and Conclusions

The introduced cognitive sequence learner COSEL is a concurrent learning system that evolves a reliable representation of frequent sequences. Several interesting observations were possible: COSEL is able to evolve stable clusters of recurring sequences in hierarchical form using reinforcement learning and genetic algorithm techniques. Each layer progressively develops more complex sequences (that is, single characters, syllables, and words), which stabilize each other in a bottom-up, topdown interactive fashion by the means of reinforcement sharing. By providing a base reward and a covering mechanism for each layer, a base representation of most frequent patterns is ensured. With the additional top-down reward, more complex patterns evolve and are sustained on demand. Thus, top-down “attentional” mechanisms can be simulated within the learning architecture. It needs to be recognized that the system is currently absolutely uninformed about symbol types (numbers vs. characters vs. blanks...). Additional mechanisms that detect for example blanks could be used to distinguish words easier and thus improve the performance of COSEL. Currently, the only additional feedback was provided at the end of a word. Such a feedback can in fact be compared with a cognitive representation of a communication item. A baby in the first stages of language learning may have only very simple cognitive capabilities so that essential words such as Mom and Dad are understood (and eventually reproduced) first. Only once cognitive representations are available that provide top-down feedback, more infrequent, longer, and special words may evolve. Additionally, it is know that language learning undergoes several stages, in which the earliest one is the recognition of phonemes and syllables. Thus, COSEL may be taught in several stages first developing the character layer, then adding the syllable layer and finally the word layer. The provided base reward and additional activity reward may be further modified as well. In general, reward indicates importance in COSEL. Higher reward triggers a more detailed representation and learning (offspring generation). Thus, in a more complex cognitive architecture, base reward may vary dependent on current significance of, interest in, or attention to current input. Learning may be only possible if attention is payed to current input in the first place. Another issue is the anticipatory capability of the system. Already in its current stage the system implicitly predicts next characters, syllables, and words. Any neuron that becomes active in the syllable or word layer, predicts subsequent characters or subsequent syllables, respectively. Right now this feature is not used but may show advantages in (1) faster comprehension, and (2) possible improved learning speed. For example, a strongly violated prediction may trigger an

11

offspring creation directly (as psychology literature often suggests that surprise triggers learning (Rescorla & Wagner, 1972)) so that the learning mechanism may be more directed towards novelty and “interesting” events. The top-down influence may also be used to produce output. Similar to the observed interaction of speech perception and production in linguistics, same neural layers may be active—and potentially interfere—in a task that requires perception and reproduction. Additionally, the formed neural clusters may be reused and adapted further to allow speech production. Currently, COSEL is applied deterministically. There are no degrees of activity nor can neurons become active due to the activity of more than one lower level neuron. For example, the COSEL architecture may be applied to real speech patterns that could be pre-processed by auditory filtering techniques. In general, each neuron may extract any type of information from the lower level activity. Given real valued input, neurons might be formed that analyze pitch, frequency, volume and so forth. Thus, the lower level neurons may not simply fire in the event of a character, but they may evolve to fire due to the event of a certain phoneme sound indicated by the activity of appropriate feature extraction neurons. The system may also be applied to higher stage learning such as a grammar learner (Morris, Cottrell, , & Elman, 2000) by adding a “sentence-layer ” on top of the word-layer. We also would like to mention that the approach might be used as a first learning stage that establishes the wiring in the network. Once basic neural connections are established, a second learning stage may adjust connection weights as in a normal neural network. The plasticity of the learning system would decrease but stability would increase—very similar to the actual cognitive development in humans. In general, the COSEL architecture is rather general and is not restricted to the investigated text sequencing task. Each layer may comprise different feature extracting mechanisms. In effect, each neuron might be considered as a kernel whose activity indicates the presence of a certain type of pattern. The higher levels serve as combinatory layers that combine features identifying frequent patterns —very similar to a multilayer neural network—but in a much more adaptive self-organized style with the potential of effective concurrent computing. Further investigations and improvements of the learning mechanisms in the COSEL architecture are necessary to increase robustness and understanding of the system as well as to evaluate the full potential of the approach.

Acknowledgment I am grateful to Pier-Luca Lanzi, Xavier Llor`a, Kei Onishi, Martin Pelikan, Kumara Sastry and the whole IlliGAL lab for their help and the useful discussions. Many thanks also to David E. Goldberg for his encouragement in pursuing this work and the useful discussions as well as to Joachim Hoffmann for his inspiring perspective and encouragement. The work was sponsored by the Air Force Office of Scientific Research, Air Force Materiel Command, USAF, under grant F49620-03-1-0129. The US Government is authorized to reproduce and distribute reprints for Government purposes notwithstanding any copyright notation thereon. Additional funding from the German research foundation (DFG) under grant DFG HO1301/4-3 is acknowledged. Additional support from the Computational Science and Engineering graduate option program (CSE) at the University of Illinois at Urbana-Champaign is acknowledged. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of the Air Force Office of Scientific Research or the U.S. Government.

12

References Brenowitz, E. A., Margoliash, D., & Nordeen, K. W. (1997). An introduction to birdsong and the avian song system. Journal of Neurobiology, 33 , 495–500. Bull, L., & Hurst, J. (2002). ZCS redux. Evolutionary Computation, 10 (2), 185–205. Butz, M. V. (in press). Anticipation for learning, cognition, and education. On the Horizon. Duda, R. O., Hart, P. E., & Stork, D. G. (2001). Pattern classification (2 ed.). New York, NY: John Wiley & Sons. Elman, J. L. (1990). Finding structure in time. Cognitive Science, 14 , 179–211. Feng, A. S., & Ratnam, R. (2000). Neural basis of hearing in real-world situations. Annual Review of Psychology, 51 , 699–725. Goldberg, D. E. (1989). Genetic algorithms in search, optimization and machine learning. Reading, MA: Addison-Wesley. Goldberg, D. E. (2002). The design of innovation: Lessons from and for competent genetic algorithms. Boston, MA: Kluwer Academic Publishers. Hinton, G. E., Dayan, P., Frey, B. J., & Neal, R. M. (1995). The wake-sleep algorithm for unsupervised neural networks. Science, 268 , 1158–1161. Holland, J. H. (1975). Adaptation in natural and artificial systems. Ann Arbor, MI: University of Michigan Press. second edition 1992. Holland, J. H., & Reitman, J. S. (1978). Cognitive systems based on adaptive algorithms. In Waterman, D. A., & Hayes-Roth, F. (Eds.), Pattern directed inference systems (pp. 313– 329). New York: Academic Press. Jordan, M. I. (1986). Attractor dynamics and parallelism in a connectionist sequential machine. Proceedings of the 8th Annular Conference Cognitive Science Society, 531–546. Kovacs, T. (1999). Deletion schemes for classifier systems. Proceedings of the Genetic and Evolutionary Computation Conference (GECCO-99), 329–336. Lanzi, P. L., & Riolo, R. L. (2000). A roadmap to the last decade of learning classifier system research. In Lanzi, P. L., Stolzmann, W., & Wilson, S. W. (Eds.), Learning classifier systems: From foundations to applications (LNAI 1813) (pp. 33–61). Berlin Heidelberg: SpringerVerlag. Margoliash, D. (1997). Functional organization of forebrain pathways for song production and perception. Journal of Neurobiology, 33 , 671–693. Margoliash, D. (2003). Offline learning and the role of autogenous speech: New suggestions from birdsong research. Speech Communication, 41 , 165–178. Morris, W. C., Cottrell, G. W., , & Elman, J. L. (2000). A connectionist simulation of the empirical acquisition of grammatical relations. In Wermter, S., & Sun, R. (Eds.), Hybrid Neural Systems (pp. 177–193). Berlin Heidelberg: Springer-Verlag. Mozer, M. C. (1992). Induction of multiscale temporal structure. In Lippman, D. S., Moody, J. E., & Touretzky, D. S. (Eds.), Neural Information Processing Systems 4 (pp. 275–282). San Mateo, CA: Morgan Kaufmann. Pashler, H., Johnston, J. C., & Ruthruff, E. (2001). Attention and performance. Annual Review of Psychology, 52 , 629–651. 13

Pashler, H. E. (1998). The psychology of attention. Cambridge, MA: MIT Press. Rescorla, R. A., & Wagner, A. R. (1972). A theory of Pavlovian conditioning: Variation in the effectiveness of reinforcement and non reinforcement. In Black, A. H., & Prokasy, W. F. (Eds.), Classical Conditioning II: Current Research and Theory (pp. 64–99). New York: Appleton Century Crofts. Schmidhuber, J. (1992). Learning complex extended sequences using the principle of history compression. Neural Computation, 4 (2), 234–242. Sutton, R. S., & Barto, A. G. (1998). Reinforcement learning: An introduction. Cambridge, MA: MIT Press. Wang, D. (2003). Temporal pattern processing. In Arbib, M. A. (Ed.), The Handbook of Brain Theory and Neural Networks (2 ed.). (pp. 1163–1167). Cambridge, MA: MIT Press. Wang, D., & Yuwono, B. (1995). Anticipation-based temporal pattern generation. IEEE Transations on Systems, Man, and Cybernetics, 25 (4), 615–628. Wilson, S. W. (1994). ZCS: A zeroth level classifier system. Evolutionary Computation, 2 , 1–18. Wilson, S. W. (1995). Classifier fitness based on accuracy. Evolutionary Computation, 3 (2), 149–175.

14