connectionist lexical processing

4 downloads 0 Views 2MB Size Report
Jeff Elman, Prof. Dr. Nikolai Petkov and Prof. ...... in deze dissertatie over het algemeen de stelling aangehangen dat neurale netwerken inderdaad in staat zijnĀ ...
CONNECTIONIST LEXICAL PROCESSING

Ivilin P. Stoianov

The research reported in this thesis was funded by the Groningen Behavioural and Cognitive Neuroscience research school (BCN) with an Ubbo Emmius grant. It was also supported by the Netherlands Organisation for Scienti c Research (NWO).

c 2001 by Ivilin P. Stoianov Copyright Typeset with LATEX Cover design by Ivilin Stoianov Printed in The Netherlands Groningen Dissertation in Linguistic (GRODIL) 31 ISSN 0928-0030

Rijksuniversiteit Groningen

Connectionist Lexical Processing Proefschrift ter verkrijging van het doctoraat in de Wiskunde en Natuurwetenschappen aan de Rijksuniversiteit Groningen op gezag van de Rector Magni cus, dr. D.F.J. Bosscher in het openbaar te verdedigen op vrijdag 23 maart 2001 om 14.15 uur door Ivilin Peev Stoianov geboren op 29 december 1969 te Yambol, Bulgarije

Promotor: Prof. Dr. Ir. J.A. Nerbonne Beoordelinscommissie Prof. Dr. Ir. D. Duifhuis Prof. Dr. N. Petkov Prof. Dr. J. Elman

iii

To my teachers Rusko Shikov & Georgi Tenev

Foreword

In 1983 a young teacher of mathematics { Georgi Tenev { in the Mathematical gymnasium of Yambol, Bulgaria, attracted a few of his new students to the vast eld of mathematics, me among them. Shortly after, another scholar { Rusko Shikov { won me over to the cause of computer science { Informatica { which would prove to be the most important step of my professional life. A number of competitions and awards followed, among them my admission to the University of So a. During my university years, I worked in parallel at the Bulgarian Academy of Sciences (BAS), where I also completed my M.Sc. research supervised by Dr. Vladimir Shapiro, who drew me to the eld of connectionism. A scienti c position in the Institute of Information Technologies, BAS followed, under the great supervision of Dr. Ieroham Baruch, who shaped me as a scientist. Equally important to me during the years in this institute was my 'guru' Dr. Nikolai Spasov, as well as all my other colleagues from the institute and the academy, Dr. Anisava Miltenova among them. Realising the importance of Cognitive Science, I could no resist responding to an announcement about a doctoral study position in connectionist natural language processing at the University of Groningen, The Netherlands. Language is an important means of producing and communicating thoughts and the connectionist paradigm is currently closest to the structure of the brain, so working on this topic provided to me an interesting and challenging opportunity to explore human cognition. This I managed to do for a full four years, supervised by Prof. Dr. John Nerbonne, who coached very successfully my work. It is successful also due to the kind collaboration of Dr. Laurie Stowe, who helped me to work out the psycholinguistic details of the thesis, Walter Jansen, who helped me a lot to get into phonology, Dr. Tier De Graaf, Dr. Peter Been, Dr. Gosse Bouma, and other colleagues from Alfa-Informatica Department. Useful to the research in this thesis was also my collaboration with Dr. Ronan Reilly and Dr. James Hammerton, with whom I discussed connectionist aspects of the work. The experimental side of the research could hardly exist without the collegial and friendly support of the system administrator of the Unix system in our department, san-sei Shoji Yoshikawa, who did his best to provide me with all available computational resources for training the connectionist models presented in the thesis. I also thank all my colleagues from Alfa-

iv Informatica for their patience. To do successful research, one must be able to concentrate on the scienti c problems and not have to be too concerned with the big or small administrative details. Thanks to the excellent organisation in the department and the faculty, these things stayed somehow invisible. The people who managed to make this happen were Mark Kas, Sietze Looyendga, Rob Visser, and the rest of the administration sta , who with their friendly support changed my ideas about administration. The hard work in Groningen would have been much harder without the relaxing evenings with my colleagues Tony Mullen, Stasinos Konstantopoulos, Rob Koeling, Walter Jansen, Miles Osborne and Rob Malouf. However, Groningen became 'home' to me mainly due to my best friends here Budi Oetomo & Jaike Blankestijn. My friends Ramo Gopal, Sergei Grachev, Dorin Saban, Sina Koening, and all others also contributed a lot to me maintaining a good mood in spite of not often seeing blue skies and sunny days. 'Home' means lots of things, but also one's native language, customs and culture. These things were transferred to the Dutch soil by my wonderful bulgarian friends, Stefka & Stoian Stoianovi, Elka Augoston-Nikolova, Eli, Emilia, Vera, Milena & Erick, Liubo & Liana, Lili Grozeva, Donka & Boyana, Maria & Paul Shumsk, Anko Popov, and many others with whom I spent many happy hours. Thank you very much! Doing research is fun, but writing a thesis is a trial. Especially when it is a regular-sized book written in a non-native language learned at the age of twenty. My supervisor took on not only the task of commenting on and discussing the thesis, but also the dicult task of understanding what I wanted to say and helping me to express it in a plausible way, chapter by chapter, for which I am very grateful to him! Other colleagues who helped me with this job were my colleagues Dr. Laurie Stowe and Tony Mullen. A written manuscript turns into a proper, full- edged thesis only after respectable evaluation. Three people are responsible for the promotion event on 23rd of March 2001 due to their positive evaluation. I thank Prof. Dr. Je Elman, Prof. Dr. Nikolai Petkov and Prof. Dr. D. Duifhuis that they agreed to be in my evaluation committee. Finally, and most importantly, I thank my family in Bulgaria who has always been and will always be my secure support in any situation. I also gratitude to anybody else not mentioned here, who directly or inderectly has had a hand in the long process of completing my thesis { proefschrift. 5 Feb. 2001, Groningen Ivilin Stoianov

Contents 1 INTRODUCTION

1

I BACKGROUND

5

2 ASPECTS OF NATURAL LANGUAGE

7

2.1 Natural Language Phenomena . . . . . . . . . . . . . . . . . . 2.1.1 Studies about Natural Language . . . . . . . . . . . . 2.1.2 Organisation of Natural Languages . . . . . . . . . . . 2.1.3 Languages, Grammars and Finite State Automata . . 2.1.4 Phonotactics and Syntax . . . . . . . . . . . . . . . . 2.1.5 Chomsky's Universal Grammar . . . . . . . . . . . . . 2.1.6 Smolensky's Optimality Theory . . . . . . . . . . . . . 2.1.7 Written Languages and Grapheme-To-Phoneme Conversion . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Methods for Language Learning and Processing . . . . . . . . 2.2.1 N-grams . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.2 Hidden Markov Models . . . . . . . . . . . . . . . . . 2.2.3 A di erent look at language learning . . . . . . . . . .

3 CONNECTIONIST MODELING 3.1 3.2 3.3 3.4

Foreword . . . . . . . . . . . . . . . . . . Computational paradigms . . . . . . . . . Biological Neuronal system. Organisation. Arti cial Neurons and Neural Networks . 3.4.1 Arti cial Neurons . . . . . . . . . 3.4.2 Neural Structures . . . . . . . . . . 3.4.3 Neural Networks Learning . . . . . 3.5 Data Representation . . . . . . . . . . . . v

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

7 8 9 10 13 15 17 18 19 20 21 22

23

23 24 26 31 31 33 34 35

vi

CONTENTS 3.6 NN architectures . . . . . . . . . . . . . . . . . . . . . . . . . 3.7 Simple Recurrent Networks . . . . . . . . . . . . . . . . . . . 3.7.1 The Back-Propagation Through Time Learning Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.7.2 Computational Power of Simple Recurrent Networks .

II SEQUENTIAL LEXICAL MODELLING 4.1 Motivations . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Formalisation of Learning . . . . . . . . . . . . . . . . . 4.2.1 Neural Transducer and Neural Predictor . . . . . 4.2.2 Aspects of Learning Phonotactics with RNNs . . 4.3 Evaluating the Learning: Formalisation . . . . . . . . . 4.3.1 Matching the Successor Distribution . . . . . . . 4.3.2 Interpreting the Output - the Threshold problem 4.3.3 Optimal Phonotactics Learning . . . . . . . . . . 4.3.4 Optimal Grammar Learning . . . . . . . . . . . . 4.4 Experiments - Phonotactics Learning . . . . . . . . . . . 4.4.1 SRN . . . . . . . . . . . . . . . . . . . . . . . . . 4.4.2 Data . . . . . . . . . . . . . . . . . . . . . . . . . 4.4.3 Training . . . . . . . . . . . . . . . . . . . . . . . 4.4.4 Evaluation . . . . . . . . . . . . . . . . . . . . . 4.5 Network Analysis . . . . . . . . . . . . . . . . . . . . . . 4.5.1 Weight Analysis . . . . . . . . . . . . . . . . . . 4.5.2 Functional analysis: Theory . . . . . . . . . . . . 4.5.3 Findings . . . . . . . . . . . . . . . . . . . . . . . 4.5.4 Syllabic structure . . . . . . . . . . . . . . . . . . 4.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . .

5 GRAPHEME-TO-PHONEME CONVERSION . . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

46 49

55

4 LEARNING PHONOTACTICS

5.1 Psycholinguistic Lessons . . . . . . 5.2 Computational GPC Models . . . . 5.2.1 Symbolic GPC models . . . 5.2.2 Connectionist GPC models 5.3 GPC with a Neural Transducer . . 5.4 Data and Representation . . . . . 5.4.1 Database - Dutch words . . 5.4.2 Distributed Representations

39 44

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

57

59 60 62 67 72 73 74 76 77 79 79 82 84 86 91 91 93 95 96 97

101

103 106 107 110 115 117 118 121

CONTENTS 5.5 Learning Monosyllabic GPC . . . . . . . . . 5.5.1 Training . . . . . . . . . . . . . . . . 5.5.2 General Performance . . . . . . . . . 5.6 Evaluation . . . . . . . . . . . . . . . . . . . 5.6.1 Frequency . . . . . . . . . . . . . . . 5.6.2 Consistency . . . . . . . . . . . . . . 5.6.3 Phoneme Position and Word Length 5.6.4 Phonemes . . . . . . . . . . . . . . . 5.6.5 Phonetic Features . . . . . . . . . . 5.7 Improved training methods . . . . . . . . . 5.7.1 Focusing on inconsistent patterns . . 5.7.2 Focusing on patterns with errors . . 5.8 GPC of Polysyllabic words . . . . . . . . . . 5.9 Modeling Dyslexia . . . . . . . . . . . . . . 5.10 Conclusions . . . . . . . . . . . . . . . . . .

vii . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

122 124 125 126 127 128 130 133 136 138 138 140 144 145 149

III HOLISTIC LANGUAGE MODELLING

153

6 RECURRENT AUTOASSOCIATIVE NETWORKS

155

6.1 6.2 6.3 6.4

6.5

6.6 6.7 6.8

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sequences, Hierarchies and Representations . . . . . . . . . . Developing linguistic representations . . . . . . . . . . . . . . Recurrent Autoassociative Networks . . . . . . . . . . . . . . 6.4.1 Summary of the Processing Steps . . . . . . . . . . . . 6.4.2 Experimenting with RANs { I: Learning Syllables . . 6.4.3 RAN Experiments { II: Developing Representations of Numbers . . . . . . . . . . . . . . . . . . . . . . . . . A Cascade of RANs . . . . . . . . . . . . . . . . . . . . . . . 6.5.1 Size of Distributed Representations . . . . . . . . . . . 6.5.2 Simulation III: representing polysyllabic words with a RAN-cascade. . . . . . . . . . . . . . . . . . . . . . . . 6.5.3 A more realistic experiment: looking for systematicity Toward Cognitive Modeling . . . . . . . . . . . . . . . . . . . Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . .

7 RAN, HOLISM, and HOLISTIC COMPUTATIONS

155 158 162 170 175 176

179 183 189 191 192 195 198 204

207

7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207 7.2 Holistic Systems and Holistic Computations . . . . . . . . . . 209

viii

8 A B C D E F

CONTENTS 7.2.1 Earlier studies on holistic computations . . . . . . . . 7.3 Implementation of Holistic Computations . . . . . . . . . . . 7.3.1 Data: RAN-developed holistic representations . . . . . 7.4 Extracting tokens from DRs . . . . . . . . . . . . . . . . . . . 7.4.1 Holistic operator 1: Extracting symbols at a speci c position . . . . . . . . . . . . . . . . . . . . . . . . . . 7.4.2 Holistic operator 2: Extracting symbols at a variable position . . . . . . . . . . . . . . . . . . . . . . . . . . 7.5 Holistic operator 3: Reversing strings . . . . . . . . . . . . . . 7.6 Scaling up . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

212 214 216 218

CONCLUSIONS Phoneme Encoding A part of the Dutch dataset SRN15 pronunciation errors Non-Words Nonword Pronunciation Samenvatting

227 235 239 243 245 247 261

218 220 221 222 225

Chapter 1

INTRODUCTION People strive toward the future. People strive to nd out about themselves. This thesis is hopefully a small step in those two directions. It is a work on natural language modeling and processing with connectionist models. The rst goal is situated within language processing, the importance of which everybody would agree on in the digital era, where \communication" is of great importance. Connectionism, on the other hand, is a computational paradigm inspired by the way the humans think, and which is developed to discover the secrets of human intelligence. Complex symbolic natural languages are one of the things most peculiar to people. Human languages have hierarchical structures, they can be represented in many ways, and they can ultimately express anything possible to imagine. Yet, all this would mean nothing to us if we did not produce and comprehend this language; without production and comprehension capabilities, it would be like the unlimited number of other representational systems, meaningless to us. People can produce and comprehend almost any expression in the language they speak, even though it may contain very complex meaning. Our nervous system does this in a very reliable way. Directly inspired by the structure of this system is a computational paradigm known variously as Neural Networks, Parallel Distributed Processing, or connectionism (Rumelhart, Rumelhart & McClelland 1986). Yet, natural language modeling is mostly practiced with other, classical approaches which nd their foundations in the natural languages themselves { symbolic methods. Those approaches usually perform very well when doing this job, but they do not explain the way humans process their languages: they provide no link to the subsymbolic architectures of the human brain. All they can suggest in that respect is the basic assumption that a neuronal 1

2

CHAPTER 1. INTRODUCTION

device capable of complex language processing and containing inherent linguistic knowledge { for example, Universal Grammar { has been developed during evolution and is now exploited. Connectionism goes further than this. Connectionist models only suppose a general neuronal substrate without speci c (linguistic) bias, which can learn from the environment, that is, can adjust its memory in order to improve the human's behaviour in that environment. The main point in this thesis is to show that natural languages can be modelled and processed with connectionist models, at least at a lexical level. For this purpose, two basic lexical problems will be concerned. The rst one is modeling the sound structure of the words, also known as lexical phonotactics, studied in Chapter 4. The other one is translating words from one representational modality (written forms) into another one (phonological forms), which is explored in Chapter 5. Those two problems do not exhaust at all the space of problems related to lexical language modeling. Yet, successful work on them contributes to our understanding of what kind of connectionist models might be used for lexical modeling, how do they do it, and what connectionist models might be used for other linguistic problems. No doubt, studying complex language expressions is even more challenging work. To do this, however, proper connectionist tools are needed. Recognising this and the limitations of some of the existing models, Chapter 6 presents a model aimed at developing representations of sequential structures to be used for further complex processing. This model is called \Recurrent Autoassociative Networks" (RAN). It exploits two basic observations: Firstly, external data is mostly dynamic (varying temporally) but internal processing bene ts in speed if it uses static representations of sequential data. The other observation is that most of the learning is based on repetition. This suggests the use of sequential autoassociation { repeating the just observed data in order to develop unique static representations of the input sequences. This approach results in one very useful side-e ect: having both an encoder and decoder in the same computational device. Further, I suggest two directions of using the representations developed by this model. The rst one is modeling hierarchy in languages, which is explicated by a two-level lexical model. The other, very promising direction is holistic modeling, which is a step toward high-level connectionist symbolic modeling. Chapter 7 suggests some ideas how this could be done and provides some basic holistic operators. This thesis targets two main groups of readers: (1) people with connectionist backgrounds aiming at cognitive language modeling, and (2) people with linguistic backgrounds willing to explore cognitively plausible methods

3 for linguistic modeling. For that reason, the thesis will start by presenting background material in two introductory chapters: Chapter 2 will outline some issues related to natural languages and their modeling, and Chapter 3 will focus on connectionism and will present in detail the main connectionist model used in this thesis { the Simple Recurrent Networks, developed by Elman (1988). In addition, since this is a bridging work between di erent sciences, the main chapters will also provide extended explanations for easier comprehension. I will further provide detailed introduction to the di erent problems outlined above locally in each chapter, since those problems have their own speci city and it is not necessary to reiterate this information. Finally, I want to nish this introduction by remarking that, rst, all the models presented here were implemented in my own programs, written in the programming language C++ under Unix, and next, that most of the work in this thesis has already been published in seven conference papers and one chapter of a book on Recurrent Neural Networks. Four are coauthored with my supervisor Prof. Dr. Nerbonne and/or Dr. Laurie Stowe (Stoianov, Nerbonne & Bouma 1998, Stoianov & Nerbonne 2000b, Stoianov, Stowe & Nerbonne 1999, Stoianov & Nerbonne 2000a), the others were written independently (Stoianov 1998, Stoianov 1999, Stoianov 2000b, Stoianov 2000a). This thesis presents that work systematically and contains some new material as well.

4

CHAPTER 1. INTRODUCTION

Introduction Recap:  The main perspective is the Cognitive Science approach toward human    

cognition and in particular the language capacity. Theories of language will be judged with regard to their tness to the brain structure and capacity. Main NL problems to be exploited: lexical processing: (1) lexical learning, (2) mapping from orthography to phonology, (3) developing representations of sequences, and (4) holistic computations. Main connectionist model : Simple Recurrent Network (Elman 1988), as an universal sequential connectionist device. I will use this model throughout the thesis, exploiting the principle of reusing existing models for di erent functions, as outlined by Reilly (1997). Main claim : Lexical learning with general connectionist models is possible (Chapters 4, 5). Main achievements : (1) Lexical learning with Simple Recurrent Networks (Chapter 4). (2) Modeling Grapheme-to-Phoneme Conversion of Dutch words (Chapter 5). (3) A connectionist model for general language modeling and holistic computations { Recurrent Autoassociative Networks (Chapters 6, 7).

Part I

BACKGROUND

5

Chapter 2

ASPECTS OF NATURAL LANGUAGE 2.1 Natural Language Phenomena Complex natural language (NL) is widely seen as one of the features that make us human (Shepherd 1994), and also as a necessary condition for some of the other distinctive human features { social interaction and culture. That is, our communal life entirely depends on the existence of languages. On the other hand, perhaps the complexity of our social life is one of the main evolutionary requirements for the development of complex human languages, because there are other animals that also communicate with each other and live in small communities (e.g., whales and primates) but have much simpler signalling systems, which one would not classify as languages. This complexity makes human languages very interesting, and recent developments of computational methods and equipment make it practically interesting as well, since it would be very convenient for humans to communicate with machines in natural language. There are also possibilities for machines to assist humans in communicating with each other, for example for on-line speech translation. Computational methods for language modeling and processing have been known for for quite some time (e.g. Chomsky 1957). New challenges that continue to make this eld interesting to explore are alternative approaches at language modeling, especially connectionist ones (see Chapter 3 for an introduction into connectionism). This thesis concerns language processing, and connectionist language modeling will be its focus. In this chapter, however, I will present to the reader some background information about the 7

8

CHAPTER 2. ASPECTS OF NATURAL LANGUAGE

nature of the natural languages, what sorts of problems they suggests and what kind of methods are being used to solve those problems. The structure of the chapter is as follows: This section continues with a short description of the basic organisation of human languages, and methods for their description. Note that the language structure noted here is the simple result of observations. Languages vary, and therefore methods used for their processing should be capable of learning, preferably by using very little background knowledge. Section 2.2 describes two popular methods for language learning, and an alternative learning approach will be suggested, which will be the rst introduction to connectionist language learning.

2.1.1 Studies about Natural Language

NL is studied from di erent prospectives. The theory of languages, that is, their properties, organisation, etc., is studied by General Linguistics. It comprises two opposite objectives: rstly, to determine the universal properties of all human languages and secondly, to determine and explain all possible variations among languages. It will be interesting to us from the prospective of providing basic descriptive concepts about the languages. On the other hand, languages exist only due to humans and Psycholinguistics is concerned with human language processing. Psycholinguistic studies show how humans perform di erent linguistic tasks. This will be of help when comparing various computational models on the extent to which they match human performance. Neurolinguistics looks at the neurobiological substrate, being interested in how languages are represented and processed in the brain. Its basic arsenal of techniques includes neuroimaging, with methods such as Positron Emission Tomography (Stowe, Wijers, Willemsen, Reuland, Paans & Vaalburg 1994) and (Functional) Magnetic Resonance Imaging (Salle, Formisano, Linden, Goebel, Bonavita, Pepino, Smaltino & Tedeschic 1999). In the next Chapter I will outline some of the basic ndings, such as location of di erent brain regions involved in language processing and the timing of some of the processes. Those results also help in language modeling, by suggesting what kind of modules are involved in human language processing and what the global architecture of our language system is. Natural language is one of the most convenient ways of communication with the increasing number of electronic devices we now have. Also, there is a growing demand for electronic assistance in document processing, speech translation, etc. All those processes are applications of another discipline, called Computational Linguistics. At this time, computational

2.1. NATURAL LANGUAGE PHENOMENA

9

linguistics mostly exploits \classical symbolic approaches" at language processing. Although they are not a direct subject of this work, outcomes from such methods will be used as a benchmark to compare the performance of the connectionist models explored here. Finally, let's just mention that all of these studies view NL phenomena in the comprehensive study of human intelligence { Cognitive Science.

2.1.2 Organisation of Natural Languages

Unless one is particularly interested in language organisation, it is easy to underestimate its complexity. One can easily observe the basic building blocks (phonemes in speech or letters in written languages), the elements conveying meaning (words) and sentences. Language structure, however, is more complicated as general linguistics studies reveal. Phones constitute the basic sound inventory of the language (O'Grady, Dobrovolsky & Katamba 1997). We speak and understand languages by producing and perceiving relatively stable patterns of sounds, called phones, which convey no meaning. The rst step abstracting from the physical reality is the phonemic level, whose inventory is a set of abstract tokens called phonemes. They roughly correspond to the phones in a relationship which is language dependent. Phonemes are organised into sequences known as syllables { the minimal pronounceable units. Not all sequences of phonemes constitute legitimate syllables, there are some restrictions. All languages have their own system of which combinations of phonemes are allowed, and phonotactics studies those systems. Syllables in turn combine to produce words. The words represent the rst linguistic level whose elements convey meanings. Those meanings are in general independent of the shape of the words, but still, some exceptions apply. For example, the pattern 'X-s' means in English a plural of the word 'X', and a pattern 'X-able' signals an adjective that means capable of being/subjected to the process 'X'. Morphology is particularly concerned with such patterns. The next, more abstract level of the organisation of human languages is the phrasal one. Phrases usually restrict the more general meanings of the words or modify them. For example, noun phrases can specify which particular object is meant. Similarly, verb phrases specify an action performed, as well as some more speci c details about it. The combination of a noun phrase and a verb phrase is a clause, which conveys a fact, an action, etc., with speci ed semantic roles: agent (the subject of the noun phrase), theme or action (the head of the verb phrase), patient (speci ed

10

CHAPTER 2. ASPECTS OF NATURAL LANGUAGE

in the verb phrase), etc. A clause that stands alone constitutes a sentence, which, however, might have even more complex structure. A few successive sentences can constitute a paragraph. We can go even further { sequences of paragraphs make dialogues, stories, etc. For a complete description of the structure of the languages, one might refer to a text book on linguistics, e.g. (O'Grady et al. 1997, Finegan 1999). This thesis will focus on the syllable and word (lexical) level of natural languages. Chapter 4 will concern methods for studying the syllabic structure; Chapter 5 will focus on methods for translations of words from their written forms into phonological representations, and in Chapter 6 and Chapter 7 an approach for developing representations of sequences, including words will be presented, whose main purpose is to prepare the ground for studying the most challenging, syntactic level of NL.

Evolution of structures in Human Languages Why do human languages have such a complex organisation? Among the answers given by anthropologists, a very reasonable one stems from CarstairsMcCarthy (1999, p.131), based on the ever-increasing need to convey more and di erent messages. Firstly, very basic sounds were used. Then, sequences of those sounds made "words". But memorising tens and hundreds of thousands of words is a burden. Therefore, the words may have split into meaningful sub-sequences, easier to remember, but restricted in ordering. This duality of patterning involves one set of patterns organising sounds into syllables and words, and a second independent set of patterns organising words into sentences. According to Carstairs-McCarthy's hypothesis, at this moment the rst sentences had appeared. In time, those organisations gradually become elaborate. The author goes even further, hypothesising that the very rst rules how to combine words may have derived from the organisation of syllables into words.

2.1.3 Languages, Grammars and Finite State Automata

The structure of natural languages described above may be de ned quite precisely in the framework of mathematical and computational linguistics and indeed, a lot of e orts have been spent in this direction (Pavlov 1982, Chomsky 1965, Pollard & Sag 1994). In this section I will note the basic terms used in such studies, which will also be used in the rest of the thesis. We begin with a formal de nition of a language L:

2.1. NATURAL LANGUAGE PHENOMENA

11

De nition 1 (Formal Language L) Language L = fwj gjjL=1j is a set of sequences wj = hcj1 ; cj2 : : : cjjwj j i, each sequence wj built out of the symbols of an alphabet = fc1 : : : cj j g. It is possible that the number of sequences (words) in L is in nite. The number of basic symbols, however, is nite. There are two ways to describe a language L from a processing perspective: (1) by de ning a system that generates all sequences wj 2 L, which system is also called generative grammar GL , and (2) by de ning a system that recognises all sequences from that language, also known as recognising automaton F L . There are also declarative de nitions which ignore processing. To de ne a grammar, another set of symbols is also used, the so-called non-terminal symbols. They de ne intermediate states during generating sequences from L. There is one special non-terminal symbol S , which de nes the initial state of the grammar. Finally, there is a special set of rules that replaces one sub-sequence of symbols with another in the string w being generated.

De nition 2 (Generative Grammar GL ; ;S;P )

A generative grammar GL ; ;S;P is a quadruple ( ; ; S; P ) such as:

 is a set of terminal symbols (elements of the language): = fc1 : : : cj jg.  is a set of non-terminal symbols (internal elements of the grammar): = fv1 : : : vj jg.  S 2 is the start symbol of G  P is a nite set of replacement rules x ! y, where x and y are strings from ( [ ) and there is at least one non-terminal symbol in x. A grammar G generates a word w by repeatedly applying rules from P , starting with a word containing only the beginning symbol: w0 ="S". This symbol is then replaced by applying a rule from P , which has the string \S " on its left-hand side; the resulting string w1 is processed further by applying another appropriate rule from P , which replaces a substring from w1 with another string, resulting in a word w2 . This process goes on until the resulting string wk contains terminal symbols only.

12

CHAPTER 2. ASPECTS OF NATURAL LANGUAGE

Language hierarchy

Languages are divided into ve main categories, depending on the size jLj and the type of replacing rules P in their generative grammar. Each successive language type is a subset of the earlier one. The largest class is languages of class zero, which are in nite in size and have no restriction on P . Contextsensitive languages are in nite in size with rules like xV y ! xwy, where x; y; w are strings from ( [ ) and V 2 . They form type one. Type two { context free languages { is formed by languages in which the rules have the form V ! w. The simplest in nite languages { of type three { are regular languages and they only have rules like V ! aY jb, where a; b 2 and V; Y 2 . These languages are also called nite-state languages since they can be recognised by automata with a nite number of states (deterministic Finite State Machines; they will be described later in this subsection). Finally, if jLj is nite, then the language might be described by a list of expressions. This is also the simplest type of languages. I want to remark that natural languages are not formal languages. Different levels of NLs can be described with di erent types of formal languages, but there are always points at which descriptions are more natural than sets of rules. In addition, languages continuously change. Nevertheless, grammars describing human languages are very useful for studying them from a theoretical point of view. Making the structure of the languages transparent by extracting rules describing them also helps students of second languages, who can make use of such a high-level abstract knowledge. However, languages also exist without such explicit descriptions. An explicitly stated grammar is clearly not required for learning the language { children learn their native languages without having any explicit knowledge of those structures.

Finite State Automata

I will complete this subsection on language formalisation with a de nition of Finite State Automata (FSA), which in their simpler version only recognise regular languages L . Extended FSA are nite state transducers that transform sequences of an input regular language L into sequences of an output regular language L (Pavlov 1982). Two versions of the latter are known (Carrasco, Forcada & Neco 1999). The so-called Moore machines associate an output symbol c q with every automaton state q and produce this symbol whenever they visit this state. Alternatively, Mealy machines associate an output symbol c t to every automaton transition t(qi c !qj ) ,

2.1. NATURAL LANGUAGE PHENOMENA

13

and produce this symbol whenever they perform this transition. It is easy to show that these two state-machines are identical in terms of the class of operations they can perform. The following De nition 3 describes Moore machines under the name Finite State Automata.

De nition 3 (Finite State Automaton FSA ; ;Q;;;qI )

A nite state automaton FSA is a sextuple ( ; ; Q; ; ; qI ) such as:  = fc 1 : : : c j jg is a nite input alphabet;

    

= fc 1 : : : c j jg is a nite output alphabet; Q = fq1 : : : qjQjg is a nite set of states;  : Q  ! Q is the state transition function;  : Q ! is the production function; qI 2 Q is the initial state

The more speci c language-recognising FSA (deterministic FSA (DFSA)) are a subclass of the Moore machines and can be de ned as Moore machines with an output alphabet =f\yes",\no"g { the answer of the DFSA whether the input string belongs to the input regular language L or not.

2.1.4 Phonotactics and Syntax

Two examples of NL organisation are phonotactics and syntax. Syllable (word) phonotactics describes the structure of the syllables (words) from a given language, that is, the rules how phonemes are organised in order to form syllables (words). Similarly, syntax describes the structure of the sentences, the word order allowed in the language. A substantial di erence between those two levels is the number of basic items the grammars use. Syllables and words are built out of some 30100 phonemes (this is language dependent), while sentences are built out of tens of thousands of words. This alone makes syntax harder to study. Words are divided into syntactic categories which are the building blocks of all syntactic theories (e.g. UG (Chomsky 1965) and HPSG (Pollard & Sag 1994)). Nevertheless, although the number of those categories is more nearly comparable to the number of the phonemes, syntactic rules seem to involve some lexically speci c dependencies. Therefore, semantic/lexical information is also necessary, which signi cantly increases the number of basic elements in syntax.

14

CHAPTER 2. ASPECTS OF NATURAL LANGUAGE

Another important di erence between syntax and phonotactics is that the former is also productive, that is, sequences of words are generated according to the syntactic rules of the language. In contrast, phonotactics only seems to be used for recognising correct phonemic sequences, although neologisms (new words in languages) usually follow the existing phonotactic rules. Next, according to Kaplan & Kay (1994), the structure of natural language words is describable by regular languages. In particular, they have proven that the system of morphology-to-phonology rules can be realised by a nite-state transducer, from which one can derive that the phonology, and correspondingly phonotactics, can be described by regular languages, that is, nite-state automata. In contrast, syntax is more powerful, and syntactic rules (e.g., S ! NP VP) must at least in theory be described by contextfree, or even perhaps context-dependent formalisms. In practice, however, humans show limited processing capacities, especially when left-embedded and centre-embedded recursion is involved. This may suggest that humans process syntax with a more limited processing device, perhaps as powerful as FSA, but this problem will not be of concern here. Normally, humans are not consciously aware of their linguistic knowledge. It is rather implicit, or intuitive. We can nd it, however, by performing linguistic tasks at various levels. For example, we can test the grammaticality of sentences (syntax), or if they make sense (semantics). Similarly, lexical-level rules can be tested by estimating the well-formedness of sequences of letters/phonemes, taken as words in the language. This is demonstrated in the following example: The sequences in (2.1) de nitely are English words. (2.1) mother, father, brother, sister None of the sequences in (2.2) are, however: (2.2) *mther, *faeaer, *brthr, *obxtve None of these sound like English words. More importantly, we are quite sure that they are not English words at all. However, the sequences in (2.3) (2.3) *mithor, *fothur, *brithir, *santer 'sound' like English, even if they mean nothing and therefore are not English words. We nd that, e.g. 'santer', could be used to name a new object or a

2.1. NATURAL LANGUAGE PHENOMENA

15

concept. This happens all the time, and it is part of the process of language development. This example shows that we have a feeling for word structure, even if no explicit knowledge. Also, the huge variety of words makes it dicult to put it into a compact form. Rather, we might classify the phonemes into groups and outline some very simple grammar, such as LW in (2.4): vowel =ajojuje vowels = vowel [vowels] consonant bjcjdjfjg (2.4) consonants = = consonant [consonants] syllable = [consonants] vowels [consonants] word = syllable [word] Of course, this grammar is too simple to describe the actual structure of English words. It just gives us an idea how this structure might be described with a grammar. In order to produce the real grammar of the words from a natural language, we have to consider all existing words in the language and extract the rules that govern their structure. This process of extracting the structure of the words is called learning lexical phonotactics and it is the subject of Chapter 4.

2.1.5 Chomsky's Universal Grammar

No doubt, language is much more complex than the simple examples given in the previous section. Even though syntax { the grammar at a sentence level { has been studied by linguists for many years, there are still di erent opinions about the details of one of the best studied languages { the English language. How then, can people, especially children, master in great detail such complexity? Chomsky (1965), trying to solve this problem, postulated the existence in the brain of a universal language device which he called a \Universal Grammar " (UG) that contains an \initial state" of the human language faculty, prior to any linguistic experience. Chomsky requires that any theory explaining the brain-and-language relation must answer three basic questions: (1) What constitutes knowledge of language? (2) How is knowledge of language acquired? and (3) How is knowledge of language put to use? (Chomsky 1986). According to Chomsky, the knowledge of language is a theory concerned with the state of of the mind of the person who speaks a particular language. More speci cally, the theory of language UG is a complex set of grammars that work with representations at di erent levels: logical forms

16

CHAPTER 2. ASPECTS OF NATURAL LANGUAGE

(abstract representations of meanings), deep structures (internal structured representations of sentences), surface structure (representations of sentences as they are pronounced), and phonetic forms (phonetic sequences to be articulated) (Chomsky 1957). In the most recent form of UG { Minimalist program { the second and the third (internal) representational levels are reduced (Veenstra 1998). The representations at each level are generated by context-free grammars and are transformed between the levels by contextsensitive transformational grammars. The most intriguing part of this inquiry comes from the answer of the second question, which basically says that knowledge of language comes as a result of mastering an initial universal theory about language (UG) (1986), or as it is stated more directly in (1965)(p.27), \the child approaches the data with the presumption that they are drawn from a language of a certain antecedently well-de ned type, his problem being to determine which of the (humanly) possible languages is that of the community in which he is placed. Language learning would be impossible unless this were the case. : : : innate schemata that are rich, detailed, and speci c enough to account for the fact of language acquisition." The disagreement which other theories about our language capacity have with UG is not the existence of a language learning device. No doubt, we speak complex symbolic languages, while other animals do not. At least their communication systems are not as highly developed as humans'. The disagreement is about the prior knowledge, the biases in our language learning system. Is it a super-complex, language-biased processor (UG) which is only tuned to the language environment it experiences in the rst few years, or is it a general learning mechanism with very few speci cally linguistic properties, that pro ts in learning from some of its model-speci c properties, such as distributed representations and generalisation in the case of connectionist models (next Chapter 3 for details) ? It is the second hypothesis that I will explore throughout this thesis, that a general connectionist learning device, exploiting the principles of distributed data representation and generalisation, can learn complex grammars with very little linguistic knowledge. Although the work in this thesis does not concern syntax, learning grammars at a lexical level (Chapter 4) contributes to the argument that complex, abstract knowledge can be developed by attending to the environment and generalising from it. My conjecture would be that syntactic categories and combinatoric patterns can be developed in a very similar way, by learning and generalisation.

2.1. NATURAL LANGUAGE PHENOMENA

2.1.6 Smolensky's Optimality Theory

17

Another theory that embraces the idea of universality in human languages is Optimality Theory (OT) by Prince & Smolensky (1993); see also Archangeli & Langendoen (1997) for very good introduction. It was developed as an attempt to solve the problem of too many candidate structures produced by the Chomsky's UG when explaining input data. The main idea of OT is the existence of a sort of a lter that assesses each of the candidate structures and lters out those which violate a set of ranked language-speci c constraints. Optimality Theory works in the following manner: Firstly, input data arrives. The input is a well-formed sequence built out of an alphabet, part of Universal Grammar. Di erent levels of UG provide di erent types of input data. So far, OT has been exploited mostly in Phonology and Morphology, so features, phonemes, syllables form the input at those di erent levels. The main purpose of OT is to transform this input data into output data, according to some pre-speci ed function, for example, to syllabify an input word. Next, a generator transforms this input into a number of potential output candidates, for example, di erent variants of syllabi cation. The following step in OT is to evaluate each of the candidates. Here OT makes use of another universal set { innate universal constraints. Those constraints are ranked and the violation of a lower-rank constraint is overshadowed by the satisfaction of higher-rank constraints. Here the variety of the languages plays a role { di erent languages rank those constraints di erently. Also, ranks determine which constraints are more vulnerable to violation, and correspondingly, this determines the markedness of the language. And after the evaluation, an evaluator selects the optimal output candidate and uses it as an output for the given input. Although OT is well established, there are some issues which confuses the potential computational linguist willing to implement the complete story. For example, the input is supposed to be well-formed. What sort of device will provide this kind of input? Also, linguists still argue about what kind of input there should be at all, e.g., whether it should be plain or rich. As for the generator, here is where OT provides a great challenge to computational linguistics because the number of those potential outputs might be unlimited. The neurolinguist would also wonder how brain structure could provide the theoretically unlimited number of possible structures, especially if we go to syntax. Also, the innateness of the UG and the set of constraints raise other sort of psycholinguistic and neurolinguistic concerns. There is one curious aspect here: this theory grew up from the connec-

18

CHAPTER 2. ASPECTS OF NATURAL LANGUAGE

tionist perspective (see Chapter 3 for introduction to connectionism). OT is a direct successor to Smolensky's Harmony Theory (Smolensky 1986), developed as a connectionist model of human cognition. However, whereas OT ranks the set of constraints in a strict order, Harmony Theory weights those constraints, which is biologically more plausible.

2.1.7 Written Languages and Grapheme-To-Phoneme Conversion

Languages existed originally in a spoken form only, and most of the approximately 8,000 languages of the world still have no written form. Much later, written representations were invented { the oldest evidence for written scripts representing language date from 5,500 ago (Finegan 1999). Evolutionary pressure led to wide expansion of written languages, because it speeded up the spread of knowledge and human development. Yet, perhaps half of the human population is still illiterate. There are di erent types of written forms of languages. The rst scripts used picture-like signs (logographs), whose numbers grew as the languages developed. The Chinese and the Japanese languages, for example, have such logographic writing systems (Kanji ). More recently developed writing systems are derived from the phonology of the languages and have many fewer signs, e.g., the Greek, the Latin and the Cyrillic alphabets with 26, 24 and 29 letters, respectively. Children learn to read the logographic codes with greater diculty than the phonological ones. Among the latter, there is variability, too. For example, languages that adopted the Latin alphabet are less straightforward to learn, since this alphabet represents their native phonology less directly (e.g., in the Germanic languages). The Romance languages, on the other hand, have a more systematic relationship between written and phonological word representations. This is also valid for the even more recent Cyrillic alphabet, invented and developed further especially for the Bulgarian language and used latter for other languages (Boyadjiev, Kutcarov & Pentchev 1999). Bulgarian words written with Cyrillic letters are pronounced in a very regular way. The basic representational form of languages, however, is speech (or sign languages by deaf people), since infants learn their languages by hearing speech and producing speech. Therefore, humans initially comprehend written languages by converting the written codes into phonological representations, what begins the process of reading. Later, after mastering the reading process, they might associate written forms directly with semantical

2.2. METHODS FOR LANGUAGE LEARNING AND PROCESSING 19 representations, but since syntactical and semantical analyses are necessary to get the meaning of the written text, it is quite likely that the text-tospeech conversion also plays a role later during adulthood. In this respect, modeling the Graphemes-to-Phonemes Conversion (GPC) is a substantial lexical problem to study, both from a cognitive stand point { to explain this cognitive process { and from practical point of view, as a part of an automatic system that learns to read text aloud. For that purpose, Chapter 4 presents a connectionist model that sequentially transduces orthographical sequences into their phonological counterparts.

2.2 Methods for Language Learning and Processing Besides working on descriptive linguistics, computational linguistics is also concerned with learning NL within di erent computational paradigms. There are two main groups of methods here: symbolic and stochastic approaches. Among the rst type of methods are, for example, N-gram models representing sequential data with chunks of it (there are also stochastic extensions of this method), Inductive Learning (IL) methods, where knowledge is encoded into grammars or automata such as Finite State Machines (FSM), in which knowledge is represented as nodes and connections between them. Every node there has its own meaning and is similar to a non-terminal symbol in arti cial grammars. The rst subsection here will present the principles of the N-gram models. Stochastics is a relatively new stream in NLP, as compared to the classical deterministic symbolism, and it embraces a number of methods, such as Hidden Markov Models and Maximum Entropy modeling. Closer look at those methods would reveal their symbolic ancestors, enriched with probabilities in the transitions between states or nodes. The second subsection will present brie y the Hidden Markov Models and will explain how it is used for lexical modeling. All of these methods share one important feature: well-founded theory and correspondingly, successful applications. Furthermore, they share an absence of similarity to the structure of the neurobiological substrate. This does not underestimate their importance for NLP. This is just to remind us that the candidate-theory(-ies) explaining our cognitive capacity for language should be sought somewhere else. Actually, some of these models have been also used in cognitively more plausible frameworks. Speci cally, the N-gram model is often being used

20

CHAPTER 2. ASPECTS OF NATURAL LANGUAGE

in this scenario, with resulting models such as "Memory-based models" (Daelemans 1999) or the neurobiologically much more elaborated "Adaptive Resonance Theory"(ART) (Carpenter & Grossberg 1992). Yet, the explicit use of the classical symbols there makes large-scale applications dif cult to implement. In ART, for example, the required symbolic nature of the category nodes (localistic encoding; see next section for details) makes it dicult to work with sequential data where theoretically unlimited contextual states are possible (language is such one). Yet, there are some attempts to overcome this problem, although in a very limited domain (Cotteleer & Stoianov 1999).

2.2.1 N-grams

N -grams refer to models NkL of a language L that consist of a set of all unique xed-size (k items long) sub-sequences occurring in the words of L. Strings are recognised as words from L i all their substrings of length k are in NkL . N-grams can be used also for generating words, starting with an

initial n-gram from the model and then gradually increasing the length of the sting generated (w) by choosing an n-gram whose left substring (head) of length k ; 1 is the same as the right substring (tail) of length k ; 1 of w, or vice-versa (tail/head match). In order to restrict some impossible combinations in generating or recognising strings, initial and nal symbols (or \spaces") may be attached to the words of the modelled language L. N -gram models are compact descriptions of the sequences of the modelled language due to the redundancy among the sub-sequences. However, they are not exact models of the language since they can also recognise and produce sequences which are not in the original model. For example, the bi-gram model N2L1 of a language L1 containing the sequences f\[language]", \[languages]", \[model]", \[models]"g is the following set: N2L1 = f[l, la, an, ng, gu, ua, ag, ge, e], es, s], [m, mo, od, de, el, l], lsg1 . In this model besides the words from L1 , other strings such as \lage" are also recognised as belonging to L1 , which is due to the limited descriptive power of the model. Nevertheless, such a basic model servers as a good base-line model when examining the performance of other language models and it will be used in Chapter 4. 1 The square brackets here are used as left and right delimiters

2.2. METHODS FOR LANGUAGE LEARNING AND PROCESSING 21

2.2.2 Hidden Markov Models

Hidden Markov Models (HMMs) are discrete stochastic models which explain sequences of external events generated by an underlying (hidden) process (Rabiner 1989). In the model, the hidden process is represented with a set of states, only one of them being active at a certain moment, and at every processing step it evolves from one state to another according to a given transitional distribution. At every state, the model generates an output from a given alphabet or function, according to a speci c output distribution associated to each state. Formally, HMMs might be viewed as stochastic Finite State Automata, built up from: (1) a nite set of (hidden) states Q = fqi gNi=1 , (2) dynamics, determined by a transitional probability distribution P T (Q=Q) and an initial state probability distribution P I (Q), and (3) emission (output) probability distribution P O (C=Q). Note, that the di erence between HMMs and FSA are (1) the choice of transitions from one state to another: in FSA this transition is determined by the current input, while in HMMs the transition is chosen according to the stochastic transitional distribution; (2) the output in FSA is deterministic { a speci c output symbol is attached to a transition or to a state, while in HMMs the output is stochastic { it is produced by stochastic output functions attached to each state (or to the transitions in some alternative HMM models). HMMs are popular in speech recognition, since they provide a good tool for describing the complexity of the speech signal and because they are trainable. A very e ective HMM training algorithm is the Baum-Welch algorithm that produces a set of models describing a set of training sequences optimally. Training essentially consists of tuning the probability distributions P I (Q), P T (Q=Q), and P O (Q) in a way that a given trained model best explains a given training sequence. HMMs can be applied for lexical modeling by training an HMM L on the structure of the words of a language L. Then, the phonotactic constraints describing the words of the language can be extracted from the underlying model and the emission probabilities. In addition, the model can be used to estimate the probability that a given sequence w belongs to the training language L, by using the so-called Viterbi algorithm. Then, the question of whether this sequence is a word or not might be answered by thresholding this probability. Actually, the same problem applies to neural networks used for language modeling, what will be discussed in Chapter 4. HMMs were used to model the phonotactics of Dutch monosyllabic words in Tjong Kim Sang (1998), where learning phonotactics was viewed as a lexical recognition problem (section 4.3 explains this). Various approaches

22

CHAPTER 2. ASPECTS OF NATURAL LANGUAGE

with HMMs there led to about 99% acceptance rate of positive data (words from the training language) and between 91% and 99% correct rejection of negative data (random strings). Those results are in generally comparable to the success of the lexical learning with connectionist models which I will present in Chapter 4.

2.2.3 A di erent look at language learning

The enormous world-wide every-day process of ordinary human development in di erent forms of social communities displays one very simple fundamental of NL: it is developed by active repetition. Even more strikingly, this strategy of language acquisition is not species speci c. Every animal with some sort of a neural system grows in the presence of animals similar to it and just by repeating the acts of its parents or other matured animals, it learns qualities necessary for a normal life. The human language faculty does not develop in a very di erent manner. Arguments speaking in favour of the importance of repetition come, for example, from constant language evolution. Languages, as we know, are in a constant process of change. This could be explained with the relatively small diversity of the individual linguistic knowledge that as a whole drives the language spoken in a given community to one or another direction. Individuals in their early stage do not choose which language to speak. They simply listen to the speech around them and gradually start to produce it, with possible modi cations. All humans are di erent and all of them more or less contribute to language change. Sequences of phonemes, words, and sentences, "unacceptable" 1000, or even 100 years ago, constitute nowadays normal words, grammatically correct sentences, and socially accepted dialogs. Similarly, formerly popular words have vanished almost forgotten in the old books and rst dictionaries. Actually, languages owe a debt to dictionaries, books and other media, which by recording languages, words and their use, slow language changes by preserving language material in a relatively stable form. Learning language by repetition is also the basic learning strategy in this thesis. For that purpose, learning models { neural networks, explained in Chapter 3 { will be trained on di erent language tasks by letting these models to repeat the objects they are given, eventually making mistakes and correcting these in such a way that the next time the models perform on the same object, to do this better. Active repetition is also the basic idea behind a language learning model presented in Chapter 6, which sequentially autoassociates input sequences in order to develop their static representations.

Chapter 3

CONNECTIONIST MODELING 3.1 Foreword It is not strange that evolution has developed a completely di erent sort of a computational device than the common computer, which has a single processor (CPU) and shared memory. Everyone working with computers has experienced unpleasantness when his/her computer crashes and loses data. Moreover, this might happen even to the most reliable of such singleprocessor systems. Fortunately, the nervous systems of all animals on our planet, including ourselves, do not crash not only in normal conditions, but also in extreme situations, such as moderate loss of neurons. The main reason for this reliability is that the neurons { the main building blocks of the nervous systems { work independently of each other and in parallel. Even if some of them fail, the rest still correctly process the incoming information, thus most probably controlling the organism correctly. This parallelism also derives huge computational power out of slowly working chemical processors. Unfortunately, however, most of the current computers rely on one processing unit only, whose crash would be fatal for the life of a working system. The computational power of those linear computers is also limited by physical properties of their processing units. For example, the maximal CPU processing speed { the most critical productivity factor { is almost attained in current CPUs running at about 1 GHz. In this chapter I will brie y introduce the basics of an alternative to the classical computational architectures, which is inspired by the neural systems of the living organisms. This alternative is connectionism, also 23

24

CHAPTER 3. CONNECTIONIST MODELING

known as arti cial neural network systems, which has been popular for more than two decades. In the remaining of this thesis I will present my ideas about how connectionism can be used to model one aspect of the human cognitive capacity { the phenomena of natural language. The chapter starts with a short comparison of two main computational paradigms used nowadays { linear vs. parallel, connectionism being a special subclass of the latter. Then, background information about biological neurons and neuronal structures will be presented, which provide the main inspiration for the developments in connectionism. Basic information about arti cial neurons and neuronal structures is presented in the next section, which is followed by a discussion about the way the information may be represented in connectionist models. I also present a few established connectionist models, focusing on dynamic neural networks, which are most adequate for linguistic processing. Special attention to the so-called Simple Recurrent Networks (SRNs) will be paid in the last section 3.7, which is used as a basic neural network throughout the remaining chapters. Those readers that are already conversant with connectionism might skip this introductory chapter and only use section 3.7 as a reference about SRNs. It will be rather necessary for the audience with linguistic background who may still feel uncomfortable with the connectionist ideas and terminology.

3.2 Computational paradigms In this section I introduce two basic computational paradigms used today { linear and parallel computing, stressing some advantages of the latter. Then, the connectionist paradigm will be introduced and will be shown to o er a very reliable method for computations and modeling.

Linear vs. Parallel computations

There are two very di erent ideas about how computing should be done: linear and parallel. Linear computing is what most computers today do: a single central processing unit (CPU) performs one operation at a time. Those operations are successively taken from a written algorithm (program) stored in a special common memory. This model corresponds to the classical von Neumann (1903 - 1957) architecture. Although it o ers a simple method for computation, linear computing is vulnerable to crashes due to the single processing unit which as a physical device is subject to failure. Another disadvantage of a single processing machine is the processing speed, which is directly dependent on the speed of the processor as I already mentioned.

3.2. COMPUTATIONAL PARADIGMS

25

On the other hand, parallel computing is based on systems consisting of more than one processing unit, which together perform more than one operation at a time. Those processing units have their own local memory and may work independently of each other. A parallelism at such a physical level directly addresses the problems of limited speed and volnurability to system failure. Still, in order to perform a global task, synchronisation mechanisms are necessary, such as common memory and special signalling. In spite of the parallel processing, a centralisation of the control again makes the systems vulnerable to crashes and is the bottleneck of the speed of parallel systems. However, there are other types of parallel systems { connectionist systems { that exploit the principles of parallel processing even further { at a data level { which increases the reliability even more.

Connectionism / Parallel Distributed Processing Connectionism refers to a special sort of parallel processing, where densely interconnected processing units perform very simple operations { signal accumulation and transmission { and the memory is represented as activation of the units and the strength of the connections between them. Those connections are modi ed according to very simple learning mechanisms, which are usually local in time and space. External signals enter the system at some pre-speci ed units which have interface functions only and which distribute their activation to other processing units. In turn, when those processing units accumulate critical amount of signal, they produce impulses which are propagated to yet other units, etc. The activation of some of the units in the system is interpreted as a product of this system (see Fig. 3.5) and it might directly activate external devices (e ectors). In living systems e ectors are muscles and glands; in the arti cial systems e ectors are motors, computer displays, and so on. The capacity to adjust the reaction of the connectionist systems adaptively in response to the current input, according to its \task" is what makes them very appropriate for modeling. They can learn almost any desired function if a pre-speci ed behaviour is required, or an input signal needs to be converted so that a consequent processing would process more easily the processed signal. Further, in such distributed systems there is no central single controlling processor. Some neuronal sub-systems might be involved in some sort of higher-level controlling mechanisms, but even then control is distributed. No matter what the function of a neuron in such a system is, each neuron performs the very simple tasks of signal gaining, reacting and synapse ad-

26

CHAPTER 3. CONNECTIONIST MODELING

justing, which makes the connectionist systems very reliable and resistant to failures of single neurons. This kind of processing is also called Parallel Distributed Processing (Rumelhart, Hinton & Williams 1986). This name emphasises one very important feature of connectionism, namely that the signal being processed is distributed among a number of units. Beside increasing the reliability of the system, this peculiarity induces inherent semantics of the distributively represented data being processed. In contrast, the classical symbolic systems work with symbols which do not have inherent meaning. The meaning there is externally attached. In a PDP system, such as connectionist image processing, data is represented with its content and each of its elements (here pixels) is processed in parallel by one neuron. Similarly, linguistic data might be represented with vectors in which each element stands for a certain feature. It is also possible that the semantics of those features overlap, which redundancy further increases the reliability of the processing. Further, input data which shares most of their features would naturally be processed in a similar way, which leads to another very important property of connectionism, namely generalisation { the capacity correctly to process unseen data. If the model has learned how to process a certain class of data by examples, other unseen examples from this class will most probably be processed in a similar way due to the shared features.

3.3 Biological Neuronal system. Organisation. The idea of Parallel Distributed Processing (PDP) is inspired by the organisation of the brain, which consists of about 1011 neurons, organised into different structures and densely interconnected. Di erent sorts of signals ow throughout the brain, all of them being processed by the neurons almost simultaneously. In this section I will sketch the organisation of the nervous systems, not focusing on details, but rather looking from a structural point of view. For a comprehensive description of the neural system, one might refer to (Kandel, Schwartz & T.M.Jessell 1991, Shepherd 1994, Nicholls, Martin & Wallace 1992). To describe a system as complex as the human brain, one must start from the very low molecular level. Nevertheless, since the models that will be considered in this work approximate the neuronal system at a very coarse level, the description here will start from a neuronal level.

3.3. BIOLOGICAL NEURONAL SYSTEM. ORGANISATION.

27

Figure 3.1: Structure of a biological neuron. Functionally there are four main elements: (1) a soma collecting the signal coming through (2) dendrites (input bres) and distributing the signal further through (3) axons (output bbers). An axon from one neuron is connected to the dendrites of another neuron through (4) synapses (or terminal buttons), which represent the main memory of the neurons.

Neurons Biological neurons (Fig.3.1) are cells with the following main functional parts: a body (soma ), which accumulates signals coming from the input bres (dendrites ) and which produces at the axonal hillock a series of bursts (impulses) when the accumulated signal reaches a critical threshold. These impulses are propagated to other neurons through an output bre called (axon ). The axon branches and connects to other neurons via connections of variable strength called synapses. On average, one neuron receives signal from 1000 other neurons. The synapses are in fact the Long-Term Memory (LTM) of the neurons. The synaptic size (conductance) varies in time, in accordance to some biologically laws called learning mechanisms. On the other hand, the current activation of the neuron is its Short-Term Memory (STM) and it changes much faster than the synapses. The signals that are being propagated through the neurons consist of a train of impulses, called also action potentials, or spikes. Those spikes

28

CHAPTER 3. CONNECTIONIST MODELING

have a chemic-electrical nature. The basic mechanism that propagates them through the nerve cell is called a sodium-potassium pump (Shepherd 1994). The stronger the signal, the longer the length of this train. The spikes themselves have relatively constant nature { at the place of a spike, the cell is being depolarized from a rest potential of about ;70mV to some +40mV , for about 1 ; 10ms. The length of the train of spikes varies from 0 ; 10 per second which corresponds to almost inactive state of the neuron, to about 1000 spikes per second, which is a very active burst. There are various types of synapses. Firstly, there are chemical and electric ones, the rst type being predominant throughout the nervous system (Nicholls et al. 1992). Electric synapses are much faster, but they are sensitive to small changes in the cells. We nd them in places where time is critical, such as the very rst layers of the retina. Chemical synapses, on the other hand, are slower since they use chemical (neuro) transmitters to transfer signal and a number of related biochemical processes. Most of them use Glutamate and acetylcholine (ACh) as a neuro-transmitter, with primarily excitatory synaptic action. There are negative connections, too, mostly based on the GABA and Glycine neuro-transmitters. Further, synapses connect not only axons to dendrites, but also axons with axons; axons with somas and even dendrites with dendrites. This means that even at a neuronal level there is a possibility for signal transformation.

Neuronal circuits and systems

The huge power of the nervous system is due to the complex organisation of the neurons, which occurs at various levels. Local circuits refer to a group of neurons within certain region whose function is to implement very simple local computations, such as reexcitation; antagonistic interactions, and so on (Shepherd 1994). The distributed nature of data processing is exhibited at another level of organisation, called neuronal eld { a set of neurons which perform one step of the processing of one type of signal, with possible inter-connections among them. One or a few neurons in this eld process in parallel one feature of the distributed signal (data). Further, neuronal elds located at adjacent layers in the brain are organised into a column { a neuronal structure that performs a complex transformation of the distributed over many neurons input signals. A typical example of this structure are the columns that process visual signals in the brain in the so-called V1 - V4 visual elds. The raw visual signal there is gradually transformed, by extracting features of increasing complexity.

3.3. BIOLOGICAL NEURONAL SYSTEM. ORGANISATION.

29

Figure 3.2: A schematic representation of the brain region involved in language processing. Of particular interest are the Broca's and the Wernike's areas.

At still higher level of organisation are the pathways and distributed systems. The former usually refer to the complete processing of a signal from a certain modality. For example, the visual pathway starts from the retina where the visual signal is detected; goes through the thalamus into the visual cortex, and then through the di erent visual regions V1, V2, and so on. At every step complex processing is done, with possible associations to other modalities. It is found that in such pathways the signal proceeds not only in a forward direction, but also that there is an extensive feedback which is responsible for processes such as prediction and signal completion. Distributed systems, on the other hand, refer to a number of connected regions which together mediate certain aspect of the behaviour of the organism. For example, there are auditory, visual, motor systems, and so on. All functional systems build up the complete nervous system and most of them are located into the brain. The systems performing the particularly interesting high-level functions are located in the so-called cerebrum, or neocortex. A number of brain-imaging studies as well as earlier experiments lead to the general acceptance that language processing is done primarily in

30

CHAPTER 3. CONNECTIONIST MODELING

a portion of the posterior aspect of the third frontal convolution (Broca's area) and the region including the posterior aspect of the superior temporal gyrus (Wernike's area) of the left cerebral hemisphere. A lateral view of the left hemisphere, with the language-relevant structures represented, is given in Fig. 3.2.

Learning and Memory It is the capacity of the neurons to adaptively modify the strength of the synapses and to keep these changes relatively stable that provides the living organisms with the capacity of complex behaviour. The synaptic modi cation is called learning, and it is a dedicated synaptic process that is expressed externally as a change in behaviour and which is caused by experiencing the external environment (world) (Shepherd 1994, Martinez & Kesner 1998). The modi cation is adaptive because it has some meaning for the behaviour and survival of the organism. We might call the capacity to keep the changes relatively stable as memory. Memory also means the capacity to store and recall previous experiences, which is a more complex process. Memory and learning are interconnected. Memory is necessary for learning, it is the product of the learning process. The main principle of learning at a neuronal level was postulated by Hebb (1949): \When an axon of a cell A is near enough to excite a cell B and repeatedly or persistently takes part in ring it, some growth process or metabolic changes take place in one or both cells such that A's eciency as one of the cells ring B, is increased". This is to say that any two connected cells that are repeatedly active at the same time will tend to become associated, so that the activity of only one of the cells facilitates the activity in the other. Nevertheless, di erent types of neurons in the brain have di erent learning mechanisms. They di er in (a) signals participating in learning, (b) neurobiological levels at which learning takes place, and (c) processes which realise the actual learning. With regard to this, the strength of a connection might have a short-lived increase or it might involve some long-lasting structural change in synapse, as in long-term memory. The learning at a neuronal level is expressed also at more global levels. For example, what is referred to as habituation and sensitisation { which represent weakening and strengthing of a connection { is expressed both at a neuronal (synaptic) level and at a behavioural level, where it represents a decrease / enhancement of the animal' response after repeated presentation of a given stimulus (Shepherd 1994). At a high level also other more complex types of learning are distinguished. One of them is associative learning,

3.4. ARTIFICIAL NEURONS AND NEURAL NETWORKS

31

in which an animal makes a connection between a neutral stimulus and a second rewarding/punishing stimulus. A more speci c type of such associative learning is classical conditioning, where a conditional stimulus is paired with a unconditional stimulus. In classical conditioning, the animal is passive recipient. By contrast, an animal might be asked to learn a task or solve a problem, for which it is rewarded or punished. This is called operant conditioning ; since the animal usually makes mistakes before learning the task, it is also called trial-and-error learning. As far as memory at a global level is concerned, there is a growing conviction that the hyppocampus plays a critical role in learning and memory. The central assumption is that stimuli enter the neocortex via the sensory system and subsequently activate the hyppocampus, which plays the role of a (global) Short Term Memory, or episodic memory. The hyppocampus in turn repeatedly provides feedback to the neocortex and gradually initiates activation patterns there, which process is termed memory consolidation (Gluck & Myers 1998). This way, the evolution has solved one very important problem in learning { the capacity to learn new patterns without interfering older memories, what Carpenter & Grossberg (1992) call also the stability-plasticity dilemma.

3.4 Arti cial Neurons and Neural Networks The structure of the biological neurons, with a lot of simpli cations, is directly used for the development of arti cial neurons, as explained in the following subsection. However, the interest here is focused on the structures that can be built with those neurons, since networks of neurons are the models that give the computational power and reliability of connectionism.

3.4.1 Arti cial Neurons

The arti cial models of the neurons that are exploited here and in most of the literature make a number of simpli cations. First, the train of neuronal spikes is replaced by a continuous activation. Since this loses some of the temporal characteristics of the signal, more recent models of arti cial neurons { spiking neurons { represent the activation with spikes (Maass 1997). Another simplifying assumption is that the arti cial synapses are able both to excite and inhibit the connected neurons, which does not occur that often in the brain. Yet, as it was recently found, the acetylcholine neurotransmitter can indeed produce both excitatory and inhibitory responses in the hyppocampus.

32

CHAPTER 3. CONNECTIONIST MODELING

Figure 3.3: A schematic diagram of a McCulloch & Pitts-like arti cial neuron. The neuron is activated with a distributed input signal X that enters the neuron through weighted input lines. Then, the accumulated signal is passed through an activation function f() which squashes the signal into the range 0 : : : 1. The output of this function is biased by a threshold { a special parameter usually represented with the weight of one of the input lines (the rst one) and constant input one.

The rst arti cial models of the neurons and the brain were proposed by McCulloch and Pitts [1943] (Figure 3.3) . Their arti cial neuron is still the basis of today's arti cial neurons. It has a vector of input lines (x1 x2 : : : xN ) with variable conductance (w1 w2 : : : wN ) (which stand for the variable synapses) that transmit the incoming signal into a summator S (formula 3.1). The accumulated signal is biased by a rest-potential value or a threshold . More complicated arti cial neurons might also have internal memory, leaking parameter, etc. Biological neurons produce an output impulse if the accumulated activation exceeds certain value, which is followed by reducing the current activation. If after a short period (refraction) there is still enough signal, the neuron res again, and so on, thus producing a train of impulses. As noted earlier, the length of this train of impulses may be interpreted as the strength of the output signal. To model this, the accumulated signal S in the arti cial neurons is propagated through an output function which produces the output Y of the neuron (3.2). The earliest versions of the arti cial neurons used the so-called hardlimited output function, which outputs one if the signal S is greater than

3.4. ARTIFICIAL NEURONS AND NEURAL NETWORKS

33

Figure 3.4: Neuron activation functions: a hard-limited function (left), a linear threshold (middle), and a sigmoidal function (right).

zero, otherwise produces zero (see Fig. 3.4, to the left). A more useful activation function is the sigmoid function, which nonlinearly squashes the input signal S into the range 0 : : : 1.

S=

3.4.2 Neural Structures

N X w x + i i

(3.1)

Y = f (S )

(3.2)

i=1

The arti cial models of the neuronal system { the Arti cial Neural Networks (NNs), (see Fig.3.5) { consist of arti cial neurons organised into layers (similarly to the neuronal elds). The neurons in one layer perform the same type of computation, which realise the idea of distributed representations and computations. A neural network consists of a few layers, the neurons from two adjacent layers being connected in such a way that the neurons from the rst layer send a signal to all neurons from the second layer. A convenient method to represent the set of connections from a layer LA to Ba layer LB is to A B A use a weight matrix W L ;L , in which an element Wi;jL ;L from a row B A WiL ;L represents the synapse connecting neuron LAj to neuron LBi . A neural network with a few successive layers and this sort of connectivity is called a feed-forward neural network. This type of NNs normally process static signals, which do not change in time.

34

CHAPTER 3. CONNECTIONIST MODELING

In addition, the neurons within one layer can also be interconnected, which represents a recurrent layer. If the neurons from a layer later in the pathway of the signal provide activation to the neurons from earlier layers, this structure is called a recurrent neural network (RNN). RNNs are also those NNs which have recurrent layers. The recurrence induces internal state memory into the NNs, represented as the activation of the neurons providing past signal. This type of networks are used when dynamic data is processed, such as in robot control, speech recognition, general signal processing, etc. Similarly to sensory receptors in organisms, all NNs have an input layer where external signals are sensed and transmitted to intermediate (hidden layers), which perform the actual NN task. In organisms some of the neurons stimulate e ector cells { muscles, glands, etc. Similarly, there is an output layer in the arti cial neural networks, whose neurons are meant directly to control physical devices. In connectionism, the structure of the brain is further re ected in more complex arti cial neural network structures consisting of various simple NNs, which perform di erent functions on the data. Those NN modules are connected in such a way that they perform complex tasks, similarly to large software packages.

3.4.3 Neural Networks Learning

One of the most important property of the Neural Networks is their capacity to learn from their environment, that is, to adapt their long-term memory according to some learning strategy. As already noted, there are many notions associated with the term \learning", but at any rate most of them de ne it in terms of adaptation aiming at improving the behaviour in the environment (Haykin 1994). More precisely, NN learning is a process of systematic adjustment of the NN's free parameters (synapses or weights) in order to improve the network's performance in some particular learning environment to acceptable levels. We see here that the learning also involves: (1) an environment providing learning examples, and (2) a designated task. The learning algorithms in the Arti cial NNs fall into three main categories, depending on the information which the training environment provides. If the NN is given both an input signal and desired output signal, this is called a supervised learning. On the contrary, if the learning mechanism does not use desired output signal, then the learning is unsupervised. Another class of learning mechanisms, which is close to the supervised learning is so-called reinforcement learning, in which the environment provides only a

3.5. DATA REPRESENTATION

35

Figure 3.5: An example of an arti cial neural network. The neurons (squares) are organised in layers. Each neuron processes one feature of the distributively represented data. The distributed signal enters the network at the input layer (here at the bottom), is consequently processed by all layers and is output at the output layer (here at the top).

positive (rewarding) or negative (punishing) signal to every NN action. The NN can use this signal to encourage or discourage similar behaviour. The Hebbian learning mechanism belongs to the class of unsupervised learning methods. Section 3.7 provides in detail one particular supervised learning algorithm used to train Recurrent Neural Networks.

3.5 Data Representation An important question concerning the NN implementation is the way the external data is encoded and represented to the networks. The NNs have typically one input layer and one output layer, also called interface layers. In a connectionist systems with an interface layer N = (n1 ; n2 ; : : : ; njN j), there are two basic methods to represent the items from a set C = fci gjiC=1j . The more apparent encoding method { orthogonal (localistic) encoding

36

CHAPTER 3. CONNECTIONIST MODELING Symbol f1 f2 a 1 0 b 0 1 c 0 0 d 0 0 i 0 0

f3 f4 f5 0 0 1 0 0

0 0 0 1 0

0 0 0 0 1

Table 3.1: Orthogonal encoding of a set of symbols CA =fa, b, c, d, ig. There are as many neurons as items in the set, each neuron representing one symbol.

{ is similar to the methods of encoding symbols in the symbolic systems with one neuron standing for one item. On the other hand, feature-based representations are truly distributed representations and they exhibit the full power of the PDP. I will explain these in turn.

Localistic representations To be more speci c, localistic representation is de ned as follows:

 as many neurons as the number of the symbols (tokens): jN j = jC j,

that is, for every token ci there is a representative neuron ni .  a neuron ni is active if and only if the correspondent token ci is present. The activations of all other neurons are set to zero or other small value. An exemplary orthogonal encoding of the tokens from a set CA =fa, b, c, d, ig is given in Table 3.1, where each letter features only one active property f1 ; f2 : : : f5 . This makes dicult for a system to generalise and a fail of one neuron follows incorrect processing of the corresponding item.

Feature-based representations A representation that is much more appropriate for the connectionist ideas is feature-based encoding. It is based on a set of features F = ff1 ; f2 ; : : : ; fjF jg, each of them having some explicit semantics. Interface layers working with these representations have as many neurons as features. A given item ci is

3.5. DATA REPRESENTATION

37

Symbol V owel Consonant Bi-Labial Dental High a 1 0 0 0 0 b 0 1 1 0 0 c 0 1 0 0 0 d 0 1 0 1 0 i 1 0 0 0 1 Table 3.2: Feature-based encoding of a set of letters CA =fa, b, c, d, ig (rows) with features 'Vowel ', 'Consonant ', 'Bi-Labial ', 'Dental ', and 'High ' (columns). Letters belonging to the same class share common values for features determining this class, while having distinct values of features for other characteristics.

represented as a vector of values for all features. Those values can be binary, or continuous. Characteristic for this type of representation is that items belonging to the same class of the data share common feature values determining this class, while having distinct values of features for other characteristics. For example, the same set of letters CA encoded earlier with orthogonal representation is encoded in Table 3.2 with a feature set f'Vowel ', 'Consonant ', 'Bi-Labial ', 'Dental ', and 'High 'g. In this representation, the vowels 'a' and 'i' share common vowel-speci c values for the features 'Vowel ' (active), and 'Consonant ', 'Bi-Labial ', and 'Dental ' (inactive), but di er in the value of feature 'High ' { active for 'i' and inactive for 'a'. Similarly, the consonants agree in features for shared properties and di er in some distinct features. The feature-based representations exploit all properties of the distributed representations. Firstly, we can encode many more tokens within a limited amount of neurons. The theoretical number of distinct items that can be encoded with k binary features (those that can take two possible states, e.g., 0 and 1) is 2k , that is, with 10 features we can represent 1024 items. However, in order to get an advantage of some other useful properties such as generalisation, it is reasonable to use not all available combinations of features. Another advantage of this encoding is that once we have chosen the set of features and have started to exploit a model with this set, we can dynamically add other tokens. The network will generalise and work with a reasonable degree of precision even for the new unseen tokens. Further,

38

CHAPTER 3. CONNECTIONIST MODELING

just a short extra training would re ne the network's work for all tokens. Therefore, wherever it is possible, it is more preferable to use this featurebased encoding. Still, there are some tasks, e.g., in phonotactics learning, where orthogonal representations are preferable to use (see Chapter 4 for details).

Symbol Encoding Scheme Since the data encoding will play an important role in the rest of this work, I will de ne here more precisely the encoding of external items with a token encoding scheme. This de nition will make use of feature-based representations, but note that the localistic encoding is simply a more speci c feature-based representation with dedicated feature for each item.

De nition 4 (Symbol encoding scheme EF ) Let the set of symbols fci g to be encoded constitute an alphabet = fci gji =1j . Then, a symbol encoding scheme EF of the symbols ci 2 with a feature set F = fF1 ; F2 : : : FjF jg is a look-up table T (  F ) which associates a vector of feature values with each symbol ci 2 : C i = (f1i ; f2i : : : fjiF j).

The item encoding will be used when external data is to be presented to the interface layers of the NNs. However, often we will also need the inverse process, namely to decode the output of the networks. In a pure connectionist system this will not be necessary since the data there is represented only in a distributed way. However, for the purpose of testing connectionist models and in hybrid systems where symbolic methods are used together with connectionist ones, there should be a mechanism that translates the distributed output into symbols. Such a decoding mechanism will employ the same look-up table T (  F ) that associates symbols ci with their distributed representations C i , but applied inversely, that is, there should be a look-up mechanism that nds that vector C j among all vectors fC i gji =1j from this table, which is closest to the NN output vector Y according to a norm jj:jj (as yet to be determined), and returns the symbol cj which corresponds to this vector C j . Another { connectionist { approach is to use a NN that translates distributed representations into localistic representations.

Internal representations vs. External distributed representations

A more speci c subset of distributed representations are the internal representations which the NNs develop in the course of their processing. I

3.6. NN ARCHITECTURES

39

draw attention to this group because later in the thesis, in Chapter 6, I will explicitly make use of those representations in order to recode input data. The internal representations the networks develop do not di er much from the feature-based representations in terms of the truly distributed way of representing data. They rather di er in terms of 'understandability' to humans. The internal representations are developed by the NNs; they have their meaning speci cally for the NN which has developed them and are almost meaningless from a rst glance. Yet, some numerical methods such as Principle Component Analysis (Haykin 1994) might extract some information about the representations. In contrast, external distributed representations have more explicit meaning because they usually are designed by the human designer of a connectionist system, or have some natural semantics which is known to us, too.

3.6 NN architectures After presenting some background information about connectionist systems, in this section I will review shortly a few of the well-established neuronal architectures and will focus on those models which are adequate for language processing and modeling. The extensive Neural Network research in the last two decades resulted in a number of neural network models (Haykin 1994), which have already found di erent practical applications. Given a speci c type of basic neuron, the functionality of a neural network model depends on its architecture (organisation of the neurons into layers and connectivity among them), learning algorithms and control mechanisms. Those parameters alone provide a big space of possible models, to which ner speci cations, such as data organisation, hierarchical architecturing, etc, may be added. Since the purpose of this chapter is only to acquaint the reader with background information relevant to the remaining of the thesis, I will mention quickly just some of the existing NN models and present in detail the so-called Simple Recurrent Networks.

Supervised NNs The NN models are divided into two basic classes according to their learning method: supervised and unsupervised. Typical models belonging to the rst class are the Multilayer Perceptron (MLP) trained with the ErrorBackpropagation learning algorithm (Rumelhart, Hinton & Williams 1986)

40

CHAPTER 3. CONNECTIONIST MODELING

and the Radial Basis Function neural networks (Haykin 1994). These networks are feed-forward and static. One of the most famous NN models is the Multilayer Perceptron. It has drawn special attention because it was theoretically (Hornik, Stinchcombe & White 1989) proven capable to approximate any static function provided enough neurons in the hidden layer size, which was also investigated by a number of experimental works (Lawrence, Giles & Fong 1995). The MLP has one input layer Inp, one output layer Out, and one or more hidden layers fHiddi g (see Fig.3.5). It is feed-forward: the signal enters the input layer, is propagated to the hidden layer through the weight matrix W Hidd;Inp that connects those two layers, and then goes to the output layer through the correspondent weight matrix W Out;Hidd. The hidden layer(s) and the output layer are those whose neurons perform the real computations, the input layer only registers the input signal. The MLP operates in two regimes: (1) use and (2) training. During the rst regime, the MLP simply propagates the input signal to the output layer, and the eventual connected devices make use of it. On the other hand, the training regimen involves two main steps: rst, a forward step as in the normal regimen of use. After that, the network' output is compared with the desired output activations and an error is computed { the di erence between those two. This error signal is then propagated backward through the layers, till the input layer. Therefore, the learning mechanism of the MLP is supervised: an error signal is used to adjust the neurons. One can nd the concrete computations describing the BP algorithm in (Haykin 1994, Reed & Marks II 1999), but they are very similar to those in Section 3.7.

Supervised Recurrent NNs Dynamic problems such as speech processing, robot control, etc., pushed connectionist investigations toward NN models capable of handling dynamic data. The rst NN models capable of processing dynamic (sequential) data were still static, encoding limited dynamics by means of a window, shifting over sequential data. For example, the NETtalk model (Sejnowski & Rosenberg 1987), using this approach and a MLP was trained to produce phonetic representations of words. The contextual information there which was required to map the current letter correctly was encoded within a shifting window of size seven { three letters on the left and three letters on the right hand side to the letter to be pronounced. This is an example of the so-called Finite Impulse Response lter (FIR), where the system response to a given input is limited to a prede ned number of steps.

3.6. NN ARCHITECTURES

41

The rst real recurrent models were extensions of the Multilayer Perceptron with recurrent connections. They implement another type of dynamics { In nite Impulse Response { where the input at a certain time in uences the system response until the dynamics is externally reset. Several recurrent versions of the MLP were developed. In one of them (Jordan 1986), the network state at any point in time is a function of the input at the current time step, plus the state of the output units at the previous step. In another recurrent model independently proposed by Elman (1988) and Robinson & Fallside (1988) { Simple Recurrent Networks (SRNs) { the network's current state depends on the current input and its own internal state, which is represented by the activation of the hidden units in the previous moment (see Fig. 4.3). This internal state is considered as a context that provides information about the past. The latter model is computationally more powerful, since it has an internal state from which the output is computed, while the context state in the earlier model is directly the NN output. SRNs have been successfully employed for many linguistic and other tasks where the objects have sequential nature (Reilly 1995, Wilson 1996, Cairns, Shillcock, Chater & Levy 1977, Stoianov et al. 1998). SRNs, the Jordan's recurrent network, and other possible similar architectures belong to the more general class of rst-order Discrete-Time Recurrent Neural Networks (DTRNNs) (Carrasco et al. 1999, Tsoi & Back 1997), which are very similar to the SRNs, but with an output layer that also possibly receives signal directly from the input layer via another set of connections (matrix). In addition, there might be extra hidden layers between the hidden layer and the output, which slightly increases the computational power of the this class of networks. DTRNNs for sequence processing can be de ned as follows:

De nition 5 (Discrete Time Recurrent Neural Network RNN W )

A discrete time recurrent neural network RNN W is a layered neural network with: (1) an input interface layer Inp, representing one static pattern X at a time and containing jInpj neurons, each neuron standing for a feature ki of the input pattern; (2) an output interface layer Out producing one static output pattern Y at a time and containing jOutj neurons, each standing for a feature fi of the output pattern; (3) one or more hidden layers Hid1 ; Hid2 ; : : : , each of them having H1; H2; : : : neurons; (4) a network weight space (long-term memory) W representing the con-

42

CHAPTER 3. CONNECTIONIST MODELING

nectivity in the RNN. (For the more speci c Simple Recurrent Networks, W = fW(Inp+Hid);Hid ; WHid;Out g); (5) an inherent internal dynamics represented as a network state Z , such as a global memory (e.g., a context layer Con in SRNs) and/or local internal memory (memory in the neurons). RNN W processes sequences of externally presented patterns in the following way: (I) The internal network state Z is reset to its initial state Z0 before a new sequence is presented to the network. (II) Presenting a pattern X to the input layer triggers one network processing step, which consists of: (a) propagating the incoming signal through the network weights W , from the input layer, through the hidden layers, to the output layer, resulting in an output pattern Y , and (b) updating the internal network state Z . DTRNNs can be trained to learn a set of input/output sequences with di erent algorithms, such as the Temporal Back-Propagation learning algorithm, or the Back-Propagation Through Time (BPTT) algorithm (Haykin 1994). SRNs were initially trained by Elman with the standard Backpropagation learning (BP) algorithm, in which errors are computed and weights are updated at each time step. While biologically better motivated because of the temporally local weight adjustments, BP is not as e ective as the BPTT learning algorithm, in which the error signal is propagated back through time and temporal dependencies are learned better. Since SRN is the connectionist model that will be used throughout the rest of the thesis, section 3.7 provides a detailed technical description of the BPTT algorithm that is used for the network training. There is also an even more general and powerful class of discrete time neural networks { second order DTRNNs, which feature two-dimensional neuronal input: each input connection is a weighted multiplication between a neuron from the input layer and a neuron from the context layer. Those networks, however, are biologically less well motivated due to this speci c connectivity (although there are such type of connections in the brain). Currently, they do not have well studied learning algorithms, either. However, such networks are very useful for compact representation of Finite State Automata in the connectionist paradigm (next section discusses encoding of FSA in rst order DTRNNs), e.g., (Carrasco et al. 1999, Omlin & Giles 1996).

3.6. NN ARCHITECTURES

43

Unsupervised NNs and their problem with Recurrence The other big class of NN models are the unsupervised neural networks. Two very typical examples of such models are the self-organising Kohonen maps (Kohonen 1984) and the Adaptive Resonance Theory (ART) (Carpenter & Grossberg 1992). Those two models o er two very di erent solutions for self-organising data into categories (clusterisation), but as far recurrence is concerned, they lack inherited capacity of processing dynamic data. The problem is that they organise the data into a sort of localistic way { individual neurons become tuned to speci c patterns. In order to provide a general capacity for recurrence, a model should be able to encode any possible sequence, which information to be reused as a contextual information for the following decisions. Supervised NNs nd proper ways to encode such context in limited contextual layers due to the pressure of the error-driven algorithms. Unsupervised networks, however, do not have such a driving force and what they can do is to drive individual neurons to respond to speci c input patterns. If those neurons are used to represent the context, then it will be represented in a localistic way. Then, in order to be able to represent a sequence of j j distinct elements with a maximal length of jLj, then j jjLj neurons would be required. It is easy to estimate that with such an exponential dependency, even large neural networks would quickly run out of neurons. In contrast, distributed context representation, such as in SRNs, allows exponential number of patterns to be encoded into a limited in size context. Nonetheless, attempts to develop dynamic versions of unsupervised NN models have been made. For example, in order to learn bi-grams for the purpose of phonotactic learning, Cotteleer & Stoianov (1999) extended the ART model (Carpenter & Grossberg 1992) with a context layer. Yet, bigrams are not enough to solve the phonotactics problem (see Chapter 4 for details) as well as many other sequential tasks. But as just explained, since ART develops localistically represented categories, it is dicult to provide longer-term contextual memory. The static self-organising Kohonen Map neural network was also extended with recurrent connections, which made the network responses dependent on both the current input and the last neural map activations. Models following this idea are the Temporal Kohonen Map (TKM) by Chappel & Taylor (1993) and Self-Organising Feature Map for Sequences (SARDNET) by James & Miikkulainen (1995), among others. The methods of encoding dynamics discussed so far employ a global memory approach { with dedicated neurons representing contextual information.

44

CHAPTER 3. CONNECTIONIST MODELING

Another way to deal with dynamic data is to implement local dynamics in non-specialised neurons (local memory ), instead. The latter type of architectures vary with regard to the place of this dynamics { in the weights, in the activation function, or both (Lawrence et al. 1995, Tsoi & Back 1994). Finally, I conclude this section by noting a common problem of recurrent networks: the input data they process has to be linear in the temporal dimension. These networks are able to recognise and classify the temporal sequences they have been trained on (SRNs and Jordan Networks) or they have clustered during the self-organisation process (TKM and SARDNET), but they do not extract more complex temporal features or substructures explicitly. In addition, as the length of the sequences becomes greater, the performance worsens, which has been recognised by a number of authors. Bengio, Simard & Frasconi (1994) showed that learning long-distance dependencies is dicult even for very simple tasks (long strings of a few basic symbols). Miikkulainen & Dyer (1991) emphasised that the required network size, the number of training examples and the training time become intractable as the sequence temporal complexity grows. To cope with this problem, Stoianov (2000b) has suggested a method for dealing with sequential data with hierarchical structure by using a hierarchical system of a special NN models { Recurrent Autoassociative Networks, which will be presented in detail in Chapter 6.

3.7 Simple Recurrent Networks This section will present in detail Simple Recurrent Networks (Elman 1988, Robinson & Fallside 1988) { a recurrent connectionist model that will be used in the rest of this thesis. The section will start with general presentation of the model, will continue with a detailed description of its processing mechanisms and the Back-Propagation Through Time learning algorithm that is used to train the model. Finally, a discussion on the computational capacity of the network will be presented. Simple Recurrent Networks have the structure shown in Figure 4.3. They operate as follows: Input sequences S I are presented to the input layer, one element S I (t) at a time. The purpose of the input layer is just to feed the hidden layer through a weight matrix, which in turn copies its activations after every step to a context layer, which provides another input to the hidden layer { information about the past. Since the activation of the hidden layer depends both on its previous state (the context) and on the current input, SRNs have the theoretical capacity to be sensitive to the entire history

3.7. SIMPLE RECURRENT NETWORKS

45

of the input sequence. However, practical limitations restrict the time span of the context information to, e.g., 10-15 steps. Finally, the hidden layer neurons output their signal through the weight matrix connecting the hidden layer to the output layer, to the output layer neurons. The activation of the latter is interpreted as the product of the network. The network is trained with a supervised training algorithm, which implies two working regimens { a regimen of training and regimen of network use. In the latter, the network is presented the sequential input data S I (t), computes the output N (t) using also the contextual information, and its reaction N (t) is used for the task at hand. The training regimen also comprises a second, training step, which compares the network reaction N (t) to the desired one S T (t), and which uses the di erence to adjust the network behaviour in a way that improves future network performance on the same data. The two most popular supervised learning algorithm used to train SRNs are the simple Back-Propagation algorithm (Rumelhart, Hinton & Williams 1986) and the Back-Propagation Through Time algorithm (Haykin 1994). While the earlier is simpler because it uses information from one previous time step only (the context activation, the current network activations, and error), the latter trains the network faster, because it collects errors from all time steps during which the network processes the current sequence and therefore it adjusts the weights more precisely. However, the BPTT learning algorithm is also cognitively less plausible, since the collection of the time-spanning information requires mechanisms speci c for the symbolic methods. However, this compromise allows more extensive research, and without it the problems which will be discussed in the following sections would require much longer learning time. Therefore, in the experiments reported here the BPTT learning algorithm will be used. In short, it works in the following way: the network reaction to a given input sequence is compared to the desired target sequence at every time step and an error is computed. The network activation and error at each step are kept in a stack. When the whole sequence is processed, the error is propagated back through space (the layers) and time and weight-updating values are computed. The weights are modi ed when all time steps are processed in this manner. As noted, this procedure results in faster training than the original simple backpropagation learning algorithm used by Elman (1988) when he introduced SRNs. The following subsection 3.7.1 describes the BPTT algorithm in detail. The reader is advised to read it, but in case he/she does not, this will not cause problems with understanding the rest of the thesis. Nevertheless, it is

46

CHAPTER 3. CONNECTIONIST MODELING

important to read subsection 3.7.2, on the computational power of SRNs.

3.7.1 The Back-Propagation Through Time Learning Algorithm

SRNs have two working regimen: (1) an utilisation of a trained network and (2) network training. The rst one is simply applying a forward pass, where the current input signal is propagated forward throughout the network and the current context layer activation is used. After each forward step, the hidden layer activation is copied to the context layer, to be used later. The network utilisation is the same as the forward step in the BPTT, described in the following subsection. The BPTT learning algorithm itself is more complicated. It includes three main steps. Firstly, a forward pass for all tokens from the input sequence, when all network activations are kept in a stack: (3.3, 3.4, 3.5, and 3.6). Secondly, there is a backward pass through time until the beginning of the sequence, where errors are computed at the output layer and back-propagated through the network layers and through time. Those errors and the stored network activations are used to compute weight update values: (3.7, 3.8, 3.9, and 3.10) Finally, the algorithm ends by updating the weights with the accumulated weight-updating values (3.11). The second step requires that at each time moment but the last, a future error be used, processed and back-propagated further through time. In the next subsections follows a detail description of the forward pass and the BPTT algorithm. Before that, all notations are explained.

Notations

In the following description, jILj, jHLj, jCLj, jOLj stand for the size of the input, hidden, context and output layers, correspondingly. The input signal provided to the i-th hidden layer neuron and the l-th output layer neurons are noted as netHi (t) and netOl (t), respectively. Next, inj (t), cnk (t), hni (t) and onl (t) stand for the activations of the j -th input, k-th context, i-th hidden and l-th output neurons at time t. Finally, wijHI , wikHC and wliO are the weights of the connections between j -th input neuron and i-th hidden neuron, k-th context neuron and i-th hidden neuron, and i-th hidden neuron and l-th output neuron, respectively. For convenience, the bias for all layers is encoded as an extra input neuron (j = 0; i = 0) with constant activation 1. The activation function f (:) is sigmoidal { the logistic function f (x) = 1+1e;x or the hyperbolic

3.7. SIMPLE RECURRENT NETWORKS

47

tangent function. The training data consist of a set of pairs f(S I ; S O )g { input sequences I S and correspondent target sequences S O . Each input sequence S I has the form S I = hcI0 cI1 : : : cIjS j;1i, and the correspondent target sequence S O = hcO0 cO1 : : : cOjSj;1i. As it will be explained in Chapter 4, if the network is trained on prediction, the same input sequence is also used as a target sequence, starting from the second element and usually nishing with a special end-of-sequence pattern #. Next, if the elements of the sequences ci are symbols from an alphabet , they are encoded with an input/output symbol encoding scheme EF before presented to the network.

Forward Pass Processing a new sequence begins with resetting the context layer by setting all context neurons cnk=1:::jCLj(t = 0) to zero. The sequences are presented to the network one element cIt at a time. Each input token is encoded with a pre-speci ed input encoding scheme in(t) = E Inp(cIt ). For each token, the forward pass is applied. The forward pass starts with an activation of the hidden layer in accordance with (3.3 and 3.4):

netHi (t) =

jILj jX CLj X H I w in (t) + wHC cn (t)

(3.3)

hni(t) = f (netHi (t))

(3.4)

j =0

ij

j

k=1

ik

k

which is followed by direct copying of the activation values of the hidden neurons to the context neurons. Next, the signal is propagated further to the output layer, by activating the output layer neurons: (3.5 and 3.6).

X wO hn (t)

(3.5)

onl (t) = f (netOl (t))

(3.6)

netOl (t) =

jHLj i=0

li

i

Backward Through Time Pass The second step of the BPTT learning algorithm for a given pair of training sequences (S I ; S O ), is propagating the error signal back through the network

48

CHAPTER 3. CONNECTIONIST MODELING

and time. We suppose that the forward steps for each token cIt in the input sequence are already done, keeping the activations and the target patterns in a stack. At this stage we also need to encode the targeting output tokens, according to the output token encoding scheme: dl (t) = E Out (cOt ), for l = 1 : : : jOLj. Next, error and weight updating values are computed in an earlier time cycle, that is, starting from the last token. Firstly, output neuron errors and deltas are computed with (3.7), in which the second term computes the neuron error, and the rst term computes the derivative of the activation function with respect to its input netOl (t). Neuron delta's lO (t) represent the output error transferred back through the activation function, that is, through the body of the neurons. Further, updates wliO ( ) of the weights connecting the hidden layer to output layer are computed with (3.8). The weight updating rule symbolises the Hebbian learning law that synapses change according to the strength of the signal that enters the synapse (here hni(t)) and the strength of the signal at the post-synaptic side (here error signal lO (t)). We use  to denote a global time index and w( ) to stand for the accumulated w(t) for all items from the current sequence.

lO (t) = f 0(netOl (t))(dl (t) ; onl (t)) wliO ( ) = 

jS j X O (t)hn (t) t=1

i

l

(3.7) (3.8)

Provided that the activation function f (x) is the logistic function, its derivative f 0 (x) with respect to the output y = f (x) is: f 0(x) = y(1 ; y). Next, deltas and updating values of the weights connecting the hidden layer to the input and the context layers are computed in accordance with (3.9 and 3.10): CLj X wO O (t) + jX wHC H (t + 1)]

(3.9)

X H (t)n (t ; 1)

(3.10)

jOLj

iH (t) = f 0 (netHi (t))[

l=1

li l

wijH ( ) =

k=1 jS j

t=1

ik k

i

j

where i = 1 : : : jHLj, j = 0 : : : (jILj + jCLj) (0-th neuron represents the bias) and n(t) is a joined vector containing both in(t) and cn(t). In (3.9), in addition to the error that comes from the output neurons at the current

3.7. SIMPLE RECURRENT NETWORKS

49

time t (the rst sum), there is also error coming from the future step (t +1), represented as the second sum. More precisely, the latter represents the context layer delta-term kC (t), computed by back-propagating the future hidden-layer deltas iH (t + 1) through the weights connecting the context neurons to the hidden neurons. Finally, all weights are updated according to (3.11) with the accumulated weight-updating values, computed with (3.8 and 3.10).

w( ) = w( ; 1) + w( )

(3.11)

Back-propagation learning algorithms are known to run the risk of getting stuck in local minima on the error surface. There are number of techniques designed to overcome this problem. The most useful technique is to apply a momentum term to (3.11), as it is done in (3.12). The momentum term keeps the movement over the weight error space for some time, even if the network has fallen into a local minimum. Usually, = 0:7 . w0 ( ) = w( ; 1) + (1 ; )w( )

(3.12)

Another technique that has similar e ect is initially to apply higher learning coecient  and next, to decrease it gradually. This implements a quick rough initial search for the region where the global minimum is located. Later, the exact location of the error minimum is searched with smaller steps. Usually, the initial  = 0:2 and the decrease might be exponential with a very small step (e.g. 0.9995). For further reading about SRN, BP and BPTT and other recurrent learning algorithms, one can refer to (Haykin 1994, Reed & Marks II 1999), among others.

3.7.2 Computational Power of Simple Recurrent Networks

In this section I will present shortly a few studies on the computational capacities of SRNs, which basically show that this connectionist model is approximately equivalent to deterministic Finite State Automata, but it also features all useful properties of connectionism, such as noise resistance and generalisation. In addition, experimental works show that SRNs can also learn limited-depth context free languages.

50

CHAPTER 3. CONNECTIONIST MODELING

Experimental works SRNs were rst experimentally demonstrated to be able to learn to closely mimic deterministic FSA, both in terms of behaviour and state representations. In all training procedures, the networks were trained by examples generated by the FSA. In particular, Cleeremans, Servan-Schreiber & McClelland (1989) and Ghosh & Karamcheti (1992) succeeded in training SRNs to learn the so-called Reber grammar { representing a regular language with four intermediate states, three close loops, and up to two allowed successors at each state { but they had diculties in learning the Embedded Reber grammar { a more complex version of the former one. Later, Manolios & Fanelli (1994) demonstrated the capacity of an extended SRN { with extra recurrent connections at the output layer { to learn several Tomita regular grammars, e.g., (10) . In those works, the authors also used cluster analysis to determine the internal representations of their networks and they showed that the neural states tend to cluster into clusters that roughly correspond to the states of the deterministic FSA that has been approximated. However, the equivalence showed was very limited since the performance of SRNs signi cantly degraded as the string length increases { a problem extensively studied latter by Bengio et al. (1994). Learning languages generated by context free grammars was shown by Elman (1988) and (1991) in his attempt to explain how the complex natural languages are accommodated by xed-resource connectionist systems and correspondingly, what is the nature of the language representations developed during training in the context state of the SRNs. The language he explored was formed from a lexicon of 23 items and was generated by a small context-free grammar with several rules forming sentences with relative clauses, number agreement, and various direct object requirements. The training set consisted of 10,000 sentences, each of them containing up to 16 words. After the training, Elman analysed the hidden layer activations and found that without any prior linguistic information, the network learned to predict the category of the words that may follow the left context presented to the input so far. That is, the network discovered the grammar underlying the surface form experienced during training. Looking for optimal language learning architecture, Lawrence, Fong & Giles (1996) used variety of learning models, including SRNs, and trained them to classify 552 positive and negative examples of English sentences as grammatical or ungrammatical. In contrast to the Elman's experiments, the category of the words here was explicitly represented. An important outcome of this experiment was that SRNs were found to perform best:

3.7. SIMPLE RECURRENT NETWORKS

51

they learned to classify correctly the whole training set and scored 64% / 74% on the testing set.

Theoretical works on representational capacity Latter studies, such as (Kremer 1995, Alquezar & Sanfeliu 1995), theoretically showed that SRNs can encode any deterministic FSA. Those studies view SRNs, and in more generally Discrete Time Neural Networks (DTNNs) as neural state machines: all possible states of all hidden units (and the context ones) are discretized into discrete regions (states) and at each time a new input symbol (pattern) is applied, a new state of the hidden units is computed by using the previous state and the current input symbol. In particular, Kremer (1995) showed how to represent FSA in SRNs with a threshold activation function, while Alquezar & Sanfeliu (1995) has shown that this can also be done with a sigmoid activation function, provided that rational numbers were returned. Those proofs are constructive, in that they have shown how to construct a SRN which simulates a given FSA. However, in order to be able to encode any FSA, those works rely on converting a deterministic FSA into a new FSA in which all original states are multiplicated as many times as there are symbols in the input alphabet of the original FSA. In the new automaton, each new state represents (a) the corresponding state of the original automaton and (b) the input transition due to the corresponding past input symbol. Other works regard SRNs and their computational capacity as a subclass of the more general rst-order Discrete-Time Recurrent Neural Networks (DTRNN). The latter models have a more general set of connections than SRNs do, for example connections from the input layer to the output layer. Omlin & Giles (1996) and Carrasco et al. (1999) applied the approach outlined above in order to prove that st-order DTRNNs, and in particular SRNs with sigmoidal units and continuous activation can encode any deterministic FSA, and in particular nite-state machines that act as transducers T (LInp; LOut ) (Mealy or Moore machines) that translate stings of an input language LInp and produce strings of equal length from an output alphabet LOut.

Theoretical works on learnability Proofs about the theoretical representational capacity still do not guarantee that a learning process would end up with an optimal neural network. For example, error-driven learning algorithms might not be able to converge

52

CHAPTER 3. CONNECTIONIST MODELING

due to the recurrence or the error surface might be too-complex and cause the network to fall into local minima. In that respect, theoretical works on learnability of RNNs are very important. In addition, when DTRNNs are trained to mimic FSA, there is a problem related to the stability of the contextual memory viewed as a state of an FSA. The recurrent connections tend to drive the activations of the context neurons away from states representing FSA states, and correspondingly the trained network would not behave anymore like the trained FSA. Kuan, Hornik & White (1994) address the learnability problem, devoting a study to the convergence of the backpropagation learning algorithm training recurrent neural networks with hidden-layer recurrence (SRNs) or output-layer recurrence (Jourdan RNNs). The learning task there is approximating E (Yt jhX0 : : : Xt i) { the conditional expectation of Yt given the history of input data hX0 : : : Xt i { by a parametric function (recurrent neural network) RNN (hX0 : : : Xt i; W ), as the parameters W range over a parameter space. The training data are sequences fZt g of random vectors Zt = (Xt ; Yt ), where Yt is scalar and Xt is a vector. The study proves a theorem about the convergence of the learning process to a weight set W  that minimises the network approximation error given a few important assumptions, which approximately look like this: A1. The training data { sequences fZt g of random vectors Zt = (Xt ; Yt ) { are generated by a stochastic process which is \near epoch dependent" on a bounded underlying mixture process and have limited memory (intuition says that Finite State Automata belong to this category of processes, which was also con rmed by one of the authors of this paper). A2. Continuous di erentiable error function of second order on the parameter space W , the input sequence (weak assumption), and on the state variables. A3. The recurrent process is a contraction mapping { to avoid chaotic behaviour. A4. The training algorithm is such that (a) it includes recursion on the parameters (weights) in the recurrent neurons (such as the BPTT learning algorithm), (b) the weight bounded, and (c) the learning coecient t is such that P1t=0 t2set< is1kept P and 1 t=0 t = 1. A5. There is a limit to which the learning process would converge (The above studies on the representational power of SRNs guarantee that there is such a limit). If those assumptions hold, then they apply a theorem proven earlier which roughly states that the gradient learning process converges to a weight set W  that minimises the network approximation error. For the particular case of learning Elman's SRNs on data generated by FSA, assumption A2 holds and A3 is replaced by Assumption B3 which requires that the recurrent weight vectors are bounded. Using the notation presented earlier in this section, assumption B3 looks

3.7. SIMPLE RECURRENT NETWORKS

P P

53

j jCLj HC 2 1=2 like this: ( jiCL =1 k=1 (wik ) )  4(1 ; ), for some  > 0. To guarantee this, the weight updating rule of the learning algorithm (3.11) is extended with a restricted mapping that ensures assumption B3. Arai & Nakano (2000) consider the stability of a SRN trained to simulate an FSA. The problem is that even if the learning process converges to an optimal solution { a DTRNN that learns the training set of sequences perfectly { this does not guarantee that the DTRNN would mimic the FSA inferable by the training sequences, which as I noted above is because the continuously activated neurons might drive away from regions of states (orbits) that represent FSA states. The works cited above about the representational power of SRNs to encode any FSA deal with this by setting the weights with proper values. Arai & Nakano (2000) solve this problem by (a) causing the sigmoidal nodes to operate almost as threshold units, which keeps the activation of the recurrent nodes near to the corners of a hypercube (one's and zero's), thus stabilising the context memory, and (b) introducing a prior to the BPTT learning algorithm by adding an internal representation term to the optimisation function of the learning algorithm. Thus, the former study deals with the learnability of training sequences, and the latter one provides a method to extend the learning in a way that would produce SRNs that stably mimic FSA, that is, it guarantees generalisation: good performance for all sequences generated by the corresponding FSA. Nevertheless, the work on the learnability with BP algorithm still does not provide a solution to the local minima problem, which is normally solved with heuristic approaches.

54

CHAPTER 3. CONNECTIONIST MODELING

Part II

SEQUENTIAL LEXICAL MODELLING

55

Chapter 4

CONNECTIONIST LEARNING OF LEXICAL PHONOTACTICS The language phenomenon phonotactics was presented earlier in Chapter 2. The phonotactic rules constrain the possible combinations of phonemes of the words of a given natural language in order to form larger linguistic units (syllables, words) in that language. However, in natural languages those rules are implicit, and human learners require no explicit grammar to tell them which combinations are allowed and which are not. Instead, a grammar describing the words of a given language at a particular moment must be abstracted from the words in that language { the words determine those rules, not vice versa { and the ever-evolving natural languages may change those constraints in time. The speakers of a given language normally learn the language phonotactics in their early language development and probably update it only slightly as this language evolves. In that respect, any device aimed at representing phonotactic constraints should be adaptive. Further, Kaplan & Kay (1994) showed that phonotactics can be described with regular languages, which in turn are recognised by FSA. Therefore, learning devices aiming at studying phonotactics should have the power to learn regular languages. Learning lexical grammars is not an easy problem, especially when cognitively plausible models such as neural networks are involved. Abstracting symbolic knowledge in models employing low-level representations is related to discretization of the continuous space (Carrasco et al. 1999); possibly localistic data representations; generalisation (Arai & Nakano 2000), and 57

58

CHAPTER 4. LEARNING PHONOTACTICS

experiencing large amounts of data, which is not always intuitive. That's why the history of research on connectionist learning language grammars shows both successes and fails in modeling even not so complicated language structures, such as the phonotactics of natural languages (Stoianov et al. 1998, Stoianov & Nerbonne 2000b, Tjong Kim Sang 1995, Tjong Kim Sang & Nerbonne 1999, Pacton, Perruchet, Fayol & Cleeremans in press). As for higher language levels, such as syntax, even symbolic methods claim that it is almost impossible to learn syntax of languages (which perhaps is describable by context-free languages) just from the experience of the language, for which theories had been proposed about the existence of genetically encoded background linguistic knowledge about the languages (Chomsky 1986). Still, Elman (1991) and Lawrence et al. (1996) learned limited syntax with SRNs. This Chapter will attack learning language phonotactics with connectionist models that have no linguistic knowledge priorly encoded. For that purpose, a more speci c rst-order Discrete Time Recurrent Neural Network model will be used { the Simple Recurrent Networks (Elman 1988), which were introduced in section 3.7. The SRNs were chosen since they have been found found capable of representing regular languages (Omlin & Giles 1996, Carrasco et al. 1999). In addition, as I explained in the previous chapter, there is a theoretical research concerning the learnability of regular languages by SRNs (Kuan et al. 1994, Arai & Nakano 2000). With that respect, in section 4.2 the task of phonotactic learning will be viewed as a subtask of the more-general case of learning regular languages represented by a nite set of training examples. In spite of the above claimed capacities, the BP/BPTT learning algorithms do not always nd the most optimal solution of the learning task { a SRN that produces only correct context dependent successors or recognises only strings from the training language. Hence, section 4.3 focuses on the formalisation of evaluating the network learning from di erent perspectives { grammar learning, phonotactics learning, and language recognition. The last two methods need one language-speci c parameter { a threshold { that distinguishes allowed successors/words in the training language. This threshold is found with a post-training procedure, but it can also be sought interactively during training. The next section 4.4 provides an experimental work on lexical learning, where the above evaluation strategies are examined. Finally, the network is assessed from linguistic and psycholinguistic perspectives: a static analysis extracts learned linguistic knowledge from the network weights, and the network performance is matched against that of humans in a lexical decision task. The network performance in time

4.1. MOTIVATIONS

59

will be used to draw a conclusion about the Dutch syllabic structure { something known from earlier psycholinguistic experiments about English syllables (Kessler & Treiman 1997).

4.1 Motivations The demand for developing a phonotactic device might be justi ed from cognitive and practical points of view. In Chapter 2, I presented three lists which contain: English words (2.1); sequences which do not sound English at all (2.2), and sequences which sound like well-formed English words, but have no meaning and therefore are not English words (2.3). This simple example shows that we make use of some implicit (phonotactic) rules that we are not aware of, but which tell us which phonemic combinations sound correct and which do not. Similarly, second language learners also experience a period when they recognise that certain phonemic combinations (words) belong to the language they learn without knowing the meaning of these words. Therefore, we might conclude that a neurobiological device dealing with phonotactics (lexical grammar) exists, and indeed that it is worth modeling it as a subtask of modeling our language processing mechanisms and intelligence in general. Besides cognitive modeling, there are also a number of practical problems that would bene t from knowing phonotactics. In speech recognition, for example, a number of hypotheses that explain the speech signal are created and the impossible sound combinations have to be ltered out before further processing. This is a lexical decision task, in which a model is trained on a language L and then tests whether a given string belongs to L. Here is where a phonotactic device would be of help. Another important problem in speech recognition is word segmentation. Speech is continuous, but we divide it into psychologically signi cant units such as words and syllables. There are a number of cues that we can use to distinguish these elements { prosodic markers, context, but also phonotactics. Similarly to the former problem, an intuitive strategy here is to split the phonetic/phonemic stream at the points of violation of phonotactic constraints. See McQueen (1998) for psycholinguistic insights on this problem and Shillcock, Cairns, Chater & Levy (1997) and Cairns et al. (1977) for connectionist modeling. Similarly, the constraints of the letters forming words in written languages (graphotactics ) is useful in word processing applications, for example, spell-checking. There is another interesting aspect to phonotactics modeling. Searching for an explanation of the structure of the natural languages, Carstairs-

60

CHAPTER 4. LEARNING PHONOTACTICS

McCarthy presented in his recent book (1999) an analogy between syllable structure and sentence structure. He argues there that sentences and syllables have a similar type of structure. Therefore, if we nd a proper connectionist mechanism for learning the syllabic structures, we might apply a similar mechanism for learning syntax, too. Of course, syntax is much more complex and more challenging, but the basic principles of both devices might be the same. Other sources of justi cation of a neurobiological device that masters phonotactics come from Neurolinguistics and neuroimaging studies. It is widely accepted that the neuronal structure of the so-called Broca area (located in the left frontal lobe) is used for language processing, and more specially that it represents a general sequential device (Stowe et al. 1994, Reilly 2000, in press). Such a device might have both recognising and generative purposes and there might be substructures which process language units at di erent levels, including the phonemic level. Therefore, the Broca's area is a plausible locus for a neuronal phonotactic device. The classical approach to grammar modeling is to use symbolic methods and arti cial grammars { methods of Arti cial Intelligence that are designed especially for dealing with symbols. However, symbolism has a very important problem { symbolic models only deal with abstract symbols which are further away from the physical environment we normally experience and perceive. Incorporated in a larger cognitive system, the symbolic structures lack \real" semantics; they needed an additional interface module for connecting to the world. This is also known as the symbol grounding problem (Harnad 1990). An alternative to this approach is connectionism, inspired by brain structure. Some of the main features and models of connectionism were outlined in Chapter 3, and one particular model { Simple Recurrent Networks were presented in detail in section 3.7. SRN is a connectionist model that is very suitable for language modeling. This neural network model works with distributed data representations and contains internal memory which allows sequential processing. It will be used as a basic model in this work.

4.2 Formalisation of Connectionist Phonotactics Learning In this section I will formalise the task of phonotactics learning with neural networks, for which I will represent in a formal framework the more general case of learning sequences with general recurrent neural networks. To begin,

4.2. FORMALISATION OF LEARNING

61

I will specify more precisely the subject of learning. One can study phonotactics at both syllabic level and word level. However, while word phonotactics includes syllable phonotactics, it also adds the complexity of syllable combinations and also morphological problems, which I will not consider here. The focus of this work here is rather computational, therefore the experimental subject of study in this chapter will be restricted to the phonotactics of syllables, and more precisely the phonotactics of monosyllabic words. On the other hands, phonotactics is a speci c version of the more general sequential prediction problem. Therefore, when formalising the process of phonotactics learning, I will talk of sequences instead of syllables or words. The term \phonotactic constraints" PCL of a set of sequences (language) L is comparable to the grammar PL describing those sequences and therefore it will also be used to describe the same phenomenon. As I noted earlier, Kaplan & Kay (1994) claim that the speci c subclass of grammars that can describe phonotactic constraints is the class of regular languages, known also as nite-state languages (see section section 2.1.3 for details). An equivalent framework that might be used for representing such knowledge is that of deterministic Finite State Automaton (FSA). Representing phonotactics with neural networks is di erent from representations in grammars or in FSAs. There are no explicit rules or discrete states of activation in NNs, but rather continuously varying activations of many neurons which jointly transform the incoming distributed signal into other distributed representations. Yet, the operations the NNs perform could be interpreted as rules in some speci c cases. More speci cally for the recurrent neural networks, there is contextual information that represents the state of the neural network and whose state is updated after each processing step according to mechanisms speci c to each dynamic connectionist model. This way, the recurrent neural networks keep track of the history of the input and can learn left context -dependent tasks. The grammar PL can be developed manually by studying the language L; the phonotactics PCL, too. Once such a (regular) grammar is developed, it can also be encoded in a SRN, as Alquezar & Sanfeliu (1995) and Kremer (1995) have shown. However, learning PCL =PL from scratch is more challenging and more appropriate from a cognitive point of view: the model should be adaptive. Children also learn languages and their words only if they are exposed to a language environment, which shows that language learning is cognitively feasible. Learning grammars has a long history in Arti cial Intelligence. Some well-known machine learning techniques are Decision Tree Learning ; Instance-

62

CHAPTER 4. LEARNING PHONOTACTICS

based Learning ; Inductive Logic Programming (Mitchell 1997). They all start with a most general or a most speci c model and by observing incoming training examples try to generate a model that covers the training set only. However, they all lack neural plausibility. Applying Neural Networks to this problem is similar. Once a NN model is chosen, its initially randomly set memory (the weights of its neurons) is adapted in such a way that the NN, rst, can recognise the strings from the training set and second, can produce valid words. For that purpose, the network is trained on a data set L1  L containing a representative subset of sequences from the training language L. During the training process the network progressively learns which phonemes/characters fci gi might follow a given preceding sub-string (left context ) s in L. After training, the network should be able to recognise whether input strings are words from the training language or not. The same network could be used for word generation, similarly to the way the arti cial grammars are used to produce words from a given language. But importantly for the phonotactic learning task, the neural network will produce at every step the set of allowed successors. The rest of this section provides a formalisation of connectionist phonotactics learning. The purpose of the formalisation is to note and keep track of the assumptions and restrictions made. First, I will de ne a more general neural sequential device { a neural transducer { that transforms input sequences into output sequences according to some learned transformations. As already noted, SRNs have been shown capable of representing any FSA (Omlin & Giles 1996, Carrasco et al. 1999). There are also works on the SRN learnability of regular languages (Kuan et al. 1994, Arai & Nakano 2000). In that respect, if the neural transducer is implemented with SRNs, and if the symbols are distributively encoded with features, then it can be trained at each step to produce the mean feature distribution of the trained target patterns. Then, a more speci c neural transducer will be presented { a neural predictor { whose training output sequences are simply the input sequences, but shifted one symbol back. This model learns to predict the mean feature distribution of the successors to any initial substring in the training dataset. When applied to the more speci c localistic encoding, this would directly lead to predicting the distribution of the successors themselves. This particular version of a neural predictor will be used for phonotactics learning.

4.2.1 Neural Transducer and Neural Predictor

This subsection de nes two discrete time recurrent connectionist devices { a Neural Transducer, which represents a general sequential associative

4.2. FORMALISATION OF LEARNING

63

device, and a Neural Predictor { a subclass of Neural Transducer trained to output the input sequence, shifted one step back. These models can be implemented with any discrete time recurrent neural network, but I will focus on one particular subclass of discrete time recurrent neural networks { Simple Recurrent Networks (Elman 1988), described in detail in Chapter 3. Neural transducers will be de ned as DTRNNs specialised sequentially to associate input sequences of symbols to output sequences of symbols. Each symbol c will be encoded with certain symbol encoding scheme into a distributed pattern C (as de ned earlier in section 3.5).

De nition 6 (Sequential Neural Transducer NTWLI ;LO ) Let there be an input language L I = fwiI g, an output language LO = fwiO g

and a transducer T L I ;LO presented with a nite set of training examples O (sl ) be the mean output pattern in T for each T = f(wiI ; wiO )gNi=1 . Let F d initial substring sl derived from fwiI gNi=1 . Then, a neural transducer NTWLI ;LO is a discrete time recurrent neural network RNN W that sequentially transforms sequences of the input language L I into sequences of the output language L O , according to the transformation T L I ;LO . More speci cally NTWLI ;LO , (1) sequentially receives sequences wiI 2 L I as input, one symbol cIit at a time, encoded as vector-patterns CiIt according to the input encoding scheme , and at each time step t produces an output pattern F NTW ; EInp t (2) is trained with a supervised error-driven learning algorithm on the training transformation patterns (wiI ; wiO )i 2 T in such a way that at every training step t, having processed the input sub-sequence (left context) wjI1:::t = hcIj1 ; cIj2 : : : cIjt i, to produce a target output pattern FtNTW = CjOt , which is the encoding of the corresponding output symbol cOjt under the ; output encoding scheme EOut (3) after sucient training that results in a network weight-state W  , for every left context sl 2 LI , NTWLI;LO (sl ) produces an output feature vector L ;L O (sl ) with F NTWI O (sl ) that di ers from the expected mean feature vector F d L ;L  NT I O d error eW sl = jjF W  (sl ) ; F O (sl )jjL2 . A neural device that represents phonotactics { a neural predictor { might be viewed as a particular case of a neural transducer, in which the input and the output alphabets are the same, possibly encoded with di erent encoding schemes, and in which every training output sequence is simply the input

64

CHAPTER 4. LEARNING PHONOTACTICS

sequence, but shifted one step back. Since the left substrings ls of the words w of the training language L are usually followed by more than one successor (which is also possible in the training data of a neural transducer), the neural predictor will learn the distribution of those successors. This distribution can be explicated as follows:

De nition 7 (Distribution of successors PsLl )

Let in a language L and a symbol encoding scheme EF : (1) each initial substring wj1:::t = hcj1 cj2 : : : cjt i of wj be followed by a successor symbol cjt+1 in wj (unless t = jwj j), and (2) a set of unique initial substrings (left contexts) fsl gl be extracted from the set of all initial substring fwj;t gj;t . Then, a distribution of successors PsLl is the context-dependent empirical distribution PsLl (C ) = (p(c1 ); p(c2 ) : : : p(cj j )) of all possible successor symbols fci gii==1j j , which can be translated into a mean feature vector L (sl ) = (fb1 ; fb2 ; : : : fbjF j) according to the symbol encoding scheme E . Fd F The following de nition speci es Neural Predictors as a speci c subclass of neural transducer, trained on transformations in which the output sequences are derived from the input sequences by shifting them one symbol back.

De nition 8 (Neural Predictor NPWL )

Let there be a language L and a transducer T L ;L O , represented with a nite set of examples T = f(wi ; wiO )gNi=1 , such that each output word wiO is the input word wi , concatenated with a special end-of-word symbol '#' and shifted one step back, thus cutting the rst symbol wi0 . Then, a neural predictor NPWL of the language L is a neural transducer L ;L O NTW trained on the sequential associations from T that results in a neural predictor with a weight state W  , which in turn produces output patterns F NPWL  (sl ) in response to input left contexts sl derived from L , (s ) (corresponding which patterns di er from the mean feature vectors F Ld l L to the empirical distribution Psl (C ) of successors to sl in L) with error  d NP L eW sl = jjF W  (sl ) ; F L (sl ) jjL2 . Note, that this de nition regards the neural predictor as a recurrent neural network that is trained to produce the encoding of the successor cl+1 to each left context sl = hc1 : : : cl i in the current training sequence w = hc1 : : : cl cl+1 : : : cjwji, but after training produces the mean encoding L (sl ) of all possible successors of a given processed context sl . Fd

4.2. FORMALISATION OF LEARNING

65

fp(c1 ); p(c2 ) : : : p(cj j )g

. ................................................. III. Decode

6

1

2

f ; f : : : fjOutj

6

Neural Predictor

(with internal dynamics) 6 1

2

. ........................ II. Process

k ; k : : : kjI npj

1 2 : : : ct;1

c ;c

. ................................................. I. Encode

6

ct

+1 : : : cjwj

ct

Figure 4.1: Neural Predictor. The current input symbol ct is encoded with an input encoding scheme into the input feature space. This representation is processed by the neural predictor and then the output { a vector of feature values { is decoded with an output decoding scheme into a vector of likelihoods that each symbol is a successor to the current context/input.

A schematic representation of the neural predictor is given in Figure 4.1. The input symbols are encoded, one after another and processed by the network, whose output also depends on the internally encoded history. Then, the produced output representation is decoded into a vector of likelihoods so that each symbol can follow the current input/context. In the case of an orthogonal encoding of the symbols at the output layer, the output neurons directly contain the network's expectations that each symbol ci 2 would follow the current left context. There is a variety of recurrent neural architectures with di erent feed-

66

CHAPTER 4. LEARNING PHONOTACTICS

forward architectures and di erent types of internal dynamics. Depending on their structure and dynamics, they also have di erent computational capacities. For example, as noted earlier, Omlin & Giles (1996) and Carrasco et al. (1999) showed that the general discrete time Recurrent Neural Networks (DTRNN) with sigmoidal activation functions can be constructed in such a way that they recognise regular languages or act as transducers similarly to deterministic Finite State Automata. However, these works did not claim that the network constructed can also be learned. I am not aware of a work that directly proves the learnability of regular languages by SRNs with the Back-Propagation learning algorithm. Yet, two works which closely concern this subject were presented in section 3.7: First, Kuan et al. (1994) provided a proof of a theorem about the convergence properties of the backpropagation learning algorithms training SRNs on sequential data generated by a bounded stochastic process. In particular, the theorem states that such a learning process converges to a network weight set W  that minimises the network approximation error, with probability 1. The theorem concerns scalar network output, but this can easily be generalised to vector network output. FSA may be considered as such bounded generators of stochastic data, in contrast to state automata with unlimited memory generating context-free languages. Second, Arai & Nakano (2000) proposed an extension of the BP learning algorithm that guarantees that if it converges the weights of the SRN to a good approximating solution, then the resulting network would behave as a stable FSA for any input sequence from the input language. This work is based on (a) causing sigmoidal nodes to operate almost as threshold units, which keeps the activation of the recurrent nodes near the corners of a hypercube (one's and zero's), thus stabilising the context memory and (b) introducing a prior in the BP learning algorithm that makes it to produce such a stable network. The combination of those two works actually provides a solution to (1st) learnability of nite state transducers represented with nite set of examples, and (2nd) stability of emulating the nite state transducer. Therefore, I will conjecture the following:

Conjecture 1 (For the learnability of a Neural Transducer)

For every nite state transducer T (L Inp ; L Out ) represented with a nite set of examples T , the training of an SRN-based Neural Transducer NTW on the input/output sequences (wInp ; wOut ) 2 T with the BPTT learning algorithm, results in an approximately correct neural transducer NTW  , in the sense that for every left context sl 2 LInp, NTW  produces an approximately L ;L  correct output pattern FWNT Inp Out (sl ) with error eW sl ! 0.

4.2. FORMALISATION OF LEARNING

67

4.2.2 Aspects of Learning Phonotactics with RNNs

In this section I will present some speci c details about learning phonotactics with recurrent neural networks. The complete implementation of the learning process depends on the speci c NN. The learning algorithm of one RNN { the SRNs { was presented in detail in section 3.7. Let the training set L contains sequences (words) fw1 ; w2 : : : wjLj g and let each word wi is a sequence of symbols (ci1 ; ci2 : : : cijwj ). As I noted earlier in the de nition of a Neural Predictor, a special symbol end-of-word '#' will be appended to each word and become e ectively the last symbol of all words, which will allow the network to predict the end of the words. During training, the words wi will be presented sequentially to the network, two symbols (cij ; cij+1 ) in order at a time, j = 1 : : : jwi j. The rst symbol will be presented to the input of the network and the second one will be targeted at the output layer; both of them encoded according to the Input=Output encoding schemes. The training process is divided into training epochs, during which each word is randomly selected and presented to the neural network. After every epoch, the NN is evaluated and the result of this evaluation controls whether the learning should continue or not. Di erent learning procedures use di erent control strategies. One of the simplest procedure is to stop the training when the network error EN drops below some error threshold E , e.g., 1%. A more complicated method is to test the network on another - testing - set and to stop the training when the performance on the testing set stops improving (see Section 4.3). Supervised error-driven algorithms are well known for their tendency to fall into local error minima in searching for the global minimum. Learning a task as complex as phonotactics requires guarding against this tendency. Therefore a special algorithm was developed that supervises the training process and minimises this risk (Stoianov et al. 1998), and which is explained in the following subsection.

Pool of Networks Experience showed that a single network trained on the prediction problem achieves good performance in most of the cases. Yet, if we need the best of the model, a set (pool ) of networks fNNi gPi=1 can simultaneously be trained on the same problem and the one achieving best performance can be used. The pool behaviour is guided by an evolutionary-like algorithm called supervisor. During training it keeps two versions of the networks being

68

CHAPTER 4. LEARNING PHONOTACTICS

trained: their previous state NNi (t) and the network's state after training on one epoch NNi1 (t). The new copies are evaluated and the networks in the pool are updated according to a speci c algorithm. The simplest one is to update each network NNi (t) with the result of the training, if it performs better than the older version and otherwise to restore the previous copy. This method is useful if we want to estimate the reliability that a single neural network will learn the problem at hand. A better, evolutionary approach is: after each training epoch, rst, to evaluate (a) the previous P copies of the networks from the pool (fNNi gPi=1 ) and (b) the new P networks (fNNi1 gPi=1 ), and second, to eliminate the P networks with the worst performance, keeping clones of the P best networks. This way, only the best P exemplars of networks will be trained further, thus avoiding training of less promising networks. The method is called evolutionary, since ideas from natural evolution are applied, namely that the best tted exemplars survive and produce o springs. The experimental work showed that indeed this method converge faster and more reliably to an optimally performing network than the standard single-network training. The cognitive plausibility of the evolutionary approach might be lacking, but it will be used some times in order to arrive at better performing model. Actually, most of the training results presented in this thesis are resulted from the rst approach, in order to show the reliability of the learning processes alone, that is, unless explicitly stated, the networks will be trained independently of each other.

Word Frequencies An important issue in learning to process words is the number of presentations of each word to the network in one training epoch. The simplest approach is to present each word just once. In natural language modeling, however, it is better to present the data to the learning model according to their frequencies of occurrence in the experience. This emphasises the more frequent sequences and takes the network performance toward them, increasing the overall performance. Therefore, the latter approach will be used and the words will be presented according to their frequencies of occurrence. Actually, if the frequency of a word w is fw , then the frequency that will be used is the logarithm ln(fw ) of that frequency. The logarithmic frequency signi cantly reduces the learning time, still preserving di erences between more and less frequent words. This approach was also used in a number of earlier studies, such as (Plaut, McClelland, Seidenberg & Patterson 1996).

4.2. FORMALISATION OF LEARNING

69

Symbol Encoding

A neural predictor NP should be able to produce the likelihoods ps(ci ) that a symbol ci 2 is a successor in the current context s. An output encoding using (phonetic) features does not predict the following symbol without a further step (e.g., interpreting a vector of likelihoods in terms of nearest encoding of a symbol). Therefore, the network output layer should contain single-neuron representations of every symbol ci 2 . Recalling the encoding mechanisms from Chapter 3, we see that only the orthogonal encoding mechanism guarantees that every symbol has its own neuron. If the output activation function of the neurons is continuous, e.g. sigmoidal, then this provides the possibility for the neurons to encode likelihoods (0 : : : 1). On the other hand, feature-based encoding is more adequate in connectionism than localistic encoding. If the former is used to encode the output data in a neural predictor, then, as was just shown, a trained predictor L should output the statistically mean activations fd i (s) of every feature Fi , given a context s. Also, a feature-based encoding xes the possible combinations of symbols at the interface layer, which in most of the cases will be useful, but not if one wants to study the data from scratch, as we do now. The output feature vector F can be transformed (decoded) into the symbol space , resulting in a vector of likelihoods (p(c1 ); p(c2 ) : : : p(c )) that each symbol ci is a successor to this context, but this is not the same as direct symbol predictions, because the underlying symbolic structure { represented with the feature-based encoding { naturally induces inter-symbol dependencies. Looking for phonotactics constraints in the data, we need direct predictions about symbols, and therefore in the following sections the orthogonal encoding scheme will be used for encoding the symbols at the output layer of the SRNs. This means that in order to encode the 44 phonemes used in the Dutch language, we will need 44 output neurons.

Tree-based data representation Learning lexical phonotactics also presents some practical challenges which can be met if addition data structures are used. For example, extracting the distribution of the successors to every left context in a language L is a computationally heavy task that will be expedited when the database L is represented in a tree-based format. To begin, let us recall two popular methods to represent a given set L of sequences. Lists are the simplest, but they are not optimal with regard

70

CHAPTER 4. LEARNING PHONOTACTICS

*HcHH     cH a b cz %%eBBe aeio

Figure 4.2: Tree-based word-list representation. Words are encoded as paths in the tree, starting from the root.

to the complexity of memory and time for access. A more ecient method is tree-based representation, known also as trie (Aho, Hopcro & Ullman 1983). This representation is e ectively a k-tree, where k stands for the number of di erent tokens in L. Sequences are encoded as paths s in this tree, starting from the root. Every node in this tree represents the left context s 2 L (which is the path from the root to the current node) and it might be the end of a word, or not, for which an additional key vs is necessary (vs = 1 if s is a word in L, that is, s 2 L, and vs = 0 otherwise). One might also attach other information to every node s, for example, the number ms of words encoded in its sub-tree Ts . This number can be calculated recursively, by performing an in-order (or depth- rst ) tree traversal (4.1). For every leaf in the tree, this procedure assigns the number ms to vs (which is normally one; see below for alternatives). Then, it starts to calculate in an upward direction the number of words in each subtree T[sci] of the current node s, according to the following procedure: if s is a leaf then ms = vs Xm i else ms = vs + [sc ]

(4.1)

8ci2

Once we have the numbers m[sci] for every node/context s and every possible successor ci 2 to s, then we can calculate immediately the empirical distribution Ps (c) of the successors fci gci 2 of s by normalising the vector (m[sc1 ]; m[sc2 ] ; : : : m[scj j] ) as follows: j j X m 1 m 2 m[scj j] Ps(c) = ( m[sc ] ; m[sc ] : : : m ), ms = m[sci] s s s i=1

(4.2)

4.2. FORMALISATION OF LEARNING

71

This procedure computes the so-called type distribution, and correspondingly type frequency Ps (ci ) for each symbol ci 2 , where every word is counted once. Alternatively, instead of counting the number of the words in each sub-tree, we might compute the total of the frequencies of all words in each (sub-)tree by substituting vw = 1 with vw = fw , where fw is the frequency of occurrence of the word w. And similarly to (4.2), we can compute the empirical distribution Fs (c) with word frequency included, which is also called token distribution, and which in turn contains token frequencies fs(ci ) for each symbol ci 2 .

Learning phonotactics by a orthogonal neural predictor

The rest of the chapter will focus on learning phonotactics with the orthogonal encoding scheme at the output layer, in which every symbol ci 2 is represented essentially by only one neuron ni: it is active (one) if that symbol is present at the input layer or targeted at the output layer and set inactive (zero) otherwise. This representation drives the network at every moment during training to make the neuron corresponding to the targeting symbol more active and to make the rest of the neurons less active. Intuitively, if such moves happen during training with di erent successors in a given context, this will result in generating likelihoods that those context follow this context. For example, let's have two training sequences bad and bag. Since we use supervised, error-driven learning, the training on those two sequences will make the network rst activate the neuron nd (corresponding to the rst symbol d) and suppress the neuron ng (standing for the symbol g); and second, to activate the neuron ng and deactivate nd . As a result of this, both neurons nd and ng will be driven once toward 1 and once toward 0, perhaps resulting in activation of 0:5 for both of them when the initial conditions (the particular left context \ba") have been presented to the input of the network. Of course, there is a learning coecient , 0 <