The Development of Spelling-Sound Relationships in

LANGUAGE AND COGNITIVE PROCESSES, 1998, 13 (2/3), 337–371

The Development of Spelling–Sound Relationships in a Model of Phonological Reading Marco Zorzi Department of Psychology, University of Trieste, Italy, and Department of Psychology, University College London, UK

George Houghton and Brian Butterworth Department of Psychology, University College London, UK Developmental aspects of the spelling to sound mapping for English monosyllabic words are investigated with a simple two-layer network model using a simple, general learning rule. The model is trained on both regularly and irregularly spelled words, but extracts the regular spelling to sound relationships which it can apply to new words, and which cause it to regularise irregular words. These relationships are shown to include single letter to phoneme mappings as well as mappings involving larger units such as multiletter graphemes and onset-rime structures. The development of these mappings as a function of training is analysed and compared with relevant developmental data. We also show that the two-layer model can generalise after very little training, in comparison to a three-layer network. This ability relies on the fact that orthography and phonology can make direct contact with each other, and its importance for self-teaching is emphasised.

INTRODUCTION Experimental psychologists have long been interested in how skilled readers convert a written word into its spoken form, and a great deal of data is available on the reading of words and nonwords in both normal and

Requests for reprints should be addressed to M. Zorzi, Dipartimento di Psicologia, Via dell’Universita`, 7, I-34123 Trieste, Italy. Marco Zorzi was supported by a grant from MURST. Brian Butterworth’s reading research is supported by Grant G9015838N from the Medical Research Council. q 1998 Psychology Press Ltd

338

ZORZI, HOUGHTON, BUTTERWORTH

impaired subjects (see Ellis, 1993, for a review). This empirical work has inspired a number of competing models of reading, and, following Seidenberg and McClelland (1989), recent theoretical discussion has been dominated by implemented computational models of reading, mostly in a neural network format (Bullinaria & Chater, 1995; Coltheart, Curtis, Atkins, & Haller, 1993; Norris, 1994; Plaut, McClelland, Seidenberg, & Patterson, 1996; Zorzi, Houghton, & Butterworth, 1998). These models have been tested against a wide variety of data from skilled and impaired readers. In parallel to studies of mature reading skill, developmental psychologists have investigated the development of reading in children (see Goswami & Bryant, 1990, for review). Most of the current theoretical accounts of learning to read are entirely informal, and are typically expressed in terms of discrete “stages” (Goswami & Bryant, 1990, for review), for instance a logographic (whole-word) stage, followed by an alphabetic (grapheme– phoneme) stage, and so on (Frith, 1985). These accounts are unrelated to any kind of more general learning theory, which would explain how anything is learned at all and how the stages progress as they do. These “broad brush” accounts can therefore be criticised as essentially descriptive, that is, their primary purpose is to summarise the developmental data. A notable exception to this is represented by the recent development of connectionst models (the seminal work in this respect is Seidenberg & McClelland, 1989), which lead to the possibility of exploiting the well-known learning capacity of neural networks to develop models of learning to read with a built-in developmental dynamic which is not specic to the area of reading. However, in contrast to the experimental work with adult subjects, data from children’s reading has not so far played a large part in the development of the computational models. This seems an unfortunate omission: Children’s reading can supply additional data to constrain models. In this article, we examine one aspect of learning to read, the development of knowledge of the spelling to sound mapping in English. We rst introduce a simple network model of this mapping which we have previously developed and tested with respect to data from adult readers (Zorzi, Houghton, & Butterworth, 1998). We then examine in more detail certain developmental aspects of the model and investigate how the knowledge and use of orthographic and phonological units such as onsets, rimes, graphemes, and phonemes emerge as a result of training the network on a corpus of English words. Finally, we explore how such knowledge applies to new words when exposure to written words has been little, a situation resembling that of young children (e.g. rst or second graders). We also show that a network using a mediated mapping (hidden units) does not perform nearly as well.

SPELLING–SOUND RELATIONSHIPS

339

MODELS OF READING ALOUD As noted, reading models have recently taken a computational turn, with neural network theory being the dominant framework. None the less, many of the issues germane to traditional box-and-arrow models of reading are still debated. In particular, the dual-route model of reading (Coltheart, 1978) is still accepted in some form by most theorists (but see Glushko, 1979; Plaut et al., 1996, Seidenberg & McClelland, 1989). According to the dual-route model, there are two different ways to name a written word. One, usually referred to as the lexical route, stores representations of the spelt form of all learned words, and operates by retrieving pronunciations by word-specic associations. The phonological form of any word is thus directly “addressed” from its orthographic form as a whole (this is sometimes referred to as “addressed” or “retrieved” phonology). The other route, sometimes named the assembly, or grapheme–phoneme conversion (GPC) route, is a system of general knowledge about the common spelling-to-sound relationships in the language; such knowledge can be applied on any string of letters in order to derive a set of sounds that are assembled into a phonological code (assembled phonology) (see Carr & Pollatsek, 1985; Patterson & Coltheart, 1987, for reviews). The assembly route can be used on both known words (and will yield the correct pronunciation if they follow the standard pronunciation rules—so-called “regular” words) and on novel items (new words, nonwords). The lexical route, on the other hand, is necessary for the pronunciation of words that do not follow the standard spelling–sound correspondences (exception words). In this latter case, the assembly route would deliver an incorrect pronunciation (“regularisation”), which conicts with the correct output generated by the lexical route. Within the dual-route framework, one main motivation for the separation of lexical and sublexical knowledge comes from neuropsychological studies. These studies have provided evidence that the two reading procedures can be selectively impaired by brain damage (acquired dyslexia; see Coltheart, 1985; Denes, Cipolotti, & Zorzi, 1996; Shallice, 1988, for reviews). The two forms of the syndrome are known as surface dyslexia and phonological dyslexia. For instance, the purest surface dyslexic patient, KT (McCarthy & Warrington, 1986), in spite of a near-perfect reading of regular words and nonwords (the latter was 100% correct), misread many exception words. Phonological dyslexics show the opposite pattern, that is, a great deal of difculty in reading unfamiliar words or nonwords. The purest case is patient WB (Funnell, 1983): his ability to read nonwords was completely abolished (e.g., he scored 0/20 on short monosyllabic nonwords), whereas his performance on both regular and exception words was almost perfect. Notably, pure cases of surface and phonological dyslexia have been found

340


even in the developmental counterpart of the syndrome. In a large study on developmental dyslexia, Castles and Coltheart (1993) identied 10 dyslexic children that were selectively impaired in reading exception words, and 8 other children that were selectively impaired in reading nonwords. This nding has been recently replicated with a second group study conducted by Manis, Seidenberg, Doi, McBride-Chang, and Petersen (1996). Coltheart and colleagues have developed implementations of aspects of dual-route theories. The assembly route has been implemented by Coltheart et al. (1993) using explicit GPC rules; the lexical route has been implemented by Coltheart and Rastle (1994) as an interactive activation model, where the written form of the words access orthographic and then phonological lexical representations. The complete dual-route model is known as the DRC model; this model can account for the main empirical ndings with normal subjects (in terms of latency and accuracy data; Coltheart & Rastle, 1994) and for the various acquired reading disorders (Coltheart, Langdon, & Haller, 1996). The main tenet of the dual-route model, that is, the separation of “generative” (i.e., sublexical) and “case-specic” (i.e. lexical) knowledge (see Houghton & Zorzi, submitted, for discussion) has been repeatedly challenged by Seidenberg and colleagues (see, for example Manis et al., 1996; Plaut et al., 1996; Seidenberg & McClelland, 1989). The Seidenberg and McClelland (1989; henceforth S&M) model is a three-layer, feedforward neural network, which attempts to read regular, exception and non-words in a single route, from spelling to sound. In a broader framework, S&M acknowledge the existence of a semantic route (not implemented), that is, an indirect pathway that goes from print to meaning and then from meaning to sound. The mapping in the implemented network is mediated by a set of “hidden units”, and the network is trained using the backpropagation algorithm (Rumelhart, Hinton, & Williams, 1986). The S&M framework can be thought of as an integration of lexical analogy and multiple-level theories (Plaut et al., 1996), as it denies the existence of separate lexical and sublexical procedures. However, it has been criticised by several authors (see, for example, Besner, Twilley, McCann, & Seergobin, 1990; Coltheart et al., 1993): For instance it showed poor nonword reading, and attempts to simulate the surface dyslexic syndrome through damage to the network were unsuccessful. The model has since been superseded by Plaut et al. (1996). This later model is still single-route but can read monosyllabic nonwords at a level of performance similar to that of human subjects. In contrast to the highly distributed representations employed by S&M at input and output (Wickelgraphs and Wickelfeatures, respectively, in which no single node stood for anything recognisable as an individual letter or phoneme), Plaut et al.’s model uses orthographic and phonological representations that are based upon graphemes (that is, single


341

nodes stand for letters and for whole groups of letters, e.g. WH, CH, AY, EA, CK, TCH, etc.) and phonemes; in addition, orthographic and phonological units are segmented into onset, vowel, and coda. The improved nonword (i.e. generalisation) performance of the model over that of S&M appears to be largely due to this change. However, acquired surface dyslexia has been successfully simulated only within a “two-route” framework, which postulates the active interaction between the (implemented) phonological pathway and the (unimplemented) semantic pathway, so that the reading disorder would arise as a consequence of semantic decits (Plaut et al., 1996; see Coltheart et al., 1996, for discussion).

THE TLA NETWORK MODEL OF SPELLING– SOUND CONVERSION Zorzi et al. (1988) develop a connectionist dual-process model of reading, which maintains the uniform computational style of the PDP models, but avoids the rigid commitment to a single route. One main component of the Zorzi et al. model is the two-layer network model of phonological assembly (henceforth TLA), that is, a network that learns the (monosyllabic) regular spelling–sound mapping in English. This network contains two layers of processing units (orthographic and phonological layers, respectively) and no intermediate (hidden) layers. The aim of the mechanism is simply for it to achieve human-like performance on the phonological reading of monosyllabic words and nonwords. The mature TLA model should therefore read nonwords and regular words well, and tend to regularise irregular words. Note that it is explicitly not required that the mechanism should be able to correctly read exception words. This is assumed to be achieved through a mediated mapping, which can be based on lexical nodes (as in traditional dual route models), or on a distributed lexicon (Zorzi et al., 1998). The model’s reading is tested using stimulus lists such as those of Glushko (1979). Zorzi et al. describe a number of detailed simulations with the model, looking at both error rates and reaction times, and comparing these with adult human data. Here we briey describe the main features of the model.

Architecture The input to the model is a representation of the spelling of a monosyllabic word. Letters in words are represented using a positional code, where each node represents both a letter and the position in the word occupied by that letter. There are no nodes representing combinations of letters, such as graphemes (e.g. TH, EE, etc.). The letter positions are dened with respect to orthographic onset and rime. All letters before the rst vowel letter form the onset, and all letters from the vowel onward form the rime. There are

342


three onset positions, and ve rime positions. For the sake of simplicity, and at the cost of some redundancy, each letter has a representation (node) at each position, giving 8 3 26 5 208 input nodes. Within each group, successive letters occupy successive positions (i.e. are “left-justied”). Thus, using “ ” to denote an empty position, milk would be represented as M I L K , old as O L D , strength as S T R E N G T H. The phonological representation has a similar format, with phonemes in a syllable aligned to phonological onset and rime positions in the same way. In this case there are three onset positions, and four rime positions (e.g. /b/ /l/ /V/ /d/ ). The phonemic representation recognises 44 different phonemes of English. Clearly, phonological constraints dictate that most phonemes can occur in only a few syllabic positions (Hartley & Houghton, 1996). However, once again economy is sacriced for simplicity, and all 44 phoneme nodes occur in all seven positions giving 308 output units. The input and output layers are fully connected (Fig. 1). The output layer has competitive interactions within each phoneme position. That is, for a given orthographic input it is possible for more than one phoneme to become active in a given position. Activated phonemes compete via lateral inhibition to become the dominant response. An executable phonological specication is considered to be achieved when at most one phoneme is active above a response threshold in each position.

FIG. 1. Architecture of the model. The arrow means full connectivity between layers. Each slot stands for a group of letters (26) or phonemes (44).


343

The representational scheme adopted in the Zorzi et al. model is purely motivated by psychological data, rather than by computational considerations (e.g. to maximise the potential for generalisation; Plaut et al., 1996). Note that orthographic and phonological representations do not incorporate “explicit” onset units or rime units (e.g. a single node representing ILK in milk; cf. Norris, 1994). Letters and phonemes are “functionally aligned” to these units, so that their relative position in a letter string is invariant with respect to the rime (or onset) units to which they belong to. Therefore, the network may detect this invariance during learning and therefore use onset/rime orthographic subpatterns (in addition to the identity of the single letters) as predictors of the word’s pronunciation. The importance of onset–rime structures in the orthographic and phonological representations is supported by a large body of empirical evidence. Bradley and Bryant (1983) found a strong connection between children’s sensitivity to rhyme and their success in reading: Rhyming skills are an excellent predictor of later reading skill, even two years or more before children learn to read. By contrast, there is good evidence that phonemic awareness arises as a consequence of learning to read, both in children (Goswami & Bryant, 1990) and illiterate adults (Morais, Cary, Alegria, & Bertelson, 1979). For instance, children have difculties in performing phonological tasks such as phoneme deletion (e.g. delete /p/ from /plant/) before they begin learning to read. That the onset–rime structure is an intrinsic feature of the phonological code is independently supported by psycholinguistic studies (Treiman, 1986; Treiman & Danis, 1988; Treiman, 1989, for review) and models of speech production (Dell, 1986; Hartley & Houghton, 1996). Children have been found to rely heavily on these larger phonological units when they have to make connections between sounds and letters. Using larger print-to-sound correspondences, the onset and rime spelling patterns, they are able to draw analogies from words they already know to new words (Goswami, 1986, 1988, 1991, 1993; Goswami & Bryant, 1990, for review). Treiman and collaborators (Treiman & Chafetz, 1987; Treiman, Mullenix, Bijeljac-Babic, Richmond-Welty, 1995) have specically argued that the orthographic structure of written English reects the phonological structure of spoken English, that is, that written words have corresponding onset–rime orthographic units, and have shown that these orthographic units play an important role in both adult’s and children’s pronunciation of printed words (Treiman et al., 1995). Note that the orthographic and phonological representations adopted in the Plaut et al. (1996) model incorporate a further segmentation of the rime into vowel and coda; such distinction at the orthographic level is not supported by empirical data. More crucially, our orthographic representation does not include complex graphemes, as used by Plaut and colleagues, that is there are no nodes standing for whole groups of letters.

344


Training The model is trained on a set of English monosyllabic spelling–sound pairs. The number of items in the set can vary, but is usually around 2500. Note that this set contains numerous words containing some degree of spelling–sound irregularity. If the model is to exhibit good nonword reading, it must extract the statistically most reliable spelling–sound relationships and “ignore” the rest. The model is trained using the delta rule (Widrow & Hoff, 1960). For each spelling–sound pair in the training set, an appropriate orthographic input is established, setting each activated letter-position node to a value of 1. Activations propagate in the usual way to the output layer, using the dot product net input rule to calculate the inputs to each phoneme unit. Connection weights are all initialised to zero, and units have no bias term. Phonemic activations are a sigmoidal function f of their net input, bounding phoneme activations in the range [0,1], and with f(0) 5 0 (no input, no output). This output activation is compared with the target activation (nodes that should be on have a target activation of 1, nodes that should be off a target of 0). The error for each phoneme unit is the difference between the target and actual activations. Where errors occur, weights to the offending units are changed according to the delta rule, that is: D wij 5

l ai(tj 2

aj)

where wij is the weight from input unit ui to output unit uj, ai and aj are activations of the input and output units respectively, tj is the target activation for output unit j, and l is a learning rate parameter. Since the TLA model is inherently incapable of learning the whole training set, it cannot be trained until errors reach zero. We typically train the model until errors have apparently reached the global minimum.

Recall The recall process is the same as that which generates the network’s output during training except that a competitive process is implemented at the output layer, whereby multiple candidates compete via lateral inhibition to be the dominant response in a given phoneme position. This process is important in Zorzi et al. (1998) in modelling reaction time (RT) data, since it recognises the role of “response competition” in accounting for variation in reaction times (e.g. Houghton & Tipper, 1994; Zorzi & Umiltá, 1995). In the present work, RTs are not simulated. In positions where more than one phoneme is activated, the most active is taken to be the network’s output


345

at that position (this is the typical result of the lateral inhibition process, in the absence of noise).

Mature Performance The model typically reaches asymptotic level of performance after something like 15–20 passes through its training set. The mature performance of the TLA model is analysed in detail in Zorzi et al. (1998), who also show how the output of this sublexical network can be integrated with the output of another route (network) capable of handling inputs on a whole word basis (including, of course, exception words). This interaction between the two routes in computing the nal phonological output leads to observed interactions between lexical variables (e.g. word frequency) and sublexical ones (e.g. regularity, consistency). At the same time, the separation of these different kinds of knowledge into different components of the system enabled successful modelling of the surface dyslexic syndrome (see Zorzi et al., 1998). These analyses will not be repeated here. To motivate our investigation of the model’s developmental properties, we can show that the TLA model performs sufciently well as a mechanism for spelling to sound conversion that further investigation of its developmental properties is warranted. Any putative model of this process should show good nonword and regular word reading, and a tendency to regularise irregular words. Table 1 shows the model’s pronunciations of all the words and pseudowords used in the Glushko (1979) experiment 1. The model gives near perfect performance on the regular items, and shows a strong regularisation tendency on the irregular words, similar to that of surface dyslexic subjects (Zorzi et al., 1998). Following this nding, a similar architecture has been applied to nonword spelling (Glasspool, Houghton, & Shallice, 1995; Shallice, Glasspool, & Houghton 1995). The model’s level of performance on the regular nonwords is at least as good as has been reported for any other network model. We conclude that the model demonstrates that a basically adequate spelling-to-sound mapping for regular monosyllables can be efciently extracted from a representative sample of English words by a two-layer network, using a simple learning rule. The task appears not to require the machinery of hidden units, recurrent nets, and associated complex learning algorithms. This point seems to us particularly important in a developmental context, in which the psychological validity of the training procedures must surely be a source of constraint. The delta rule learning procedure used in the current work is formally equivalent to a classical conditioning law (the Rescorla– Wagner rule; Sutton & Barto, 1981), and has been directly applied to human learning by a number of authors (see, for example, Gluck & Bower, 1988a,

TABLE 1 Pronunciations Produced by the “Mature” Model for the Stimuli from Glushko (1979), Experiment 1

Words Regular bath beef bleed breed buff bust cold code deal dean dream dune feet goad greet haze heat heed hoop lobe lode meld mode must note pink plain port posh probe puff shore soil sole soon spoon steal sweet told wail weak wilt wore

/b&T/ /bif/ /blid/ /brid/ /bVf/ /bVst/ /k6 Udl/ /k6 Ud/ /dil/ /din/ /drim/ /djun/ /t/ /g6 Ud/ /grit/ /heIz/ /hit/ /hid/ /hup/ /l6 Ub/ /l6 Ud/ /meld/ /m6 Ud/ /mVst/ /n6 Ut/ /pI9k/ /pleIn/ /pOt/ /p0S/ /pr6 Ub/ /pVf/ /SO/ /soIl/ /s6 Ul/ /sun/ /spun/ /stil/ /swit/ /t6 Uld/ /weIl/ /wik/ /wIlt/ /wO/

Nonwords Exception

both been blood bread bull bush comb come dead deaf dread done foot good great have head hood hoof lose love mild move most none pint plaid post push prove pull shove said some soot spook steak sweat tomb wool wear wild were

*/b0T/ /bin/ */blud/ /bred/ */bVl/ */bVS/ */k0m/ */k6 Um/ /ded/ */dif/ /dred/ */d6 Un/ */fut/ /gUd/ */grit/ */heIv/ /hed/ */hud/ /huf/ */l6 Us/ */l6 Uv/ /maIld/ */m6 Uv/ /m6 Ust/ */n6 Un/ */pInt/ */pleId/ */p0st/ /pUS/ */pr6 Uv/ /pUl/ */S6 Uv/ */seId/ */s6 Um/ */sut/ */spUk/ */stik/ */swit/ */tVm/ */wul/ */wI6 / /waIld/ /w3/

Regular cath heef dreed sheed wuff nust pold gode feal hean bleam mune peet soad steet taze weat beed moop cobe hode beld pode sust wote bink prain bort wosh brobe suff plore hoil lole doon grool sweal speet dold lail meak pilt dore

/k&T/ /hif/ /drid/ /Sid/ /wVf/ /nVst/ /p6 Uld/ /g6 Ud/ /l/ /hin/ /blim/ /mjun/ /pit/ /s6 Ud/ /stit/ /teIz/ /wit/ /bid/ /mup/ /k6 Ub/ /h6 Ud/ /beld/ /p6 Ud/ /sVst/ /w6 Ut/ /bI9k/ /preIn/ /bOt/ /w0S/ /br6 Ub/ /sVf/ /plO/ /hoIl/ /l6 Ul/ /dun/ /grul/ /swil/ /spit/ /d6 Uld/ /leIl/ /mik/ /pIlt/ /dO/

Exception coth heen drood shead wull nush pomb gome fead heaf blead mone poot sood steat tave wead bood moof cose hove bild pove sost wone bint praid bost wush brove sull plove haid lome doot grook sweak speat domb lool mear pild dere

/k0T/ /hin/ /drud/ /Sid/ /wUl/ /nVS/ /p0m/ /g6 Um/ /d/ /hif/ /blid/ /m6 Un/ /put/ /sUd/ /stit/ /teIv/ /wid/ /bUd/ /muf/ /k6 Us/ /h6 Uv/ /baIld/ /p6 Uv/ /s0st/ /w6 Un/ /bInt/ /preId/ /b0st/ /wUS/ /br6 Uv/ /sVl/ /pl6 Uv/ /heId/ /l6 Um/ /dut/ /grUk/ /swik/ /spit/ /d0m/ /lul/ /mI6 / /paIld/ /d3/

* 5 Regularisation errors. Vowels: /&/ as in cAt, /e/ as in bEt, /I/ as in hIt, /0/ as in hOt, /V/ as in hUt, /i/ as in bEAt, /u/ as in bOOt, /U/ as in pUt, /O/ as in dOOR, /6 U/ as in grOve, /aI/ as in fIle, /eI/ as in wAIt, /3/ as in BURn, /I6 / as in chEER. Consonants: most have standard values, e.g., /d/ as in Door. Also S as in SHed, 9 as in siNG, T as in THin.


347

1988b; Shanks, 1991; Siegel & Allan, 1996, for review). Its use in the present context can thus be supported by appeal to its much wider applicability in predicting learning data. A further advantage of the two-layer architecture is that the learning takes place in a single-set of weights directly connecting orthographic to phonological representations. This means that the model’s knowledge about spelling–sound mappings is relatively easy to analyse, and in many cases will directly implement basic intuitions such as that an initial letter b is always pronounced /b/ (Zorzi et al., 1998; see later).

DEVELOPMENT OF PHONOLOGICAL READING SKILLS Standard accounts of reading development (e.g. Frith, 1985; Marsh, Friedman, Welch, & Desberg, 1980) report that children begin with a “logographic stage”, in which words are treated as “unanalysed” wholes, which have a rote association with their pronunciation. In the terms of current reading models, this view translates into the idea that a lexical reading route develops rst (though presumably this is inuenced to some degree by the teaching method used). We can assume that children already have lexical phonological representations for words they know how to say. To develop a whole-word reading mechanism, letter representations might become linked directly with the phonological nodes, or lexical orthographic representations might rst develop which then become associated with items in the phonological output lexicon (as proposed for instance in the network model of Coltheart & Rastle, 1994). However it is achieved exactly, this kind of skill is “case-specic” and limits children to reading words they already know. For instance, a study by Seymour and Elder (1986) looked at the reading of 26 four- to ve-year olds in their rst year at school, who had acquired a “sight vocabulary”. Only one of the 26 children could read unfamiliar words to any signicant degree. Many of the children appeared to make guesses at unfamiliar words on the basis of their visual similarity to words they knew. (This pattern of performance is also characteristic of phonological dyslexics, who seem reliant on whole-word reading; Campbell & Butterworth, 1985.) With further practice, and possibly needing explicit instruction in “sounding out” words, most children come to acquire phonological reading skills which enable them to read new words. The crucial difference between this and lexical reading is that phonological reading requires the explicit interaction of sublexical spelling and sound representations (e.g. letters and phonemes). This is not necessary for lexical reading. In a model such as that of Coltheart and Rastle (1994), lexical reading occurs due to associations between (input) lexical orthographic nodes and (output) lexical

348


phonological nodes. The constituents of orthographic and phonological representations make no contact with each other. Learning spelling–sound relationships requires that both representations be “active” in the child’s mind simultaneously, so that they can make contact, and the “phonics” teaching method supplies information on the correct pronunciation of words and letters as the child is looking at them and attempting to say them (Ellis, 1993). Although we would not want to claim that it is identical to the kind of training our model gets, this “supervised” learning situation is similar to it in a number of important ways. First, the model requires that orthographic and phonological representations be simultaneously active, and direct contact between them must be possible. Second, the computation of error requires external teaching input, and, in particular, the ability to compare this with the network’s own pronunciation to determine precisely where they differ. We propose that, in the case of children, this latter ability would require skills of phonological manipulation involving verbal short-term memory (to keep the target pronunciation supplied by the teacher in mind). There is good evidence that developmental phonological dyslexics, who develop poor nonword reading skills, are impaired on just these skills of phonological manipulation and short-term memory (e.g. Campbell & Butterworth, 1985; Howard & Best, 1996; Hulme & Snowling, 1992; Stothard, Snowling, & Hulme, 1996). In addition, other studies (e.g. Treiman, 1984) have shown individual differences in children’s reading “strategies”, some children relying heavily on whole-word reading and others on spelling–sound conversion. Stuart and Masterson (1992) found that scores on tasks of phonological awareness at age four correlated with reading ability at age ten, and that children who had scored highly on phonological awareness (six years earlier) were better at reading nonwords than children who scored lower. The basic feature of phonological reading is the ability to read new words aloud. Therefore, in the rst simulation we simply chart the model’s nonword generalisation ability over time, as a function of its training on whole-word spelling–sound pairs. As noted, this can be thought of as a kind of phonics training as the network’s “attention” is simultaneously focused on the spelling and sound of a word, and the teacher is providing the correct pronunciation while the letter pattern for the word is activated.

Study 1: Development of Nonword Reading as a Function of Spelling–Sound Training In this study, the network was trained, as described earlier, on a set of 2300 monosyllabic words, with the learning rate set to 0.05. No training was given on isolated letter–sound mappings. At various stages, the model was tested


349

on the set of Glushko regular nonwords (see Table 1). The model’s mature performance on these words (Table 1) was taken to be correct and its performance at each stage was measured in terms of the percentage of correct responses compared to the mature state. The results are shown in Fig. 2. The times at which the model is tested are given in terms of training episodes (single exposures to input–output pairs) rather than epochs (exposures to the whole training set), as signicant learning goes on within epochs. The main thing to note about Fig. 2 is that the model starts successfully to generalise its learning very early in training, for instance scoring 15% correct after about 700 individual learning trials on different words. The model reaches 90% correct on Glushko’s regular nonwords after about 7000 learning episodes (i.e. less than three epochs), and 100% after 22,000–25,000 learning episodes, that is, less than ten epochs. Plaut et al. (1996) did not explicitly test nonword reading in their model at earlier points of training: However, inspecting Fig. 23 in Plaut et al. (1996), it can be noted that nonword reading performance in the feed-forward version of the model reaches about 95% correct after 100 epochs. Note that this is the most favourable comparison with respect to the models developed by Plaut and

FIG. 2. Performance of the model (% correct pronunciation) on Glushko’s (1979) regular nonwords as a function of amount of training. Amount of training is given in number of training episodes (trials). After reaching about 80% correct, improvement slows down. Note that the calibration of the X-axis changes after 4000 trials, each tick mark representing 1000 trials.

350


colleagues, since their attractor network version takes far longer to be trained. A difference between this training and children’s experience is that children typically see a smaller number of words many times. Furthermore, the idea that the pronunciation for many thousands of words is externally supplied by a teacher (i.e. “direct teaching”, as in all currently implemented models) cannot be reasonably maintained (Share, 1995; see later). Young readers continually encounter new items; it has been estimated that the average fth-grader encounters around 10,000 new words per year (Nagy & Herman, 1987). Therefore, it is crucial to demonstrate that exposure to a very small set of words is sufcient for the successful decoding of new items. The generalisation performance of the model under such conditions is looked at in a later simulation. This rst simulation shows that the model can learn the regular mapping quite efciently, and starts to do so from an early stage. In the following simulations, we investigate what kinds of sublexical relationships between spelling and sound (single letters, graphemes, etc.) this performance is based on, and consider how this relates to evidence from children’s reading for the use of grapheme–phoneme relationships as well as larger structures based on the onset–rime division.

Study 2: Development of Grapheme-to-phoneme Mappings As children’s phonological reading skills develop, they become able to assign phonological values to isolated letters and graphemes (e.g. Treiman, Goswami, & Bruck, 1990). Does the model acquire this ability, without explicit training on isolated letters and sounds? This question is not as obvious as it may at rst appear. In the absence of any “attentional focusing” on parts of the input–output representations, the delta rule learning procedure distributes the blame for any error found on a given trial across the connections from all currently active input nodes. For instance, suppose the network were required to learn just one word, say blot ® /bl0t/. If the network learned sublexical correspondences in this case, we would expect there to be a strong excitatory weight specically linking the initial letter b to the initial phoneme /b/, the letter l to /l/ and so on. Presentation of an initial b alone would activate initial /b/. However, this is not what would happen. Because the letter nodes are linked to all the phoneme nodes, the learning of this isolated mapping would be distributed, so that all the input letters contributed equally to the activation of all of the phonemes. As such, the network would read the word “holistically”, though without having a lexical node standing for the word. Presentation of isolated letters from blot would


351

partially activate all the phonemes in /bl0t/ to the same extent. Note that at the outset of its training, when the network has only seen a few words, this is precisely how it will treat them. At this early stage, generalisation to new words is likely to be weak (see Fig. 2,), and the network will go through a kind of “logographic” stage. As noted by Share (1995, p. 159), “logographic reading must necessarily be short-lived because the alphabetic nature of English orthography dictates complete or near-complete processing of orthographic detail.” In the absence of explicit grapheme–phoneme training, sublexical correspondences only emerge as a function of experience of a wider vocabulary. Before looking at this issue developmentally, we rst show the model’s mature performance on individual letters and graphemes. Table 2 shows the model’s pronunciation of all the individual letters of the alphabet in isolation, and a set of consonant, vowel, and consonant–vowel graphemes. In the case of the isolated consonant letters and graphemes, they are assumed to be read as though they are the beginning of a word (i.e. they activate onset letter positions). In most cases this leads to activation of phonemes in the rst onset position. The only exceptions to this are for the letters x and q. The rst never occurs in initial position in the training set and consequently generates no phoneme when in the initial position. The letter x was tested instead in the contexts ax, ex, ix, ox, and ux. In all cases it was pronounced /ks/. The situation with q is a little different. It always appears in the combination qu in training, and in isolation gives partial activation to /w/,

TABLE 2 Model’s Pronunciation of Single Letters and Graphemes Presented in Isolation

Letters Phonemes

Letter

Phonemes

Letters Phonemes

Letters Phonemes

a e i o u b c d f g h j k l m

n p q(u) r s t v w x y z th sh ch wh

/n/ /p/ /kw/ /r/ /s/ /t/ /v/ /w/ /ks/ /j/ /z/ /T/, /D/ /S/ /tS/, /S/ /w/, /h/

wr kn

/r/ /n/

ai ea ee ei oa oi oo ou ar er ir or

/eI/ /i/ /i/ /eI/ /6 U/ /oI/ /u/ /aU/, /u/ /A/, /O/ /3/ /3/ /O/, /3/

ur aw ew ow ay ey oy air ear our aze eze ize oze uze

/&/ /e/ /I/ /0/ /V/ /b/ /k/ /d/ /f/ /g/ /h/ /dZ/ /k/ /l/ /m/

/3/ /O/ /ju/ /6 U/, /aU/ /eI/ /eI/ /oI/ /e6 / /I6 / /O/ /eIz/ /iz/, /eIz/ /aIz/ /6 Uz/ /juz/

352


in the onset /kw/. The consonant letters were also tested for their pronunciation in the “coda”, by presenting them in the context aC (where C is any consonant). Most were pronounced as in Table 1, though r and w were treated as part of vowel graphemes. Vowel letters and graphemes activate orthographic rime positions, and the corresponding phonemes always become active in the phonological rime. The vowel graphemes tested include two- and three- letter VC combinations such as ur, aw, air, etc. In addition, the model’s handling of “rule of e” (how a vowel letter is pronounced due to a nal e) is tested using the letter strings, aze, eze, ize, oze, uze. In some cases, the model signicantly activates more than one phoneme at a given position, though always with different activation levels. In these cases, the alternative pronunciations are given, ordered by activation level (i.e. most active rst). Although we have no data regarding the normal pronunciations of these letters and graphemes (in isolation), the ones generated by the model seem to us intuitively acceptable. The model can therefore be taken to have extracted (positional) grapheme–phoneme correspondences without explicit training on subword mappings, or the use of explicit (i.e. local) representation of graphemes. We are not aware of any other connectionist model which shows such capacity without training on GPCs, or explicit coding of graphemes. In looking at this issue developmentally, we will take the model’s “mature” performance to represent desired or correct performance (where the model generates competing pronunciations, we simply take the most active one to be the target). The development of this capacity of the network was tracked during learning by presenting the strings in Table 2 after every 100 learning episodes, which amounts to about 25 times per epoch. At each time the model’s performance was separately analysed on four categories of input: (1) Single vowel letters, (2) single consonant letters, (3) consonant graphemes, and (4) vowel graphemes (including the VC graphemes in Table 2). Correct output was taken to be the dominant pronunciation produced by the mature model. For the developing model, if it produced more than one phoneme at a given position, the most active phoneme was taken as the model’s response and all other (competing) responses were ignored. Results are shown in Fig. 3. As can be seen from Fig. 3, the different classes of spelling–sound relationship offer different degrees of difculty for the network, single letters being generally easier than graphemes. The pronunciations of consonant graphemes are learned especially slower. Note that young skilled readers are usually better with items that only require knowledge of invariant context-free correspondences (e.g. Coltheart & Leahy, 1992; Marsh et al., 1980).

353

FIG. 3. Performance of the model on isolated letters and graphemes during training. Calibration of X-axis is adjusted after 5000 trials.

354


Study 3: Use of Onset–rime Based Associative Structures As previously stated, there is good evidence that orthographic and phonological onset–rime structures play an important role in both adults and children’s reading (e.g. Treiman et al., 1995). Sensitivity to larger units such as the rime is clearly expected in a three-layer network with hidden units (and indeed this is what happens; see Plaut et al., 1996, for discussion). The hidden units of the Plaut et al. model can be sensitive to higher-order combinations of input units (which may actually reach the size of the whole word, e.g. in PINT). On the other hand, as stated by Plaut et al. (1996, p. 81), “In a . . . network without hidden units, the contribution of an input unit to the total signal received by an output unit is unconditional; that is the contribution of each input unit is independent of the state of other input units.” It is therefore crucial to establish whether the TLA model (a network without hidden units) has extracted some associative knowledge about such higher-order structures, which would be potentially very useful to achieve good phonological reading. In particular, we wish to determine whether the network develops a set of connections that can aid the generation of the correct vowel pronunciations in response to particular combinations of vowel and consonant letters (i.e. VC units). Although the spelling and sound representations used by the model have the onset–rime templates imposed on them, no alignment of the two representations is presupposed in the model. That is, all letter nodes are connected to all phoneme nodes, and there is nothing in the initial structure of the model which dictates that, say, orthographic onsets should map to phonological onsets, or orthographic vowels to phonological vowels. If the model ends up doing this, it is by virtue of discovering, for instance, that the best way to predict the vowel sounds in the words it sees is by associating those sounds with vowel letters. As before, we rst analyse the representation used by the mature model, and then look at the development of this representation as a function of learning. The simplest way to look at the use of onset–rime associative structures in the model is to analyse the weight matrix produced by the learning algorithm. Notably, the analysis of a two-layer network is much simpler than that required by multilayers networks (such as backpropagation networks). We therefore analyse the connectivity pattern from orthographic onset to phonological onset and from orthographic rime to phonological rime, and vice versa. If the model is aligning the respective onsets and rimes, then we would expect to nd for instance that the majority of the weights projecting from the orthographic onset are received by the phonological onset, and similarly for the rime. Thus, there are four different orthography-to-phonology interactions to analyse: (1) Onset-to-onset, (2) onset-to-rime, (3) rime-to-onset, and (4) rime-to-rime. The results are


355

TABLE 3 Mean Connection Strengths of Slots in the Orthographic Onset and Rime to Phonologica l Onset and Rime

Phonological Onset Orthographic onset Orthographic rime

Phonological Rime

Excitatory

Inhibitory

Excitatory

Inhibitory

1 21.1 1 5.2

2 56.5

1 12.7

2 65.0 2 111.4

2 55.5

1 40.8

shown in Table 3. Because there are more slots (i.e. positions) in the orthographic rime than in the onset, we rst compute the total input weights for the relevant units (e.g. from onset to onset) and then divide by the number of input positions to get a mean input over the single slot. In Table 3, excitatory (i.e. positive) and inhibitory (i.e. negative) weights are shown separately. Apart from the similarity in the inhibitory inputs to the phonological onset, the results clearly indicate a functional alignment of spelling and sound onset and rime constituents. Note that, although inhibition is important in the contextual resolution of locally ambiguous spelling–sound relationships, the activation of the correct phonemes can only be achieved by excitatory connections, and here the differences in the connectivity patterns are quite pronounced. For both phonological onset and rime structures the mean excitatory input from the corresponding orthographic constituents is three to four times as large as that from the complementary constituents. This is shown more clearly in Fig. 4. Having established that the network has discovered onset–rime associative mappings, one further question is how these mappings develop during training. Following Treiman et al. (1995), we also wish to dene the role of consonant letters in determining the pronunciation of the vowel. On the basis of statistical analysis of a large corpus of words, Treiman et al. conclude that orthographic units VC2 (vowel plus nal consonants) are very useful as guides to pronunciation of the vowel; in other words, VC2 letter clusters have relatively stable pronunciations and are much more consistent than vowels alone (V) or than letter clusters formed by initial consonants and vowel (C1V). The same conclusions about the important role of C, and VC2 units, compared to C1V and the intra-rime units V and C2, was reached by Treiman and colleagues with further experiments on both adults and children. We therefore analyse the relative contribution of orthographic units C1 (initial consonants) and C2 (nal consonants) to the pronunciation of the vowel. In this analysis we look at the input to the phonological vowel slot from letters in orthographic onset and coda. We therefore exclude the orthographic vowel (slot 4) and look at weights projecting from the rst

356


FIG. 4. Excitatory connectivity between onset-rime associative structures in the model.

three (onset) and last four (coda) letter slots. The results are shown in Fig. 5 in terms of proportion of total input weights to the vowel. Clearly, the inuence of C2 on the phonological vowel tend to increase as training proceeds. This is true for both excitatory and inhibitory weights. On the other hand, the contribution of C1 decreases, as demonstrated by the smaller proportion of input weights (of both signs) that send activation to the vowel phonemes. These results are quite consistent with Treiman et al.’s (1995, p. 130) suggestion that their results “could be taken to suggest that children develop a sensitity to VC2 units based on their experience with the orthography and its relation to the phonology. If so, children must grasp the statistical regularities of the language quite quickly.”

Study 4: Onset–Rime Structure and Nonword Reading We have shown that in the model the weights from the orthographic coda have a greater inuence on the phonological vowel than do weights from the orthographic onset. This means that, for orthographic CVC words, the VC component functions more as a group than the CV component. Treiman et al. (1990) carried out a study to see whether children and adults were differentially sensitive to VC as opposed to CV constituents. They constructed two sets of nonwords, labelled H (High) and L (Low), with 24 items in each set. All the stimuli consisted of a consonant grapheme, followed by a vowel grapheme, followed by a consonant grapheme, so all


357

FIG. 5. Proportion of input weights from orthographic onset (C1) and orthographic coda (C2) to phonological vowel (V), as a function of training stage (in learning episodes). Excitatory (exc) and inhibitory (inh) weights are shown in separate curves.

had a CVC pronunciation. Each grapheme could contain one or two letters, for examples chud, tain, fesh. The H and L items were matched so that both sets contained the same graphemes. However, the graphemes were composed such that the VC units of the H items were more frequent in real words than those in the L items (see Treiman et al., 1990, p. 560, for details of the basis of the frequency estimation). The frequencies of the CV units in the two word sets were closely matched. Treiman et al. argued that: (1) If readers use only grapheme–phoneme associations in reading nonwords, then the H and L items should be indistinguishable (they contain the same graphemes); (2) if readers use the initial CV unit as a basis for nonword reading, they should also read the two sets equally well (CV frequency was matched); (3) if readers use the VC constituent, they should perform better on the H nonwords than the L nonwords, as the H nonwords have more frequent VC components. The stimuli were tested on four groups of subjects: rst-graders, good and bad third-grade readers, and adults. Performance was measured as percentage of correct pronunciations of the nonwords. Children were also

358


tested on their pronunciation of the graphemes used to make up the nonwords when presented in isolation. All four groups showed better performance on the H nonwords than the L nonwords. To test the model against these ndings, we tracked the performance of the model on Treiman et al.’s (1990) nonwords while it was being trained on the usual set of real words. The model was tested at various stages on the two sets of nonwords. A given nonword pronunciation was taken to be correct if it matched that reported by Treiman et al. (1990, Appendix, p. 567), which is actually the standard GPC pronunciation (Venezky, 1970). The times at which the model was tested are given in terms of training episodes (single exposures to input–output pairs). The performance at each stage was measured in terms of the percentage of correct responses (see Fig. 6). Clearly, the performance on L nonwords is poorer from the very beginning of training, and it tends to remain so even at later stages. We can compare the model’s performance to that of the experimental groups in Treiman et al.’s (1990) study by nding those points in the learning where the performance on H nonwords is the same as that of the various different experimental groups. We then take the model’s performance on the L nonwords at the same point of learning, and compare this with the experimental results. This comparison of model and data is shown in Fig. 7. It is clear that the model shows the same difference in performance level at each stage as the different subject groups. As said above, although H and L

FIG. 6. Performance of the model Treiman et al.’s (1990) H and L nonwords during training. Calibration of X-axis is adjusted after 12,000 trials.


359

FIG. 7. Performance on Treiman et al.’s (1990) L and H nonwords. The graph shows the proportion of correct responses of the four experimental groups and of the model at different stages of training (number in parentheses indicates learning episodes).

nonwords contain exactly the same graphemes, H nonwords have more frequent VC components. We have previously shown that the model extracts grapheme-to-phoneme correspondences; none the less, the consistent advantage of H nonwords over L nonwords demonstrates that the model, similarly to human readers, largely uses VC units in reading nonwords (Fig. 7). The studies described so far have analysed the kinds of sublexical regularities the model extracts from its training and compared them with evidence for similar learning in children. However, the model is trained on a large corpus of different words, and will not see any given word many times (15 to 20 typically). It may reasonably be objected that this training is unlike that which children experience, where they see a smaller number of words many times. Furthermore, as previously said, the idea that several thousands of words can be taught by means of “direct teaching”, that is, by externally supplying the correct pronunciation, is clearly awed (see Share, 1995, for discussion). In the next section, we look at the performance of the model when trained repeatedly on a small number of words.

ANALOGY AND TRANSFER: READING NEW WORDS WHEN YOU DON’T KNOW MANY OLD ONES In this simulation we investigate the nonword generalisation performance of the model when trained on a very small set of real words. These words are Glushko’s (1979) regular and exception words, 43 for each type for a total of 86 training examples. Most of the nonwords in the list (86 items) were

360


obtained by Glushko by changing the onsets of the real words. Therefore, the network could, in principle, benet from the training that is given on the orthographic rimes of the real words. Note that this is exactly the kind of rationale that underlies transfer experiments on young children. The data from transfer experiments suggest that beginning readers are able to draw analogies from words they know to new words (Baron, 1977, 1979; Goswami, 1986, 1988, 1991, 1993; Goswami & Mead, 1992; see Goswami & Bryant, 1990 for review). In Baron’s (1979) study, for instance, children were given three lists of words. The lists were of regular, exception, and nonsense words. He found that children read nonwords to rhyme with a real word which has the same orthographic rime, whether this is an exception or a regular word. In Goswami’s studies, rst- and second-graders were taught some “clue” words such as BEAK to see if they could use this information to decode target words such as PEAK and nonwords such as NEAK, which share the orthographic rime with the clue word. Goswami found that children make analogies about spelling sequences that correspond to onsets and rimes.1

Study 5 In this study, the training corpus for the network consisted of just 86 words (43 regular and 43 exceptions, from Glushko, 1979, Experiment 1). Otherwise, everything was the same as in the previous studies. Training proceeded for 80 epochs, after which the error reached asymptote. In the testing phase, the network was presented with a stimulus word and the output activations were recorded. The network’s response for a given slot position was taken to be that of the most active unit, if any was active. Performance was tested on both the training set and on the test set (86 nonwords from Glushko, 1979, Experiment 1).

Results. The network produced the correct pronunciation of all words in the training set (100% correct). Note that the network, although twolayered, is able to pronounce both regular and exception words. This performance is notably different from that shown when the model is trained on the full set of monosyllables, as reported in Table 1, where most of the exception words were regularised. The reason for this is that the notions of consistency and regularity depend on the context of the whole learning set. In this simulation, both regular and exception words are “regular”, in the 1 Note that in Goswami’s studies the similar words (clues) were directly in front of the child when attempting to pronounce nonwords. This might have induced some form of phonological priming (see Patterson & Coltheart, 1987, for a discussion of intralist priming effects due to mixed word-nonword stimuli in the context of adult nonword reading). However, since in our network no priming can take place, it is instructive to see whether the “analogy” effect can be induced simply by learning the clue words.


361

sense that they do not conict with other known pronunciations if the larger unit of the rime is considered. 2 However, some inconsistency is present at the level of the single graphemes, in particular in the vowel graphemes (e.g. the EA in BREAD, DREAM, GREAT; all of which are in the training set). The fact that vowel grapheme inconsistency has little (if any) effect on the network’s performance means that the network has discovered the reliability of the larger VC2 (rime) unit, so that V inconsistency is overridden by VC2 consistency. This nding is clearly in agreement with Treiman et al.’s (1995) conclusions about the importance of VC2 units in both adults and children reading. We now examine the generalisation performance of the network trained on just the 86 real words. As shown later, the performance of the network on Glushko’s (1979) nonwords is very good. Note that, because they are derived from real regular and exception words, the set of nonwords is composed of two lists of “regular” nonwords and “exception” nonwords. If the basis for the network’s decoding ability is the orthographic rime (VC2), rather than single graphemes, we expect that the exception nonwords should be given an irregular pronunciation, that is, they should rhyme with the exception words from which they were derived. The network responses are listed in Table 4. Of the regular nonwords, the network gives a “regular” pronunciation (i.e. one that obeys the standard GPC rules) to 38 items (88.4%). Amongst the remaining ve, one (GROOL) is pronounced to rhyme with an exception word (WOOL ® /wUl/), and one has a spurious doubling of the nal consonant (WEAT ® /witt/). With a less stringent criterion the performance is therefore 93% correct. The remaining three responses do not have acceptable pronunciations, either because of a missing phoneme (e.g. WOTE ® /w6 U/) or because the pronunciation is irregular even at the level of the rime (e.g. SWEAL ® /swel/). Turning to the exception nonwords, four responses were actual errors (90.7% correct). Of the correct response, only ve nonwords were given pronunciations that follow the standard GPC rules (11.6%). Most of the other nonwords (34/43, 79%) were pronounced to rhyme with a trained word that has the same orthographic rime (VC2) as the nonword. In other words, when the number of trained words is relatively small, individual words can have a much stronger effect on the model’s performance, producing a “lexical analogy” effect. This study shows that the model can achieve good generalisation (about

2

This is not strictly true for few rimes within the list of exception words, where the shared rime is pronounced in two different ways (e.g. OVE in LOVE and MOVE; OOD in BLOOD and GOOD).

TABLE 4 Pronunciations Produced by the Network for the Stimuli from Glushko (1979), Experiment 1

Nonwords Regular cath heef dreed sheed wuff nust pold gode feal hean bleam mune peet soad steet taze weat beed moop cobe hode beld pode sust wote bink prain bort wosh brobe suff plore hoil lole doon grool sweal speet dold lail meak pilt dore

/kAT/ /hif/ /drid/ /Sid/ /wVf/ /nVst/ /p6 Uld/ /g6 Ud/ /l/ /hin/ /blim/ /mjun/ /pit/ /s6 Ud/ /stit/ /teIz/ /witt/** /bid/ /mup/ /k6 Ub/ /h6 Ud/ /beld/ /p6 Ud/ /sVst/ /w6 U/** /bI9k/ /preIn/ /bOt/ /w0S/ /br6 Ub/ /sVf/ /plO/ /hoIl/ /l6 Ul/ /dun/ /grUl/* /swel/** /spit/ /d6 Uld/ /leIl/ /mik/ /paIlt/** /dOn/**


/k6 UT/* /hin/ /drud/ /Sed/* /wUl/* /nUS/* /p6 Um/* /gVm/* /fed/* /hef/* /bled/* /mun/** /pUt/* /sUd/* /steIt/* /t&v/* /wed/* /bUd/* /muf/ /k6 Us/ /huv/* /baIld/* /puv/* /s6 Ust/* /wVn/* /bVnt/** /pred/* /b6 Ust/* /wUS/* /bruv/* /stUl/** /plVv/* /hed/* /lVm/* /dUt/* /gruk/ /swek/* /spet/* /dum/* /lUl/* /me6 /* /paIld/* /d3n/**

* irregular pronunciation; ** non-accepted pronunciation.

362


363

90% correct on Glushko’s (1979) nonwords) when repeatedly trained on a smaller set of words. The model still exploits regularities above the level of single graphemes, but does not show the regularisation tendency produced by training on large numbers of words. Instead, the inuence of individual words is much more clearly seen. This reinforces the point made earlier, that early in development of spelling–sound knowledge, inuences of whole words may be manifest, even in a system (such as this one) with no lexical representations. The demonstration that exposure to a very small set of words is sufcient for the successful decoding of new items has some important implications. As said previously, Share (1995) strongly argued that “direct teaching”, that is, direct input of target pronunciations for the thousands of words used to train connectionist models, is a very implausible training regimen. However, this might not be necessary if we give up the assumption, often made by dual-route theorists (e.g. Coltheart et al., 1993), that the two routes are fully independent from each other. In fact, one of the most interesting theories about the reading acquisition proposes that the full development of each route is dependent on the other (the self-teaching hypothesis; see Share, 1995). Share and colleagues (Jorm & Share, 1983; Share, 1995; also see Skoyles, 1988) propose that the spelling–sound conversion route functions as a self-teaching mechanism enabling the learner to acquire word-specic orthographic representations (i.e. lexical representations); phonological recoding is regarded as critical to successful reading acquisition. This also implies an asymmetrical pattern in developmental reading disorders: Decits that result in impaired spelling–sound conversion are much more detrimental to reading progress than decits resulting in impaired orthographic knowledge, because the formation of word-specic associations depends on self-teaching (Share, 1995). The good generalisation ability shown by the TLA model when trained on a small set of words opens the possibility to implement a self-teaching mechanism. Successful decoding of new items even with little experience on words will be very advantageous for acquiring new vocabulary (self-teaching); this, in turn, will result in extending and rening the knowledge about spelling– sound relationships.

Study 6 We argued earlier that the nonword reading abilities of the TLA network, and particularly the capacity to generalise after exposure to few words, depends on the fact that the input–output representations can make direct contact with each other. In this study we support this idea by looking at the performance of a network in which only indirect (mediated) connections are possible. We therefore prevent the orthographic and phonological

364


representations making direct contact with each other. Instead, all interactions are via a layer of hidden units, so that phonology is computed by means of intermediate representations, rather than directly from orthography. If such a network does not show the kind of early generalisation illustrated previously, then it would not be of use in self-teaching. We consider the implications of this for models of developmental phonological dyslexia. It might be objected that this manipulation produces an architecture that is in principle more powerful than the TLA model. Our prediction that the generalisation ability will be poorer would therefore seem counterintuitive, because a three-layer network has more processing resources than the two-layer model. The point we wish to establish here however, is that, when training on a small corpus, a network with hidden units will most likely begin by adopting a highly lexical strategy, whereby hidden-units will become specialised for recognising individual input words. This will lead to poor generalisation compared to the TLA given the same small amount of training. For this simulation we used a three-layer feed-forward network, trained with the back-propagation learning algorithm (Rumelhart et al., 1986). The input–output representations for this model were the same as in the previous simulations. To directly compare this model with the model used in Study 5, we used the same training corpus of 86 words and the same testing set of 86 nonwords. The hidden layer of the model contained 50 hidden units. Update of weights took place “online”, that is, after every presentation of an input–output pair; learning rate was set to 0.1 and momentum to 0.6. Training proceeded for 200 epochs, after which the error descent reached asymptote. In the testing phase, the network was presented with a stimulus word and the output activations recorded. The network’s response for a given slot position is that of the most active unit, if any is active, with the additional constraint that it must exceed a threshold of 0.2. Performance was tested on both training set and on the test set (86 nonwords from Glushko, 1979, Experiment 1).

Results. The network after 250 epochs of training shows 100% correct performance on the training corpus. Turning to the generalisation performance on Glushko’s (1979) nonword stimuli, we nd that the results are strikingly poor, compared to the network of the previous simulation. To assess the network’s performance, we will not use the stringent criteria based on “regular” (i.e. GPC) nonword pronunciations, but rather the “lenient” criteria (see Plaut et al., 1996; Zorzi et al., 1998), which scores a given response as correct if the rime pronunciation is that of some real word in the training set. On the regular nonwords, the network produces 23/43 errors


365

(46.5% correct), 18 of which are true errors (wrong or lacking phonemes, e.g. BLEAM ® /bid/; MOOP ® /hu/) and 5 are lexicalisations. The latter are responses that precisely correspond to the phonology of a word the network was trained on (e.g. CATH ® /bAT/, from BATH; PEET ® /t/, from FEET). On the exception nonwords, the network produces 22/43 errors (48.8% correct), 18 of which are true errors and 4 are lexicalisation errors. The nonword pronunciations are reported in Table 5. Thus, on the entire set of 86 nonwords, the network gives wrong pronunciations to over 52% of the stimuli, where 20% of the errors are lexicalisations. Overall, then, the model produces good reading of known words, poor nonword reading, and a high percentage of lexicalisation errors, a prole suggesting that the network has adopted a lexical strategy. Notably, this happens in spite of the fact that orthographic and phonological representations are still based on positionally dened letters and phonemes. A possible objection about the size of the training corpus used (e.g. Seidenberg & McClelland, 1990) cannot be strictly applied to our simulation, because the network performance is in sharp contrast with the results of the previous simulation, where the two-layer network model trained on the same corpus was able to read correctly 90% of the same nonword stimuli. Clearly, a large training corpus will force the network to rely more on the relationships between single letters and phonemes. In the limit, this will also depend on the number of hidden units provided to the network. The important point, however, is that the poor generalisation capabilities that result from the exposure to a small set of words would have a crucial impact on the self-teaching mechanism: Decoding of new words has little chance of being successful and will in turn result in poor real word reading (see Share, 1995). This study has some implications with regard to developmental phonological dyslexia. Manis et al. (1996) and Plaut et al. (1996) offer the S&M model as an account of phonological dyslexia. The argument of Manis et al. is that the model showed little generalisation to new items due to its defective input and output representations (Wickelfeatures), and hence that representational problems might be a sufcient cause of the phonological dyslexic pattern. Our study suggests that, in addition, it is important for early generalisation (and hence self-teaching) that the components of the spelling and sound representations make direct contact.

CONCLUSIONS It can be maintained that the theoretical progress of developmental cognitive psychology has been limited by its lack of any kind of unifying learning theory, a situation which has been exacerbated by the general

TABLE 5 Pronunciations Produced by the Network for the Stimuli from Glushko (1979), Experiment 1

Nonwords Regular cath heef dreed sheed wuff nust pold gode feal hean bleam mune peet soad steet taze weat beed moop cobe hode beld pode sust wote bink prain bort wosh brobe suff plore hoil lole doon grool sweal speet dold lail meak pilt dore

/bAT/** /hif/ /drid/ /sed/* /Vf/* /Vst/* /p6 Uld/ /g6 Ud/ /wil/* /hin/ /bid/* /mjun/ /t/** /s6 Ud/ /swit/** /heIz/** /wit/ /bid/ /hu/* /k6 Ub/ /h6 Ud/ /beld/ /p6 Ud/ /sVst/ /w6 U/* /bink/ /preIn/ /6 Ut/* /w0l/* /br6 Ub/ /sVf/ /pO/* /l/* /l6 U/* /dun/ /gu/* /stel/* /swit/** /k6 Uld/** /eI/* /ek/* /paIlt/* /dO/


* True errors; ** lexicalisations.

366

/k6 UT/ /hin/ /dUd/* /sed/* /wUl/ /pUS/** /pum/* /Vm/* /ed/* /hef/ /bled/ /mun/ /pUt/ /sUd/ /swit/** /huv/* /wed/ /bVd/ /uf/* /6 U/* /huv/ /baIld/ /puv/ /s6 Ust/ /wn/* /bVnt/* /pr&d/ /b6 Ust/ /wUS/ /brVv/* /sUl/ /plVv/* /hl&d/* /lVm/ /fUt/** /guk/* /stek/* /swet/** /dum/ /Ul/* /me/* /paIld/ /d3n/*


367

rejection of associationism which accompanied the “cognitive revolution”. The development of connectionist network models has brought associationist accounts of knowledge into the mainstream of cognitive psychology, and, with that, must inevitably bring associative learning into developmental cognitive psychology (see Elman, Bates, Johnson, Karmiloff-Smith, Parisi, & Plunkett, 1996, for review of relevant modelling work). Models of reading are one area where cognitive theory has beneted greatly from connectionist modelling, permitting detailed quantitative accounts to be developed which would be impossible with informal, diagrammatic models. This development must raise the issue of the relevance of associative learning to the problem of learning to read. In this article, we have looked at a simple model of the mapping from spelling to sound in English, which can be constructed from example data using an associative learning rule which has previously found considerable support in studies of animal and human learning. To work well, the model benets from the use of input–output representations which have independent empirical support, and requires that the spelling–sound representations can make direct contact with each other. Given this, the model will discover spelling–sound regularities at the single letter, grapheme, and further levels, and can exploit these regularities in generalisation from an early stage, as do children. Even though trained on many exception words, in the long run the exceptions tend to get obliterated by regularities in the mass of words, due to the architecture’s inherent lack of representational power. When these features are not present, and the mapping must be indirect, whole-word reading is good but generalisation is impaired, because the model will tend to read on a lexical basis, a pattern resembling developmental phonological dyslexia. Interestingly, Share (1995, p. 199) notes that “Although connectionist models claim to simulate human printed word learning, direct input of target pronunciations for the several thousand words used in the training corpus implies subscription to the dubious direct teaching option.” Therefore, it is important to demonstrate, as we did in the present Study 5, that exposure to a very small set of words is sufcient for the successful decoding of new items. We believe that this nding opens the possibility to implement the idea of a self-teaching mechanism (Share, 1995) that enables the learner (i.e. the model) to acquire word-specic orthographic representations when new words are encountered. There is a large amount of data indicating that the representation of the English spelling–sound system becomes modied and rened with the increasing print experience, to evolve into a more complete and sophisticated understanding of the relationships between orthography and phonology (see Share, 1995, for review). This and other developmental

368


aspects are not captured by the Coltheart at al. (1993) implementation of the phonological assembly route. In the DRC model, this route uses explicit GPC rules that are discovered and stored by a specially constructed “learning algorithm” in a single pass through the training database; our model, on the other hand, discovers various levels of spelling–sound correspondences using a simple, general learning rule. One might question the plausibility of a learning algorithm that is just specic to reading; furthermore, many recent studies suggest that the GPC characterisation is inadequate, and that the phonological route operates also on larger orthographic units (e.g. subsyllabic units like word onsets and word bodies; see Treiman et al., 1995). Just as with adult reading data, the ability to address developmental ndings in any quantitative detail will depend on the use of explicit computational models. In turn, constraints derived from developmental studies, including the viability of learning rules, should inform models of mature performance. We hope the studies presented here will contribute to the further integration of developmental data into models of mature cognitive performance.

REFERENCES Baron, J. (1977). Mechanisms for pronouncing printed words: Use and acquisition. In D. LaBerge & S.J. Samuels (Eds), Basic processes in reading: Perception and comprehension (pp. 175–216). Hillsdale, NJ: Lawrence Erlbaum Associates Inc. Baron, J. (1979). Orthographic and word specic mechanisms in children’s reading of words. Child Development, 50, 60–72. Besner, D., Twilley, L., McCann, R.S., & Seergobin, K. (1990). On the association between connectionism and data: Are a few words necessary? Psychological Review, 97, 432–446. Bradley, L., & Bryant, P.E. (1983). Categorising sounds and learning to read: A causal connection. Nature, 301, 419–521. Bullinaria, J.A., & Chater, N. (1995). Connectionist modelling: Implications for cognitive neuropsychology. Language and Cognitive Processes, 10, 227–264. Campbell, R., & Butterworth, B. (1985). Phonological dyslexia and dysgraphia in a highly literate subject: A developmental case with associated decits of phonemic processing and awareness. Quarterly Journal of Experimental Psychology, 37A, 435–475. Carr, T.H., & Pollatsek, A. (1985). Recognizing printed words: A look at current models. In D. Besner, T.G. Waller, & G.E, MacKinnon (Eds), Reading research: Advances in theory and practice, Vol. 5 (pp. 1–82) San Diego, CA: Academic Press. Castles A., & Coltheart, M. (1993). Varieties of developmental dyslexia. Cognition, 47, 148–180. Coltheart, M. (1978). Lexical access in simple reading tasks. In G. Underwood (Ed.), Strategies of information processing. London: Academic Press. Coltheart, M. (1985). Cognitive neuropsychology and the study of reading. In M.I. Posner & O.S.M. Marin (Eds), Attention and performance, Vol. XI. Hillsdale, NJ: Lawrence Erlbaum Associates Inc. Coltheart, M., Curtis, B., Atkins, P., & Haller, M. (1993). Models of reading aloud: Dual-route and parallel-distributed-processing approaches. Psychological Review, 100, 589–608.


369

Coltheart, M., Langdon, R., & Haller, M. (1996). Computational cognitive neuropsychology. In B. Dodd, L. Worrall, & R. Campbell (Eds), Models of language: Illuminations from impairment. London: Whurr Publishers. Coltheart, M., & Rastle, K. (1994). Serial processing in reading aloud: Evidence for dual-route models of reading. Journal of Experimental Psychology: Human Perception and Performance, 20, 1197–1211. Coltheart, V., & Leahy, J. (1992). Children’s and adult’s reading of nonwords: Effects of regularity and consistency. Journal of Experimental Psychology: Learning, Memory, and Cognition, 18, 183–196. Dell, G.S. (1986). A spreading activation theory of retrieval in sentence production. Psychological Review, 93, 283–321. Denes, F., Cipolotti, L., & Zorzi, M. (1996). Dislessie e disgrae acquisite. In L. Pizzamiglio & G. Denes (Eds.), Manuale di Neuropsicologia, II Edizione. Bologna, Italy: Zanichelli. Ellis, A.W. (1993). Reading, writing and dyslexia: A cognitive analysis (2nd edn). Hove, UK: Lawrence Erlbaum Associates Ltd. Elman, J.L., Bates, E.A., Johnson, M.H., Karmiloff-Smith, A., Parisi, D., & Plunkett, K. (1996). Rethinking innateness: A connectionist perspective on development. Cambridge, MA: MIT Press. Frith, U. (1985). Beneath the surface of surface dyslexia. In K.E. Patterson, J.C. Marshall, & M. Coltheart (eds), Surface dyslexia: Neuropsychological and cognitive studies of phonological reading. Hove, UK: Lawrence Erlbaum Associates Ltd. Funnell, E. (1983). Phonological processing in reading: New evidence from acquired dyslexia. British Journal of Psychology, 74, 159–180. Glasspool, D.W., Houghton, G., & Shallice, T. (1995). Interactions between knowledge sources in a dual-route connectionist model of spelling. In L.S. Smith & P.J.B. Hancock (Eds), Neural computation and psychology. London: Springer-Verlag. Gluck, M.A., & Bower, G.H. (1988a). Evaluating an adaptive network model of human learning. Journal of Memory and Language, 27, 166–195. Gluck, M.A., & Bower, G.H. (1988b). From conditioning to category learning: An adaptive network model. Journal of Experimental Psychology: General, 117, 227–247. Glushko, R.J. (1979). The organization and activation of orthographic knowledge in reading aloud. Journal of Experimental Psychology: Human Perception and Performance, 5, 674–691. Goswami, U. (1986). Children’s use of analogy in learning to read: A developmental study. Journal of Experimental Child Psychology, 42, 73–83. Goswami, U. (1988). Orthographic analogies and reading development. Quarterly Journal of Experimental Psychology, 40A, 239–268 Goswami, U. (1991). Learning about spelling sequences in reading: The role of onsets and rimes. Child Development, 62, 1110–1123. Goswami, U. (1993). Towards an interactive analogy model of reading development: Decoding vowel graphemes in beginning reading. Journal of Experimental Child Psychology, 56, 443–475. Goswami, U., & Bryant, P. (1990). Phonological skills and learning to read. Hove, UK: Lawrence Erlbaum Associates Ltd. Goswami, U., & Mead, F. (1992). Onset and rime awareness and analogies in reading. Reading Research Quarterly, 27, 152–162. Hartley, T., & Houghton, G. (1996). A linguistically constrained model of short-term memory for nonwords. Journal of Memory and Language, 35, 1–31 Houghton, G., & Tipper, S.P. (1994). A model of inhibitory mechanisms in selective attention. In D. Dagenbach & T.H. Carr (Eds), Inhibitory processes in attention, memory, and language (pp. 53–112). New York: Academic Press.

370


Houghton, G., & Zorzi, M. (submitted). On the interaction of knowledge sources in cognitive processes: Insights from models of reading and spelling. Howard, D., & Best, W. (1996). Developmental phonological dyslexia: Real word reading can be completely normal. Cognitive Neuropsychology, 13, 887–934. Hulme, C., & Snowling, M. (1992). Decits in output phonology: An explanation of reading failure? Cognitive Neuropsychology, 9, 47–72. Jorm, A.F., & Share, D.L. (1983). Phonological recoding and reading acquisition. Applied Psycholinguistics, 4, 103–147 Manis, F.R., Seidenberg, M.S., Doi, L.M., McBride-Chang, C., & Petersen, A. (1996). On the bases of two subtypes of development dyslexia. Cognition, 58, 157–195. Marsh, G., Friedman, M.P., Welch, V., & Desberg, P. (1980). A cognitive–developmental approach to reading acquisition. In G.E. McKinnon & T.G. Waller (Eds), Reading research: Advances in theory and practice, Vol. 3. New York: Academic Press. McCarthy, R., & Warrington, E.K. (1986). Phonological reading: Phenomena and paradoxes. Cortex, 22, 359–380. Morais, J., Cary, L., Alegra, J., & Bertelson, P. (1979). Does awareness of speech as a sequence of phonemes arise spontaneously? Cognition, 7, 323–331. Nagy, W.E., & Herman, P.A. (1987). Breadth and depth of vocabulary knowledge: Implications for acquisition and instruction. In M. McKeown & M. Curtis (Eds), The nature of vocabulary acquisition. Hillsdale, NJ: Lawrence Erlbaum Associates Inc. Norris, D. (1994). A quantitative, multiple levels model of reading aloud. Journal of Experimental Psychology: Human Perception and Performance, 20, 1212–1232. Patterson, K.E., & Coltheart, V. (1987). Phonological processes in reading: A tutorial review. In M. Coltheart (Ed.), Attention and performance: Vol. XII. The psychology of reading. Hillsdale, NJ: Lawrence Erlbaum Associates Inc. Plaut, D.C., McClelland, J.L., Seidenberg, M.S., & Patterson, K.E. (1996). Understanding normal and impaired word reading: Computational principles in quasi-regular domains. Psychological Review, 103, 56–115. Rumelhart, D.E., Hinton, G.E., & Williams, R.J. (1986). Learning internal representations by error propagation. In D.E. Rumelhart & J.L. McClelland (Eds), Parallel distributed processing: Explorations in the microstructure of cognition: Vol. 1. Foundations. Cambridge, MA: MIT Press. Seidenberg, M.S., & McClelland, J.L. (1989). A distributed, developmental model of word recognition and naming. Psychological Review, 96, 523–568. Seidenberg, M.S., & McClelland, J.L. (1990). More words, but still no lexicon: Reply to Besner et al. (1990). Psychological Review, 97, 452–477. Seymour, P.K., & Elder, L. (1986). Beginning reading without phonology. Cognitive Neuropsychology, 3, 1–36. Shallice, T. (1988). From neuropsychology to mental structure. Cambridge, UK: Cambridge University Press. Shallice, T., Glasspool, D., & Houghton, G. (1995). Can neuropsychological evidence inform connectionist modelling? Analyses from spelling. Language and Cognitive Processes, 10, 195–225. Shanks, D.R. (1991). Categorization by a connectionist network. Journal of Experimental Psychology: Learning, Memory and Cognition, 17, 433–443. Share, D.L. (1995). Phonological recoding and self-teaching: Sine qua non of reading acquisition. Cognition, 55, 151–218. Siegel, S., & Allan, L.G. (1990). The widespread inuence of the Rescorla–Wagner model. Psychonomic Bulletin and Review, 3, 314–321. Skoyles, J.R. (1988). Training the brain using neural-network models. Nature, 333, 401. Stothard, S.E., Snowling, M.J., & Hulme, C. (1996). Decits in phonology but not dyslexic? Cognitive Neuropsychology, 13, 641–672.


371

Stuart, M., & Masterson, J. (1992). Patterns of reading and spelling in 10-year-old children related to prereading phonological abilities. Journal of Experimental Child Psychology, 54, 168–187. Sutton, R.S., & Barto, A.G. (1981). Toward a modern theory of adaptive networks: Expectation and prediction. Psychological Review, 88, 135–170. Treiman, R. (1984). Individual differences among children in reading and spelling styles. Journal of Experimental Child Psychology, 37, 463–577. Treiman, R. (1986). The division between onsets and rimes in English syllables. Journal of Memory and Language, 25, 476–491. Treiman, R. (1989). The internal structure of the syllable. In G. Carlson & M. Tanenhaus (Eds), Linguistic structure in language processing (pp. 27–52). Dordrecht, The Netherlands: Kluwer. Treiman, R., & Chafetz, J. (1987). Are there onset- and rime-like units in printed words? In M. Coltheart (Ed.), Attention and performance: Vol. XII. The psychology of reading. Hove, UK: Lawrence Erlbaum Associates Ltd. Treiman, R., & Danis, C. (1988). Short-term memory errors for spoken syllables are affected by the linguistic structure of the syllables, Journal of Experimental Psychology: Learning, Memory and Cognition, 14, 145–152 Treiman, R., Goswami, U., & Bruck, M. (1990). Not all nonwords are alike: Implications for reading development and theory. Memory and Cognition, 18, 559–567. Treiman, R., Mullenix, J., Bijeljac-Babic, R., & Richmond-Welty, E.D. (1995). The special role of rimes in the description, use, and acquisition of English orthography. Journal of Experimental Psychology: General, 124, 107–136. Venezky, R. (1970). The structure of English orthography. The Hague, The Netherlands: Mouton. Widrow, G., & Hoff, M.E. (1960). Adaptive switching circuits. In Institute of Radio Engineers, Western Electronic Show and Convention Record, Pt. 4 (pp. 96–104). Zorzi, M., Houghton, G., & Butterworth, B. (1998). Two routes or one in reading aloud? A connectionist dual-process model. Journal of Experimental Psychology: Human Perception and Performance, 24. Zorzi, M., & Umiltà, C. (1995). A computational model of the Simon effect. Psychological Research, 58, 193–205.