Master Thesis - ENS

2 downloads 776 Views 402KB Size Report
Model of Acquisition of Phonetic Categories by Young Infants: A Templatic Representation for Speech Signals. Master Thesis. Submitted in September 2007 by.
Model of Acquisition of Phonetic Categories by Young Infants: A Templatic Representation for Speech Signals

Master Thesis Submitted in September 2007 by

Guillaume Beraud-Sudreau Master of Cognitive Science / Master de Sciences Cognitives Ecole Normale Supérieure / Ecole des Hautes Etudes en Sciences Sociales / Université Paris V

Internship Directors :

Emmanuel Dupoux, Laboratoire de Psycholinguistique (EHESS/ENS) Shigeki Sagayama,

Sciences

Cognitives

et

Sagayama & Ono Laboratory, University of

Tokyo

1

2

1

Introduction.........................................................................................................................................5 1.1

The logical problem of language acquisition...............................................................................5

1.2

Early language acquisition ..........................................................................................................5

1.3

Focus of the present work...........................................................................................................6

1.4

Why is the acquisition of phonetic categories a hard problem....................................................6

1.4.A i:

General problem ......................................................................................................................6

ii :

Cross-linguistic variations in spectral boundaries................................................................6

iii :

Cross-linguistic variations in temporal boundaries ..............................................................6

iv :

Cross-linguistic variations in inventory sizes .......................................................................7

v:

Within-language variability...................................................................................................7

1.4.B

Problems treated, simplifications .........................................................................................7

1.5

Featural codes for speech recognition........................................................................................7

1.5.A 1.6

Limitations of featural codes ................................................................................................8

Holistic or template-based codes ................................................................................................8

1.6.A 2

Parsing of continuous signal................................................................................................6

Other templatic approaches.................................................................................................9

A template-based representation of the speech stream ..................................................................11 2.1

Presentation ..............................................................................................................................11

2.1.A

Structure of the model........................................................................................................11

2.1.B

Produced representation ...................................................................................................12

2.1.C

Nature of the reference sounds .........................................................................................13

2.2

Particularities of the new representation:..................................................................................13

2.2.A

Adaptation to the learned language...................................................................................13

2.2.B

Transformation of trajectories into points...........................................................................13

2.2.C

Specific case: Affricates phonemes...................................................................................14

2.3

Implementation and testing of the model ..................................................................................14

2.3.A

Objectives and presentation of the implementation...........................................................14

2.3.B

Detailed algorithm ..............................................................................................................14

i:

Constitution of the reference base ........................................................................................14

ii :

Comparison with new stimuli .............................................................................................15

iii :

Obtained representation ....................................................................................................15

iv :

Clustering and score ..........................................................................................................15

v:

Comparison with a direct clustering algorithm ...................................................................16

2.3.C

Input data ...........................................................................................................................16

i:

Easy set : [r m s a e i] ............................................................................................................16

ii :

Hard set : [p t k u o y].........................................................................................................16

iii :

Complete set ('all') : [r m s p t k a e i u o y]........................................................................17

iv :

Polysyllabic set : [R ȓ d m ø a i u] ......................................................................................17

2.4

Experiments achieved...............................................................................................................17

2.4.A i:

Experiment 1: efficiency of the new comparison ; supervised clustering ..........................17 supervised clustering.............................................................................................................17 3

a)

Monosyllabic sets : ............................................................................................................17

b)

Polysyllabic sets ................................................................................................................18

1.2.4.A.i.b.1

Presentation of the problem ..............................................................................18

1.2.4.A.i.b.2

Comparison between syllabic and phonetic segmentations .............................19

1.2.4.A.i.b.3

Segmentation of the continuous speech ...........................................................19

1.2.4.A.i.b.4

Random templates ............................................................................................20

ii :

unsupervised clustering .....................................................................................................21

2.4.B

Impact of the language of the reference base ...................................................................23

2.4.C

Enhancements ...................................................................................................................24

2.5

Known limitation of the proposed model ...................................................................................25

3

Conclusion........................................................................................................................................27

1

Used algorithms : DTW and 1Pass DP ............................................................................................28 1.1

DTW algorithm ..........................................................................................................................28

1.2 method

Comparison between the proposed clustering method and the "Kmeans over Trajectories" 29

2

Speech segmentation based on template permutations:.................................................................29 2.1.A

Basic concept.....................................................................................................................29

2.1.B

Description of the algorithm ...............................................................................................30

i:

Expectation step ....................................................................................................................30

ii :

Maximization step ..............................................................................................................30

4

1 1.1

Introduction The logical problem of language acquisition

During their first years of life, babies learn spontaneously the language spoken by adults around them. This acquisition, which is incredibly fast, relies on mechanisms that are still unknown. In 1979, N. Chomsky proposed the concept of Universal Grammar, according to which, the acquisition of language rests on innately specified specialized mechanisms [Chomsky 79]. This point of view is supported by the observation that babies learn a capacity to generate an infinite set of sentences from on a finite, small, and noisy set of stimuli. In addition, the information provided to infants only contain positive example of structures that are admissible in their language, but no counter-examples, nor negative feedback. The impossibility of infinite language induction based on partial and finite data hence motivated the idea that infants come equipped with knowledge about the subject matter which takes the form of an innate universal grammar (see also [Osherson et al., 1984]). Such apriori arguments are based on an analysis of the structure of language competence in the adult, but not on real acquisition data. Therefore, they leave unspecified many aspects of the actual learning mechanisms that may be used by human infants [see Elman et al and Marcus for an extended debate on nativism versus empiricism]. Here, we propose to explore the opposite approach, namely, to examine how far one can go with rather simple statistical learning mechanisms or signal processing algorithms which run on actual speech data. Our aim is hence to specify which part of the human language competence has to be prewired, and which can be extracted from the data using statistical procedures.

1.2

Early language acquisition

Human babies learn a lot about their language before they speak, but experimental studies of such early language acquisition only appeared 30 years ago, with the development of new experimental techniques (such as the High Amplitude Sucking Paradigm, the Preferential Looking Paradigm, or the Head Turn Procedure [Eimas et al., 1971]. In the 90's, brain imaging techniques allowed a more direct observation of the cerebral activities of the babies (using Event-Related Potentials [Dehaene-Lambertz and Dehaene, 1994], Near Infrared Spectroscopy [Pena et al, 2003] or functional Magnetic Resonance Imaging [Dehaene-Lambertz et al. 2002]). These new experimental techniques, based on indirect measures instead of the oral productions of the baby, enabled researchers to gather a great amount of new data and observations regarding the first steps of speech acquisition. One of the surprising findings of these studies is that newborns and preverbal infants are extremely competent in making very detailed phonetic distinctions. They can discriminate minimal phonetic contrasts in stop consonants, which are notoriously difficult to classify. Infants are typically better at discrimination then adults: For instance, English 6 month old infants, unlike adults, can discriminate the contrast between dental [t] and retroflex [t•], which is used in Hindi, but not in English [Werker and Tees, 1984a; 1984b]. They can however ignore changes in talkers in making consonant or vowel discrimination [Kuhl (1983)]. In addition, human newborns can determine whether pairs of sentences belong or not to the same language, even if they have not heard it before [Nazzi et al, 1998] and they can discriminate multisyllabic utterances on the basis of their number of syllables [Bertoncini and Mehler, 1981]. During the first year of life, the perception becomes more and more specific to the language(s) spoken in the environment, and contrasts, which are not used in the language(s) are less and less well perceived. As far as phonetic categories are concerned (the consonants and vowels of the language), this tuning, starts with the acquisition of the vowel categories, which seems to take place around 6 months of age [Kuhl et al. 1992]), followed by the acquisition of the consonants and the loss of the non-distinctive contrasts at st the end of the 1 year [Werker and Tees, 1984a]). At the same time, infants learn the sequential constraints which governs the ordering of consonants and vowels. At 9 months, a baby can detect what sequences of phonemes are allowed and which are not [Friederici and Wessel 1993]). During the second of half of the first year of life, infants have a more and more specialized processing of the prosody of their language ([Hirsh-Pasek et al. 1987]), as well as the cues which signal word boundaries [Jusczyk et al., 1996]. This early tuning process seems to go hand in hand with a more definite specialization of the left hemisphere for language [Dehaene-Lambertz et al., 2006]. 5

Although the exact mechanisms governing the acquisition of language are still unknown, it is agreed that the baby uses the perceived speech stream to carry out a statistical learning of the particularities of his language. This learning is based on the analysis of regularities within his language. For instance, it is shown that babies are sensitive to the sounds distribution (monomodal vs bimodal distribution), and can build categories according to these distributions [Maye et al, 2002]. Babies are also sensitive to other statistical distributions, such as transitions probabilities between segments [Mattys and Jusczyk, 2001], as well as the presence of complementary distributions between segments [Peperkamp and Morgan, in press].

1.3

Focus of the present work

The objective of this report is to present a model of phonetic categories acquisition. As we have seen previously, this acquisition is achieved before 1 year old, which means that infants could not have been using lexical knowledge to drive it. Indeed, at the end of the first year of life, the recognition lexicon of infants is believed to contain only around 20 words. Hence, the data suggests that the only possible strategy for such an early acquisition is through an unsupervised algorithm looking for modes in the phonetic distribution [Maye and Gerken, 2002]. Yet, this leaves open a number of hard questions regarding the particular clustering algorithm which is used, as well as the type of preprocessing which is done on the acoustic signal to optimize the possibility to finding proper phonetic categories.

1.4 1.4.A i:

Why is the acquisition of phonetic categories a hard problem Parsing of continuous signal General problem

The physical substratum of speech is continuous, both in the spectral dimension and in the time dimension . The segmentation of this signal into discrete phonetic units requires to find simultaneously the spectral and the temporal boundaries of these units. Yet, these boundaries are difficult to find for two reasons: they vary widely across languages, they are also very variable within languages.

ii :

Cross-linguistic variations in spectral boundaries

The fact that languages vary in the physical realization of phonetic categories is best illustrated in the inventory of vowels. Vowel categories have been described in terms of prototype theory [Kuhl, 1992 ; Lacerda, 1995]. They have a center, or best examplar, and a domain, where examplar further and further from the center are perceived as less and less typical. For instance, French and English both vary in the center and domain of most of their vowels. Kuhl proposed nevertheless that such cross linguistic variation might be limited by the existence of innate psychophysical boundaries that have to be respected by linguistic systems. Although there has been the proposition of an innate 30ms psychoacoustical boundary for the durational cue of voiced and unvoiced consonants (Voice Onset Time), [Pisoni et al,1980] such boundaries have not been documented for vowels or other spectral characteristics of consonants. The language-specific phonetic categories hence appear to be primarily the result of some clustering process of the speech sounds of a given language in a space of spectral or spectro-temporal parameters.

iii :

Cross-linguistic variations in temporal boundaries

A less well studied domain of variation regards the temporal axis. Here, we are not talking about temporal cues for phonetic contrasts (such as VOT, or the slope of formant transitions), but about the way in which a given stretch of speech is segmented into consecutive phonemes. For instance, the non-word "ebzo" is considered to have 2 consonants and 2 vowels by English listeners. But Japanese listeners will report hearing 2 consonants and 3 vowels ("ebuzo"). These perceptual differences are mainly due to the fact that sequences of consonants are allowed in English, while they are not possible in Japanese. The Japanese perceptual system interprets the small formants transition present between /b/ and /z/ as a full vowel, which turns the sequence into a legal one in the language. In contrast the English or French 6

perceptual system simply ignore this information, or interprets it as being part of the respective consonants [Dupoux et al, 1999] This problem is very general and not restricted to the perception of illusory vowels. For instance, let consider the sequence "ts". In some language (eg. Italian), it is considered as a single affricate s phoneme [t ], in others (e.g. French), as the concatenation of two phonemes, "t" and "s", in yet others (e.g. Canadian French), as an contextual allophone of the phoneme [t] . . In order to construct a model of phonetic category acquisition, it is hence important to pay attention to this double aspects of parsing, simultaneously along the time and the spectral dimensions..

iv :

Cross-linguistic variations in inventory sizes

These difficulties are increased by the lack of a priori knowledge about the phonetic categories to learn. In particular, the number of categories to create is a priori unknown, and can greatly vary among languages. For instance, some languages use 3 vowels, while some others use 20. In the same way, the number of consonants in a given language can vary from 6 to almost an hundred. On another hand, it seems the average duration of the phonetic units is relatively stable among languages (i.e., between, 50ms and 150ms). This duration window is probably a very useful clue for building the phonetic units. It is to be noted that the lack of a priroi knowledge regarding the phonetic categories makes the acquisition of phonetic categories even more difficult if one considers that this task has to be fully achieved in a non-supervised way (ie. the learning is not guided by any other information than the speech signal itself).

v:

Within-language variability

Finally, even if one focuses on a single language, the physical signal considered for the acquisition of the phonetic categories is very noisy. This is due to three factors. First, the speech stream which arrives to the ears of infants is mixed with a variety of nonspeech noises (environmental noises), and the speech signal is itself distorted (filtering, reverberation). Second, the phonetic realization of the categories are not the same across individuals: differences in gender, age, size of vocal tract modify considerably the acoustic properties of the speech segments. Finally, and most importantly for us, the phonetic categories are also variable within individuals. This source of variations, highly specific to speech signals, is generally called the coarticulation effect. It is due to the fact that speech is produced by a complex set of independently controlled articulators. In order to minimize energy expenses, our motor control system will tend to anticipate future segments by preparing the articulators for their future targets. For instance, in the syllable /Su/ (as in “shoe”), the fricative is already rounded in preparation for the rounded vowel. In /Si/ (as in “she”), in contrast, the fricative is not rounded. The result is that the spectral characteristics of the fricative are contaminated by the following vowel. Vice versa, due to inertia of the articulators, consonants and vowels will be influenced by the preceding ones. Hence, depending on the context, the physical representation of a phoneme will strongly vary from one realization to another. Some researchers have proposed that there is no invariant properties of segments, because they are planned as entire articulatory gestures [Fowler et al., 1993] These noise sources, which strongly disturb the mapping between the abstract phonemes and their acoustic realizations, don't seem to disturb neither infant acquisition nor adult comprehension of speech signal.

1.4.B

Problems treated, simplifications

In this report, we ignore the first two sources of variability by considering one single speaker, in a quiet and non-reverberating environment. These precautions allowed us to focus on what we consider to be the main difficulties, in particular the coarticulation problem and the non-supervised aspects of the phonetic acquisition.

1.5

Featural codes for speech recognition

Early models of speech acquisition have assumed that the speech stream is coded into a low dimensional space of local features (see [Eimas et al., 1975], for a theory of phonetic features). Speech recognition systems typically encode the speech stream into a low dimensional vector of local spectral features (for instance, a vector of 20 or so MFCC or LPC coefficients). These coefficients are computed every 5ms frame, and represent an analysis of the speech stream on a time window of around 10-30ms. Unlike the proposal of Eimas, however, these features have not been claimed to have a linguistic interpretation, but are used solely for efficiency reasons. Finally, neurophysiological models of speech perception also propose an early representation of speech in terms of local features. The cochlea performs 7

a spectral decomposition of sounds, and the coding in the auditory nerve has been described as filtering through a basis of short term wavelet type functions [Smith and Lewicki, 2006]. All of these models share the idea that the speech perception system is organized hierarchically, starting with the local acoustic features, which are used to builfd higher order structures like phoneme, syllables, and words.

1.5.A

Limitations of featural codes

The main deficit of feature approaches is that they are too local to be able to capture adequately the sources of variation that affect the acoustic properties of phonetic categories. Psychoacoustic models have for the most part failed to provide features that would have the grequired degree of invariance regarding coarticulation [Blumstein, and Stevens, 1979]. Speech recognition systems compensate for this defect of features by providing another representational level which is much more holistic: the level of words. Words are represented as sets of nodes or states connected through transition probabilities; each state representing a probability distribution over feature space. Such models called Hidden Markov Models (HMM) are being very successful in achieving invariant recognition of words, but these models are heavily trained in a supervised fashion (the words or phonetic tags are provided to the learning system). Obviously, such an approach is not possible for a nonsupervised learning system.

1.6

Holistic or template-based codes

This report proposes to explore another, more recent class of models, based on more global or holistic representations. The basic idea is that the invariant properties that correspond to phonemes are not present at the level of local features, but rather at the level of entire articulatory trajectories (i.e., chunks of speech of the size of syllables). In such models (see Figure 1), the system is based on the segmenting and storing of large sized templates, which are the basis for discovering the more abstract segments, which then can be used to recover the even more abstract linguistic features. In other words, this model puts the standard hierarchical model on his head and starts with big units rather than small ones. This approach makes two important presuppositions: 1. that the problem of segmentation into templates or acoustic syllables is simpler than the original problem of segmentation into phonetic categories, 2. that the acoustic templates provide a representation which is more useful for discovering phonetic categories than the original feature representations. In this report, we mostly focus on the second presupposition, although we report in the Appendix an original algorithm achieving the syllabic segmentation, based on an even more global point of view –and thus coherent with the proposed model. Classical architecture

Proposed architecture

W ord

W ord

Syllables

Features

Grammar

Grammar Phoneme s

Phoneme s

Acoustic features

Acoustic syllables

Acoustics

Acoustics

Figure 1. Comparison between the classical hierarchical architecture based on local features, and our antihierachical approach based on templates. Note that a template model is consistent with the findings that newborn infants can count the number of syllables in the speech stream before they can pay attention to individual segments [Bertoncini and Mehler, 1981].

8

1.6.A

Other templatic approaches

The idea of a similarity based representation of sounds has introduced by R.N.Shepard [Shepard, 1968], and more recently developed by S. Edelman [Edelman 1998]. He proposed to describe a visual recognition system based on an analysis of similarities over similarities: the visual stimuli were compared to a given set of patterns, and the results of these comparisons were used as a new representation of the stimuli. Conventional recognitions systems often categorize objects according to their similarities with prototypes of the categories to recognize ; these similarities are usually considered as the result of the classification, as the categories associated to a stimulus is the one proposing the minimum distance between it's prototype and the input. But S. Edelman proposes to base the analysis on similarities between observations, and compare these similarities together (creating similarities of similarities). These analyses – named second order isomorphism – allow building much more complex classifications than a simple recognition system, based on a measure of similarities followed by a winner-takes-all system. In our work, we follow this concept in the most literal way, as the stimuli are first compared with the reference sounds, and then the results of these comparisons are compared together through a clustering algorithm, to create the phonetic categories (so these categories are built from an analysis of similarities over similarities).

The idea to use templatic representation of sounds has already been applied in practical works.. In particular, M. Coath & S.L. Denham [Coath and Denham, 2005] studied a representation based on the comparisons between syllabic templates. The aim of this research is to recognize They propose to represent the sounds as the response to convolution filters, each filter corresponding to the feature to detect. This operation produces a distance between the stimuli and several instances of the sounds to detect. This new representation of the stimuli is relatively comparable to the one we are proposing. Unlike classical speech recognition system, the model proposes to use the comparisons to create a new representation, and to base the recognition on this new representation, instead of classifying the stimuli by a simple winner-takes-all test. However, the authors propose to compare the syllables using convolution products, which don't provide a sufficient robustness to obtain satisfying results. They also just consider the global result of the sounds comparisons, and consequently can't efficiently distinguish units of smaller size than the templates used for comparison. This limitation is not considered as a problem for the authors, as they focus on digit recognition, instead of phonetic acquisition. Finaly, the authors don't propose any error-rates, or efficiency comparison between their system and any conventional representations

Similar ideas are extensively applied to object recognition. In particular, comparing an observed shape with multiple templates of a given set of objects obviously greatly decrease the difficulties related to the rotation or illumination of this object. In particular, it is successfully used in face recognition problems, proposing robustness to face orientation and illumination [Beymer and Poggio, 1996]. However, these applications raise more difficulties than sound representation, as it involves comparisons of 2 dimensional images of 3 dimensional objects: these comparisons (in the best case, of 2 dimensional pictures) necessarily require more complex considerations than comparisons between speech stimuli streams (that are 1 dimensional).

Finally, experiments have been achieved on humans in order to support the concept of visual representation by similarities: for instance, F. Cutzu and S. Edelman proposed [Cutzu and Edelman 1996] an experiment where subject indirectly rebuilt the relation between 3 dimensional objects: several threedimensional animal-like shapes were generated; the shapes were controlled by an high number of parameters, hidden from the subject. Pairs of pairs of pictures of these shapes were then present to the subjects, who had to decide which pair was the most similar; according to the answer, relations between the animals were rebuilt by multidimensional scaling. These new relations, determined by the subject, were corresponding to the (hidden) relations between the object in the parameters space. This experiment would 9

suggest that humans are sensitive to the relation between object (as, for instance, a recognitions of object by feature wouldn't lead people to reproduce the relation between the shapes presented in this experiment).

It is interesting to highlight that, although it is common to represent data by using similarities with reference templates, these templates are usually considered as prototypes of the objects to identify –or, sometime subparts of these objects. The model proposed hereafter is, on another hand, aiming to identify units (phonemes) that are subpart of the patterns used for the comparisons (syllables).

10

2

A template-based representation of the speech stream

Here, we propose a new representation of the speech signal based on syllabic templates. This representation can be considered as a first step in the acquisition of the phonetic categories by the infant, as it facilitates the acquisition of phonemes in this language. In this report we will show that this representation helps learning one given language, while it makes more difficult the discrimination of phonemes that are not part of this language.

2.1 2.1.A

Presentation Structure of the model

The proposed model relies on an original representation of speech stimuli. This representation is based on multiple comparisons between perceived stimuli and a bank of reference sounds (we suppose this set of sounds has been previously stored). In order to achieve these comparisons, it is necessary to segment the speech stream into relatively short templates, of the same shape than the reference sounds. This segmentation would be a first step for the comparison of new sounds with the reference base (and, consequently, this segmentation can be considered as a first step for the acquisition of language). A similarity measure is obtained by the comparison of every new sound with every reference template. The measure of similarity between new stimuli and one reference sound is one dimension of the new representation of the new sounds. The created representation allows transforming a sequence of physical values (the physical or at least acoustic representation of the speech stream) into a set of sequences of similarities (one similarity per reference sound). Before the comparison, and in order to improve its efficiency, the sounds are warped. The distortion applied to the sounds can be useful information about the relation between these new stimuli and the reference sounds. Therefore it can be valuable to consider it during the clustering step. In the experiments proposed later in this report, we will present the gain obtained by the use of this information. The acquisition of phonetic categories strictly speaking will be achieved in a second step, using the computed representation. We will show in this report that the new representation allows easier learning of these categories, and thus that this representation can be considered as a first step of their acquisition.

An illustration of this model is given on Figure 2.

11

Reference base (3)

Segment & store (B) Code (A) oe ffi ci en

acoustic signal (1)

Low dimensional spectral feature space (2)

Match (C) High dimensional similarity space (4)

Figure 2. Global presentation of the proposed algorithm: the physical signal (1) is transformed (A) to an acoustic representation (2 ; form for practical implementation of the model, MFCC coefficients are used for efficiency). This data is segmented, and part of it is stored (B) to create the reference base (3). In a second step, the new stimuli are compared with this reference base (C), to create a similarity representation (4).

2.1.B

Produced representation

The representation obtained after comparisons is based on the reference templates, which means that each dimension of this representation corresponds to one of the reference sounds. For instance, if the syllables [ba], [na], [ra]… have been stored in the reference base, one dimension of the new representation will correspond to the similarity between stimuli and the reference sound [ba], while another will correspond to the reference [na], and other dimensions will correspond to the other references (cf Figure 3)

Figure 3. projection ma vs. mi: 2 axis of the new representation; the horizontal axe corresponds to the reference template "ma", while the vertical axe correspond to the sound "mi". Each point corresponds to a frame (5 milliseconds long), indicated by its label. Notice the "m" sounds are a high similarity along both axis, while the phonemes "a" and "i" have a high similarity along their corresponding reference sound only, and the other phonemes don't match with neither axis, and are close from the origin. This figure displays two axes only, but the representation is highly dimensional (with one dimension corresponding to every reference sound).

12

2.1.C

Nature of the reference sounds

The nature of the reference sounds is fundamental for an efficient representation of the speech stimuli. We assumed that phonetic units were optimal. This assumption was motivated by the following reasons: Coarticulation effect is very strong within syllables – and thus coarticulation creates a very strong noise if we consider smaller units than syllables. The use of syllabic units cancels the effect of this noise, as different instances of the same phoneme (appearing in different contexts) can be stored. On another hand, the coarticulation effect is very low between syllables, so it isn't necessary to consider longer units than syllables – such units would increase the complexity of a representation, without providing any gain. This consideration is also coherent with observations achieved on newborn babies. The use of syllabic templates for sounds comparisons requires the babies to be able to detect such units in the speech stream. It has been shown that newborns babies have this ability, as they can discriminate between strings of bisyllables and trisyllables (which means they can count the number of syllables in speech stream [Bertoncini and Mehler, 1981]). Finally, this hypothesis has been successfully tested on our model: experiments that have shown the superiority of a syllable based representation for phoneme acquisition, which is, as we have seen, coherent with both theoretical consideration and compartmental observations.

The exact mechanism allowing babies to achieve this segmentation is still unknown, and will not be the focus of this report. Nevertheless, a syllabic segmentation mechanism is proposed in annex.

2.2 2.2.A

Particularities of the new representation: Adaptation to the learned language

This proposed representation of the speech stream shows several major differences with a "physical representation" like for instance, the spectral representation. The first noticeable difference is the adaptation of the representation to a particular language. The representation would be adapted to learn the phonemes that appear in the reference base, while it should be inefficient in learning discrimination of phonetic categories that do not appear as distinct in the phonetic base. So this representation is clearly adapted to the learned language, and can be considered as a first step in its acquisition. This adaptation will be clearly underlined during the tests achieved on the proposed implementation of the model.

2.2.B

Transformation of trajectories into points

As another major difference, it is important to note that our representation transforms data appearing as sequences of physical values into a constant similarity. For instance, the physical representation of a phoneme clearly isn't constant. But its similarity with a reference sound featuring the same phoneme will be continuously high. This difference greatly simplifies the learning of the phonetically categories, as a phoneme will be corresponding to a point – or an area – of the new representation, instead of a complex trajectory in the acoustic representation. On another hand, phonemes that do not appear in the reference base aren't associated to any zone in the new representation (the representation of such a phoneme will still be a trajectory…). So our representation would not be adapted to learn such a phoneme. These differences would give an interesting understanding of the discrimination abilities (or inabilities) as based on a representation specificity (or limitation), prior to any clustering.

13

2.2.C

Specific case: Affricates phonemes

Noticeable examples are the affricates phonemes. For instance, let's consider a language including a s phoneme " t ". A reference base acquired from such a language would incorporate a set of templates s s including the phoneme " t ", so the sound " t " would correspond to an area in the new representation. New s stimuli containing the phoneme " t " would create a cluster in this area, which would be considered as an independent phoneme. s

On another hand, a representation built on a language containing the phonemes "t" and "s", but not " t s s " would not have any constant area corresponding to the phoneme " t ". If a sound " t " is represented using a reference based built from this language, it would not correspond to a fix area (and would successively s match with "t", then with "s"). In these conditions, the sound " t " would be considered as the concatenation of two phonemes, instead of a third one. This difference would enable any acquisition technique based on the proposed representation to easily learn the discrimination between phonemes that would not be possibly discriminated using local information only.

2.3 2.3.A

Implementation and testing of the model Objectives and presentation of the implementation

A Matlab implementation of the described model has been written. The objectives of this implementation were: -

To demonstrate the efficiency of the proposed representation compared with a physical representation of the speech signal, learning the phonetic categories.

-

To compare the efficiency of the proposed representation, depending on the speech templates considered (for instance, to compare syllabic templates with other kinds of segmentations).

-

To propose some optimization to achieve on the representation.

2.3.B

Detailed algorithm

The implemented algorithm was proceeding in several successive steps. The input data provided to the algorithm was coded using the MFCC coefficients [Mermelstein, 1976]. MFCC representation of a sound is computed from the spectrum of the speech signal: the logarithm of the spectrum is mapped along a mel scale, and the MFCC coefficients are the amplitude of the discrete cosine transform of the obtained signal. This choice is motivated by technical consideration, as MFCC coefficients provide an efficient way to encode a speech stream for speech recognition and are relatively fast to compute. In experiments described in this report, the signal was represented using the MFCC representation. The MFCC coefficients were computed every 5ms, but each MFCC frames actually represents the speech signal for 15 milliseconds (the frames are computed on overlapping windows). The signal was represented by the first 13 coefficients. In order to judge the efficiency of the proposed representation, we often compared the error-rates obtained from clustering based on our representation with results obtained from the MFCC coefficients. In these cases, we used the MFCC coefficients, plus the 2 first derived coefficients (so the considered signal was 39-dimensional). These derived coefficients greatly improves the learning and recognition of phonetic categories (as shown for instance in Figure 5). Considering the input sounds represented by their MFCC coefficients, the implemented algorithm was proceeding by successive steps:

i:

Constitution of the reference base

This step was, for most of the trials, achieved in a supervised way. A given number of realizations of every possible syllable of the considered language were randomly selected to compose the reference base. This way to proceed allowed obtaining balanced reference bases, created from optimal templates. 14

The balance of the reference base is probably not mandatory, as long as every syllable is represented (in the case of an unbalanced templates base, the results would probably be equivalent to a balanced base, with the same number of elements than the less represented syllable in the database. However, in some experiments, the reference base was built in an unsupervised way (and, in these cases, it will be specified). We also developed and implemented an unsupervised segmentation algorithm, which could produce relevant template. This algorithm is described in annex.

ii :

Comparison with new stimuli

The comparisons of the database templates and the new stimuli are achieved using the Dynamic Time Warping (DTW) algorithm [Myers and Rabiner, 1981], DTW algorithm proposes a comparison between two sounds, deforming one of them if it is necessary, in order to compare parts of the sounds that are the most probably matching. This comparison method offers robustness against changes in speed of the speech stream, and synchronizes the templates prior to match them together.

Figure 4. Result of the comparison of 2 sounds using the DTW algorithm: one of the sounds (the new stimulus, here "ba") is displayed along an horizontal time-axe, while the other (the reference sound, here "be") is displayed along a vertical time-axe. The DTW algorithm compares these sounds by computing the distance between every possible pair of frames from the two sounds (these distances are stored in the "distance matrix", represented in the figure: similar frames are represented in yellow, while distant ones are represented in red). Then, it finds the best possible path (the path joining the beginning of the sounds to their ends with a minimal total distance, in blue on the figure). Finally, it projects the distance along this path on one of the sounds (in our algorithm, the new stimulus). This allows obtaining a frame-wise distance between the new stimulus and the reference sound. The same comparison is applied with every reference sound, providing a vector of similarities for every frame of the new stimulus.

In some cases, the 1-pass DP algorithm was preferred (1-pass DP is a generalization of the DTW algorithm, allowing to compare one sound to many references simultaneously, or to compare one continuous stimulus to a shorter reference, repeating it as many times as necessary); in these cases, the comparison method and the reason of the choice will be specified. In addition, we extracted from the comparisons information about the distortions that have to be applied on the sounds in order to make them match. To collect a time-distortion measure, which gives additional information about the comparison between the sounds, the optimal path obtained by the DTW algorithm is considered. The distance between the derivate of this path at one given moment and a linear distortion of the reference sound offers a dissimilarity measure (the highest the value of this distortion distance is, the most the sounds had to be warped before matching).

iii :

Obtained representation

Considering the input data, the representation obtained after comparison is a N or 2N-dimensional signal (if the reference base counts N distinct templates, the comparisons produce N similarities, and N distortion distances), sampled every 5 milliseconds. From a computational point of view, this representation is relatively complex. Therefore, the proposed algorithm shouldn't be considered as efficient from an engineering point of view.

iv :

Clustering and score

Depending to the experiments, the clustering achieved over the new representation uses supervised or unsupervised algorithms (in case of supervised clustering, a perceptron algorithm is used, optimized using 15

the Rprop algorithm (Rprop algorithm achieves a supervised training on a perceptron network; it uses a backpropagation optimization technique; see [Riedmiller, and Braun, 1993] for further details). The training and the recognition of the phonetic categories was achieved on a frame wise point of view: every frame was considered as an independent data, and the order of these frames was not used for the clustering or the recognition; this point of view for clustering was clearly simplistic. However, our objective was to compare different representations of the speech signal. For this task, it was relevant to use a simple clustering algorithm. We only suffered of this simplicity for the unsupervised tests, where the proposed algorithms were unable to converge toward the appropriate clusters. In any cases, the error rate measured was the percentage of misclassified frames (so this error rate was frame-wise). In every experiment, a similar clustering algorithm was applied to the MFCC coefficients, and its two first derivate coefficients, in order to judge the gain obtained using the proposed representation. For the learning, we used a training set including 34 times every possible syllables (so, from 34x9=306 sounds for the smallest sets to 34x36=1224 sounds for the biggest set; each sound contains an average of 40 to 50 frames, depending on the considered set).

v:

Comparison with a direct clustering algorithm

On a computer-science point of view, our approach can be considered as related with the DTWdistances based K-means algorithm. (for instance, an application of such an algorithm is proposed in [Somervuo and Harma, 2004]). This algorithm computes a K-means clustering directly on sequence of points, using the DTW algorithm in order to compare the different stimuli to the centroids. It would not need any intermediary representation to compute a relatively efficient clustering of the phonetic categories. But this algorithm can only provide a clustering for entire sequences of sounds. So, in order to converge to phonetic categories, it would require to segment the speech into phonetic units. More details related to this algorithm and its connection with the proposed model are given in annex.

2.3.C

Input data

The input data provided to our algorithm was encoded using the 13 firsts MFCC coefficients, sampled every 5ms.The choice of the MFCC coefficients as input representation was motivated by the fact these coefficients are directly extracted from a physical representation of the speech signal (being the "Fourier transform of the Fourier transform"), and in the same time relevant for speech recognition. For the tests achieved on the algorithms, we considered 4 pseudo-languages. All these stimuli were recorded by the same male speaker, in a quiet and non-reverberating environment. In order to estimate the result of the proposed experiments, all the data were manually labeled (whenever some of the proposed experiments are based on an unsupervised learning system, and thus do not actually rely on these labels).

i:

Easy set : [r m s a e i]

This set was composed of 6 phonemes (a, e, i, r, m and s). The phonemes of this set were selected in order to be relatively easy to differentiate. These phonemes were used to create syllables, of ConsonantVowels (CV) structure. These syllables were pronounced independently (thus, the considered data were monosyllabic words). This set was balanced (every possible syllable was appearing the same number of times). Each syllable was recorded 54 times (so this set was constituted of 3 x 3 x 55 = 495 elements, as 3x3=9 syllables can be created with these 3 consonants and 3 vowels)

ii :

Hard set : [p t k u o y]

This set was also composed of 6 phonemes (p t k u o y). These phonemes were selected to be difficult to differentiate: in particular, the 3 stops (p, t, k), are extremely hard to differentiate, like the 3 selected consonants (u, o, y). These phonemes were also used to create monosyllabic words, with a CV structure. This set was also balanced, and each syllable was recorded 55 times. 16

iii :

Complete set ('all') : [r m s p t k a e i u o y]

This set was a combination of the 2 sets previously described. It was constituted of CV words, with CV monosyllabic words. It was including all possible syllables created from the phonemes of the easy and the hard set (such as "ru" or "pa"). Like the sets described previously, this set was also balanced, and constituted of 54 realizations of each possible syllable (so it was composed of 6 x 6 x 55 = 1980 elements).

Polysyllabic set : [R ȓ d m ø a i u]

iv :

This set contains trisyllabic words. These words are composed with 8 phonemes (R ȓ d m ø a i u). These phonemes are arranged following the same CV structure than in the previous sets. So the trisyllables recorded have a CVCVCV structure. 512 trisyllabic words were recorded. The set was built in such a way that all the phonemes were pronounced the same number of times, in every position.

2.4

Experiments achieved

In order to test the predictions related to our models, we achieved several experiments, using the data presented hereinabove.

2.4.A

Experiment 1: efficiency of the new comparison ; supervised clustering st

The 1 test realized was related with the "efficiency" of the proposed representation, for phoneme acquisition. To estimate this efficiency, we measured the frame wise error-rate for phonetic classification of our data, either after a supervised or an unsupervised clustering.

i:

supervised clustering

This experiment was achieved in order to measure the linear separation between the phonetic categories, for different kinds of representations. The clustering of the data was achieved using a perceptron. The perceptron parameters were determined using the R-prop (supervised) algorithm. [Riedmiller and Braun, 1993]. The perceptron output allows attaching a label to every input frame, and the error rate was obtained by comparing this found label with the actual labels manually attributed to the frames. We tested the different sets of stimuli, under different conditions: -

MFCC representation, with the first 2 derivate coefficients (control)

-

Our representation, testing : o

The influence of the number of reference sounds

o

The influence of the distortion distance

Whenever this experiment showed a superiority of the proposed representation, it gave relatively contrasted results, according to the sets it was achieved on:

a)

Monosyllabic sets :

The tests over the monosyllabic sets showed a clear superiority of the proposed representation. For these tests, the considered templates were monosyllabic words, compared together using the DTW algorithm. We used reference bases created with 4 to 12 instances of every possible syllable (there were 36 possible syllables), and tested the discrimination using or not the distortion distance.

17

Hard set 3x3 [ptk][uyo]

30 Training Classification error

25

Generalization

20 15 10 5

*9 12

8* 9

4* 9

*9 12

8* 9

4* 9

m fc m c fc c+ de lta 2

0

Nb of detectors spectre only

spectre+time

Figure 5. Efficiency of a supervised classification, achieved on MFCC coefficients (for control, including or not the 2 first derived coefficients –"mfcc" or "mfcc+delta2"), or on the similarities based representation, with 4, 8 or 12 examples of every reference, and without or with the time distance ("spectrum only" / "spectrum+time"). As expected, a bigger reference base allows a more precise discrimination of the stimuli. The next test also shows the efficiency of the distortion distance to discriminate phonemes. However, these affirmations could be considered as obvious, as the representation based on the biggest database (12 instances of each syllable) is extremely complex: It counts 12 * 36 * 2 = 864 dimensions (12 = number of instances of every syllable, 36 = number of different syllables in the language, and the number of references is multiplied by 2 because we used the distortion distance). With such a representation, the discrimination is necessarily higher than with simpler data. Nevertheless, all this information is directly extracted from the MFCC coefficients. And, if the representation based on a great number of references is complex, a projection from this representation on the most significant dimensions (using a Principal Component Analysis ) allows obtaining much better results that directly processing the MFCC coefficients (and there derived), with a same level of complexity. So it can be deduced from these experiments that, for monosyllabic data, the proposed representation is more efficient for discriminating phonemes. It is also shown that increasing the number of references sounds improves the discrimination, and that the distortion distance helps discriminating phonemes more efficiently.

We achieved the same tests, using a more complex representation of the sounds as input for our algorithm: instead of using the simple MFCC coefficients, we tried to add the derived coefficients. Whenever the use of these derived significantly coefficients increased the results for direct recognition (clustering directly based on the MFCC coefficients), it didn’t show any significant change when used as input of our algorithm, meaning that information obtained by the use of these coefficients are already extracted by our representation.

b)

Polysyllabic sets

1.2.4.A.i.b.1 Presentation of the problem The case of continuous speech is tested here using polysyllabic words (trisyllables). This kind of treatment raises additional problems compared with the monosyllabic data previously presented. In particular, the continuous speech has to be segmented in templates prior to the creation of the reference base and the comparison between the new stimuli and the sounds of this base. 18

Considering the central role of the reference templates in our algorithm, it is relevant to study the impact of there selection on the results.

1.2.4.A.i.b.2 Comparison between syllabic and phonetic segmentations We first compared representations based on syllabic templates with phonetic templates (both manually created), and with the direct clustering achieved over the MFCC coefficients (plus derivate coefficients). The use of syllabic templates proved to be more efficient than phonetic templates. Syllabic templates-based representation also proved to be more efficient than the MFCC coefficients (and, in order to compare models of an equivalent complexity, we artificially increased the number of dimensions of this representation; in this case, the MFCC based representation was equivalent to the phonetic template based one; but the representation based on syllabic comparisons still allowed improvement of discrimination of the phonetic categories). Polysyllables: Influence of the Templates [SRdm][aiu2]

Training Generalization

25

Classification error

20

15

10

5

0 mfcc delta2

mfcc delta 20 (13 * 20 coefs)

syllabic templates

phonetic templates

Figure 6. polysyllabic set: efficiency of the supervised clustering, over the MFCC coefficients (plus derived), and the comparisons-based representation, using syllabic and phonetic templates. For the phonetic based representation, the clustering is clearly suffering from over-fitting: whenever the representation is complex (the learning error rate is lower than the one obtained from the MFCC coefficients), it is inadequate (the rules extracted from the learning set cannot be applied to the generalization set).

These experiments have been achieved in "ideal" conditions, as the templates were segmented manually. Of course, any automatic segmentation of the speech signal would be less precise, and generate errors in the templates boundaries. But the sounds are warped before the comparisons (using the DTW algorithm), so the precision of the segmentation is not mandatory. The result of this experiment, showing a superiority of the syllabic templates, is not surprising: unlike phonetic units, syllabic ones take in account the coarticulation effect, and consequently avoid a very strong source of noise. The optimality of these syllabic sounds units also strongly supports our model, as it has been shown that newborn babies can perceive these units: it would give a role to this early sensibility (that precedes the phonetic acquisition), and would involve it in a full language acquisition (unlike more classical models).

1.2.4.A.i.b.3 Segmentation of the continuous speech A segmentation algorithm is presented in annex. This algorithm is coherent with our approach as it mainly considers global information of the speech. 19

It is based on the idea that the most relevant units are optimally describing the speech stream. Using this assumption, the algorithm finds a (locally) optimal set of sounds that can, with a minimal distortion, be concatenated to match in an optimal way with new speech stimuli.

1.2.4.A.i.b.4 Random templates We also tested the efficiency of a sound representation based on randomly segmented templates. We considered 3 types of templates could be relevant: -

syllabic templates (a priori assumed to be optimal, as said above)

-

phonetic templates (already described hereinabove)

-

random templates

For the "random templates", we considered a segmentation of the speech at a constant length. We tried several possible lengths, in order to find the optimal one.

Figure 7. 3 different types of segmentation, for the creation of the templates base: the portions of st signal between the red lines will be kept as reference templates. The 1 image corresponds to a nd rd syllabic segmentation, the 2 one is a phonetic segmentation, and the 3 one corresponds to a random segmentation, as the boundaries are fixed at regular intervals, without considering the data contained in the signal. The syllabic and phonetic segmentations are obtained manually.

The comparison was achieved using the 1-pass DP algorithm, independently on every reference sounds (so there is no interaction of the best path through the different sounds, and the dimensions are independent). We also tried to synchronize the paths through the different dimensions (by forcing the path along all the dimensions to come back to the origin of the corresponding reference sounds at the same moment than when the optimal pass does). However, this type of comparison wasn't more efficient than comparing the stimuli independently with every reference sound.

20

Polysyllabic set [SRdm][aiu2] Influence of the templates

Training

25

Generalization

Classification error

20

15

10

5

40 0

36 0

32 0

28 0

24 0

20 0

16 0

12 0

80

40

c te ph m pl on at et es ic te m pl at es

sy ll a bi

m fc

c

de lta

2

0 Length of the random references

Figure 8. comparison between different segmentation techniques; for the randomly segmented templates, lengths from 40 to 400 ms have been tried. These results show the superiority of the syllabic templates. The U-shape of the score function according to the length of the random units is also noticeable. The optimal value (120m) is longer than an average phoneme, but shorter than a syllable. This result allow to think that templates have to represent the transition periods between phonemes to be efficient.

The results are presented in the Figure 8. -

The syllabic templates proved to be optimal.

-

As expected, the random templates were not optimal. It should be noted that the optimal random templates were corresponding to an intermediary length between the length of a phoneme or a syllable.

-

The random templates are more efficient than the optimal randomly segmented templates. This result is due to the additional freedom given to the comparisons with the randomly segmented templates by the use of the 1-pass DP algorithm.

This experiments support the idea of syllables as basic perceptual, prior to the learning of the perception of phonemes. It also suggests that whenever an appropriate segmentation improves the efficiency of a later clustering, a roughly achieved segmentation can give relatively efficient results.

ii :

unsupervised clustering

The same experiment was achieved, considering an unsupervised learning. This experiment should be more relevant in term of phonetic acquisition, considering this acquisition is achieved by babies in an unsupervised way (ie. It rests on a passive listening of the speech, and is not directed by any exterior information). However, we didn't focus on this part of the learning mechanism in this report: the learning of units such as phonemes involves complex learning algorithms (usually based on Hidden Markov Models (HMM) ; for an example of such an algorithms, see [Takami and Sagayama, 1992]). These algorithms usually build allophonic categories, which are merged into phonetic categories [Peperkamp and Le Calvez, 2003]. 21

We tried to avoid these difficulties by using of simple language, featuring very distinct phonemes. In these conditions, it is possible to obtain a satisfying result with simple algorithm such as K-means or EM (these algorithms don't use time-related information, such as HMM-based algorithms). For the more complex languages, the simple clustering methods used didn't allow us to obtain satisfying results: the clusters found, from both the MFCC or the similarity based representation, were not corresponding to phonetic units (the algorithm was proposing allophonic units for some phonemes, and merging other ones).

A test has been achieved on our simple language ("Simple set", containing the phonemes [r, m, s, a, e, i]). The simplicity of this language allowed the use of simple unsupervised clustering algorithms (we tested K-means and EM algorithms). This test showed a clear superiority of our representation. While the clustering proposed over the MFCC representation proves to be irrelevant, the clustering proposed on our representation is corresponding to the categories found by a supervised algorithm. [FIG XXX] As it could be expected, K-means and EM algorithms obtain the same error rate on our representation, nd whenever the 2 is much more complex. This is due to the special shape of the speech signal, in our representation (where the values correspond to similarities with other sounds, and thus "have a meaning" for phonetic clustering). Unsupervised learning Easy Set [mrs][aei]

35 Training

Classification error

30

Generalization

25 20 15 10 5 0 mfcc

8*9 K-means

mfcc Nb of detectors

8*9 EM

Figure 9. Unsupervised learning of the phonetic categories, using MFCC (and derived coefficients) versus similarities-based representation. The similarities based representation proved to be clearly superior to the MFCC-based coefficients. In both case, the number of clusters (corresponding to the 6 phonemes) was provided to the algorithm.

In these experiments, the number of categories to build was provided to the algorithm. However, this additional information was not necessary to find the optimal partition, as shown in Figure 9.

22

BIC (Bayesian Information Criterium)

Optimal number of clusters for the easy set [m R s a e i]

4100 4050 4000 3950 3900 3850 3800 3750 3700 3

4

5

6

7

8

9

10

11

12

number of clusters

Figure 10. Evaluation of the efficiency of the clustering, as a function of the number of clusters created. The efficiency was computed using the Bayesian Information Criterion (BIC). The minimum value of the BIC corresponds to the optimal number of clusters; here, the optimal number of clusters corresponds to the 6 phonetic categories : [m R s a e i].

2.4.B

Impact of the language of the reference base

The representation built strongly depends of the reference base used (as dimensions of the representation correspond to sounds of the reference base). For instance, a well balanced base containing sounds of the learned language should be optimal for the acquisition of phonetic categories, while learning a language from a representation composed of inadequate sounds (like syllables of another language) should be harder. In order to test the impact of the language of the reference base, we tested the learning of a language represented on a base containing different phonemes: We considered 2 languages: the "Hard Set" (phonemes [p, t, k, u, o, y]), and the "Easy Set" (phonemes [m, r, s, a, e, i]). We tried to discriminate phonemes from the Easy Set, using a representation based on the "Hard Set", and vice versa (discriminating the phonemes of the Hard Set, represented with a reference based built from sounds of the "Easy Set"). Both sets contain the same number of phonemes and possible syllables (so the number of dimensions in an Easy Set-based representation and in an Hard Set-based representation are equals).

23

Classification error

Inadequate Reference Bases: Easy Set [Rms][aei] Hard Set [ptk][uyo]

Training Generalization

20 18 16 14 12 10 8 6 4 2 0 Reference = Reference = Easy ; Learned Hard ; Learned = Easy = Easy

Reference = Reference = Hard ; Learned Easy ; Learned = Hard = Hard Nb of detectors

Figure 11. Influence of the reference base for the acquisition of the phonetic categories: clustering algorithm were trained to learn phonetic categories from data represented on an adequate representation (when the language of the reference base corresponds to the learn language), or an inadequate one (when the language of the reference base differs from the learned one)..

These results clearly demonstrate the impact of the language of the reference base. It is much harder to learn the phonetic categories if the speech is represented on an inadequate base. This shows that the construction of a reference base adapted to the learned language is a first step for the acquisition of phonetic categories. It corresponds to the observations, as adults are less efficient to discriminate phonemes that do not appear in there native language.

2.4.C

Enhancements

The representation proposes exhibits some important particularities, due to the nature of its dimensions (which are independent similarities). These particularities can be used in order to adapt the clustering algorithm used to actually create the phonetic categories. For instance, it can be admitted that each syllables contains a limited number of phonemes (for instance, one could say that, in most of the languages, most of the syllables are composed of 3 phonemes or less). Based on this assumption, and noticing that the detectors are syllabic templates, some observations can be made on the shape of the categories to create. One reference sound should contain a limited sound of phonemes (usually up to 3). So one dimension, corresponding to this reference sound, should have a limited number phonemes (which are included in the reference sound, thus matching with it), centered at a high similarity value, while all the other phonemes should be centered at a low similarity.

It is also possible to simplify the representation, considering it is based on similarities. Instead of considering the continuous similarity measure used in the proposed experiments, it is possible to use a simple binary notation: if, at a given moment, the stimulus is matching with a reference sound, the corresponding dimension will be set at one, otherwise it's value will be set at 0 (the decision "matching"/"not matching" can be based on a simple ceiling over the similarities). This simplification offers the advantage of being relatively similar to the data manipulated by neural models. We compared the efficiency of the recognition, based on a continuous and a binary representation, using a simple artificial neural network (perceptron) trained on a supervised way. It was also coherent with the concept of representation by similarities, as the only information representing the sounds was the list of 24

references matching to them (this information was sufficient to achieve the recognition, as the number of reference sounds was relatively high). In term of quantity of data manipulated, this binary representation was clearly less complex than the MFCC representation. We tested the efficiency of this simplified representation either on the "Full Set" (composed of the phonemes [m, s, R, p, t, k, a, e, I, u, y, o]) and the polysyllabic set (manually segmented, composed of the phonemes [R, S, d, m, a, e, i, u]).

Trisyllabes [SRdm][aiu2]

all easy + hard set 9x9 [ptkRms][uyoaei]

45

35

40

Generalization

35

Classification error

Classification error

30

Training

25 20 15 10

Training Generalization

30 25 20 15 10

5

5

0 *2

12 *3 6

12 *3 6

*3 6

mfcc

12

8* 36

*ti m

e

bi na ry

/3 9

ta 2 fc c+ de l m

m

fc c

0 mfcc delta2

12*16 binary

24*16 binary

12*9

12*9*2 (time)

Nb of detectors

Type of representation

Figure 12. Comparison of the efficiency of different representation: the MFCC representation (and MFCC plus 2 derived coefficients "delta2"), the binary similarity-based representation (with 12x16 or 24x16 reference sounds), or the similarity based segmentation (using or not the temporal distance). For these tests, the ceil used to create the binary segmentation was at the average matching plus one standard deviation (all the coefficients higher than this level were set at 1, all the ones matching less were set at 0).

These results show the efficiency of a binary representation. Despite its simplicity, the results obtained are relatively similar to the results obtained using a continuous representation (and still show an improvement compared with the MFCC-based representations). It is also important to notice that the binary representation allows the results to improve when the data base is enlarged (as shown with the polysyllabic examples), while a database of the same size would lead to an over-fitting if used with a continuous representation. These considerations allow to think that the ceiling achieved to obtain the binary representation actually keeps most of the relevant information for phonemes discrimination.

2.5

Known limitation of the proposed model

One strong criticism against the presented model should be that it doesn't correspond to the observed phonetic discrimination chronology. It has been shown that, prior to any phonetic acquisition, newborn babies can discriminate virtually every phonetic categories. However, according to our model, infants should learn discriminations (by adding discriminant pairs of templates in there reference base), and not forget the distinctions, as observed (which means they would have to get a full initial reference base, and to forget references during the learning, which is impossible). However, an evolution of this model could be suggested, in which the reference base would not be a set of real speech templates, but more abstract data. These data, which could correspond to artificial sounds, can be initialized in a homogeneous way along the possible speech spectrum. While the baby hears his language, adapts the reference sounds to fit with the statistical distribution of the heard speech. By doing so, he will decrease his sensitivity to sounds that do not correspond to any receptor, and increase it for sounds in different zones containing many receptors. 25

26

3

Conclusion

We have proposed in this report an original model for the phonetic categories acquisition. Whenever this model is mainly focusing on the representation of speech stream by young infants, we it corresponds as a first stage in the language acquisition. The proposed representation can adapt to the linguistic environment in which it is developed, which simplifies the acquisition of the phonetic categories. We have shown that the proposed representation is clearly depending of the stimuli perceived during the early stage of language acquisition, as an inadequate representation makes the acquisition of phonetic categories harder, while an appropriate one facilitates the learning of these categories.

We showed that a syllabic point of view is the most relevant for the acquisition of phonetic categories through our model, which is coherent with the early perception of syllables by infants. Consequently, the developed model allows proposing coherent chronology of the speech acquisition during the first year of life.

Our approach is closely related with the idea proposed by M. Coath and S. L. Denham, who propose to represent sounds using similarities with reference templates. However, we propose a more complex algorithm (by warping the sounds before comparison), which allows us obtaining a relevant representation for the phonemes acquisition. We also propose a comparison of our representation and a more conventional one, which gives an estimate of the gain obtained by our system. We also compare the efficiency of different kinds of template; while this previous study only considered reference templates corresponding to the sounds they wanted to discriminate (digits pronounced in English), we used templates that were not corresponding to the units we wanted to discriminate (using syllables while we were aiming to discriminate phonemes). Nevertheless, this work, as the one we presented in this report, was a direct application of the theoretical considerations previously developed by S. Edelman, who underlined the benefits for recognition of a representation based on similarity measure.

27

2 APPENDIXES 1

Used algorithms : DTW and 1Pass DP

1.1

DTW algorithm

The DTW algorithm provides an efficient measure of similarity between two sounds. In order to increase the robustness of the comparison between the sounds, they are warped one relatively to the other. This warping aligns the part of the two songs that corresponds together, in order to compare the sounds in a relevant way. Considering 2 sounds, coded as two sequences of L1 and L2 vectors (or frames), the comparison will be achieved in 3 steps: st

1 , a “distance matrix” D between the two sounds is created. Each element D(t1,t2) of this matrix, of size L1xL2, corresponds to the distance between the t1th frame of the sound 1, and the t2th frame of the sound 2 (so this matrix represent the distance between every pair of sounds of the two sounds). A path along this matrix (a sequence of couples (t1,t2), each element t1 and t2 increasing) corresponds to a distortion of the sounds 1 relatively to the sound 2, or vice versa. The aim of the DTW st algorithm is to find a path connecting the 1 element of the distance matrix D(0,0) to the last one D(L1,L2), minimizing the sum of the elements encountered (each element corresponding to the distance between the 2 sounds at a given time, the found path would correspond to the distortion minimizing the distance between the sounds). The distance between two frames can be computed using the Euclidian distance –after a normalization of every component of the vectorial representation of the sound. It is also possible to use the Mahalanobis distance between the two frames (this distance takes in account the statistical distribution of the frames, via the covariance matrix of a set of frame. In the experiments presented in this report, the first solution (Euclidian distance between normalized vectors) has been used, for computational reasons. ILLUSTRATION DISTANCE MATRIX nd

In a 2 step, the “Cumulated Distance matrix” C is created. Each element of thes matrix st corresponds to the sum of the distance along the shortest path from the 1 element of the distance matrix D(0,0) to a given couple of coordinates t1,t2. st

The cumulated distance matrix is built recursively : the 1 element of this matrix C(0,0) is fixed at 0. Then, the element C(t1,t2) is equal to the smallest possible previous element (for instance, the minimum of C(t1-1,t2-1), C(t1-1,t2), or C(t1,t2-1) ; this minimum is called the “origine” of the element C(t1,t2), as the optimal path leading to this elements comes from its orignin), plus the value of D(t1,t2) st nd (corresponding to the distance between the t1th frame of the 1 sound and the t2th frame of the 2 sound). In order to find the best path leading to an element, the origin of every couple of frame is kept in a matrix called the “path matrix”.

ILLUSTRATION CUMULATED DISTANCE MATRIX rd

st

Finally, in a 3 step, the path matrix is used to find the optimal path connecting the 1 elements to the last one. This path is found iteratively, starting from the final element. The previous element in the path is it’s origin, that can be found using the “path matrix”. It is also possible to find the origin of this element, and, gradually, to find back the best path, corresponding to the distortion of the two sounds minimizing the distance between each other.

1Pass DP algorithm improves the DTW algorithm, allowing to compare a long (continuous) sound S to several shorter templates {S1... Sn}. The output of this algorithm would be the optimal sequence of 28

short templates, Si1...Sik, and, for each template Sij, a sequence of couples of frames (one corresponding to the continuous sound, the other one to the template Sij). These sequences create a path through the continuous sound and the set of template, minimizing the distance between each other.

1.2

Comparison between the proposed clustering method and the "Kmeans over Trajectories" method

Our model can be related to a more conventional algorithm for clustering of multidimensional time series. This algorithm is simply an adaptation of the K-means algorithm, and thus is an unsupervised clustering algorithm. Kmeans algorithm usually consider points, instead of sequences. For creating N clusters, it is initialized by associating random values to N points, considered as the centroides of the initial clusters. Then, it repeats the following steps: -

for every point, it finds the closest centroid, using the Euclidian distance. The point will be considered as part of the cluster associated to its closest centroid.

-

For every cluster, the centroid is recomputed, according to the set of point that have been declared as part of it.

After iterating these operations, the centroids should converge to some value, creating a Voronoi decomposition of the space (each cluster being a part of this decomposition). This algorithm can be modified in order to classify trajectories instead of points: trajectories are considered as centroides, and the distances between trajectories are computed using the DTW algorithm instead of the Euclidian distance. Such an algorithm could be used to classify sounds in an unsupervised way, and thus to find the phonetic units. However, this algorithm suffers several weaknesses: st

-

1 , it requires the sequences to be segmented in a relevant way for the clustering. For the acquisition of phonetic units, it would require the sounds to be segmented following the phonemes boundaries, raising a bootstrapping problem –how to find these boundaries if the phonemes are unknown? An algorithm dealing with this bootstrapping problem can be derived from the segmentation algorithm proposed bellow (however, its complexity is much higher than the algorithm proposed in this report).

-

2 , it is, on a computational point of view, a very expensive algorithm. It involves, for every iteration of the Kmeans algorithm, the computation of the distance between every instance of the sounds to classify and the centroide of every cluster. On this point of view, the proposed algorithm, proceeding in two step (first comparing the sounds to a predetermined database, and then applying the standard Kmeans algorithm on points) can be considered as an approximation of this algorithm, as it is computationally less expensive.

nd

2

Speech segmentation based on template permutations:

The algorithm presented in this section proposes an original and purely top-down method, for syllabic or phonetic segmentation. The originality of this technique is its extensive use of global information, and the absence of direct use of local information.

2.1.A

Basic concept

The method proposed hereafter relies as a conception of the phonemes or syllables as best units for a description of the speech signal: the syllabic or phonetic segmentation are optimal to describe the speech as a permutation/segmentation of a finite number of units. 29

This algorithm would thus try to find a segmentation of a relatively short speech stimulus providing independent units that allow describing as well as possible new speech stimuli. The best segmentation is the segmentation minimizing the deformation that would have to be achieved on the new stimuli in order to obtain the best matching (ie. the best description) between these new stimuli and the considered units. An analogy can be proposed between the task aimed and a puzzle game: considering a given set of pieces, one can try to rebuild a given picture. The pieces of the puzzle are the sound units, the picture to rebuild is the speech stream ; the algorithm proposed here will find the optimal set o pieces to rebuild the image. We would want to prove that these are the phonetic or syllabic units.

2.1.B

Description of the algorithm

The proposed mechanism relies on an EM algorithm. It is initialized with a given set of units. These units don't necessarily have to be corresponding to real linguistic units, considering their boundaries will be modified in order to find the "best units" (corresponding to syllables or phonetic units). It is an iterative algorithm, computing, on every iteration, an Expectation phase, proposing, using a given set of segmented units, to describe as well as possible new input stimuli. The Expectation phase is followed by a Maximization phase, during which, using the description of new speech stimuli found previously, the units are redefined.

i:

Expectation step

In this step, we try to find a "description" of speech stimuli, based on a limited set of segmented units. The description considered is obtained using the 1pass-DP algorithm. This algorithm allows finding the optimal concatenation of the syllables units for a matching with the new speech stream. The 1pass-DP algorithm also provides the deformations of the sounds that are necessary to obtain this matching. This deformation is the objective we will try to minimize.

ii :

Maximization step

The result of the expectation step described hereinabove will be used to modify the boundaries of the sound units considered. These boundaries are moved according to the deformation proposed by the 1Pass-DP algorithm.

This algorithm should converge to units minimizing locally the distortion, describing new speech stimuli. Ideally, there should be two sets of units minimizing this distortion, namely the syllabic and phonetic units. Unfortunately, I couldn't fully develop and optimize this algorithm. However, a naïve version has been implemented, providing promising results on simple tests. Further development would be needed.

30

3 References: [Bertoncini and Mehler, 1981], Bertoncini, J., Mehler, J., (1981).,Syllables as units in infant speech perception, Infant Behavior and Development. [Beymer and Poggio, 1996], Beymer, D., Poggio, T. (1996), Image Representations for Visual Learning, Science [Blumstein, and Stevens, 1979] Blumstein, S.E. & Stevens, K.N. (1979). Acoustic invariance in speech production: evidence from measurements of the spectral characteristics of stop consonants. Journal of the Acoustical Society of America. [Coath and Denham, 2005] Coath M., Denham, S.L., (2005). Robust sounds classifications through the representations of similarity using response fields derived from stimuli during early experience, Biological Cybernetics. [Cutzu and Edelman 1996], Cutzu, F., Edelman, S. (1996), Faithful representation of similarities among 3D shapes in human vision. PNAS. [Dehaene-Lambertz and Dehaene, 1994] Dehaene-Lambertz, G., Dehan, S., (1994). Speed and cerebral correlates of syllable discrimination in infants. Nature. [Dehaene-Lambertz et al., 2002] Dehaene-Lambertz, G., Dehaene, S., Hertz-Pannier, L. (2002), Functional neuro imaging of speech perception in infans, Science [Dehaene-Lambertz et al., 2006], Dehaene-Lambertz, G., Hertz-Pannier, L., Dubois, J. (2006), Nature and nurture in language acquisition: anatomical and functional brain-imaging studies in infants. Trends in Neurosciences. [Dupoux et al, 1999], Dupoux, E., Kakehi, K., Hirose, Y., Pallier, C., Mehler, J., (1999), Epenthetic vowels in Japanese, a perceptual illusion?. Journal of Experimental Psychology. [Edelman, 1998] Edelman, S. (1998). representation as representation of similarity, Behavioral and Brain Sciences. [Eimas, et al., 1971], Eimas, P.D., Siqueland, E.R., Jusczyk, P., Vigorito J., (1971). Speech perception by infants. Science. [Eimas, et al., 1975] Eimas, P.D., Tartter, V.C. (1975). The role of auditory feature detectors in the perception of speech. Perception and Psychophysics [Fowler et al., 1993], Fowler, A., Saltzman, E. (1993), Coordination and coarticulation in speech production. Language and Speech. [Friederici and Wessels, 1993], Friederici, A. D., & Wessels, J. M. I. (1993) Phonotactic knowledge of word boundaries and its use in infant speech-perception. Perception & Psychophysics. [Hirsh-Pasek et al., 1987], Hirsh-Pasek, K., Golingkoff, R.M., Cauley, K.M., & Gordon, L. (1987). The eyes have it: Lexical and syntactic comprehension in a new paradigm. Journal of Child Language [Jusczyk et al., 1996], Jusczyk, P.W., Myers J, Kemler Nelson DG, Charles-Luce J, Woodward AL, Hirsh-Pasek K. (1996). Infants sensitivity to words boundaries in fluent speech, Journal of Child Language. [Kuhl, 1983], Kuhl, P. K. (1983). Perception of auditory equivalence classes for speech in early infancy. Infant Behavior. [Kuhl et al. 1992], Kuhl, P.K., Williams, K.A., Lacerda, F., Stevens, K.N., (1992). Linguistic experience alters phonetic perception in infants by 6 months of age. Science. [Kuhl, 1992], Kuhl P. K., (1992), Infants' Perception and Representation of Speech: Development of a New Theory. [Lacerda, 1995], Lacerda, F., (1995). The perceptual-magnet effect: An emergent consequence of exemplar-based phonetic memory . Proceedings of the XIIIth International Congress of Phonetic Sciences, Stockholm. [Mattys and Jusczyk, 2001], Mattys, S.L., Jusczyk, P.W., (2001), Do Infants Segment Words or Recurring Contiguous Patterns?. Journal of Experimental Psychology. 31

[Maye et al, 2002], Maye, J., Werker, J.F., Gerken L., (2002), Infant sensitivity to distributional information can affect phonetic discrimination. Cognition. [Mermelstein, 1976] Mermelstein, P. (1976), Distance measures for speech recognition, psychological and instrumental, in Pattern Recognition and Artificial Intelligence. [Myers and Rabiner, 1981], Myers C.S., Rabiner, L. R. (1981). A comparative study of several dynamic time-warping algorithms for connected word recognition. The Bell System Technical Journal [Nazzi et al., 1998], Nazzi, T., Bertoncini, J., Mehler, J. (1998). Language discrimination by newborns: towards an understanding of the role of rhythm. Journal of Experimental Psychology. [Osherson et al., 1984] Osherson, D.N., Stob, M., Weinstein M. (1984) Learning theory and natural language. Cognition. [Pena et al, 2003], Pena, M., Maki, A., Kovacic, D., Dehaene-Lambertz, G. (2003). Sounds and silence: An optical topography study of language recognition at birth, PNAS. [Peperkamp and Le Calvez, 2003] Peperkamp, S., Le Calvez, R., (2003). The acquisition of allophonic rules: statistical learning with linguistic constraints, Cognition. [Pisoni et al., 1980], Pisoni, D.B., Jusczyk P.W. Walley, A., J Murray, (1980), Discrimination of relative onset time of two-component tones by infants. Journal of the Acoustical Society of America. [Riedmiller and Braun, 1993], Riedmiller, M., H. Braun, H., (1993), A direct adaptive method for faster back-propagation learning: the RPROP algorithm [Shepard, 1968], Shepard, R.N., (1968), Cognitive psychology: A review of the book by U. Neisser. American Journal of Psychology. [Smith and Lewicki, 2006], Smith E.C., Lewicki, M.S., (2006), Efficient auditory coding, Nature, Smith & Lewicki, 2006 [Somervuo and Harma, 2004] Somervuo P. Harma, A. (2004). Bird song recognition based on syllable pair histogram, Acoustics, Speech, and Signal Processing. [Takami and Sagayama, 1992] Takami, J,. Sagayama, S. (1992). A successive state splitting algorithm for efficient allophonic modeling, Acoustics, Speech, and Signal Processing. [Werker and Tees, 1984a] Werker, J.F., Tees, R.C., (1984), Cross-language speech perception: Evidence for perceptual reorganization during the first year of life, Infant Behavior. [Werker and Tees, 1984b] Werker, J.F., Tees, R.C., (1984), Phonemic and phonetic factors in adult cross-language speech perception, Journal of Acoustic Society of America.

32