A New Methodology for Comparing Speech Rhythm

2 downloads 0 Views 128KB Size Report
that renders difficult the inference of relevant information from short excerpts of speech data [5,8,6] or evaluate speech rhythm in very particular experimental.
A New Methodology for Comparing Speech Rhythm Structure between Utterances: Beyond Typological Approaches Pl´ınio A. Barbosa and Wellington da Silva Instituto de Estudos da Linguagem, State University of Campinas, Brazil [email protected], dablio [email protected]

Abstract. This paper proposes a new methodology for automatically comparing the speech rhythm structure of two utterances. Eleven parameters were automatically extracted from 44 pairs of audiofiles yielding 11-size difference vectors. The parameters include speech rate, durationrelated stress group rate, prominence and prosodic boundary strength, f0 peak rate, as well as the coupling strength between underlying syllable and stress group oscillators. The 11-parameter difference vectors were used to infer the perceptual differences identified by a group of 10 listeners who judged the same 44 pairs of audiofiles . The results indicate that duration-related prominence or prosodic boundary rate and speech rate, taken together, predict up to 71 % of the response variance. To a minor extent, prominence/boundary strength mean and non-prominent VV unit rate predict up to 60 % of the response variance when combined with prominence or prosodic boundary rate. Keywords: speech rhythm, prominence, rhythm perception, speech rate.

1

Introduction

This paper explores a formal device for answering two related questions: what makes utterances sound prosodically distinct? Or what makes utterances differ as to the manner of speaking? We think the response to these questions concerns differences in speech rhythm structure. Speech rhythm is related to the variable interaction of a structuring component with a regularity component [1]. Since timing and prominence organisation are the main variables which define rhythm, a methodology to examine different aspects of timing and prominence organisation throughout an utterance is thought to be relevant. Two main approaches on speech rhythm have been proposed by researchers. One group of researchers is interested in finding typological rhythmic differences among languages or language varieties. This group proposed several measures (nPVI, rPVI, VarCo, %V, ΔC, inter alia. See, for instance, [2,3,4]) for examining patterns of data with different clusters of data corresponding to different rhythm classes. The main problem with this approach is the lack of a universal principle of speech rhythm applicable to all languages, because its tenants presuppose an H. Caseli et al. (Eds.): PROPOR 2012, LNAI 7243, pp. 329–337, 2012. c Springer-Verlag Berlin Heidelberg 2012 

330

P.A. Barbosa and W. da Silva

a priori classification of languages into two or three rhythm classes. Furthermore, events related to phonotactically-conditioned processes and hypoarticulation processes are usually invoked to explain why the data associated to a particular rhythm class occupy a particular region of the mathematical spaces formed by the proposed indexes. All proposals by this group of researchers reveal part of the consequences of speech production onto phoneme-sized variables. However, to advance the knowledge on speech rhythm, a key aspect of rhythm production (and perception) should be considered: the interplay between regularity and structuring constraints taking place between the syllable and the higher-level units (see [1] for emphasis on the importance of this interplay to all aspects of human movements). Structuring has to do with the hierarchical pattern of prominence or prosodic boundary levels. Regularity has to do with the (quasi-)regular recurrence of syllables and prominent syllables in time. The other group of researchers has proposed hierarchical models such as [5,6], which propose the coupling between two or more levels of interacting oscillators, such as between the syllable and the stress-group oscillators. This coupling allows to explain both universal and language-specific properties of rhythm by means of general principles of production applicable to all languages, which are parameterised by a coupling strength variable (ω). These hierarchical models separate the contributions of segmental factors from prosodic factors in speech rhythm and this is one of their main strengths as compared with the typological approach. In fact, the hierarchical models capture the view of rhythm as the alternation of strong and weak beats as speech unfolds through time and fulfill the three properties proposed by [7] of an adequate model of speech rhythm: predictivity, expliciteness, and universality. Unfortunately, the hierarchical models present either a level of abstraction that renders difficult the inference of relevant information from short excerpts of speech data [5,8,6] or evaluate speech rhythm in very particular experimental settings, such as speech synchrony and utterance repetition [9,10]. In this paper we present a methodology which takes prominence organisation and timing into account, while allowing a multiparametric comparison of any two excerpts of speech in terms of speech rhythm.

2

Methodology

For evaluating possible rhythmic differences across distinct speaking styles, reading vs storytelling styles were chosen. This choice is motivated by the fact that storytelling presents elements which can be found in spontaneous conversation, such as hesitations due to macro- and microplanning of the discourse. Though hesitations can occur in read speech, these are much less frequent than in the case of storytelling. This feature is important to be considered in developing an approach to describe speech rhythm in natural conditions and to investigate the possible differences between less and more controlled situations of utterance production.

A New Methodology for Comparing Speech Rhythm Structure

2.1

331

Corpora

The corpus consisted of texts recorded by three speakers of Brazilian Portuguese (henceforth BP). Two native female and one native male speakers read a 1,500word text on the origin of the Bel´em pastries (reading style, RE). After the reading, the three subjects told what the text was about (storytelling style, ST). The speakers were Linguistics students aged between 30 and 45 years at the time of recording. For the perception tests (see section 2.3), a subset of the corpus was used so that sessions last no more than 25 minutes. This subset was formed by the ST style of one of the female speakers, and the RE style of the other two speakers (one male and one female). Excerpts from 8.9 to 18.2 seconds were extracted from several parts of the material in order to make up the subsets for the discrimination tasks. The reason for choosing relatively long stretches of speech was guided by the high standard-deviations of the listeners’ responses obtained in a previous study for excerpts of 1 to 2 seconds [11]. The long-duration excerpts allow the listeners to more accurately evaluate the manner of speaking than short-duration excerpts (see a similar extension for voice similariry judgement in [12]). The 44 excerpts were segmented and labelled in VV units. 2.2

Measuring Techniques and Parameters Extracted

According to a traditional approach in speech research [13,14,15], syllables were phonetically segmented by tracking two consecutive vowel onsets (VO). The segmentation was performed semi-automatically in Praat [16] in two stages: automatic VO detection by using the BeatExtractor Praat script available in [17], followed by manual correction, where applicable. Two consecutive VOs define a VV unit, which contains only a single vowel, starting at the first VO. The BeatExtractor script detects points in the speech signal where changes in the previously filtered energy envelope are relatively fast and positive (from low to high energy). According to [18], the speech signal energy in the region of the first and second formants simulates the spectral region our auditory system tracks for detecting syllables. Duration-related stress groups were then delimitated by automatically detecting duration-related phrase stress positions throughout the utterances. The sequence of phrase stress positions was automatically tracked by serially applying two techniques for normalising the VV durations: a z − score (z) transform (equation 1):  μi dur − (1) z =  i , var i i where dur is the VV duration in ms, the pair (μi , vari ), the reference mean and variance in ms of the phones within the corresponding VV unit. These references are found in [17, pp. 489-490] for BP and Standard French, followed by a 5-point moving average filtering (equation 2): zfi ilt =

5.z i + 3.z i−1 + 3.z i+1 + 1.z i−2 + 1.z i+2 13

(2)

332

P.A. Barbosa and W. da Silva

The normalisation technique and the detection of duration-related phrase stress positions (detection of zf ilt maxima) were performed by the Praat script SGdetector. The computation of both the stress group duration and the number of VV units in the stress group is automatically performed by this script. This two-step normalisation technique aims at minimising the effects of phoneme-size intrinsic duration in the VV unit. This normalised duration maxima signal both prominence degree and prosodic boundary strength, indistinctly. This is not a drawback of this approach, since the salience of these two prosodic functions on a particular word, equally signals perceived rhythm. Listeners of Romance languages often attribute both functions to a prominent or a pre-boundary word when evaluating these functions in their own languages [19]. As presented in [6], the ratio between the intersect and the slope of the regression line designed to explain stress group duration from the number of VV units in the group (predictor variable) is related to a measure of the amount of stressing in a particular language, the coupling strength ω. The higher the coupling strength, the greater the influence of the structuring component onto VV duration regularity, that is, VV duration becomes less regular. The coupling strength defined as the aforementioned ratio is the first parameter extracted from the corresponding annotation file of each audio excerpt. Besides this parameter, ten other parameters were automatically computed by using a new RhythmParameterExtractor script running on Praat, which automatically extracts the 11 parameters from each excerpt by using a pair of audiofile and annotation file (TextGrid file in Praat). The second parameter is speech rate in VV units per second, extracted from the corresponding TextGrid file. The third to fifth parameters are the mean, standard-deviation and skewness of the zf ilt maxima, which reveal the structure of duration-related pooled prominence degree and boundary strength in the excerpt. The use of prominence/boundary distribution is crucial to produce an accurate description of speech rhythm, as recently claimed by [20,21]. The sixth parameter is the rate of the zf ilt maxima in peaks per second, which is meant to stand for the prominence or prosodic boundary rate, for the reasons mentioned before. The seventh parameter is the rate of f0 peaks in peaks per second. The sequence of f0 peaks is obtained from the audiofile in five steps: (1) extracting the f0 trace using Praat (limits between 75 and 600 Hz), (2) smoothing the obtained contour with a 5-Hz filter, (3) interpolating the gaps due to unvoiced segments, (4) automatically counting the number of peaks in the contour, and (5) dividing the number of peaks by the total duration of the excerpt. The next three parameters are the coefficients of variation (the ratio between standarddeviation and mean) for the following sequence of variables: the number of VV units per stress group, the duration of the stress group, and the VV duration. The last parameter is the rate of non-prominent VV units, which is close to the articulation rate, for non-prominent VV units do not contain silent pauses (as well as final-lengthened acoustic segments). Non-prominent VV units are those that do not have peaks of normalised duration for the respective VV units.

A New Methodology for Comparing Speech Rhythm Structure

333

Their rate was computed by dividing the number of such units in a particular utterance by the total duration delimited between the first and the last VO of the utterance. 2.3

Perception Test: Discrimination Tasks

Two discrimination tasks were designed for comparing two randomly-combined audio excerpts. Each excerpt was also delexicalised using the technique developed by [22]. Delexicalisation is the method of suppressing the segmental information from the speech signal to render it unintelligible while preserving their prosodic characteristics. Vainio and colleagues’ method combines inverse filtered glottal flow and all-pole modelling of the vocal tract with the advantage of preserving voice quality. Each pair of excerpt was combined in two orders of presentation (AB and BA), either in the delexicalised version or in the original version. This design allowed us to examining the degree of consistency when evaluating the same audiofiles in the two different orders. Consistency was defined as the absolute difference in judgment response for the AB and BA orders (perfect consistency has zero difference). Two subsets of 44 audio pairs were built, one with the delexicalised pairs (DS), and the other with the same pairs in their original version (OS). Each listener judged the DS first (task 1), and then the OS (task 2), instructed by this single sentence: “evaluate how different is the manner of speaking (modo de falar in Portuguese) of the excerpts in the pair in a scale of 1 (same manner of speaking) to 5 (very different manner of speaking)”. In each subset, the excerpts in the pair were separated by a 1,000-Hz short tone to signal the boundary between the audio files being evaluated. A group of ten listeners, all Linguistics majors, participated in the experiment. As regards their performance in the two tasks, we evaluated two hypotheses: (1) the task with the DS has higher and less variable consistency than the task with the OS (because the listeners would focus their attention to prosodic elements only in the case of the DS), and (2) one or more parameters among the 11 difference values for the two paired excerpts can satisfactorily predict the listeners’ responses. This second hypothesis presupposes a link between perceptual and production differences, at least in terms of the 11 parameters proposed in this paper.

3

Results and Partial Discussion

The responses scale from 1 to 5 was linearly transformed into a scale from -1 to 1, with zero standing for a neutral response. As regards consistency, that is, the ability of the listener to choose the same response for the pair of excerpts, irrespective of the presentation order, the results indicated a higher and lesser variable degree of consistency for the OS, contradicting our first hypothesis (mean,standard-deviation): (0.7, 0.6), for the DS, and (0.4, 0.5) for the OS (significant difference with tdf =398 = 4.2, p < 10−4 ). Both means are close to

334

P.A. Barbosa and W. da Silva

the distance between two points in the transformed response scale (0.5). Apparently, the lexical and acoustic segmental information in the original subset of indexical (speaker recognition), lexical or semantic memory, helps maintaining the same judgment for the same pair of excerpts in different orders. The reason for the listeners not doing the same evaluation when exchanging the order is related to slight changes in the manner of speaking throughout the excerpts as well as to increased memory burden. Due to limits of post-recognition temporal buffer of up to 10 or 20 seconds [23, p. 56–66], probably only the final parts of the first excerpt is retained in the memory to compare with the second excerpt. As regards the responses themselves, there was no significant difference (paired-t-test, tdf =439 = 1.6, p < 0.2) between the OS and DS. In order to predict the responses from the 11-parameter difference vectors, multiple linear regression models were designed. The predicted variable was the mean transformed values related to the ten listners’ responses to the original subset of excerpts. The reason for that is related to the fact that, although there was no difference in judgment between the OS and DS, the OS produced more consistent responses across presentation order. Only the pairs with response consistency equal or inferior to 0.5 were chosen for the models. From these pairs, only those with standard-deviation across listeners inferior to 0.5 were considered for analysis. The rationale behind this choice is the use of relatively homogeneous judgments. Each predicted variable for a selected pair is the mean response value for the two presentation orders. This produced a set of 15 paired excerpts comparing subjects and styles. The predictor variables were the 11-order absolute difference vector corresponding to each pair of excerpts. The best model explained 71 % of the variance of the listeners’ responses (lr): lr = −1.5 + 10.4pr + 2.65sr − 10.75pr ∗ sr,

(3)

with p − value of at least 0.009 for all coefficients (F3,11 = 12.4, p < 0.0008). This model predicts the listeners’ responses from two production parameters: speech rate (sr) difference and zf ilt maxima peak rate (pr) difference. Taken separately, these two parameters explain 40 % (pr) and 50 % (sr) of the responses’ variance. Duration-related prominence/boundary strength mean (zf ilt maxima mean) difference also explains 50 % of the responses’ variance. Non-prominent VV rate (ur) difference, on the other hand, explains 31 % of the variance by itself. These four parameters are the best single predictors. Taken together, zf ilt maxima mean and zf ilt maxima peak rate differences explain 60 % of the variance, whereas combining zf ilt maxima peak rate with non-prominent VV rate difference explains 56 % of the variance, all coefficients significant or marginally significant for all models with p-value from 0.08 to 0.003. It is not necessary to use logistic regression to restrict the predicted values to the [−1, , 1] interval because values lesser than −1 can be interpreted as highly similar, whereas values greater than 1, as highly distinct in the manner of speaking. As regards intra-speaker differences for different excerpts as well as differences between speaking styles and between speakers, response means (and standarddeviations) are: −0.7(0.2) for excerpts from the same speakers, which indicates a

A New Methodology for Comparing Speech Rhythm Structure

335

similar manner of speaking; 0(0.4) for excerpts from the same style (reading) in different speakers, which indicates that the two speakers are reading in relatively different ways; and 0.5(0.6), for differents speaking styles (and speakers).

4

General Discussion

It can be inferred from the results that, at least for explaining what is perceived as differences in the manner of speaking, the listeners seem to rely on up to four parameters: speech rate, duration-related prominence or boundary rate (and not rate of f0 peak, non significant in any combination), mean of prominence/boundary strength (estimated by z−f ilt maxima mean), and nonprominent VV rate. These results confirm hypothesis 2 above: at least two parameters related to rate (of syllable and of stress group) satisfactorily predict the listeners’ responses, and explain 71 % of the response variance. None of the variability descriptors, i.e., coefficients of variation of VV duration, of number of VV units in the stress group and of stress group duration, made significant predictions for the responses. The successful predictors, speech rate in VV units per second, and stress group rate, as well as duration-related prominence/boundary, and non-prominent VV unit rate per second are closely related to the parameters which predict judgments of voice similarity [12]: pausing and articulation rate. These latter are, in some extent, included in speech rate and prominence degree/boundary strength rate. The articulation rate, on the other hand, is very close to the non-prominent VV unit rate. This result does not mean that the other descriptors are not useful for describing rhythm at all, but that the four descriptors presented above seem to be used for perceiving rhythm (or the influences of rhythm in the judgment of the manner of speaking). As regards differences across styles and speakers, it seems from above that inter-style differences were well perceived, but also, in some extent, differences between readers (as the RE style was evaluated with two speakers). The methodology shown here seems quite robust and indicates that is probably better to evaluate rhythmic differences and their degree than to try to use typological approaches to classify speech rhythm. In speech technology, our approach can be used to automatically detect rhythmic differences between a pre-recorded utterance from one or more reference databases, and a new utterance whose rhythmic structure is unknown. The multiple regression equation 3 can be used to predict how apart a reference utterance and a new utterances are in terms of perceived rhythm. This figure can help taking decisions on prosodic differences for devices that automatically detect/recognise languages and dialects in person-machine dialogue systems. According to the results presented here, the questions that open this paper (what makes utterances sound prosodically distinct? What makes utterances differ as to the manner of speaking?) have the following answer: speech rate and stress group (and not f0 peak) rate as well as duration-related prominence/boundary strength mean and non-prominent VV rate (the latter is close to the articulation rate). All these measures are related to syllable and stress group rates, combined with a measure of prominence or boundary strength measure.

336

P.A. Barbosa and W. da Silva

Acknowledgments. The first author acknowledges a grant from CNPq number 300371/2008-0, and Sandra Madureira for proof-reading. We thank our listeners and speakers, and Juva Batella for adapting the text from European Portuguese to BP. The original text is from the INESC-Lisboa.

References 1. Fraisse, P.: Les Rythmes. Journal Fran¸cais d’Oto-Rhino-laryngologie Suppl´ement 7, 23–33 (1968) 2. Dellwo, V.: The Role of Speech Rate in Perceiving Speech Rhythm. In: Proc. Speech Prosody 2008, Campinas, Brazil, pp. 375–378 (2008) 3. Low, E.L., Grabe, E., Nolan, F.: Quantitative Characterisations of Speech Rhythm: Syllable-Timing in Singapore English. Language and Speech 43, 377–401 (2000) 4. Ramus, F., Nespor, M., Mehler, J.: Correlates of Linguistic Rhythm in the Speech Signal. Cognition 73, 265–292 (1999) 5. Barbosa, P.A.: From Syntax to Acoustic Duration: a Dynamical Model of Speech Rhythm Production. Speech Communication 49, 725–742 (2007) 6. O’Dell, M.L., Nieminen, T.: Coupled Oscillator Model of Speech Rhythm. In: Proc. of ICPhS 1999, San Francisco, USA, pp. 1075–1078 (1999) 7. Bertinetto, P.M., Bertini, C.: Towards a Unified Predictive Model of Natural Language Rhythm. Quaderni Del Laboratorio Di Linguistica Della SNS 7 (2008) 8. Barbosa, P.A.: Measuring Speech Rhythm Variation in a Model-based Framework. In: Proc. of Interspeech 2009 - Speech and Intelligence, Brighton, UK, pp. 1527–1530 (2009) 9. Cummins, F., Port, R.: Rhythmic Constraints on “Stress-timing” in English. J. Phon. 26, 145–171 (1998) 10. Cummins, F.: Entraining Speech with Speech and Metronomes. Cadernos De Estudos Lingu´ısticos 43, 55–70 (2002) 11. Silva, W., Barbosa, P.A.: Caracteriza¸c˜ ao Semiautom´ atica da Tipologia R´ıtmica do Portuguˆes Brasileiro. Anais do Col´ oquio Brasileiro de Pros´ odia da Fala. ID [2432011] (2011), http://www.experimentalprosodybrazil.org/III_CBPF_Anais.html ¨ 12. Ohman, L., Eriksson, A., Granhag, P.A.: Mobile Phone Quality vs Direct Phone Quality: How the Presentation Format Affects Earwitness Identification Accuracy. The European Journal of Psychology Applied to Legal Context 2(2), 161–182 (2010) 13. Classe, A.: The Rhythm of English Prose. Blackwell, Oxford (1939) 14. Lehiste, I.: Suprasegmentals. MIT Press, Cambridge (1970) 15. Dogil, G., Braun, G.: The PIVOT Model of Speech Parsing. Verlag, Wien (1988) 16. Boersma, P., Weenink, D.: Praat: Doing Phonetics by Computer. Version 5.2.44, http://www.praat.org 17. Barbosa, P. A.: Incurs˜ oes em torno do Ritmo da Fala. Pontes/FAPESP, Campinas (2006) 18. Scott, S.K.: Perceptual Centres in Speech: an Acoustic Analysis. PhD Thesis, University College London (1993) 19. Beckman, M.E.: Evidence for Speech Rhythms across Languages. In: Tohkura, Y., et al. (eds.) Speech Perception, Production and Linguistic Structure, pp. 457–463. IOS Press, New York (1992)

A New Methodology for Comparing Speech Rhythm Structure

337

20. Kohler, K.J.: Rhythm in Speech and Language: A New Research Paradigm. Phonetica 66, 29–45 (2009) 21. Cumming, R.E.: The Language Specific Interdependence of Tonal and Durational Cues in Perceived Rhythmicality. Phonetica 68, 1–25 (2011) 22. Vainio, M., et al.: New Method for Delexicalization and its Application to Prosodic Tagging for Text-to-Speech Synthesis. In: Proc. of Interspeech 2009 - Speech and Intelligence, pp. 1703–1706 (2009) 23. Cowan, N.: Attention and Memory. An Integrated Framework. Oxford University Press, New York (1997)