Emotional Speech Synthesis Based on Improved ... - Semantic Scholar

2 downloads 0 Views 182KB Size Report
Listening tests prove that the ... conversion system is adopted here to transform the spectrum of neutral voice to ... conversion, decision tree is adopted here.
Emotional Speech Synthesis Based on Improved Codebook Mapping Voice Conversion Yu-Ping Wang, Zhen-Hua Ling, Ren-Hua Wang iFlytek Speech Laboratory University of Science and Technology of China, Hefei {ypwang2, zhling}@ustc.edu, [email protected]

Abstract. This paper presents a spectral transformation method for emotional speech synthesis based on voice conversion framework. Three emotions are studied, including anger, happiness and sadness. For the sake of high naturalness, superior speech quality and emotion expressiveness, our original STASC system is modified by introducing a new feature selection strategy and hierarchical codebook mapping procedure. Our result shows that the LSF coefficients at low frequency carry more emotion-relative information, and therefore only these coefficients are converted. Listening tests prove that the proposed method can achieve a satisfactory balance between emotional expression and speech quality of converted speech signals.

1. Introduction In recent years, with the development of TTS(Text-To-Speech) systems, the speech quality and naturalness of the synthesizer have reached a high level, so more requests lead us to new areas of research such as emotional speech synthesis. Analysis of acoustic features of emotional speech and its synthesis rules have been studied[1], especially from the viewpoint of prosody. But spectrum of the emotional speech is still a problem. Corpus-based approach is realized by waveform unit selection from a large size emotional speech corpus[2][3], this method can achieve high speech quality, but problems still remain with database designing, recording and labeling. Therefore, in order to generate spectrum of emotional speech automatically, voice conversion system is adopted here to transform the spectrum of neutral voice to emotional one. Generally, there are two methods for spectrum conversion: codebook mapping and GMM based method. Codebook mapping [4] is adopted here because each codebook represents one segment of training data, which can preserve the information in the training data well, whereas in GMM methods, overall optimized conversion function for a group of data may lose information of some training data and cause smoothing effects of speaker’s or emotion’s characteristics. In this paper, emotional spectrum is converted from that of neutral, to achieve high accuracy in spectrum conversion, base on the analysis, only the LSF coefficients at low frequency are converted and DAL is introduced in to modify the converted LSF

and also a hierarchical codebook mapping procedure is used to improve the speech quality of conversion output. . Three emotions are synthesized from reading style speech by the presented method: anger, happiness and sadness. Comparing the results of the original voice conversion system and the improved one, speech quality has been improved and meanwhile the identification rate of emotion type still remains stable. This paper is arranged as follows: section 2 will introduce the baseline voice conversion system, the experiment data and the details of proposed improving methods. The following section will describe experimental results. After discussion, we conclude this paper.

2. Method

2.1. Basic idea Codebook mapping voice conversion is used to convert emotional speech from neutral speech, according to the results of the analysis of conversion parameters, only the LSF coefficients at low frequency are converted because they are proved to carry more emotion-relative information and DAL (Distance of Adjoin dimension LSF coefficients) is introduced in to modify converted LSF and also a hierarchical codebook mapping procedure is used to improve the speech quality of conversion output. 2.2. Baseline Voice Conversion System The baseline system is based on codebook mapping for spectral conversion[5]. In order to reduce smoothing effects, the system uses a phoneme-tied weighting strategy which takes into account the phoneme types and state types of code words in addition to the objective distance between spectral coefficients. The new strategy can reduce the smoothing effects while maintaining high speech quality. And for prosodic conversion, decision tree is adopted here. STAIGHT is adopted here for speech analysis and all-pole model is used to present the spectrum of analysis output. 40 poles are used to make a precise model of the spectrum of speech data with 16kHz sampling rate. Then these poles are transferred to LSF coefficients, which are the spectral parameters used in the conversion procedure. 2.3. Speech Database Target emotions are anger, happiness and sadness, the selection is based on the concept that they are primary emotions and they can be expressed continuously in speech. Our experiments show that texts with emotions are helpful for emotional expression. As emotion text collection is difficult, texts with or without emotions are

obtained first, then by adding, deleting or changing a few key words to obtain emotional texts. For each emotion, 100 emotional sentences are collected. One female speaker reads the texts in target emotion style and neutral style respectively. For each emotional voice and neutral one, 10 sentences are selected randomly from the recorded database for listening test. Ten listeners are asked identify which emotion each sentence belongs to with 5 choices: neutral, anger, happiness, sadness and fear. Table1 shows the result of identification test. Table 1. Result of identification test for the emotional speech database(%)

Emotion Type Neutral Anger Happiness Sadness

Neutral 92.5 3.75 8.0 6.5

Anger 0.75 89.75 6.0 0.25

Happiness 1.5 3.75 83.75 0.5

Sadness 2.75 0.5 0.75 82.75

Fear 2.5 2.25 1.5 10.0

For each emotion type, only the first 50 emotional and the corresponding neutral sentences are used for training in the voice conversion system. 2.4. Improvement on Feature Selection

2.4.1. Significant analysis of LSF coefficients at different order In order to analyze the significance of LSF coefficients in presenting the spectral differences between neutral and emotional speech, an analysis is conducted here. Accurate alignments are first made between emotional and corresponding neutral utterances, and then 10000 frames of LSF coefficients on vowels are analyzed by Paired-Samples T-Test. Figure1 shows the LSF significance of different order in distinguishing neutral and happy speech. The results for angry and sad speech are similar and not listed here. The following conclusions can be drawn from Figure 1: 1. There explicitly exists spectral component that can strongly distinguish neutral and emotional speech, and it must be adjusted as emotion changes. 2. The spectral component that can strongly distinguish neutral and emotional speech is mainly presented in the low orders of LSF coefficients, the high orders of LSF coefficients are not so significant in distinguish them; As high orders of LSF coefficients are not so significant in distinguishing neutral and emotional speech, which means, high order LSF coefficients of neutral and emotional are close to each other, so when LSF coefficients are converted from neutral to emotional, only the low orders are converted and leave the high orders unchanged, this can avoid smooth effect in high frequency and achieve more natural and clear speech. So, in this experiment, only the lowest 10 orders are converted and the other 30 left unchanged.

T-Test Result 180 160 140 120

T

100 80 60 40 20 0 -20 LSF 1

LSF 7 LSF 4

LSF 13 LSF 19 LSF 25 LSF 31 LSF 37 LSF 10 LSF 16 LSF 22 LSF 28 LSF 34 LSF 40

Fig. 1. T-Test result of significant analysis

2.4.2. LSF Modification Based on DAL Conversion Though the quality of output speech has been improved by only converting the LSF of lowest 10 orders, it is still not good enough as neutral voice. One of the problems is the synthesized speech is not clear enough, because the smooth effects make speech formant unclear, so in order to improve the synthesized speech, the formants should be made clear. As we all know, DAL (in formula 1) has a strong correlation with speech formants, DAL[i ] = LSF [i + 1] − LSF [i ]

(1)

So, DAL is added into the parameter space to join training and conversion.. After all the coefficients have been generated and assume DAL[i] and LSF[i] are the generated DAL and LSF coefficients, DAL is used to modify the LSF data as formula 2: LSF [i ] = LSF [i − 1] + DAL[i − 1] + 2 DAL[i − 1] [( LSF [i + 1] − LSF [i − 1]) − ( DAL[i ] + DAL[i − 1])] 2 2 DAL[i − 1] + DAL[i ]

(2)

After modification, new LSF coefficients are generated. They are then converted into spectrum and used to synthesize emotional speech with prosody information.

2.5. Hierarchical Codebook Mapping The precision of predicted spectrum affects speech quality of the synthesized speech, so if the precision can be improved, high quality speech can be achieved, as our synthesis system is based on codebook mapping, smooth effect is inevitable. Some attempts [6][7] have been made to reduce the smooth effect. Here a hierarchical codebook mapping method is proposed to improve the precision of spectral conversoion. Assuming LSFn [i ] and LSFp [i ] the natural LSF parameters and the predicted LSF parameters, Res[i ] = LSFn [i ] − LSFp [i ] will also be converted and used to modify LSFp [i ] to achieve more reasonable LSF coefficients.

Here, a hierarchical codebook mapping method is introduced to predict Res[i ] . The conversion residual of LSF coefficients in the training set are combined with the neutral LSF to constructed a residual codebook, which is used to predict Res[i ] in the general codebook mapping way during conversion procedure. The final LSF will be modified in the following way: LSF [i ] = LSF p [i ] + Res[i ]

(3)

The flowcharts of constructing residual codebook and conversion process of hierarchical codebook mapping method are shown in Fig. 2 and 3. Neutral

Neutral LSF Training set

Emotional

LSF Codebook

Baseline

System

Emotional LSF Training set

Predict LSF

LSF Residual Neutral LSF

Residual

Residual Codebook Fig. 2. The flowchart of constructing residual codebook

Neutral

Emotional

Predict LSF

LSF Codebook Neutral LSF

Final Predict LSF Neutral

Emotional

Residual Codebook

Predict Residual

Fig. 3. The conversion process of hierarchical codebook mapping method

3. Experiments

3.1. Spectrum Comparison

In order to test the effect of the improvements, four conversion systems are used separately to convert emotional spectrum from that of neutral, the comparison of converted sad spectrum between the four systems and natural is shown in Figure4

Fig. 4. Comparison of converted spectrum

From left to right, the spectrum are the generation of baseline system, improved system by 2.4.1, improved system by 2.4.2, improved system by 2.5, neutral spectrum and sadness spectrum separately. As shown in Figure4, when LSF difference coefficients are added or only convert low orders of LSF or use spectrum residual to modify LSF, the formants of

synthesized speech become clearer, listening tests also shows speech quality becomes better. 3.2. Listening Tests

To evaluate effect of conversion and improvement, listening tests is conducted here. 10 listeners are asked to listen to the synthetic speech to, on one hand, identify the emotion type of synthetic speech(carried out as a force-choiced test, four choices: anger, happiness, sadness and neutral), on the other hand, give a Mean Opinion Score(MOS) to the quality of each utterance. For each emotion type, only the first 50 emotional and the corresponding neutral sentences are used for training in the voice conversion system. Six systems described below are used for the listening tests: A. Converted prosody with neutral spectrum Prosody is converted by decision tree with no spectrum conversion. B. Baseline System Both spectrum and prosody are converted C. Baseline System with improvement2.4.1 Only the lowest 10 orders of LSF parameters are converted D. Baseline System with improvement2.4.2 DAL modification E. Baseline System with improvement2.5 Spectrum residual modification F. Baseline System with all three improvements 20 sentences of each emotion are selected and synthesized by each system. Table 2 and Table 3 show the results of the listening tests: Table 2. Identification rate of the emotion type for different conversion system(%)

Emotion Anger Happiness Sadness

A 56.0 76.0 28.0

B 76.0 84.0 60.0

C 76.0 82.0 62.0

D 78.0 82.0 62.0

E 76.0 84.0 62.0

F 78.0 82.0 60.0

Table 3. The MOS socre for different conversoin system

Emotion Anger Happiness Sadness

A 4.03 3.88 2.62

B 3.03 3.18 2.26

C 3.32 3.36 2.97

D 3.14 3.22 2.75

E 3.10 3.20 2.83

F 3.34 3.40 3.10

We can conclude from Table 2 and Table 3 that prosodic features mainly contribute to emotion in speech especially for anger and happiness, and also the converted spectrum reinforce the emotional expression, as for sadness. Because the spectrum of neutral speech is so different from that of sad speech and spectrum is also important in emotional expression, so when neutral spectrum with sadness prosody are combined to synthesis sadness speech, both identification rate and MOS are low, but after spectrum is converted and the improvements, the identification rate and MOS are acceptable.

4. Discussion As our listening test shows, prosodic features are important in emotional expression especially for anger and happiness, when neutral spectrum with emotional prosody are combined to synthesis emotional speech, the identification rate of emotions in the synthetic speech is acceptable, and after spectrum is converted, the identification rate is much higher, that’s to say, prosodic features mainly contribute to emotion in speech and spectrum reinforce the emotional expression, as for sadness, spectrum in emotional expression is comparatively more important. So when use the converted spectrum rather than the original spectrum, the identification rate increases a lot. To achieve high accuracy in spectrum conversion, only the LSF coefficients at low frequency are converted and DAL is introduced in to modify the converted LSF and also a hierarchical codebook mapping procedure is used to improve the speech quality of conversion output.

5. Conclusion In this paper, we describe our emotional speech synthesis system based on codebook mapping, prosodic parameters are predicted by decision tree and meanwhile spectrum is converted from neutral to emotional, as smooth effect deteriorate speech quality, some improvements are applied to improve speech quality of the synthesized speech, listening tests shows the proposed methods can effectively improve the speech quality of the synthesizer and meanwhile the identification rate of the perception test remains high. However, there are still a lot of problems stay unresolved, for example, the accuracy of the converted prosodic features is still low, and so is the spectrum. More work should be done in these areas in the near future to generate high quality emotional speech.

6. References 1. I. R. Murry, et al : Towards the Simulation of Emotion in Synthetic Speech : A Review of the Literature of Human Vocal Emotion, J. of ASA, 93, No.2, pp.1097-1108(1993) 2. A. Iida, et al.:A Speech Synthesis System with Emotion for Assisting Communication, Proc. ICSA Workshop on Speech And Emotion, pp.167-177(2000) 3. Iida, A., Campbell, N.: A corpus-based speech synthesis system with emotion, Speech Communication 40 (2003) 4. M. Abe, S. Nakamura, K. Shikano and H. Kuwabara.: Voice Conversion through vector quantization, Proceedings of ICASSP 1988, pp.655-658 5. Shuang, Z.W, Z.X.Wang, Z.H.Ling, and R.H.Wang, A Novel Voice Conversion System based on Codebook Mapping with Phonome-tied Weighting, ISCSLP 2004, pp. 1197-1200 6. Noriyasu Maeda, Banno Hideki, Shoji Kajita, Speaker conversion through NoN-Linear frequency warping of STRAIGHT spectrum, Proc. Of EUROSPEECH 1999, pp.827-830 7. Toda, T., Saruwatari, H., voice conversion algorithm based on Gaussian Mixture Model with dynamic frequency warping of STRAIGHT spectrum, Proc. Of ICASSP, 2001, pp.841-944