A Study of Thai Tone Classification

2 downloads 0 Views 126KB Size Report
Abstract. Tone classification is a necessary part for Thai speech recognition. This paper present the work based on our former work described in our published ...
A Study of Thai Tone Classification Li Tan, Montri Karnjanadecha, Thanate Khaorapapong and Pichaya Tandayya Department of Computer Engineering Faculty of Engineering Prince of Songkhla University Hat Yai, Songkhla, Thailand, 90112 E_mail: [email protected], [email protected], [email protected], [email protected]

Abstract Tone classification is a necessary part for Thai speech recognition. This paper present the work based on our former work described in our published paper ref. [1] and the published paper ref. [2]. Several configurations of tone classification front-end for large vocabulary Thai speech corpus are implemented and compared. They include the tone-critical segments, feature setting, frequency scale, normalization techniques. The experiments are also presented and discussed. Finally we get 72.21% accuracy for a large vocabulary Thai continuous speech corpus described in ref. [3] which includes 5726 tones (training set: 4506, testing set:1220).

1. Introduction Tone Classification is a critical component for tonal speech recognition, such as Thai language that has five tones: mid, low, falling, high, and rising. The identification of tone mainly relied on the contour of the fundamental frequency (F0). Then pitch detection is the first part of tone classification that has done in our former work. There are a number of studies of Thai tone recognition. In ref. [2], it presents an empirical study of constructing Thai tone models on three speech corpus which are Potisuk-1999 corpus, Thai proverb corpus, Thai animal story corpus. In this paper we present the experiments that follow the method and results described in ref. [2] for the large vocabulary Thai continuous speech corpus based on the project described in ref. [3] and the analysis of the experiments results. In the following, we first present the subject that we used in this research work. Then, the experimental setting will be described. After that, the experimental results will be introduced and discussed. Finally it’s the conclusions of the paper.

2. Subject In our experiments we use a large vocabulary continuous Thai speech corpus which is taken from the project

described in ref 2. The corpus consists of 360 utterances. The data are collected from 18 native Thai speakers (9 male and 9 female speakers). Each speaker read 20 utterance which are randomly arrange from Thai words. Totally the corpus contain 5726 tones. The speech signals are digitized by a 16-bit A/D converter at 16 kHz. The speech data are manually segmented and transcribed at phoneme levels using the waveform analysis software.

3. Experimental Setting Based on our speech data, we separate the speech data into two sets: training set, testing set. For training set, we collect the first 15 utterance of each speaker (270 utterance) as training set. And the last 5 utterance of each speaker is the testing set (90 utterance). The experiments are performed using a three-layer feedforward neural network. The number of input depends on the number of features. The number of hidden is depends on the trying of training performance. The number of output units is 5 corresponding five tones in Thai language. All feature parameters are normalized to lie between –1.0 and 1.0. Training algorithm is back-propagation method. The matlab tool is used to implement all of the work presented in this paper.

4. Tone-critical Segment Tone-critical segment refers to the part containing critical information for tone classification. Based on the current research work, the tone mainly lie on the vowel part of the syllable is generally accepted. Through the reference 2, it concludes that the rhyme segment takes advantage over the syllable segments. Here we first did the experiments only on vowel segment that described in ref.1 and we got the 63% accuracy from 373 tones. Based on the experiments, we think the final consonant should also contain the tone information. The observation of pitch contour is shown in Fig. 1.

160 140 120 100

0

2

4

6

8

10

12

14

0

2

4

6

8

10

12

14

160 140 120 100

Fig 1. Pitch contour with final consonant and the vowel only

72

/

/

3

3

45 - 56 55.

4 9 - 756 .

44

-8 6 .

78 - 896 55.

From the observation, for the rising tone, the pitch contour of vowel is 6 points but the pitch contour with vowel and final consonant together is 14 points. There are 8-point differences here. Also the classification experiment also has done on vowel aa. The experiment’s results are shown in Table 1. Table 1. Tone-critical segment classification results

Here the speech considers all of the speech in the corpus. It includes 18 speaker and 20 utterance each speaker. The performance of 10 tone features are significant improved from the 4 3-rd order polynomial coefficients about 6%. This is the same as the results presented in Ref. [2].

6. Frequency Scaling !

After combining the final consonant into tone classification, the accuracy is increased from 63% to 77.48%. This proves that the final consonant also contains critical information of tone.

5. Feature Setting Tone is represents by the shape of the pitch contour, such as: level and slope. For isolated tone recognition, the 3-rd polynomial coefficients are generally used because of the instinct difference between the typical shape of pitch contour. But for continuous speech, the variation of pitch contour is effected by many other factors, such as: speakers, continuous. In order to catch more intrinsic difference between each tone, here we use 5 F0 heights and slopes at 0%, 25%, 50%, 75% and 100% as the tone

feature which is shown in Fig. 2.

0

01 2

" &

'()& $

# * ,

)

$% + * &

&

'()&

'()& "

-.

(2)

where f is the frequency in hertz. Semitone is similar to log scale. It is used to define the music tone. Also according the researches, the listeners judged F0 intervals to be equivalent if they were equal in semitones. So researchers often consider to using semitone scaling to reduce the variation of tone. ERB frequency scaling is similar to the critical-band frequency scaling. Basically the ERB frequency defined according to the knowledge of human auditory system. So researches consider to using ERB-rate to improve the performance of classification to model the human tone classification system. In Fig 4 shows the histogram of the one speaker’s pitch data before and scaling. 1500

1000

500

0 50

1500

1000

500

0 50

(a) Original Speech

Fig. 5 Histogram of Two Speaker’s Pitch

The mean normalization is shown in equation 3. (b) Semi-tone Scaling

(3) Where f is the original frequency, norm-f is the normalized frequency, m the mean value of the whole utterance. Mean-normalization normalizes the mean of total utterance to zero and keep the relative differences within the utterances.

(c) ERB scaling Fig. 4 Histogram of Scaling

From the histogram, we can see that after scaling, the distribution basically keep unchanged. But the variation range of pitch is compressed. The classification results are shown in Table 3. Table 3. Classification Results of Scaling 5 he ight+ 5 slope s Pe rce nt

Se m i-tone

ERB

66.15% (807/ 1220) 66.89% (816/ 1220)68.61% (837/ 1220)

Through the table, both kind of scaling improved the performance. But the ERB scaling get more advantages which follow the result shown in ref. [2].

7. Normalization F0 is speaker dependent. Therefore, for speaker independent tone classification the normalization is necessary. To observe the pitch variation between speaker, the pitch histogram of two speakers (one male and one female) are shown in figure 5. From figure 5, the variation of pitch data between speaker is clearly expressed. The mean of F0 for female is clearly higher than the male. Also the distribution of pitch is different. This explains the normalization is a critical part for speaker independent tone classification. Generally the mean normalization and z-score normalization are used.

The z-score normalization is shown in equation 4. (4) Here norm-f is the normalized frequency, m is the mean of the utterances and sd is the standard deviation of the pitch among the utterance. It normalizes the distribution of pitch within the utterance into the 0-1 normal distribution. The figure 6 shows the histogram after normalization. 2000

2000

1000

1000

0

-0.2

-0.1

0

0.1

0.2

0 -6

2000

2000

1000

1000

0 -0.3

-0.2

-0.1

0

0.1

0.2

0 -6

-4

-2

0

2

4

6

-4

-2

0

2

4

6

Fig. 9 Histogram after Normalization

From figure 6, we notified that the variations between the speaker are reduced. For mean normalization, it keeps the distribution of original data. But for z-score normalization, the distributions of the pitch data are modified into Guassian distribution. The classification results after normalization are shown in Table 4.

In this paper, we have presented a study of Thai tone classification with respect to the tone-critical segment, feature setting, frequency scaling, normalization. A feedforward neural network is implemented as tone classifier. se mi-tone Se m i+ z-score Se mi+ m-norm Through all of the experiments, we conclude that it’s Pe rce nt 66.15% (807/ 1220) 69.84% (852/ 1220)7 2 .2 1 % ( 8 8 1 / 1 2 2 0 ) necessary to consider the final consonant in Thai tone classification. The feature setting that use 5 heights and ERB ERB+ z-score ERB+ m -norm slopes gives 6% higher performance than the 3-rd order Pe rce nt 68.61% (837/ 1220) 70.16% (881/ 1220)72.05% (879/ 1220) polynomial coefficients. For frequency scaling, the ERB scale is better than semi-tone frequency. The meannormalization takes advantage over z-score normalization Here all of classification is using 10 tone features. From because it keeps the distribution information of tone. the table, the semi-tone scaling and mean-normalization Finally the best configuration is using 10 features, semigive us the highest performance which is not the same as tone scale and mean normalization. The performance is the results in ref [2]. In ref. [2], it concludes that the ERB 72.21%. scaling and z-score normalization give the best performance. Based on the difference and the histogram of In the future work, we plan to consider more factors which different speaker, we can conclude that the distribution of effects the shape of pitch contour to improve the pitch data is not absolutely follow the Guassian classification performance. distribution. But the z-score normalization modified the distribution as Guassian distribution. This should be the 10. References main reason make the lower performance. For ref. [2], the z-score give the higher performance. It may because the [1] A. B. Smith, C. D. Jones, and E. F. Roberts, “Article Title”, speech data is not so much. Table 4. Classification Results of Normalization

Journal, Publisher, Location, Date, pp. 1-10. [2] Jones, C. D., A. B. Smith, and E. F. Roberts, Book Title, Publisher, Location, Date.

8. Confusion Analysis From above, we get that using 10 features, semi-tone scaling and mean normalization give the highest performance. Table 5 is the confusion-matrix under the best configuration. Table 5. Confusion-matrix

0 1 2 3 4

0(M) 1(L)

2(F)

3(H)

4(R)

Percent(% )

319 51 20 48 19

18 6 212 28 0

16 16 14 103 8

5 17 1 6 88

8 3 .0 7 6 3 .8 6 8 3 .1 4 5 3 .3 7 6 3 .3 1

32 159 8 8 24

7 2 .2 1

From the confusion-matrix, it can be seen that the fall provides the highest recognition rate, while the high gives the poorest. Also we noted that there is big confusion between the mid and the low, the low and the high, the fall and the high. Since the number of the speech with the mid is very large compared to other tones, the classifier may be biased. Thus the fall, the high and the rise also misclassified as the middle tone.

9. Conclusions