Bangla Phoneme Recognition for ASR Using

0 downloads 0 Views 329KB Size Report
Hundred sentences from the Bengali newspaper “Prothom Alo” [11] are uttered by 30 male speakers of different .... Prothom Alo. Online: www.prothom-alo.com.
Bangla Phoneme Recognition for ASR Using Multilayer Neural Network Abstract This paper presents a Bangla phoneme recognition method for Automatic Speech Recognition (ASR). The method consists of two stages: i) a multilayer neural network (MLN), which converts acoustic features, mel frequency cepstral coefficients (MFCCs), into phoneme probabilities and ii) the phoneme probabilities obtained from the first stage and corresponding ∆ and ∆∆ parameters calculated by linear regression (LR) are inserted into a hidden Markov model (HMM) based classifier to obtain more accurate phoneme strings. From the experiments on Bangla speech corpus prepared by us, it is observed that the proposed method provides higher phoneme recognition performance than the existing method. Moreover, it requires a fewer mixture components in the HMMs. Index terms: Multilayer Neural Network, Hidden Markov Models, Acoustic Features, Phoneme Probabilities, Automatic Speech Recognition.

I. INTRODUCTION A new vocabulary word or out-of-vocabulary (OOV) word often causes an “error” or a “rejection” in current hidden Markov model (HMM)-based automatic speech recognition (ASR) systems. To resolve this OOV-word problem, an accurate phonetic typewriter or phoneme recognizer functionality is expected [1-3]. There have been many literatures on phoneme recognition for ASR systems for almost all the major spoken languages in the world. Unfortunately, only a very few works have been done in ASR for Bangla (can also be termed as Bengali), which is one of the largely spoken languages in the world. More than 220 million people speak in Bangla as their native language. It is ranked sixth based on the number of native speakers [4]. A major difficulty to research in Bangla ASR is the lack of proper speech corpus. Some efforts are made to develop Bangla speech corpus to build a Bangla text to speech system [5].

However, this effort is a part of developing speech databases for Indian Languages, where Bangla is one of the parts and it is spoken in the eastern area of India (West Bengal and Kolkata as its capital). But most of the natives of Bangla (more than two thirds) reside in Bangladesh, where it is the official language. Although the written characters of Standard Bangla in both the countries are same, there are some sound that are produced variably in different pronunciations of Standard Bangla, in addition to the myriad of phonological variations in non-standard dialects [6]. Therefore, there is a need to do research on the main stream of Bangla, which is spoken in Bangladesh, ASR. Recognition of Bangla phonemes by Artificial Neural Network (ANN) is reported in [7-8]. However, most of these works are mainly concentrated on simple recognition task on a very small database, or simply on the frequency distributions of different vowels and consonants. Besides, the methods provided in [7-8] uses a multilayer neural network (MLN) in their architecture. Because a single MLN has an inability of resolving coarticulation effect [9], the phoneme recognition methods do not provide higher phoneme recognition performance. In this paper, we build a Bangla phoneme recognition system for an ASR in a large scale. For this purpose, we first develop a medium size (compared to the exiting size in Bangla ASR literature) Bangla speech corpus comprises of native speakers covering almost all the major cities of Bangladesh. Then, mel-frequency cepstral coefficients (MFCCs) of 39 dimensions are extracted from the input speech. The proposed method consists of two stages: i) a multilayer neural network (MLN), which converts acoustic features, MFCCs, into phoneme probabilities and ii) the phoneme probabilities obtained from the first stage and corresponding ∆ and ∆∆ parameters calculated by linear regression (LR) are inserted into a hidden Markov model (HMM) based classifier to obtain more accurate phoneme strings. The incorporation of dynamic parameters, ∆ and ∆∆, resolves coarticulation effect and consequently, increases the phoneme recognition performance. For evaluating Bangla phoneme correct rate (PCR) and phoneme accuracy (PA), we have designed three experiments: (i) MFCC+MLN, (ii) MFCC+MLN+ ∆ and (iii) Proposed method MFCC+MLN+∆.∆∆.

The paper is organized as follows. Section II briefly describes approximate Bangla phonemes with its corresponding phonetic symbols; Section III explains about Bangla speech corpus; Section IV provides a brief description about existing and our proposed methods, while Section V gives experimental setup. Section VI explicates the experimental results and discussion, and finally, Section VII draws some conclusions and remarks on the future works.

II. PHONETIC SYMBOLS FOR BANGLA PHONEMES The Bangla phonetic inventory consists of 8 short vowels (A, Av, B, D, G, H, I, J), excluding long vowels (C, E) and 29 consonants. Table I shows Bangla vowel phonemes with their corresponding International Phonetic Alphabet (IPA) and our proposed symbols. On the other hand, the consonants, which are used in Bangla language, are presented in Table II. Here, the Table exhibits the same items for consonants like as Table I. In the Table II, the pronunciation of /k/, /l/ and /m/ are same by considering the words wek (/biʃ/), wel (/biʃ/) and wWm (/ɖiʃ/) respectively, which is shown in Fig. 1. Here the meanings of wek, wel and wWm are English language “twenty (20)”, “poison” and “bowl” respectively. On the other hand, in the words Rvg (/dʒam/) and hvK (/dʒak/), there is no difference between the pronunciation of /R/ and /h/ respectively that depicted in Fig. 2. Here the meanings of Rvg and hvK are English language “black berry” and “go” respectively. Again, Fig. 3 shows that there is no difference between /Y/ and /b/ in the words nwiY (/hɾin/) and bvwZb (/natin/) respectively. Here the meanings of nwiY and bvwZb are English language “deer” and “grand daughter” respectively. Moreover, phonemes /o/ and /p/ carry same pronunciation in the words cvnvo (/pahaɽ/) and Avlvp (/aʃaɽ/) respectively, which is shown in the Fig. 4. Here the meanings of cvnvo and Avlvp are English language “hill” and “rainy season” respectively. Initial consonant cluster is not allowed in the native Bangla: the maximum syllable structure is CVC (i.e. one vowel flanked by a consonant on each side) [10]. Sanskrit words borrowed into Bangla

possess a wide range of clusters, expanding the maximum syllable structure to CCCVC. English or other foreign borrowings add even more cluster types into the Bangla inventory.

Figure 1:

Spectrogram of Bangla phonemes /k/, /l/ and /m/ in the words wek (/biʃ/), wel (/biʃ/) and wWm (/ɖiʃ/) respectively.

Figure 2:

Spectrogram of Bangla phonemes /R/ and /h/ in the words Rvg (/dʒam/) and hvK (/dʒak/) respectively.

Figure 3:

Spectrogram of Bangla phonemes /Y/ and /b/ in the words nwiY (/hɾin/) and bvwZb (/natin/) respectively.

Figure 4:

Spectrogram of Bangla phonemes /o/ and /p/ in the words cvnvo (/pahaɽ/) and Avlvp (/aʃaɽ/) respectively.

TABLE I: Bangla Vowels. Letter A Av B C D E G H I J

IPA /ɔ/ and /o/ /a/ /i/ /i/ /u/ /u/ /e/ and /æ/ /oj/ /o/ /ow/

Our Symbol a aa i i u u e oi o ou

TABLE II: Bangla Consonents. Letter K L M N O P Q R S U V W X Y Z _ ` a b c d e f g h i j k l m n o p q

IPA /k/ /kh/ /ɡ/ /ɡʱ/ /ŋ/ /tʃ/ /tʃʰ/ /dʒ/ /dʒʱ/ /ʈ/ /ʈʰ/ /ɖ/ /ɖʱ/ /n/ /t̪/ /t̪ʰ/ /d̪ / /d̪ ʱ/ /n/ /p/ /pʰ/ /b/ /bʱ/ /m/ /dʒ/ /ɾ/ /l/ /ʃ/ / /s/ /ʃ/ /ʃ/ / /s/ /h/ /ɽ/ /ɽ/ /e̯ / /-

Our Symbol k kh g gh ng ch chh j jh ta tha da dha n t th d dh n p ph b bh m j r l s s s h rh rh y

III. BANGLA SPEECH CORPUS A real problem to do experiment on Bangla word ASR is the lack of proper Bangla speech corpus. In fact, such a corpus is not available or at least not referenced in any of the existing literature. Therefore, we develop a medium size Bangla speech corpus, which is described below. Hundred sentences from the Bengali newspaper “Prothom Alo” [11] are uttered by 30 male speakers of different regions of Bangladesh. These sentences (30x100) are used for training corpus (D1). On the other hand, different 100 sentences from the same newspaper uttered by 10 different male speakers (total 1000 sentences) are used as test corpus (D2). All of the speakers are Bangladeshi nationals and native speakers of Bangla. The age of the speakers ranges from 20 to 40 years. We have chosen the speakers from a wide area of Bangladesh: Dhaka (central region), Comilla – Noakhali (East region), Rajshahi (West region), Dinajpur – Rangpur (North-West region), Khulna (South-West region), Mymensingh and Sylhet (North-East region). Though all of them speak in standard Bangla, they are not free from their regional accent. Recording was done in a quiet room located at United International University (UIU), Dhaka, Bangladesh. A desktop was used to record the voices using a head mounted close-talking microphone. We record the voice in a place, where ceiling fan and air conditioner were switched on and some low level street or corridor noise could be heard. Jet Audio 7.1.1.3101 software was used to record the voices. The speech was sampled at 16 kHz and quantized to 16 bit stereo coding without any compression and no filter is used on the recorded voice.

IV. PHONEME RECOGNITION METHODS 4.1

Proposed Method Fig. 5 shows the phoneme recognition method using MLN [8]. At the acoustic feature extraction

stage, input speech is converted into MFCCs of 39 dimensions (12-MFCC, 12-∆MFCC, 12-∆∆MFCC, P, ∆P and ∆∆P, where P stands for raw log energy of the input speech signal). MFCCs are input to an MLN with three layers, including 2 hidden layers, after combining preceding (t-3)-th and succeeding (t+3)-th

frames with the current t-th frame. The MLN has 39 output units (total 39 monophones) of phoneme probabilities for the current frame t. The two hidden layers consist of 300 and 100 units, respectively. The MLN is trained by using the standard back-propagation algorithm. This method yields comparable recognition performance. However, the single MLN suffers from an inability to model dynamic information precisely.

Figure 5: 4.2

MLN-based Existing Phoneme Recognition Method.

Proposed Method Fig. 6 shows the proposed phoneme recognition method, which comprises two stages: i) a

multilayer neural network (MLN), which converts acoustic features, MFCCs, into phoneme probabilities of 39 dimensions, ii) the phoneme probabilities obtained from the first stage and corresponding ∆ and ∆∆ parameters calculated by LR are inserted into a hidden Markov model (HMM) based classifier to obtain more accurate phoneme strings. The architecture of MLN and its learning method is similar to the method described in Section 4.1. The 39 dimensional output probabilities obtained by the MLN and corresponding ∆ and ∆∆ calculated by using three point LR (39x3) are inserted into a HMM-based classifier. Input features of 117

dimensions (39x3) for the HMM-based classifier are derived by combining output probabilities obtaining by the MLN, and corresponding ∆ and ∆∆.

Figure 6:

Proposed Phoneme Recognition Method.

V. EXPERIMENTAL SETUP The frame length and frame rate (frame shift between two consecutive frames) are set to 25 ms and 10 ms, respectively, to obtain acoustic features (MFCCs) from an input speech. MFCC comprised of 39 dimensional (12-MFCC, 12-∆MFCC, 12-∆∆MFCC, P, ∆P and ∆∆P, where P stands for raw log energy of the input speech signal). For designing an accurate phoneme recognizer, PCR and PA for D2 data set are evaluated using an HMM-based classifier. The D1 data set is used to design 39 Bangla monophone (8 vowels, 29 consonant, sp, sil) HMMs with five states, three loops, and left-to-right models. Input features using the methods (i) and (iii), and (ii) and (iv) are 39 and 78 dimensions, respectively. In the HMMs, the output probabilities are represented in the form of Gaussian mixtures, and diagonal matrices are used. The mixture components are set to 1, 2, 4, 8, 16 and 32. In our experiments of the MLN, the non-linear function is a sigmoid from 0 to 1 (1/(1+exp(-x))) for the hidden and output layers.

To obtain the PCR and PA we have designed the following experiments: (i)

MFCC+MLN [8]

(ii)

MFCC+MLN+∆

(iii)

MFCC+MLN+∆.∆∆ [Proposed].

VI. EXPERIMENTAL RESULTS AND DISCUSSION Figs. 7 and 8 show the PCR and PA respectively for the investigated methods using training data set D1. It is shown from the figures that the proposed method provides a higher phoneme recognition performance than the other methods investigated.

For an example, at 32 mixture component, our

proposed method (iii) shows 70.78% PCR, while the methods, (i) and (ii) exhibit 67.75% and 70.37% PCRs, respectively. On the other hand, at the same mixture component, the accuracies of the methods (i), (ii) and (iii) are 52.98%, 59.95% and 61.31%, respectively. On the other hand, the PCR and PA for test data, D2 are shown in the Figs. 9 and 10, respectively for the investigated methods. The proposed method, (iii) outperformed the other methods for both the evaluations (PCR and PA) in the mixture components 2, 4, 8, 16 and 32. It is noted from the mixture component 32 from the Fig. 10 that the method (iii) having 54.66% accuracy shows its better recognition performance over the other methods accuracies.

Phoneme Correct Rate (%)

75 70

MFCC+MLN MFCC+MLN+Δ MFCC+MLN+Δ.ΔΔ

65 60 55 50 1

2

4

8

16

32

Number of Mixture Component(s)

Figure 7:

Phoneme correct rate for training data using investigated methods.

Phoneme Accuracy (%)

70

60

MFCC+MLN MFCC+MLN+Δ MFCC+MLN+Δ.ΔΔ

50

40

30 1

2

4

8

16

32

Number of Mixture Component(s)

Figure 8:

Phoneme accuracy for training data using investigated methods.

Phoneme Correct Rate (%)

65

60

MFCC+MLN MFCC+MLN+Δ MFCC+MLN+Δ.ΔΔ

55

50 1

2

4

8

16

32

Number of Mixture Component(s)

Figure 9:

Phoneme correct rate for test data using investigated methods.

Phoneme Accuracy (%)

60 55 50

MFCC+MLN MFCC+MLN+Δ MFCC+MLN+Δ.ΔΔ

45 40 35 1

2

4

8

16

32

Number of Mixture Component(s)

Figure 10:

Phoneme accuracy for test data using investigated methods.

The reason for providing better result by the method (iii) is ∆ and ∆∆ (dynamic parameters), while the existing method (i) contains no dynamic parameters. On the other hand, the second investigated method, (ii) embeds only velocity coefficient (∆) which covers a context window of limited size and consequently, provides a higher recognition performance than the existing method (i). Since the proposed method incorporates both velocity (∆) and acceleration coefficients (∆∆), it shows better performance than the method (ii). It is claimed that the proposed method reduces mixture components in HMMs and hence computation time. For an example the from Fig. 10, approximately 47.50% phoneme recognition accuracy is obtained by the methods (i) and (iii) at mixture components 32 and four, respectively.

VII. CONCLUSION In this paper, we proposed a Bangla phoneme recognition method using a multilayer neural network. The following conclusions are drawn from the study: (i)

The proposed method provides higher phoneme correct rate and phoneme accuracy than the other methods investigated.

(ii)

The proposed method requires a fewer mixture components in the HMMs.

(iii)

The dynamic parameters, ∆ and ∆∆, improve the phoneme recognition performance significantly.

The authors would like to evaluate phoneme recognition performance using recurrent neural network in near future.

REFERENCES [1] I. Bazzi and J. R. Glass, “Modeling OOV words for ASR,” Proceedings of ICSLP, Beijing, China, p. 401-404, 2000. [2] S. Seneff, et. al, “A two-pass for strategy handling OOVs in a large vocabulary recognition task,” Proc. Interspeech, 2005. [3] K. Kirchhoff “OOV Detection by Joint Word/Phone Lattice Alignment, “ASRU , Kyoto, Japan, Dec 2007.

[4] http://en.wikipedia.org/wiki/List_of_languages_by_total_speakers, Last accessed July 12, 2010. [5] S. P. Kishore, A. W. Black, R. Kumar, and Rajeev Sangal, "Experiments with unit selection speech databases for Indian languages," Carnegie Mellon University. [6] http://en.wikipedia.org/wiki/Bengali_phonology, Last accessed July 12, 2010. [7] K. Roy, D. Das, and M. G. Ali, "Development of the speech recognition system using artificial neural network," in Proc. 5th International Conference on Computer and Information Technology (ICCIT02), Dhaka, Bangladesh, 2002. [8] M. R. Hassan, B. Nath, and M. A. Bhuiyan, "Bengali phoneme recognition: a new approach," in Proc. 6th International Conference on Computer and Information Technology (ICCIT03), Dhaka, Bangladesh, 2003. [9] T. Robinson, “An application of Recurrent Nets to Phone Probability Estimation,” IEEE Trans. Neural Networks, Volume 5, Number 3, 1994. [10]

C. Masica, The Indo-Aryan Languages, Cambridge University Press.

[11]

Daily Prothom Alo. Online: www.prothom-alo.com