Multi-agent Based Arabic Speech Recognition

6 downloads 0 Views 408KB Size Report
Authorized licensed use limited to: University of Houston Clear Lake. Downloaded on August 12,2010 at 22:07:15 UTC from IEEE Xplore. Restrictions apply.
2007 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology - Workshops

Multi-Agent based Arabic Speech Recognition Muhammad Taha*, Tarek Helmy** and Reda Abo Alez*** * Cairo University, Faculty of Science, Mathematical Department, Cairo, Egypt. ** College of Computer Science and Engineering, King Fahd University of Petroleum and Minerals, Dhahran 31261, Mail Box 413, Kingdom of Saudi Arabia. *** Al Azhar University, Faculty Of Engineering Computers & Systems Engneering Departement, Nasr City, Cairo, Egypt. Emails: [email protected], [email protected].

Abstract

2. Challenges for Arabic Speech Recognition

This paper presents a novel agent-based design for Arabic speech recognition. We define the Arabic speech recognition as a Multi-Agent-System where each agent has a specific goal and deals with that goal only. Once all the small tasks are accomplished the big task is too. A number of agents are required in order to recast Arabic speech recognition, namely the Feature Extraction Agent and the Pattern Classification Agent. These agents are detailed in this paper.

Arabic is a Semitic language which differs from Indo-European languages syntactically, morphologically and semantically. Kirchhoff et al. in [11] stated that "the most difficult problems in developing highaccuracy speech recognition systems for Arabic are: Script Representation: The Arabic alphabet only contains letters for long vowels and consonants. Short vowels and other pronunciation phenomena, like consonant doubling, can be indicated by diacritics (short strokes placed above or below the preceding consonant). Morphological Complexity: Arabic has a rich and productive morphology which leads to a large number of potential word forms. Dialectal vs. Formal Speech: Arabic dialects are primarily oral languages; written material is almost invariably in MSA".

Keywords Agent, speech recognition, Neural Network, and Linear Predictive Coding.

1. Introduction

3. Related Work

Russell and Norvig [15] have defined an agent as being anything that can be viewed as perceiving its environment through sensors and acting upon that environment through effectors. The Multi-Agent paradigm is one which promotes the interaction and cooperation of intelligent autonomous agents in order to deal with complex tasks [5]. In this paper, we define the Arabic speech recognition as a Multi-AgentSystem where each subsystem has a specific goal and deals with that goal only. The Feature Extraction Agent and the Pattern Classification Agents are wired together for recognize Arabic speech signals. This paper is organized as follows: Problems for Arabic speech recognition presented in section 2. Section 3 presents previous work in Arabic speech recognition. In section 4 we investigate feature extraction agent. Section 5 presents pattern classification agent. Finally section 6 shows the results and conclusions.

A number of researchers are using Modern Standard Arabic (MSA), which is a formal linguistic standard, for developing Arabic speech recognizers. Kirchhoff et al. in [11] have used two MSA texts and one ECA text. It was found that the diacritization error rate on MSA ranged between 9% and 28%. Kirchhoff et al. in [12] have used morphology-based language models in a speech recognition system for conversational Arabic. Vergyri et al in [16] have developed the automatic diacritizing Arabic text for use in acoustic model training for ASR. El-Choubassi et al. in [3] have developed Arabic speech recognition using recurrent neural networks. Gal [6] has used an HMM based bigram model for decoding diacritized sentences from non-diacritized sentences. This technique was applied to the Quran and reaches a level 14% word error.

4. Feature Extraction Agent

0-7695-3028-1/07 $25.00 © 2007 IEEE DOI 10.1109/WI-IATW.2007.54

433

Authorized licensed use limited to: University of Houston Clear Lake. Downloaded on August 12,2010 at 22:07:15 UTC from IEEE Xplore. Restrictions apply.

target signal. Namely, the nth signal sample is represented as a linear combination of the previous p samples, plus a residual representing the prediction error.

Extraction agent can be described as follows: Percepts: This agent gets spoken Arabic words percept by microphone. Actions: There are four main actions available: Recode an utterance; Remove room noise; Filter the data; and Computes linear predictive coefficients (LPCs). Goals: The goal for this agent is to translate acoustic signals information into features which describe the properties of speech sounds. Environment: The environment consists of Egyptian speakers. The first step in all automatic speech recognition (ASR) algorithms is acoustic feature extraction. The aim of the feature extraction agent is to translate the information contained in acoustic signals into a data representation that is suitable for statistical modeling and likelihood calculation. ASR requires acoustic features that represent reliable phonetic information consistently, i.e. features which describe the distinctive properties of speech sounds efficiently and that are reproducible over many tokens. A multitude of techniques have been proposed to perform acoustic feature extraction efficiently, e.g. filter bank features, frequency-filtered filter bank features [13], mel-frequency cepstral coefficients [1], line spectrum pairs [14] perceptually based linear prediction [9], formant-like features [10], features based on auditory models [9][7][2], and features derived from wavelet transforms. The static features - also referred to as dynamic features are almost always included in the acoustic feature vectors. The common aim of almost all these feature representations is to describe acoustic signals in terms of their short-term spectral energy distribution. Information on the spectral energy distribution of a signal is usually extracted at regular time intervals. In addition to these so-called static coefficients, the first and second order time derivatives of the static features - also referred to as dynamic features - are almost always included in the acoustic feature vectors.

5. Pattern Classification Agent Classification agent can be described as follows: Percepts: This agent gets LPC values of each of voice word. Actions: There are four main actions available: Form the training data, Create the neural network object, Train the network, and Simulate the network response to new inputs. Goals: The goal for this agent is to classify each voice word. Environment: The environment consists of Acoustic signals features. Once the feature has been extracted, the task is to match the right pattern. We expand Arabic speech analysis to use the concept of neural networks [4] and [8]. A neural network is composed of a number of interconnected units (artificial neurons). Each unit has an input/output (I/O) characteristic and implements a local computation or function. The output of any unit is determined by the I/O characteristics, its interconnection to other units and (possibly) the external inputs. The real power of neural networks comes when we combine neurons into the multilayer structures, called neural networks. The Neuron has set of nodes that connect it to inputs, output, or other neurons, also called synapses. The LPC values of each of the training sets are fed as input to the neural network. Each output neuron represents a voice command or a word. The target output is made 1 at the corresponding neuron. There are 15 inputs (LPC values) and n outputs (corresponding to the n words to be recognized). The training is iterated and the weights are adjusted for each training sample.

6. Experimental Testing and Results This section describes the experiments performed to evaluate the performance of the proposed model on Arabic speech recognition. We consider a natural user interface of mobile computer applications like "simple drawing program". The speech data was recorded in a quiet room environment to obtain clean speech signals. A speaker with middle quality and a compatible Sound Blaster card were used at 11025 Hz sampling rate. The speech segments have been down sampled at 16 kHz. Speech data was prepared for the 7 basic utterances:

4.1 Linear Predictive Coding We propose to use Linear Predictive Coding (LPC) to analyze Arabic sound patterns [9]. LPC analyzes the speech signal by estimating the formants, removing their effects from the speech signal, and estimating the intensity and frequency of the remaining drone. The method employed is a difference equation, which expresses each sample of the signal as a linear combination of previous samples. Such an equation is called a linear predictor, for which this method is called Linear Predictive Coding. The basic assumption behind LPC is the correlation between the nth sample and the p previous samples of the

‫ﻣﺴﺘﻄﻴﻞ‬

‫ﻣﺮﺑﻊ‬

‫ﺧﻂ‬

‫داﺋﺮة‬

‫ﺻﻤﻢ‬

‫أﻧﺸﺊ‬

‫ارﺱﻢ‬

The speech data used in these experiments obtained from Egyptian Arabic speakers group; 2 males and 2

434

Authorized licensed use limited to: University of Houston Clear Lake. Downloaded on August 12,2010 at 22:07:15 UTC from IEEE Xplore. Restrictions apply.

females. The data from this group were divided into two sets; training set A and testing set B. Training set A with 84 examples was obtained from 3 utterances of each word. Testing set B with 56 examples was obtained from 2 utterances of each word. A graphical user interface is designed to allow interaction with the system. Now, LPC analyzes the speech signal by estimating the formants, removing their effects from the speech signal, and estimating the intensity and frequency of the remaining drone. Spectral analysis of the speech signal was performed over 10 msec with Hamming window. After recoding an utterance, high pass filters to detrend the data and remove room noise, computes linear predictive coefficients (LPCs) of order 15 for the segment and plots the prediction error and reconstructed signal with finite-impulse response (FIR) and infinite-impulse response (IIR) implementations of the filter. Now, The LPC values of each of the training sets are fed as input to the neural network. Each output neuron represents a voice command or a word. The target output is made 1 at the corresponding neuron. There are 15 inputs (LPC values) and (n=7) outputs. The training is iterated and the weights are adjusted for each training sample. The Accuracy of the trained network on each word among training and testing sets is shown in Table 1. The accuracy is measured as: Accuracy = (no. of patterns recognized accurately) / (no. of patterns fed).

directions for Arabic speech recognition. The neural network approach is quite general and can be extended to continuous speech to obtain high levels of pattern classifications and recognition.

8. Acknowledgment The authors would like to express their sincere appreciation to Professor Hany Ammar for many useful comments.

9. References [1] Davis S B, Mermelstein P. Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE transactions on acoustics speech and signal processing ASSP-1980; 28(4):357–366. [2] De Mori R, Albesano D, Gemello R, Mana F. Ear-model derived features for automatic speech recognition. In proceedings of ICASSP Istanbul, Turkey. 2000; pp. 1603–1606. [3] El Choubassi, M.M.; El Khoury, H.E.; Alagha, C.E.J.; Skaf, J.A.; Al-Alaoui, M.A. Arabic speech recognition using recurrent neural networks. Proceedings of the 3rd IEEE International Symposium on 2003, pages I-543 -I- 547. [4] Fausett Laurene V. Fundamentals of neural networks: Architectures, algorithms and applications. Englewood Cliffs NJ: Prentice-Hall. 1994.

Table (1): The accuracy results of the trained network. Arabic Words ‫ارﺱﻢ‬ ‫أﻧﺸﺊ‬ ‫ﺻـﻤـﻢ‬ ‫ﻣﺴﺘﻄﻴﻞ‬ ‫ﻣﺮﺑﻊ‬ ‫ﺧﻂ‬ ‫داﺋﺮة‬

Training Set Accuracy 11/12 10/12 10/12 12/12 9/12 10/12 9/12

Testing Set Accuracy 10/12 9/12 10/12 12/12 9/12 9/12 9/12

[5] Ferber J. Multi-Agent Systems - An Introduction to Distributed Artificial Intelligence. AddisonWesley, 1999. [6] Gal Ya’akov. An HMM approach to vowel restoration in Arabic and Hebrew. In Proceedings of the Workshop on Computational Approaches to Semitic languages. Philadelphia. Association for Computational Linguistics. July, 2002. p. 27–33. [7] Ghitza O. Temporal non-place information in the auditory-nerve firing patterns as a front-end for speech recognition in a noisy environment. Journal of Phonetics. 1988; 16(1):109– 124.

The back propagation neural network achieved an average test accuracy of over (80.95) % of all utterances. This result is calculated based on the overall accuracy for the back propagation neural network shown in Table 1. The accuracy network is significant when larger vocabulary is taken into consideration because a larger vocabulary provides more opportunity for the network to learn and therefore generalize on unseen examples.

[8] Golden Richard M. Mathematical Methods for neural network analysis and design. MIT Press, 1996 (1st ed.). [9] Hermansky H. An efficient speaker independent automatic speech recognition by simulation of some properties of human auditory perception. In Proceedings of ICASSP. Dallas TX USA.1987; pp. 1159–1162

7. Conclusions As can be seen from these results, the approach that we proposed gave promising recognition rates that can match. These results indicate useful future research

[10] Holmes J N, Holmes W J, Garner P N. Using formant frequencies in speech recognition. In Pro-

435

Authorized licensed use limited to: University of Houston Clear Lake. Downloaded on August 12,2010 at 22:07:15 UTC from IEEE Xplore. Restrictions apply.

ceedings of Euro-speech, Rhodes Greece. 1997 pp. 2083–2086.

processing. 1992; 2:80–87. [15] Russell, Stuart J. and Peter Norvig. Artificial Intelligence: A Modern Approach (The Intelligent Agent Book). Prentice Hall. 1995; p31. [16] Vergyri D., Kirchhoff K.. “Auto-matic diacritization of Arabic for acoustic modelling in speech recognition”. In Ali Farghaly andKarine Megerdoomian, editors, COLING 2004 Computational Approaches to Arabic Script-based Languages, pp. 66–73, Geneva, Switzerland, 2004.

[11] Kirchhoff K. et al. Novel approaches to Arabic speech recognition: Report from 2002 JohnsHopkins summer workshop. In Proceedings of ICASSP, 2003; pages I-344-I-347. [12] Kirchhoff K., D. Vergyri, J. Bilmes, K. Duh, and A. Stolcke.. Morphology-based language modeling for Arabic speech recognition. Computer Speech and Language, 2006;20(4):589–608. [13] Nadeu C, Hernando J, Gorricho M. On the decorrelation of filter-bank energies in speech recognition. In Proceedings of Euro-speech. Madrid, Spain. 1995;pp. 1381–1384. [14] Paliwal K. On the use of line spectral frequency parameters for speech recogition. Digital signal

436

Authorized licensed use limited to: University of Houston Clear Lake. Downloaded on August 12,2010 at 22:07:15 UTC from IEEE Xplore. Restrictions apply.