Paper Title (use style: paper title)

0 downloads 0 Views 832KB Size Report
[12]. Also, a paper published by a team of researchers and engineers from Microsoft Artificial Intelligence and Research reported a speech recognition system ...
2017 IEEE 15th Intl Conf on Dependable, Autonomic and Secure Computing, 15th Intl Conf on Pervasive Intelligence and Computing, 3rd Intl Conf on Big Data Intelligence and Computing and Cyber Science and Technology Congress

Improving Wake-Up-Word and General Speech Recognition Systems Veton Këpuska

Gamal Bohouta

Electrical & Computer Engineering Department Florida Institute of Technology Melbourne, FL, USA [email protected]

Electrical & Computer Engineering Department Florida Institute of Technology Melbourne, FL, USA [email protected] number of speech commands to control all devices in the home. Most companies that produce speech recognition systems have focused on improving the speech recognition accuracy in General ASR systems without focusing on improving the speech recognition accuracy in WUW ASR Systems such as “Google achieved an 8 percent error rate in 2015 that is reduction of more than 23 percent from year 2013” [12]. Also, a paper published by a team of researchers and engineers from Microsoft Artificial Intelligence and Research reported a speech recognition system that makes the same or fewer errors than professional transcriptionists. The researchers reported a word error rate (WER) of 5.9 percent, down from 6.3 percent WER [13].

Abstract— The aim of this paper is to design a whole speech recognition system that can work in both the Wake-up-word (WUW) and the General Automatic Speech Recognition (ASR) Systems with high accuracy. WUW is Key-Word spotting and a modern technology in Speech Recognition (SR) that is not yet widely recognized. This new ARS system will be used to solve one of the biggest problems in speech recognition, which is how to discriminate between the uses of a word or phrase in an alerting and a referential context. Moreover, by using this paradigm the accuracy of commands that are used to interact with machines,

On the other hand, there are some researchers who have studied the WUW paradigm in different aspects, such as WUW with noise environments, the speed of utterance, the location of the target speaker, and utilizing the pitch and energy. According to Zehetner and others, "their research focused in a WUW spotting database with three different background noise levels, four different speaker distances to the microphone, and ten different speakers" [4]. Also, another research paper studied that "By using prosodic features utilizing only pitch and energy, it was demonstrated that a classification model for discriminating an alerting and referential context can be developed achieving the accuracy of 83.67%"[5]. Moreover, the research of Jwu-Sheng Hu concluded that "a multichannel speech interface is introduced not only to estimate the unknown locations of the sound sources but also to strengthen the speech feature for WUW detection"[3]. In this paper, we propose an approach that will be used to detect the Wake-upword (WUW) as a command with 100% accuracy and to provide a whole speech recognition system that can work in both WUW and General ASR systems with high accuracy.

such as one word or an entire sentence, will improve to have 100% accuracy. Due to the increasing number of speech commands in life, this model will be the best solution for applications that use the speech commands to active the computers, such as medical assistance, human-robot interaction, telecommunications industry, home automation, and security access control. Keywords— Automatic Speech Recognition; Improving Speech Recognition techniques; Wake-up-word ; CMU Sphinx.

I. INTRODUCTION Automatic speech recognition (ASR) is a technology that allows a machine to recognize the utterances that a person speaks into a microphone and then convert it to text [11]. In recent years, ASR has become popular in the different areas of applications, such as command and control systems, medical systems, disabilities systems, dictation systems, telephony systems, and embedded applications. All these applications in life will increase the number of speech commands, which are Wake-up-words. The WUW speech recognition “is similar to Key-Word spotting. “WUW is a new paradigm in Speech Recognition (SR) that is not yet widely recognized” [1]. However, WUW-SR is different in one important aspect of being able to discriminate the specific word/phrase used only in alerting context (and not in the other; e.g. referential contexts)” [1]. For example, in future homes, we will need a

Moreover, our approach will discriminate between the word/phrase in alerting context or referential contexts, such as in the use of the word “Computer'' in alerting context is "Computer, show me the chart? '' and in referential context it is “Every computer should have a speaker”. To achieve this ASR system with 100% accuracy, the following steps were accomplished to establish the new system: adding new components to the structure of the ASR system, choosing the best platform for testing the approach, simulating the WUW

978-1-5386-1956-8/17 $31.00 © 2017 IEEE DOI 10.1109/DASC-PICom-DataCom-CyberSciTec.2017.67

318

2017 IEEE 15th Intl Conf on Dependable, Autonomic and Secure Computing, 15th Intl Conf on Pervasive Intelligence and Computing, 3rd Intl Conf on Big Data Intelligence and Computing and Cyber Science and Technology Congress and General acoustic models, and designing the rule-based system to make the final decision.

III. TESTING THE SYSTEM In recent years, ASR systems have been using the latest technologies, and are operating at accuracies of more than 90% [6]. There are several ASR systems, such as Amazon, Microsoft, Google, Sphinx 4, HTK, Kaldi and Dragon [15]. In our proposal, we chose the Sphinx 4 for testing our approach based on its supporting, open-source system, programming language and the structure of components. The four main reasons for choosing the Sphinx 4 that are the Sphinx 4 has been developed at Carnegie Mellon University (CMU). Currently,” CMU Sphinx has a large vocabulary, speaker independent speech recognition codebase, and its code is available for download and use” [7]. Second, “Sphinx-4 is an open source speech recognition system. To facilitate new innovation in speech recognition research, CMU formed a distributed, cross-discipline team to create Sphinx-4, an open source platform that incorporates state-of-the art methodologies and also addresses the needs of emerging research areas.” [10]. Third, its structure has three main components, which is the same as our structure, and it includes the Frontend, the Decoder and the Linguist. Moreover, its structure has been designed with a high degree of flexibility and modularity” [8]. Fourth, the Sphinx-4 has been written by Java programming language. Therefore, there are additional packages such as Pocketsphinx, Sphinxbase, and Sphinxtrain that we have used to train and test the acoustic model and the language model by using other languages Perl, Python.

II. THE STRUCTURE OF THE ARS SYSTEM The ASR system is the one of an actively area. Many companies used different approaches to design and improve their ASR systems. There are many techniques used to design the ASR systems, based on the application and its complexity. ASR systems can be separated to several different classes, such as isolated or connected words and continuous speech, dependent and independent speaker, and small and large vocabulary. However, there are three main components in the general speech recognition system, which includes the Frontend, the Decoder, and the knowledge base. “There are three main blocks in the design, which are controllable by any external application: the frontend, decoder, and knowledge base (KB)” [9]. The knowledge base usually includes the acoustic model, the language model, and the dictionary. The following is the structure of the general speech recognition system.

IV. THE WUW AND GENERAL ACOUSTIC MODELS An acoustic model is used to represent the relationship between an audio signal and the phonemes or words. It is created by taking a large database of speech (called a speech corpus), and their text transcriptions, and using special training algorithms to create statistical representations for each phoneme or word in a language [15]. The statistical representations are Hidden Markov Models (HMM). In our proposal, we have two acoustic models WUW acoustic model and the General acoustic models.

Fig. 1. The Structure of General Speech Recognition System

In this paper, we have modified some components in the original structure of general speech recognition systems : (1) modifying the Frontend to be able to accept various kinds of features, such as Mel-scale Filtered Cepstral Coefficients (MFCCs) and Linear Prediction Coefficients (LPC; (2) adding two acoustic models to the knowledge base component, the WUW acoustic model and the General acoustic models; (3) combining the decoder results: the Decoder 1 for WUW acoustic model, the Decoder 2 for the General acoustic model. (4) using the decision support system to help the system for making the final decision. the following is the structure of our proposed ASR System:

A. General acoustic models In this paper, we used a general English acoustic model of Sphinx-4; Sphinx-4 has the most recent version of an HMMbased speech and a strong acoustic model the using HMM model with training large vocabulary [9]. Moreover, for increasing the accuracy of the acoustic model, we used the Pocketsphinx, Sphinxbase, Sphinxtrain for adapting the General acoustic model by using some corpus, such as TIMIT corpus. The TIMIT corpus of read speech is designed to provide speech data for acoustic-phonetic studies and for the development and evaluation of automatic speech recognition systems. TIMIT contains broadband recordings of 630 speakers of eight major dialects of American English, and each reading has ten phonetically rich sentences. The TIMIT corpus includes time-aligned orthographic, phonetic, and word transcriptions as well as a 16-bit, 16kHz speech waveform file for each utterance [16]. Also, we have used the WUW corpus that was collected from speech recognition group at Florida Institute of

Fig. 2. The Structure of Proposal Speech Recognition System

978-1-5386-1956-8/17 $31.00 © 2017 IEEE DOI 10.1109/DASC-PICom-DataCom-CyberSciTec.2017.67

319

2017 IEEE 15th Intl Conf on Dependable, Autonomic and Secure Computing, 15th Intl Conf on Pervasive Intelligence and Computing, 3rd Intl Conf on Big Data Intelligence and Computing and Cyber Science and Technology Congress Technology. This corpus has two kinds of sentences, such as the use of the word “Computer'' in alerting context is "Can you solve this, Computer '' and in referential context is “Computer is normally used in presentation.” In our proposal, we used only referential contexts for adapting the General acoustic model. Getting the General Score (Score1) from the General acoustic model (WUW1: -3.20E+07) B. WUW acoustic model By using the Pocketsphinx, Sphinxbase, and Sphinxtrain with other languages Perl, Python, we created a new acoustic model. Also, we trained the WUW acoustic model by using the WUW corpus that we had collected by speech recognition group at Florida Institute of Technology or from the TIMIT corpus. In this part, we used only alerting contexts for training the WUW acoustic model.

Fig. 4. Detection Wake-up-word (Score 1 and Score 2)

B. Making the final Decision In the second stage, we made the final decision based on the results of decoders (Decoder 1, Decoder 2) and the scoring of the WUW acoustic models (Score 1) and General acoustic models (Score 2). For evaluating this stage, we have tested the model with some corpus. After testing this stage with some corpus, the experimental results show that we can make the final decision with 100% accuracy, For example: When the WUW is "COMPUTER " if the result of word is a WUW by three rules:

Getting the General Score (Score2) from the WUW acoustic model (WUW1: -5.19E+07) V.

THE EXPERIMENT RESULTS

In this paper, we have presented the approach and tested this approach by Shpinx4 and some corpus. After testing the approach, we chose the best result from the decoders and acoustic models. According to the results of the decoders (Decoder 1, Decoder 2) and the scoring of the WUW and General acoustic models, we have designed the decision support system, which includes two stages: the first stage is detection Wake-up-word (WUW), and the second is making the final decision. A. Detection Wake-up-word (WUW) In this stage, we designed the Support Vector Machine (SVM) that has been used to detect a single word or phrase while rejecting all other words or sounds by using the score of WUW acoustic models (Score 1) and the score of reverse WUW acoustic models (Score 2). For evaluating, the detection "Wake-up-word (WUW), we have tested the support vector machine with some corpus. For example:

Fig. 5. WUW and General acoustic models Scores caption.

TABLE I. Words

Fig. 3. Detection Wake-up-word (Score 1 and Score 2)

978-1-5386-1956-8/17 $31.00 © 2017 IEEE DOI 10.1109/DASC-PICom-DataCom-CyberSciTec.2017.67

320

WUW AND GENERAL ACOUSTIC MODELS SCORES WUW and General acoustic models WUW Score

General Score

WUW1

-3.20E+07

-5.19E+07

WUW2

-4.89E+07

-8.44E+07

WUW3

-3.86E+07

-4.66E+07

WUW4

-2.12E+07

-5.44E+07

WUW5

-5.95E+07

-7.98E+07

WUW6

-3.25E+07

-4.68E+07

2017 IEEE 15th Intl Conf on Dependable, Autonomic and Secure Computing, 15th Intl Conf on Pervasive Intelligence and Computing, 3rd Intl Conf on Big Data Intelligence and Computing and Cyber Science and Technology Congress Words

best platform for testing the system, simulating the WUW and General acoustic models, and designing the rule-based design system. Also, the new model was tested by using Shpinx4 and some corpus and the experimental results show that we can make the final decision if the result of word is a WUW word with 100% accuracy.

WUW and General acoustic models WUW Score

General Score

WUW7

-2.57E+07

-5.70E+07

WUW8

-4.49E+07

-5.45E+07

WUW9

-3.97E+07

-7.28E+07

REFERENCES

1) If the results of decoders (Decoder 1, Decoder 2) have equal results and the score of WUW acoustic models (Score 1) is more than the score of General acoustic models (Score 2)

[1]

IF the same result (Decoder 1="COMPUTER ", Decoder 2="COMPUTER ") and Score 1=-3.609822E7 > Score 2=5.9299232E7, Then the final decision is the "COMPUTER " of Decoder 1 is WUW with (Confidence Word (50) %) (Decoder 1="COMPUTER ", Score 1=-3.609822E7)

[2]

[3]

2) If the results of decoders (Decoder 1, Decoder 2) have different results and the results of Decoder 1 is WUW words , then the final result is the result of Decoder 1 is WUW word with (Confidence Word (50)%) and the result of Decoder 2 is General word with (confessed word(0)%)

[4]

[5]

IF the result different (Decoder 1="COMPUTER", Decoder 2="COMPANY") and result (Decoder 1="COMPUTER") is WUW Then the final decision is the "COMPUTER " of Decoder 1 is WUW with (Confidence Word (50) %)

[6]

[7]

3) If the results of Decoder 1 is silence "" and the results of Decoder 2 is WUW or General word , then the final result is the result of Decoder 2 is General word with (Confidence Word (100)%) . IF The result (Decoder 1="", Decoder 2="COMPUTER or General word ") Then the final decision is the "COMPUTER or General word " of Decoder 2 is General word with (Confidence Word (100) %)

[8]

[9]

The results of experiment show that we can make the final decision if the result of word is a WUW with 100% accuracy (Confidence Word (100%)) or General word with Confidence Word (100%). For example: If the score of word "COMPUTER " gets the Confidence Word (50) % in the first stage and the Confidence Word (50) % in the second stage, then the word "COMPUTER " will be the WUW word with Confidence Word (100) %.

[10]

[11]

[12]

[13]

VI. CONCLUSIONS This paper proposes a new speech recognition system that will be able to work in both side the Wake-up-word (WUW) and General speech recognition system with high accuracy. Moreover, this system can discriminate between the uses of a word or phrase in an alerting and a referential context with virtually 100% accuracy. To achieve this ASR system with 100% accuracy, we have accomplished some steps, such as adding new components to the structure of ARS, Choosing the

[14]

[15] [16]

978-1-5386-1956-8/17 $31.00 © 2017 IEEE DOI 10.1109/DASC-PICom-DataCom-CyberSciTec.2017.67

321

V. Këpuska and T. Klein,” A novel Wake-Up-Word speech recognition system, Wake-Up-Word recognition task, technology and evaluation”, Nonlinear Analysis, vol. 71, pp. e2772-e2789, 2009. J. Hu, M. Lee and T. Wang, “Wake-Up-Word Detection for Robots Using Spatial Eigenspace Consistency and Resonant Curve Similarity”, presented at the 2011 IEEE Int. Conf. on Robotics and Automation,Shanghai, China , May 9-13, 2011. J. Hu, M. Lee and T. Wang, “Wake-Up-Word Detection by Estimating Formants from Spatial Eigenspace Information”, presented at the 2012 IEEE Int. Conf. on on Mechatronics and Automation, Shanghai, China ,August 5 - 8, 2012. A. Zehetner, M. Hagmuller, and F. Pernkopf,” Wake-Up-Word Spotting For Mobile Systems”, Signal Processing and Speech Communication Laboratory, pp. e2772-1476, 2011. V. Këpuska, R. Sastraputera, C. Shih,” Discrimination Of Sentinel Word Contexts Using Prosodic Features”, Florida Institute of Technology, USA. D. Isaacs and D. Mashao, "A Comparison of the Network Speech Recognition and Distributed Speech Recognition Systems and their eect on Speech Enabling Mobile Devices", doctoral diss. Speech Technology and Research Group, University of Cape Town, 2010 P. Lamere, P. Kwok, E. Gouvêa, B. Raj, R. Singh, W. Walker, M. Warmuth, Peter Wolf, ” Sphinx-4: A Flexible Open Source Framework for Speech Recognition”, Sun Microsoft, USA, November 2004. P. Lamere, P. Kwok, E. Gouvêa, B. Raj, R. Singh, W. Walker, M. Warmuth, Peter Wolf, ”Design of the CMU Sphinx-4 Decoder”, Mitsubishi Electric Research Laboratories, USA, August, 2003. P. Lamere, P. Kwok, E. Gouvêa, B. Raj, R. Singh, W. Walker, M. Warmuth, Peter Wolf, ”The Cmu Sphinx-4 Speech Recognition System”, Sun Microsystems Laboratories,Carnegie Mellon University,Mitsubishi Electric Research Labs and University of California, USA. E. Babu, S. Jeelan and P. Prakash, “Static dictionary for Telugu speech recognition system”, Int. J. of Conceptions on Computing and Information Technology, vol. 1, pp. 29-32, November, 2013. I. Kaur, N. Kaur, A. Ummat, J. Kaur and N. Kaur, “Automatic Speech Recognition: A Review”, Int. J. ofof Computer Science And Technology, vol. 7, pp. 43-49, Oct - Dec 2016. V. Këpuska and G. Bohouta, “Comparing Speech Recognition Systems (Microsoft API, Google API And CMU Sphinx)”, Int. J. of Engineering Research and Application, vol. 7, pp. 20-24, March, 2017. W. Xiong, J. Droppo, X. Huang, F. Seide, M. Seltzer, A. Stolcke, D. Yu and G. Zweig, “Achieving Human Parity In Conversational Speech Recognition”, Microsoft Research Technical Report MSR, February, 2017. A. Stiles, B. Schmitt, F. Gertz, T. Klein and V. Kepuska, “Testing and Improvement of the Triple Scoring Method for Applications of Wake-up Word Technology”, 2007 Amalthea Reu Site, 2007. M. Sharma and O. kumari, “Speech Recognition: A Review”, Nat. Con. on Cloud Computing & Big Data. TIMIT Acoustic-Phonetic Continuous Speech Corpus. Linguistic Data Consortium: https://catalog.ldc.upenn.edu/LDC93S