Reducing Computational Complexity of Dynamic Time ... - CiteSeerX

4 downloads 4009 Views 63KB Size Report
Email: [email protected]*, [email protected]**, [email protected]***, ... the training utterance and stored them as templates. .... yield good speech quality.
Reducing Computational Complexity of Dynamic Time Warping Based Isolated Word Recognition with Time Scale Modification Peter H. W. Wong*, Oscar C. Au**, Justy W.C. Wong***, William H. B. Lau Department of Electrical and Electronic Engineering The Hong Kong University of Science and Technology Clear Water Bay, Kowloon, Hong Kong. Email: [email protected]*, [email protected]**, [email protected]***, [email protected] Tel.: +852 2358-7053**

ABSTRACT In this paper, we propose an algorithm to reduce the computational complexity of dynamic time warping for isolated speech recognition. Prior to features extraction process in speech recognition, we apply time scale compression both on the reference and testing utterances in order to reduce duration of utterances. Experimental results shown that the computational complexity can be reduced up to 75% without affecting the accuracy of recognition. The time scale process can also suppresses the effect of noise in a certain degree that the recognition accuracy can be improved for noisy test utterances. The proposed algorithm can solve the problem of highly mismatch of utterances duration between two utterances.

recognition, but its performance is usually less satisfactory compared with HMM. While DTW is computationally simpler than HMM, its computational requirement is still high. It would be desirable to reduce the computation requirement without sacrificing the recognition performance. In this paper, we propose to use Time Scale Modification (TSM) to shrink the test and reference utterance in time before performing DTW. The reference template creation process and the recognition are shown in Figure 1 and Figure 2 respectively. Our simulation suggests that a computation reduction by as large as 75% is possible without affecting the recognition performance.

I. INTRODUCTION Automatic speech recognition is well known to be of great importance as a natural man-machine interface. There are two main approaches to speech recognition [1], namely dynamic time warping (DTW) and hidden Mark models (HMM). HMM can capture the statistical characteristics of word and subword units among different speakers even in large vocabulary and thus is generally better than DTW in speaker independent large vocabulary speech recognition. However, there are useful applications of DTW in small vocabulary, isolated word, speaker dependent or multi-speaker speech recognition due to its relative simplicity and good recognition performance in these situations. In the training phase, DTW extracts the speech features of the training utterance and stored them as templates. In the recognition phase, the speech features of the test utterance are extracted, time aligned with each template before a mismatch score is computed. The word whose template yields the smallest mismatch is declared to the recognized word. This time alignment of test and reference utterances is quite effective in differentiating different isolated words. DTW can also be used to do continuous speech

Figure 1: Process for the creation of reference templates

Figure 2: Recognition phase

II. DTW WITH TIME SCALE MODIFICATION (TSM)

the cross correlation, is determined, the overlapping region is cross-faded and the remaining of the analysis frame is directly copied.

Time Scale Modification Time scale modification (TSM) [2-6] is a class of algorithm to change the time scale of a signal. There is an associated parameter, ? , called TSM factor. When ? is one, the signal is unchanged. When ? is greater than one, the signal is time expanded (e.g. from 1 second to 2 second if (? =2). When ? is less than one, the signal is time compressed (e.g. from 1 second to 0.5 second if (? =0.5). Some TSM algorithms are very simple and fast such as overlap -and-add (OLA), but they tend to yield rather poor speech quality with annoying reverberation. The algorithm used in this paper is synchronized overlap-and-add (SOLA) [7]. This si a time domain algorithm with very good speech quality. In SOLA, the input signal is divided into frames. When ? is less than one which is the case in this paper, each frame is moved to the left (or earlier time) to overlap with, and add to some previous frames. The optimal shifting point (synchronization point) is obtained by searching over a certain search window. This search procedure is essential for the good performance of SOLA but is computationally expensive. Fast algorithms exist [8-9] to speed up the search. Synchronized Overlap-and-Add (SOLA) The input (or analysis) signal x[n] is segmented into overlapping frames of length N that are a distance of Sa apart. The first frame is directly copied to the output (or synthesis) signal y[n]. The (m+1)-th frame which starts at m?Sa slides along the synthesized signal y[n] around the location m?Ss in the range of [k min, k max] to find a location which maximize the normalized cross-correlation function defined in (1) for the overlapping region. L ?1

R[k ] ?

?

y[m ? Ss ? k ? i] ?x[m ? Sa ? i]

i ?0 1

(1)

L ?1 ?L ? 1 2 ?2 2 ?? y [m ? Ss ? k ? i ]?? x [m ? Sa ? i ]? ?i ? 0 i? 0 ?

The Sa and Ss are called the analysis and synthesis frame period respectively. The relation between Sa and Ss is defined in (2) Ss = Sa ? ?

(2)

? is called the time scale factor. The signal is time scale expanded when ? is greater than one and time scale comp ressed when ? is smaller than one. L is the length of the overlapping region between the shifted analysis frame and synthesized signal. Usually the k min and k max are set to –N/2 and N/2 respectively. Once the location, which maximizes

Computation Reduction of DTW To reduce the computation of DTW, we propose to apply TSM before DTW, as follows. In the training phase, the time scale of each training utterance is modified using SOLA with TSM factor less than 1. Speech features are then extracted and stored as templates. While fast implementation of SOLA has certain computational requirement, it is more than compensated by the reduced duration of the training utterance, which implies lower computational requirement for the feature extraction and lower memory requirement to store the reference templates. This saving will be particular significant if the number of reference templates is large. In the recognition phase, the time scale of the test utterance is modified using SOLA with the same TSM factor. Speech features are extracted from the time scaled test utterance and then time aligned with the templates using DTW. The one with the lowest mismatch score is the recognized word. The shorter duration of the test and reference utterances implies reduced computation of the speech feature extraction and the DTW. In order to maintain the recognition performance of DTW, the TSM algorithm must yield good speech quality. Thus SOLA rather than OLA is chosen.

III. NOISE SUPPRESSION PROPERTY OF SOLA If the test utterances are corrupted by noise, the recognition rate can be decreased significantly. As mentioned in section II, in SOLA, the overlapping region between the analysis frame and the synthesized signal is cross-faded. This process can reduce the noise power effectively. The recognition rate thus can be increased if TSM is applied on noisy test utterances. In our simulations, white Gaussian noise is added to the test utterances before applying TSM. We found that the recognition rate can be increased with the application of TSM.

IV. SIMULATION RESULTS We performed a simple experiment to verify the feasibility of the proposed combination of TSM and DTW. One speaker from the TI-46 database is used in our experiment. The vocabulary contains only ten digits. This is a small vocabulary, isolated speech and speaker dependent situation. Four training utterances of each digital are used as

templates. Four other utterances of each digit are used for testing. The speech features used are the first twenty mel-frequency cepstral coefficients (MFCC). DTW is applied with local constraint path as shown in figure 3. The mismatch measure is simply the norm of the 20 MFCC. The training utterances are pre-processed using SOLA with various TSM factors. The testing utterances are pre-processed with SOLA in the same way before DTW is applied. The results are shown in Table 1. This result suggests that the application of SOLA before DTW does not affect the recognition accuracy of the DTW. This is true for TSM factor ranging from 1 down to 0.2. When the TSM factor is too low, such as 0.1, the recognition accuracy of DTW degrades because the SOLA itself is distorting the speech significantly before feeding the data into the DTW. A large speed-up factor of 22.8 compared with DTW alone (? =1) is possible when (? =0.2). To evaluate the noise suppression property of TSM. We added different amount of white Gaussian noise to the test utterance before applying TSM. The results are shown in Table 2 and Figure 6. As expected, the recognition accuracy is reduced when noise is added to the test utterances without applying TSM. The recognition rates are increased if TSM is applied on the utterances and the recognition rates increase with the time scales factor (? ) decreases until ? reaches a certain value, in our experiment, the recognition rates for different Signal-to-Noise rate (SNR) attain their maximum value when ? = 0.4.

V. CONCLUSIONS We proposed using Time Scale Modification (TSM) before carry out Dynamic Time Warping (DTW) for Isolated Speech Recognition. Since TSM can reduce redundancy information while maintaining important information, the proposed algorithm can reduce the computation complexity of DTW up to 75%, without affecting the recognition accuracy. Storage requirement for the reference templates can also be reduced as the duration of reference utterances is reduced. TSM can also improved the recognition accuracy for noisy recording environment.

Figure 3: Local constraint for DTW path search Time Scale factor (? ) 1 0.8 0.6 0.4 0.2 0.1 0.08

Recognition rate (%) 97.5 97.5 97.5 97.5 97.5 92.5 85.0

CPU time (s) 2.93 2.01 1.18 0.50 0.13 0.037 0.024

Speed up factor 1 1.46 2.48 5.87 22.8 79.4 121.41

Table 1: Recognition rate for different TSM factors

15 ? 1 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1

15.0 13.5 12.5 16.3 15.5 41.3 40.0 38.8 32.5

Signal-to-Noise ratio (dB) 20 25 30 No noise Recognition rate (%) 30.0 73.5 90.0 97.5 46.3 86.3 92.5 97.5 42.5 78.8 93.8 97.5 50.0 82.5 90.0 97.5 60.0 83.5 92.5 97.5 72.5 86.3 93.8 97.5 67.5 77.5 88.8 97.5 58.8 78.8 86.3 97.5 60.0 82.5 88.8 92.5

Table 2: Recognition rate for different TSM factors and Signal-to-Noise ratio

Recognition rate (%)

Reference

98

96

94

92

90

88

86

84 0

0.1

0.2

0.3

0.4 0.5 0.6 Time scale factor

0.7

0.8

0.9

1

Figure 4: Recognition rate for differenert time scale factors CPU time (second) 3

2.5

2

1.5

1

0.5

0 0

0.1

0.2

0.3

0.4 0.5 0.6 Time scale factor

0.7

0.8

0.9

1

Figure 5: CPU time for different time scale factor Recognition rate(%) 100 90 80

no noise 30 dB 25 dB 20 dB 15 dB

70 60 50 40 30 20 10 0.1

0.2

0.3

0.4

0.5 0.6 0.7 Time scale factor

0.8

0.9

1

Figure 6: Recognition rate for different time scale factor and Signal-to-Noise ratio

[1] J.R. Deller, J.G. Proakis, J.H. L. Hansen, Discrete Time Processing of Speech Signals, MacMillan, 1993. [2] W.D. Garvy, "The Intelligibility of Abbreviated Speech Pattern", Quarterly Journal of Speech, Vol. 39, pp. 296-306, 1953. [3] G. Fairbanks, W.L. Everitt, R.P. Jaeger, "Method for Time or Frequency CompressionExpansion of Speech", IRE Trans. Professional Group on Audio, Vol. AU-2, pp. 7-12, Jan 1954. [4] J.L. Flanagan, R.M. Golden, "Phase Vocoder", Bell System Technical Journal, Vol. 45, pp. 1493-1509, Nov. 1966. [5] David Malah, "Time Domain Algorithms for Harmonic Bandwidth Reduction and Time Scaling of Speech Signals", IEEE Trans. on Acoustics, Speech and Signal Processing, Vol. 27, No. 2, Apr. 1979. [6] M.R. Portnoff, "Time Scale Modification of Speech based on Short Time Fourier Analysis", IEEE Trans. on Acoustics, Speech, Signal Processing, Vol. 29, pp. 374-390, Jun 1981. [7] S. Roucos, A.M. Wilgus, "High Quality Time Scale Modification for Speech'', Proc. IEEE Int. Conf. Acoustics, Speech, Signal Processing, Vol. 1, pp. 493-496, 1985. [8] S. Yim and B. I. Pawate, "Computationally Efficient Algorithm for Time Scale Modification (GLS-TSM)", Proc, IEEE Int. Conf. Acoustics, Speech, Signal Processing, Vol. 2, pp. 1009-1012, 1996. [9] J.W.C. Wong, O.C. Au and P.H.W. Wong, "Fast Time Scale Modification Using Envelop Matching Technique (EM-TSM)", Proc, IEEE Int. Sym. on Circuit and System, 1998.