sports audio segmentation and classification - lrde-epita

1 downloads 0 Views 148KB Size Report
Jan 18, 2010 - such as zero-crossing rate and short time energy. And a window size of 2.4s was used in their experiment. There are also some our previous.
Proceedings of IC-NIDC2009

SPORTS AUDIO SEGMENTATION AND CLASSIFICATION Jun Huang 1, Yuan Dong 1, Jiqing Liu 1, Chengyu Dong 2, Haila Wang 2 1

Beijing University of Posts and Telecommunications, Beijing 2 France Telecom Research & Development Center, Beijing [email protected], [email protected]

Abstract The audio stream is an important component of a sports video. In this paper, we present a system for audio segmentation and classification, which can segment and classify the sports audio stream into speech, non-speech very well. The novel point in our research is that we apply the segmentation and clustering method which is often used in speaker diarization system for broadcast news to the analysis of sports videos. After the segmentation and Bayesian Information Criterion (BIC) clustering is performed, Gaussian Mixture Model (GMM) is used in the classifier to identify the kind of sound for each segment. Experiments on a database composed of 6 hour audio stream in the Eurosport TV program show that the average accuracy can reach 87.3% on segmentation and classification. This research is very useful for analyzing the content of sports videos in detail. Keywords: audio segmentation and classification; sports audio; GMM; content analysis

1 Introduction As the amount of multimedia data is increasing rapidly on the Internet, we need an efficient method to segment and classify the audio stream automatically based on its content. The audio stream in sports videos is rich in semantic information, which is helpful for content analysis of sports video. Improving the performance of sports video parsing is also a purpose of our research. There have been many studies on audio segmentation and classification, using different algorithms and different features. Barras [1, 2], presented an improving multistage speaker diarization system. A standard Bayesian information criterion (BIC) agglomerative clustering is used. El-Maleh [3], reported their results of combining the linear spectral pairs (LSP) and zero-crossing based features for frame-level narrowband speech/music discrimination. Scheirer [4], introduced a real-time computer system capable of distinguishing speech signals from music signals. ___________________________________ 978-1-4244-4900-2/09/$25.00 ©2009 IEEE

They evaluated 13 features such as spectral centroid and spectral flux, and combined them in the classification. Pinquier [5], reported an approach to speech/music segmentation based on four original features: 4Hz modulation energy, modulation entropy and two segmental parameters. Lu [6, 7], presented a robust algorithm for audio segmentation and classification. One second audio clips were classified into speech, music, environment sound and silence, and this was also a process of segmentation. Saunders [8], introduced a speech/music classifier based on simple features such as zero-crossing rate and short time energy. And a window size of 2.4s was used in their experiment. There are also some our previous works on text-independent speaker verification [911], which are employed in the classification step in this paper. However, there is a limitation when segment audio stream by using 1s or 2.4s window to identify the class of a sound clip. The limitation is that an accurate sound change point cannot be located. In this paper, we present a system for audio segmentation and classification in sports videos. Potential change points between different kinds of sounds are found by using KL2 distance peak detection at first. After the KL2 distance based segmentation, the Bayesian information criterion (BIC) agglomerative clustering is performed until no more segments merging is possible. And then the segment boundaries will be refined with energy constraints. At the last step, classification is performed by using Gaussian Mixture Models, which have been trained from the training set of different kinds of sounds. The rest of this paper is organized as follows. Audio segmentation method is discussed in detail in Section 2. And the classification scheme is presented in Section 3. In Section 4, the evaluation method is introduced and experiment results of proposed algorithms are presented.

2. Audio Segmentation In this session, the segmentation system is described. This method is often used in the speaker

 Authorized licensed use limited to: Telecom ParisTech. Downloaded on January 18, 2010 at 05:48 from IEEE Xplore. Restrictions apply.

diarization system. Our experiments showed that it is also useful to audio segmentation. The architecture is shown in Figure 1.

possible environmental changes in the audio source [13]. Local maxima in certain surrounding range in the KL2 curve are searched, and corresponding potential sound change boundaries are also found. This is shown in Figure 3.

Figure 1. Architecture of the segmentation system 2.1. Feature Extraction Linear spectral pairs (LSPs) are derived from linear predictive coefficients (LPC). Previous research have shown that LSP is effective for speech and non-speech discrimination [3, 6]. It is also proved that LSP is more robust in the noisy environment [12]. 10-order LSPs and short-time energy (STE) will be extracted from the input audio signal every 10ms using a 25ms window on a 0-8kHz band. It is shown as Figure 2.

Figure 2. Feature extraction 2.2. Initial Segmentation Generally, the miss rate of the sound change points should be low enough, even if the false alarm is very high. Because the false change points can be easily removed after the agglomerative clustering is performed, and the purity of each segment should be ensured. The audio stream is segmented by taking the maxima of a KL2 distance between two adjacent sliding windows (A and B) of audio frame. Assuming that the 11 features (10-order LSPs and short-time energy) in the individual windows are independent and follow a Gaussian distribution, we can obtain: KL 2( A, B) C ( A, B)  M ( A, B) (1)

1 >tr (6A 16B)  tr (6B 16A)@  d (2) C ( A, B) 2 M ( A, B) (PA  PB)(6A 1  6B 1 )(PA  PB)T (3) where P x , is the mean vector, and 6x is the covariance matrix, and d is the dimensionality of the feature vector ( x = A or B). M(A,B) often directly reflects, among other physical influences,

Figure 3. Illustration of sound change detection 2.3. BIC agglomerative clustering Agglomerative clustering is necessary for eliminating the false sound change points. It is performed on the segments resulting from the initial segmentation. In this step, each segment is seen as one cluster at first, and then each cluster is modeled by a single Gaussian with a full covariance matrix estimated on the feature vectors of each frame of the cluster. Suppose two Gaussian models from two audio segments are C 1 and C 2 , and C is the Gaussian model used to estimate the merged segment. The BIC difference between the two models is

BIC (C1, C 2) 1 (( N 1  N 2) log | 6 |  N 1 log | 61 |  N 2 log | 6 2 |) 2 1 1  O (d  d (d  1)) log( N 1  N 2) 2 2 (4) where 6 is the covariance matrix of the merged cluster C , 61 of the cluster C 1 , 6 2 of the cluster C 2 , and N 1 and N 2 are respectively the number of frames of these two segments, and d is the dimension of the feature space, and O is the penalty factor [6]. At each iteration, the two nearest clusters are merged, and the distances of remaining clusters to new cluster are updated, until the BIC distances between all cluster pairs becomes positive. 2.4. Viterbi resegmentation An eight component GMM with a diagonal covariance matrix in trained for each segment resulting from the BIC agglomerative clustering. The boundaries are shifted to the nearest point of low energy within a 1s interval, so that the segment boundaries can be located at silence part of the audio. The segment boundaries obtained after this step, is the final segmentation result.

 Authorized licensed use limited to: Telecom ParisTech. Downloaded on January 18, 2010 at 05:48 from IEEE Xplore. Restrictions apply.

3. Classification of Audio Segments After clustering, classification should be performed, because we want to know which kind of sound it is for each cluster. More complex feature vectors and Gaussian Mixture Models are employed in this step. Mel-frequency cepstral coefficients (MFCCs) have been commonly used in speech recognition. They are based on the known variation of the human ear’s critical bandwidths with frequency. MFCCs are calculated from the log filterbank amplitude {mj} using the Discrete Cosine Transform:

ci

2 N Si mj cos( ( j  0.5)), i 1,..., L (5) ¦ N N j1

where N is the number of filterbank channels, and L is the desired length of the cepstrum. Here, we set N=24 and L=12. In the classification step, 12 MFCC coefficients plus c 0 are computed from the sports audio stream with the frame period 10 ms, and the window size 25 ms. And to represent the dynamic information of the feature, we compute the first and second derivatives, and append them to original feature vector to form a 39-dimension feature vector finally. Considering that the sports audio is recorded with high background noise, we carried out a Cepstral Mean Normalization (CMN) on original 13dimension MFCC before computing the first and second order derivatives, which is a very effective technique in practice where it compensates for long-term spectral effects such as those caused by different microphones and audio channels, to alleviate the interference. Our previous experiments show that GMMs with 128 Gaussian mixtures and a diagonal covariance matrix, which are trained with the speech (including pure speech, speech with noise and speech with music) and non-speech (including music and other environmental sounds such as applause, cheer, motor) data, can discriminate speech and non-speech in the sports audio very well. For each D-dimension feature vector x , its likelihood to a claimed audio type model O { pi, Pi, 6i}, i 1,..., M with M components is defined as [14]:

p( x | O )

M

¦ p b ( x) i i

(6)

pi are the mixture weight and Pi , 6i are the mean vector and covariance matrix, respectively. Classification is carried out on the clusters we get in the clustering step. After removing the silence frame, every frame of the same cluster is computed to get a probability score, and the overall score summation for each cluster will decide the sound type of this cluster.

4. Experiments on sports audio data 4.1. Data Sets Unlike the data-set used in the general work, our database of sports audio is not composed of relatively clean audio such as CD recordings and TIMIT database. It is from broadcast TV which is an un-controlled audio environment with high background sound interference that makes the audio classification much more difficult. We’ve collected more than 6 hour audio evaluation data from Eurosport TV program of soccer, tennis, car racing and ski. Sound change points of them are hand-labeled and each segment is labeled with one of these two classes: speech and non-speech. 4.2. Evaluation measure In the experiments, we adopt the common used Fmeasure as our evaluation measure, which is the harmonic mean of the precision P and recall R defined as:

F

2PR PR

(8)

where the precision P and recall R are defined as:

P

target  miss target  miss  false _ alarm target  miss R target

(9) (10)

There are some other notions which we define as:

miss  false _ alarm target  non _ target miss miss _ rate target false _ alarm false _ alarm _ rate non_target error _ rate

(11) (12) (13)

i 1

bi ( x )

1 (2S )

D/2

| 6i |

1/ 2

1 exp{ ( x  Pi )T 6i 1 ( x  Pi )} 2 (7)

We can show these notions in the equations in Figure 4 as follows.

where bi (x) are the component densities and

 Authorized licensed use limited to: Telecom ParisTech. Downloaded on January 18, 2010 at 05:48 from IEEE Xplore. Restrictions apply.

music/background discrimination. The features of music and environment sound are so similar that we cannot get a good result for segmentation and classification. For example, some elements of music, e.g. drumbeat, are noise-like in acoustics. Figure 4. Illustration of the evaluation of experiments (take speech as the target for example) 4.3. Results In the experiments on the sports data mentioned above, we have got a satisfying result on speech and non-speech segmentation, and it is shown as follows: Table 1. Experiment result for speech and nonspeech segmentation and classification Sound type Speech Non-speech Target time 21589.797s 2590.630s Miss time 2628.694s 433.963s False alarm time 433.953s 2628.664s Error rate 12.7% 12.7% Miss rate 12.176% 16.751% False alarm rate 16.751% 12.175% Recall 87.824 % 83.249% Precision 97.763% 45.068% F-measure 0.925 0.585 For further research, we also attempted to segment the sports audio into speech, music and background sound. Three GMMs were trained for each kind of sound. But in the experiments, we find that it is difficult to discriminate music and background noise in the sports data, and clustering often merged music and background noise segments, which also affected the result of classification. Experiment result is shown in Table 2. Table 2 Experiment result for speech, music and background sound segmentation and classification Sound type Speech Music Background Target time 21589.797s 1687.096s 903.534s Miss time 2587.876s 888.929s 554.478s FA time 446.088s 1428.976s 2156.179s Error rate 12.5% 9.6% 11.2% Miss rate 11.987% 52.690% 61.368% FA rate 17.219% 6.353% 9.263% Recall 88.013% 47.310% 38.632% Precision 97.706% 35.838% 13.933% F-measure 0.926 0.408 0.205

Also we can see that the duration of speech is much more than the non-speech duration. Even a rarely part of speech target miss will have a great impact on the precision of the non-speech, but the recall of the non-speech is still high enough.

5. Discussion In this paper, we have proposed a scheme for sports audio segmentation and classification. Our algorithm is discriminative between speech and non-speech, while it performs not so well relatively between music and other environmental sound. And until now, speech and non-speech structure of sports audio can provide rough information for sports video parsing in our research. In the future work, more knowledge on characteristics of different types of sports audio data should be investigated, which will be very useful to refine the result of audio classification. And we will bring in new feature to make our system work well for speech, music, and background sound segmentation and classification, so that we can provide more semantic information for the content analysis of sports video.

Acknowledgements Supported by The Key Project of The Ministry of Education of P. R. China (108012), and Scientific Research fund of overseas returned staff, Ministry of Education

References [1]

[2]

[3] From above 2 tables we see that, the system works well for speech and non-speech segmentation, but does not work on environment sound. The main reason is that the feature LSP and MFCC is very effective for speech/music or speech/non-speech discrimination, but cannot work well for

C. Barras, Xuan Zhu, S. Meignier, and J. Gauvain, “Multistage speaker diarization of broadcast news,” Audio, Speech, and Language Processing, IEEE Transactions on, vol. 14, 2006, pp. 1505-1512. C. Barras, X. Zhu, S. Meignier, and J. Gauvain, “Improving Speaker Diarization,” IN PROC. FALL 2004 RICH TRANSCRIPTION WORKSHOP (RT-04, 2004. K. El-Maleh, M. Klein, G. Petrucci, and P. Kabal, “Speech/music discrimination for multimedia applications,” Proceedings of the Acoustics, Speech, and Signal Processing, 2000. on IEEE International Conference Volume 04, IEEE Computer Society, 2000, pp. 2445-2448.

 Authorized licensed use limited to: Telecom ParisTech. Downloaded on January 18, 2010 at 05:48 from IEEE Xplore. Restrictions apply.

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

E. Scheirer and M. Slaney, “Construction and evaluation of a robust multifeature speech/music discriminator,” Acoustics, Speech, and Signal Processing, 1997. ICASSP-97., 1997 IEEE International Conference on, 1997, pp. 1331-1334 vol.2. J. Pinquier, J. Rouas, and R. Andre-Obrecht, “A fusion study in speech/music classification,” Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03). 2003 IEEE International Conference on, 2003, pp. II-17-20 vol.2. Lie Lu, Hong-Jiang Zhang, and Hao Jiang, “Content analysis for audio classification and segmentation,” Speech and Audio Processing, IEEE Transactions on, vol. 10, 2002, pp. 504-516. Lie Lu, Stan Z. Li, and Hong-Jiang Zhang, “Content-based audio segmentation using support vector machines,” Multimedia and Expo, 2001. ICME 2001. IEEE International Conference on, 2001, pp. 749-752. J. Saunders, “Real-time discrimination of broadcast speech/music,” Acoustics, Speech, and Signal Processing, 1996. ICASSP-96. Conference Proceedings., 1996 IEEE International Conference on, 1996, pp. 993996 vol. 2. Y. Dong, L. Lu, and H. Wang, “Confusion based automatic question generation,” IET Conference Publications, vol. 2008, Jan. 2008, pp. 64-67. Yuan Dong, Jian Zhao, Liang Lu, Jiqing Lui, Xianyu Zhao, and Haila Wang, “Eigenchannel Compensation and Symmetric Score for Robust Text-Independent Speaker Verification,” Chinese Spoken Language Processing, 2008. ISCSLP '08. 6th International Symposium on, 2008, pp. 1-4. Y. Dong, L. Lu, Z. Xianyu, and Z. Jian, “Studies on Model Distance Normalization Approach in Text-independent Speaker Verification,” Acta Automatica Sinica, vol. 35, May. 2009, pp. 556-560. L. Lu, H. Jiang, and H. Zhang, “A robust audio classification and segmentation method,” Proceedings of the ninth ACM international conference on Multimedia, Ottawa, Canada: ACM, 2001, pp. 203-211. A. Haubold and J.R. Kender, “Accommodating Sample Size Effect on Similarity Measures in Speaker Clustering,” cs/0612138, Dec. 2006. D. Reynolds and R. Rose, “Robust textindependent speaker identification using Gaussian mixture speaker models,” Speech and Audio Processing, IEEE Transactions on, vol. 3, 1995, pp. 72-83.

 Authorized licensed use limited to: Telecom ParisTech. Downloaded on January 18, 2010 at 05:48 from IEEE Xplore. Restrictions apply.