Video Indexing Using Face Detection and Face ... - Semantic Scholar

1 downloads 0 Views 319KB Size Report
the video sequences are scanned for faces by a Neural Network based face ... Fig 1 shows the results of the face detection for two frames of a TV news show.
Video Indexing Using Face Detection and Face Recognition Methods Stefan Eickeler, Stefan M¨uller, Gerhard Rigoll Gerhard-Mercator-University Duisburg, Department of Computer Science Faculty of Electrical Engineering, 47057 Duisburg, Germany e-mail: eickeler,stm,rigoll  @fb9-ti.uni-duisburg.de

Abstract This paper presents a video indexing approach, that combines face detection and face recognition. The frames of the video sequences are scanned for faces by a Neural Network based face detector and the faces are extracted from the sequence. The faces are then grouped into clusters by a combination of a face recognition method using pseudo two-dimensional Hidden Markov Models and the k-means clustering algorithm. Each resulting cluster consists of the face images of the one person. In the next step the detected faces are labeled as one of the different people in the video sequence and the occurrence of the people can be evaluated. The results of the proposed approach on a TV broadcast news sequence are presented. The system was able to distinguish between three different newscasters and an interviewed person. 1 Introduction One of the most important content of movies and other video sequences are people. Therefore a good video indexing approach would be to detect the people in the sequence, recognize them, and analyze their actions. This paper explores the usability of face detection and recognition to find the faces in a video sequence and assign them to the different characters. First we give a short description of the face detection algorithm, the face recognition method based on pseudo two-dimensional Hidden Markov Models, and the k-means clustering using HMMs. Then the use of the presented methods to build up a video indexing system is explained. The paper ends with the results on a sample video of a TV broadcast news show sequence and the conclusions. 2 Face Detection The face detection is very similar to the Neural Network based method presented in [6]. A square sampling window extracts subimages of different sizes from the frames of the video sequence. These samples are then preprocessed by subtracting a best-fit brightness plane and performing a histogram equalization. The subimages are then scaled to a size of  . Each pixel of the subimage is used as input of a Neural Network, that is trained on face and non-face samples. The output neuron gives a probability that the subimage represents a face. Subimages with a probability higher than a threshold are extracted from the frames. The face detection is applied to the frames of a TV news show. Fig 1 shows the results of the face detection for two frames of a TV news show.

Figure 1: Results of face detection

aH11

a11

a22

a12

1

a33

a23

2

a34

3

aH12

1H

a44

1

1 1 12

a 21 b1(O)

b2(O)

b4(O)

b3(O)

1

2

3 O

O

a 12

3

a322

4

a412

a423 a333

3

3

a422

24

a323 a233

a411 1

23

a223 a133

O

a222

22

a123

3

3

a a122

4H

a311 1

2 12

aH34

3H

a211 2

1

a24

a13

aH23

2H

a111

4

aH44

aH33

aH22

a433 4

3

O

Figure 2: 1-D Hidden Markov Model

Figure 3: Pseudo 2-D Hidden Markov Model 3 Face Recognition

The face recognition module uses pseudo 2-D Hidden Markov Models (HMM) and DCT coefficients for the recognition [3]. The image of the face is scanned with a sampling window top to bottom and left to right. The size of the sampling window depends on the size of the image of the face. The pixels in this sampling window are transformed using the DCT. The first 15 coefficients are extracted and are arranged in a feature vector. The use of DCT-coefficients as features for the recognition has some important advantages: 1. The DCT decorrelates the subimage and allows the use of diagonal covariance matrices for the probability density function of the HMMs. 2. The face recognition can be directly applied to JPEG and MPEG compressed images. An overlap between adjacent sampling windows can be used to improve the ability of the HMM to model the neighborhood relations between the windows. The resulting array of feature vectors are classified using pseudo 2-D HMMs. A single HMM is trained for each person on the training images using the Baum-Welch Algorithm. For the recognition the Viterbi Algorithm or the Forward-Backward Algorithm is used to determine the probability of each face model for the test image. HMMs [5] are statistical models that consist of several states. At each step a transition to another state depending on a transition probability matrix is performed and a symbol is created depending on a probability density function (pdf), which is assigned to each state. Figure 2 shows a one-dimensional Hidden Markov Model with four states and assigned pdfs. Pseudo 2-D HMMs are extensions of the one-dimensional case to work on two-dimensional data like images. Pseudo 2-D HMMs are nested one-dimensional HMMs: A superior HMM models the sequence of columns in the image. Instead of a probability density function the states of the superior model (superstates) have a onedimensional HMM to model the cells inside the columns. Figure 3 shows a pseudo 2-D HMM with four superstates containing a three state 1-D HMM in each superstate. The probability density functions of the inferior models are omitted in this figure. The Baum-Welch Algorithm provides the HMM parameters corresponding to a local maximum of the likelihood function depending on the initial model parameters [5]. Therefore it is very important to use a good initial model for the training. We train a common initial model on all faces in the training set using the Baum-Welch Algorithm. This common model is refined on the training faces of one person to get the model for this person. The face recognition based on pseudo 2-D HMMs was evaluated on the Olivetti Research Laboratory (ORL) face database. The recognition rate of 100 % was better than the recognition results of all other methods evaluated on this database. 4 HMM-Clustering The HMM-clustering is an unsupervised grouping of data into classes, containing similar members. It is a k-means clustering with Hidden Markov Models as representation of a cluster (codebook) instead of vectors and therefore allows the clustering of 1-D vector sequences. A similar method was published in [4]. For the initialization of the clustering process, a number of sequences are assigned randomly to each cluster. Then the codebook is iteratively refined by training the HMMs of the clusters on the assigned sequences and then assigning each sequence to the HMM, that has the highest probability of producing the sequence. This is repeated until the likelihoods of the

clusters converge. The result are the clusters of data and the codebook of HMMs. The use of pseudo 2-D HMMs for the representation of a cluster is without any other modifications of the clustering algorithm possible. This allows a grouping of face images into classes, which cannot be done by the classical k-means clustering, because the large variety of the facial expressions requires the incorporation of a face recognition method. A smoothing of the variances of the HMMs with the variances of the common initial model prevents clusters with only a few members from overfitting and gives a better similarity inside the resulting classes. 5 Video Indexing For video indexing one frame per second of the video sequence is used for face detection. The image regions containing a face are extracted from the frames. On these images the DCT features for the face recognition are calculated using a blocksize, that gives approximately the same amount of features vectors  for all face images. The feature arrays are then grouped by the HMM-clustering. The biggest clusters contain the main people of the video sequence and the occurrence in the video sequence can be further evaluated. 6 Experiments and Results To show the capabilities of the proposed approach we applied it to the indexing of TV news. The TV news show was captured in a resolution of      with a frame rate of 5 fps. Three different newscasters are reading the news in this show, therefore the approach presented in [1, 2] cannot be used in this case, because it can only cope with one newscaster. The proposed approach was able to assign the video segments to the three newscasters correctly. Figure 4, 5, and 6 show images of the three clusters of people representing the three newscasters. Figure 7 shows images of the cluster of an interviewed person. 7 Conclusions This paper presented a video indexing approach based on the detection and recognition of people. It was shown that this method has advantages compared to our previous video indexing method. The proposed approach is capable of indexing a video sequence without any prior knowledge of the sequence, because no video model has to be trained on the training samples as in [1, 2]. The presented method can be further improved by a detection of cuts and edit effects and a tracking of faces. This simplifies the detection of the face of the same person in consecutive frames of the video sequence. In the future we will apply this method to detect the main characters in movies. We hope to be able to extend the method to work directly on MPEG data. Our face recognition approach [3] is already capable of recognizing faces directly in JPEG data. For the face detection in the compressed domain a scheme similar to [7] has to be developed. References [1] S. Eickeler, A. Kosmala, and G. Rigoll. A New Approach to Content-Based Video Indexing Using Hidden Markov Models. In Workshop on Image Analysis for Multimedia Interactive Services (WIAMIS), pages 149– 154, Louvain-la-Neuve, Belgium, June 1997. [2] S. Eickeler and S. M¨uller. Content-Based Video Indexing of TV Broadcast News Using Hidden Markov Models. In Proc. IEEE Int. Conf. on Acoustics, Speech, and Signal Processing (ICASSP), pages 2997–3000, Phoenix, Apr. 1999. [3] S. Eickeler, S. M¨uller, and G. Rigoll. Recognition of JPEG Compressed Face Images Based on Statistical Methods. Technical report, Faculty of Electrical Engineering - Computer Science, Gerhard-MercatorUniversity Duisburg, 1999. http://www.fb9-ti.uni-duisburg.de/report.html. [4] L. M. D. Owsley, L. E. Atlas, and G. D. Bernard. Automatic Clustering of Vectortime-Series for Manufacturing Machine Monitoring. In Proc. IEEE Int. Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 3393–3396, Munich, Apr. 1997. [5] L. R. Rabiner. A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition. Proc. of the IEEE, 77(2):257–285, Feb. 1989. [6] H. A. Rowley, S. Baluja, and T. Kanade. Neural Network-Based Face Detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(1):23–38, Jan. 1998. [7] H. Wang and S.-F. Chang. A Highly Efficient System for Automatic Face Region Detection in MPEG Video. IEEE Transactions on Circuits and Systems for Video Technology, 7(4):615–628, Aug. 1997.

Figure 4: First newscaster

Figure 5: Second newscaster

Figure 6: Third newscaster

Figure 7: Interviewed person