A Fast and Accurate Face Detector for Indexation ... - Semantic Scholar

6 downloads 112 Views 458KB Size Report
Bill Clinton, picture of Beatles. Moreover, access providers could store at low cost the face information: the cropped frames, corresponding to the faces present in ...
A Fast and Accurate Face Detector for Indexation of Face Images Raphael Feraud, Olivier Bernier, Jean Emmanuel Viallet and Michel Collobert France Telecom CNET [email protected]

Abstract Detecting faces in images with complex backgrounds is a difficult task. Our approach, which obtains state of the art results, is based on a generative neural network model: the Constrained Generative Model (CGM). To detect side view faces and to decrease the number of false alarms, a conditional mixture of networks is used. To decrease the computational time cost, a fast search algorithm is proposed. The level of performance reached, in terms of detection accuracy and processing time, allows to apply this detector to a real word application: the indexation of face images on the Web.

1. Introduction and Related Work Detecting faces in images with complex backgrounds is a difficult task. Among other techniques [9], neural net– works are efficient face detectors ([3], [6]). Web indexation applications, for which faces are high level semantic objects ([4],[7],[8]), are characterized by the amount and variabil– ity of the images to be processed. If detection rate is not critical (faces may be missed), false alarm must be avoided to obtain non noisy indexation. Processing time must be kept as low as possible as the number of images on the Web increases continuously. In this paper, we describe a state of the art neural network model, the Constrained Generative Model (CGM). To detect side view faces and to decrease the false alarm rate, a conditional mixture of networks is used. Then, to decrease processing time, a fast search algorithm is proposed. Finally, the level of performances reached for Web image indexation is discussed.

2. The face detector To detect a face in an image means to find its position in the image plane and its size or scale. Our approach first implements simple processes, based on standard image processing and then, considering that an image of a face is a particular event in the set of all the possible images, more

sophisticated processes based on statistical analysis (Figure 1) are used.

2.1. Hypothesis elimination When color information is available, a color filter, made up of a table of pixels collected manually on a large collec– tion of face images, is applied. Subwindows containing a small number of skin pixels are considered as background subwindows. The others, corresponding approximately to 40% of the total number of subwindows (depending on the image), are evaluated by a neural network filter. The filter is a single multilayer perceptron (MLP) [2]. It has 300 inputs, corresponding to the size of the extracted subwindows (15x20 pixels), 20 hidden neurons, and one output (face/non–face), for a total of 6041 weights. The network is trained using standard backpropagation. The face training set is composed of 8 000 front view and side view faces. Approximately 50 000 specific non–face ex– amples were collected using an iterative algorithm [10]. The subwindows are enhanced by a histogram equalization, and smoothed. Then, they are normalized by subtraction of the average face. The obtained network is relatively small and fast with a very high detection rate (above 99%) but also with a high false alarm rate (up to 1%). This network, un– usable alone because of its poor false alarm rate, is used as a filter which discards more than 99% of the hypothesis.

2.2. The Modular System The last detection stage is a modular system, based on a new neural network model (Figure 2), the Constrained Generative Model (CGM) [3]. The goal of the learning process of the CGM is to approximate the projection of a point of the input space on the set of faces subwindows :







  , then   

 ,   

arg min       , where   if    : 



if

Euclidean distance.

is the

NON-FACE

Decision

FACE

CGM 1

CGM 2

CGM 3

CGM 4

MLP

FRONT VIEW 1

FRONT VIEW 2

SIDE VIEW 1

SIDE VIEW 2

GATE

?

NON-FACE

Decision

grayscale image

MLP FILTER

?

Decision

color image

COLOR

NON-FACE

FILTER

sudwindows extracted from the image

       !"#!  ! $   % !!& ! '% ()'*+    ,% &    -! '% (. /,*+ '*0 +1

When training the neural network, each face example is reconstructed on the output layer. Each non–face example is constrained to be reconstructed as the mean of the 2 nearest neighbors of the nearest face example. Therefore the neural network learns to estimate the projection of a subwindow onto the set of face subwindows: the reconstruction error is related to the distance between a point and the set of faces:



  

54 )6    !47 is the size of input image  3

8

 )6  ^ 2

1

8 where and ^ the reconstructed image by the neural network. As a consequence, if we assume that the learning process is consistent, our algorithm is able to evaluate the probability that a point belongs to the set of faces. Let be a binary random variable, 1 corresponding to a face example and 0 to a non–face example, we express this probability as: 9 ^ 2 1 : ?!@ B 2?/A













where C is used to adjust the sensitivity.

The training face set and the normalization of the sub– windows are the same than those used for the previous neu– ral network filter. The training set is divided into four sets (labeled with the variable D ). These four sets, containing approximatively 2000 face subwindows, allow to built four CGMs, specialized on different face orientations: E 0 F 20 FHG , E 20 F 40 FIG , E 40 F 60 FHG , E 60 F 80 FHG , respectively correspond– ing to CGM1 to CGM4. For each set of faces corresponds a set of counter–examples, containing approximatively 2000 subwindows, collected by an iterative algorithm [10]. Each CGM module evaluates the probability of an extracted sub– window of the image to be a face, knowing the value of the random variable D . Assuming that the orientation partition can be generalized to every input, including the non–face subwindows, the modular system combines each estimator using a gate network:

15 x 20 outputs

approach cannot be used in our case: the image normaliza– tion we use is local and not easily computed in the Fourier space of the whole image. To reduce the computational time cost of the face de– tection process, our approach is to reduce the number of subwindows analyzed. We assume that around a face sub– window, the output of the modular network is a continuous monotonous function: the farther an input subwindow is from a face subwindow, the lower is the output of the face detector. This assumption is verified on a large set of images (Figure 3). Consequently, we use the following algorithm to speed up the face detection process:

50 neurons

35 neurons

15 x 20 inputs

     !"#!  % ("#*  %  % *$% )! ) !'&, *'' '   ' !  !"0   3     4   6 4

9



1:









   9  

1

1:

 D







9  the output of where  is the number of CGMs,  the output of the the gate network and 1: D CGM  . This system is different from a mixture of experts introduced in [5]: each module is trained separately on a subset of the training set and then the gating network learns to combine the outputs. Since prior knowledge is used to split the training set, and since each module is trained separately, the capacity of this system is less than that of the more general case, the mixture of experts.





3. The search algorithm In this part, we focus on a technique to reduce the compu– tational time cost of the face detection process. The detector locates faces in a subwindow of fixed size, 15x20 pixels. To detect faces at different scales, a subsampling of the original image is performed. The exhaustive search leads to evalu– ate a very large number of subwindows: all the subwindows in all the subsampled images have to be tested. To re– duce this computational time cost, Rowley [6] uses a simple multilayer perceptron, similar to our filter network, which determines the possible location of faces. Then a larger network is used to achieve precise location. In WebSeer [4], this reduction is obtained with a photograph–graphic classification. Another approach [1] is to calculate the Fourier trans– form of the image and of the neural network filter, and then to process the image in the Fourier space. This interesting

Step 1. at each scale, the subsampled subwindows centered on the intersection points of a regular grid (Figure 4) are tested by the detector (color filter, neural network filter and modular system),

 

Step 2. a local exhaustive search is performed around the points of the grid where  ,the output of the modular system, is greater than a first threshold, Step 3. at each scale, the subwindows corresponding to the points of the local exhaustive search (step 2), where  is greater than a second threshold, are stored in a  set ,

 



Step 4. an overlapping elimination or summation (depend– ing on the overlapping surface), between the different  positions and scales of the subwindows of , is per– formed to locate the faces.



4. Indexation of face images Currently, most of the indexation engines on the Web are based on textual information. Nevertheless, information in a Web page consists both of text and images. Therefore, the results of an image search, using a textual indexation engine, can be very noisy. In this section, we propose an image indexation engine, based on our face detector, in order to collect Web images containing faces [4]. The proposed application allows to easily sort images of faces. Knowing the location, scale and number of faces, the image can be indexed with the following labels: portrait or group picture, image containing a face, and background image. Merging this information and the textual information, the functionalities proposed by our system allow to search for a particular image such as: image of John Coltrane, portrait of Bill Clinton, picture of Beatles. Moreover, access providers could store at low cost the face information: the cropped frames, corresponding to the faces present in the image, can be stored instead of the whole image and summarize it(Figure 5). The difficulty of this problem is to process the amount of information contained in the Web pages. The answers

  0'&!   ! &')!" ! &%  !(!  % %! )1/,*+ '* "-!