Robust Smile Detection using Convolutional Neural Networks

19 downloads 0 Views 4MB Size Report
Robust Smile Detection using Convolutional Neural Networks. Additional Material. Simone Biancoa, Luigi Celonaa,∗, Raimondo Schettinia. aDipartimento di ...
Robust Smile Detection using Convolutional Neural Networks Additional Material Simone Biancoa , Luigi Celonaa,∗, Raimondo Schettinia a Dipartimento

di Informatica, Sistemistica e Comunicazione (DISCo), University of Milano-Bicocca, 20126 Milano, Italy

Abstract We present a fully automated approach for smile detection. Faces are detected using a multi-view face detector and aligned and scaled using automatically detected eye locations. Then we use a Convolutional Neural Network (CNN) to determine whether it is a smiling face or not. To this end, we investigate different shallow CNN architectures that can be trained even when the amount of learning data is limited. We evaluate our complete processing pipeline on the largest publicly available image database for smile detection in uncontrolled scenario. We investigate the robustness of the method to different kinds of geometric transformations (rotation, translation, and scaling) due to imprecise face localization, and to several kinds of distortions (compression, noise, and blur). To the best of our knowledge this is the first time that this type of investigation is performed for smile detection. Experimental results show that our proposal outperforms state-of-the-art methods on both high and low quality images. Keywords: Smile detection, Deep learning, Convolutional Neural Networks, Face detection, Face alignment

1. The proposed approach The difference between configuration A and configuration B is that the latter uses two 3 × 3 convolutional layers instead of a single 5 × 5 layer. In this way, we incorporate two non-linear rectification layers instead of a single one, which make the decision function more discriminative and the CNN deeper. Furthermore, the use of two smaller filters, also decreases the number of parameters from 79,712 (configuration A) to 72,544 (configuration B). In fact, assuming both input and output of the two configurations have C channels, we require 52 C 2 = 25C 2 parameters for a single 5 × 5 convolutional layer; instead, 2(32 C 2 ) = 18C 2 parameters for two-layer 3 × 3 convolutional stack. Thus, the use of two smaller filters can be considered as a regularization approach (with injected non-linearity). Unlike configuration A, configuration C contains one more fullyconnected layer with 1024 neurons before the FC-2. In this way, we capture the global properties of previous convolutional layer before the other fully-connected layer. 2. Experimental setup 2.1. Image database Figure 1 shows some typical images of the GENKI-4K database. ∗ Corresponding

author: Tel: +39 02 6448 7893. Email addresses: [email protected] (Simone Bianco), [email protected] (Luigi Celona), [email protected] (Raimondo Schettini)

Table 1: CNN configurations investigated in this work (shown in columns). The convolutional layer parameters are denoted as ”convhreceptive field sizei-hnumber of channelsi”. The ReLU activation function is not shown for brevity.

CNN Configuration A B C 4 weight layers 5 weight layers 5 weight layers input (32 x 32 RGB image) conv3-32 conv3-32 conv3-32 maxpool LRN conv3-32 conv5-32 conv5-32 conv3-32 avgpool LRN conv5-64 conv5-64 conv3-64 avgpool FC-1024 FC-2 soft-max # of parameters 79,712 72,544 1,093,472

2.2. CNN training We train our CNNs from scratch using Stochastic Gradient Descent (SGD) with a batch size of 256, momentum set to 0.9, and a weight decay parameter of 0.002. We initialize the learning rate to a value of 0.001, and drop it by

Figure 1: Sample typical images from GENKI-4K database. Smiling faces (top) and non-smiling faces (bottom) are shown.

Figure 2: Face labeled as non-smile in the GENKI-4K database that the CNN-A classifies as smile. Images are reported in order of decreasing confidence p(smile).

Figure 4: Some misclassified examples caused by bad alignment. The first row shows the faces before alignment with the detected facial landmarks overlaid; the second row shows the faces after alignment. Table 2: Types and ranges of the geometric transformations applied to the original images to simulate a bad alignment.

Type Rotation Scaling Translation

Amount Angle −30◦ , −20◦ , . . . , 30◦ Factor 0.80, 0.90, . . . , 1.20 Offset −8, −6, . . . , 8

2.4. Classification robustness with respect to face alignment 2.5. Classification robustness with respect to image artifacts We run two different experiments: in the first one we consider a single artifact at a time; in the second one images are corrupted by multiple artifacts together. The results of the single-artifact experiment are reported in Figure 7. In the same plots the results obtained by our implementation of the method by Gao et al. [2] are also reported. From the plots it is possible to notice that CNN-A

Figure 3: Face labeled as smile in the GENKI-4K database that the CNN-A classifies as non-smile. Images are reported in order of increasing confidence p(smile).

a factor of 10 every 6000 iterations. We train for a total of 30000 iterations. We train the CNNs using the Caffe [1] library. 2.3. Performances evaluation and results Concerning our method, some examples of misclassified images are reported in Figure 2 and 3. Figure 2 depicts same faces labeled as non-smile in the database that the CNN-A classifies as smile. Instead, Figure 3 reports some examples of faces labeled as smile in the database that are classified as non-smile by our approach. From these images it is possible to see that some classification errors are due to incorrect labels in the dataset. Apart from these misclassifications, the greatest source of error is due to very bad facial landmarks localization, as shown in Figure 4. In the following section we therefore investigate the robustness of the CNN to bad face alignment and image distortions.

Table 3: Types and ranges of the distortions applied to original images.

Type JPEG compr. Gaussian noise Gaussian blur Motion blur

2

Amount Quality 99%, . . . , 0% Zero-mean σ 2 = 0.01, 0.02, . . . , 0.06 Filter size 3 × 3, 9 × 9, . . . , 25 × 25 σ 2 = Filter size×0.25 Pixel length 5, 10, . . . , 30 Angle 45◦

robustness for large filter sizes. Some samples of face crops after the application of the multiple artifacts at the six distortion levels considered are reported in Figure 8.

(a)

(b)

(c) Figure 5: Examples of the transformations applied to cropped faces to simulate a bad alignment: (a) rotation, (b) scaling, (c) translation.

Figure 8: Some samples of face crops after the application of a combination of the three artifacts (Motion blur, Gaussian noise and JPEG compression) at the six distortion levels considered.

(a)

2.6. Network training adding images distorted by artifacts

(b)

We fine-tune the CNN-A by chopping and retraining from scratch the FC-2 layer. We fine-tune using the same parameters used for training (batch size equal to 256 and momentum 0.9) except the starting learning rate set to 1e-4, the weight decay parameter set to 2e-4 and the total number of iterations set to 15000.

(c)

References [1] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, T. Darrell, Caffe: Convolutional architecture for fast feature embedding, in: Proceedings of the ACM International Conference on Multimedia, ACM, 2014, pp. 675– 678. [2] Y. Gao, H. Liu, P. Wu, C. Wang, A new descriptor of gradients self-similarity for smile detection in unconstrained scenarios, Neurocomputing.

(d) Figure 6: Some examples of the face crops after artifact application: (a) JPEG compression, (b) Gaussian noise, (c) Gaussian blur, (d) Motion blur.

Simone Bianco received the BSc and the MSc degrees in mathematics from the University of Milano-Bicocca, Italy, in 2003 and 2006, respectively. He received the PhD degree in computer science from the Department of Informatics, Systems and Communication of the University of MilanoBicocca, Italy, in 2010, where he is currently a postdoctoral researcher. His research interests include computer vision, machine learning, optimization algorithms, and color imaging.

shows a very high level of robustness against JPEG compression and Gaussian noise. In both cases the performance remain almost unaltered even for large distortion levels. Concerning Gaussian and Motion blur the CNN shows a lower level of robustness, with performance decreasing for filter size and pixel length larger than 10. In comparison with the method by Gao et al. [2] we can see a very similar behavior against JPEG compression and a higher robustness to Gaussian noise and Motion blur. For what concerns Gaussian blur we notice an inversion in performance with the method by Gao et al. [2] having higher

Luigi Celona received the BSc and the MSc degrees in Computer Science respectively from the University of 3

100

80

80

Accuracy (%)

Accuracy (%)

100

60

40

20

60

40

20 Gao et al. Proposed

0 100

Gao et al. Proposed

0 80

60

40

20

0

0

JPEG compression quality index (%)

0.01

0.02

(a)

0.04

0.05

0.06

25

30

(b)

100

100

80

80

Accuracy (%)

Accuracy (%)

0.03

Gaussian noise variance

60

40

20

60

40

20 Gao et al. Proposed

Gao et al. Proposed

0

0 0

5

10

15

20

25

0

Gaussian blur filter size

5

10

15

20

Motion blur pixel length

(c)

(d)

Figure 7: Classification rates varying (a) the JPEG quality index, (b) the variance of zero-mean Gaussian noise, (c) the filter size of Gaussian blur, (d) the pixel length of Motion blur.

Messina, Italy, in 2011, and from the University of MilanoBicocca, in 2014. He is currently a PhD student at the Department of Informatics, Systems and Communication of the University of Milano-Bicocca, Italy. His current research interests focus on image analysis and classification, machine learning and face analysis.

color image analysis.

Raimondo Schettini is Full Professor at the University of Milano Bicocca (Italy). He is Vice-Director of the Department of Informatics, Systems and Communication, and head of Imaging and Vision Lab (www.ivl.disco.unimib.it). He has been associated with Italian National Research Council (CNR) since 1987 where he has leaded the Color Imaging Lab from 1990 to 2002. He has been team leader in several research projects and published more than 300 refereed papers and several patents about color reproduction, image processing, analysis and classification. Raimondo Schettini is Fellow of the International Association of Pattern Recognition (IAPR) for his contributions to pattern recognition research and 4