Race Recognition Using Deep Convolutional Neural Networks - MDPI

0 downloads 0 Views 7MB Size Report
Nov 1, 2018 - Keywords: race recognition; deep convolutional neural networks; social ... features from both 2D and 3D information, or the fusion of face and gait, etc. .... Human activity recognition is another problem in video analysis.
SS symmetry Article

Race Recognition Using Deep Convolutional Neural Networks Thanh Vo 1 , Trang Nguyen 2 and C. T. Le 3, * 1 2 3

*

Advanced Program in Computer Science, University of Science, VNU HCMC, Ho Chi Minh 700000, Vietnam; [email protected] Faculty of Information Technology, Ho Chi Minh City Open University, Ho Chi Minh 700000, Vietnam; [email protected] Faculty of Information Technology, Ho Chi Minh City University of Technology (HUTECH), Ho Chi Minh 700000, Vietnam Correspondence: [email protected]

Received: 7 July 2018; Accepted: 30 October 2018; Published: 1 November 2018

 

Abstract: Race recognition (RR), which has many applications such as in surveillance systems, image/video understanding, analysis, etc., is a difficult problem to solve completely. To contribute towards solving that problem, this article investigates using a deep learning model. An efficient Race Recognition Framework (RRF) is proposed that includes information collector (IC), face detection and preprocessing (FD&P), and RR modules. For the RR module, this study proposes two independent models. The first model is RR using a deep convolutional neural network (CNN) (the RR-CNN model). The second model (the RR-VGG model) is a fine-tuning model for RR based on VGG, the famous trained model for object recognition. In order to examine the performance of our proposed framework, we perform an experiment on our dataset named VNFaces, composed specifically of images collected from Facebook pages of Vietnamese people, to compare the accuracy between RR-CNN and RR-VGG. The experimental results show that for the VNFaces dataset, the RR-VGG model with augmented input images yields the best accuracy at 88.87% while RR-CNN, an independent and lightweight model, yields 88.64% accuracy. The extension experiments conducted prove that our proposed models could be applied to other race dataset problems such as Japanese, Chinese, or Brazilian with over 90% accuracy; the fine-tuning RR-VGG model achieved the best accuracy and is recommended for most scenarios. Keywords: race recognition; deep convolutional neural networks; social networks; surveillance system

1. Introduction Nowadays, surveillance systems contribute vitally to public security. The development of artificial intelligence, especially artificial intelligence for computer vision [1], has made it easier to analyze the resulting videos [2,3]. Several studies have recently addressed the problem of event detection in video surveillance [4] which requires the ability to identify and localize specified spatiotemporal patterns. Another problem in surveillance video analysis, which is attracting much research interest, is the person re-identification problem [5]. Person re-identification describes the task of identifying a person across several images that have been taken using multiple cameras or a single camera. Re-identification is a vital function for surveillance systems as well as human–computer interaction systems in order to facilitate searching of identity from large amounts of videos and images. Likewise, this study addresses the problem of race recognition (RR) in surveillance videos. In several situations, identifying the race of a person may be helpful for surveillance systems to identify the appropriate emergency supporter. Moreover, another application of RR is to video classification/clustering. The race of people

Symmetry 2018, 10, 564; doi:10.3390/sym10110564

www.mdpi.com/journal/symmetry

Symmetry 2018, 10, 564

2 of 15

in a video can be an important factor for video classification/clustering. To address this challenge, significant efforts have been made toward RR and categorization. Fu et al. [6] wrote a comprehensive review on the learning of race from the face using many state-of-the-art methods. From their analysis, the problem could be resolved by using two main approaches: single-model RR, which tries to extract both appearance features and local discriminative regions, and multi-model RR, which combines features from both 2D and 3D information, or the fusion of face and gait, etc. In recent years, social networks have become popular with billions of users around the world, with millions of pieces of information shared daily. Many studies on the application of social networks have been analyzed recently. In 2016, Farnadi et al. [7] gave a detailed analysis of various state-of-the-art methods for personality recognition in many datasets from Facebook, Twitter, and YouTube. In this year, Nguyen et al. [8] used a Deep Neural Network (DNN) to meet two types of information needs of response organizations: (i) informative tweet identification and (ii) topical classes classification. They also provided a new learning algorithm using stochastic gradient descent to train DNNs in online learning during disaster situations. Recently, Carvalhoa et al. [9] presented a smart platform to efficiently collect, manage, mine, and visualize large Twitter corpora. In recent times, deep learning [10], which tries to learn good representations from raw data automatically with multilayers stacked on each other, has attracted significant attention from researchers due to its various applications in computer vision [11–15], natural language processing [16–18], and speech processing [19]. Convolutional neural networks (CNNs), a type of deep learning model, have recently achieved many promising results in large-scale image and video recognition [20,21]. The VGG model [20], which was first introduced by Simonyan and Zisserman in 2015, achieved very good performance on ImageNet [22] and has been widely used in computer vision studies. In recent years, many researchers have switched from RR of popular race groups such as African Americans, Caucasians, and Asians to that of sub-ethnic groups such as Koreans, Japanese, and Chinese [23–25]. Inspired by this idea, this paper presents a system that has the ability to detect race using deep learning in realistic environments using social network information for the Vietnamese sub-ethnic group. The main contributions of this work are as follows. (1) We introduce a race dataset of Vietnamese people collected from a social network and published for academic use. (2) We propose an efficient framework including three modules for information collection (IC), face detection and preprocessing (FD&P), and RR. (3) For the RR module, we propose two independent models: an RR model using a CNN (RR-CNN) and a fine-tuning model based on VGG (RR-VGG). Experimental results show that our proposed framework achieves promising results for RR in various race datasets including the Vietnamese sub-ethnic group and others such as Japanese, Chinese, and Brazilian. More specifically, the proposed framework with RR-VGG achieves the best accuracy in most of the scenarios. The remainder of the paper is organized as follows: Section 2 presents a review of related works including convolutional neural networks and the VGG model. Section 3 proposes the RR framework. The experimental results are shown in Section 4, and Section 5 gives the conclusions of the paper. 2. Related Work 2.1. Deep CNNs in Computer Vision In recent years, deep CNNs have been used extensively in computer vision especially due to their promising performance. For the problem of image classification, Zhang et al. [15] proposed a novel feature learning method for halftone image classification with very good performance. This method uses stacked sparse auto-encoders (SAE) to extract features of halftone images by using unsupervised learning, and then uses SoftMax regression to fine-tune the deep neural network using supervised learning to classify halftone images. Wei et al. [26] proposed a flexible deep CNN model, called Hypotheses-CNN-Pooling, for multilabel image classification. The input of this

Symmetry 2018, 10, 564

3 of 15

framework is an arbitrary number of object segment hypotheses. Then, a shared CNN is linked to each hypothesis. Finally, the results from different hypotheses are summed with max pooling to generate the final multilabel predictions. In face-related applications, there are several problems such as face detection [27], face alignment [12], facial expression analysis [11], etc. In 2015, Li et al. [27] proposed a cascade model built on a CNN with very powerful discriminative capability while maintaining high performance to deal with the problem of changes in visual properties, such as those due to pose, expression, and lighting, in real-world face detection. In 2017, Park et al. [12] designed deep neural networks for face alignment using facial landmark features and recurrent regression. In addition, because a smile is one of the most common facial expressions in our daily life, Chen et al. [11] proposed an intelligent method for smile detection using CNNs. Facial expression analysis could be applied to other problems such as medical assessment, lie detection, human–computer interface, robotics, etc. In video analysis, the problem of person re-identification—identification of people across images that have been taken using multiple cameras, or over time using a single camera—is an important one. In 2015, Ahmed et al. [5] proposed a method for parallelly learning features and a corresponding similarity metric for person re-identification. The authors also present a deep convolutional model with layers specially designed to address the problem of re-identification, and they achieve good results. Another problem in video analysis is visual target tracking, which has a wide range of applications such as vehicle navigation, augmented reality, video surveillance, etc. Pang et al. [13] proposed an approach to deal with visual target tracking tasks using a CNN with very good performance that is useful for real-time visual tracking. Human activity recognition is another problem in video analysis which has attracted a lot of attention in recent years. Ronao and Cho [14] proposed a CNN to perform efficient and effective human activity recognition using smartphone sensors by exploiting the inherent characteristics of activities and 1D time-series signals. This method achieves very good performance on several experimental datasets. 2.2. Transfer Learning In practice, due to insufficiently sized datasets, modern deep CNNs rarely train an entire convolutional network from scratch. Instead, they usually use a network pretrained on a very large dataset (e.g., ImageNet), such as VGG, and then use it as an initialization or a fixed feature extractor for new tasks. The three major transfer learning scenarios are CNNs as fixed feature extractors, fine-tuning CNNs, and pretrained models. Among these, fine-tuning CNNs have been widely used with the various models such as VGG [28–30] and GoogLeNet [31,32]. VGG is a convolutional neural network model proposed by Simonyan and Zisserman [20] which achieves 92.7% accuracy in ImageNet [22], a dataset of over 14 million images belonging to 1000 classes. The trained VGG model has two different forms—VGG-16 and VGG-19—the structure and parameters of which are freely available online. In this study, VGG-16 will be used in the RR-VGG model presented in Section 3.3. Its macroarchitecture is shown in Figure 1. Many works have leveraged the structure of VGG to perform transfer learning in different problems. Paul et al. [33] applied the pretrained VGG model to the lung adenocarcinoma detection problem to extract useful features and classify images under various classifiers. In another work by Long et al. [34], many pretrained CNN models (VGG, AlexNet, and ResNet) were adopted into a fully convolutional network and fine-tuned for segmentation tasks. Hoo-Chang et al. [35] extensively studied the application of pretrained VGG models to computer-aided detection problems and achieved some promising results. The results detailed above make transferring a pretrained network such as VGG more applicable than other methods. VGG-16 consists of 13 convolutional layers and three fully connected layers. In this model, larger filters (e.g., 5 × 5) are built from multiple smaller filters (e.g., 3 × 3) (Figure 2). Therefore, all convolutional layers have the same filter size of 3 × 3. In total, VGG-16 requires 138 M weights and 15.5 G multiply-and-accumulates to process one 224 × 224 input image [36]. The VGG model has been used in many studies [28–30] so far.

Symmetry FOR PEER REVIEW Symmetry 2018, 2018, 10, 10, x564 Symmetry 2018, 10, x FOR PEER REVIEW

44 of of 15 15 4 of 15

Figure 1. Macroarchitecture of VGG-16. Figure 1. 1. Macroarchitecture Macroarchitecture of of VGG-16. VGG-16. Figure

Figure 2. 2. Decomposing Decomposing larger larger filters filters into into the the smaller smaller filters filters used used in in VGG-16 VGG-16 [36]. [36]. Figure Figure 2. Decomposing larger filters into the smaller filters used in VGG-16 [36].

3. Materials and Methods 3. Materials and Methods 3. Materials and Methods 3.1. System Architecture 3.1. System Architecture RR is Architecture a difficult problem and is almost impossible to solve completely with computer vision. 3.1. System RR is a difficult problem of and is almost impossible to solve completely withthis computer However, with the integration social network information and computer vision, problemvision. can be RR is a difficult problem and is almost impossible to solve completely with computer vision. However, with the integration of social information and computer vision, this problem can partially solved. First, we introduce thenetwork Vietnamese Faces (VNFaces) dataset for academic research However, with the integration of socialthe network information and computer vision, this problem can be partially solved. First, we introduce Vietnamese Faces (VNFaces) dataset for academic research which contains only two classes—“Vietnamese” and “other”—for classification. Secondly, we proposed be partially solved. First, we classes—“Vietnamese” introduce the Vietnamese Faces (VNFaces) dataset for academic research which contains only two and “other”—for classification. Secondly, we a deep learning framework for the problem of RR. Although the model initially uses only the VNFaces which contains only two classes—“Vietnamese” and “other”—for classification. Secondly, we proposed a deep learning framework the of RR.to Although the model initially uses only dataset, later experiments have shownfor that it problem can be applied other race datasets. proposed a deep learning framework for the shown problem of itRR. Although thetomodel initially uses only the VNFaces dataset, later experiments have that can be applied otherasrace datasets. The proposed framework shown in Figure 3 consists of three main modules follows. The first the VNFaces dataset, later experiments have shown that it can be applied to other race datasets. The named proposed inVietnamese Figure 3 consists of three mainincluding modules profile as follows. The first module, theframework IC module,shown collects user information, pictures and The named proposed in Figure 3 consists of three mainincluding modules profile as follows. The first module, theframework IC module,shown collects information, pictures and race, from the social network. Then, theVietnamese informationuser from the IC module is processed by the FD&P module, named the IC module,Then, collects user information, including profileby pictures and race, from the social network. theVietnamese information from the IC module processed themodule FD&P module. Firstly, this module detects faces in users’ profile pictures. If facesisare detected, the race, from the social network. Then, the information from the IC module is processed by the FD&P module. module detects profile pictures. label If faces are faces detected, the module will cropFirstly, the facethis frames, resize themfaces to 64in×users’ 64, and automatically these as “Vietnamese” module. Firstly, this module detects faces in× users’ profile pictures.label If faces arefaces detected, the module will crop the face frames, resize them to 64 64, and automatically these as “Vietnamese” or “other” based on their information. The resulting dataset of labeled face images is called VNFaces. will crop the faceon frames, resize them to 64 resulting × 64, anddataset automatically label these facesisas “Vietnamese” or “other” based information. The ofmodels. labeled face images called VNFaces. The third module istheir RR, which consists of two independent A race detection model using or “other” based on their information. The resulting dataset of labeled face images is called VNFaces. The third module isisRR, which in consists twoInindependent models. A race detection model using a a CNN (RR-CNN) proposed Sectionof3.2. addition, Section 3.3 presents a fine-tuning approach The third module isproposed RR, which consists of independent models. A race detection model using a CNN (RR-CNN) Sectionmodel 3.2.two In[20], addition, 3.3 presents a fine-tuning for race detectionisbased on theinVGG-16 namedSection RR-VGG. A comparison betweenapproach RR-CNN CNN (RR-CNN) is proposed in Section 3.2. In addition, Section 3.3 presents a fine-tuning approach for detection based on theinVGG-16 RR-VGG. comparison between RRandrace RR-VGG will be presented Section 4model using [20], manynamed race datasets fromAdifferent sub-ethnic groups. for race detection based on the VGG-16 model [20], named RR-VGG. A comparison between RRCNNAfter and RR-VGG will be presented in Sectiondataset, 4 using many datasets different sub-ethnic RR is trained on the VNFaces any race image from from different environments, CNN and RR-VGG will be presented in Section 4 using many race datasets from different sub-ethnic groups. e.g., other social networks or surveillance videos, can be put into the FD&P module to extract and groups. After RR trained onthe theRR VNFaces any image from environments, e.g.,in other preprocess theisface. Next, moduledataset, can identify whether or different not the person that appears this After RR is or trained on the VNFaces dataset, any image from different environments, e.g., other social networks surveillance videos, can be put into the FD&P module to extract and preprocess image is Vietnamese. social networks or surveillance videos, can be put into module extract in and preprocess the face. Next, the RR module can identify whether or the not FD&P the person thattoappears this image is the face. Next, the RR module can identify whether or not the person that appears in this image is Vietnamese. Vietnamese.

Symmetry 2018, 10, x FOR PEER REVIEW

5 of 15

Symmetry Symmetry2018, 2018,10, 10,564 x FOR PEER REVIEW

55ofof15 15

VNFaces VNFaces IC

FD&P

Training IC

FD&P

Training RR module

FD&P module

FD&P module

Detecting RR module OtherDetecting Vietnamese

Other(RRF). Vietnamese Figure 3. System architecture of the Race Recognition Framework Figure 3. System 3.2. Race Recognition Using CNN architecture of the Race Recognition Framework (RRF).

Figure 3. System architecture of the Race Recognition Framework (RRF).

3.2. Race The Recognition architectureUsing of theCNN RR-CNN model is shown in Figure 4. The input for this model is grayscale 64 × 64 images and there areCNN two output classifications of one class, indicating Vietnamese or Other. 3.2. Race Recognition Using The architecture of the RR-CNN model is shown in Figure 4. The input for this model is grayscale There are four convolutional (C1–C4) layers and three max-pooling (P1–P3) layers, followed a 64 × 64 images and there areRR-CNN two output classifications one class, Vietnamese Other. The architecture of the model is shown in of Figure 4. Theindicating input for this model is or grayscale dropout layer, and two fully connected layers between the input and the output (see Figure 4). The There areimages four convolutional layersclassifications and three max-pooling (P1–P3) layers,Vietnamese followed a or dropout 64 × 64 and there are(C1–C4) two output of one class, indicating Other. first convolutional layer (C1) filters the input image (64 × 64) with 32 learnable kernels of size 3 × 3 to layer, and two fully connected layers between the input and the output (see Figure 4). The firsta There are four convolutional (C1–C4) layers and three max-pooling (P1–P3) layers, followed give 32 matrices of size 62 × 62. The results of C1 are passed to the max-pooling layer P1, with 32 convolutional layer the input image × 64) with 32 learnable sizeFigure 3 × 3 to dropout layer, and(C1) twofilters fully connected layers(64 between the input and thekernels outputof(see 4).give The learnable kernels of size 2 × 2. The results of P1 will be 32 matrices of size 31 × 31; these are passed 32 matrices of size 62layer × 62.(C1) Thefilters results C1 are passed layer kernels P1, withof32size learnable first convolutional theofinput image (64to× the 64) max-pooling with 32 learnable 3 × 3 to through the second convolutional layer C2, which has 32 learnable kernels of size 3 × 3, to get 32 kernels size 2 × of 2. size The results P1 will be 32 of size 31 these are passed give 32ofmatrices 62 × 62.ofThe results ofmatrices C1 are passed to × the31; max-pooling layerthrough P1, withthe 32 matrices of size 29 × 29. Then, P2, with 32 learnable kernels of size 2 × 2, is used to process the previous second convolutional layer2 C2, has 32oflearnable kernels of size of 3× 3, to 32 these matrices size learnable kernels of size × 2. which The results P1 will be 32 matrices size 31get × 31; are of passed results to get 32 matrices of size 14 × 14. Next, C3 and C4 are contiguous with 64 learnable kernels of 29 × 29. Then, P2, with 32 learnable kernels of which size 2 ×has 2, is to process the previous through the second convolutional layer C2, 32used learnable kernels of size 3 results × 3, to to getget 32 size 3 × 3 for each layer. The results of C4 are 64 matrices of size 10 × 10, which are passed to P3, with 32 matrices size × 14. Next, and are contiguous with of size × 3 for matrices ofof size 2914 × 29. Then, P2,C3 with 32C4 learnable kernels of size64 2 ×learnable 2, is usedkernels to process the3previous 64 learnable kernels of size 2 × 2, to give 64 matrices of size 5 × 5. They are passed to the flatten layer each layer. The32results of C4 of size 10 ×C4 10,are which are passed P3, with 64 learnable results to get matrices of are size64 14matrices × 14. Next, C3 and contiguous withto64 learnable kernels of to give 1600 values. Two fully connected layers, one with 1024 hidden units and the other with 512, kernels sizeeach 2 ×layer. 2, to give 64 matrices 5 × 5. They are10passed to theare flatten layer to give size 3 ×of 3 for The results of C4 of aresize 64 matrices of size × 10, which passed to P3, with follow. Finally, the output layer applies two classes, “Vietnamese” and “other”, to the VNFaces 1600 values. Two fully layers, onematrices with 1024 hidden units and othertowith 512, follow. 64 learnable kernels ofconnected size 2 × 2, to give 64 of size 5 × 5. They arethe passed the flatten layer dataset. Finally, output layer applies two classes, “Vietnamese” the and VNFaces dataset. to give the 1600 values. Two fully connected layers, one with and 1024“other”, hidden to units the other with 512, follow. Finally, the output layer applies two classes, “Vietnamese” and “other”, to the VNFaces dataset.

Figure 4. The RR-convolutional neural network (CNN) model. Figure 4. The RR-convolutional neural network (CNN) model.

Figure 4. The RR-convolutional neural network (CNN) model.

Symmetry 2018, 10, x FOR PEER REVIEW

6 of 15

The activation function of the output layer is SoftMax which produces 6aofdistribution over the 15 two class labels in RR. However, to generalize for the problem of RR, N class labels will be used. The Symmetry 2018, 10, x FOR PEER REVIEW 6 of 15 SoftMax function is shown below. The activation function of the output layer is SoftMax which produces a distribution over the Symmetry 2018, 10, x FOR PEER REVIEW 6 of 15 ( ) RR, N class labels will be used. twoThe classactivation labels in function RR. However, generalize forSoftMax the=problem of the to output layer is whichofproduces distribution over the , ∀ = 1,a 2, …, (1) ∑ ( ) TheThe SoftMax function is shown below. two class labels in RR. However, tooutput generalize problem of RR, N class alabels will be used. The activation function of the layerfor is the SoftMax which produces distribution over the Symmetry 2018, 10, 564

SoftMax function iswhere shown below. two class labels in RR. However, to generalize for problem of RR, N class will bethe used. The is a vector of the the output layer and labels indicates probability of the ith class exp(inputs zi ) theto ai = the i =the 1, 2, . . . , N is ∑ (1)  , ∀of SoftMax function isUsing shownthis below. function, sum outputs = 1. The loss value based on the crossN total ( )z j ∑ j=1 exp = , ∀ = 1, 2, … , (1) entropy of image ∑ with parameter w is computed according to the following formula ( )( ) = , ∀ = 1, 2, … , (1) where z is a vector of the inputs to the output layer ∑ ( )and ai indicates the probability of the ith class. Using where is a vector of the inputs to the output Nlayer and 1 indicates the probability of the ith class. this function, the total sum of the outputs is (∑i= 1. −The loss of )= , 1 ,ai = ( value log based + on 1 −the cross-entropy log ) (2) ∑the following Using sum outputs = 1. The lossprobability value based crosswhere afunction, vector ofthe thetotal inputs to of thethe output layeristoand indicates the of on the the ith class. imagethis x is with parameter w is computed according formula ∑ entropy offunction, image the withtotal parameter is outputs computed to The the following Using this sum of w the is according = 1. loss valueformula based on the crossand w are the in the to the and vectors. To minimize this value, the Adaptive C jth values entropy of image where with parameter is computed according following formula    1 Moment optimizer [37], which computes adaptive learning J (w,Estimation xi , yi ) = −1(Adam) y log a + 1 − y log a (2) rates for each ji ji ji ji ( , , ) = − C j∑ ( log + 1− log ) (2) = 1 1 To store an exponentially decaying average of past squared gradients parameter, was used. ( , , )=− ( log + 1− log ) (2) Adam also keeps an exponentially decaying average of past gradients , like momentum. Let ℊ = where y and a are the jth values in the y and a vectors. To minimize this value, the Adaptive ji ji i i where and in the of the andobjective vectors. To minimize this value, the Adaptive at time step t. The ∇are (the) jth be values the gradient function with respect to the parameter Moment Estimation (Adam) optimizer [37], which computes adaptive learning rates forrates each parameter, where and are the jth values in the and vectors. To minimize this value, the Adaptive Moment Estimation (Adam) optimizer [37], which computes adaptive learning for each averages are computed using the following formulae was used. To store an exponentially decaying average of past squared gradients v , Adam also keeps t parameter, was used.(Adam) To store an exponentially decaying average of past squaredrates gradients , Moment Estimation optimizer [37], which computes adaptive learning for each ) ℊt = ∇w J (w) be the = + (1 − Let an exponentially decaying average of past gradients m , like momentum. t Adam also was keepsused. an exponentially average of past average gradientsof past , likesquared momentum. Let ℊ = parameter, To store an decaying exponentially decaying gradients , (3) thegradient objective withfunction respect to the parameter w parameter at time are )ℊ t.atThe = + (1 − step ∇gradient ( also ) beof the offunction the objective with to the timeaverages step Adam keeps an exponentially decaying average of respect past gradients , like momentum. Let t.ℊ The = computed using the following formulae computed using formulae ∇averages ( ) beare the gradient of the the objective function with respect to themoment parameter at timeand stepthe t. The where andfollowing are estimates of the first (the mean) second moment (the averages are computed using the following formulae (1 ) uncentered variance) of the gradients, respectively. and are initialized as zero vectors. Biases = + − ℊ m t = β 1 m t −1 + (1 − β 1 ) t (3) (3) 2 are counteracted byv= computing bias-corrected (1(1−− β)2 )ℊ first and second moment estimates: = β 2 vt−1++ + t= (1 − )ℊ t (3) = = + (1 − )ℊ 1 − where andvt areare estimates offirst the moment first moment (the and mean) and themoment second(the moment (the where mt and estimates of the (the mean) the second uncentered (4) uncentered variance) therespectively. gradients, respectively. and are initialized as zero vectors. Biases where and areofestimates of the (the mean) and the second moment (the variance) of the gradients, mtfirst and vmoment are initialized as zero vectors. Biases are counteracted t = . are by computing bias-corrected first and and second moment uncentered variance) of the gradients, are initialized as zero vectors. Biases 1 −estimates: by counteracted computing bias-corrected first andrespectively. second moment estimates: are counteracted by computing bias-corrected first second moment estimates: = and Adam’s update rule is determined by the following formula mt ˆ=t =11− m − βt1 (4)(4) 1− = − . vˆ t== 1−vtβt .. (5) (4) + 1− 2 = . 1 − [37] formula Adam’supdate updaterule rule determined by thepaper following The authors of aby past propose default values of = 0.9, = 0.999, and = 10−8 Adam’s isisdetermined the following formula this, RR-CNN uses aηformula learning rate of = 0.001 to minimize the cross-entropy loss Adam’s updateBesides rule is determined by model the following ˆ t.. wt+1 == wt−− √ m (5)(5) value. vˆ+ t+e . = − (5) + −8 The aapast default = 10 3.3. RR-VGG: A [37] Fine-Tuning Theauthors authorsofof pastpaper paper [37]propose proposeApproach defaultvalues valuesof of β 1 == 0.9, 0.9, β 2 == 0.999, 0.999, and and e = 10−.8 . −8. Besides model uses learning rate = 0.001 to minimize the cross-entropy The this, authors of a past paper [37] propose default = 0.9,the cross-entropy = 0.999, and loss = 10loss Besides this, RR-CNN RR-CNN model uses aalearning rate of of η =values 0.001 toofminimize The input images for this model are rescaled to 64 × 64 to help reduce thevalue. complexity of training value. this, RR-CNN model uses a learning rate of = 0.001 to minimize the cross-entropy loss Besides and testing. To generate more input images as well as to increase the accuracy of this model, we value. 3.3. RR-VGG: A Fine-Tuning Approach applied random flip, zoom, and shear rotate features for each face. Figure 5 shows the augmented 3.3. RR-VGG: A Fine-Tuning Approach The input images for which this model rescaled toby64random × 64 to flip help(both reduce the complexity of training images wereare augmented horizontal and vertical), zoom (zoom range is 3.3. RR-VGG: A Fine-Tuning Approach The input images for this model are rescaled to 64 × 64 to help reduce the complexity of training and testing. To generate more input images as well as to increase the with accuracy of this model, 0.3), and shear rotate (sheer range is 30 degrees) batch_size = 32. we applied and testing. generate more input images as well as to the accuracy of this model, we The input images for model rescaled toeach 64we × 64 toincrease help reduce the complexity of training random flip,Tozoom, and shear rotate features for face. Figure 5allshows the augmented images Tothis create the are RR-VGG model, first loaded convolutional and max-pooling layers with the applied random flip, zoom, and shear rotate features face. Figure 5 shows therange augmented and testing. Toaugmented generate more input as well asfor to each increase the accuracy of this model, which were by random flip (both horizontal and vertical), zoom (zoom is we 0.3),with 1024 hidden existing weights ofimages VGG-16 (see Figure 1). Next, two fully connected layers, one images which were augmented random flip (both horizontal and vertical), zoomthe (zoom range is applied random flip, zoom, shear rotate features for each face. Figure 5function shows augmented and shear rotate (sheer range is by 30 degrees) with batch_size = 32. units andand one with 512, were added. Finally, the SoftMax was used for classification into the 0.3), and shearwere rotate (sheer range is 30 degrees) with batch_size =cross-entropy 32.and images augmented random flip (both vertical), zoom (zoom range is Towhich create the two RR-VGG model, we first loaded allhorizontal convolutional max-pooling layers with the classes ofbythe output layer. Therefore, theand loss value was determined by Equation To create the of RR-VGG model, we all convolutional max-pooling layers with the batch size n = 32 0.3), and shear rotate range 30 first degrees) batch_size = 32.and existing weights VGG-16 (see is Figure 1).loaded Next, two fully Gradient connected layers, one 1024 hidden 2. (sheer To minimize this value, awith Mini-batch Descent (GD)with optimizer with existing weights of VGG-16 (see Figure 1). Next, two fully connected layers, one with 1024 hidden To create the RR-VGG model, we first loaded all convolutional and max-pooling layers with the units and one withlearning 512, were added. Finally, the SoftMax used rate for classification into the rate = 0.0001, momentum γ =function 0.9, andwas learning decay t = 0.000001 was used. The timeunits and one 512, were added. Finally, the SoftMax function was used for classification intoTherefore, the existing weights ofoutput VGG-16 (see Figure 1).the Next, twoused fully layers, one with 1024 hidden two classes ofwith the layer. Therefore, cross-entropy loss value was determined by Equation (2). based learning rate schedule was toconnected anneal the learning rate over time. the learning two classes of the output layer. Therefore, the cross-entropy loss value was determined by Equation units and one with 512, were added. Finally, the SoftMax function was used for classification into the rate for the kth epoch was determined by optimizer with batch size n = 32, learning To minimize this value, a Mini-batch Gradient Descent (GD) 2.rate To value, a γMini-batch Descent (GD) optimizer with batch size n = 32, two classes of thethis output layer. Therefore, the cross-entropy loss was was determined by Equation ηminimize = 0.0001, momentum = 0.9, and Gradient learning rate decay t =value 0.000001 used. The time-based rate this = 0.0001, momentum γ = 0.9, and learning decay t = 0.000001 used. 2.learning To minimize value, a Mini-batch Gradient Descentrate (GD) optimizer withwas batch sizeThe n =time32, based learning schedule was used anneal the learning rate over time. Therefore, the learning learning rate =rate 0.0001, momentum γ =to 0.9, and learning rate decay t = 0.000001 was used. The timerate for the kthrate epoch was determined based learning schedule was used by to anneal the learning rate over time. Therefore, the learning rate for the kth epoch was determined by

Symmetry 2018, 10, 564

7 of 15

learning rate schedule was used to anneal the learning rate over time. Therefore, the learning rate for Symmetry 2018, 10, x FOR PEER REVIEW 7 of 15 the kth epoch was determined by ηt η t +1 = . (6) (1 + k × t ) = . (6) + × )by the following formula Then, the Mini-batch GD’s update rule was (1 calculated Symmetry 2018, 10, x FOR PEER REVIEW

Then, the Mini-batch GD’s update rule was calculated by the following formula wt+1 = wt − η ∇w J (w, xi:i+n ; yi:i+n ). = − =∇ ( , : . ; : ). (1 + × )

7 of 15

(6)

(7) (7)

Then, the Mini-batch GD’s update rule was calculated by the following formula =

− ∇

( ,

:

;

:

).

(7)

Figure 5. images rotate features. features. FigureFigure 5. Augmented Augmented images using using random random flip, flip, zoom, zoom, and and shear shear rotate 5. Augmented images using random flip, zoom, and shear rotate features.

4. Results and Discussion Discussion 4. Results and 4. Results and Discussion 4.1. VNFaces Dataset and Cross-Validation Cross-Validation Setting 4.1. VNFaces Dataset and Cross-ValidationSetting Setting 4.1. VNFaces Dataset and To conduct the experiment, module was was used collect useruser information on Facebook To the the IC collect information on To conduct conduct the experiment, experiment, thethe ICICmodule module was used usedtoto to collect user information on Facebook Facebook including profile picture and race from usersof of different different ages and races. These profile pictures have have including profile picture and race from users ages and races. These profile pictures including profile picture and race from users of different ages and races. These profile pictures have varying pose, accessories, illumination,and andimaging imaging conditions. TheThe second module, named named FD&P, FD&P, varying pose, accessories, illumination, conditions. second module, varyingwas pose, accessories, illumination, and conditions. module, named used to automatically detect (using the imaging Haar Casecade classifierThe [38]),second crop, and label 6100 faces, FD&P, was to detect (using (using the Haar Haar Casecade Casecade classifier classifier [38]), [38]), crop, crop, and and label label 6100 6100 faces, was used used to automatically automatically detect including 2892 Vietnamese and 3208the Other, forming a collection called the VNFaces dataset faces, including 2892 Vietnamese and 3208 Other, forming a collection called the VNFaces dataset (Available at (Available https://goo.gl/M8eww3). a sample of Vietnamese (left) and (right) dataset including 2892 at Vietnamese and 3208 Figure Other,6 shows forming a collection called the Other VNFaces https://goo.gl/M8eww3). Figure 6 shows a sample ofaVietnamese (left) and Other faces. faces. (Available at https://goo.gl/M8eww3). Figure 6 shows sample of Vietnamese (left) (right) and Other (right)

faces.

Figure 6. Examples ofofVietnamese (left)and and Other (right) Figure 6. Examples Vietnamese (left) Other (right) faces.faces.

Figure 6. Examples of Vietnamese (left) and Other (right) faces.

Symmetry 2018, 10, 564

8 of 15

Symmetry 2018, 10, x FOR PEER REVIEW

8 of 15

To problem andand to make the classifier generalizable to independent datasets, To avoid avoidthe theoverfitting overfitting problem to make the classifier generalizable to independent we employed 10-fold stratified cross-validation for the VNFaces dataset. We used nine folds for datasets, we employed 10-fold stratified cross-validation for the VNFaces dataset. We used nine folds training andand thethe lastlast fold forfor testing. InIn the for training fold testing. thetraining trainingprocedure, procedure,the theweight weightparameters parameters of of the the CNN CNN model were optimized automatically using the training set only. The testing set was hidden from the the model were optimized automatically using the training set only. The testing set was hidden from CNN model and only used after training was complete. Figure 7 shows the numbers of Vietnamese CNN model and only used after training was complete. Figure 7 shows the numbers of Vietnamese and and Other Other face face images images in in each each fold fold utilized utilized in in this this experiment. experiment. 700

Number of instances

600 500

306

308

305

316

339

318

346

316

330

340

304

302

305

294

271

292

264

294

280

270

400 300 200 100 0

Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Fold 6 Fold 7 Fold 8 Fold 9 Fold 10

Vietnameses

Other

Figure and Other Other face face images images in in each each fold. fold. Figure 7. 7. Numbers Numbers of of Vietnamese Vietnamese and

4.2. Race Recognition Recognition for for the the VNFaces Dataset 4.2. Race VNFaces Dataset For input images were augmented by random flip, zoom, and shear For the theRR-VGG RR-VGGmodel, model,the the input images were augmented by random flip, zoom, and rotate. shear To show the effectiveness of this step, we performed RR-VGG in two cases denoted by RR-VGG1 rotate. To show the effectiveness of this step, we performed RR-VGG in two cases denoted by and RRRR-VGG2. RR-VGG2 used the random flip, zoom, and shear rotate features with batch_size = 32 for VGG1 and RR-VGG2. RR-VGG2 used the random flip, zoom, and shear rotate features with each image for training and validation sets, while RR-VGG1 used the original images for training and batch_size = 32 for each image for training and validation sets, while RR-VGG1 used the original validation images for sets. training and validation sets. In this and RR-VGG2 In this comparison comparison of of accuracy, accuracy, RR-CNN, RR-CNN, RR-VGG1, RR-VGG1, and RR-VGG2 models models were were used used independently for predicting race with 10-fold cross-validation. For better analysis, we also independently for predicting race with 10-fold cross-validation. For better analysis,compared we also with the original VGG (whichVGG we called RR-VGG0) where we used a well-trained model without compared with the original (which we called RR-VGG0) where we used VGG a well-trained VGG fine-tuning to predict racial identity. The experimental results in terms of accuracy are shown in model without fine-tuning to predict racial identity. The experimental results in terms of accuracy Table 1. Firstly, it is clear to seeitthat the original RR-VGG0 performs worse than the others are shown in Table 1. Firstly, is clear to see that the original RR-VGG0 performs worsewith thanonly the 72.37% accuracy since it is not trained for face recognition. Secondly, RR-VGG1 lacks stability and others with only 72.37% accuracy since it is not trained for face recognition. Secondly, RR-VGG1 lacks has the worst results at 78.70% with oscillation of oscillation 18.46%. Inof detail, at Fold 2, Fold and2,Fold stability and has the worst results at an 78.70% with an 18.46%. In detail, at 5, Fold Fold10, 5, RR-VGG1 has bad results approximately 50% accuracy,50% while at the other has good and Fold 10, RR-VGG1 hasofbad results of approximately accuracy, whilefolds, at theitother folds,values it has of accuracy. andRR-CNN RR-VGG2and achieve stableachieve accuracy across the 10 folds with of good values RR-CNN of accuracy. RR-VGG2 stable accuracy across the oscillations 10 folds with 1.76% and 2.22%, respectively. The average accuracy of RR-VGG2 (88.87%) is a little bit better than that oscillations of 1.76% and 2.22%, respectively. The average accuracy of RR-VGG2 (88.87%) is a little of (88.64%). bitRR-CNN better than that of RR-CNN (88.64%). Table 1. Accuracy of Table 1. Accuracy of different different models models for for the the VNFaces VNFaces dataset dataset with with 10-fold 10-fold validation. validation. Models

Folds

Folds

Fold 1

FoldFold 1 2 FoldFold 2 3 Fold 4 FoldFold 3 5 FoldFold 4 6 FoldFold 5 7 Fold 8 FoldFold 6 9 FoldFold 7 10 Average Fold 8 accuracy Fold 9 Fold 10

RR-CNN (%)

RR-CNN (%) 87.54 87.54 89.34 88.52 89.34 88.85 88.52 84.92 88.03 88.85 87.87 84.92 90.98 88.03 88.69 91.64 87.87 88.64 ± 1.76 90.98 88.69 91.64

Models RR-VGG0 (%) RR-VGG1 (%) RR-VGG2 (%) RR-VGG0 (%) RR-VGG1 (%) RR-VGG2 (%) 82.15 92.13 90.49 82.15 55.42 51.1592.13 88.03 90.49 75.21 90.6651.15 91.48 88.03 55.42 72.46 91.31 86.89 75.21 90.66 91.48 40.64 46.56 91.48 73.24 90.9891.31 88.85 86.89 72.46 80.15 88.85 90.66 40.64 46.56 91.48 79.56 89.84 85.08 73.24 84.46 91.3190.98 90.00 88.85 80.46 54.2688.85 85.74 90.66 80.15 72.37 ± 3.24 78.70 ± 18.46 88.87 ± 2.22 79.56 89.84 85.08 84.46 91.31 90.00 80.46 54.26 85.74

Symmetry 2018, 10, x FOR PEER REVIEW

Average accuracy

88.64 ± 1.76

9 of 15

72.37 ± 3.24

78.70 ± 18.46

88.87 ± 2.22

Symmetry 2018, 10, 564

9 of 15

In this experiment, we studied the effect of the number of epochs in the RR-CNN model. We builtInRR-CNN models we with 10, 20,the 30,effect 40, 50, andnumber 60 epochs; the results shownmodel. in Figure The this experiment, studied of the of epochs in theare RR-CNN We8. built training and validation accuracies of this model are stable from 80% to 90% when the number RR-CNN models with 10, 20, 30, 40, 50, and 60 epochs; the results are shown in Figure 8. The trainingof epochs increases from 10 of to this 60. In addition, training losses model stable.ofHowever, and validation accuracies model are stable from 80%oftothis 90% whenare thealso number epochs validation losses fluctuate upward when the number of epochs increases from 10 to 60. This means increases from 10 to 60. In addition, training losses of this model are also stable. However, validation that overfitting occurs with a large number of epochs. Therefore, we suggest that only 10 to 20 epochs losses fluctuate upward when the number of epochs increases from 10 to 60. This means that overfitting shouldwith be used to number train RR-CNN. occurs a large of epochs. Therefore, we suggest that only 10 to 20 epochs should be used to train RR-CNN.

(A)

(B)

(C)

(D)

(E)

(F)

Figure8.8.Performance PerformanceofofRR-CNN RR-CNNwith with1010(A), (A),2020(B), (B),3030(C), (C),4040(D), (D),5050(E), (E),and and6060(F) (F)epochs. epochs. Figure

Next, Next,we wecompared comparedthe thecomputation computationtime timefor foreach eachfold foldincluding includingtraining trainingand andtesting testingtime time among RR-VGG2. The Theresults results(Figure (Figure show RR-CNN requires amongRR-CNN, RR-CNN, RR-VGG1, RR-VGG1, and RR-VGG2. 9) 9) show thatthat RR-CNN requires 91.3 91.3 s while RR-VGG1 and RR-VGG2 take a much longer at 1623.1 s and 1648.9 s, respectively.

Symmetry 2018, 564 PEER REVIEW Symmetry 2018, 10, 10, x FOR

1015 of 15 10 of

s while RR-VGG1 and RR-VGG2 take a much longer at 1623.1 s and 1648.9 s, respectively. This is This is because obvious because (including RR-VGG1 and RR-VGG2) has more layers RR-CNN. obvious RR-VGGRR-VGG (including RR-VGG1 and RR-VGG2) has more layers thanthan RR-CNN. Hence, RR-VGG’s structure is much more complex than that of RR-CNN. In short, RR-CNN is more Hence, RR-VGG’s structure is much more complex than that of RR-CNN. In short, RR-CNN is more efficient than RR-VGG (including RR-VGG1 and RR-VGG2) terms computation time. efficient than RR-VGG (including RR-VGG1 and RR-VGG2) in in terms of of thethe computation time.

1500 1000 500 91.3 0 RR-CNN

0.6

1623.1 1648.9

RR-VGG1

RR-VGG2

(a)

The computation time (s)

The computation time (s)

2000

0.48

0.5

0.42

0.4 0.3 0.2

0.19

0.1 0 RR-CNN

RR-VGG1

RR-VGG2

(b)

Figure 9. Computation time in (a) thethe training process andand (b)(b) thethe testing process of of experimental Figure 9. Computation time in (a) training process testing process experimental models forfor thethe VNFaces dataset. models VNFaces dataset.

Table 2 compares total number trainable parameters between three models RR-CNN, Table 2 compares thethe total number of of trainable parameters between thethe three models RR-CNN, RR-VGG1, and RR-VGG2. table, RR-CNN much fewer parameters RR-VGG1, and RR-VGG2. As As we we cancan seesee in in thethe table, RR-CNN hashas much fewer parameters in in comparison with two others, which makes it more lightweight efficient in the training dataset. comparison with thethe two others, which makes it more lightweight andand efficient in the training dataset. Table 2. Number of trainable parameters between RR-CNN, RR-VGG1, and RR-VGG2 models. Table 2. Number of trainable parameters between RR-CNN, RR-VGG1, and RR-VGG2 models. Model Model

RR-CNN RR-CNN RR-VGG1 RR-VGG1 RR-VGG2 RR-VGG2

Number of Trainable Parameters Number of Trainable Parameters 2,230,242 2,230,242 17,338,690 17,338,690 17,338,690 17,338,690

Race Recognition: Extension Experiments 4.3.4.3. Race Recognition: Extension Experiments extension experiments,we weaimed aimedtotoshow showthat that our our proposed models In In thethe extension experiments, models not notonly onlyapply applyfor forthe VNFaces dataset but could also be applied for other race datasets. We conducted two experiments the VNFaces dataset but could applied for other race datasets. We conducted two experiments follows. as as follows. first experiment, used proposed models perform classification Japanese, In In thethe first experiment, wewe used thethe proposed models to to perform classification of of Japanese, Chinese, and Brazilian datasets; the Japanese Female Facial Expression (JAFFE) [39], Chinese University Chinese, and Brazilian datasets; the Japanese Female Facial Expression (JAFFE) [39], Chinese of Hong of Kong Face Sketch and[40], Brazilian face database captured at the Artificial University Hong Kong Face(CUFS) Sketch [40], (CUFS) and Brazilian face database captured at the Intelligence Laboratory of FEI (FEI) datasets [41], respectively, were used as the image sources. Artificial Intelligence Laboratory of FEI (FEI) datasets [41], respectively, were used as the image The JAFFE datasetdataset includes 213 images of seven facial expressions of 10 Japanese femalefemale models, sources. The JAFFE includes 213 images of seven facial expressions of 10 Japanese as exemplified in Figure 10. The dataset has 188 images of 188of Hong Kong students, and FEI models, as exemplified in Figure 10.CUFS The CUFS dataset has 188 images 188 Hong Kong students, contains 14 images for each of the individuals, 2800 images in in total, with examples and FEI contains 14 images for each of 200 the 200 individuals, 2800 images total, with examplesfrom fromthe in in Figures 11 11 andand 12, 12, respectively. We conducted experiments classifying each race thedatasets datasetsshown shown Figures respectively. We conducted experiments classifying eachout of out the of combined datasetdataset produced by adding imagesimages of 3208ofother (including Asian,Asian, African, race the combined produced by adding 3208 people other people (including and Caucasian) to eachtodataset as described in Section 4.1. We used people in each African, and Caucasian) each dataset as described in Section 4.1. We 80% usedof80% of people inrace eachfor training and the 20% for as described in Table 3. These experiment settings make race for training andremaining the remaining 20%testing, for testing, as described in Table 3. These experiment settings suresure thatthat the the splitting process is aissubject-independent partition where people in in thethe training and make splitting process a subject-independent partition where people training testing sets areare completely different. The The results are reported in Table 4. It is4. shown that thethat proposed and testing sets completely different. results are reported in Table It is shown the models models both work well in all other race datasets This has proven that our defined CNN model proposed both work well in all other race tested. datasets tested. This has proven that our defined andmodel fine-tuning VGG models could be used in RR problems, fine-tuning achieving CNN and fine-tuning VGG models could be used in RRwith problems, with VGG fine-tuning VGGthe best accuracy. achieving the best accuracy.

Symmetry 2018, 10, 564 Symmetry 2018, 10, x FOR PEER REVIEW

11 of 15 11 of 15

Figure 10. 10. Samples Samples from from the the JAFFE JAFFE dataset. dataset. Figure

Face Sketch Sketch (CUFS) (CUFS) dataset. dataset. Figure 11. Samples Samples from the Chinese University of Hong Kong Face

Samples from the FEI dataset. Figure 12. Samples Table 3. Number of Table 3. Number Number of of images images in in the the training training and and testing testing datasets datasets of of the the three three combined combined datasets. datasets. Table 3. images in the training and testing datasets of the three combined datasets. Dataset Dataset

Dataset JAFFEJAFFE + Others + Others CUFSCUFS + Others + Others + Others FEI +CUFS Others FEI + Others

Test TrainTrain Test 8 Japanese + 2566 others 2 Japanese 2+ Japanese + 642 others 8 Japanese + 2566 others 642 others 150 Chinese + 2566 others 38 Chinese 38 Chinese + 642 others 150 Chinese + 2566 others + 642 others 150 Chinese + 2566 others + 642 others 160 Brazilian + 2566 others38 Chinese40 Brazilian + 642 others 160 Brazilian + 2566 others 40 Brazilian + 642 others

Table 4. Accuracy of the three proposed methods used on the different race datasets. Table 4. 4. Accuracy Accuracy of of the the three three proposed proposed methods methods used used on on the the different different race race datasets. datasets. Table Dataset JAFFE + Others CUFS + Others FEI + Others

Accuracy Accuracy (%)(%) Accuracy (%) Dataset RR-CNNRR-CNN RR-VGG1 RR-VGG1 RR-VGG2 95.81 99.42 JAFFE + Others 95.81 99.42 98.72 JAFFE + Others 95.81 99.42 98.72 99.94 100.0 CUFS + Others 99.94 100.0 100.0 90.73 95.76 FEI + Others 90.73 95.76 97.25

RR-VGG2 98.72 100.0 97.25

In the the second second experiment, experiment, we tested tested the the performance performance of of classifying classifying Asian Asian people people and and Others Others when we mixed our VNFaces dataset with two other datasets: the JAFFE dataset and the CUFS dataset. when we mixed our VNFaces dataset with two other datasets: the JAFFE dataset and the CUFS when The combined dataset has a total 3090 Asian people and 3208 Other people. It was split into training dataset. The combined dataset has a total 3090 Asian people and 3208 Other people. It was split into and testing withsets an 80% for training a 20%and proportion for testing,for astesting, shown as in training andsets testing withproportion an 80% proportion forand training a 20% proportion Figure 13. shown in Figure 13. shown in Figure 13.

Symmetry 2018, 10, 564

12 of 15

Symmetry 2018, 10, x FOR PEER REVIEW

12 of 15

Number of instanse

6000 5000 4000

2566

3000 2000 1000

2472

642 618

0 Train

Test

Asian (Vietnamese, Japanese, Chinese)

Others

Figure 13. Number of Asian people (including Vietnamese, Japanese, and Chinese) and Others in the training and testing datasets.

The results results of ofthe theevaluation evaluationononthe the testing provided in Table 5. we Ascan wesee, canall see, all testing setset areare provided in Table 5. As three three proposed performed wellhigh with high classification of over 75%. RR-CNN In detail, proposed modelsmodels performed well with classification accuracyaccuracy of over 75%. In detail, RR-CNN achieved 76.51%, while RR-VGG1 and RR-VGG2 achieved and 87.24%, respectively. achieved 76.51%, while RR-VGG1 and RR-VGG2 achieved 86.55% 86.55% and 87.24%, respectively. This This implies that the fine-tuning VGG model with deeper convolutional layers could be used when implies that the fine-tuning VGG model with deeper convolutional layers could be used dealing with many complicated datasets containing various races. Table of Table 5. 5. Accuracy Accuracy of of Asian Asian race race classification classification by by the the three three proposed proposed models models of of aa combined combined dataset dataset of Vietnamese, Japanese, and Chinese with Others. Vietnamese, Japanese, and Chinese with Others. Model

Model

RR-CNN RR-CNN RR-VGG1RR-VGG1 RR-VGG2

RR-VGG2

Accuracy (%) Accuracy (%) 76.51 76.51 86.55 86.55 87.24 87.24

Last, as considering races of of similar appearances, so as race raceclassification classificationcan canbe bechallenging challengingwhen when considering races similar appearances, that in the third experiment, we we conducted the the classifying people in Asian ethnicity. In details, we so that in the third experiment, conducted classifying people in Asian ethnicity. In details, combined the dataset of Vietnamese, Japanese, and and Chinese together named VCJ dataset with with 2892 we combined the dataset of Vietnamese, Japanese, Chinese together named VCJ dataset Vietnamese, 10 Japanese, and 188and Chinese. We performed experiment where we classified races 2892 Vietnamese, 10 Japanese, 188 Chinese. We performed experiment where wethree classified out the dataset. Thedataset. output layer in RR-CNN, RR-VGG1, RR-VGG2 models has been transformed three races out the The output layer in RR-CNN,and RR-VGG1, and RR-VGG2 models has been to deal with to three Vietnamese, Japanese, and Chinese The experiment was set transformed dealclasses: with three classes: Vietnamese, Japanese, andrespectively. Chinese respectively. The experiment up identically to the previous with 80%with of people forpeople training 20% ofand last20% for testing, in was set up identically to the previous 80% of forand training of last as forgiven testing, Table 6 and the results areresults givenare in given Table in 7. Table Since 7. the number of Vietnamese is much largerlarger than as given in Table 6 and the Since the number of Vietnamese is much Japanese and Chinese which which makes makes it become unbalance problem, the default accuracy than Japanese and Chinese it become unbalance problem, themetrics defaultusing metrics using is less meaningful. In order In to order overcome such problem, the receiver operating characteristic (ROC) accuracy is less meaningful. to overcome such problem, the receiver operating characteristic curve [42] is [42] widely used. However, for multiclass algorithms, we need useto a use multiclass AUC (ROC) curve is widely used. However, for multiclass algorithms, weto need a multiclass method. In thisInapproach, a separate AUCAUC for each classclass is calculated, suchsuch thatthat the the AUC of class AUC method. this approach, a separate for each is calculated, AUC of class is by by considering all all thethe samples of of C asi as positives and thethe samples of of all all other classes as Ci calculated is calculated considering samples positives and samples other classes negatives. TheThe average AUS is calculated as the mean of AUC of three classes. As we see in as negatives. average AUS is calculated as the mean of AUC of three classes. As can we can seethe in table, all three models havehave achieved overover 80%80% in AUC. The The best best average AUCAUC is 99.08% using RRthe table, all three models achieved in AUC. average is 99.08% using VGG1 while RR-VGG2 achieves 94.48% andand RR-CNN accuracy is 90.19%. RR-VGG1 while RR-VGG2 achieves 94.48% RR-CNN accuracy is 90.19%. Table 6. Number of images in the training and testing datasets of the VCJ dataset.

Train Train 2313 Vietnamese + 8 Japanese + 150 Chinese 2313 Vietnamese + 8 Japanese + 150 Chinese

Test Test 579579 Vietnamese + 2 38Chinese Chinese Vietnamese + 2 Japanese Japanese ++38

Symmetry 2018, 10, 564

13 of 15

Table 7. AUCs (%) of classifying Vietnamese, Japanese, and Chinese out the VCJ dataset. Class

RR-CNN

RR-VGG1

RR-VGG2

Vietnamese Japanese Chinese Average

89.24 100.00 81.35 90.19

99.13 100.00 98.13 99.08

94.41 100.00 89.04 94.48

5. Conclusions This study proposed an efficient RR Framework consisting of three modules: an information collector, face detection and preprocessing, and RR. For the RR module, this study proposes two independent models: RR-CNN and RR-VGG. To evaluate the proposed framework, we conducted experiments on our Vietnamese dataset collected from Facebook, named VNFaces, to compare the accuracy between the RR-CNN and RR-VGG models. The experimental results show that for the VNFaces dataset, the RR-VGG model with augmented input images yields the best accuracy at 88.87%, while RR-CNN achieves 88.64%. In other examined cases, our proposed models also achieved high accuracy in the classification of other race datasets such as Japanese, Chinese, and Brazilian. Even in the case of classifying people with similar appearances, our models could perform well with overall results are over 80%. In most of the scenarios, the fine-tuning RR-VGG achieved the best accuracy due to its number of deep layers; this suggests that it could be successfully applied to various RR problems. In future work, several related issues will be studied. Firstly, a project that collects a race face dataset based on social network data for RR will be opened for contributions from around the world. Secondly, several preprocessing techniques and deep models to improve the accuracy of classification will be studied. Author Contributions: T.V. proposed the topic and implemented the framework. T.V. wrote the paper. C.T.L. and T.N. improved the paper. Funding: This research received no external funding. Conflicts of Interest: The authors declare no conflict of interest.

References 1.

2. 3. 4. 5.

6. 7.

8.

Baidyk, T.; Kussul, E.M.; Monterrosas, Z.C.; Gallardo, A.J.I.; Serrato, K.L.R.; Conde, C.; Serrano, A.; Diego, I.M.; Cabello, E. Face recognition using a permutation coding neural classifier. Neural Comput. Appl. 2016, 27, 973–987. [CrossRef] Kardas, K.; Cicekli, N.K. SVAS: Surveillance Video Analysis System. Expert Syst. Appl. 2017, 89, 343–361. [CrossRef] Zhang, Q.; Chen, X.; Zhan, Q.; Yang, T.; Xia, S. Respiration-based emotion recognition with deep learning. Comput. Ind. 2017, 92–93, 84–90. [CrossRef] Cosar, S.; Donatiello, G.; Bogorny, V.; Gárate, C.; Alvares, L.O.; Brémond, F. Toward abnormal trajectory and event detection in video surveillance. IEEE Trans. Circuits Syst. Video Technol. 2017, 27, 683–695. [CrossRef] Ahmed, E.; Jones, M.J.; Marks, T.K. An improved deep learning architecture for person re-identification. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 3908–3916. Fu, S.; He, H.; Hou, Z.-G. Learning Race from face: A survey. IEEE Trans. Pattern Anal. Mach. Intell. 2014, 36, 2483–2509. [CrossRef] [PubMed] Farnadi, G.; Sitaraman, G.; Sushmita, S.; Celli, F.; Kosinski, M.; Stillwell, D.; Davalos, S.; Moens, M.F.; Cock, M.D. Computational personality recognition in social media. User Model. User-Adapt. Interact. 2016, 26, 109–142. [CrossRef] Nguyen, D.T.; Joty, S.R.; Imran, M.; Sajjad, H.; Mitra, P. Applications of online deep learning for crisis response using social media information. arXiv 2016. Available online: https://arxiv.org/abs/1610.01030 (accessed on 1 July 2018).

Symmetry 2018, 10, 564

9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21.

22.

23. 24.

25. 26.

27.

28.

29.

30.

14 of 15

Carvalhoa, J.P.; Rosa, H.; Brogueirac, G.; Batista, F. MISNIS: An intelligent platform for twitter topic mining. Expert Syst. Appl. 2017, 89, 374–388. [CrossRef] LeCun, Y.; Bengio, Y.; Hinton, G. Deep learning. Nature 2015, 521, 436–444. [CrossRef] [PubMed] Chen, J.; Ou, Q.; Chi, Z.; Fu, H. Smile detection in the wild with deep convolutional neural networks. Mach. Vis. Appl. 2017, 28, 173–183. [CrossRef] Parka, B.H.; Oha, S.Y.; Kim, I.J. Face alignment using a deep neural network with local feature learning and recurrent regression. Expert Syst. Appl. 2017, 89, 66–80. [CrossRef] Pang, S.; Coz, J.J.; Yu, Z.; Luaces, O.; Díez, J. Deep learning to frame objects for visual target tracking. Eng. Appl. Artif. Intell. 2017, 65, 406–420. [CrossRef] Ronao, C.A.; Cho, S.B. Human activity recognition with smartphone sensors using deep learning neural networks. Expert Syst. Appl. 2016, 59, 235–244. [CrossRef] Zhang, Y.; Zhang, E.; Chen, W. Deep neural network for halftone image classification based on sparse auto-encoder. Eng. Appl. Artif. Intell. 2016, 50, 245–255. [CrossRef] Majumder, N.; Poria, S.; Gelbukh, A.F.; Cambria, E. Deep learning-based document modeling for personality detection from text. IEEE Intell. Syst. 2017, 32, 74–79. [CrossRef] Poria, S.; Cambria, E.; Gelbukh, A.F. Aspect extraction for opinion mining with a deep convolutional neural network. Knowl. Based Syst. 2016, 108, 42–49. [CrossRef] Yu, X.; Yu, H.; Tian, X.Y.; Yu, G.; Li, X.M.; Zhang, X.; Wang, J. Recognition of college students from Weibo with deep neural networks. Int. J. Mach. Learn. Cybern. 2017, 8, 1447–1455. [CrossRef] Qawaqneh, Z.; Mallouh, A.A.; Barkana, B.D. Deep neural network framework and transformed MFCCs for speaker’s age and gender classification. Knowl. Based Syst. 2017, 115, 5–14. [CrossRef] Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2015, arXiv:1409.1556. Available online: https://arxiv.org/abs/1409.1556 (accessed on 1 July 2018). Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet classification with deep convolutional neural networks. In Proceedings of the 25th International Conference on Neural Information Processing Systems, Lake Tahoe, NV, USA, 3–6 December 2012; pp. 1106–1114. Deng, J.; Dong, W.; Socher, R.; Li, L.-J.; Li, K.; Fei-Fei, L. Imagenet: A large-scale hierarchical image database. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; pp. 248–255. Roh, R.C.; Lee, S.W. Performance evaluation of face recognition algorithms on Korean face database. Int. J. Pattern Recognit. Artif. Intell. 2007, 21, 1017–1033. [CrossRef] Bastanfard, A.; Nik, M.A.; Dehshibi, M.M. Iranian face database with age, pose and expression. In Proceedings of the 2007 International Conference on Machine Vision, Islamabad, Pakistan, 28–29 December 2007; pp. 50–58. Gao, W.; Cao, B.; Shan, S.G.; Chen, X.L.; Zhou, D.L.; Zhang, X.H.; Zhao, D.B. The CAS-PEAL large-scale Chinese face database and baseline evaluations. IEEE Trans. Syst. Man Cybern. Part A Syst. Hum. 2008, 38, 149–161. Wei, Y.; Xia, W.; Lin, M.; Huang, J.; Ni, B.; Dong, J.; Zhao, Y.; Yan, S. HCP: A flexible CNN framework for multi-label image classification. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 38, 1901–1907. [CrossRef] [PubMed] Li, H.; Lin, Z.; Shen, X.; Brandt, J.; Hua, G. A convolutional neural network cascade for face detection. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 5325–5334. He, K.; Wang, Y.; Hopcroft, J.E. A powerful generative model using random weights for the deep image representation. In Proceedings of the 30th Conference on Neural Information Processing Systems (NIPS), Barcelona, Spain, 5–11 December 2016; pp. 631–639. Yang, W.; Ouyang, W.; Li, H.; Wang, X. End-to-end learning of deformable mixture of parts and deep convolutional neural networks for human pose estimation. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 3073–3082. Li, X.; Zhao, L.; Wei, L.; Yang, M.H.; Wu, F.; Zhuang, Y.; Ling, H.; Wang, J. DeepSaliency: Multi-task deep neural network model for salient object detection. IEEE Trans. Image Process. 2016, 25, 3919–3930. [CrossRef] [PubMed]

Symmetry 2018, 10, 564

31.

32. 33.

34.

35.

36. 37. 38.

39.

40. 41. 42.

15 of 15

Carreira, J.; Agrawal, P.; Fragkiadaki, K.; Malik, J. Human pose estimation with iterative error feedback. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 4733–4742. Karaoglu, S.; Tao, R.; Gevers, T.; Smeulders, A. Words matter: Scene text for image classification and retrieval. IEEE Trans. Multimed. 2017, 19, 1063–1076. [CrossRef] Paul, R.; Hawkins, S.H.; Balagurunathan, Y.; Schabath, M.B.; Gillies, R.J.; Hall, L.O.; Goldgof, D.B. Deep feature transfer learning in combination with traditional features predicts survival among patients with lung adenocarcinoma. Tomogr. J. Imaging Res. 2016, 2, 388–395. Long, J.; Shelhamer, E.; Darrell, T. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 3431–3440. Hoo-Chang, S.; Roth, H.R.; Gao, M.; Lu, L.; Xu, Z.; Nogues, I.; Yao, J.; Mollura, D.; Summers, R.M. Deep convolutional neural networks for computer-aided detection: CNN architectures, dataset characteristics and transfer learning. IEEE Trans. Med. Imaging 2016, 35, 1285–1298. Sze, V.; Chen, Y.H.; Yang, T.J.; Emer, J.S. Efficient Processing of Deep Neural Networks: A Tutorial and Survey. arXiv 2017. Available online: https://arxiv.org/abs/1703.09039 (accessed on 1 July 2018). [CrossRef] Kingma, D.P.; Ba, J. Adam: A Method for Stochastic Optimization. arXiv 2014. Available online: https: //arxiv.org/abs/1412.6980 (accessed on 1 July 2018). Viola, P.; Jones, M. Rapid object detection using a boosted cascade of simple features. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), Kauai, HI, USA, 8–14 December 2001. Lyons, M.; Akamatsu, S.; Kamachi, M.; Gyoba, J. Coding facial expressions with Gabor wavelets. In Proceedings of the Third IEEE International Conference on Automatic Face and Gesture Recognition, Nara, Japan, 14–16 April 1998; pp. 200–205. Wang, X.; Tang, X. Face photo-sketch synthesis and recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2009, 31, 1955–1967. [CrossRef] [PubMed] Thomaz, C.E.; Giraldi, G.A. A new ranking method for principal components analysis and its application to face image analysis. Image Vis. Comput. 2010, 28, 902–913. [CrossRef] He, H.; Garcia, E.A. Learning from imbalanced data. IEEE Trans. Knowl. Data Eng. 2009, 21, 1263–1284. © 2018 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).