Face Recognition on Mobile Platforms

0 downloads 0 Views 2MB Size Report
According to our long-term plans, we develop our face recognition algorithms under ... dominant market share, the most popular are Google's Android and Apple's ... processing algorithms to mobile platforms in order to achieve gesture based ...
Face Recognition on Mobile Platforms Kornél Bertók

Attila Fazekas

Department of Computer Graphics and Image Processing University of Debrecen, Faculty of Informatics Debrecen, Hungary E-mail: [email protected]

Department of Computer Graphics and Image Processing University of Debrecen, Faculty of Informatics Debrecen, Hungary E-mail: [email protected]

Abstract—Every year, another generation of smartphones is released that is more capable and stronger than the prior top-class devices. Because of the increased performance and ergonomics of smartphones, the shift away from personal computers is continuously accelerating. Due to this effect, many developers are interested in adapting PC-based solutions to mobile platforms. In this paper, we focus on adapting face recognition algorithms to Android mobile platforms. These algorithms are already part of a Windows desktop application and our aim is to create such an architecture where the application logic has the same source code in different platforms. That is, the article is about a cross-platform development of image processing algorithms. According to our long-term plans, we develop our face recognition algorithms under Windows, but every stable release will be also built as the part of an Android application. In addition, multiple user interfaces must be developed for each platform and we also need interfaces to reach the functionalities of the common application logic from user interfaces. Besides the concept of the architecture and we also quantify the performance of the same algorithms under different platforms.

can be expanded with a set of cognitive capabilities in order to interconnect the content scape with cognitive and sensory contents [2]. Smartphones can collect or understand data during the communication, so the border between natural and artificial cognitive skills can be faded. The observation of users through smartphones can have a lot of benefits, especially the methods of information access can be increased and made self-evident by the cognitive effect that the smartphone hold on the user. The human factor must always be in focus and as the result of this process, people can interact with smartphones more diverse and natural as before. The sensory information obtained from smartphones can be transferred to the user and transformed to an appropriate sensory modality in a way that the user can process easily and efficiently [3]. This kind of transformation makes smartphones more user-friendly and gives the opportunity to control systems as never before. The future goal of the proposed system is allowing users to use smartphones as a controller of a spatial memory by understanding instructions like gestures or human activities, [4] and [5].

Keywords—face recognition; mobile platforms; shared application logic; facial feature extraction; head pose estimation

In this paper, we share our experiences about adapting image processing algorithms to mobile platforms in order to achieve gesture based control in the future. The control scheme is purely depend on the user’s face, so here we focus on the adaption of our existing face recognition solutions (face detection and tracking, facial feature extraction, head pose estimation and head movement detection, see in [6], [7] and [8]) to mobile platforms. We chose the Android mobile platform as our new target platform, because is the most common nowadays and gives us everything we need to build the application (e.g. developer tools, application model, etc.). The rest of this article is organized as follows: First of all, we give a short overview about the system architecture and the main concept of our crossplatform developments in Section 2. An appropriate system architecture is introduced, where the user interface (UI) is entirely divided from the application logic (AL). The code of AL is completely shared between the particular platforms and only a thin-layer of platform-specific code are used for transferring data from/to the UI. This platform-specific code is a JNI bridge in case of our mobile platform. Then in Section 3, we give a short summary about the applied image processing algorithms of AL. Then in Section 4, we will summarize and compare the results of image processing algorithms running on both PCs and smartphones.

I.

INTRODUCTION

We live in a highly interconnected world, where technology adoption is one of the defining factors in human progress. There has been a noticeable rise over the past years in the percentage of people who say that they own and use a smartphone. Smartphone penetration rate as share of the population has been growing increasingly and nowadays, smartphones have become an important part of our everyday life. All aspects of their performance have dramatically improved in recent years and now smartphones are capable to perform computationally intensive tasks. There are several mobile platforms which have dominant market share, the most popular are Google's Android and Apple's iOS. In addition to the progress of smartphone penetration there is a growing demand for adapting PC based solutions to mobile platforms. As the result of the adaptation, mobile infocommunications can be expanded by novel information processing and content handling functions. People and their smartphones can have more and more common or mixed cognitive capabilities. In this sense, interdisciplinarity clearly serves the continuous digital life and it also has a key role in disappearing the real borders between human and artificial cognitive capabilities which can have promising results on the human-computer interactions and communications [1]. To this end, smartphones

II.

SYSTEM ARCHITECTURE

In this section we give an overview about the main concepts of the architecture. We would like to make available our face

recognition application for multiple device platforms (Windows and Android). A simple straightforward approach is to create multiple versions of the application in different source trees, which means a Windows and an Android native application (on the platform) must be developed independently from each other. Besides it is simple approach, it needs significantly expensive development cost and time. Therefore, we decided to create a common and platform independent AL which is shared between the platforms and is written purely in C++. Moreover, the UI is developed separately per platforms by using the native technologies of the platforms (i.e. Java on Android). Even a thin-layer of platform-specific C/C++ code is used for transferring data between the AL and UI. This platform-specific code is a JNI bridge in case Android platform. The bridge also hides the differences between the different programming languages of the two sides. Consequently, C++ is the key for adapting code to mobile platforms. C++ is unique as it gives us the ability to write efficient, fast and cross-platform code which can then be shared across the different device platforms. Figure 1 shows the system architecture. The Shared C++ Application Logic encapsulates the common and platform independent functionalities (i.e. image processing procedures). The C++ Bridge is responsible for the communication between the AL and UIs and the User Interfaces are developed in the native language of the platforms.

supported IDE is the Android Studio [9]. The SDK also hides the differences of the supported CPUs and architectures (i.e. the instruction set of ARM or x86), so it gives the opportunity to develop one application for multiple mobile architectures. Figure 3 depicts how the UI is built using the SDK: several Java and XML files are edited and then command line tools (of JDK) are used to build and debug the Android applications. The objective of Android UI is to access and manage a particular hardware camera (i.e. set image capture settings and start/stop preview). For this purpose, the Android Camera API is used. The execution flow of Android UI is the following: first we get and set the camera parameters and the correct display orientation of preview frames before starting the preview. The preview can be started if the camera and the preview are initialized. Camera preview frames as byte arrays can then be accessed by callback functions. The callbacks are delivered to the event loop of the thread which opened the camera. Setting of display orientation does not have any effect on the order of byte array passed as the parameter of callbacks. To this end, the frames have to be rotated before processing them in the application logic. By default, the order of byte array is arranged as the NV21 encoding format so it also have to be converted to BGRA format. The rotation and color space conversion for each preview frames can be computationally expensive on several older smartphones. Byte arrays have to be delivered to the application logic. The communication between the UI and AL happens within an asynchronous task which helps to avoid blocking the UI thread. This task allows to perform background operations and publish results on the UI thread without having to manipulate threads. It is also important that only copies of preview frames are delivered to the callback functions so modifying the byte array in the callback function or in the AL (i.e. drawing primitives on it) does not have any effect on the displayed image. That is, debug frames which are sent back to UI from AL are displayed over a separate image view. Last but not least, the UI must terminate capturing and drawing frames and release camera object resources. Figure 2 shows how the UI looks like.

Fig. 1. The system architecture of cross-platform development.

Hereinafter, we summarize the data flow and working of the Android branch along two major development tools of the Android applications (see Figure 3) and we also describe briefly the bridge layer. A. UI Development with Android SDK The so-called native applications of the platform are created by the SDK. Native in this context means the set of functionalities which are offered by the platform. The SDK contains an extensive set of development tools, including libraries, debugger tools, emulators, and documentations. SDK based applications are written in Java and the officially

Fig. 2. The Android User Interface during its operation.

Android SDK Application

Android NDK Application

Makefiles (.mk)

Java Source Code

Java Source Code

Compile (Javac)

Compile (Javac)

Dynamic Library (.so)

Java Native Library Class

Java Android App Class

Compile & Link C/C++ Code (ndk-build)

Create C/C++ Header (Javah -jni)

C/C++ Source Code

Header File

Fig. 3. Major development tools of the Android platform and the building process of the two different strategies. Application development and building purely with SDK (in Java) and the using of existing C/C++ libraries and sources with Android NDK.

B. AL Development with Android NDK Libraries written in C/C++ can be compiled to the Android using the NDK and then the native C/C++ classes can be loaded and called from Java code running under the Dalvik Virtual Machine (VM). Here native code refers to a machine code executed directly by the CPU. Software development using the NDK means manually invoking the build of some dynamic libraries from C/C++ sources (see Figure 3). For portability purposes, every CPU have its own Application Binary Interface (ABI), which defines the way, how the machine code should interact with the system at runtime. These environmental dependencies, along the set of sources and other libraries are defined in makefiles. Libraries can be built by command line tools and can then be loaded in the UI side under Java. Dalvik VM loads the native library and then all of the native methods can be registered in the function’s body.

also hides the differences between the different programming languages of the two sides. It means that the data types and structures of the UI (Java) and the AL (C++) are mapped to each other through the bridge. The mapping between Java and C++ primitive data types is quite straightforward [15], but sending images between the two sides is more complicated. On Java side, the callback function gets the image as a byte buffer in NV21 format. The byte buffer can be passed as a function parameter through JNI, where a proper representation can be created from it. The color space conversion (YUV to BGRA) is also the scope of the bridge. Considering the reverse direction of communication (AL to UI) we do not need to convert back to YUV, because the image view on the Java side can display BGRA images without any problem.

The principle of AL is: “write once, target all”, and the only requirements against it, is the use of standard C++ language and library features only which are supported by the compilers of the particular device platforms. Third-party software components can be also used, but in this case we may have to build them manually to each device platform. It can be said generally that we are a little bit restricted to using the lowest common denominator subset of language features and third-party libraries which are available on all device platforms which may prevent the performance.

In this section we give a short overview about the functionalities of application logic. As it is already mentioned, the AL is a standalone library written purely in C++. The target of AL is to provide a basic face recognition toolkit based only on camera frames and some pre-trained models as inputs. Currently, the following functionalities are implemented and adapted to the Android platform: face detection, face tracking, user registration, facial feature extraction and 6DoF head pose estimation.

C. Bridge As it is already mentioned, the bridge has a messenger role between AL and UI, because the two sides can exclusively communicate with each other through the bridge. The bridge

III. FACE RECOGNITION

A. Face Detection The face detection is based on the well-known algorithm of Viola and Jones [11]. Their algorithm is so efficient that it is very close to being a de-facto standard for face detection. This success is principally the outcome of its relative simplicity, real-

time execution, and significant performance. There is three main principle of the algorithm: introduction of a new image representation (integral image); using some variants of the learning algorithm AdaBoost to select features and to train classifiers; and ordering classifiers into a cascade architecture. The algorithm is relatively simple and allows us to detect multiple faces on video streams, but considering its runtime we encountered difficulties on both platforms. Making the application work real-time on smartphones, the face detector is not allowed to run on every frame: it runs only with a specified frequency, and the face is tracked by a simple template based procedure between the endpoints of two face detections. Face tracking also provides a solution for that cases when the face cannot be detected (i.e. when users turns or tilts their head overly). Our strategy is the following: detect faces in every fifth second and track them continuously between two face detections. A simple template-based tracking procedure is used to estimate the actual position of faces based on the previous ones. The algorithm tries to find matches between the image patch (face template) and the actual input image. The initial face template is given by the first face detection, then on the next frame we estimate where the face is. The estimation is based on matching the face template over several regions of the next frame, and the output of matching is a candidate that is similar to the image patch. Matching is the estimation of normalized cross correlation between the template and the input image. The normalization is important, because it decreases the effect of different lighting conditions. If there is a correlation between the face template and the input image, then we refresh the template from the new scene and repeat the same matching until a new face detection is coming.

considered as a statistical model of the face and the approach consists of two stages: a model training and a model fitting stage. The ASM focuses on the object’s shape which is represented by several 2-D points. Several constraints and principles are determined for the shape during the training stage. The shape cannot have arbitrary layout because it is regularized by several conditions, called as Point Distribution Model (PDM) in terminology. PDM guarantees that shapes can only vary in ways seen in a previously annotated training database. During fitting, an initial shape is iteratively deformed over an unknown object. The fitting is the solution of an optimization problem, where a distance based strategy is used to achieve the best fit between the model and the object of the input image. Figure 4 shows the result of facial feature extraction, the points (with green-yellowred colors) represent the shape model. D. Head Pose Estimation Head pose estimation is performed by the Pose from Orthography and Scaling with ITerations (POSIT) algorithm [14]. Let 𝑃 the relative pose of a 3-D object to the camera. In general 𝑃 is the combination of an 𝑅 rotation matrix and a 𝑇 translation vector: 𝑃 = [𝑅|𝑇]. POSIT estimates the 𝑃 pose of the object knowing only at least four non-coplanar 3-D points of the object (in the object coordinate system), their corresponding 2-D projections (on the image plane), and the focal length of the camera. 3-D object points are determined by statistical 3-D rigid model (a wireframe from the face), and their 2-D projections are given by fitting ASM. It also means having a one-to-one function between 2-D and 3-D points. The output of head pose estimation can be seen the figure below where it is represented by the coordinate system drawn over the camera frames.

B. User Registration We also implemented a user registration strategy, it is a userlevel tracking subsystem and based absolutely on the output of face detection and tracking. A metadata is also assigned to users, where we record among others the detection time, face rectangle, face template, the list of -head pose of their lifetime etc.  New users can be created as the result of face detection: o If a rectangle intersects more than 80% with the face rectangle of an existing user, then we update the user’s metadata with this new detection, o Otherwise a new user is created based on the detection.  Detected users are continuously tracked over the camera stream: o If the tracking is successful then we keep the user active, otherwise the user is deleted, o Tracking procedures are also error-prone as they usually suffer from error accumulation. Users, who are not detected more than 10 seconds ago, are deleted. C. Facial Feature Extraction An Active Shape Model (ASM) is used for extracting several facial feature point of the face, [12] and [13]. ASM can be

Fig. 4. Two examples of the facial feature extraction and head pose estimation (only the rotations around X and Y axes are displayed numerically).

RESULTS

The final precision of the proposed system is evaluated over the Biwi database [16]. During our test experiments we focus on measuring the yaw and pitch errors of the estimated head pose. It is because the head pose is depend on face detection and facial feature extraction at the same time, so the overall error is appeared in the head pose. Although the code is cross-platform, we would like to perform several test cases on mobile platform in the near future, which also require manual annotation. Biwi contains Microsoft Kinect scanned faces and covers a relatively large interval in case of yaw and pitch. Figure 5 shows the absolute yaw and pitch errors in function of both direction (yaw and pitch). Negative pitch values mean the upward direction, while negative yaws the leftward. In case of yaw direction, the range [−40°, +40°] is that where the absolute error is below 10 degrees, and [−12°, +12°] is where the error reaches the minimum, here the error is below 5 degrees. In contrast, the pitch error function is not symmetric to zero, here the range [−32°, +40°] is that where the absolute error is below 10 degrees, and [−6°, +16°] is where the error is below 5 degrees. Figure 5 can be interpreted in the following: we define an 𝑁 × 𝑀 grid in the yaw-pitch coordinate system, where the distance between the grid points is 2°. It means that absolute errors were accumulated over the 2° environment of each grid points and the empty areas are interpolated. For example, in case of 12° of pitch and −40° of yaw:  The average pitch error is 7.44°,  Moreover the average yaw error is 10.06°. We experienced that pitch error begins to grow highly by raising up the head. Furthermore, there is also correlation between the absolute pitch error and yaw rotation because the highest pitch errors are measured in the bottom-left and up-left regions.

Moreover, the performance of the proposed mobile application is evaluated regarding the runtime. We also implemented a simple profiling subsystem to measure the time complexity of several procedures. The test cases were performed both on PC and smartphone (see Figure 6 and 7), where the same code with the same parameters was running on the different platforms. The test PC is an Intel(R) Core(TM)2 Quad CPU Q9550 @ 2.83 GHz, with 4 GB RAM and the smartphone is a Samsung Galaxy S3 Neo with a Qualcomm Snapdragon 400 @ 1.4 GHz CPU and 1.5 GB RAM. The profiling was running over 1000 frames along VGA resolution on both devices. We found that the total runtime is mostly influenced by the Viola and Jones face detector, which runs on every 10th seconds. It occurs as steep jumps on the characteristics of the total runtime (see near vertical gray lines of the Total series on the diagrams below). The average runtime of face detection is 190 ms on PC and 240 ms on smartphone, which is not a significant difference between the two platforms. It is an interesting fact that the CPU utilization of the application is nearly twice on PC (70%) as on smartphone (40%). We assume that the mobile implementation of the algorithms and the 3rd party dependencies are not optimized as well as on PC. Although, it shows that smartphones have a reserve against the resources. ASM Fitting

300 200 100 0 0

200

Pitch error [deg]

-50

1000

0 -20

0

50

20

Yaw (y)

40

ASM Fitting

0

30 15 -50 20 10 0 50 -30 -20

-20

-10 0

0 Pitch20 (x)

10

20

50

30

40

10 20

0

5 10 0

40 Yaw (y)

0

Fig. 5. Absolute pitch and yaw errors in function of both yaw and pitch. Errors -50 were represented along the 𝑍 axis. The ground truth measurements30 of pitch and yaw were represented along the 𝑋 and 𝑌 axis. 20 10 50 -30

-20

-10

0 Pitch (x)

10

20

30

40

Face Detection

400 300 200 100 0 0

200

400

600

800

1000

FRAME NUMBER

Pitch (x)

0

RUNTIME [MS]

20 0

Total

500 25

Yaw error in function of Yaw and Pitch

-50

Yaw error [deg] Yaw (y)

800

Fig. 6. Runtime measurements of the two most important module on PC. The vertical axis is the runtime (milliseconds), the horizontal is the frame number.

Pitch (x)

Yaw (y)

600

10

20 10 0

-40

400

FRAME NUMBER

20

-40-40

Face Detection

400

Pitch error in function of Pitch and Yaw

-40

Total

500

RUNTIME [MS]

IV.

0

Fig. 7. Runtime measurements of the two most important module on smartphone. The vertical axis is the runtime (milliseconds), the horizontal is the frame number.

The runtime of everything else is negligible on PC, for example, the average runtime of ASM fitting is about 5 ms, and the head pose estimation is even less. However, this is not true in case of mobile platform. Here, the average runtime of ASM fitting is 48 milliseconds, almost ten times greater as on PC. We assume that the implementation is not enough optimized for ARM CPUs. The total runtime is affected by more algorithms in case of smartphones. The average runtime for a whole frame is 20 ms on PC and 84 ms on smartphone. See Figure 8 for the complete comparison. The system is running at 50 FPS on PC and 12 FPS on smartphone. Face Detection [ms]

ASM Fitting [ms]

Total [ms]

PC

189.5

5.6

19.8

Smartphone

240.4

48.9

84.2

[7]

[8]

[9]

[10] [11]

[12]

[13]

Fig. 8. The average runtimes for both of the platforms. [14]

V.

CONCLUSIONS AND PLANS FOR THE FUTURE

In the above, we proposed a system for mobile image processing, which is suitable for face recognition. Because of the design of the architecture, the system is flexible enough and can easily be extended with new functionalities. The application logic is totally divided from the UI, it is coded in standard C++11 and can be compiled to PC and smartphone also (without any modification). There is a huge performance loss (about four times slower) in the case of mobile platforms, but the whole application logic is running with 12 FPS on smartphone without any optimization. It is worth dealing with optimization and tuning the parameters in the future because the implementations are optimized for x86/x64 CPUs and not for ARMs. This is also supported by the fact that currently, the CPU utilization on smartphone is about 35-45%, so the mobile device has a plenty of resources yet. The communication between the UI and application logic should also be improved because images larger than VGA resolution are passed and converted slowly. For example, a frame from the application logic can be uploaded to an OpenGL texture, and then it can be rendered to a GLSurfaceView on the UI side – this can save some copies between the components. REFERENCES [1] [2]

[3] [4]

[5]

[6]

P. Baranyi, A. Csapo, Gy. Sallai "Cognitive Infocommunications (CogInfoCom)", Springer International, 2015. A. Csapo and P. Baranyi, “Definition and Synergies of Cognitive Infocommunications,” Acta Polytechnica Hungarica, vol. 9, no. 1, pp. 67-83, 2012. G. Sallai, “Defining Infocommunications and Related Terms,” Acta Polytechnica Hungarica, vol. 9, no. 6, pp. 5-15, 2012. M. Niitsuma and H. Hashimoto, “Spatial memory as an aid system for human activity in intelligent space,” IEEE Transactions on Industrial Electronics, vol. 54, no. 2, pp. 1122-1131, 2007. A. Csapo and P. Baranyi, “A Unified Terminology for the Structure and Semantics of CogInfoCom Channels,” Acta Polytechnica Hungarica, vol. 9, no. 1, pp. 85-105, 2012. K. Bertok and A. Fazekas, “Recognizing Human Activities Based on Head Movement Trajectories,” in Proc. of IEEE International Conference on CogInfoCom 2014, pp. 273-278, 2014.

[15]

[16]

K. Bertok and A. Fazekas, “Gesture Recognition - Control of a Computer with Natural Head Movements,” in Proc. of GRAPP/IVAPP 2012, pp. 527-530, 2012. K. Bertok, L. Sajo and A. Fazekas, “A robust head pose estimation method based on POSIT algorithm,” Argumentum, vol. 7, pp. 348-356, 2011. "Android SDK", developer.android.com, 2016. [Online]. Available: https://developer.android.com/studio/features.html. [Accessed: 03-Jul2016]. "Android NDK", developer.android.com, 2016. [Online]. Available: https://developer.android.com/ndk/index.html. [Accessed: 03-Jul-2016]. P. Viola and M. Jones, “Rapid object detection using a boosted cascade of simple features,” in Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol. 1, pp. 511-518, 2001. T.F. Cootes, C.J. Taylor, D.H. Cooper and J. Graham, “Active shape models - their training and application,” Computer Vision and Image Understanding, vol. 61, pp. 38-59, 1995. S. Milborrow and F. Nicolls, “Active Shape Models with SIFT Descriptors and MARS,” in Proceedings of the 9th International Conference on Computer Vision Theory and Applications, pp. 380-387, 2014. D. DeMenthon and L.S. Davis, “Model-Based Object Pose in 25 Lines of Code,” International Journal of Computer Vision, vol. 15, pp. 123-141, 1995 "JNI Types and Data Structures ", docs.oracle.com, 2016. [Online]. Available: http://docs.oracle.com/javase/7/docs/technotes/guides/jni/spec/types.htm l. [Accessed: 03-Jul-2016]. G. Fanelli, M. Dantone, J. Gall, A. Fossati and L. Gool, “Random Forests for Real Time 3D Face Analysis,” International Journal of Computer Vision, vol. 101, no. 3, pp. 437-458, 2013.