Low Vision Assistance using Face Detection and Tracking on Android ...

3 downloads 81740 Views 166KB Size Report
Low Vision Assistance Using Face Detection and. Tracking on Android Smartphones. Andreas Savakis. 1, Mark Stump. 1, Grigorios Tsagkatakis. 1, Roy Melton.
Low Vision Assistance Using Face Detection and Tracking on Android Smartphones 1

1

1

1

2

Andreas Savakis , Mark Stump , Grigorios Tsagkatakis , Roy Melton , Gary Behm , and Gwen Sterns

3

1 2

Department of Computer Engineering, National Technical Institute for the Deaf Rochester Institute of Technology Rochester, New York 14623

3

Abstract—This paper presents a low vision assistance system for individuals with blind spots in their visual field. The system identifies prominent faces in the field of view and redisplays them in regions that are visible to the user. As part of the system performance evaluation, we compare various algorithms for face detection and tracking on an Android smartphone, a netbook and a high-performance workstation representative of cloud computing. We examine processing time and energy consumption on all three platforms to determine the tradeoff between processing on a smartphone versus a cloud-desktop after compression and transmission. Our results demonstrate that Viola-Jones face detection along with Lucas-Kanade tracking achieve the best performance and efficiency.

I.

INTRODUCTION

The National Institutes of Health’s Healthy People 2010 Midcourse Review states that “vision and hearing are defining elements of the quality of life.” In general, low vision and blindness increase significantly with age across all racial and ethnic groups, and by the age of 65 approximately one third of the population has some form of vision-reducing eye disease. The most common eye diseases among people aged 40 years and older are age-related macular degeneration, cataract, diabetic retinopathy and glaucoma. The number of individuals in the U.S. who are affected by age-related macular degeneration alone is expected to increase from 1.75 million individuals currently to almost 3 million by 2020. Generally, vision impairment appears to be directly associated with loss of self-sufficiency. A survey completed on 222 subjects age 65–90 found that 49% had some type of ocular disorder, 29% reported difficulty with basic activities, such as grooming and bathing, and 19% of subjects reported some difficulty with “instrumental activities” such as shopping or managing money. Finally, 32% of subjects reported difficulty with some mobility tasks. Overall, the results show that there is a notable link between visual impairment and ability to interact with the world around them. This research was supported in part by the RIT/RGHS Alliance and the National Technical Institute for the Deaf.

978-1-4673-2527-1/12/$31.00 ©2012 IEEE

Department of Ophthalmology Rochester General Hospital Rochester, New York 14621

This research focuses on assisting individuals with blind spots in their visual field. Such conditions may be caused by two specific diseases: Usher’s syndrome, a condition that combines loss of hearing and loss of peripheral vision due to retinitis pigmentosa, and macular degeneration, a disorder that affects central vision. Similarly to Usher’s syndrome, advanced glaucoma and cerebrovascular accidents (strokes) often cause peripheral visual field loss, impairing the affected patient’s ability to navigate in crowded areas, cross streets, or perform tasks like reading and writing which require scanning. As in the case of macular degeneration, macular edema caused by diabetes mellitus causes loss of central vision, impairing the ability to read or recognize faces. Previous research on systems for low vision assistance includes the work of Luo and Peli [1]. A head mounted display along with an augmented-vision device superimposes contour images over the user’s field of vision for tunnel vision assistance. The head mounted display is monocular and interacts with the processing device, which receives input from a head mounted camera. The system uses a combination of buzzers and a target contour to assist a user with tunnel vision in finding the designated target. However, a dedicated device requires specialized hardware, which can be costly and difficult to carry. Scherlen and Gautier have developed a system to assist patients with centrally-located scotomas by visual signal adaptive restitution (ViSAR) [2]. Their active vision system is built around the idea of observing the user, interpreting cognitive discomfort and restituting the signal by relocating information that would normally be obscured by a scotoma. The system requires a head mounted display and eye tracking to detect where a user’s eyes are looking and therefore, what information their scotoma is obscuring. This system is not portable and requires an expensive setup. In this paper we propose a system based on an Android smartphone, which is portable and easily accessible by the user. An added advantage of using a smartphone is that mobile computing capabilities continue to increase with new models.

1176

II.

III.

MOBILE LOW VISION ASSISTANCE SYSTEM

A prototype solution for mobile low vision assistance was developed and evaluated on an Android smart phone platform. A user interface was designed to identify the user’s preferred screen location for viewing, which is a region free of blind spots. Once the preferred location is identified, the user has the option to adjust the image rendering parameters, such as zoom and contrast, in order to maximize the effectiveness of the display. After initialization, the system locates faces as regions of interest and redisplays them in the user’s preferred location. Face tracking is used to track the detected face between consecutive frames, while face detection takes place every few seconds to reduce the computational load. By enabling users to visualize the primary face in the visual field, the proposed system enables face to face communication for users with blind spots. They are able to observe facial expressions, read lips and even follow gestures. An example of the system output is shown in Figure 1, with the blind spot superimposed on the image. Viewing of the person’s facial expression is enabled in this example due to face redrawing at a preferred location of the visual field.

Fig.1. System output demonstrating face relocation in a preferred, visible area of the visual field. When designing a system on a mobile platform, two quantities are of prime importance: processing time, (i.e., the time between the user query and the response), and power consumption, (i.e., the power demands of each algorithm and therefore the expected battery life). One approach is to perform all processing on the smartphone, and another is to compress the images and transmit to a cloud-computer for processing. The decision on the processing platform depends on the available computational and communication capabilities. In this paper, we consider three platforms: a smartphone, a netbook and a desktop simulating cloud processing. Since face detection and tracking are computationally demanding tasks, we benchmarked the performance of the following algorithms. For face detection, we evaluated the popular Viola-Jones algorithm [3], Support Vector Machines [4], and a method by Pai et al [5]. For face tracking, we examined template matching, Lucas-Kanade tracking [6] and CAMShift [7]. The next sections discuss these algorithms in more detail and present benchmarking results.

FACE DETECTION ALGORITHMS

The Viola-Jones detector [3] is considered one of the best performing near-frontal face detection algorithms for both speed and complexity. It is based on a combination of weak classifiers (boosting) to obtain an efficient strong classifier. Each weak classifier is tasked to learn one of three Haar-like features in a 24x24 pixel sub-window: a two-rectangle feature, a three-rectangle feature or a four-rectangle feature. Instead of running weak classifiers on a basic image, each image is transformed into an “integral” image, allowing fast feature extraction, since any Haar classifier can be computed from simple lookup table entries in the integral image. The algorithm performs early rejection if the combination of the weak classifiers (one for each feature) is below a threshold for a test sub-window. Since most of the subwindows in an image are negative (non-faces), the algorithm rejects them at the earliest stage. When used together, the selected group of weak classifiers becomes a “strong” classifier through a process called “boosting.” The most used technique, originally used by Viola and Jones, is AdaBoost. Support vector machines (SVMs) are popular classifiers used to separate two classes, (e.g., faces and non-faces) [4]. During training, the SVM determines the hyperplane that maximizes the separation margin between the training data classes. The optimal hyperplane is obtained by maximizing a convex quadratic programming problem. In cases where the data cannot be linearly separated, the kernel trick is employed to transform the input data into a higher dimensional space Φ, where the transformed data become linearly separable. Pai et al is a color-based algorithm that uses the color of human skin as well as the geometry of the human face to detect faces quickly. Lighting compensation is accomplished by computing the average luminosity of the image and compensating for the different color channels to generate a chrominance image. After some filtering, skin detection is performed in chrominance space. Finally, bounding boxes are determined for the face-colored blobs found in the image [5]. With regions of potential faces identified, numerous geometric tests are run to validate face regions, (e.g., the width and height ratio of the face candidate regions, eye and mouth distances, etc.). IV.

FACE TRACKING ALGORITHMS

The template based approach is an iterative localization algorithm that employs the sum-of-squared differences to locate the region that corresponds to the tracked target. Template matching requires a large number of comparisons in order to determine the region with the best match, but it imposes mild memory requirements. The Lucas-Kanade method of tracking uses sparse optical flow to track objects [6]. The optical flow is a differential method that assumes a location patch of pixels has spatial coherence, i.e., pixels around a feature tend to move in the same direction, as they likely belong to the same object. To find pairs of features, a robust feature detector is used, such as the Harris corner detector. Additionally, it is assumed that pixel values around a chosen feature point remain constant.

1177

Continuously adaptive mean shift (CAMShift) [7] is an object tracking algorithm that tracks both the object’s coordinates and its color probability distribution. The base of the algorithm, “mean shift,” works by adapting from frame to frame the color probability distributions in the tracked region. A color histogram look-up for the region is taken from the HSV transformed image, and color probability distribution of the region is found. The center of mass contained within the search window is found and the next search window is updated. V.

ALGORITHM AND PLATFORM BENCHMARKING

For our mobile implementation, we considered a Motorola Droid. This Android smartphone is equipped with an ARM processor operating at 528 MHz with 192 MB of RAM and 256 MB of flash memory, a 3.2-megapixel color camera with autofocus, and 3G or Wi-Fi wireless connectivity. In addition, a netbook computer with an Atom processor running at 1.66 GHz and 1 GB of RAM was utilized. A desktop computer with two dual-core 2.7-GHz processors and 8 GB of RAM served as a “cloud” based system by taking into consideration data transfer times. All of the algorithms were implemented on OpenCV as well as some native C/C++ functions and a Java interface. Each algorithm was run at seven different resolutions ranging from 848x480 to 176x14. Face detection was implemented every 100 frames, and face tracking was performed between face detection events. An example of the actual video stream is shown in Figure 2.

For the Viola-Jones implementation on the mobile device, a boosted Haar cascade was used. The minimum search window was 40x40 pixels in size, and a scale factor of 1.2 was used to resize windows during image scans. A Canny edge detector was used for early rejection of regions with an abundance/shortage of edges. Finally, the Haar cascade used on the machine was stored on a flash memory chip (Micro SD) available on the device in order to use as little on-board memory as possible on the mobile device. The SVM was trained using the face library in [9], which contains a total of 6,977 images, 2,429 faces and 4,548 nonfaces. A linear kernel was employed due to its less demanding computational load. The training generated a total of 407 support vectors for the dataset given. The implementation of the SVM classifier was based on the OpenCV library. For the actual detection, a sliding window and image pyramid approach was used. A step increment of two pixels was used for scanning an image. A three-level image pyramid was used for processing, obtained by scaling the image to a quarter of its original size, then scaling to half size, and then keeping it at the original size. As with Viola-Jones, the trained library was stored on a removable flash memory card in order to conserve the on-board memory of the device. The Pai et al method was modified from its initial form. After testing the original algorithm, it became clear that the mouth and eyes regions were not large enough for processing. Instead of using face and eye detection, averages were taken from many tests of the algorithm to estimate the average size of a detected face at a typical distance from the camera. After the height/width ratio check, the area of the face candidate is checked, and it is declared a face if the facial area is within the expected bounds. For template matching, the search area was restricted to reduce processing time. Only an area of 75x75 pixels around the detected face is searched for a match with the template. For Lucas-Kanade tracking, a corner detector is employed for finding discriminative features to track. Ten features are detected between the previous and current frame to keep computations to a minimum while maintaining the ability to track a moving face throughout a scene.

Fig. 2. Face relocation example demonstration of the low vision assistance system. Windows are displayed to demonstrate: Face Detected (inside), Tracking Search Space (middle) and Face Detection Search Space (outside). We collected experimental results on execution times of the face detection and tracking algorithms as well as their power consumption. To simulate a cloud-based system, the overhead of transmission time was added to the desktop system results. These values include JPEG compression time and upload time over a 3G network. For the upload times, the results of Tan, Lam, and Lau were used [8]. Upload times for three different carriers were averaged to find a representative upload rate for a 3G network.

The CAMShift tracking method was implemented consistently with its theoretical implementation, where the initial face detected is used as the original training image. VI.

RESULTS AND CONCLUSINOS

We demonstrate algorithm performance across platforms for face detection and tracking in Figures 3 and 4 respectively. We observe a rather linear trend with image size increase for all platforms and algorithms. Cloud computing incorporates JPEG image compression and transmission on a 3G network. Figures 5 and 6 show power consumption for face detection and tracking. Based on these results, we selected Viola-Jones for face detection, as it provides the best detection at acceptable power consumption and execution time. Both Lucas-Kanade and CAMShift offer acceptable performance. Although CAMShift has better power characteristics, we selected Lucas-Kanade because it is less prone to drifting.

1178

Fig. 3. Execution times for Viola-Jones face detection. Fig. 6. Power consumption for face tracking on Android.

REFERENCES [1]

[2]

[3]

Fig. 4. Execution times for Lucas-Kanade face tracking.

[4] [5]

[6]

[7]

[8]

[9]

Fig. 5. Power consumption for face detection on Android.

1179

G. Luo and E. Peli, “Use of an augmented-vision device for visual search in patients with tunnel vision,” Investigative Ophthalmology and Visual Science, vol. 47, no. 9, pp. 4152–4159, 2006. A. Schlerlen and V. Gautier, “A new concept for visual aids: ‘ViSAR’ Visual Signal Adaptive Restitution,” in Proc. IEEE Conf. of the Engineering in Medicine and Biology Society (IEEE-EMBS), 2005, pp. 1976–1979. P. Viola and M. Jones, “Rapid object detection using a boosted cascade of simple features,” in Proc. of the IEEE Computer Society Conf. on Computer Vision and Pattern Recognition (CVPR), 2001, vol. 1, pp. I511–I518. C. Cortes and V. Vapnik, “Support-vector networks,” Machine Learning, vol. 20, no. 3, pp. 273-297, 1995. Y.-T. Pai, S.-J. Ruan, M.-C. Shie, and Y.-C. Liu, “A simple and accurate color face detection algorithm in complex background,” in Proc. of 2006 IEEE Intl. Conf. on Multimedia and Expo, ICME 2006, 2006, pp. 1545-1548. B. Lucas and T. Kanade, “An iterative image registration technique with an application to stereo vision,” in Proc. of 7th Intl. Joint Conf. on Artificial Intelligence (IJCAI '81), 1981, pp. 674-679. G. R. Bradski, “Real time face and object tracking as a component of a perceptual user interface,” in Proc. of 4th IEEE Workshop on Application of Computer Vision, WACV '98, 1998, pp. 214-219. W. L. Tan, F. Lam, and W. C. Lau, “An empirical study on 3G network capacity and performance,” in Proc. of INFOCOM 2007, 26th IEEE Intl. Conf. on Computer Communications, 2007, pp. 1514-1522. “CBCL Face Database #1,” MIT Center for Biological and Computation Learning, Available http://www.ai.mit.edu/projects/cbcl.