AdaBoost Face Detection on the GPU Using Haar-Like ... - Springer Link

3 downloads 0 Views 358KB Size Report
In this article, an approach for AdaBoost face detection using Haar-like ..... GHz Pentium 4 with 4GB RAM, and using a NVIDIA GeForce GTX285 1GB. GPU.
AdaBoost Face Detection on the GPU Using Haar-Like Features M. Mart´ınez-Zarzuela, F.J. D´ıaz-Pernas, M. Ant´on-Rodr´ıguez, F. Perozo-Rond´ on, and D. Gonz´ alez-Ortega Higher School of Telecommunications Engineering, Paseo de Bel´en, 15 47007 Valladolid, Spain [email protected], http://www.gti.tel.uva.es/moodle

Abstract. Face detection is a time consuming task in computer vision applications. In this article, an approach for AdaBoost face detection using Haar-like features on the GPU is proposed. The GPU adapted version of the algorithm manages to speed-up the detection process when compared with the detection performance of the CPU using a well-known computer vision library. An overall speed-up of × 3.3 is obtained on the GPU for video resolutions of 640x480 px when compared with the CPU implementation. Moreover, since the CPU is idle during face detection, it can be used simultaneously for other computer vision tasks. Keywords: Face Detection, Adaboost, Haar-like features, GPU, CUDA, OpenGL.

1

Introduction

In the field of Computer Vision (CV), detecting a specific object in an image is a computationally expensive task. Face detection can be addressed using feature-based approaches, without machine learning “a priori”, or appearancebased approaches, with machine learning inside. Feature-based approaches advantage is they make an explicit use of face knowledge: like local features of the face (nose, mouth, eyes,...) and the structural relationship between them. They are generally used for one single face localization and are very robust to illumination conditions, occlusions or viewpoints. However, good quality pictures are required and algorithms are computationally expensive. On the other hand, the appearance-based approaches consider face detection as a two-class pattern recognition problem that rely on statistical learning methods to build a face/non face classifier from training samples; they can be used for multiple face detection in even low resolution images. In practice, appearance-based approaches have proven to be more successful and robust than featured-based approaches. Different appearance-based methods mainly differ in the choice of the classifier: support vector machines, neural networks [6], Bayesian classifiers [3] or Hidden Markov Models (HMMs) [12]. A well-known method developed for frontal J.M. Ferr´ andez et al. (Eds.): IWINAC 2011, Part II, LNCS 6687, pp. 333–342, 2011. c Springer-Verlag Berlin Heidelberg 2011 

334

M. Mart´ınez-Zarzuela et al.

face detection was independently introduced by Viola and Jones [16], by Romdhani [13], and by Keren [10]. All of these algorithms use a 20x20 pixel patch (searching window) around the pixel to be classified. The main difference between these approaches lies in the manner in which a cascade of hierarchical filters devised to classify the window as a face or a non-face are obtained, and more specifically, the criterion used for performance optimization during training. Facing this problem as seen in [15], computation time is an important factor that will set the need of a implementation method that reduce this time to a minimum. In this paper we identify different parallelizable steps of the AdaBoost algorithm for face detection introduced in [16] and propose how to translate it for its execution under the GPU. This way, not only the face detection is faster than when it is done on the CPU, but also the CPU remains idle and can be used for other computations simultaneously. 1.1

Motivation

Using boosting algorithms reduces the number of computations needed to classify a window as face/non-face. The introduction of these algorithms made possible to detect faces in real time on CPUs (Central Processing Units). However, the speed of the detection is still not very fast and performance falls down when the resolution of the images increases. Also, some ad-hoc hardware developments and software implementations for specific architectures different from CPU have been proposed. Masayuki Hirimoto et al. studied the requirements of a specialized processor suitable for AdaBoost detection using Haar-like features [9] and Yuehua Shi et al. developed a cell array architecture using a multipipeline for speeding up its computation [14]. Some researchers work optimizing the wellknown open source computer vision library OpenCV to run not only under Intel platforms, but also on the Cell BE processor. For face detection using Haar-like features and AdaBoost algorithm, their implementation speeds-up computation between ×3 and ×11 times for 640x480 px video resolutions [1]. Ghorayeb et al. proposed an hybrid CPU and GPU implementation of AdaBoost for face detection, not using Haar-like features but Control Points features, achieving a classification speed of 15 fps for video resolutions 415x255 px on an Athlon 64 3500+ at 2.21GHz and a 6600GT GPU [5]. For many years, CPU software immediately run faster on new generations of microprocessors due to a significant increases in clock frequencies. However, the new trend in microprocessors design is not increasing the clock frequency, but the number of cores inside a die. Sequential implementations of algorithms have to be redesigned so that the workload is efficiently delivered to CPUs equipped with 2, 4 or 8 cores. On the other hand, GPU computing has proved to be a nice technique for speeding-up the execution of algorithms using not CPUs, but GPUs (Graphics Processing Units), massively parallel processors with hundreds of cores hidden on commodity graphics cards. GPU computing deals with the translation of algorithms into data parallel operations and their implementation on GPUs.

AdaBoost Face Detection on the GPU Using Haar-Like Features

335

In this article, some approaches for AdaBoost face detection using Haar-like features on the GPU are proposed. Section 2 gives an overview of the algorithm. Section 3 describes algorithm implementation on the GPU, using CUDA and a combination of CUDA and OpenGL/Cg. In section 4 experimental performance tests, comparing CPU and GPU performance are detailed. Finally, section 5 draws the main conclusions obtained from this research.

2

AdaBoost Face Detection with Haar-Like Features

AdaBoost learning [4], combined with Haar-like features [7] computation is one of the most employed algorithms for face detection. The AdaBoost learning process is able to significantly reduce the space of Haar features needed to classify a window on an image as containing or not containing a face. Selected Haarlike features are disposed on a classifier made up of a cascade of hierarchical stages. A scanning window is used over the input image at different locations and scales, and then the contents of the window are labeled along the cascade as a potential face or discarded as a non-face. Haar-like features are widely used in face searching and numerous prototypes have been trained to accurately represent human faces through the AdaBoost learning algorithm. Results of some of these trainings are available through the open source OpenCV library [11], in which is possible to find XML descriptions of cascades of classifiers for frontal or partially rotated faces. Viola and Jones proposed four basic types of scalar features for face detection [16]. In this paper we use five different types of Haarlike features, that are shown in figure 1. Every feature can be located on any subregion inside the searching window and vary in shape (aspect ratio) and size. Therefore, for a window of size 20x20 pixels, there can be tens of thousands of different features. Haar-like features in figure 1 are computed using equation (1), where hj (x, y) is the Haar-feature j computed over coordinates (x, y) on the window, relative to the position of the window on the original image i in grey scale. The sum of the pixels over positions (m, n) inside every rectangle r conforming the Haarfeature, is weighted according to a wjr factor. Selection of hj and associated weights are decided by the boosting procedure during training. Stages in the AdaBoost cascade are comprised of a growing number of J Haar-features, as stated in equation (2) . Probability of face or non-face depends on a threshold θ, computed during the training process. Different Haar features contribute with a different weight to the final decision of the strong classifier. Thresholds have to be chosen to maximize the amount of correct detections while maintaining a low number of false positives.

hj (x, y) =

R  r=1

⎡ ⎣wjr ·

 (m,n)∈r

⎤ i(m, n)⎦

(1)

336

M. Mart´ınez-Zarzuela et al.

 J 1 j=1 hj (x, y) < θ H(x, y) = 0 otherwise

3 3.1

(2)

AdaBoost Haar-Like Face Detection on the GPU Parallel Computation of the Integral Image

Within a detection window, computing a stage in the cascade of classifiers implies scanning several Haar features at different scales and positions. Each Haar feature is a weak classifier and its evaluation is based on a sum of intensities, thus requiring fetching every pixel under the feature area. This involves lots of lookups, which is undesirable. A preprocessing stage can be added to speed-up weak classifiers computation, by generating an Integral Image (II) [2]. In the integral image, the value stored on a pixel is the sum of those pixel intesities above and to the left in the original input image, as described in (3), being i(x, y) the value on the input image and ii(x, y) the value on the integral image.  i(x , y  ) (3) ii(x, y) = x ≤x,y  ≤y

Once the II image is calculated, computing a Haar-like feature needs just four memory fetches per rectangle. The sum-up of the intensities under a rectangle at any location and scale can be computed in constant time using equation (4), as described in figure Features comprised of three different rectangles in figure 1 can be expressed using only two opposite weighted rectangles. S = ii(xC , yc ) + ii(xA , yA ) − ii(xB , yB ) − ii(xD , yD )

(4)

Fig. 1. Five types of rectangular Haar wavelet-like features. A feature is a scalar calculated by summing up the pixels in the white region and subtracting those in the dark region.

3.2

Parallel Computation of the Cascade of Classifiers: CUDA Na¨ıve Implementation

Detection is divided into several CUDA kernels, pieces of code that are executed in parallel by many GPU threads. Blocks of threads and grid data organization are mapped around a mask vector of integers, which contains information about the window being processed or 0 if the window has been discarded on a previous stage. Initial CUDA grid is defined unidimensional, and has the length of

AdaBoost Face Detection on the GPU Using Haar-Like Features

337

the mask. Each thread classifies one single window at a given stage and scale, sequentially computing all necessary weak classifiers. In a block of threads, the feature description is uploaded to shared memory before it is computed. Figure 2 shows mask vector and how processing is delivered to blocks of threads and a grid of blocks for an image of size M xN pixels. Each integer in the mask identifies a single sliding window over the original image, placed at coordinates (x, y) = (v%M, v/M ), where v is the value read from the mask vector and % is the modulo operation.

Fig. 2. Data organization for parallel computation (phase 1)

Once the first stage in the cascade of classifiers has been calculated, several values in the mask will be 0. Corresponding windows will not have to be processed in subsequent stages. However, when a kernel is mapped to a grid and block for computation, hardware executes a minimum group of threads in parallel (warp of threads). A thread reading a 0 value from the mask will not have any operation to do, so it will be idle waiting for the rest of the threads in the warp. A way to avoid idle threads is to compact the mask, putting the values different from zero all together at the beginning of the mask, as it is shown in figure 3. The algorithm to do this in parallel is described in [8]. The size of the mask, therefore the size of the grid, after compactation will be M  xN  . It is possible to dynamically adjust the size of the kernel on the fly for maximum performance. Although performance of compactations on the GPU is larger than on the CPU only for large masks and input images, doing the compactation on the CPU is not practical, as it would involve two extra memory copies. One of the advantages of compactation process is that it returns the number of valid points in the mask. Knowing the number of surviving windows for the next stage allows for another optimization. On the first stages of the cascade, there is a small number of features to compute and many windows have to be computed. Along the cascade, the number of candidate windows is reduced exponentially and the number of features grows linearly.Thus, it is possible to divide computation of the cascade into two different phases: parallelizing the computation of windows (Kernel 1) and parallelizing the computation of features inside a window (Kernel 2). Kernel 1 is used along the cascade first stages. The number of windows is big and the number of features to compute inside a window is smaller than the warp size. Every thread sequentially computes the features inside the window. The size of the block is dynamically adjusted depending on the number of windows that survive the strong classifier.

338

M. Mart´ınez-Zarzuela et al.

Fig. 3. Modification of mask along the cascade of classifiers

Kernel 2 can be used instead of Kernel 1 when the number of windows becomes small enough and the number of features inside a window is large, thus is worthwhile to use a block of threads for parallel computing the features inside a window. Figure 4 shows how computation of a window is mapped to a block of threads. Size of the block depends on the number of features that have to be computed in a the given stage and is chosen to be a multiple of the warp size. Each thread stores in shared memory the result of the computation of a single feature. Then this values are modified depending on each feature threshold and summed up using a parallel reduction. Different blocks are executed on different multiprocessors, so that surviving windows are still computed in parallel. The final result is compared to the strong classifier and the window is discarded as necessary.

Fig. 4. Data organization for parallel computation (phase 2)

3.3

Further Optimizations

Processing all the possible windows over an input image is a really expensive computational task. On the implementation considered in section 3.2, all the pixels of the image are considered as potential window origins. Hence, for robustness, all of them are classified as face/non-face regions, using a brute force approach. However, it is possible to increase detection speed by increasing the distance between windows. For a better performance, CPU implementations use

AdaBoost Face Detection on the GPU Using Haar-Like Features

339

an offset of y = 2 px between windows, so that only odd rows of the input are processed. An offset of x = 1 px is used when the previous window has been classified as a potential face and x = 2 px in other case. For the CUDA implementation, it is possible to define the mask to be of size (M/2)x(N/2), so that it will contain one fourth of the possible windows. On odd and even frames of a video sequence it is possible to use an overlapping mask of the same size, but displaced an offset of x = 1 px. For the given masks, the coordinates of a window can be calculated as (x, y) = (v%M, v/M ) ∗ (of f setx , of f sety ). Moreover, re-scaling the sliding windows and associated Haar-like features is needed when looking for an object in different sizes. The same integral image can be used, but lookup accesses to textures become more dispersed, what provokes cache fails and slows down the process. Another improvement consists on resizing the input image and keeping the sliding window size, since resizing the input image on the GPU can be done almost for free.

4

Performance Tests

The GPU-based face detector developments described in section 3 have two interfaces able to process static images or video sequences captured from a video webcam. Several tests were done in order to test the performance of the GPU implementations, using different sized images and video resolutions on a 3.2 GHz Pentium 4 with 4GB RAM, and using a NVIDIA GeForce GTX285 1GB GPU. The GPU performance was tested against the CPU implementation of the algorithm delivered with the OpenCV library [11]. Both CPU and GPU implementations use the same XML cascade of classifiers for detecting frontal faces and detection rate was exactly the same on the tests that are presented here. Figure 5 shows a graph comparing the time needed to detect only one face on the CPU and on the GPU. Figure 5(a), shows times for detection on static images depending on their resolution. Times given are average times after 10 executions over different images for two different GPU implementations: the na¨ıve version of the algorithm and the algorithm optimized as detailed in section 3.3. The na¨ıve implementation on the GPU runs more or less at the same rate than the CPU OpenCV version. The GPU optimized code gives a speed-up of ×1.6 for images of size 800x800 px. Figure 5(b) shows a comparison of times for one face detection on video sequences. Again, the GPU optimized implementation is the fastest. The speedup obtained with respect to the OpenCV implementation is up to ×3.3 times for video resolutions of 640x480. The speed-up obtained is greater for video than for static images due to the initialization processes that take place on the GPU when a kernel or a shader is launched for the first time. This initialization is only necessary for the first frame when the face detector is launched with the video interface. Memory and algorithm initialization makes also face detection on a single image slower than on a single video frame when using OpenCV.

340

M. Mart´ınez-Zarzuela et al.

(a) Static images

(b) Video sequences

Fig. 5. Performance test for only one face detection on static images and video sequences at different resolutions

Finally, we analyzed the performance of the different implementations depending on the number of faces that have to be detected on a single image. Figure 6(a) shows an image from the CMU/VASC database of size 512x512 px. Figures 6(b) to 6(d) have the same size, but incorporate an increasing number of faces to be detected. The number of windows that survive along the cascade of classifiers varies depending on the number of faces on the image. For the image containing 9 faces, the number of windows that have to be evaluated after the first stage is ×1.5 times greater than on an image containing only one face. In the last stage, the number of windows grows up by a factor of ×9. On a sequential implementation on the CPU, the performance of the algorithm can be significantly hurt with an increasing number of faces on the image. However on a parallel GPU-based implementation, the performance its only reduced by a factor of ×1.3 on a GPU GTX285, containing 240 stream processors.

(a) Original

(b) Two faces

(c) Four faces

(d) Nine faces

Fig. 6. Original image from CMU/VASC database with one face and modifications including two, four and nine faces

5

Conclusions

In this paper, different GPU-based approaches for implementing AdaBoost face detection using Haar-like features were described and tested. In the AdaBoost

AdaBoost Face Detection on the GPU Using Haar-Like Features

341

algorithm, detection speed is increased through the evaluation of a cascade of hierarchical filters: windows easy to discriminate as not containing the target object are classified by simple and fast filters and pixels that resemble the object of interest are classified by more involved and slower filters. The main limitation we found when translating the algorithm was the high level of branching imposed, what makes difficult data parallelization. For GPU performance, a cascade of classifiers comprised of a smaller number of stages or even just one computationally expensive stage would perform rather fast. Mapping the algorithm to the GPU to evaluate simultaneously as many windows as possible in the first stages (large number of windows including a small number of features) and parallelize the computation of features inside a window on the next stages (small number of windows including a large number of features) demonstrated to be more convenient to take advantage of the underlying parallel hardware. For static images, a maximum speed-up of ×1.6 was obtained with the fastest GPU implementation at resolutions of 800x800 px. For video resolutions of 640x480 px, face detection was ×3.3 times faster than the OpenCV library version for CPU. A cascade of well-known classifiers was employed. Although this cascade was designed for its execution on sequential hardware, nice figures of speedup are obtained on the GPU.

Acknowledgements This work has been partially supported by the Spanish Ministry of Science and Innovation under project TIN2010-20529.

References 1. CellCV: Opencv on the cell. adaboost face detection using haar-like features optimization for the cell. code downloading and performance comparisons (2009), http://cell.fixstars.com/opencv/index.php/Facedetect (last Visit February 2009) 2. Crow, F.C.: Summed-area tables for texture mapping. In: SIGGRAPH 1984: Proceedings of the 11th Annual Conference on Computer Graphics and Interactive Techniques, pp. 207–212. ACM Press, New York (1984) 3. Elkan, C.: Boosting and naive bayesian learning. Tech. rep. (1997) 4. Freund, Y., Schapire, R.E.: A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences 55(1), 119–139 (1997) 5. Ghorayeb, H., Steux, B., Laurgeau, C.: Boosted algorithms for visual object detection on graphics processing units. In: Narayanan, P.J., Nayar, S.K., Shum, H.-Y. (eds.) ACCV 2006. LNCS, vol. 3852, pp. 254–263. Springer, Heidelberg (2006) 6. Rowley, H.A., Baluja, S., Kanade, T.: Neural network-based face detection. IEEE Transactions on PAMI (1998) 7. Haar, A.: Zur theorie der orthogonalen funktionensysteme. Math. Annalen. 69, 331–371 (1910) 8. Harris, M.: Parallel prefix sum (scan) with cuda. In: Nguyen, H. (ed.) GPU Gems 3, ch. 39, pp. 851–876. Addison Wesley Professional, Reading (2007)

342

M. Mart´ınez-Zarzuela et al.

9. Hiromoto, M., Nakahara, K., Sugano, H., Nakamura, Y., Miyamoto, R.: A specialized processor suitable for adaboost-based detection with haar-like features. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2007, pp. 1–8 (June 2007) 10. Keren, D., Osadchy, M., Gotsman, C.: Antifaces: A novel, fast method for image detection. IEEE Transactions on Pattern Analysis and Machine Intelligence 23(7), 747–761 (2001) 11. OpenCV: Open source computer vision library (2009), http://sourceforge.net/projects/opencvlibrary (last visit February 2009) 12. Rabiner, L.R.: A tutorial on hidden markov models and selected applications in speech recognition, pp. 267–296 (1990) 13. Romdhani, S., Torr, P., Scholkopf, B., Blake, A.: Computationally efficient face detection. In: IEEE International Conference on Computer Vision, vol. 2, p. 695 (2001) 14. Shi, Y., Zhao, F., Zhang, Z.: Hardware implementation of adaboost algorithm and verification. In: 22nd International Conference on Advanced Information Networking and Applications - Workshops, AINAW 2008, pp. 343–346 (March 2008) 15. Vaillant, R., Monrocq, C., Le Cun, Y.: Original approach for the localization of objects in images. In: IEEE Proceedings of Vision, Image and Signal Processing, vol. 141(4), pp. 245–250 (August 1994) 16. Viola, P., Jones, M.: Robust real-time object detection. International Journal of Computer Vision 57(2), 137–154 (2002)