Multiple human tracking based on daubechies ...

Multiple human tracking based on daubechies complex wavelet transform combined with histogram of templates features Sunitha M R

H S Jayanna

Ramegowda

Associate Professor, AIT, Chikmaglalur, Karnataka, India Email: [email protected]

Professor and Head, Dept. of ISE, SIT, Tumkur, Karnataka, India Email: [email protected]

Principal, BCE, S.Belagola, Karnataka, India Email: [email protected]

Abstract—In video processing, tracking objects that are in motion has attracted lot of interest of the researchers all over the world. In numerous computer vision applications, such as monitoring traffic, remote video surveillance automation, human tracking, etc, moving object detection in video sequence is the major step of knowledge extraction. In this paper, an effective human tracking system based on Daubechies Complex Wavelet Transform (DaubCxWT) combined with histogram of template is introduced. This transform is suitable to track a person in video sequences because of its approximate shift-invariance nature. Initially, DaubCxWT co-efficients associated to the person are computed. Then, in Daubechies complex wavelet domain, the energy of these co-efficients is compared to the neighbouring object, to carry out tracking in the consecutive frames. Histogram of template feature is used to extract the texture and gradient information for the human detected. Daubechies Complex Wavelet co-efficients and histogram of template features are combined to form feature vector. In order to build feature vector for every pixel in that area, the calculated coefficients are utilised. Further, by making use of the generated feature vectors inside an adaptive search window, optimal search for the best match is performed. Search window adaption is employed to estimate the speed and direction of the person, in motion. This method has shown appreciable results. Keywords— Histogram of template, daubechies complex wavelet transform, adaboost, feature vector.

I.

INTRODUCTION

Object tracking in video sequence is an important research topic in video analysis applications such as visual navigation, video surveillance, to analyze shopping behavior of customers in retail shop and military surveillance. Object tracking may be described as the process to initially detect the object of interest and then to continuously estimate the position and various significant information of the object in images against dynamic scenes. Moving object trajectory in video sequence can be obtained over time by locating its position in every image of the video, using object tracking process. The tracking algorithms that are reliable possess some issues such as, image noise which causes brisk changes in the appearance, variations in illumination, shape and size variations. Performance of any

978-1-4673-6667-0/15/$31.00©2015 IEEE

tracking system depends on the observation models and target representation. For many applications behaviour of persons are most interest such as for traffic surveillance, sports analysis or military surveillance. Thus people detection from visual observations is a very active research area. Most of the existing methods for multi human tracking are still restricted to specific application contexts. The work presented in [1] tracks global statistics of object color and shape by employing a particle filter. Robustness to occlusions and background noise is achieved by adding color to the state of the particle filter. However, sampling in the state space is quite expensive. It is also computationally expensive due to huge number of samples to represent probability density function. Apart from all these, particle filter suffers from particle impoverishment problem [2]. An algorithm for effective tracking of objects in surveillance videos, which have structured environments, has been presented in [3]. The relationship between the environments and the objects acts as additional information in motion estimation of objects. This fact is exploited and a state term is introduced. Distance field models used with Bayesian tracking framework by making use of distance transform. Particle filter is employed to solve the tracking problem. Hence this approach integrates the effect of environment on the motion of object into the tracking algorithm thus improving the performance of the object. But this system has not considered occlusion which is the obvious parameter in surveillance video. In [4], method proposed concentrates on tracking multiple human beings whose paths may intersect, while preserving their respective individuality. The system involved two steps. Firstly, the trajectories that may involve identity transformations but give the information about the grid cells in which people may be spotted at any given time, are found using K-Shortest paths. Next linear program is executed on a reduced number of grid cells that saves storage space and time. The system experiences a setback in identifying individuals whose clothes are of same color. The work in [5] presented local background subtracter for

tracking human in surveillance videos. This method consisted of two stages. In the first stage, background model is initialized by considering the information of stationary and non-stationary pixels across the frames. In the following stage, background subtraction based on local thresholding is performed. This technique extracts the foreground object by making a comparison between any frame of the video and the background model obtained in the previous stage. In this paper a system which successfully detects and tracks people in video, is presented. Frames are generated from input video and pre-processing is performed for each frame. Preprocessing stage includes RGB to gray conversion, resizing and normalization of frames. Few of these frames are considered for generating background. Then the foreground object is extracted by subtracting the generated background from each frame. This foreground is sent to Adaboost to detect and validate human blobs. Each blob location is computed to crop the area of interest from the gray image. Daubechies Complex Wavelet Transform (DaubCxWT) is applied to this area to get the coefficients associated to human. Histogram of Template (HOT) features is also obtained. Daubechies Complex Wavelet coefficients and HOT features are combined to form feature vector. Combining features that best discriminate between object and background will improve the tracking performance [6]. Further, humans are tracked by defining a search window. The computed feature vectors are used to update the search window. The rest of the paper is organized as follows. Section II proposes the methodology of the proposed system. In Section III experimental results are presented. Finally, conclusions and future work are discussed in Section IV.

II.

METHODOLOGY

The block diagram of proposed system is illustrated in Fig. 1. Frames are generated from the input video. Each frame undergoes pre-processing, which includes RGB to gray scale conversion, resizing of frame to 320 x 240 and normalization of frames. Next step is object localisation which is carried out using background subtraction. Finally detected objects are tracked using DaubCxWT and HOT features. A. Foreground Extraction In the area of video processing and computer vision, background subtraction is a real time technique used to extract foreground objects in an image or video stream. This technique assists in distinguishing moving people in videos with static cameras. The distinguishing is done by taking the absolute difference between the reference model (with no moving objects) and the current frame. B. Human Detection Using AdaBoost After the separation of moving human from the background scene, human blobs are generated and validated, using AdaBoost algorithm. It is quite complicated to detect human in motion as human bodies have variety of poses and external

appearance. Hence, to get a good result out of it AdaBoost algorithm is used. This algorithm is one which can detect people in motion, with varying orientation and sizes, under a challenging background. Boosting is an approach to machine learning, which combines many relatively weak and inaccurate rules to create a highly accurate prediction rule. This algorithm automatically assigns the dominant orientations in order to overcome the effects of geometric and rotational variations, using the oriented gradients which are insensitive to various illumination and noises.

Fig. 1. Block Diagram of the Proposed System.

In a multi-resolution framework to identify human shapes at various scales, AdaBoost algorithm derives histogram of gradient orientation, by employing a sobel convolution kernel [7]. This method was basically developed to identify facial region making use of the assumption that features of face persist roughly at the same position in a 9 non-overlapping cells square window. This method is extended for identifying human regions also by modifying it, where, around the human silhouette 15 cells are located at particular locations. These 15 cells are the most prominent cells that characterize human shape [8]. After the humans are detected using AdaBoost, features are to be extracted in order to track every person. Here, we are extracting two set of features, which are further combined to form feature vector. One is Daubechies Complex Wavelet coefficients by applying Daubechies Complex Wavelet Transform (DaubCxWT) and the other is Histogram of Template (HOT) features. C. Daubechies Complex Wavelet Transform (DaubCxWT) DaubCxWT is a shift-invariant method. This nature makes this transform suitable to track objects in video stream. Hence it has been adapted in our system to track multiple humans efficiently. The scaling equation of multi-resolution theory is given by 2

2

1

where a are the coefficients and ∑ 1. This can be to be real complex or real valued. Daubechies considered valued only in order to provide general solution. DaubCxWT is got by considering complex values of both a and . Multiresolution analysis of L2(R) and the scaling function are used to define Daubechies’s wavelet bases { , } in single dimension. The details for constructing Daubechies complex wavelet is explained in [9]. The wavelet ψ t is as given below: 2

1

2

Histogram of template, a feature extraction process, proposed by S. Tang and S. Goto, is employed here [11]. The method extracts features like gradient and intensity information, and both information are made homologous. Also, these features will encode the relationship of three pixels in one template.

2

Function can be divided into complex scaling function and mother wavelet as: ,

,

3

where, - resolution level { - approximation coefficient { } - detailed coefficient Following are the significant benefits of DaubCxWT noted based on the properties of DaubCxWT, for people tracking in video stream: • The shape of the signal which is quite essential in object tracking applications is retained by the linear phase property of DaubCxWT. Thus it lessens the incorrect tracks of people. • The real components of the DaubCxWT represent only few of the stronger edges whereas imaginary part represents every strong edge. This helps in retaining the edges. Hence DaubCxWT is rightly called as local edge detector. If an input signal shift causes an unpredictable change in the transform coefficients, then that transform is said to be shift-sensitive. Shift-sensitivity is reduced in DaubCxWT. As the tracker navigates through the video frames, the reconstruction by making use of real valued Discrete Wavelet transform (DWT) coefficients is altered remarkably. In the case of complex wavelet transform, all local shifts and orientation are reconstructed in the similar fashion. Therefore, boundaries of object in next frames can be found quickly and accurately by employing DaubCxWT [13]. D. Histogram of Template (HOT) The texture information is not taken into consideration for detection by gradient based feature extraction procedures. These methods consider only gradient information. When both, texture and gradient information are considered, more precise detection result can be acquired.

Fig. 2. 12 Templates illustrating Spatial Relationship of Three Pixels.

Fig. 2 shows 12 templates which describes spatial relationship of three pixels [10], [11]. It is expected that these templates can be used to reconstruct the human body. Hence, all the information of human body can be represented by these templates. In our work, for calculating features the templates (1) to (8) are employed. Some formulas to express human body shape make use of these templates. These formulas have reasonable computation complexity. The formulae are based on both texture and gradient information, which gives concrete definition of these feature [10], [11]. Using the below given two function, texture information can be obtained. First is: I P

P1 && I P

P2

4

According to this function, for every template, a pixel is said to have met the template, if the intensity value of is more than the two neighbouring pixel P1 and P2. It is able to capture the pixels with larger value in one template, and the properties of local part of human body are well reflected by the histogram of pixels that satisfy every template in a sub window. This histogram of pixels meeting different templates is calculated as feature. Histogram of pixels with 8 bins is as illustrated in Fig. 3. Each template is represented as a bin in the histogram. The amount of pixels which meet corresponding template in the given region decides the value of each bin. The second function is: 2 1 5

It also uses texture information along with gradient information. But for gradient calculation all three color channels of image are used by HOG. These information are not used as a cue for detection and also the texture information is neglected. E. Temporal Object Tracking The generated DaubCxWT co-efficients and HOT features are combined to form feature vector (FV). The next stage is to track detected people in the successive frames. In our system, temporal tracking is used for object tracking, in which, at a reference frame, rectangle is drawn around the detected object. At the reference frame, the pixels in the rectangle must be tracked temporally. This is the final step to be done before locating the object in the next frame. The FV are used by the temporal tracking to the find the new pixel locations in the adaptive search window.

Fig. 3. Histogram of Pixels.

Fig. 4. Final HOT Vector.

In template , the sum of intensity values of three pixels is more than the values of other templates. By this, it says P meets template . Function (5) can be used to compute a histogram. Template with biggest sum could be found by making use of this function. Similar formulas are defined for obtaining the gradient magnitude information: 1 &&

2 1

6 2

7

Function (6) means that for each template, if the gradient magnitude value of is more than other two, we say P meets this template. For function (7), if P meets template, the sum of gradient magnitude of three pixels in template k must be the biggest one in all templates. For each function, an m-dimensional vector can be extracted from a given region. m denotes number of templates used in feature extraction. Here, an 8 dimensional vector is obtained for each function, since we are using 8 templates. The final feature vector is generated by integrating all these vectors together, as shown in Figure 4. For function (5) templates, the histogram contains 32 bins. Compared with methods that use features like gradient information, such as HOG, the proposed feature shows more discriminative ability. The advantages that HOT feature posses over HOG are:

The objective of temporal tracking is to find moving object of interest in the next frame using the available information of the object in both the background frame and current frame. The FV is constructed in such a way that it corresponds to the pixel around the interested objects area. Thus it is used to calculate the best matched area of the object in consecutive frames. The exact position of the interested object in frame t+1 is to be found in a best way. The search of the area in a search window of frame t+1 is carried out to find the matching area a of the frame t with respect to corresponding area in the frame t+1. In order to track the area a in frame t, our method, utilizes the FV of every pixel and correlation to the best matched regions as explained in below procedure. • For pixels present in both search window and area a generate FV. • The search window is swept with a search area having the same dimension as a. • Correlation is performed between the FVs of the pixels in the search area and FVs of the pixels within area a, to find the optimal matching area. The procedure specified above to find the best matched area is similar to block-matching algorithm, except that, rather than its luminance, it exploits the FV generated for a pixel. An adaptive and effective updating mechanism for adaptive search window is required for the change of object location due to the following reasons. The search window location is set to prevent the loss of the object within search window which guarantees that object will lie within the search area. A variable and large size of search window results in computational complexity. This can be efficiently reduced by location-adaptive fixed size search window [12]. III.

EXPERIMENTAL RESULT

The results of above proposed work is presented in this section. The video clips considered here are of frame size 320 x 240 with static background [14]. The algorithms are processed

on gray scale of every frame. Foregroundd is generated by taking absolute difference between generrated background model and the gray frame. Adaboost is applied to detect humans in the frame by locating 15 cells att specific locations around the human silhouette. Here multiplee people are being tracked based on feature updation in search window. DaubCxWT combined with HOT features arre used to track the people inside a search window of size 90x990. Each feature is computed on a patch with random size and position, sampled from within the bounding box of detection. The HOT features are represented as histogram containing 8 bins b and every bin corresponds to one template. Out of 12 diffferent templates, 8 templates are being used. The average acccuracy attained is 90%.

(a) (b) (c) Fig. 8. (a) Tracking Result of videoo2 with frames #100, #125 and #150.

Fig. 5 represents the background of thhe video 1 that is obtained by adaptive background generationn model. Fig. 6 (a), (b) and (c) displays the human tracking results of video1 in frame number 100, 125 and 150 respectively..

Fig. 5. Background Image of Videeo 1.

Fig. 9. Tracking when Persson Search Window Overlap

IV.

In this paper, an effectivve human tracking system is presented. The algorithms useed in this work are suitable for improving the performance byy target object’s representation and predicting the location. In the present work moving object can be effectively detected byy human blob generation using AdaBoost algorithm. HOT feaature is used to extract both the texture information and the gradient information for the detected human. Daubechies Complex Wavelet Transform (DaubCxWT) is applied to this region to get the co-efficients associated to human. Daubeechies Complex Wavelet coefficients and HOT features are combined to form feature vector. The temporal object tracking t serves to enhance the region whose motion is beingg tracked. Further, humans are tracked by defining a search window. w This search window is updated based on the computeed feature vector. The proposed method has given 90% of trackking success rate.

(a) (b) (c) . Fig. 6. Tracking Result of video1 with Frame #1000, #125 and #150

REFER RENCES [1]

[2]

[3] (a) (b) Fig. 7. Tracking Result of video1 with Framee #319 and #325.

Fig. 7 (a) and (b) shows tracking of all 3 persons in video1 efficiently. Each person is represented with a different colour bounding box for the representation of indiviidual's track. Fig. 8 (a), (b) and (c) is another test case for video2, v illustrating successful tracking in frames 100, 125 andd 150 respectively. Fig. 9 is the illustration of human trackingg result for video2 when search windows of human are overlappping.

CON NCLUSIONS

[4]

[5]

[6]

[7]

G. Varghese, Z. Wang, “Video Denoising Based on a Spatiotemporal Gaussian Scale Mixture Model,,” IEEE Transactions on Circuits and Systems for Video Technology, Vol. V 20, no. 7, pp. 1032-1040, 2010. Y. Wu, T. S. Huang, “A Co--inference Approach to Robust Visual Tracking,” in proc. IEEE Int. Coonf. on Computer Vision, Vol. 2, pp. 2633, Vancouver, Canada, Jul. 20011. U. Orguner, F. Gustafsson, “Risk Sensitive Particle Filters For Mitigating Sample Impoverishm ment,” IEEE Transactions on Signal Processing, Vol. 56, no. 10, pp. 5001-5012, 5 2008. H. B. Shitrit, J. Berclaz, F. Fleurret, P. Fua1, “Tracking Multiple People Under Global Appearance Constraints,” in proc. IEEE Int. Conf. on Computer Vision, pp. 137 – 144, Barcelona, Nov. 2011. K.K.Hati, P. Kumar Sa, B. Majhii, “LOBS: Local Background Subtracter for Video Surveillance,” in proc. Asia-Pacific Conference on Postgraduate Research in Microoelectronics & Electronics, pp. 29-34, Hyderabad, India, 2012. Sunitha M R, H S Jayanna, Raamegowda,” Tracking Multiple Moving Object based on combined Color C and Centroid feature in Video Sequence,” in proc. IEEE Int. Conf. C on Computational Intelligence and Computational Research, pp. 8466-850, Coimbatore, India, 2014. E. Corvee, F. Bremond, “Com mbining Face Detection and People Tracking in Video Sequences,”” in proc. IEEE Int. Conf. on Crime Detection and Prevention, pp 1 – 6, London, UK, 2009.

[8]

S. Bak, E. Corvee, F. Brémond, M. Thonnat, “Person Re-identification Using Haar-based and DCD-based Signature,” in proc. IEEE Int. Conf. on Advanced Video and Signal Based Surveillance, pp. 1-8, 2009. [9] I. Daubeschies, “Ten Lectures on Wavelets,” in proc. Regional Conference in Applied Mathematics, Society for Industrial Applied Mathematics, vol. 61, Philadelphia, PA, 1992. [10] S. Tang, S. Goto, “Histogram of Template for Pedestrian Detection,” IEICE Transactions on Information and Systems, Vol. 93, no. 7, pp.1737-1744, Dallas, Mar. 2010. [11] S. Tang, S. Goto, “Histogram of Template for Human Detection,” in proc. IEEE Int. Conf. on Acoustics Speech and Signal Processing, pp. 2186 – 2189, Mar. 2010.

[12] M. Khansari, H. R. Rabiee, M. Asadi, M. Ghanbari, “Object Tracking in Crowded Video Scenes Based on the Undecimated Wavelet Features and Texture Analysis,” EURASIP Journal on Advances in Signal Processing, pp. 1-18, 2008. [13] O. Prakash, M. Khare, C. M. Sharma, A. K. Singh Kushwaha, “Moving Object Tracking in Video Sequences Based on Energy of Daubechies Complex Wavelet Transform,” Int. Journal of Computer Applications, pp. 6-10, 2012. [14] http://cvlab.epfl.ch/data/pom