A Metaheuristic Bat-Inspired Algorithm for Full Body Human Pose ...

8 downloads 0 Views 2MB Size Report
Abstract—This paper addresses the problem of full body articulated human motion tracking from multi-view video data recorded in a laboratory environment.
2012 Ninth Conference on Computer and Robot Vision

A Metaheuristic Bat-Inspired Algorithm for Full Body Human Pose Estimation S. Akhtar, A.R. Ahmad, E. M. Abdel-Rahman Systems Design Engineering Dept. University of Waterloo Waterloo, Canada sohail.akhtar;[email protected], [email protected]

looses the diversity of exploring the whole solution space which results in sub-optimal solution and the solution may even diverge. Many different methods such as Unscented Particle Filter (UPF) [3], Genetic Algorithm (GA) [4], local search method [5], multiple hypothesis tracking [6], partitioned sampling [2], covariance scaled sampling [7], annealed PF [8] have been developed to improve the sampling efficiency of the generic PF and alleviate the requirement of a large number of particles. Many of these algorithms [2], [3], [6], [7], [8] use complicated optimization methods to allocate the particles in important areas of the search space. Although, these approaches produce satisfactory results but they loose the diversity of the exploration of search space. In last few decades, researchers have suggested that nature is a great source for the development of intelligent systems and has provided solutions to many complicated problems [9]. Natural systems have evolved over millennia to solve such problems. A close examination of these systems suggest that they contain many simple elements that when working together, produce complex emergent behavior. These natural systems have inspired several natural computing paradigms, such as PSO [10], Artificial Immune System (AIS) [11], Genetic Algorithm (GA) [12], [13], [14], Ant Colony Optimization (ACO) [15]. These natureinspired algorithms are used to solve many optimization problems where conventional computing techniques perform unsatisfactorily. These techniques have a built in capability to explore a larger region of the solution space and can avoid premature convergence. In this paper, the recently proposed BA [16], based on the echolocation of bats, is used for full body human tracking. The method allows a diversified search of the solution space. It has a sophisticated group behavior like PSO, excellent local search capabilities along with annealing characteristics. The rest of this paper is organized as follows: Section II formulates the artificial bat algorithm along with implementation details and modifications specific to the full body human pose estimation problem. Section III explains the articulated human model and the image likelihood computations used in this study. Experimental results are given in Section IV and conclusions are given in Section V.

Abstract—This paper addresses the problem of full body articulated human motion tracking from multi-view video data recorded in a laboratory environment. The problem is formulated as a high dimensional (31-dimensional) non-linear optimization problem. In recent years, metaheuristics such as Particle Swarm Optimization (PSO), Ant Colony Optimization (ACO), Artificial Immune System (AIS), Firefly Algorithm (FA) are applied to complex non-linear optimization problems. These population based evolutionary algorithms have diversified search capabilities and are computationally robust and efficient. One such recently proposed metaheuristic, Bat Algorithm (BA), is employed in this work for full human body pose estimation. The performance of BA is compared with Particle Filter (PF), Annealed Particle Filter (APF) and PSO using a standard data set. The qualitative and the quantitative evaluation of the performance of full body human tracking demonstrates that BA performs better then PF, APF and PSO. Keywords-Human pose estimation, bat optimization, articulated human tracking, soft computing, swarm optimization.

I. I NTRODUCTION Full body human pose estimation from video sequences is a challenging and important computer vision problem. It has numerous applications such as cartoon character animation, human gait analysis, sports biomechanic, robotics, human computer interface, gesture recognition etc. Tracking human body in 3-dimension is a highly non-linear and multimodal optimization problem. The task is further complicated by the high dimensionality of the search space as a coarse representation of human body requires 25-31 degrees of freedom. In a general tracking framework, the target position is first estimated and then search is carried out around this estimated region. Particle Filter (PF) has been successfully applied to solve non-linear optimization problems [1] but the requirement for the number of particles for reliable tracking grows exponentially with dimension [2]. This high number of particles requirement adds to the computational cost. PF also suffers due to degeneracy which occurs when only a few particles have significant weights. Particles with negligible weights are discarded during resampling process. As a result, the particles after resampling do not remain the true representative of the posterior distribution. So PF 978-0-7695-4683-4/12 $26.00 © 2012 IEEE DOI 10.1109/CRV.2012.55

369

xti = xt−1 + vit i

II. E CHOLOCATION OF BATS A. Behavior of Natural Bats

Here β ∈ [0 − 1] is a random vector drawn from uniform distribution, x∗ is the current global best solution among all N bats. In our full body human tracking problem fmin is set to a small value and fmax is varied in accordance with the max variance allowed in each time step. Accordingly, following Eq.(1) each bat is assigned a frequency in the range [fmin , fmax ] that affects the bat’s velocity.

Bats are the only mammals that can fly. There are many different species of bats and they are of different sizes. Among them, microbats extensively use echolocation [17], [18]. They use a type of sonar to detect prey, avoid obstacles, and locate their roosting crevices in the dark. They emit loud sound pulses and listen for the echo that bounces off their surrounding objects. These sound pulses vary in properties and different species emit pulses of different bandwidth. Bats emit sound pulses of constant frequency in the range of 25kHz to 150kHz. Each ultrasonic sound burst lasts for a very short time, typically 5 to 20 ms. Microbats normally emit about 10-20 sound bursts every second. The rate of emission of these sound pulses is increased (to about 200 pulses per second) when they fly close to their prey. while searching for prey, bats produce loud sound pulses (in the range of 110dB) but when bats get closer to their prey, they become quieter. Such echolocation capabilities of microbats can be associated with the objective function to be optimized and optimization algorithms can be formulated that mimic this bat behavior in finding the optimal solution.

D. Local Search For local search, once a solution is selected among the current best solutions, the new solution is generated on the bases of current loudness Ai of the bat and maximum allowed variance max(var) during a time step as: xnew = xold + εAi · max(var)

E. Loudness and Pulse Emission The loudness usually decreases once a bat found it’s prey and rate of pulse emission increases. In our experiments, the loudness and pulse emission rate are varied once a solution is improved. The bat is moving towards optimal solution according to:

Bat-inspired algorithms or bat algorithms can be developed by idealizing some of the echolocation characteristics of microbats. The following assumptions are made to approximate bat’s echolocation properties to solve an optimization problem [16]: 1) All bats use echolocation to sense distance. 2) Bats fly randomly with velocities vi at position xi with a fixed frequency/wavelength fmin , varying wavelength/frequency λ and loudness A0 to search for the prey. 3) Depending on the proximity of the prey, they adjust their wavelength/frequency and can adjust the pulse emission rate ri ∈ [0 − 1]. 4) Their loudness vary from large A0 to small Amin values as they come close to prey. In practical implementations, frequency is in a range [fmin , fmax ] and is chosen such that it is comparable to the size of the domain of interest.

At+1 = αAti , i

Ati → 0,

(2)

(5)

rit → ri0 ,

as

t→0

The initial loudness is Ai ∈ [0.1 − 0.9] and initial emission rate is r0 ∈ [0 − 1] and α = γ = 0.9. Ai decreases as the solution improves resulting in a more precise local search as evident by Eq. 4. Based on the above idealization, a flow chart of the BA is given in Figure 1. III. A RTICULATED H UMAN M ODEL AND THE I MAGE L IKELIHOOD In this study, the computational framework proposed by Balan et al [19] is used. It enables us to conduct a fair comparison with other tracking algorithms such as PF, annealed PF and PSO. A generative approach for Articulated Human Tracking (AHT) is adopted in this study. The human body model, which plays a key role in generative AHT, is explained in this section. The image likelihood computation, an important component in any estimation based tracking study, is also explained.

For a virtual bat to solve an optimization problem, rules need to be defined to set their positions and velocities in the d-dimensional search space. The new position xti and velocity vit at time step t are given as:

vit = vit−1 + (xt−1 − x∗ )fi i

rit+1 = ri0 [1 − e−γt ]

where α and γ are constants. In fact, α is like the cooling factor in a simulated annealing scheme. For 0 < α < 1 and γ > 0, we get

C. Movement of Bats/Generation of New Solutions

(1)

(4)

where ε ∈ [−1, 1] is a random number.

B. Bat Algorithm

fi = fmin + (fmax − fmin )β

(3)

370

intersecting at one point. This makes it clear that the unit of the human action is simply a rotation, i.e. all the human actions are combinations of rotations of body parts around their respective joints.

Figure 2.

Figure 1.

The articulated human model

A kinematic tree-like hierarchical structure is used to represent the human body. Ten cylinders are used to represent the body parts as shown in Figure 2 [19]. The joints have anatomic joint angle limits, e.g. the range of angles the human elbow can move is limited, and these limits should be taken into account during tracking process. Hard priors are implemented to exclude implausible joint angles. Neck, shoulders, and thighs all have 3-degrees of freedom and elbow and knee has 1-degree of freedom. Then, there are 3-degrees of freedom for the global position and 3-degrees of freedom for the global orientation. There are 2-degrees of freedom for torso rotation and 2-degrees of freedom for each of the right and left clavicle. This all gives a 31-dimensional state-space representation of the human body . The motion of a point in a body part is determined by forward kinematics. Each limb has a local coordinate system with z-axis directed along the limb. Rigid transformations are used to specify relative position and orientation of the body parts and for global coordinate translation. Camera calibration data is used to get the rotation and translation transformation vectors. These vectors are used to map from the world coordinate points to the camera reference frame points [21]. Each of the bats, in this 31-dimensional space, represent a body pose by aligning the skeleton in the 3-dimensional world, which is later projected on the image plane.

BA algorithm: Flow chart

A. Articulated Human Model In this work, the human body is modelled as a set of cylinders of different sizes connected by joints [20], [19]. This representation is called an articulated body representation. The body parts include torso, head, upper arms, forearms, thighs and calves. The torso is considered as the base/root of the articulated model and all other parts are linked to it through joints. All parts of the human body are connected by revolute joints. A revolute joint has 1-degree of freedom. Its action causes a pure rotation. If a joint has more than one degree of freedom (e.g., shoulder has 3-degrees of freedom) then it is constructed by joining revolute joints with their axis all

B. Image Likelihoods Every visual tracking system requires an explicit or implicit model for shape, appearance and dynamics of the object of interest. The lack of a suitable model limits the performance of the tracking system. We have considered silhouette information as the appearance model and edge gradient as the shape model in our simulations. These image characteristics are used to find the image likelihoods (cost function) as follow:

371

1) Silhouette Information: Silhouette of the object of interest is obtained by statistical background subtraction with a Gaussian mixture model. The foreground pixels are set to 1 and the background pixels are set to 0 to form a pixel map. The likelihood of a pose is then estimated by taking a number of visible points on each limb and projecting them into the image. The MSE between the predicted and observed silhouette values for these points is computed as [8]: M SEsilhouette =

N 1  (1 − psi )2 N i=1

IV. E XPERIMENTAL R ESULTS A. Ground Truth and Video Data To evaluate the performance of the BA for AHT, a publically available data set provided by [19] is used. This data set contains both video and ground truth data. The ground truth motion data is captured by a commercial Vicon system (Vicon Motion Systems Ltd, Lake Forest, CA) that uses reflective markers and six 1M-pixel cameras to recover the three dimensional pose and motion of human subjects. Video data is captured simultaneously from four Pulnix TM6710 cameras (JAI Pulnix, Sunnyvale, CA). These are grayscale progressive scan cameras with a resolution of 644×488 pixels and a frame rate of 60Hz. The Vicon system is calibrated using Vicons proprietary software while the video cameras are calibrated using Matlab camera calibration software.

(6)

where psi are the values of the foreground pixel map at the N-sampling points taken from the interior of the predicted cylinders making up the human skeleton [8]. The silhouette likelihood (LH) is then computed as: LHsilhouette = e−M SEsilhouette

B. Error Measurement The 3D error between the predicted and the ground truth human pose is computed following the procedure in [19]. Virtual markers (m = 15 ∈ M ) are placed on the joints of the subject. For each bat xti , the human pose error ehuman (xti , τ t ) is computed as the average distance between the virtual markers, as placed on the predicted human pose, and the true pose τ t as:  m(xti ) − m(τ t ) ehuman (xti , τ t ) = m∈M (12) |M |

(7)

2) Edge Gradient: The edges produced by a human subject give a good outline of visible arms, and legs. Moreover, the edges are some what invariant to color, clothing, texture, and lighting. In this work, a gradient based edge detection mask is used as in [8]. The result is thresholded to eliminate spurious edges and then smoothed with a Gaussian kernel. It is remapped between 0 and 1 to produce pixel map. A number of points (pei ) are selected on the edge of the predicted cylinders making up the body pose. The MSE between the true edge and the points on the predicted pose is computed as [8]: M SEedge

N 1  = (1 − pei )2 N i=1

where m(xti ) is the 3D location of the marker m for the predicted pose and m(τ t ) is the marker location on the true pose. C. Experiments All experiments consider a video of a person walking in a circle in front of a multi-camera setup. A total of 150 frames are considered. In first frame, the human model is initialized using ground truth motion capture data and the size of the body segments are also predefined. Statistical background subtraction with a Gaussian mixture model is used to extract the object of interest. The Brown University framework is used to compare the performance of various tracking algorithms. The algorithms (PF, APF, PSO, BA) are plugged in this framework and the experiments are conducted. The optimized parameters for each algorithm are utilized. The predictions for the next frame (xti− ) at time instant t are made using a temporal model. For PF and APF, a zero velocity motion model (xti− = xt−1 + ) is used to i predict the joint angles (states of the system). The noise ( ) is drawn form a Gaussian distribution whose standard deviation is equal to the maximum inter frame joint angle difference. It accounts for model uncertainties and produces a random velocity effect. In case of PSO and BA, the predictions for the next frame at time instant t are obtained using the best estimate at previous time instant (xt−1 ∗ ) as

(8)

The edge likelihood is then computed as: LHedge = e−M SEedge

(9)

The two likelihoods are combined together to get the Image Likelihood (ILH) as: ILH = ζ LHsilhouette + η LHgradient

(10)

where ζ and η are constants in the range [0-1] such that ζ + η = 1. In these experiments ζ = η = 0.5, i.e. equal weighting for the two likelihoods is considered. The video sequences obtained from multiple cameras are employed in this study. The ILH obtained for edge and silhouette for each camera are combined together to get and over all ILHall cameras function (cost function) as follows: ILHall cameras =

C 

(ILHi )

(11)

i=1

where C is the number of cameras used.

372

xti− = xt−1 + ηi . Here again, the noise (η) is drawn from a ∗ Gaussian distribution where the standard deviation of each body angle is equal to the maximum absolute inter-frame angular difference [8]. The biomechanical constraints are used as hard priors to constrain the search space and avoid implausible human poses. Any particle exceeding the joint angle limits is discarded and randomly initialized for the next frame/iteration. There are two popular ways to compare the performance of different optimization algorithms. One is to set a certain convergence criteria and keep track of the function evaluations taken to reach this criteria. The other way is to fix the number of function evaluations and compare the function values attained. This second method is adopted in this study. Same number of likelihood evaluations are used to find the human pose and later on, the Most Likely/Appropriate Pose (MAP) error is evaluated for each frame using Equation (12) to quantitatively compare the tracking performance. This error (in mm) is the sum of the error between the marker position on the real person and the virtual markers placed on the 3D predicted skeleton. A PF with 2500-particles and an APF with 500-particles and 5-annealing layers is used. For PSO, 500-particles and 5iterations of PSO run are used. The PSO is used with a fixed inertial weight as in the original algorithm [10]. Similarly for BA, 300-particles with 5-iterations are used. It is observed that a further 50 − 60 % of the original particles are selected for the neighborhood search. This makes the total number of function evaluations approximately equal to that of the other three algorithms. Figure 3, shows the tracking performance of different algorithms for a specific run in terms of MAP error. It can be seen that the performance of PF is inferior to other three algorithms. In fact, the error builds up in this run and progressively diverges. The average distance error results for five independent runs is given in Table I. The table shows the mean value of MAP error over the 150 frames, considered in this experiment, along with the standard deviation. The tracking images for selected frames for one of the run of each tracking method are shown in Figure 6. In this figure the black-colored cylindrical model represents ground truth pose and cyan-colored cylindrical model represents estimated pose. Table I C OMPARISON OF PF, APF, PSO

Algorithm PF APF PSO BA

AND

BA

Figure 3. Comparison of MAP distance error for PF, APF, PSO and BA

average of the 5-runs for all 150 frames. This table shows the mean and standard deviation of the MAP error. It shows that with the increase in the bat population, the MAP error is decreasing almost linearly as depicted in Figure 4. The frame-to-frame MAP error for a specific run is also shown in Figure 5. It confirms that the error remains bounded as the number of particles decrease and the amount of deviations from the ground truth values decrease on average as the number of particles increase.

Figure 4. Change in MAP error with increase in Bat population

Table II C OMPARISON OF DIFFERENT BA

RESULTS

MAP error/frame (in mm) 57.144 ± 19.639 43.433 ± 9.284 48.51 ± 7.851 40.87 ± 8.001

Bat population 100 200 300 400 500

Experiments are run with different numbers of bat populations. The MAP error results are listed in Table II for an

373

POPULATIONS

MAP error/frame (in mm) 42.743 ± 9.567 41.637 ± 9.013 40.87 ± 8.001 40.01 ± 8.103 39.14 ± 7.826

PF

APF

PSO

BA

Figure 6. Comparison of PF (1st col), APF (2nd col), PSO (3rd col) and BA (4th col) (Frame# 90, 95, 100, 105, 110)

374

[5] P. Troma and C. Szepesvari, “Sls-n-ips: an improvement of particle filters by means of local search,” In Proc. Non-Linear Control Systems (NOLCOS01), 2001. [6] C. T. and R. J., “A multiple hypothesis approach for figure tracking,” Proc. CVPR (2), pp. 239–245, 1999. [7] C. Sminchisescu and B. Triggs, “Covarinace scaled sampling for monocular 3d body tracking,” in: Computer Vision and Pattern Recognition, Kauai Marriott, Hawaii, December 914, 2001. [8] J. Deutscher and I. Reid, “Articulated body motion capture by stochastic search,” in: IJCV, (61)2, pp. 185–205, 2004. [9] E.-G. Talbi, Ed., Metaheuristics. From Design to Implementation. Hoboken, New Jersey: John Wiley & Sons Inc., 2009. [10] J. Kennedy and R. C. Eberhart, The particle swarm: social adaptation in information-processing systems. Maidenhead, UK, England: McGraw-Hill Ltd., UK, 1999, pp. 379–388. [Online]. Available: http://portal.acm.org/citation.cfm?id=329055.329090

Figure 5. Comparison of MAP distance error of Bat Algorithms for three population sizes

V. C ONCLUSION In this paper, BA is used to estimate the full body human pose in video sequences. BA is based on the echolocation behavior of bats. It combines the advantages of PSO and Harmony Search (HS). The parameters of the original BA are modified in accordance with the requirements of full body human tracking. The bats’ positions and velocities are updated similar to PSO, where fi essentially controls the pace and range of the movement of the bats. Thus, BA can be considered as a combination of PSO and intensive local search. The computational cost associated with each bat is also low. Each bat represents a potential solution in the search space which represents a possible skeletal configuration of human body. A qualitative and quantitative analysis of BA is done with PF, APF and PSO. The MAP error results have suggested that the BA has performed better than the other three employed algorithms. The BA appears to be robust. Further, it is noticed that with the increase in bat population, the tracking accuracy is increased at the expense of computational burden.

[11] L. N. de Castro and F. J. V. Zuben, “The clonal selection algorithm with engineering applications,” In GECCO’00, Workshop on Artificial Immune Systems and Their Applications, pp. 36–37, 2000. [12] J. Holland, Adaptation in Natural and Artificial Systems. Ann Arbor, MI, USA: University of Michigan Press, 1975. [13] A. Ahmad, O. Basir, K. Hassanein, and M. Imam, “An effective module placement strategy for genetic algorithms based layout design,” in Intl Journal of Production Research, vol. 44, No. 8, 2006, pp. 1545–1567. [14] A. Ahmad, “An Intelligent Expert System for Decision Analysis & Support in Multi-Attribute Layout Optimization,” Ph.D. dissertation, University of Waterloo, 2005. [15] M. Dorigo, “Optimization, Learning and Natural Algorithms,” Ph.D. dissertation, Politecnico di Milano, Italy, 1992. [16] X.-S. Yang, “A new metaheuristic bat-inspired algorithm,” Science, vol. 284, pp. 65–74, 2010. [17] P. Richardson, “Bats,” Natural History Museum, London, UK, 2008.

R EFERENCES

[18] Richardson. The secrete life of bats. [Online]. Available: http://www.nhm.ac.uk

[1] D. Simon, Optimal State Estimation: Kalman, H Infinity, and Nonlinear Approaches. New Jersey, USA: John Wiley & Sons Inc., 2006.

[19] A. O. Balan, L. Sigal, and M. J. Black, “A quantitative evaluation of video-based 3d person tracking,” Proceedings of the 14th International Conference on Computer Communications and Networks, October 15-16, 2005.

[2] J. MacCormick and M. Isard, “Partitioned sampling, articulated objects, and interface-quality hand tracking,” in: European Conference on Computer Vision, Dublin, Ireland, June 2000.

[20] H. Sidenbladh, M. Black, and D. Fleet, “Stochastic tracking of 3d human figures using 2d image motion,” in: European Conference on Computer Vision, Dublin, Ireland, June 2000.

[3] V. M. Rudolph and D. Arnaud, “The unscented particle filter,” Department of Engineering, Cambridge University: Technical Report, 2000.

[21] R. Y. Tsai, “A versatile camera calibration technique for high accuracy 3d machine vision metrology using off-the-shelf tv cameras and lenses,” in: IEEE J. Robotics Automat, RA-3(4), pp. 323–344, 1987.

[4] H. T., “Monte carlo filter using the genetic algorithm operators,” Journal of Statistics Computation and Simulation, vol. 59, pp. 1–23, 1997.

375