Human Body Pose Estimation With Particle Swarm Optimisation

17 downloads 380 Views 808KB Size Report
Particle Swarm Optimisation, Human Body Pose Estimation, Disparity Space. 1 Introduction .... (2006)) and Neural Networks (Gross et al. (2006); ... inated from the simulation of a simplified social model where the agents were thought.
Human Body Pose Estimation With Particle Swarm Optimisation S. Ivekovic E. Trucco

[email protected] [email protected]

Y. R. Petillot

[email protected] Department of Electrical, Electronic and Computer Engineering, School of Engineering and Physical Sciences, Heriot-Watt University, Edinburgh EH14 4AS, United Kingdom

Abstract In this paper we present the Particle Swarm Optimisation (PSO) approach to 3-D human body pose estimation from multi-view image data. Pose estimation from multiple views is a challenging problem requiring a powerful optimisation algorithm. PSO, a fairly recent technique, has been shown to tackle multimodal functions successfully and as such presents a promising tool for this problem. We use the silhouettes extracted from the multiple views to construct the objective function and a layered subdivision surface body model to represent the pose. A synthetic sequence of a human walking-rotating is used to analyse different parameter settings and performance of the algorithm. Multi-view image data of the upper human body, typical of immersive videoconferencing scenarios, is then used to show the applicability of the method to real data. The upper-body pose estimation algorithm is formulated in two different ways, in 3-D space and disparity space, and the resulting pose estimates are presented. Keywords Particle Swarm Optimisation, Human Body Pose Estimation, Disparity Space

1

Introduction

Human body pose estimation from images is an important topic in many research areas. Examples include surveillance (Haritaoglu et al. (2000)), motion capture (Deutscher et al. (2000)), human gait analysis (Veeraraghavan et al. (2005)), human activity recognition (Ben-Arie et al. (2002)), sign language recognition (Ong and Ranganath (2005)), medical analysis (Kohle et al. (1997)), communications (R. et al. (2005)) and animation (Collomosse et al. (2003)). In this paper we address the problem of human body pose estimation for application in two distinct problem domains, surveillance with full-body passive millimeterwave sensors (Haworth et al. (2007)) and immersive videoconferencing (Ivekovic and Trucco (2006); Isgro` et al. (2004)). In surveillance, millimeter-wave sensors have been investigated for detection of threats concealed under clothing. In setup reported by Haworth et al.(Haworth et al. (2007)), the person is asked to enter the sensor’s field of view and perform a rotating walk around their main body axis. The resulting millimeter-wave frame sequence is then inspected automatically to locate abnormal textures and body shapes, suggesting the presence of possible threats. Fitting an a priori body model enables comparisons with predicted, synthetic frames generated by a simulator (Grafulla-Gonzalez c

200X by the Massachusetts Institute of Technology

Evolutionary Computation x(x): xxx-xxx

S. Ivekovic, E. Trucco, Y. R. Petillot

Figure 1: Human body pose estimation problem. The body pose is represented with a kinematic chain and constrained by the silhouette. Disc Thrower figure courtesy of FCIT. et al. (2005)) and assists with tracking of suspicious areas. Single-view millimeter-wave data does not contain enough information to accurately fit a body model for this purpose. We performed a set of experiments with a simulated synthetic video sequence to show that, for the expected constrained body motion, augmenting the sensor with two video-cameras provides a sufficient constraint for reliably solving the pose estimation problem and fitting a model. Immersive videoconferencing aims at recreating the sense of presence that is an integral part of an ordinary meeting. As it is impossible to place a camera in the middle of the screen (where the ’eyes’ of remote participants appear), the video acquired from remote cameras must be warped to match the local viewpoint. This requires high-quality view synthesis (Ivekovic and Trucco (2006)) which in turn requires high quality stereo disparity data, normally difficult to achieve given the geometric constraints imposed by videoconferencing setups (Isgro` et al. (2004)). By estimating the upper-body pose and fitting a model to the available disparity data, the quality of the view-synthesis can be significantly improved and the sense of presence strengthened. We describe two different approaches to estimating the upper-body pose from multi-view video data. The first approach uses a body model in 3-D space and demonstrates the intuitive way of solving the problem. The second, more elegant approach, uses a body model in disparity space. In the remainder of this paper we first describe the motivation for using Particle Swarm Optimisation in Section 2. We give an overview of the related work in Section 3. Details of the PSO algorithm that we used in our experiments is described in Section 4 and the pose estimation algorithm in Section 5. Section 6 describes the experiments with the simulated synthetic sequence for the surveillance purposes and Section 7 describes the upper-body pose estimation from multi-view video data in 3-D and disparity space. Conclusions are given in Section 8.

2

Motivation

Human body pose is usually represented with a kinematic chain structure consisting of joints and limbs (see Figure 1 for illustration). Each joint can have up to three rotational degrees of freedom, i.e., rotation around the x, y, and z axis, while the joint at the top of the chain, commonly referred to as root joint, also posesses up to three translational degrees of freedom, defining the position of the entire structure in the reference space. A simple example of a kinematic (sub)chain in a human body is a set of two angles which influence the movement of the elbow. A (simplified) model of the elbow consists of two rotational degrees of freedom, one in the shoulder and one in the elbow itself, as shown in Figure 2. Let us assume that the human arm is seen from four different view2

Evolutionary Computation Volume x, Number x

Human Body Pose Estimation with PSO

Graph of the fitness function for lower arm Fitness function value -3.12 -3.14 -3.16 -3.18 -3.2 -3.22 -3.24 -3.26 -3.28 -3.3 -3.32

-2

-1.5

-1 -0.5 0 Elbow Rotation

0.5

1

1.5

2 3

2.5

2

0 1 0.5 1.5Shoulder Rotation

Figure 2: Illustration of the elbow joint parametrisation. points and that the evaluation function measures the amount of overlap between the silhouettes as described before. A graph of such evaluation function is shown in Figure 2. This example can be treated as a boundary case for pose estimation as the human kinematic structure normally contains many more than just 2 degrees of freedom. As can be seen from Figure 2, the evaluation function is clearly multimodal. A full kinematic structure, describing the entire body, will therefore most certainly require a global optimisation algorithm to adequately address the dimensionality and multimodality of the problem while keeping it fully automatic. Particle Swarm Optimisation has been reported to perform well on classic multimodal test functions such as, e.g., Schaffer’s f6 function (Kennedy and Eberhart (1995); Dawis (1991)). As it is a fairly recent technique, its use for optimisation problems such as human body pose estimation has not been widely explored yet. As shown in Section 5, formulating the pose estimation problem in the context of PSO is actually fairly straightforward. The simplicity of the PSO algorithm versus its ability to find global minima is in itself a very appealing argument for its use and in this paper we show that it can be used to solve the pose estimation problem as well.

3

Related Work

Human body pose estimation from video data is a well-known problem. In video production and medical contexts motion and pose is often acquired with commercial sytems based on a variety on markers attached to the body. Vision and graphics research has concentrated on pose and motion estimation without markers. Prototype solutions have been reported with and without explicit body models, using image data or 3D scanner data, and single or multiple-viewpoint video sequences. A not-so-recent survey can be found in (Gavrilla (1999)) and partly also in (Moeslund and Granum (2001)). Body pose estimation from images using a human body model has been addressed by various researchers. Pl¨ankers and Fua (Plaenkers and Fua (2001)) report using an implicit surface body model (metaballs) which they fit to 3D stereo data constrained by silhouette contours. They use an implementation of Levenberg-Marquardt optimisation method to fit the model to the stereo data obtained from 3 views. Carranza et al. (Carranza et al. (2003)) use a triangular mesh body model enhanced with a 1D Bezier spline and downhill optimisation constrained with silhouettes to recover the body pose. Poppe et al. (R. et al. (2005)) mention a similar application area to ours, virtual environments, and work on monocular video sequences using a simple body model composed of cylinders. Our work differs from the mentioned related work in two aspects. Unlike other reported work, we use a subdivision surface body model, the choice of which was Evolutionary Computation Volume x, Number x

3

S. Ivekovic, E. Trucco, Y. R. Petillot

motivated by the requirements of our application, and we recover the pose parameters using a global optimisation algorithm, the Particle Swarm Optimisation (PSO). The Evolutionary Computation research addressed the articulated body pose estimation and analysis in various ways including using Genetic Algorithms (Ye and Liu (2005); Shoji et al. (2000); Hsu et al. (2006)) and Neural Networks (Gross et al. (2006); Guo et al. (1994)). PSO has been successfully applied to various problems, however, we were not able to find many references to its use for articulated body pose estimation. The closest related work found is by Schutte et al. (Schutte et al. (2004)) who report a parallel PSO implementation and illustrate its performance on an example of a simple kinematic chain similar to our boundary case example described in section 2.

4

Particle Swarm Optimisation

Particle Swarm Optimisation (PSO) is an evolutionary computation technique introduced by Kennedy and Eberhart in 1995 (Kennedy and Eberhart (1995)). The idea originated from the simulation of a simplified social model where the agents were thought of as collision-proof birds and the original intent was to graphically simulate the unpredictable choreography of a bird flock. The original PSO algorithm was later modified by the authors and other researchers to improve its search capabilities and convergence. Several successful applications of PSO were also reported in the literature. For an overview of the relevant research in this area an interested reader will find a good starting point in (Eberhart and Shi (2004)). One of the important modifications of PSO was introduced in 1998 by Shi and Eberhart (Shi and Eberhart (1998)). They changed the velocity update equation of the swarm by adding an additional parameter called inertia weight, w. The aim of this parameter was to guide the search behaviour of the swarm. The larger the inertia parameter value, the more global the search, and vice versa. Several other modifications were added later on but for the purpose of this paper we focus on the contribution of (Shi and Eberhart (1998)), as it is also the version of PSO which we used in our experiments. In the following we give a brief overview of the PSO algorithm using inertia weight parameter. 4.1 PSO Algorithm with Inertia Weight Parameter Assume an n-dimensional search space S ⊆ Rn , a swarm consisting of N particles and a fitness function f : S → R defined on the search space. The i-th particle is represented as an n-dimensional vector Xi = (xi1 , xi2 , ..., xin )T ∈ S. The velocity of this particle is also an n-dimensional vector Vi = (vi1 , vi2 , ..., vin )T ∈ S. The best position encountered by the i-th particle so far (personal best) is denoted as Pi = (pi1 , pi2 , ..., pin )T ∈ S and the value of the fitness function at that position pbesti = f (Pi ). The index of the particle with the overall best position so far (global best) is denoted as g and gbest = f (Pg ). Let us also denote the optimum of the fitness function f by sol = f (Ps ), where the index s denotes the solution position in the search space. The PSO algorithm can then be stated as follows. 1. Initialisation: • Initialise a population of particles {Xi }, i = 1...N, with random positions and velocities in the search space S. For each particle evaluate the desired fitness function and set pbesti = f (Xi ). Identify the best particle in the swarm and store its index as g and its position as Pg . 4

Evolutionary Computation Volume x, Number x

Human Body Pose Estimation with PSO

2. Repeat until |sol − gbest| <  for some predefined  or the number of iterations reaches a predefined limit: • Move the swarm by updating the position of every particle according to the following two equations: Vi

= wVi + ϕ1 (Pi − Xi ) + ϕ2 (Pg − Xi )

Xi

= Xi + Vi

(1)

where ϕ1 and ϕ2 are random numbers defined by an upper limit which is a parameter of the system and w is the inertia weight parameter. • For i = 1...N update pbesti and gbest. The value of the inertia weight w can remain constant throughout the search or change with time. The parameters ϕ1 and ϕ2 influence the social and cognition components of the swarm behaviour (Shi and Eberhart (1998)). They are composed of a random number and a constant and can be written as ϕ1 = c1 rand1 () and ϕ2 = c2 rand2 (), where c1 and c2 are two constants and rand1 () and rand2 () two random numbers in the interval [0, 1]. In our experiments the values of the constants c1 and c2 were both set to integer 2, as recommended by (Shi and Eberhart (1998); Kennedy and Eberhart (1995)), which on average made the weights for social and cognition components of the swarm equal to 1. Throughout the experiments with pose estimation we concentrated on the influence of the inertia weight parameter on the swarm behaviour and did not experiment with the social or cognition bias that could be introduced through manipulating the values of ϕ1 and ϕ2 . 4.2 Inertia weight parameter Inertia weight plays an important role in directing the exploratory behaviour of the particles. Higher inertia values push the particles to explore more of the search space and emphasise their individual velocity. This behaviour is useful when trying to coarsely explore the entire search space to find a good starting point for a multimodal optimisation. Lower inertia values force particles to focus on a smaller search area and move towards the best solution found so far. This approach makes sense when the global optimum region has been successfully identified and the exact optimum location in the search space is required. Shi and Eberhart discussed the influence of different inertia values on the exploratory abilities of the swarm in (Shi and Eberhart (1998)). They used a constant inertia change function and one which decreased linearly with time. They tested inertia values in the interval [0, 1.4] and found that for a constant inertia value a medium value of w, i.e., 0.8 < w < 1.2, had the best chance of finding the global optimum while also requiring a moderate number of iterations. Large values of w, i.e., w > 1.2 made PSO behave more like a global search method always trying to exploit new search areas. We decided to model the inertia change with an exponential function which allowed us to use a constant sampling step while gradually guiding the swarm from a global to more local exploration: w(x) =

A , ex

Evolutionary Computation Volume x, Number x

x ∈ [0, ln(10A)],

(2) 5

S. Ivekovic, E. Trucco, Y. R. Petillot

where A denotes the starting value of w when x = 0. The optimisation terminated when w(x) fell below 0.1. The sampling variable x was incremented by ∆x = ln(10A)/N , where N is the desired number of inertia weight changes. The swarm was allowed to explore the search space with a particular inertia value for as long as every move of the swarm improved the current global optimum estimate. As soon as an iteration failed to improve the estimate, the value of the sampling variable increased and the inertia weight value decreased accordingly. This forced the swarm to identify possible optimum regions at the very beginning, then focus on the best few, and eventually settle down in the most promising region and find the global optimum.

5

Pose Estimation with PSO

In this section we describe the building blocks of our pose estimation algorithm. We begin with the body model representing the estimated pose, describe the PSO parametrisation of the pose, explain the evaluation function and describe two extensions of the original algorithm presented in Section 4 which allowed us to perform the pose estimation more accurately and efficiently. 5.1 Body Model We use a 3-D layered subdivision surface body model consisting of two layers, the skeleton and the skin. The skeleton layer is defined as a set of transformation matrices which encode the information about the position and orientation of every joint with respect to its parent joint in the kinematic chain hierarchy: Skeleton = {T12 , T23 , ..., TNN−1 }.

(3)

N is the number of joints in the skeleton and Tij is a homogeneous transformation matrix encoding the orientation of the coordinate system of joint j with respect to the coordinate system of joint i. The top of the hierarchy is the root joint which branches out into a number of kinematic sub-chains modelling the skeletal structure of the human body. The skin layer represents the second layer in the model and is connected to the skeleton through the joints’ local coordinate systems. Each of the joints controls a certain area of the skin. Whenever a joint or limb moves, the corresponding part of the skin moves and deforms with it. The skin can therefore be described as a set of transformation matrices forming the skeleton layer combined with the sets of points influenced by each of the transformations: Skin = {{T12, PT12 }, {T23, PT23 }, ..., {TNN−1, PTNN−1 }}

(4)

In order to generate a smooth skin surface of the model, all the skin points PT j have to i be transformed into a common coordinate system such as the world coordinate system: Pw = Tw1 ∗ T12 ∗ · · · ∗ Tij ∗ PT j , i

∀ i, j ∈ [1, . . . , N ]

(5)

The points (vertices) Pwi are connected with edges into faces F to form a base mesh: M0 = {V, F }, where V = {Pwi }, F = {Pwi1 , Pwi2 , Pwi3 , Pwi4 }

(6)

which is then subdivided to obtain the smooth limit surface, i.e., the skin: M ∞ = S ∞ . . . S 1 S0 M 0 ,

(7)

where S is a subdivision operator. 6

Evolutionary Computation Volume x, Number x

Human Body Pose Estimation with PSO

5.2 PSO Parameterisation of the Pose In PSO, each particle represents a potential solution in the search space. Our search space is the space of all plausible skeleton configurations. The individual particle’s position vector in the search space is therefore specified as follows: (8)

Xi = (rootx , rooty , rootz , α0x , βy0 , γz0 , α1x , βy1 , γz1 , ..., γxN ),

where rootx , rooty , rootz denote the position of the root joint (first joint in the hierarchy) with respect to the reference (world) coordinate system, and αix , βyi , γzi refer to rotational degrees of freedom of joint i around the x, y, and z-axis, respectively. 5.3 Evaluation Function The evaluation function compares the silhouettes extracted from the original images acquired by the cameras and the silhouettes generated by the model in its current pose. The original images can be acquired from N different viewpoints. Each of the original images is foreground-background segmented and binarised to obtain a silhouette. Let the images containing the original silhouettes be denoted as Iio , i = 1...N . Similarly, let Iim , i = 1...N denote images of the model silhouettes. The evaluation function can then be written as follows: E=α

row X col X 1

1

(I1o & I1m )+β

row X col X 1

(I2o & I2m )+γ

1

row X col X 1

1

(I3o & I3m )+· · ·+ω

row X col X 1

o m (IN & IN ),

1

(9) where row and col denote the image dimensions, i.e., number of rows and columns, respectively, and & denotes the logical AND operation. Coefficients α, β, ..., ω are used to normalise the contribution of every view to the total error count. Let T1 , T2 , ..., TN denote the total number of pixels for the original silhouettes of the views 1...N , respectively. The values of the coefficients are then α = 1/T1 , β = 1/T2, ..., ω = 1/TN . 5.4 Hierarchical Approach Although PSO has been reported to tackle highly multidimensional problems successfully (Kennedy and Eberhart (1995)), the unmodified algorithm described in Section 4 failed to solve the complex pose estimation problem satisfactorily. The main reason for this was the complex evaluation function which prevented us from using a large swarm size if the result was to be obtained within an acceptable time frame. We decided to augment the original algorithm to exploit the inherent hierarchy present in the kinematic chain model. The hierarchy meant that the positions of the joints lower in the chain were constrained by the configurations of the joints higher in the chain. For example, the rotation of the shoulder joint directly restricted the position of the elbow and limited its plausible orientation. This approach greatly reduced the combinatorial complexity of the search as it constrained the search space much more than if all the joint values were optimised for at once. Taking advantage of the dependency constraint between the individual joints is inevitable if the algorithm is to be more efficient, given that the swarm size and the time available were limited. Incorporating the hierarchical search in the PSO algorithm proves to be fairly straighforward. The first step consists of identifying the appropriate kinematic chain structure and establishing dependencies between the individual joints. A commonly Evolutionary Computation Volume x, Number x

7

S. Ivekovic, E. Trucco, Y. R. Petillot

Figure 3: (left) full-body model used in synthetic experiments (centre) schematic colour-coded illustration of the kinematic subchains (right) upper-body model used in multi-view sequences used set of guidelines to levels of articulation and dependencies between the joints of an articulated model is contained in the ISO H-Anim Standard (19774:200x (2006)). The Level of Articulation 1 (LOA1) of the H-Anim Standard most closely resembles the kinematic structure of our model and the schematic illustration of the individual kinematic subchains is shown in Figure 3. Once the hierarchy has been identified, we can optimise for the joint rotations in a hierarchical manner. Depending on the complexity of the evaluation function, level of articulation of individual joints, and dependency between them, this can be done one joint at a time or by grouping several joints together. We describe the hierarchical approach in more detail in the experimental section. 5.5 Continuity of Pose Estimates When optimising the pose over a sequence of frames depicting a continuous body motion, one of the requirements is the temporal and spatial consistency of the estimates. Optimising the pose from scratch on a frame by frame basis can, in principle, produce a continuous articulated sequence. In practice, however, it is much more likely that seemingly plausible estimates are ambiguous if not additionally constrained. For example, the evaluation function comparing the overlap between the silhouettes does not necessarily distinguish between the front or the back of the model. The optimisation can return any of the ϕ ± kπ, k ∈ Z values as the orientation in reference space estimate, as long as all the answers lie within the allowed parameter boundaries. This becomes obvious when a temporally consistent sequence of estimates is required. A simple solution is to use the best estimate for the frame at t = t − 1 to initialise the optimisation in the frame t. This does not in itself guarantee the smoothness of the continuous motion across the frames, as it is likely that due to the uncertainty in the accuracy of the individual pose estimates the estimates will not be entirely consistent with the ground truth. It does, however, provide a constraint that enforces the proximity of the neighbouring estimates in the search space and also naturally reduces the complexity of the search.

6

Experiments with Synthetic Data

In this section we describe the experiments performed on the synthetic data sequence with known ground truth. We generated a sequence in which a simple full-body model was used to simulate a constrained rotating walk that would happen inside a millimeter wave sensor. The aim of these experiments was to establish how reliably a pose 8

Evolutionary Computation Volume x, Number x

Human Body Pose Estimation with PSO

Figure 4: Example frames from the synthetic sequence used as constraints for the pose estimation. Front and top view are shown combined. Table 1: Degrees of Freedom for Synthetic Data Tests JOINT (index)

DOF

World orientation

1

Root Location (root)

1

Left Hip rotation

1

Right Hip rotation

1

TOTAL

4

could be estimated for such a constrained full-body motion, using only two camera views. 6.1 The Model and the Sequence The simple body model used to generate the sequence is shown in Figure 3(a). As the motion inside a sensor is fairly constrained, it was possible to approximate it with 4 degrees of freedom shown in Table 1. The synthetic cameras were placed in front of the model and on top of the model, providing two useful constraints for the pose estimation. Figure 4 shows example silhouettes for top and front view, extracted from the sequence. The sequence consists of 360 frames in which the model performs one full walking rotation around its main axis. 6.2 Evaluation Function In this set of experiments we use only two views, front view and top view, to constrain the pose. This reduces the evaluation function from Equation ?? to: E =α

row X col X 1

(Ifo & Ifm ) + β

1

row X col X 1

(Ito & Itm ),

1

where the indices f and t denote front and top view, respectively, and the rest of the parameters remain as in Equation 9. 6.3 PSO Parametrisation of the Pose Each individual particle models the 4 degrees of freedom that change to influence the pose of the model: Xi = (rootx , αworld , αlef t hip , αright hip ), (10) where rootx denotes the position of the root joint with respect to the world coordinate system (translation), αworld denotes the orientation of the model with respect to the Evolutionary Computation Volume x, Number x

9

S. Ivekovic, E. Trucco, Y. R. Petillot

world coordinate system, αlef t hip denotes a left hip rotation and αright hip denotes a right hip rotation. 6.4 PSO Performance Analysis Particle Swarm Optimisation algorithm contains several parameters which influence its performance. Swarm size and inertia weight parameter are two which, when chosen carefully, can significantly influence the behaviour of the swarm. In pose estimation we are not only interested in the accuracy of the results, but also in the efficiency of the estimation process. To this end we have performed several experiments, measuring how different parameter settings affect the uncertainty of the pose estimates. We present the results in the sequel. The following two standard formulas for sample mean and standard deviation were used to calculate the uncertainty intervals of estimated parameter values:

x¯ =

v u u σx = t

N 1 X xi N i=1

N

1 X (xi − x ¯)2 N − 1 i=1

(11)

(12)

6.4.1 Swarm Size The common sense tells us that the larger the number of particles, the more likely one of the particles will explore the right region of the search space and find the optimal solution. However, large swarm size comes with a price of higher computational complexity which is not always affordable. In our case, the complexity of the evaluation function restricts the number of particles which can be used if the solution is to be obtained within some pre-defined time period. The goal of the first experiment was to establish the optimal consensus between the size of the swarm, acceptable quality of the pose estimate and reasonable estimation time. We performed pose estimation with varying swarm sizes on a single frame of the sequence. The size of the swarm varied from 5 to 100 particles in the increments of 5. The estimation with every individual swarm size was run 20 times to establish the uncertainty of the result. The plots in Figure 5 show the pose estimate uncertainty interval for each of the tested swarm sizes overlaid on top of the ground truth information for all 4 optimised parameters. Time necessary to get the estimate and the error value are also shown. The orientation of the model with respect to the world reference frame is plotted in a separate plot as its uncertainty interval is much larger than that of the other parameters. This is a consequence of the lack of continuity constraint which we mention in Section 5.5. The optimisation performance is very good for the other three optimised parameters as can be seen from the plots. Results shown in Figure 5 indicate that 40 particles present an acceptable compromise between the quality of the estimate and the required time. 10

Evolutionary Computation Volume x, Number x

Human Body Pose Estimation with PSO

Uncertainty interval plot for the world rotation parameter 1

Uncertainty plot for root position, left and right hip 2

world rotation uncertainty interval ground truth

0

1

-2

Parameter Value

Parameter Value

-1

-3 -4 -5

0.5 0 -0.5 -1

-6

-1.5

-7 -8

root and hips uncertainty ground truth

1.5

0

10

20

30

40

50

60

70

80

90

-2

100

0

10

20

30

Population size

-1.92

500

-1.94

400 Time

Error Value

600

-1.96

200

-2

100

10

20

30

40

50

70

80

90

100

60

70

80

90

Population size

100

90

100

optimisation time estimate uncertainty interval

300

-1.98

0

60

Optimisation time uncertainty interval plot 700

best error estimate uncertainty interval ground truth

-1.9

-2.02

50

Population size

Best error estimate uncertainty interval plot -1.88

40

0

0

10

20

30

40

50

60

70

80

Population size

Figure 5: Uncertainty interval analysis for varying population sizes. 6.4.2 Inertia Weight Dynamically changing inertia weight, modelled with an exponential function (Equation 2) allows us to influence the exploratory behaviour of the swarm as described in Section 4.2. The higher the initial inertia value, the more global the search, however, it also takes longer before the optimisation converges. Just like in the case of the swarm size, there is a tradeoff between the extent of the global exploration and the necessary time for convergence. As our pose-estimation problem becomes highly nonlinear with the increasing number of parameters, our next goal was to find the lowest possible inertia value which will allow enough global exploration to locate the global optimum basin and then quickly converge towards the exact location. In this experiment we kept the swarm size constant at 40 particles as suggested by the first experiment. We again performed 20 repetitions of pose estimation for each different starting inertia value. The values tested started at x = 1.0 for w = 2.0/ex and were decreased in intervals of ∆x = 0.2 until they reached x = 3.0. Figure 6 shows the results of the experiment. The experiments show that the inertia value w = 2.0/e which we used as the highest starting point in this experiment is also the best one. If the starting value is set any lower, the uncertainty in the error estimates increases. This conclusion is made under the assumption that each estimation is run from scratch without any prior knowledge of the pose. 6.4.3 Pose continuity In Section 5.5 we mentioned the need for consistent estimates and suggested that initialising the search in the new frame with the best result of the previous frame will ensure the coherency of the estimates. In order to force the swarm to only explore close to the suggested initial estimate, the inertia value has to be kept low as well, which Evolutionary Computation Volume x, Number x

11

S. Ivekovic, E. Trucco, Y. R. Petillot

World rotation estimate uncertainty interval 1 0 -1

1

-2

0.5

-3 -4

0 -0.5

-5

-1

-6

-1.5

-7

1

1.5

2

2.5

root and hips uncertainty ground truth

1.5

Parameter Value

Parameter value

Uncertainty plot for root position, left and right hip 2

world rotation estimate uncertainty interval ground truth

-2

3

1

1.5

2

Inertia weight value Best error estimate uncertainty interval plot -1.92

-1.95

200

-1.96

Time

Error Value

optimisation time estimate uncertainty interval

250

-1.94

-1.97 -1.98

150

100

-1.99

50

-2 -2.01

3

Optimisation time uncertainty interval plot 300

best error estimate uncertainty interval ground truth

-1.93

2.5

Inertia weight value

1

1.5

2

2.5

0

3

1

1.5

2

Inertia weight value

2.5

3

Inertia weight value

Figure 6: Uncertainty interval analysis for varying inertia weight values.

additionally reduces the complexity of the search. In this experiment we ran the pose estimation for the first half of the synthetic sequence, initialising the search in every new frame with the best estimate from the previous frame. The swarm size was set to 40 particles and and the inertia weight started at w = 0.3. Figure 7 shows the results, the PSO estimates on the left and the ground truth on the right. Experimental results confirm that the algorithm is capable of estimating the pose reliably.

Pose estimation for sequence of 180 frames - Ground Truth 2

1.5

1.5

1

1

0.5

0.5

0

Parameter value

Parameter value

Pose estimation for sequence of 180 frames - PSO estimate 2

-0.5 -1 -1.5

0 -0.5 -1 -1.5

-2

-2

-2.5

-2.5

-3 -3.5

-3 0

20

40

60

80

100

Frame number

120

140

160

180

-3.5

0

20

40

60

80

100

120

140

160

180

Frame number

Figure 7: Results of the pose estimation for a continuous sequence of 180 frames. 12

Evolutionary Computation Volume x, Number x

Human Body Pose Estimation with PSO

Figure 8: Example pose from the multi-view data set. Each pose is acquired from 4 viewpoints simultaneously.

7

Experiments with Multi-View Still Images

7.1 Data Set The data for this set of experiments consists of a set of different upper-body poses, typical of immersive video-conferencing scenarios, acquired from 4 different viewpoints. Example pose shown from all four views is given in Figure 8. The data was acquired using the setup shown in Figure 9 consisting of 4 fire-wire webcams and off-the shelf lighting.

Figure 9: Multi-view camera setup used to acquire the data and a detailed view of the lighting. 7.2 Upper Body Model The upper body model (see Figure 3(c)) consists of 10 joints with a total of 20 degrees of freedom, 3 translations and 17 rotations. Limb lengths are fixed. Table 2 shows the detailed list of degrees of freedom used. As the body model is a subdivision surface, it can be used at various levels of smoothness. 2 levels of subdivision were sufficient to achieve a shape which was smooth enough to interpret the pose. 7.3 PSO Parametrisation of Pose Estimation The search space is 20-dimensional and the individual particle’s position vector in the search space is specified as follows: Xi = (rootx , rooty , rootz , α0x , βy0 , γz0 , α1x , γz1 , ..., α7x ),

(13)

where rootx , rooty , rootz denote the position of the root joint with respect to the world coordinate system, α0x denotes a rotation around x-axis of the root joint coordinate system for angle α, γz1 denotes a rotation around z-axis of the clavicle-neck joint for angle γ, etc. 7.4 Evaluation function Let the images containing the original silhouettes be denoted as Ilo , Ico , Iro , and Ito for left, centre, right, and top original silhouette, respectively. Similarly, let Ilm , Icm , Irm , and Itm denote images of the model silhouettes from the left, centre, right, and top view, Evolutionary Computation Volume x, Number x

13

S. Ivekovic, E. Trucco, Y. R. Petillot

Table 2: Degrees of Freedom for Tests with Real Data JOINT (index)

DOF

Root location (root)

3

Root orientation (0)

3

Clavicle-neck orientation (1)

2

Clavicle-left orientation (2)

2

Left Shoulder orientation (3)

3

Left Elbow orientation (4)

1

Clavicle-right orientation (5)

2

Right Shoulder orientation (6)

3

Right Elbow orientation (7)

1

TOTAL

20

respectively. The evaluation function from Equation ?? then simplifies to: E=α

row X col X

γ

row X col X

1

1

(Ilo & Ilm ) + β

row X col X

(Iro & Irm ) + δ

row X col X

1

1

1

1

(Ico & Icm ) +

1

(Ito & Itm )

(14)

1

7.5 Hierarchical Upper-Body Pose Estimation In Section 5.4 we mentioned the need for hierarchical approach to pose estimation in order to limit the complexity of the problem. In the case of the synthetic sequence (6) the number of parameters to be optimised was low enough to not require the hierarchical approach. In the case of upper body pose estimation where the search space has expanded to 20 dimensions, this is not the case anymore. When we used the original algorithm with exponentially falling inertia weight value, the inertia weight fell to zero well before the optimisation identified the global optimum basin of attraction. Waiting for the right answer by prolonging the time simply wasn’t feasible. Instead, we decided to optimise hierarchically. We performed the hierarchical optimisation in 7 steps (see Table 3). First, we optimised the location of the skeleton in space, i.e., the location of the root joint, followed by the root joint orientation. These were both 3 DOF optimisations. Once the skeleton was positioned in space, we optimised the neck and head sub-chain, for which we only used 2 DOF in the clavicle neck joint to model the tilt of the head. The movement of the clavicle left and clavicle right joint on their own does not produce enough variation in the silhouette shape to be optimised individually. Therefore, in the next step, we combined the left clavicle joint with two rotational dimensions of the shoulder joint and optimised the parameters of the left upper arm, a 4 DOF optimisation. Likewise, we then optimised the right upper arm, again 4 DOF. At the 14

Evolutionary Computation Volume x, Number x

Human Body Pose Estimation with PSO

Table 3: Steps in the hierarchical optimisation TORSO

RIGHT UPPER ARM

(1) Root location

(5) Clavicle-right orientation

3DOF: rootx , rooty , rootz

+ Right shoulder orientation

(2) Root orientation

4DOF: α5x , γz5 , α6x , γz6

3DOF: α0x , βy0 , γz0 NECK & HEAD

LEFT LOWER ARM

(3) Clavicle-neck orientation

(6) Left shoulder orientation

2DOF: α1x , γz1

+ Left elbow orientation 2DOF: βy3 , α4x

LEFT UPPER ARM

RIGHT LOWER ARM

(4) Clavicle-left orientation

(7) Right shoulder orientation

+ Left shoulder orientation

+ Right elbow orientation

4DOF:

2DOF: βy6 , α7x

α2x , γz2 , α3x , γz3

end we were left with the left and right lower arm, each modelled with 2 DOF as described in the boundary case. The two 4 DOF upper arm optimisations required a slightly denser sampling of the inertia weight function to correctly locate the optimum region. Our fitness function is based on silhouette overlap between the model and the original silhouette coming from the video sequence. When optimising joint parameters hierarchically, the joints lower in the hierarchy mislead the silhouette overlap count as they contribute to it despite of not having been optimised yet. We avoid this by deforming the subdivision body model so that at a particular stage of the optimisation only those body parts which are currently optimised or have already been optimised are visible. We also exclude the hands from the model entirely, as the hands in the original sequence exhibit too much articulation and constantly mislead the optimisation. The hands can be very useful as a constraint, however, that is entirely based on the input images and the explicit modelling as such is not necessary. 7.6 Combination of using the original and hierarchical approach The results of the hierarchical approach were encouraging (see Figure 10 left). The optimisation correctly identified the pose. However, having optimised hierarchically without looking back, another problem creeped in, namely that of error propagation. In our hierarchical approach we rely on the fact that each individual stage comes up with the best possible result and therefore provides a good starting point for the next stage. However, results from Section 6 show that there is always some uncertainty present in the pose estimates and this uncertainty propagates and grows through the hierarchical stages. At this point, going back to optimising all the parameters at once makes much Evolutionary Computation Volume x, Number x

15

S. Ivekovic, E. Trucco, Y. R. Petillot

Figure 10: The left image shows the result under the influence of the error propagation in the hierarchical approach. The right image illustrates how this can be corrected using the combined approach. more sense. The result of the hierarchical optimisation, even if corrupted by propagated bad estimates, represents an excellent starting point for the original algorithm. If we use it as the initial position in the search space and initialise the particles around it with a low initial inertia value, e.g., w = 0.5, we can force the swarm to explore the space around the provided initial solution and find a better one. This approach successfully corrects the influence of the error propagation as shown in Figure 10 right.

Figure 11: 7.7 Pose Estimation in Disparity Space The pose estimation we presented in the previous section represents only one of the steps in a much larger application as shown in Table 4. Once the pose has been estimated, the model is fit to the incomplete and noisy disparity data. The new, complete data is then used for high-quality view synthesis. The previous section described the upper-body pose estimation in 3-D space. Given that we want to post-process disparity data, it makes sense to investigate the possibility of estimating the pose in disparity space itself and use the model directly, without any additional transformations between the image space and 3-D space, once the pose has been estimated. Another advantage is that the noise associated with {x, y, d} is homoscedastic (Demirdjian and Darrell (2002)) unlike the noise associated with the 3-D points reconstructed from stereo data which is well known to be heteroscedastic. In this section we describe the modifications which are necessary to perform the

Figure 12: Combined approach results for third pose, centre and left view. 16

Evolutionary Computation Volume x, Number x

Human Body Pose Estimation with PSO

Table 4: Diagram of the entire algorithm from image acquisition to view synthesis. 1. multi-view

2. silhouette

3. pose

4. disparity data

5. view

image acquisition

extraction

estimation

completion with model

synthesis

pose estimation in disparity space. Apart from the described modifications, all the remaining steps of the algorithm are exactly the same as in the 3-D case. Let us assume that a 3-D point M = (X, Y, Z, 1)T is viewed by two distinct cameras, left camera P l and right camera P r , and that the image of the point M is defined as ml = (xl , yl , 1)T ' P l M and mr = (xr , yr , 1)T ' P r M in the left and right camera’s image plane, respectively, where “'” denotes equality up to a scale factor. The corresponding points ml and mr are related by a disparity which, in a general case, we define as: d(ml , mr ) = mr − ml = (xr − xl , yr − yl ). (15) In the case of rectified images, the two corresponding points lie on the same scanline, and the disparity simplifies to a displacement along the scanline: d(ml , mr ) = xr − xl

(16)

For a rectified stereo pair of images, the disparity space is then defined as a threedimensional space D3 = {x, y, d}. There exists a projective transformation Γ between the 3-D space R3 and disparity space D3 , Γ : R3 → D3 :   pl11 pl12 pl13 pl14   pl21 pl22 pl23 pl24  Γ= (17)  pr11 − pl11 pr12 − pl12 pr13 − pl13 pr14 − pl14  pl31 pl32 pl33 pl34 for which D ' ΓM ,

M ∈ R 3 , D ∈ D3 ,

(18)

and where plij and prij denote the elements of the left and right rectified camera projection matrix,P l and P r , respectively. 7.7.1 Body Model in Disparity Space Given the projective transformation Γ as in Equation 17, there exists a relationship between a homogeneous transformation A = [R|t] in R3 and an equivalent homogeneous transformation B in D3 (see Figure 13): B = ΓAΓ−1

(19)

The notion of homogeneous transformations in R3 is therefore directly transferrable to the disparity space D3 . The points can be rotated and translated in disparity space directly if, for every known homogeneous transformation A in 3-D space, a corresponding transformation B is computed as in Equation 19. 7.8 Articulated Body Pose Estimation in Disparity Space We use the result from Equation 19 to estimate the pose of an articulated upper body model in disparity space by estimating the homogeneous transformations of individual joints composing the model. The skeleton layer of the upper-body model is now defined as follows: (20) Skeleton = {B 21 , B 32 , ..., B N N −1 }, Evolutionary Computation Volume x, Number x

17

S. Ivekovic, E. Trucco, Y. R. Petillot

A

M Γ ΓΜ

AM Γ

B

ΒΓΜ = ΓΑΜ

Figure 13: Transformation diagram where N is the number of joints in the skeleton and B ji is a homogeneous disparity space transformation matrix encoding the orientation of the coordinate system of joint j with respect to the coordinate system of joint i. The skin is connected to the skeleton through the joints’ local coordinate systems and deforms with it, just like in the 3-D case.

Figure 14: An example of estimated upper body pose. The upper row shows a wireframe model overlaid on top of the silhouette to illustrate the evaluation function constraint.

Figure 15: Upper body pose estimates. The estimate in the third row is slightly underconstrained with the available views and as a result the right lower arm estimate is not entirely correct, however, this is due to the underconstrained evaluation function and not the fault of PSO itself. 18

Evolutionary Computation Volume x, Number x

Human Body Pose Estimation with PSO

8

Conclusions

In this paper we have presented the algorithm for pose estimation with Particle Swarm Optimisation. Experiments with synthetic and real data illustrated the ability of the method to solve the problem reliably and within an acceptable time frame. The future work will concentrate on demonstrating the pose continuity and tracking with PSO on real data, which, due to the restrictions of our simple camera setup, was at this point not possible. The results shown in this paper should be illustrative enough to foster the use of PSO in related problems and so bridge the gap between the Evolutionary Methods and Computer Vision.

References 19774:200x, I. F. (2006). Information technology computer graphics and image processing humanoid animation (h-anim). Ben-Arie, J., Wang, Z., Pandit, P., and Rajaram, S. (2002). Human activity recognition using multidimensional indexing. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(8). Carranza, J., Theobalt, C., Magnor, M., and Seidel, H. (2003). Free-viewpoint video of human actors. ACM Transactions on Graphics, 22(3). Collomosse, J., Rowntree, D., and Hall, P. (2003). Video analysis for cartoon-like special effects. In Proceedings of the 14th British Machine Vision Conference, pages 749–758. Dawis, L. (1991). Van Nostrand Reinhold, New York. Demirdjian, D. and Darrell, T. (2002). Using multiple-hypothesis disparity maps and image velocity for 3-d motion estimation. International Journal of Computer Vision, 47(1/2/3):219–228. Deutscher, J., Blake, A., and Reid, I. (2000). Articulated body motion capture by annealed particle filtering. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition 2000, volume 2, page 2126. Eberhart, R. C. and Shi, Y. H. (2004). Special issue on particle swarm optimization. IEEE Transactions on Evolutionary Computation, 8(3). Gavrilla, D. (1999). Visual analysis of human movement: A survey. Computer Vision and Image Understanding, 1999, 73. Grafulla-Gonzalez, B., Haworth, C., Harvey, A., Lebart, K., Petillot, Y., de Saint Pern, Y., Tomsin, M., and Trucco, E. (2005). Millimetre-wave personnel scanners for automated weapon detection. Pattern Recognition and Image Analysis, Part 2, Proceedings of the lecture notes in computer Science, 3687. Gross, H., Richarz, J., Mueller, S., Scheidig, A., and Martin, C. (2006). Probabilistic multi-modal people tracker and monocular pointing pose estimator for visual instruction of mobile robot assistants. In International Joint Conference on Neural Networks, 2006, pages 4209 – 4217. Guo, Y., Xu, G., and Tsuji, S. (1994). Understanding human motion patterns. In Proceedings of the 12th IAPR International Conference on Computer Vision Image Processing, pages 325 – 329. Haritaoglu, I., Harwood, D., and Davis, L. (2000). W4: real-time surveillance of people and their activities. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(8). Haworth, C., De Saint-Pern, Y., Clark, D., Trucco, E., and Petillot, Y. (2007). Detection and tracking of multiple metallic objects in millimetre-wave images. International Journal of Computer Vision, 71(2). Hsu, H.-H., Hsieh, S.-W., Chen, W.-C., Chen, C.-J., and Yang, C.-Y. (2006). Motion analysis for the standing long jump. In 26th IEEE International Conference on Distributed Computing Systems Workshops, page 47. Evolutionary Computation Volume x, Number x

19

S. Ivekovic, E. Trucco, Y. R. Petillot

Isgro, ` F., Trucco, E., and Schreer, O. (2004). Three-dimensional image processing in the future of immersive media. IEEE Transactions on Circuits and Systems for Video Technology, 14(3):288–303. Ivekovic, S. and Trucco, E. (2006). Human body pose estimation with pso. In World Congress on Computational Intelligence, WCCI 2006. Kennedy, J. and Eberhart, R. (1995). Particle swarm optimization. In Proceedings of the IEEE International Conference on Neural Networks, volume 4, pages 1942–1948. IEEE. Kohle, M., Merkl, D., and Kastner, J. (1997). Clinical gait analysis by neural networks: issues and experiences. In Proceedings of the tenth IEEE Symposium on Computer-Based Medical Systems, pages 138–143. Moeslund, T. and Granum, E. (2001). A survey of computer vision-based human motion capture. Computer Vision and Image Understanding, 81. Ong, S. and Ranganath, S. (2005). Automatic sign language analysis: A survey and the future beyond lexical meaning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(6). Plaenkers, R. and Fua, P. (2001). Articulated soft objects for video-based body modeling. In 8th International Conference on Computer Vision, ICCV 2001. R., P., D., H., A., N., and M., P. (2005). Towards real-time body pose estimation for presenters in meeting environments. In Proceedings of the 13-th International Conference in Central Europe on Computer Graphics, Visualization and Computer Vision’2005. Schutte, J. F., Reinbolt, J. A., Fregly, B. J., Haftka, R. T., and George, A. D. (2004). Parallel global optimization with the particle swarm algorithm. International Journal for Numerical Methods in Engineering, 61(13). Shi, Y. H. and Eberhart, R. C. (1998). A modified particle swarm optimizer. In Proceedings of the IEEE International Conference on Evolutionary Computation. Shoji, K., Mito, A., and Toyama, F. (2000). Pose estimation of a 2d articulated object from its silhouette using a ga. In Proceedings of the 15th International Conference on Pattern Recognition, volume 3, pages 713 – 717. Veeraraghavan, A., Roy-Chowdhury, A., and Chellappa, R. (2005). Matching shape sequences in video with applications in human movement analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(12). Ye, Z. and Liu, Z.-Q. (2005). Genetic condensation for motion tracking. In Proceedings of 2005 International Conference on Machine Learning and Cybernetics, volume 9, pages 5542 – 5547.

20

Evolutionary Computation Volume x, Number x