Human Pose Estimation Using Exemplars and Part Based ... - CiteSeerX

0 downloads 0 Views 1MB Size Report
rotation invariant EdgeField features based on which we learnt boosted ... In recent year human pose estimation from a single image has become an inter- ... which uses all the training samples as examplars is adopted to achieve better ..... Synthesized dataset: A dataset of 1000 images synthesized using Poser. 8.
Human Pose Estimation Using Exemplars and Part Based Refinement Yanchao Su1 , Haizhou Ai1 , Takayoshi Yamashita2 , and Shihong Lao2 1

Computer Science and Technology Department, Tsinghua, Beijing 100084, China 2 Core Technology Center, Omron Corporation, Kyoto 619-0283, Japan

Abstract. In this paper, we proposed a fast and accurate human pose estimation framework that combines top-down and bottom-up methods. The framework consists of an initialization stage and an iterative searching stage. In the initialization stage, example based method is used to find several initial poses which are used as searching seeds of the next stage. In the iterative searching stage, a larger number of body parts candidates are generated by adding random disturbance to searching seeds. Belief Propagation (BP) algorithm is applied to these candidates to find the best n poses using the information of global graph model and part image likelihood. Then these poses are further used as searching seeds for the next iteration. To model image likelihoods of parts we designed rotation invariant EdgeField features based on which we learnt boosted classifiers to calculate the image likelihoods. Experiment result shows that our framework is both fast and accurate.

1

Introduction

In recent year human pose estimation from a single image has become an interesting and essential problem in computer vision domain. Promising applications such as human computer interfaces, human activity analysis and visual surveillance always rely on robust and accurate human pose estimation results. But human pose estimation still remains a challenging problem due to the dramatic change of shape and appearance of articulated pose. There are mainly two categories of human pose estimation methods: Topdown methods and bottom-up methods. Top-down methods, including regression based methods and example based method, concern of the transformation between human pose and image appearance. Regression based methods [1,2] learns directly the mapping from image features to human pose. Example based methods [3–7] find a finite number of pose examplars that sparsely cover the pose space and store image features corresponding to each examplar, the resultant pose is obtained through interpolation of the n closest examplars. Bottom-up methods, which are mainly part based methods [8–12], divide human body into several parts and use graph models to characterize the whole body. First several candidates of each part are found using learned part appearance model and then the global graph models are used to assemble each part into a whole body.

700

Yanchao Su, Haizhou Ai, Takayoshi Yamashita, and Shihong Lao

Among all these works, top-down methods are usually quick in speed but their performances rely on the training data. Bottom up methods are more accurate but the exhaustive search of body parts is too time consuming for practical applications. Inspired by the works of Active Shape Model (ASM) [13], we find that iterative local search could give accurate and robust result of shape alignment with acceptable computation cost given a proper initialization. We design a novel iterative inferring schema for bottom-up part based pose estimation. And to model the image likelihood of each part, we designed rotation invariant EdgeField features based on which we learn boosted classifiers for each body part. To provide initial poses to the part based method, we utilize a sparse example based method that gives several best matches with little time cost based on the cropped human body image patch given by a human detector. The combination of top-down and bottom-up methods enables us to achieve accurate and robust pose estimation, and the iterative inferring schema boosts the speed of pose inference with no degeneracy on accuracy. The rest of this paper is organized as follows: section 2 presents an overview of our pose estimation framework, section 3 describes the top-down example based stage, and section 4 is the bottom-up iterative searching stage. Experiments and conclusions are then given in section 5 and section 6.

2

Overview of Our Framework

A brief flowchart of our framework is given in Fig. 1.

Searching Cropped Searching Searching Seeds Image Candidates Seeds Example based Generate BP inference Coverage? initialization candidates

Result

Fig. 1: Examples in dataset 2

The pose estimation framework uses the cropped image patch given as input and consists of two main stages: the initialization stage and the refining stage. In the initialization stage, a sparse top-down example based method is used to find the first n best matches of the examplars, the poses of these examplars are used as the searching seeds in the searching stage. In the refining stage, we utilize a set of boosted body part classifier to measure the image likelihood and a global tree model to restrict the body configuration.

Human Pose Estimation Using Exemplars and Part Based Refinement

701

Inspired by the works of Active Shape Model, we design an iterative searching algorithm to further refine the pose estimation result. Unlike ASM which finds the best shape in each iteration, we find the first n best poses using BP inference on the tree model and use them as searching seeds for the next iteration. The refined pose estimation result is given when the iteration converges.

3

Example Based Initialization

The objective of example based initialization stage is to provide several searching seeds for the part based refinement stage, so this stage has to be fast and robust. In some previous example-based methods [3,4,7], a dense cover of pose space which uses all the training samples as examplars is adopted to achieve better performance, but in our case, a sparse examplar set is enough to get initial poses for the refining stage. We use the Gaussian Process Latent Variable Model (GPLVM) [14] to obtain a low dimensional manifold and use the active set of GPLVM as examplars. An illustration of the GPLVM model and the examplars are given in Fig. 2.

Fig. 2: Illustration of GPLVM and examplars

The adopted image feature is an important aspect of example-based methods. In many previous works, silhouettes of human body are usually used as the templates of examplars. But in practice, it is impossible to get accurate silhouettes from a single image with clutter background. In [3] the author adopts histograms of oriented gradients (HOG) [15] as the image descriptor to characterize the appearance of human body. However, there are two main limitations of using HOG descriptor in a tight bounding box: firstly, the bounding box of human varies dramatically when the pose changes, secondly, a single HOG descriptor isnt expressive enough considering the furious change of the appearance

702

Yanchao Su, Haizhou Ai, Takayoshi Yamashita, and Shihong Lao

of human body in different pose. So we first extend the tight bounding rectangle to a square, and then divide the bounding square equally into 9 sub-regions, HOG descriptors in the sub-regions are concatenated into the final descriptor. And for each examplar, we assigned a weight to each sub-region which represents the percentage of foreground area in the sub-region. Given an image and the bounding box, we first cut out a square image patch based on the bounding box, and then HOGs of all the sub-regions are computed. The distance between the image and an examplar is given by the weighted sum of the Euclid distances of HOGs in each sub-region. We find the first n examplars with shortest distance as the initial poses.

4 4.1

Part Based Refinement Part Based Model

It is natural that human body can be decomposed into several connected parts. Similar with previous works, we model the whole body as a Markov network as shown in Fig. 3(left). The whole body is represented as a tree structure in which each tree node corresponds to a body part and is related to its adjacent parts and its image observation. Human pose can be represented as the configuration of the tree model: L = {l1 , l2 , ..., ln }, and the state of each part is determined by the parameter of its enclosing rectangle: li = {x, y, θ, w, h} (Fig. 3(right)). Each part is associated with an image likelihood, and each edge of the tree between connected part i and part j has an associated, that encodes the restriction of configuration of part i and potential function part j.

xi

pj

xj

pi

𝜃

𝜃

( x, y) h w

state observation

Fig. 3: Left: Markov network of human body. Middle:the constraint between two connected parts.Right: the state of a body part

As shown in Fig. 3(middle), the potential function has two components corresponding to the angle and the position: p θ ψij (li , lj ) = ψij (li , lj ) + ψij (li , lj )

(1)

Human Pose Estimation Using Exemplars and Part Based Refinement

703

The constraint of angle is defined by the Mises distribution: θ ψij (li , lj ) = ek cos(θ−µ)

(2)

And the potential function on position is a piece-wise function: p ψij (li , lj ) = {

2

αe−β||pi −pj || , ||pi − pj ||2 < t 0 , ||pi − pj ||2 ≥ t

(3)

The parameter α, β can be learnt from the ground truth data. The piece-wise definition of the potential function not only guarantees the connection between adjacent parts, but also can speed up the inference procedure. The body pose can be obtained by optimizing: Y Y p(L|Y ) ∝ p(Y |L)p(Y ) = ψi (li ) ψij (li , lj ) (4) i

4.2

i∈Γ (p)

Image Likelihood

As in previous works, body silhouettes are the most direct cue to measure the image likelihood. However, fast and accurate silhouettes extraction is another open problem. As proven in previous works, boosting classifiers with gradient based features such as HOG feature [15] is an effective way of measuring image likelihood in pedestrian detection and human pose estimation. However, in human pose estimation, the orientation of body parts change dramatically when the pose changes, so when measuring image likelihood of body parts we need to rotate the images to compute the HOG feature values. This time consuming image rotation can be eliminated when using a kind of rotation invariant features. Based on this point, we designed an EdgeField feature. Given an image, we can take each pixel as a charged particle and then the image as a 2D field. Following the definition of electric field force, we can define the force between two pixels p and q: f (p, q) =

kdp · dq r ||r||3

(5)

Where dp , dq are the image gradient on pixel p, q. and r is the vector pointing from p to q. the amplitude of the force is in proportion to the gradient amplitude of p and q, and the similarity between the gradient of p and q, and is in inverse proportion to the square of the distant between p and q. Given a template pixel p and an image I, we can compute the force between the template pixel and the image by summing up the force between the template pixel and each image pixel: f (p, I) =

X kdp · dq q∈I

||r||3

r

By applying orthogonal decomposition on f(p,I), we can get:

(6)

704

Yanchao Su, Haizhou Ai, Takayoshi Yamashita, and Shihong Lao

f (p, I) =



f (p, I) · ex f (p, I) · ey



=k

P dp · q∈Γ (p) P dp · q∈Γ (p)

rq ·ex ||rq ||3 rq ·ey ||rq ||3

! =



dp · Fx (p) dp · Fy (p)



= dp F (p)

(7) Where ex and ey are the unit vector of x and y directions. And F (p) is a 2 by 2 constant matrix in each position p in image I. we call F (p) the EdgeField. Our EdgeField feature is define as a short template segment of curve composed of several points with oriented gradient. The feature value is defined as the projection of the force between the template segment and the image to a normal vector l: X X f (S, I) = F (p, I) · l = dTp F (p)l (8) p∈S

p∈S

Obviously, the calculation of the EdgeField feature is rotation invariant. To compute the feature value of an oriented EdgeField feature, we only need to rotate the template. The EdgeField can be precomputed from the image by convolution. An example of EdgeField and its contour line are shown in Fig. 4 left and middle. Fig. 4 right shows an example of EdgeField feature.

dpi pi f l

f(S,I)

S I

Fig. 4: Left: EdgeField; Middle: Contour line of EdgeField; Right: EdgeField feature

Based on the EdgeField feature, we can learn a boosted classifier for each body part. We use the improved Adaboost algorithm [14] that gives real-valued confidence to its predictions which can be used as the image likelihood of each part. The weak classifier based on feature S is defined as follow: h(Ii ) = lutj , vj < f (S, Ii ) ≤ vj+1

(9)

Where the range of feature value is divided into several sub ranges{(vj , vj+1 )}, and the weak classifier output a certain constant confidence luti in each sub range. A crucial problem in learning boosted classifier with EdgeField feature is that the number of different EdgeField features is enormous and exhaustive search

Human Pose Estimation Using Exemplars and Part Based Refinement

705

in conventional boosting learning becomes impractical. So we design a heuristic searching procedure as follow: Algorithm 1:Weak classifier selection for the ith part Input: samples {Iij }, labels of the sample{yij } and the corresponding weight{Dij } 1. Initialize Initialize the open feature list: OL = {S1 , S2 , . . . , Sn } and the close feature list: CL = φ 2. Grow and search: for i = 1 to T Select the feature S ∗ with the lowest classification lost from OL: P S ∗ = argmins∈OL (Z(S)) where Z(S) = i Di exp(−yi h(Ii )) OL = OL − {S ∗ }, CL = CL ∪ {S ∗ } Generate new features based on S∗ under the constraint in Fig. 5, then learn parameters and calculate the classification lost and put them in OL. end for 3. Output: Select the feature with the lowest classification loss for OL and CL This algorithm initializes with EdgeField features with only one template pixel, and generates new features based on current best feature under the growing constraint which keep the template segment of the feature simple and smooth.

Fig. 5: Growing constraint of features: Left: the second pixel (gray square) can only be placed at 5 certain directions from the first pixel (black square); Middle: The following pixels can only be placed in the proceeding direction of current segment and in the adjacent directions. Right: the segment cannot cross the horizontal line crossing the first pixel.

When the position of each pixel p in the feature is given, we need to learn the parameters of the feature: the projection vector l and the gradient of each pixel dp. The objective is to best separate the positive and negative samples. Similar with common object detection task, the feature values of positive samples are surrounded by those of negative samples, so we take the Maximal-RejectionClassifier criterion [18] which minimizes the positive scatter while maximizing the negative scatter: ({dpi , l}) = argminRpos /Rneg

(10)

706

Yanchao Su, Haizhou Ai, Takayoshi Yamashita, and Shihong Lao

Where Rpos and Rneg are the covariance of the feature values of positive sample and negative samples respectively. Since it is hard to learning l and dp simultaneously, we first discretize l into 12 directions, and learn {dp } for each l. Given l, the feature value can be computed as follow: f (S, I) =

X

 dTpi F (pi )l = dTp1 , dTp2 , . . . , dTpn (F (p1 )l, F (p2 )l, . . . , F (pn )l) (11)

pi ∈S

Then the learning of {dp } becomes a typical MRC problem where dTp1 , . . . , dTpn is taken as the projection vector. Following [16], we adopt weighted MRC to take the sample weights into account. Given the boosted classifier of each part, the image likelihood can be calculated as follow: ψi (li ) = (1 + exp(−Hi (Ii )))−1 = (1 + exp(−

X

hk , i(Ii )))−1

(12)

k

4.3

Iterative Pose Inference

Belief propagation (BP) [17] is a popular and effective inference algorithm of tree structured model inference. BP calculates the desired marginal distributions p(li—Y) by local message passing between connected nodes. At iteration n, each node t ∈ V calculates a message mnts (lt ) that will be passed to a node s ∈ Γ (t): Z Y mnts (ls ) = α ψst (ls , lt )ψt (lt ) × mn−1 (13) ut (ls )dlt lt

u∈Γ (s)

Where α = 1/ mnts (ls )is the normalization coefficient. The approximation of the marginal distribution p(ls |Y ) can be calculated in each iteration: P

pˆ(ls |Y ) = αψt (lt )

Y

mnut (ls )

(14)

u∈Γ (s)

In tree structured models, pˆ(ls |Y ) will converge to p(ls |Y ) when the messages are passed to every nodes in the graph. In our case the image likelihood is discontinuous so that the integral in equation (13) becomes product over all feasible states lt. Although the present of the BP inference reduces the complexity for pose inference, the feasible state space of each part remains unconquerable by exhaustive search. Inspired by the work of Active Shape Model (ASM), we designed an iterative inference procedure which limits each part in the neighborhood of several initial states and inference locally in each iteration. The whole iterative inference algorithm is shown in algorithm 2.



Human Pose Estimation Using Exemplars and Part Based Refinement

707

In each iteration, the searching seeds are the initial guesses of the states of each part, and we sample uniformly from the neighborhood of each searching seeds to generate searching candidates. And then BP inference is applied over the searching candidates to calculate the marginal distribution over the candidates. And the searching seeds for next iteration are selected from the candidates using MAP according to the marginal distribution. Unlike ASM which uses the reconstructed shape as the initial shape for the local search stage, we keep the first n best candidates as the searching seeds for the next iteration and searching candidates are generated based on the searching seeds. Preserving multiple searching seeds reduce the risk of improper initial pose. Algorithm 2 iterative pose inference Input: searching seeds for each part Si = {li1 , li2 , . . . , lin } Generating candidates: Ci = {li |li ∈ N eighborhood(lki ), lki ∈ Si } For each part i: Calculate the image likelihood ψi (li )using boosted classifiers BP inference: Loop until convergence: Calculate messages: For each part s and the messages   t, t ∈ Γ (s) caculate Q Q n mts = α lt ∈Ct ψst (ls , lt )ψt (lt ) × u∈Γ (s) mn−1 ut (ls) Q Calculate belief: pˆ(ls |Y ) = αψt (lt ) u∈Γ (s) mnut (ls ) Find searching seeds for each part i: Si =candidates with the first n largest belief values Go to generating candidates if Si do not converge The estimate pose L = {Si }

5 5.1

Experiments Experiment Data

Our experiments are taken on following dataset: 1. Synthesized dataset: A dataset of 1000 images synthesized using Poser 8. Each image contains a human and the pose is labeled manually. And the backgrounds of these images are the images selected randomly selected from other datasets. 2. The Buffy dataset from [11]: This dataset contains 748 video frames over 5 episodes of the fifth season of Buffy: the vampire slayer. The upper body pose (including head, torso, upper arms and lower arms) are labeled manually in each image. This is a challenging dataset due to the changing of poses, different clothes and strongly varying illumination. As in [11, 12], we use 276 images in 3 episodes as the testing set, and the rest 472 as training set. 3. The playground dataset: This dataset contains 350 images of people play football and basketball in the playground. The full body pose is labeled manually. 150 images from this data set are used as the training set, and the rest 200 is the testing set.

708

Yanchao Su, Haizhou Ai, Takayoshi Yamashita, and Shihong Lao

4. The iterative image parsing dataset from [11]: this dataset contains 305 images of people engaged in various activities such as standing, performing exercises, dancing, etc. the full body pose in each image is annotated manually. This dataset is more challenging because there is no constraint on human poses, clothes, illumination and backgrounds. 100 images in this dataset are used as the training set and the rest 205 is the testing set.

0.6 0.55

Our descriptor HOG descriptor

0.85

Mean Error

Mean Error

0.7 0.65

1

0.9

Our descriptor HOG descriptor

0.8 0.75

5 10 15 number of selected examplars

20

0.65

0.8 0.7

0.7 0

Our descriptor HOG descriptor

0.9

Mean Error

0.8 0.75

0

5 10 15 number of selected examplars

20

0

5 10 15 number of selected examplars

20

Fig. 6: Mean minimum errors of selected examplars in dataset 2,3,4 (from left to right)

5.2

Experiments of Examplar Based Initialization

To learn the examplars for full body pose estimation, we used the whole dataset 1 to learn examplars. The coordinates of 14 joint point of each image are connected as a pose vector; GPLVM is used to learn a low dimensional manifold of the pose space. The active set of GPLVM is taken as the examplars and the corresponding descriptors are stored. The upper body examplars are learnt similarly by using the bounding box of upper body. Given an image and a bounding box of human, we generate 100 candidate bounding box by adding random disturbance to the bounding box. And for each examplar, we find the best candidate bounding box with the shortest distance of the descriptor, and the first 20 examplars with the shortest distance are used as the initial pose in the next stage. We measures the accuracy of this stage on the datasets in terms of the mean of minimum distance of corresponding joint points divided by the part length between ground truth and the top n selected examplars, as shown in Fig. 6. 5.3

Experiments of Pose Estimation

For the part classifier, in the training sets, we cut out image patches according to the bounding box of each part and normalize them into the same orientation and scale and use them as positive samples. Negative samples are obtained by adding large random disturbance to the ground truth states of each part. In the pose estimation procedure, the top 20 matched examplars are used as the searching seeds. 100 searching candidates of each part are generated based on

Human Pose Estimation Using Exemplars and Part Based Refinement

709

each searching seeds. During each iteration, the candidates with top 20 posterior marginal possibilities are preserved as the searching seeds for the next iteration. To compare with previous methods, we used the code of [12]. The images are cut out based on the bounding box (with some loose), and then are processed by the code. The correctness of each part in the testing sets of dataset 2,3,4 are listed in table 1. A part is correct if the distance between estimated joint points and ground truth is below 50% of the part length. From the result we can see that our method can achieve better correctness with much less time cost. Fig. 7 give some examples in dataset 2,3,4. Table 1: Correctness of parts in dataset 2,3,4

part

dataset 2 Method Our in [12] method

dataset 3 Method Our in [12] method

Torso 91% 93% 81% Upper Arm 82% 83% 83% 85% 45% 51% Lower Arm 57% 59% 61% 64% 32% 31% Upper Leg 54% 61% Lower Leg 43% 45% Head 96% 95% 81% Total 78% 80% 52% Time cost 65.1s 5.3s 81.2s

6

84% 57% 59% 41% 37% 67% 63% 51% 50% 89% 59% 9.1s

dataset 3 Method Our in [12] method 84% 56% 59% 35% 38% 68% 59% 66% 51% 78% 59% 79.1s

87% 64% 62% 41% 43% 67% 68% 69% 60% 80% 64% 8.9s

Conclusion

In this paper, we propose a fast and robust human estimation framework that combines the top-down example based method and bottom-up part based method. Example based method is first used to get initial guesses of the body pose and then a novel iterative BP inference schema is utilized to refined the estimated pose using a tree-structured kinematic model and discriminative part classifiers based on a rotation invariant EdgeField feature. The combination for top-down and bottom-up methods and the iterative inference schema enable us to achieve accurate pose estimation in acceptable time cost. Acknowledgement. This work is supported by National Science Foundation of China under grant No.61075026, and it is also supported by a grant from Omron Corporation.

References 1. A. Agarwal and B. Triggs.: Recovering 3d human pose from monocular images. PAMI 28(1) (2006) 44–58

710

Yanchao Su, Haizhou Ai, Takayoshi Yamashita, and Shihong Lao

Fig. 7: Examples in dataset 2,3,4

2. A. Bissacco, M. Yang, and S. Soatto.: Fast human pose estimation using appearance and motion via multi-dimensional boosting regression CVPR 13 (2007) 234–778 3. R. Poppe.: Evaluating example-based pose estimation: Experiments on the humaneva sets CVPR 2nd Workshop on EHuM2 14 (2007) 234–778 4. R. Poppe and M. Poel.: Comparison of silhouette shape descriptors for examplebased human pose recovery AFG. (2006) 5. G. Mori, J. Malik: Recovering 3d human body configurations using shape contexts IEEE PAMI (2006) 1052–1062 6. G. Rogez, J. Rihan, S. Ramalingam, C. Orrite, P.H.S. Torr.: Randomized Trees for Human Pose Detection. CVPR (2008) 7. G. Shakhnarovich, P. Viola, R. Darrell.: Fast pose estimation with parametersensitive hashing. ICCV (2003) 8. H. Jiang and D. R. Martin: Global pose estimation using non-tree models. CVPR (2008). 9. L. Sigal and M. J. Black.: Measure locally, reason globally: Occlusion-sensitive articulated pose estimation. CVPR (2006) 10. Leonid Sigal, Sidharth Bhatia and Stefan Roth et al.: Tracking Loose-limbed People. CVPR (2004) 11. D. Ramanan.: Learning to parse images of articulated bodies. NIPS (2006) 12. Mykhaylo Andriluka, Stefan Roth, and Bernt Schiele.: Pictorial Structures Revisited: People Detection and Articulated Pose Estimation. CVPR (2009) 13. A. Hill, T. F. Cootes, and C. J. Taylor.: Active shape models and the shape approximation problem. In 6th British Machine Vison Conference (1995)157–166 14. N. D. Lawrence.: Gaussian process latent variable models for visualization of high dimensional data. NIPS 16 (2004) 329–336. 15. N. Dalal and B. Triggs.: Histograms of oriented gradients for human detection. CVPR (2005) 16. R. E. Schapire and Y. Singer.: Improved boosting algorithms using confidence-rated predictions. Machine Learning,37(3) (1999) 297–336 17. J.S. Yedidia, W.T. Freeman and Y. Weiss.: Understanding Belief Propagation and its Generalizations. IJCAI (2001) 18. Xun Xu, Thomas S. Huang: Face Recognition with MRC-Boosting. ICCV (2005)