Non-rigid Articulated Point Set Registration for Human ... - IEEE Xplore

1 downloads 0 Views 2MB Size Report
We propose a new non-rigid articulated point set regis- tration framework for human pose estimation that aims at improving two recent registration techniques ...
2015 IEEE Winter Conference on Applications of Computer Vision

Non-rigid Articulated Point Set Registration for Human Pose Estimation Song Ge and Guoliang Fan∗ School of Electrical and Computer Engineering Oklahoma State University, Stillwater, OK 74078 {song.ge,guoliang.fan}@okstate.edu

Abstract

recently. There are three main categories, discriminative, generative and hybrid. Discriminative approaches usually require a large training dataset to deal with body shape variability and complex poses [1, 16, 17]. Generative ones fit a template model that is often pre-segmented along with an articulated structure and is assumed to be locally rigid to a target point set [5, 14, 8]. Hybrid approaches tend to take advantages of two kinds of approach by involving a training database to initialize an articulated model followed by template-based pose refinement [2, 19].

We propose a new non-rigid articulated point set registration framework for human pose estimation that aims at improving two recent registration techniques and filling the gap between the two. One is Coherent Point Drift (CPD) that is a powerful Gaussian Mixture Model (GMM)-based non-rigid registration method, but may not be suitable for articulated deformations due to the violation of motion coherence assumption. The other is articulated ICP (AICP) that is effective for human pose estimation but prone to be trapped in local minima without good correspondence initialization. To bridge the gap of the two, a new non-rigid registration method, called Global-Local Topology Preservation (GLTP), is proposed by integrating a Local Linear Embedding (LLE)-based topology constraint with CPD in a GMM-based formulation, which accommodates articulated non-rigid deformations and provides reliable correspondence estimation for AICP initialization. The experiments on both 3D scan data and depth images demonstrate the effectiveness of the proposed framework.

   &'

³  ´      !"#   $ %

  ³ ´ 

*!#      

(    ' )

               

Figure 1. Overview of the proposed framework.

1. Introduction

We propose a new generative framework for human pose estimation (as shown in Fig. 1) that takes advantage of two classic non-rigid point set registration algorithms, i.e., Coherent Point Drift (CPD) [13] and articulated ICP (AICP) [14] while addressing their own limitations in the context of human pose estimation. The key is a novel GMM-based point set registration algorithm, called Global-Local Topology Preservation (GLTP) that integrates a local linear embedding (LLE)-based topology constraint along with the CPD-based regularization. GLTP provides reliable correspondence estimation to initialize AICP which is also modified with better flexibility to deal with highly articulated poses. Furthermore, a subject-specific articulated model is learned that further improves local segment matching in AICP. Our framework only uses one template that is able to accommodate various articulated poses under different body shapes/sizes.

Point set registration is a fundamental issue for many computer vision applications. Iterative Closest Point (ICP) [3, 20] is a classic rigid registration method which iteratively assigns correspondence and then finds the least square solution. In [10, 11], putative correspondences based on the feature descriptors are first created, then the non-rigid transformation for inliers is estimated. Gaussian Mixture Model (GMM)-based methods are an important category for non-rigid registration where point sets are represented by a density function and then registration is cast as a density estimation problem [13] or an alignment between two distributions [9]. On the other hand, human pose estimation as a separate but related topic has attracted much interest ∗ This work is supported by the Oklahoma Center for the Advancement of Science and Technology (OCAST) under grant HR12-30 and the National Science Foundation (NSF) under grant NRI-1427345.

978-1-4799-6683-7/15 $31.00 © 2015 IEEE DOI 10.1109/WACV.2015.20

94

2. Related Work

3. Proposed Framework The proposed framework has three steps that will be detailed in the following. First, we learn a subject-specific articulated model by a two-stage CPD registration that improves segment matching in AICP. Then, the key step is correspondence initialization using a newly proposed non-rigid registration algorithm, i.e., GLTP, for AICP initialization. Third, a modified AICP is developed with better flexibility and efficiency to handle highly articulated poses.

There are many recent algorithms that cast human pose estimation from 3D depth data or point clouds as a registration problem where an articulated human model is often involved. The major challenge is that registration results can be easily trapped in local minima due to the highly articulated nature of human poses and various body shapes as well as the unknown camera view. One approach to deal with this problem is to involve a large pre-collected dataset which is used to find a similar pose or to initialize correspondence prior to the registration process. A 3D database which contains mesh models along with embedded skeletons is used in [19] to initialize a similar pose for a given input depth image before CPD is performed to refine pose estimation. A motion database, which includes a large set of poses represented by the 3D positions of five body extremities (hands, feet, and head), is employed in [2] for pose initialization which facilitates and validates the tracking process. In [17], the depth image to 3D mesh model correspondences are predicted by employing a pre-trained regression forest, then joint positions are initialized by minimizing an energy function using predicted correspondences. In those approaches, the initial database has to be sufficiently large to accommodate a variety of poses or body shapes. The other approach to improve pose/correspondence initialization is to pre-process the target data by extracting useful features, such as upper-body segmentation in depth images in [6]. For sequential depth data, the tracking strategy is often used to simplify the pose estimation problem by using the results in the previous frames, under the assumption of pose continuity between the adjacent frames [2, 7, 12].

3.1. CPD for Shape/Skeleton Initialization The objective of this step is to create a subject-specific articulated model for an unknown subject. It involves a general “T-pose” template Y = [y1 , · · · , yM ]T where different body segments are pre-labeled along with an articulated skeleton, and an initial target point set of a “neutral-pose” (preferably similar to “T-pose”) Z = [z1 , · · · , zN ]T , which is indeed available in most experiments, as shown in Fig. 1. We apply the original CPD algorithm for non-rigid registration between Z and Y due to the favorable condition of motion coherence between two point sets. CPD is a powerful GMM-based registration approach which enforces the GMM centroids to move coherently as a group to preserve the topological structure of the point set. It defines the non-rigid transformation as a displacement function. By using calculus of variation, it is found that the optimal displacement function is represented by a linear combination of Gaussian kernel functions as T (Y, W) = Y + GW,

(1)

where GM×M is the Gaussian kernel matrix with element y −y gij = exp(− 12  i β j 2 ), β is the width of the Gaussian kernel and WM×D is the weight matrix. And the regularization term of W to encourage global coherent motion is defined as

In this paper, we propose a general non-rigid articulated point set registration framework for human pose estimation, and we mainly focus on the gap between two recent registration algorithms, i.e., CPD and AICP. Although CPD is proven powerful for general non-rigid registration, it may not be effective to handle articulated non-rigid deformations, such as those observed in the 3D human data, where the CPD assumption (motion coherence) may not be valid. On the other hand, just like most ICP-based methods, AICP is prone to be trapped in local minima without good correspondence initialization. It also assumes the template is locally rigid, which demands a good shape initialization. When the target model has significant shape variation or articulated deformation, correspondence estimation may not be accurate or reliable which will trouble the effectiveness of AICP. We present a new articulated registration framework that not only addresses the limitations of CPD and AICP, but also takes advantage of their respective strengths in a unified framework. The proposed framework is compared against state-of-the-art algorithms on both 3D scan data and depth images with competitive performance.

Eglobal (W) = T r(WT GW),

(2)

where T r(B) denotes the trace of the matrix B. Shape/skeleton initialization is a two-stage and one-time registration process between Z and Y via CPD. First, CPD is used to create a labeled initial target along with the estimated “neutral-pose”. Second, Y (the “T-pose” template) is deformed to the estimated “neutral-pose” from Z, creating a more preferable condition for second-round CPD that can further improve the accuracy of shape/skeleton estimation, especially around joints. As the result, a subject-specific articulated model can be learned which is represented by a ˆ = [ˆ labeled point set Z z1 , · · · , zˆM ]T , as shown in Fig. 1. This process allows us to deal with various body shapes by using the simple “T-pose” template and an initial pose from the subject which naturally reveals the body shape/skeleton. 95

The underlying assumption of LLE is local deformation coherence which is complementary to that of CPD, i.e., global drift coherence. Using both of them leads the GLTP algorithm that can accommodate complex non-rigid articulated deformation. By extending the GMM-based formulation [4, 13], the objective function of GLTP is written as

3.2. GLTP for Correspondence Estimation The “T-pose” template Y is a standard pose where all limbs are fully stretched (Fig. 1), providing a good starting point for correspondence estimation for a new target point set X = [x1 , · · · , xN ]T (i.e., a new pose) from the subˆ is not ject. Because the subject-specific articulated model Z a strict “T-pose”, it may not serve as a good template for articulated registration. Therefore, the original template Y is still used for correspondence estimation in X, where the key is to preserve both the global and local structures during articulated registration. CPD represents a global topology constraint to sustain the spatial coherence of the point set, but it could be too strong to allow local segments to have different transformations during articulated registration. Although CPD with a small width of the Gaussian kernel can follow the local structure, it could be too weak to keep the global connectivity leading chaotic registration results. Therefore, preserving both global connectivity and local structure is necessary for articulated registration. In this work, CPD is still mainly used as a global topology constraint to preserve the holistic structure of a point set. A local topology constraint is also needed to accommodate non-rigid articulated deformation that is aimed at preserving the local structure. The proposed Global-Local Topology Preservation (GLTP) algorithm is inspired by [15] where Local Linear Embedding (LLE) was proposed as a nonlinear dimensionality reduction method to preserve the local neighborhood structure in the low-dimensional latent space. In the context of non-rigid registration, we want local structures in template Y to be preserved after non-rigid transformation. The LLE idea is applied in GLTP in three steps. First, K nearest neighbors of each point in Y are computed according to the Euclidean distance. The value of K is related to the point density and distribution in Y. Second, each point in Y is represented by a weighted linear combination of its neighbors. The weights can be obtained by minimizing the reconstruction error: ELLE (L) =

M  m=1

ym −

K 

Lmi yi 2 ,

QGLT P (W, σ 2 ) = 

M,N

m,n=1

+

Elocal (W) = −

Np D α λ ln(σ 2 ) + Eglobal (W) + Elocal(W), 2 2 2

xn −T (ym ,Θ) 2  ) σ old xn −T (yi ,Θ) 2 1  ) i=1 exp(− 2  σ old

pold (m|xn ) = M

(5)

exp(− 12 

+c

,

2 D 2

) ωM which is added to account for outliers and c = (2πσ . (1−ω)N We rewrite the objective function (5) in a matrix form as:

1 {Tr(XT d(PT 1)X) 2σ 2 −2 Tr(Y T PX) − 2 Tr(WT GPX) + Tr(Y T d(P1)Y) QGLT P (W, σ 2 ) =

+2 Tr(WT Gd(P1)Y) + Tr(WT Gd(P1)GW)} Np D α λ + ln(σ 2 ) + Tr(WT GW) + {Tr(Y T MY) 2 2 2 +2 Tr(WT GMY) + Tr(WT GMGW)},

(3)

[d(P1)G + σ 2 αI + σ 2 λMG]W = PX − (d(P1) + σ 2 λM)Y. (8)

Similarly we can obtain σ 2 by taking the corresponding derivative of (7) and setting it equal to zero. Then we have σ2 =

(ym + G(m, ·)W)

1 (Tr(XT d(PT 1)X) − 2tr(YT PX) Np D

−2 Tr(WT GT PX) + Tr(Y T d(P1)Y) +2 Tr(WT GT d(P1)Y) + Tr(WT GT d(P1)GW)).

Lmi (yi + G(m, ·)W) ,

(7)

where the M ×N matrix P has the element pold (m|xn ); the  is an M ×M matrix   T; L M ×M matrix M = (I− L)(I− L)  ij = Lij , for j ≤ K and L  ij = 0 for j > K; I is the with L identity matrix; d(v) is the diagonal matrix formed from the vector v; and 1 is the column vector of all ones. To obtain the weight matrix W which minimizes (7), we take the derivative of (7) respect to W and set it equal to zero, then W can be obtained by solving a linear system:

i=1

2

(6)

where ω (0 ≤ ω ≤ 1) is the weight of a uniform distribution

m=1 K 

 xn − (ym + G(m, ·)W) 2 2σ 2

where α and λ are two trade-off parameters controlling the GMM matching term and topological constraint terms, D is N M the dimension of the data, Np = n=1 m=1 pold (m|xn ) and pold (m|xn ) are the posterior probabilities from previous GMM parameters:

where L is an M ×K weight matrix containing the neighborhood information for each point in Y. Third, we compute the Gaussian kernel weight matrix W that represents the non-rigid transformation in CPD, so that W allows each point to be reconstructed by its neighbors after non-rigid transformation. W is estimated by minimizing the cost function: M 

pold (m|xn )

(4)

(9)

GLTP is solved by an EM algorithm similar to that in CPD. The probability of correspondence between template Y and target X is stored in matrix P from which a labeled target

i=1

where G(m, ·) denotes the mth row of G. 96

AICP adopts a divide-and-conquer strategy to minimize (10) iteratively. The key idea is to iteratively estimate the articulated structure by assuming that it is partially rigid where ICP can be performed locally. In each iteration, a joint is selected randomly or according to certain rule, by which the articulated model is split into two parts, and ICP is performed on either part and so on. Here we want to improve the flexibility of the original AICP. Instead of splitting the articulated model into two parts based on a selected joint, we iteratively select several connected segments as a rigid part for ICP optimization and then update the selected part along with its associated child segments. If the root (torso) is selected, all segments have to be updated. This scheme, which is referred to as a modified AICP algorithm, provides more flexibility for segment combinations and is helpful to deal with highly articulated limbs. Let Ψ = {S1 , · · · , Sp } including p (p ≤ P ) connected segments (with MΨ points) along the articulated structure ˆ and Ψc represents the set of all child segments. We from Z define an objective function with respect to Ψ by simplifying (10) as

Table 1. Pseudo-code for the proposed GLTP algorithm.

• Initialization: Initialize 0 ≤ ω ≤ 1, K > 0, α0 > 0, M,N 1 2 β > 0, λ0 > 0 and σ 2 = DMN m,n=1  xn − ym  yi −yj 2 −1 • Compute Gaussian kernel G: gij = exp 2  β  • Compute the LLE weight matrix L   T • Compute the matrix M = (I − L)(I − L) • While (dissatisfy stopping criteria) E-step: Compute matrix P according to (6) M-step: Compute weight matrix W by solving (8) Compute Np = 1T P1 Compute σ 2 according to (9) • End while • Obtain the transformed point set Ytran = Y + GW ˆ from matrix P • Obtain labeled target model X

ˆ = [ˆ ˆ M ]T is obtained. It was observed model X x1 , · · · , x that the correspondence is largely determined in the initial stage and does not change significantly in later iterations. Therefore, it is helpful to slowly weaken the two topology constraints (i.e., reducing both α and λ) during iterations so that the GMM matching term becomes more and more dominant which is helpful to improve the matching accuracy by refining transformation estimation progressively. The pseudo-code of GLTP is shown in Table 1.

Q(TW Ψ )=

(11)

ˆ ˆΨ where zˆΨ m is a point in part Ψ from Z, and x m is its correˆ that is initialized by GLTP in Step 2. ICP itspondence in X ˆΨ eratively updates the correspondence x m and the part-level W rigid transformation TΨ can be solved in a closed form by ˆ we deminimizing (11). Following the tree structure in Z, W form all points that belong to Ψ and Ψc by TΨ , then move to next iteration with a new Ψ and so on. The pseudo-code of modified AICP is given in Table 2, where the local transformations of hands/feet are ignored due to their small sizes. With a good initialization from GLTP, AICP normally converges quickly (no more than 10 iterations).

After previous two steps, we have two labeled point sets with estimated correspondences, i.e., the subject-specific ˆ that includes P rigid and connected articulated model Z body segments {S1 , · · · , SP } (Step 1) and the labeled tarˆ for a new pose (Step 2). It is worth mentioning get model X ˆ may not be accurate that correspondence estimation in X near joints due to possible strong articulation in X. Neverˆ provides reasonably good correspondence initialtheless, X ˆ The objective now ization to be further registered with Z. is to obtain the global rigid transformation for each body segment Sp , which is represented by

Table 2. Pseudo-code for the modified AICP algorithm.

• Initialization ˆ the subject-specific articulated model • Z: ˆ the labeled target model • X: • For (p from the root to all child segments) • Local ICP for Ψ = {Sp } according to (11) • End for • While (stopping criteria not satisfied) • For (each of four limbs) • Local ICP for Ψ = {Supper−limb } • Local ICP for Ψ = {Slower−limb } • Local ICP for Ψ = {Slower−limb,upper−limb } • End for • End while

W L L TW p = Troot · · · T∨(p) Tp

where ∨(p) denotes the index of the parent segment of Sp , TW root is the transformation of the root segment in the world coordinate, and TL p is the local transformation with respect to its joint connecting parent segment. TW p could be obtained by minimizing the objective function as: Mp P  

2 ˆΨ ˆΨ  TW Ψ z m−x m  ,

m=1

3.3. Modified AICP for Pose Estimation

W Q(TW 1 , · · · , TP ) =

MΨ 

ˆ pm 2 , (10) zpm − x  TW p ˆ

p=1 m=1

ˆ is the ˆ pm ∈ X where Mp is the number of points in Sp and x p correspondence of zˆm ∈ Sp which has to be initialized. 97

4. Experiments

thresholding, a modified Locally Optimal Projection (LOP) algorithm for denoising [19] and outlier removal by limiting the maximum allowable distance between two nearest points. Fig. 3 shows an example of depth pre-processing for the SMMC-10 dataset. After pre-processing the point set for each frame, the depth image contains around 5000 points. To accommodate the nature of the 2.5-dimensional depth image, we only use the frontal view of the “Tpose” template (1042 visible points), which has a skeleton (Fig. 4(a)) and can be represented by different body segments (Fig. 4(b)).

Our proposed framework is implemented in Matlab and evaluated on two kinds of publicly available datasets, namely SCAPE [1] (captured by a 3D laser scanner) and SMMC-10 [18] (depth sequences captured by a time-offlight (ToF) depth camera). Below we present the results corresponding to two datasets, separately. It is worth noting that our framework does not use any training data (except for a generic “T-pose” template in Steps 1 and 2) and we estimate the pose for each frame individually and independently for the SMMC-10 dataset (no tracking involved).

4.1. Data Preparation The SCAPE dataset contains a series of 3D point data (12500 points) captured from one male subject (the only one made publicly available) under different poses. The SCAPE dataset has one standard pose with ground-truth joint positions. In order to perform quantitative analysis we need to generate the ground-truth for other poses. Since the dataset is fully registered (index of each point stays the same across all poses), we can generate the ground-truth joint positions for other poses from the given standard pose in two steps. First, for each joint in the standard pose, we find a set of neighboring points around a joint area between two connected body segments and compute LLE weight coefficients to represent each joint locally. Then for the new pose, we use LLE weight coefficients and the associated points on the new pose (which share with the same indexes as those in the standard pose) to reconstruct each joint position. The “T-pose” template (1000 points) used for SCAPE data is modified from the MotionBuilder humanoid model, which has a skeleton and can be represented by different body segments shown in Fig. 2(a) and (b).

(a)

(b)

(c)

Figure 3. The sample for SMMC-10 dataset pre-processing. (a) The point set transferred from a depth image; (b) The point set after background subtraction; (c) The point set after denoising.

     

Figure 4. (a) The “T-pose” template model for the SMMC-10 dataset. (b) The labeled template. (c) The initial pose from a depth image. (d) The subject-specific articulated model obtained by transforming the template model.

4.2. Experiments for the SCAPE Dataset We first validate the proposed framework on 38 target poses from the SCAPE data, most of which have strong non-rigid articulation compared with the template. Since the ground-truth correspondence is not available, we use the labeling accuracy of body segments for correspondence evaluation. The pose estimation accuracy is validated by measuring the distance error between each estimated joint position with its corresponding ground-truth position. Shape/Skeleton Initialization by CPD: We create a subject-specific articulated model for an unknown subject by the two-step CPD registration. Given the initial pose from the SCAPE data, we can obtain its labeled body segments in Fig. 2(c) and the estimated skeleton (joint positions) in Fig. 2(d). Compared with the given groundtruth skeleton, the average error of joint position around limbs/shoulders/hibs is only 2.88cm (the error of the first round CPD registration is 3.08cm).

     

Figure 2. (a) The “T-pose” template model used for the SCAPE dataset. (b) The labeled template. (c) The labeled initial pose. (d) The learned subject-specific articulated model for SCAPE data (black: estimated and blue: ground-truth).

The SMMC-10 dataset contains 28 depth image sequences with different motion activities and provides the corresponding ground-truth marker locations. The input depth image cannot be used directly due to background noise and undesirable objects. Therefore, we performed three pre-processing steps to make the depth data ready for pose estimation, which include body subtraction by depth 98

Correspondence Estimation by GLTP: We compare GLTP (α = 10, λ = 5 × 106 and K = 10) with CPD (β = 2, α = 3) in Fig. 5 in term of correspondence estimation. When articulated deformation is not significant between the template and target, such as the 1st pose, both CPD and GLTP perform well. However, in the cases of highly articulated deformations, e.g., poses 2 to 5, significant correspondence errors are observed around the head, limbs and body joints in the CPD results. On the other hand, GLTP provides stable correspondence estimation across all poses. However, the results around limb joints are not sufficiently reliable, and further refinement is needed for accurate pose estimation. We also compare the labeling accuracy of CPD, GLTP and AICP over all 38 poses in Fig. 6(a), which shows that GLTP is the best one among the three and AICP is better than CPD due to the fact that its local rigid assumption is more suitable for 3D human data.

Figure 7. Results of correspondence refinement before (above) and after (below) AICP, especially around limb joints (circled area).

Pose Estimation by Modified AICP: Both CPD and GLTP can be used to initialize correspondences for AICPbased pose estimation. We show the labeling accuracy of body segments (averaged over 38 poses) of our framework (GLTP+AICP) in Fig. 6(a). It is shown a significant improvement is achieved by method using GLTP and AICP jointly (GLTP+AICP) which is also better than the one using CPD and AICP together (CPD+AICP). We compare pose estimation results in terms of joint position error (meter) in Fig. 6(b) where both estimated and ground-truth skeletons produce similar results, showing the effectiveness of shape/skeleton initialization (Step 1). Our algorithm significantly outperforms other options including CPD, GLTP, AICP, and CPD+AICP, showing the effectiveness of GLTP for correspondence estimation (Step 2) and the necessity of AICP for correspondence refinement (Step 3). We visualize some correspondence refinement results in Fig. 7 where obvious improvements are seen around limb joints. We also present pose estimation results in Fig. 8.

Figure 5. Correspondence estimation: Five SCAPE target models (1st row); CPD results (2nd row) and GLTP results (3rd row).





Figure 6. Result comparison on SCAPE data. (a) Labeling accuracy of body segments; (b) Average joint position errors (meter).

Figure 8. Pose estimation results for some SCAPE data.

99





Figure 9. Comparison results of the mean pose estimation error (meter). (a) For 26 sequences (without 25 and 28); (b) For all 28 sequences.

4.3. Experiments for the SMMC-10 Dataset

cannot work well in sequences 25 and 28, leading to inaccurate pose estimation. Therefore, our comparative analysis is done in two different settings, i.e., 26 sequences (without sequences 25 and 28) and all 28 sequences. Without considering the two sequences (25 and 28) which do not follow our assumption about the camera view, our framework can achieve a result that is comparable with the best ones so far [19, 17, 12], as shown in Fig. 9(a). Fig. 9(b) shows that our method is better than [18, 2, 7] but little worse than [19, 17, 12] with respect to all sequences. We also illustrate some pose estimation results of four selected sequences (4, 21, 26, 25) in Fig. 10 which are consistent with above analysis. It is worth noting that our method does not require training data which is needed in [19, 17, 2, 7] and there is also no tracking involved which is a critical part in [2, 7, 12]. There is a great potential to further improve the accuracy and robustness of pose estimation by incorporating a complete template model along with occlusion handling and temporal pose tracking.

Our algorithm is compared against with several state-ofthe-art algorithms [18, 2, 19, 17, 7, 12] in terms of the error between each estimated joint with its corresponding marker. Due to the inconsistency between the definition of joints and the configuration of markers, we need to remove the constant offset at each joint, similar to other algorithms. Shape/Skeleton Initialization by CPD: Unlike the 3D SCAPE data which are nearly noise-free and complete (without occlusion), the depth data may be still noisy and incomplete due to occlusion, even after pre-processing. We cannot directly use the correspondences estimated from the depth image to learn a subject-specific articulated model. Instead we use the transformed template obtained by CPD as the subject-specific articulated model, which has no missing data and follows the existing topology constraints among different body segments. Fig. 4(c) and (d) show the initial pose from the depth image and the learned subjectspecific articulated model with labeled segments and the estimated skeleton (joint positions) respectively. Correspondence Estimation by GLTP: Given a new depth image, we use the “T-pose” template model to estimate correspondences via GLTP that improves the accuracy of correspondence estimation compared with CPD. However, depth images often suffer poor quality due to noise and occlusion, and we cannot use labeled depth images for AICP-based pose estimation. Similar to the idea of the previous step, we replace the observed depth image by a transformed template obtained by GLTP where body segments are roughly labeled due to imperfect registration. Pose Estimation by Modified AICP: We perform the modified AICP for pose estimation by registering two transformed template models. One is the subject-specific articulated model (Step 1), and the other is a substitute for the observed depth image (Step 2). Out of 28 depth sequences, the subject keeps a stable view point in all but two (25 and 28) sequences. In sequences 25 and 28, the subject undergos significant view point changes (i.e., the visible body part changes from the side and to the back). Since the “T-pose” template used here only captures the frontal view, GLTP

Figure 10. Pose estimation results for some sequences from SMMC-10 dataset. Rows 1 to 4 are the results of seven sample frames from sequences 4, 21, 26 and 25 respectively.

100

people. ACM Trans. Graph, 24:408–416, 2005. 1, 5 [2] A. Baak, M. M¨uller, G. Bharaj, H.-P. Seidel, and C. Theobalt. A data-driven approach for real-time full body pose reconstruction from a depth camera. In Proc. ICCV, 2011. 1, 2, 7 [3] P. J. Besl and N. D. McKay. A method for registration of 3-d shapes. TPAMI, 14(2):239–256, 1992. 1 [4] C. M. Bishop. Neural Networks for Pattern Recognition. Oxford University Press, Inc., New York, NY, USA, 1995. 3 [5] D. Demirdjian. Combining geometric- and view-based approaches for articulated pose estimation. In Proc. ECCV, 2004. 1 [6] D. Droeschel and S. Behnke. 3D body pose estimation using an adaptive person model for articulated ICP. In Proc. ICIRIA, 2011. 2 [7] T. Helten, A. Baak, G. Bharaj, M. Mller, H.-P. Seidel, and C. Theobalt. Personalization and evaluation of a real-time depth-based full body tracker. In Proc. International Conference on 3D Vision, pages 279–286, 2013. 2, 7 [8] R. Horaud, F. Forbes, M. Yguel, G. Dewaele, and J. Zhang. Rigid and articulated point registration with expectation conditional maximization. TPAMI, 33(3):587–602, 2011. 1 [9] B. Jian and B. C. Vemuri. Robust point set registration using Gaussian mixture models. TPAMI, 33(8):1633–45, 2011. 1 [10] J. Ma, J. Zhao, J. Tian, Z. Tu, and A. L. Yuille. Robust esitimation of nonrigid transformation for point set registration. In Proc. CVPR, 2013. 1 [11] J. Ma, J. Zhao, J. Tian, A. Yuille, and Z. Tu. Robust point matching via vector field consensus. IEEE Trans. Image Processing, 23(4):1706–1721, 2014. 1 [12] Y. Mao and Y. Ruigang. Real-time simultaneous pose and shape estimation for articulated objects using a single depth camera. In Proc. CVPR, 2014. 2, 7 [13] A. Myronenko and X. Song. Point set registration: Coherent point drift. TPAMI, 32(12):2262–2275, 2010. 1, 3 [14] S. Pellegrini, K. Schindler, , and D. Nardi. A generalization of the ICP algorithm for articulated bodies. In Proc. BMVC, 2008. 1 [15] S. T. Roweis and L. K. Saul. Nonlinear dimensionality reduction by locally linear embedding. Science, 290:2323–2326, 2000. 3 [16] J. Shotton, A. Fitzgibbon, M. Cook, T. Sharp, M. Finocchio, R. Moore, A. Kipman, and A. Blake. Real-time human pose recognition in parts from single depth images. In Proc. CVPR, 2011. 1 [17] J. Taylor, J. Shotton, T. Sharp, and A. W. Fitzgibbon. The vitruvian manifold: Inferring dense correspondences for oneshot human pose estimation. In Proc. CVPR, 2012. 1, 2, 7 [18] S. T. D. K. Varun Ganapathi, Christian Plagemann. Real time motion capture using a single time-of-flight camera. In Proc. CVPR, 2010. 5, 7 [19] M. Ye, X. Wang, R. Yang, L. Ren, and M. Pollefeys. Accurate 3D pose estimation from a single depth image. In Proc. ICCV, 2011. 1, 2, 5, 7 [20] Z. Zhang. Iterative point matching for registration of freeform curves and surfaces. IJCV, 13(2):119–152, 1994. 1

4.4. Discussion GLTP plays a critical role in non-rigid registration which is the prerequisite for modified AICP. In practice, we found that GLTP still has some failed cases (not correctable by AICP). First, when some body segments are very close or even connected, EM optimization is easily trapped in local minima, leading to mis-labeled body segments. Second, when there are significant non-rigid local deformations in some body segments, LLE-based topology constraint may be corrupted, resulting in invalid registration results. Third, a significant view change may influence correspondence estimation due to incomplete data (e.g. non-visible points). The main computational complexity of proposed framework is shown in Table 3. In practice, we test the GLTP and modified AICP in Matlab on a PC with Intel i7 CPU 3.40GHz and 32GB RAM. We select the template and target model both have around 1100 points, and we set 100 iterations for GLTP and 10 iterations for modified AICP. The running time is around 13s with an un-optimized Maltab implementation. Table 3. Computational complexity of three algorithms involved.

Algorithms Computational complexity CPD O(M N ) + O(M 3 ) GLTP O(M N ) + O(M 3 ) + O(M 2 ) + O(M K 3 ) 2 AICP O(MΨ ) M and N are the number of points in the template and target respectively. K is the number of LLE neighbors in GLTP. MΨ is the number of points in a selected rigid part Ψ in AICP.

5. Conclusion We have studied human pose estimation from a perspective of non-rigid articulated registration. Specifically, we have focused on two recent techniques, CPD and AICP, by addressing their limitations in the context of human pose estimation. We have developed a new GMM-based registration algorithm which integrates two topologically complementary (CPD and LLE) constraints into a unified probability density estimation framework and provides reliable correspondence estimation for AICP initialization. We also proposed a modified AICP algorithm for human pose estimation. The experiments on 3D scan data and depth images show the competitiveness of our framework. Our future work will focus on developing an integrated registration framework where an articulated structure is fused with other topology constraints to handle more difficult poses with ambiguous body segments and complex local non-rigid deformations as well as view changes.

References [1] D. Anguelov, P. Srinivasan, D. Koller, S. Thrun, J. Rodgers, and J. Davis. SCAPE: Shape completion and animation of

101