Reconstructing 3D Human Body Pose from Stereo ... - Semantic Scholar

13 downloads 0 Views 672KB Size Report
This paper presents a novel method for reconstructing a 3D human body pose using depth information based on top-down learning. The human body pose is ...
Reconstructing 3D Human Body Pose from Stereo Image Sequences Using Hierarchical Human Body Model Learning Hee-Deok Yang and Seong-Whan Lee Department of Computer Science and Engineering, Korea University Anam-dong, Seongbuk-gu, Seoul 136-713, Korea {hdyang, swlee}@image.korea.ac.kr

Abstract This paper presents a novel method for reconstructing a 3D human body pose using depth information based on top-down learning. The human body pose is represented by a linear combination of prototypes of 2D depth images and their corresponding 3D body models in terms of the position of a predetermined set of joints. In a 2D depth image, the optimal coefficients for a linear combination of prototypes of 2D depth images can be estimated using least square minimization. The 3D body model of the input depth image is obtained by applying the estimated coefficients to the corresponding 3D body model of prototypes. In the learning stage, the proposed method is hierarchically constructed by classifying the training data recursively into several clusters with silhouette images and depth images. In applying hierarchical human body model learning to estimate 3D human body pose, the similar pose in a silhouette image can be estimated as a different 3D human body pose. The proposed method has been tested with 20 persons’ sequences. The proposed method achieved the average errors 0f 12.3 degree for all human body components.

images, in effect, having a stereo vision system. From a computer vision system perspective, 3D information is lost via projection of the three-dimensional world into a two-dimensional world. Much research has been developed for reconstructing a 2D or 3D body pose with 2D information such as edge, silhouette image extracted from monocular image, e.g., from a photograph or a video [1, 2, 3, 5, 6, 7]. Sminchisescu and Triggs [6] used a recovering 3D human body motion from monocular video sequences based on a robust image matching metric, incorporation of joint limits and non-self intersection constraints. In order to overcome this problem of 2D feature ambiguity, the depth information is applied to reconstruct the 3D human body pose. As shown in Fig. 1, even if the body poses are shown similar, the 3D depth images are different to each other. 2D Projection

3D World

2D World Depth Information

3D Human Body Pose

+ ?

1. Introduction There has been growing interest in improving the interaction between humans and computers. It is argued that to achieve comfortable Human-Computer Interaction(HCI), the computer is required to be able to interact naturally with the human, similar to the way human-human interaction takes place. In every day life, humans can easily understand human bodies pose from low-resolution images or from images obtained directly through human vision. It is believed that humans employ extensive prior knowledge regarding human body structure and motion in this task [5] and humans use two eyes for capturing

45ȋ

-45ȋ 0ȋ

3D Reconstruction

Figure 1. Motivation of research

2. Human Body Model For the purpose of human body representation, the articulated human body is modeled. The human body model consists of a kinematics skeleton from articulated joints covered by ‘flesh’ built from superquadrics ellipsoids with additional deformable parameters. The 3D articulated human body model consists of 17 body

0-7695-2521-0/06/$20.00 (c) 2006 IEEE

segments including the major and most important human body parts. Fig. 2 presents the structure of the proposed 3D human body model. The articulated human body model has 40 DOF (the body model has 37 DOF and 3 DOF for global translation).

5 4 6

3 2

7

1

8

z›™œŠ›œ™ŒG–G‰–‹ GšŒŽ”Œ•›š

9 10

12

11 15

13

16

14

17

Lower Torso (1) Middle Torso (2) Upper Torso (3) Neck (4) Head (5) Right Upper Arm (6) Right Lower Arm (7) Right Hand (8) Left Upper Arm (9) Left Lower Arm (10) Left Hand (11) Right Upper Leg (12) Right Lower Leg (13) Right Foot (14) Left Upper Leg (15) Left Lower Leg (16) Left Foot (17)

si

m is the number of prototypes and c ( s1 ,..., snc )T is a silhouette image, sc is a value of a

pixel in the silhouette image. A 2D depth image is represented by a linear combination of a number of prototypes of 2D depth images, and its 3D body model is represented by estimated coefficients of the corresponding 3D body model of prototypes by Eq. (4). Linear combination of prototypes

D1 kŒ—›G ”ˆŽŒ

w™–›–› —Œ ‹Œ—›G”ˆŽŒGX

=

Figure 2. The proposed 3D human model

D1 ZkG œ”ˆ•G”–‹Œ“

3. Hierarchical Data Learning A 3D human body pose can be represented as a 2D depth image and a 3D body model. A depth image is represented by a linear combination of prototypes of 2D depth images. The optimal coefficients for a linear combination of prototypes of 2D depth images and their corresponding 3D body models is estimated using least square minimization.

3.1 3D Gesture Representation Linear combinations of prototypes based approach is used to reconstruct 3D human body pose from continuous depth images. If a sufficiently large number of pairs of a depth, and its 3D body model as are used as prototypes of the human gesture, an input 2D depth image is reconstructed by a linear combination of 2D depth image prototypes. The reconstructed 3D body model can be obtained by applying the estimated coefficients to the corresponding 3D body model of the prototypes as demonstrated in Fig. 3. The goal is to find an optimal parameter set D which best estimates the 3D human body pose from a given depth image. To make various prototypes of 2D depth images and their 3D body models, data is generated using the 3D human model described in Sec. 2. The depth image is represented by a vector di (d1c,..., d nc )T , where n is the number of pixels in the image and d c is a value of a pixel in the depth image. The 3D body model is represented by a vector pi (( x1 , y1 , z1 ),..., ( xq , yq , zq ))T , where x, y and z are the position of body joint in the 3D world. Eq. (1) explains training data. D

(d1 ,..., d m ), P

( p1 ,..., pm ), S

( s1 ,..., sm )

(1)

+… + D m

D2

+

w™–›–› —Œ ‹Œ—›G”ˆŽŒGY

D2

+

+… +

w™–›–› —Œ ‹Œ—›G”ˆŽŒG”

Dm

w™–›–› —Œ w™–›–› —Œ w™–›–› —Œ ZkGœ”ˆ•G”–‹Œ“GX ZkGœ”ˆ•G”–‹Œ“GY ZkGœ”ˆ•G”–‹Œ“G”

Pairs of depth images and their 3 D human models

ZkGœ”ˆ•G‰–‹ G”–‹Œ“

where

Figure 3. Basic idea of the proposed method

3.2 Hierarchical Human Body Model Database In order to cluster the prototypes, the algorithm is constructed hierarchically. Given a set of silhouette images, the depth images and their 3D body models are used for training, these are classified into several clusters. A set of cluster is built in which each has similar shape in 2D silhouette image space. Then, each cluster is recursively divided into several sub-clusters. To divide training data into sub-clusters, the K-means algorithm is applied. The hierarchical model has fourlevels as presented in Fig. 4. Lower Level

Level 1 (k=10)



Level 2 (k=20)



Level 3 (k=20)













Using silhouette images

… …



Leaf Nodes (10-60)













Using depth images Higher Level

Figure 4. Building a hierarchical human body model database

4. Reconstruction of 3D Human Body Pose To reconstruct a 3D human body pose, a four-level hierarchical model is used. In the first level, the 3D human body is estimated with a Silhouette History Image (SHI) [9], applying spatio-temporal features, which include continuous silhouette information. The input silhouette images are compared with the prototypes of 2D silhouette images, and the prototype with

0-7695-2521-0/06/$20.00 (c) 2006 IEEE

the smallest distance is selected. Template matching is used to compare two silhouette images. In the bottom level, the 3D human body pose is estimated using a linear combination of prototypes of 2D depth images To reconstruct the 3D human pose at the bottom level, the inverse matrix of D in Eq. (2) is calculated. The inverse D 1 of a matrix D exists only if D is square. However, a matrix D is not square. In this case, Singular Value Decomposition (SVD) is applied. The pseudo-inverse of D can be estimated such as: (2) D UWV T , D  VW  U T In addition, the solution DD such as:

D

~ D can be rewritten

~ DD

(3)

D , a set of coefficients of

After calculating D D prototypes is solved using Eq. (3). The depth of the image and the position of each segment of the 3D human model are calculated using Eq. (4). m m ~ Di ¦ D k d k , Pi ¦ D k pk (4) k 1

k 1

and reconstructed 3D human body pose in 3D world are compared. The angle calculated in the xz-plane has various values. Therefore, two angles are measured in the yz-plane and xy-plane. The average error is calculated such as: 1 l Ti (5) ¦ G (T i )  R (T i ) l l1 where l is the total number of frames in a sequence, G (Ti ) is the T of ith body segment in the ground truth and R (T i ) is the T of ith body segment in the reconstructed 3D human body using the proposed method. \ i is also calculated using Eq. (5). The average error is about 12.3 degree for all human body components. 5.2.1 Experiments with FBG Database Fig. 6 presents the estimated results obtained in several images from the front view of the FBG database. The result of the estimated 3D human body model represented a front view, left 45 degree view and right 45 degree view of a 3D human body model respectively.

5. Experimental Results and Analysis 5.1 Experimental Environment



kŒ—›G”ˆŽŒš O‰PGlŸˆ”—“ŒšG–GminG‹ˆ›ˆ‰ˆšŒ



kŒ—›G”ˆŽŒš OˆPGlŸˆ”—“ŒšG–G›™ˆ•G‹ˆ›ˆ

S…S



S…S

z“–œŒ››ŒG”ˆŽŒš



z“–œŒ››ŒG”ˆŽŒš

S…S

… … …

S…S

ZkGœ”ˆ•G‰–‹ G”–‹Œ“š

S…S



ZkGœ”ˆ•G‰–‹ G”–‹Œ“š

S…S

yŒšœ“› yŒšœ“›

u–™”ˆ“¡Œ‹Gp”ˆŽŒ u–™”ˆ“¡Œ‹Gp”ˆŽŒ

›T• …

For training the proposed method, approximately 40,000 pairs of silhouette images, depth images and their 3D human models were generated. Fig. 5(a) presents training data. S…S

p•—œ› p•—œ› {”Œ

›TY

›TX

sŒ›G”ˆŽŒšG–Gš›Œ™Œ–GŠˆ”Œ™ˆ

S…S z“–œŒ››ŒG”ˆŽŒš

S…S kŒ—›G”ˆŽŒš OŠPGlŸˆ”—“ŒšG–G™Œˆ“G‹ˆ›ˆ

Figure 5. Example of train and test data For testing the performance of the proposed method, two data sets are used. The first is the KU Gesture database [4, 10]. The data are captured as 3D motion data. Fig. 5(b) presents examples of the FBG database. The second consists of real data captured with stereo camera, Videre STH-MDCS2.

›

sŒ›G”ˆŽŒšG kŒ—›G”ˆŽŒš –Gš›Œ™Œ–GŠˆ”Œ™ˆ

u–™”ˆ“¡Œ‹ š“–œŒ››ŒG ”ˆŽŒš

u–™”ˆ“¡Œ‹G ‹Œ—›G ”ˆŽŒš

yŒŠ–•š›™œŠ›Œ‹ ZkGœ”ˆ•G‰–‹ G”–‹Œ“š OT[\SGWSG[\G‹ŒŽ™ŒŒGŒžP

Figure 6. Examples of the reconstructed 3D human body pose with sitting on a char sequence of the FBG database Joint angle X_W

X^W

lš›”ˆ›Œ‹

X]W

n™–œ•‹G{™œ›

X\W X[W XZW XYW XXW XWW `W

5.2 Experimental Results and Analysis To verify the effectiveness of the proposed method, experiments were carried out on FBG database. The FBG database has the ground truth for the 3D human body pose in the 3D world. Therefore, the ground truth

W

XW

YW

ZW

[W

\W ]W ^W Frame number

Figure 7. Temporal curve of joint angles of left upper leg with the sequence in Figure 6

0-7695-2521-0/06/$20.00 (c) 2006 IEEE

Fig. 7 presents the estimated angles of the left upper leg with walking at a place sequence. The average error is approximately 7 degree. As shown in Fig. 7, the estimated joint angle at frame 7, 14, 27 changes rapidly, because this region is the boundary of the clustering algorithm. 5.2.2 Experiments with Real Data Fig. 8 presents the reconstructed 3D human body pose of various poses with real data. Despite row segmentation, the proposed method outputs the 3D human body pose accurately. However, it is observed that some poses are inherently difficult to detect. Therefore, some results are not accurate.

model with error of silhouette extraction. The second is the method of extending the number of characteristic views. Using additive low-level information such as color, edge information, and the relationship between human body components can be analyzed correctly.

Acknowledgment This research was supported by the Intelligent Robotics Development Program, one of the 21st Century Frontier R&D Programs funded by the Ministry of Commerce, Industry and Energy of Korea. This research was also partially supported by the Brain Korea 21 project in 2006.

References

sŒ›G”ˆŽŒšG –Gš›Œ™Œ–GŠˆ”Œ™ˆ

yŒŠ–•š›™œŠ›Œ‹ ZkGœ”ˆ•G‰–‹ G”–‹Œ“š OT[\SGWSG[\G‹ŒŽ™ŒŒGŒžP

sŒ›G”ˆŽŒšG –Gš›Œ™Œ–GŠˆ”Œ™ˆ

yŒŠ–•š›™œŠ›Œ‹ ZkGœ”ˆ•G‰–‹ G”–‹Œ“š OT[\SGWSG[\G‹ŒŽ™ŒŒGŒžP

Figure 8. Examples of reconstructed 3D human body pose with various real data

6. Conclusion and further research In this paper, an efficient method for reconstructing a 3D human body pose from a stereo image sequence, using top-down learning, is proposed. A human body pose is represented by a linear combination of 2D depth image prototypes and their corresponding 3D body models, in terms of the position of a predetermined set of joints. Using the 2D depth images and their corresponding 3D body models, optimal coefficients for a linear combination of prototypes of the 2D depth images and their corresponding 3D body models can be estimated using least square minimization. A similar pose in a silhouette image can be estimated as a different 3D human body pose, by applying hierarchical human body model learning to estimate a 3D human body pose. Two interesting problems remain for further research. The first is the method of overcoming the various sizes of real human body and a 3D human body

[1] A. Agarwal and B. Triggs, “3D Human Pose From Silhouette by Relevance Vector Regression,” Proc. of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Washington D.C., USA, July 2004, pp. 882-888. [2] R. Bowden, et al., “Non-linear Statistical Models for 3D Reconstruction of Human Pose and Motion from Monocular Image Sequences,” Image and Vision Computing, No. 18, 2000, pp. 729-737. [3] R. T. Heap and D. Hogg, “Improving Specificity in PDMs Using a Hierarchical Approach,” Proc. of 8th British Machine Vision Conference, Colchester, UK, Sep. 1997, pp. 590-599. [4] B.-W. Hwang, S. Kim, and S.-W. Lee, “Full-Body Gesture Database for Analyzing Daily Human Gestures,” Advances in Intelligent Computing, Lecture Notes in Computer Science, Vol. 3644, Hefei, China, pp. 611620, The KU Gesture Database, http://GestureDB.korea. ac.kr/. [5] R. Rosales and S. Sclaroff, “Specialized Mapping and the Estimation of Human Body Pose from a Sin-gle Image,” Proc. of IEEE Workshop on Human Motion, Texas, USA, Dec. 2000, pp. 19-24. [6] C. Sminchisescu and B. Triggs, “Estimating Articulated Human Motion with Covariance Scaled Sampling,” International Journal of Robotics Research, Vol. 22, No. 6, 2003, pp. 371-393. [7] C. Taylor, “Reconstruction of Articulated Objects from Point Correspondences in a Single Uncalibrated Image,” Computer Vision and Image Under-standing, Vol. 80, No. 3, 2000, pp. 349-363. [8] S. Ullman and R. Basri, “Recognition by Linear Combinations of Models,” IEEE Trans. on Pattern Analysis and Machine Intelligence, Vol. 13, No. 10, 1991, pp. 992-1006. [9] H.-D.Yang, S.-K. Park, and S.-W Lee, “Reconstruction of 3D Human Body Pose Based on Top-Down Learning,” Advances in Intelligent Computing, Lecture Notes in Computer Science, Vol. 3644, Hefei, China, August 2005, pp. 601-610.

0-7695-2521-0/06/$20.00 (c) 2006 IEEE