Upper Body Pose Estimation from Stereo and Hand-face Tracking

0 downloads 0 Views 1MB Size Report
face reconstruction, pose estimation and tracking must be very fast. Extensive .... ing of fleshtone regions in binocular sequences (see Fig- ure 3). The centroids ...
Upper Body Pose Estimation from Stereo and Hand-face Tracking Jane Mulligan, University of Colorado at Boulder, [email protected] Abstract In applications such as immersive telepresence we want to extract high quality 3D models of collaborators in real time from multiview image sequences. One way to improve the quality of stereo or visual hull based models is to estimate the kinematic pose of the user first and then constrain 3D reconstruction accordingly. To serve as a preprocessing step such pose extraction must be very fast, precluding the usual generate and test techniques. We examine a method based on psychophysical evidence that known relative hand position can be used to directly compute the pose of the arm. First we explore a number of possible models for this relationship using motion capture data. We then examine how reconstruction of face and hand position as well as a patch on the torso, allow us to exploit these simple direct calculations to estimate the pose of a user in a desktop collaboration environment. Keywords: motion capture, human pose estimation

1. Introduction In immersive telepresence applications we want representations of remote collaborators which are as true to life as possible [14], however stereo sensitive are sensitive to the number and range of disparities tested, and space carving approaches are sensitive to errors in silhouette extraction. A richer model including the kinematic pose of the user is useful in predicting silhouette shape and depth/disparity ranges for correlation. This is similar to the common skeleton-surface model used in computer animation. Pose and trajectory information are also important in networked communication settings because they provide a basis for interpolating over dropped frames, or for animating a suitable avatar over low bandwidth connections. In recent years a large volume of work termed “looking at people” has evolved [1, 13, 15]. We are interested in extracting a kinematic pose to support 3D modeling and rendering of a human subject. Our work therefore falls into the category of pose estimation or motion capture [13] rather than full gesture recognition. Previous human motion analysis work on extracting 3D pose from video sequences frequently involved

matching the projections of kinematic configurations to the image contents, in a classic hypothesize-andtest manner[4]. Problems in correctly identifying the pose arise in cases of self-occlusion and foreshortening [3, 12, 6] or when silhouette extraction fails. Sidenbladh et al. [16] exploit a database of motion capture data to predict the next pose in a sequence based on searching for a similar sequence for the d preceding timesteps. Goncalves et al. [7] learn models of human based on extracted pose data in 2D. In this paper, we begin our investigation into kinematic model extraction with the problem of extracting only the arm pose. In some sense this is the most challenging body component in the desktop collaboration environment, since the arms and hands tend to move farther, more rapidly and more often than the head and torso. In this environment we expect to be able to extract the left and right arm pose independently using 3D hand trajectories, then compose the arm models with a simple torso position model derived from tracked face position. Often we would like a means of predicting parameters based on natural human postures: obviously humans choose among their redundant configurations every day. Moeslund and Granum [12] and Tolani et al. [19] use similar models to account for the redundant degree of freedom in the human arm pose. Moeslund’s [11] screw axis representation is not related to human anatomy, but defines the screw axis as the vector from shoulder (S) to hand (H). The arm link lengths are fixed so there is no translation for a particular hand position as the elbow rotates about the screw axis by an angle α. For a shoulder centred coordinate frame the arm pose can be specified by 4 parameters: the hand position (Hx , Hy , Hz ) and the angle α. This representation only describes the arm of course, but it is very useful for describing rather natural adjustments of arm pose. Adjusting arm pose to avoid joint limits or match image contents only requires adjusting the elbow swivel angle around the ellipse. The model does not however give us intuitions about what pose to select given some hand target location.

Hands can usually be located in the image, unfortunately human arm kinematics are redundant so we cannot simply use inverse kinematics to identify a pose based on hand centroid (endpoint) position. We need a method to estimate a single pose among the many available to generate a particular hand observation. Of course humans solve this problem as they reach to point or grasp objects in daily life. Psychophysical research by Soechting and Flanders [17, 18] examines human subjects and how their performance in pointing to visible and obscured targets relates to target position. They describe a kinematic model which Kondo [9] refers to as the Sensorimotor Transformation Model (STM). This model uses linear functions of the components of target endpoint position to generate the 4 angles required to define the arm pose. Again this model only applies to arm posture, but it does give us a method for choosing poses preferred by humans. Kondo has used it to implement an inverse kinematics algorithm for computer animation [9, 8]. As a preprocessing step for detailed real-time surface reconstruction, pose estimation and tracking must be very fast. Extensive work on hand and face extraction is described in the literature and we will not attempt to improve on these approaches. We use the simplest approach, exploiting colour segmentation of fleshtones in normalized RGB space, which has proved fast and effective[20, 21]. Moeslund and Granum [12] estimate arm pose for a fixed shoulder position by identifying hand by colour in a monocular image. Section 2 introduces the STM model. Section 3 explores its validity using motion capture data. In Section 4 we demonstrate how the model can be applied in real binocular sequences. Finally Section 5 summarizes our results.

2. Approximating Arm Pose Soechting and Flanders [17] original study describes arm pose using a set of spherical coordinates relative to the shoulder frame pictured in Figure 1. The Z axis points upward, the Y axis points to the front (out of the chest) and the X axis points inward toward the other shoulder. Target positions are described in spherical coordinates with respect to the shoulder: p  x2 + y 2 + z 2  R = χ = atan2(x, (1) p y)  ψ = atan2(z, x2 + y 2 ) The four angles describing the pose of the arm are θ, the elevation of the upper arm, η, its rotation with respect to the Z axis, β, the forearm elevation and α,

Z

X

α θ

lu

β

Y lf η

Figure 1. Soechting and Flanders’ model for (left) arm pose. θ is the upper arm elevation, η is upper arm rotation about z, β is forearm elevation and α is z rotation in the forearm frame. its Z rotation. The link lengths are lu and lf respectively. World endpoint position Hw = (xw , yw , zw ) is related to these angles as follows: xw yw zw

= lu sin θ sin η + lf sin β sin α = lu sin θ cos η + lf sin β cos α = −lu cos θ + lf cos β

(2)

Soechting and Flanders’ [17] experiments involved 4 subjects in two motor tasks: touching targets (accurate), or pointing to the location of a target which was no longer visible (inaccurate). Ultrasound sensors were used to measure wrist, elbow, shoulder and the endpoint of the stylus used for pointing, as well as the target position. They were interested in the relationship between the extrinsic world position (R, χ, ψ) and the subjects’ intrinsic representation of arm pose (θ, β, η, α) for the motor tasks described. They fitted linear, second and third order models in the extrinsic parameters. Coefficients for each model were fitted to the measured arm position (intrinsic angles) and varied from subject to subject. The inaccurate task was somewhat better modeled by the linear model, than the accurate task. Since freely gesturing in space is even less constrained than the inaccurate pointing task, we want to evaluate these linear models as a tool for directly determining arm pose from endpoint position. The model and results described by Soechting and Flanders seem very promising for arm pose extraction using only tracked hand positions in multiview sequences, but do they generalize to a wide variety of natural gestures? Using centroids of extracted fleshtone regions in multiple views to reconstruct hand (and face) position can give rather noisy estimates of hand position, depending on hand posture and camera viewpoints. Can these models give useful estimates of arm

MoCap Seq. argue heated interview pleading enthus

Num Frames 768 1694 1948 1844 540

Relative Seq. argue heated interview pleading enthus

Error for Linear θ β 0.525 0.347 0.338 0.116 0.018 0.093 0.289 0.894 0.447 0.040

R, χ, ψ Fit η α 0.180 0.625 0.318 0.729 0.183 0.136 0.295 0.683 0.255 0.123

Table 1. Motion capture sequences. pose under these circumstances, possibly with added tracking?

Table 2. Relative Error over sequence for fitted linear least squares model in R, χ and ψ.

3. Motion Capture Analysis To evaluate these models for more general human motion, we obtained the SitTalkDrink motion capture dataset from Credo Interactive (www.charactermotion.com). We chose 5 of these sequences (Table 1) to validate the models proposed in the previous section. All of the selected sequences involve people seated and conversing in the kind of desktop setting we are interested in. Demirdjian et al. have incorporated motion capture information in their human tracker to enforce realistic constraints on likely motion [5]. Exploring real motions to get our assumptions right is an important step. We fitted 2 different models to each of the intrinsic angles for each of the motion capture sequences. The linear model is of the form: θ i = a 0 + a 1 R i + a 2 χi + a 3 ψi the nonlinear model had the form: βi = a0 + a1 Ri + a2 χi + a3 ψi + a4 Ri2 + a5 Ri χi +a6 Ri ψi + a7 χ2i + a8 χi ψi + a9 ψi2 We used a standard least squares approach (SVD/Backsub) to assign the coefficients aj for each sequence. We use relative error over the sequence to measure the quality of fitted models: P (ybi − yi )2 rel = Pi , (3) 2 i (yi − y) where ybi is the estimated value at step i, yi is the true value and y is the mean of the true values for the sequence. The results for linear and nonlinear model fits are summarized in Tables 2 and 3. Figure 2 illustrates the fits for angle η for the enthus sequence with rel = 0.25 for the linear model, and rel = 0.07 for the nonlinear model. Relative error essentially measures how much better the model is than merely guessing the mean. We can see from these results that for a

Relative Error for Seq. θ argue 0.123 heated 0.155 interview 0.004 pleading 0.076 enthus 0.046

Nonlinear R, χ, β η 0.212 0.053 0.055 0.149 0.020 0.022 0.835 0.192 0.007 0.068

ψ Fit α 0.124 0.574 0.098 0.401 0.070

Table 3. Relative Error over sequence for fitted quadratic least squares model in R, χ and ψ.

particular sequence we can fit an effective model in extrinsic parameters which will predict the intrinsic pose angles. The linear models are most effective for the interview and enthus sequences. The nonlinear model fits well in all cases except for β in the pleading sequence. These results are promising. Obviously there is information in the hand pose from which we calculate (R, χ, ψ), which we can use to infer the overall arm pose.

4. Binocular Sequence Data The pose estimation system we envision is integrated with, and acts as a preprocessor for, a high quality modeling/telepresence system based on multiview stereo. We therefore assume dense disparity maps are available to us at least intermittently. We can obtain much faster information regarding hand and face positions by combining background subtraction and tracking of fleshtone regions in binocular sequences (see Figure 3). The centroids of corresponding regions are used as correspondences to reconstruct the 3D hand and face positions. 4.0.1. Segmenting Skin Regions Skin tones are very consistent and a useful way to identify people and their parts in images. For a particular set of environ-

Eta actual vs R,Psi,Chi linear model 1.4

Bet Est Bet

1.2

Radians

1 0.8 0.6 0.4 0.2 0 0

100

200

300 Frame

400

500

600

Relative Error for Linear R, χ, ψ Fit Seq. d φ γ enthus 0.899 0.105 0.816 head rel x head rel y head rel z enthus 0.060 0.215 0.063 hand rel x hand rel y hand rel z enthus .967 0.087 0.869

Eta actual vs R,Psi,Chi linear model 1.4

Bet Est Bet

1.2

Table 4. Relative Error for possible elbow encoding given

Radians

1 0.8 0.6 0.4 0.2 0 0

100

200

300 Frame

400

500

600

Figure 2. Examples of linear and nonlinear model fits for η for the enthus motion capture sequence.

Figure 3. Hand and face centroids extracted using tracking of fleshtone regions.

ment conditions we construct a mean and standard deviation for red green and blue image components in skin regions cs = (rs , gs , bs ).

4.1. Other Models Our data suggests that indeed there are direct relationships between hand position and the pose angles chosen by subjects. The 4 angles used in Soechting’s model however are more than we need to resolve the arm pose if we already know the hand and shoulder position. Moeslund [12] and others have pointed out that the elbow is constrained to fall on a circle about the line between the shoulder and hand. In our framework either β and η or α and θ would describe the pose of the lower or upper arm and thus fully constrain the arm pose. Since shoulder position is difficult to directly extract [10], we would prefer to work with the pose angles of the lower arm. We are not restricted to the proposed intrinsic pose angles of course and we have tested a number of possible encodings of elbow position, to determine which can be optimally fit using our extracted hand and face information. For points hand n, head h and elbow e, three potential candidates are: • Head relative position eh = e − h

q=

acos(pi,j · cs ) ||pi,j || · ||cs ||

We also construct a mean background image with a color pixel bij at each pixel. To emphasize skin regions we examine the angle q between the colorspace vector at each pixel and mean skin color if it exceeds a threshold it is non-skin. Similarly if the angle r between a background pixel and the foreground pixel r=

acos(pi,j · bi,j ) ||pi,j || · ||bi,j ||

is less than a threshold it is part of the background. the result of combining these masks is illustrated in Figure 3. This segmentation greatly improves the robustness of tracking hand and face regions.

• Hand relative position en = e − n • Hand relative spherical coordinates pP d= e2n −1 en (y) φ = tan en (x)

γ = acos( end(z) )

Figures 4, 5 and 6 illustrate the fits for these elbow encodings for the enthus motion capture sequence. The linear model (R, χ, ψ) is the basis for these estimates, but now is computed for the hand relative to the head position, since these are the image measurements available to us. The best overall fit is achieved for the head relative encoding of elbow position with relative error of 0.060, 0.215 and 0.063 for x, y, and z respectively (Table 4).

Head Relative X

Hand Relative X

0.8

0.15

0.7 0.1 0.6 0.05

0.5

Metres

Metres

0.4 0.3 0.2 0.1

0 −0.05 −0.1

0 −0.15 −0.1 −0.2 0

100

200

300 Frame

400

500

−0.2 0

600

100

Head Relative Y

200

300 Frame

400

500

600

500

600

500

600

Hand Relative Y

−3

0.4

−3.05

0.3

−3.1

0.2

Metres

Metres

−3.15 −3.2

0.1 0

−3.25 −0.1

−3.3

−0.2

−3.35 −3.4 0

100

200

300 Frame

400

500

−0.3 0

600

100

Head Relative Z

200

300 Frame

400

Hand Relative Y

0.4

0

0.3 −0.05 0.2 −0.1

0.1

Metres

Metres

0 −0.1 −0.2 −0.3

−0.15 −0.2 −0.25

−0.4 −0.3 −0.5 −0.6 0

100

200

300 Frame

400

500

600

−0.35 0

100

200

300 Frame

400

Figure 4. Examples of linear model fits for head relative elbow position for the enthus motion capture sequence.

Figure 5. Examples of linear model fits for hand relative elbow position for the enthus motion capture sequence.

The head relative encoding was then applied to a real image sequence. Frames 50 and 70 from this sequence are illustrated in Figure 8. Predicted (red) and a hand extracted elbow position (blue) are compared in Figure 7. Predicted values are based on the head relative model for the enthus motion capture data illustrated in Figure 4. Relative errors achieved were 0.6685 0.1091 and 0.0985 for x, y and z coordinates respectively.

lected based on a fixed image relationship to the subject’s face. Occlusion of the surface could also be a problem, but the extracted kinematic pose should allow us to determine when the selected patch is selfobscured. The process is as follows: we use the extracted image face location F and the image dimensions of the face (fx , fy ) to move to an image window bounded by (F + (−fx /2, fy )) and (F + (fx /2, 1.5fy )) . The valid reconstructed depth points associated with this window are used to fit a plane and construct the necessary transformation T . In Figure 9, plane normals calculated from the stereo surface window (red) and the extracted targets(green) have been projected and plotted on the associated image. The normals have an absolute angular difference of 7.6◦ . The image region selected relative to the face is outlined in blue. Valid stereo depth points from this region are used to estimate T . The stereo normal actually appears to be more accurate, which may be the case since it is based on more data.

4.2. Torso Rotation The ability to predict arm pose using only head and hand position extracted via colour segmentation in multiview sequences seems very promising. To provide a more complete estimate of user pose we further require an estimate of the user’s orientation. For the dense model applications we are interested in, we can address this problem by using a patch of reconstructed surface points on the torso to fit a plane and thereby construct a similar transformation to that used in the target data. This approach would update torso transformation T , at a slower rate (that of the correlation stereo update), but we do not anticipate rapid whole body movement in the desktop collaboration environment. The torso surface region can be se-

5. Summary and Conclusions In this paper we have explored some interesting results from the psychophysics literature, which indicate

Hand Relative d

0.34

0.36

0.32 0.35

0.3 0.34

X Metres

0.28

Radians

0.33

0.26

0.32

0.24

0.31

0.22

0.3

0.2

0.29

0.18

0.28 0.27 0

100

200

300 Frame

400

500

0.16 0

600

Hand Relative phi

10

20

30

40

50

60

70

50

60

70

50 250

60

300 70

Frame −0.25

5.5 5

−0.3

4.5

Y Metres

4

Radians

−0.35

3.5 3 2.5

−0.4

2

−0.45

1.5 1 0.5 0

100

200

300 Frame

400

500

−0.5 0

600

10

20

30

40

Frame

Hand Relative gamma

0.3

3

50 0.2

2.6

0.15 100

Z Metres

Radians

0.25 2.8

2.4

0.1 150 0.05

2.2

0 200 −0.05

2

1.8 0

100

200

300 Frame

400

500

600

Figure 6. Examples of linear model fits for hand relative elbow position (d, φ, γ) for the enthus motion capture sequence.

that the arm pose achieved by human subjects in inaccurate pointing tasks is well modeled by linear combinations of the spherical coordinates of the target or finger position. We have used several motion capture sequences and used them to analyse the potential of using these models to predict arm pose. We observed promising trends which suggest that simple direct calculations can be used to obtain good estimates of arm pose based on hand and face positions acquired and reconstructed using colour segmentation. The application we are interested in is high quality real-time reconstruction of human users in a desktop telepresence setting. To use pose estimation as a preprocessing step to augment visual hull or stereo calculations, the estimation process must be very fast. Linear functions of hand position are very attractive for this purpose. The full 4 angle pose description used by Soechting and Flanders [17] is not actually required to constrain the elbow position for known hand and shoulder locations. We examined the estimation of only the forearm pose angles, and found we could equally well estimate the elbow pose relative to the hand. We observed that the form in which elbow position was expressed could greatly affect the quality of the fit. Further the hand position could be expressed relative to

−0.1 0

10 50

20 100

30 150

40200

Frame

Figure 7. Predicted head relative elbow position for image sequence (red), hand estimate of true position (blue)

Figure 8. Sequence images with pose estimated from head-hand position overlaid.

the face centroid as origin, rather than the shoulder position, which is not easily extracted. Challenges arise with methods using skintone alone to track hands. The system must follow the distal end (hand) when the user wears short sleeves for example. Further if the hands cross confusion in tracking can occur. Demirdjian et al. [5] have shown that understanding natural human motion can reduce some of these uncertainties. Finally we proposed that given the context of tracking and reconstructing depth data in real time, we could estimate torso rotation by fitting a plane to a region of the depth surface. This region is determined relative to the image position and size of the subject’s

[7] L. Goncalves, E. D. Bernardo, and P. Perona. Reach out and touch space (motion learning). In Proc. IEEE Intl. Conf on Automatic Face and Gesture Recognition, pages 234–238, 1998. [8] Y. Koga, K. Kondo, J. Kuffner, and J. Latombe. Planning motions with intentions. In Proc. SIGGRAPH ’94, pages 395–408, 1994. [9] K. Kondo. Inverse kinematics of a human arm. Dept. of Computer Science 94-1508, Stanford University, Stan-

Figure 9. Fitted plane normals from extracted targets(green) and stereo surface window (red). The angular difference is about 8◦ .

ford CA 94305-9025, 1994. [10] T. B. Moeslund. Modelling the human arm. Laboratory of Computer Vision and Media Technology CVMT 02-01, Aalborg University, Aalborg East, Denmark, May 2002. http://www.cvmt.dk/ tbm/Publications/.

face. We plan to examine how to make these models generic enough that they can be scaled and adapted online for any user. We also believe that there maybe different behavior in different regions of the reachable space and therefore we may need several models indexes by hand position. We want to explore the fundamentals of the model which can be scaled to any user[2].

References

[11] T. B. Moeslund and E. Granum. Pose estimation of a human arm using kinematic constraints. In Proceedings of the 12th Scandinavian Conference on Image Analysis, page ??, Bergen, Norway, June 2001. [12] T. B. Moeslund and E. Granum. A survey of computer vision-based human motion capture. Computer Vision and Image Understanding, 81(3):231–268, March 2001. [13] T. B. Moeslund, M. Vittrup, K. S. Pederson, M. K. Laursen, M. K. D. Sorensen, H. Uhrenfeldt, and E. Granum. Estimating the 3d shoulder position us-

[1] J. K. Aggarwal and Q. Cai. Human motion analysis:

ing monocular vision and detailed shoulder model. In

A review. Computer Vision and Image Understanding,

Proceedings of the Conference on Imaging Science, Sys-

73(3):428–440, March 1999. [2] B. Allen, B. Curless, and Z. Popovic. The space of human

tems, and Technology (CISST’02), Las Vegas, NV, June 2002.

body shapes: reconstruction and parameterization from

[14] J. Mulligan, X. Zampoulis, N. Kelshikar, and K. Dani-

range scans. ACM Transactions on Graphics (TOG),

ilidis. Stereo-based environment scanning for immer-

22(3):587–594, July 2003. [3] I. Cohen and H. Li. Inference of human postures by clas-

sive telepresence. IEEE Transactions on Circuits and Systems for Video Technology, March 2004.

sification of 3d human body shape. In Proc. IEEE Int.

[15] V. Pavlovic, R. Sharma, and T. Huang. Visual inter-

Workshop on Analysis and Modeling of Faces and Ges-

pretation of hand gestures for human-computer interac-

tures (AMFG’03), 2003.

tion: a review. IEEE Transactions on Pattern Analysis

[4] Q. Delamarre and O. D. Faugeras. 3d articulated models and multi-view tracking with silhouettes. In ICCV (2), pages 716–721, 1999. [5] D. Demirdjian, T. Ko, and T. Darrell. Constraining human body tracking. In Proceedings of the Ninth IEEE International Conference on Computer Vision (ICCV03), volume 2, page 1071, Nice, France, Oct 2003. [6] D. M. Gavrila and L. S. Davis. 3-d model-based tracking of humans in action: a multi-view approach. In Pro-

and Machine Intelligence, 19(7):677–695, July 1997. [16] H. Sidenbladh, M. J. Black, and L. Sigal. Implicit probabilistic models of human motion for synthesis and tracking. In Proc. European Conference on Computer Vision (ECCV02), volume LNCS 2350, pages 784–800, 2002. [17] J. F. Soechting and M. Flanders. Errors in pointing are due to approximations in sensorimotor transformations. Journal of Neurophysiology, 62(2):595–608, August 1989.

ceedings IEEE Computer Vision and Pattern Recogni-

[18] J. F. Soechting and M. Flanders. Sensorimotor repre-

tion (CVPR’96), pages 73–80, San Francisco, CA, 1996.

sentations for pointing to targets in three-dimensional

space. Journal of Neurophysiology, 62(2):582–594, August 1989. [19] D. Tolani, A. Goswami, and N. Badler. Real-time inverse kinematics techniques for anthropomorphic limbs. Graphical Models, 62:353–388, 2000. [20] C. Wren, A. Azarbayejani, T. Darrell, and A. Pentland. Pfinder: Real-time tracking of the human body. PAMI, 19(7):780–785, 1997. [21] M. Yang and N. Ahuja. Recognizing hand gesture using motion trajectories. In Proceedings CVPR, volume 1, pages 466–472, 1999.