3D Human Pose Estimation using 2D-Data and an ... - CiteSeerX

2 downloads 0 Views 3MB Size Report
Jun 16, 2000 - In many 3D human pose estimation applications it is ... received much attention due to its large number of potential ... poral context where the movements of different body parts ..... and Multi-view Tracking with Silhouettes.
Proc.

HuMAnS 2000, Hilton Head Island, South Carolina, June 16, 2000

3D Human Pose Estimation using 2D-Data and an Alternative Phase Space Representation Thomas B. Moeslund and Erik Granum Laboratory of Computer Vision and Media Technology Institute of Electronic Systems, Aalborg University Niels Jernes Vej 14, DK-9220 Aalborg East, Denmark tbm,eg  @vision.auc.dk Abstract

systems are based on a priori knowledge in the form of a human model and the constraints related to it. In many of these systems the measured data are used to adjust the parameters of the model, i.e. the current state of the model represents the output of the system - the estimated pose [1][5][6][14][15]. The model is updated from each image by investigating which synthesised model configuration provides the best match to the input data. This is known as the ”analysis-by-synthesis” (AbS) approach. The configuration or pose of the human model is described by the current values of the variables representing the different degrees of freedom in the model. Hence the configuration is represented in a coordinate system spanned by the different degrees of freedom. This is known as the phase space [2]. The major problem with the AbS approach is the potential high dimensionality of the phase space which may make the approach unfeasible for real-time systems. To reduce this problem the approach is used in a local scheme based on prediction. For each image a point in the phase space is predicted, synthesised and matched against the image data. The result determines which point in the phase space to synthesise next and so on. This iterative approach continues until a solution is found. Unfortunately the approach contains some drawbacks as it is based on a local search around the predicted value. It may end up with a sub-optimal solution due to a local extremum or a wrong prediction. Also if the system looses track it might not be able to regain track because it is searching in a wrong region. Further, using this AbS approach relies on knowing the initial pose parameters.

In many 3D human pose estimation applications it is desirable to be able to estimate the pose using monocular vision. The ambiguities related to this are usually handled by introducing a priori knowledge in the form of a human model. We represent the human model in a phase space spanned by its different degrees of freedom and use the analysis-by-synthesis (AbS) approach to match the phase space model with real images and thereby estimating the pose. An alternative phase space is presented which significantly reduces the size of the phase space, hence less complexity when matching of model and image data. The phase space is reduces further by constraints based on the human motor system. Due to the relatively small size of the phase space we are able to consider the entire phase space in each time-step. This means that our system may be applied to continuous pose estimation as well as initial pose estimation. The approach is based on colours and silhouettes, where the latter is used in two different schemes. Results show that the two schemes compliment each other, making the system capable of estimating the pose even during occlusion.

1 Introduction Lately human motion capture, or pose estimation, has received much attention due to its large number of potential applications. To avoid the intrusiveness of mechanical and magnetic sensors it is desirable to use ”touch free” computer vision-based systems. Many computer vision systems have been designed and implemented as described in [10] and [9]. Due to the rather young age of this research field most of these systems are, however, limited in their functionality. To handle some of the ambiguities in post estimation more than one camera may be used [3][4][7][11]. However, in many situations it is desirable to be able to estimate 3D poses using monocular vision. To do so many

2 The Approach Instead of using the AbS approach in a local region, where it is a matter of predicting a pose and iterating until the optimal solution is found, we will use the AbS approach in a global scheme and hereby avoiding the prob1

lems mentioned above. For each image the entire phase space is considered a potential solution - the search space. By identifying and eliminating all points which can not explain the image data, the search space can (hopefully) be reduced to only one point - the correct pose. To prune the search space a number of constraints are needed. The AbS approach contains straight forward ways to reduce the search space: 1) Its dimensionality is set according to the current application. If a system only is interested in capturing the head pose of a subject then the space can be reduced to only the degrees of freedom associated with the head, 2) Kinematic constraints of the human motor system are also used to constrain the space, e.g. the legs can not bend forward at the knee. This may also be seen in a temporal context where the movements of different body parts have velocity and acceleration constraints, and 3) Collision constraint. Two body parts can not occupy the same space at the same time. Even after using the above constraints the search space may still be rather large and therefore we introduce additional constraints based on image data and an alternative phase space representation. We focus on simple methods and image representations which will require only limited computational effort so the approach remains suitable for real-time applications. The essence of our approach can be stated as:

segmentation with colours, while section 4 discusses how to reduce the dimensionality of the phase space. A further pruning is described in section 5. Section 6 defines image features used for matching between observed data and alternative model representations. The last sections present results, a discussion, and finally give a conclusion.

3 Initialisation and Segmentation When estimating the pose of a human body(part) based on image data from one image, the estimated 3D pose will be relative. This might be sufficient in a number of applications, but in others there is a need for absolute positions. For example in applications where the subject is interacting in the world (e.g. pointing) there exists a need to anchor the relative positions to a known world coordinate-system and thereby achieving absolute pose data. Given we are working within MMI the subject is likely to be sitting or standing in a fixed position while interacting. This allows for a calculation of an absolute 3D reference point ones and for all, either by hand measurements or through an automatic initialisation. We choose to have the system find the 3D position of the shoulder and use this as a reference point when estimating the pose of the arm.

3.1 Initialisation

We address the application area of man-machineinteractions (MMI). This allows an assumption of the subject to be monitored where head and torso are frontoparallel with respect to the camera, i.e. the position of the head in 3D will allow estimation of the 3D position of other salient points such as the shoulders. To focus on the principles of the approach we work only with 3D pose estimation of the left arm with the hand being a ”stiff” extension of the lower arm along its axis.

The subject places the left arm stretched out and parallel to the XY-plane and XZ-plane in the world coordinate system defined as shown in figure 1. The distance between the shoulder and the hand (along the arm) will be equal to the length of the arm. The distance between the shoulder and hand on the Z-axis (towards the camera) will be zero. Together this yields two equations in four unknowns. The camera has been calibrated using Tasi’s calibration technique yielding a number of parameters which can be used to map a 3D point into the image. When mapping an image point to a 3D world point we end up with three equations in four unknowns. Hence the image points of the shoulder and hand result in six equations in eight unknowns. Altogether we have eight linear equations in eight unknowns. This can be solved using standard linear algebra resulting in 3D coordinates of the shoulder and the hand. The shoulder coordinates define the anchor point. Whenever the system is started up or the subject changes the pose of the shoulder the initialisation is carried out. Note that this initialisation should not be mistaken for the initial pose which is required in most other AbS systems. Our initialisation is not needed each time a subject is engaging in a communication with a machine, but only one time during an entire sessions.

2.2 Content of the Paper

3.2 Segmentation

The approach and constraints outlined above are implemented as follows. Section 3 concerns the initialisation and

For the above to work we need to find the image points of the shoulder and the hand. Due to the assumption about

 

Measurements in the image of salient points on the human body through simple image processing allow us to exploit the structural constraints of the model for the current configuration.



Spatial image features may help to reduce the search space further. We are initially using the silhouette of the subject which is fairly simple to derive. An alternative representation of the phase space which supports a more efficient use of the different constraints.

2.1 Scenario and Model Complexity

2

the subject facing the camera we are able to estimate the shoulder based on the position of the head. The hand and head are segmented using colour information. Since RGBcolours are sensitive to the intensity of the lighting we use chromatic colours which are normalised according to the , intensity. The chromatic colours are defined as , and , where . Since the three chromatic colours always sum to 1, two components are sufficient to represent a colour image. Empirically an upper and lower threshold for skin colour of the two chromatic colours have been found. After thresholding an image the two largest blobs are found and their centre of mass represent the head and hand, respectively. During processing, when the system estimate pose data, the position of the hand in the image needs to be estimated. This is done in the same way as during initialisation, except a 2nd order predictor is used to limit the area wherein the hand is searched.

 



 

α S

 

Y

X Z

Figure 1: The human arm with the possible elbow locations indicated by the circle. is the angle between the top point on the circle and the actual elbow position. , , and represent the shoulder, elbow, and hand positions, respectively. In the bottom right corner the world coordinate system is illustrated.

2

The phase space for the left arm model has six dimensions when defined in Cartesian coordinates. Three coordinates for the elbow, , and three for the hand, . Since an arm consists of two parts where the lengths are fixed, a more compact representation may be used - Spherical coordinates. With the arm lengths fixed we have four angles, two for the shoulder and two for the elbow, . Still this is a four dimensional phase space with many possible solutions which need to be pruned. An alternative representation can be defined using the fact that when the 3D positions of the hand and the shoulder are known together with the lengths of the upper and lower arm, the possible elbow positions will be located on a circle (in 3D) perpendicular to the line spanned by the shoulder and hand, This circle is illustrated in figure 1. The angle, , is the angle between an elbow position and the top point on the circle. The phase space representation is now given by the three coordinates of the hand and . We do not know the 3D position of the hand, , but still the above phase space representation is useful. Using the parameters obtained during calibration we can map an image point to a line in space. That is, for each value of we can uniquely determine and . So for each image we have a two dimensional phase space consisting of the variables and . In this alternative phase space is bounded by one circle-sweep ( ) while is bounded by the total length of the arm. In table 1 the different phase space representations, their size , and dimensionality are shown. No constraints whatsoever are used yielding the full size phase space. The Greek letters and represent the resolution of the Cartesian coordinates

CED





/43

/45

2

/ 3

 GH 1  !$#%"!V()!$#%( 2W!X/ 3 2W!

=

@

2

I JLNM)? &

270

The area, , of the synthesised bounding box is found by projecting the arm into the image. The arm is modelled as a cylinder, so that two parallel lines are projected for both the upper and the lower arm. For each line , , , and are found within the cutoff region. This is done using the positions of the shoulder, elbow, and hand, and the positions where the projected lines and the cut-off lines intersect. The matching (similarity measure) between the two bounding boxes is defined as the intersection divided by the union of the two boxes:

In figure 3 the similarity measure is shown for the image in figure 2.A. The entire search space is shown except those -constraint. pruned by the Due to the nature of the matching process a number of wrong similarity measures might occur when the torso, legs, and head are mistaken for the arm, i.e. the hand is occluding the rest of the body. This problem is discussed further in section 8. For now we deal with it by only matching in the region to the right of a vertical cut-off-line and above a horizontal cut-off-line shown in figure 2.B.

/ 3

90

15 0

Figure 4: The similarity measure 6.D.

calculated for figure

180

 ~ – •œ› ™ • &

The area, , of the bounding box in the image is defined by , , , and within the cut-off region. 5

q 7:´

Similarity (S2)

0.6 0.4 0.2

35 25 15 0

Hz

180

90

270

360

calculated for figure

.

.

>

.

>.

.

>

.

.

`

’

`

>

8 ²7N7

2

/43



q;

During test we use a resolution of and which results in a phase space with distinct values according to table 1. After the phase space is pruned the number is between and distinct values depending on the current movement. These two numbers represent the entire phase space and not only a predicted region as in many other systems. That is, the size of the phase space is reduced with a factor due to the constraints. Com-

²7N7

/ 3

2… q R± ²:;

5 5 “¯RD¢! if ¬ > Measure K« . 5 +®­ 54° 0 ` if ¬ +®­ ¯RD where ¯RD 5 is the maximum length of the upper arm in the 5 are the x-coordinates image. ¬ and (in the image) of ­ the hand and shoulder, respectively.

± 8 Z 7N7 q

`

Even though the system can handle the situations in figure 6.C and 6.D where the hand is occluding the torso, it can not handle situations where no arm silhouette is visible at all. In these situations we combine predicted values and the assumption that . As time goes more emphasis is put on the latter. This will often be incorrect but is in many cases a fair estimate and it allows for a continuous processing of data. To achieve a more precise estimate additional image information is required. This could be texture or flow measurements which, however, will slow the entire system without guaranteeing a better performance. During initialisation when calculating the 3D location of the hand and head (shoulder) and when estimating the position of the hand in the image, an amount of uncertainty has been observed. This transforms the -circle into a torus and the -line into a cone. I.e. each point inside the torus is an elbow candidate and each point inside the cone is a (and not only ) candidate. An investigation into this uncertainty showed that not much can be gained using this, more correct, phase space. Also, the uncertainty due to the variation of the clothes is larger than the extra precision gained by using a torus and cone instead of a circle and a line. We use the assumption that the subject is facing the camera with the head fixed with respect to the shoulder. This might be a reasonable assumption in many MMI cases, but it is not a general valid assumption. To expand our system to handle this we need to increase the dimensionality of the phase space by adding new degrees of freedom for the head

.

7 Results

.

8 Discussion

`

`

>

/ 3

Using appropriated weights we may simply add the similarities of the two methods since they have the same range. The weights might depend on different parameters such as: the general uncertainty (measured off-line), the current uncertainty (measured on-line), the model used, the ”looseness” of the clothes, etc. We found a dependency on the distance between the hand and the shoulder in the image. With a short distance should have more weight than as illustrated in figure 4 and 5. As the distance increases and become rather similar, however, will be more reliable due to the nature of the calculation and should therefore be given more weight than . Without degrading the combined measure we may use binary weights which makes it sufficient to calculate only or for each image and hereby reducing the complexity of the overall calculations. Which measure to use is then given by:

.

8 ²:7:7

Z 7N ’

α

. Figure 5: The similarity measure  ` & 6.D.

"!X#— !V ( !X# (

.

0 55 45

1G"

pared to the non-pruned standard representations ( and ) the reduction is approximate a factor for the given resolution. Still, to synthesise and match values is a comprehensive task and therefore we adapt a coarser-to-finer search for the peak in the selected measure. This gives good results due to the generally monotonic characteristics of the curves. In figure 6 five images from a sequence are shown. The left column shows the input images. The middle column shows the selected similarity measure, i.e. or . The right column shows the results of the pose estimation projected onto the input image. In worst case scenarios where no significant peak can be found we use the median peak. This gives a maximum error of on . Two worst case situations are shown in figure 6.A and B. They occur when the silhouette of the arm appears the same for diffe. In these situations exact pose estimation rent values of based only on silhouettes is impossible, even for a human! In most other configurations, represented by the three other images, a significant peak is found and the estimated pose is rather precise as can be seen in the right column.

Z

q +³N9

6

/ 3

References

and shoulder. Further constraints should be incorporated to reduce the search space correspondingly. The same relates to the assumption of the hand being part of the arm. The clothes of the subject obviously affects the estimated pose. To actually model this effect will require heavy computational methods and knowledge about the clothes worn by the subject. It is clear that the tighter a subject’s clothes are the better the estimated pose will be. However, even with rather ”loose” clothes, as in figure 6, a reasonable result can normally be obtained, and that is without the normal constraints imposed on a subject’s clothes: tightfitting [16], special coloured [4], special textured [13], or markers [12].

[1] C. Bregler and J. Malik. Tracking People with Twists and Exponential Maps. In International Conference on Computer Vision and Pattern Recognition, 1998. [2] L. Campbell and A. Bobick. Recognition of Human Body Motion Using Phase Space Constraints. In International Conference on Computer Vision, Cambridge, Massachusetts, 1995. [3] Q. Delamarre and O. Faugeras. 3D Articulated Models and Multi-view Tracking with Silhouettes. In International Conference on Computer Vision, Corfu, Greece, September 1999. [4] D. Gavrila and L. Davis. 3-D Model-Based Tracking of Humans in Action: A Multi-View Approach. In Conference on Computer Vision and Pattern Recognition, San Francisco, USA, 1996. [5] L. Goncalves, E. Bernardo, E. Ursella, and P. Perona. Monocular Tracking of the Human Arm in 3D. In International Conference on Computer Vision, Cambridge, Massachusetts, 1995. [6] D. Hogg. Model-Based Vision: A Program to See a Walking Person. Image and Vision Computing, 1(1), February 1983. [7] I. Kakadiaris and D. Metaxas. Vision-Based Animation of Digital Humans. In Conference on Computer Animation, pages 144–152, 1998. [8] Y. Kameda, M. Minoh, and K. Ikeda. Three Dimensional Pose Estimation of an Articulated Object from its Silhouette Image. In Asian Conference on Computer Vision, 1993. [9] T. Moeslund. Summaries of 107 Computer Vision-Based Human Motion Capture Papers. Technical report, Laboratory of Image Analysis, Aalborg University, Denmark, 1999. [10] T. Moeslund and E. Granum. A Survey of Computer Vision-Based Human Motion Capture. Submitted to International Journal on Computer Vision and Image Understanding, December 1999. [11] T. Moeslund and E. Granum. Multiple Cues used in ModelBased Human Motion Capture. In The fourth International Conference on Automatic Face- and Gesture-Recognition, Grenoble, France, March 2000. [12] O. Munkelt, C. Ridder, D. Hansel, and W. Hafner. A Model Driven 3D Image Interpretation System Applied to Person Detection in Video Images. In International Conference on Pattern Recognition, 1998. [13] R. Pl¨ankers, P. Fua, and N. D’Apuzzo. Automated Body Modeling from Video Sequences. In International Workshop on Modeling People at ICCV’99, Corfu, Greece, September 1999. [14] K. Rohr. Human Movement Analysis Based on Explicit Motion Models, chapter 8, pages 171–198. Kluwer Academic Publishers, Dordrecht Boston, 1997. [15] S. Wachter and H.-H. Nagel. Tracking Persons in Monocular Image Sequences. Computer Vision and Image Understanding, 74(3):174–192, June 1999. [16] C. Yaniz, J. Rocha, and F. Perales. 3D Region Graph for Reconstruction of Human Motion. In Workshop on Perception of Human Motion at ECCV, 1998.

9 Conclusion In this paper we have presented a compact phase space representation for human pose estimation. It is based on the kinematic constraints of the human motor system and it efficiently reduces the dimensionality of the phase space and thereby the number of possible solutions. We have applied the phase space to an analysis-by-synthesis (AbS) approach. Due to the low dimensionality of our phase space we are not limited to a local search as used in other AbS approaches. We are using it in a global search where the entire phase space is considered. This also means that we can handle long term occlusions and do not need to know the initial pose parameters as is the case in virtually all other model-based systems [10]. Hence our work applies to continuous pose estimation as well as to initial pose estimation. We have focused on simple data types which allow for real-time implementation. These are colour to segment the hand and head, and silhouettes of the arm to be matched against the synthesised model data in the AbS approach. The silhouette matching is carried out using two combined methods. Each fairly simple and focusing on the overlap and the shape similarity, respectively. Combined they are capable of estimating the 3D pose of a human arm even when the hand/arm is occluding the torso. Currently we are heading in two directions. The first is to expand the idea of a compact phase space to pose estimation of the entire human body. This requires a larger phase space and a number of new constraints. The other direction is to tailor our system to VR interface application where the use of a tracker (mounted on stereo glasses) allows for continuous update of the subject’s head position, hence no initialisation required. Currently we are investigating how to expand our systems to handle situations where the torso is not frontoparellel with respect to the camera.

Acknowledgement We would like to thank the Danish National Research Councils, who through the project: ”The Staging of Virtual Inhabited 3D Spaces” funded this work. 7

Similarity (S1)

1

A

0.8

0.6 10 170 0

130 −10 90

H

z

α

Similarity (S1)

1

B

0.7

0.4 50 180 30

140 10 100

H

z

α

Similarity (S2)

0.7

C

0.35

0 55 180 35 H

z

140 15 100

α

Similarity (S2)

0.6

D

0.3

0 55 130 35 Hz

115 15 100

α

Similarity (S1)

1

E

0.7

0.4 50 180 35 Hz

.

.

140 20 100

α

Figure 6: The results of five images from a sequence. The left column shows the input image. The middle shows the selected similarity measure ( or ) for the non-pruned part of the search space. The last column shows the result as an overlay on the input image. 8

>

`