Vision-based Human Pose Estimation for Pervasive ... - CiteSeerX

0 downloads 0 Views 579KB Size Report
Oct 23, 2009 - detected from video frames, and get candidates from the pose database by ... Context-aware computing is one of the important characters of pervasive ... approaches employ an analysis by synthesis methodology to optimize the ... the image observations and interpolate pose from the pose candidates.
Vision-based Human Pose Estimation for Pervasive Computing Yi Li

Zhengxing Sun

State Key Lab for Novel Software Technology

State Key Lab for Novel Software Technology

Nanjing University, Nanjing, 210093, China

Nanjing University, Nanjing, 210093, China

[email protected]

Corresponding Author: [email protected]

Researchers have worked for decades towards this goal. However, to analysis human pose from visual input is very difficult since the relation between image observation and poses is multi-valued in both directions and the self-occlusions of human body. In addition, the high dimension of human body, complexity of human motion and image feature extraction in videos make the problem a big challenge.

ABSTRACT Vision-based human pose estimation is useful in pervasive computing. In this paper, we proposed an example-based approach to human pose estimation from monocular image sequences. We use human motion capture data to synthesize a pose example database with each pose’s 3D information known. Firstly, we use shape context to describe the human silhouette detected from video frames, and get candidates from the pose database by silhouette matching; Secondly, we build probability and statistical model of motion, and carry out pose estimation from these candidates; Finally, Kernel Regression is used to smooth the motion. The proposed method could effectively analyze 3D pose from video and solve the orientation ambiguities problem, also it is invariant to view points. The effectiveness of this method is verified on videos of walking, running and jumping.

In the previous work, two main classes of pose estimation approaches can be identified: model-based approaches and learning-based approaches. The majority of model-based approaches employ an analysis by synthesis methodology to optimize the similarity between the model projection and observed images. The problems of these model-based approaches are that the initialization is very difficult and time consuming as the high dimensionality of human body. Also local minima in pose estimating can lead to low performance. Instead learningbased methods learn a mapping from image space to pose space using no explicit human model. Learning based methods perform less accurately compared to model-based methods but computationally less expensive. Learning based methods are also able to automatically initialize. While learning-based methods need a large training database and they are limited to fixed classes of movements and range of viewpoints used in training.

Categories and Subject Descriptors I.2.10 [Artificial Intelligence]: Vision and Scene Understanding –motion, video analysis; I.3.7 [Computer Graphics]: ThreeDimensional Graphics and Realism – animation

General Terms Algorithms

Example-based approach is a special case of learning-based methods, and there have been many advances in this area. Such approaches use a database of examples that describe poses in both image space and pose space. Pose estimation is obtained by searching for training images similar to a given image, and then interpolating from their poses. These approaches need no explicit initialization and computation less expensive than model-based methods. The key task of an example-based approach is to encode the image observations and interpolate pose from the pose candidates.

1. INTRODUCTION Context-aware computing is one of the important characters of pervasive computing. It means that computers should be able to extract information from the environment independently, rather than rely on information supplied to it by keyboard input. And perhaps the most relevant information should be retrieved for interaction is where and who are the humans in the environment and what their activities are. Here, computer vision can play an important role. Vision based human pose estimation and motion analysis can tell what people are doing and what they need. Based on the results, the environment can supply service for people.

In this paper we propose a new example-based human pose estimation method from monocular image sequences. We first extract human silhouette from the image and encode it using shape context. We use modified silhouette matching method to search the most K nearest poses in the database, which forms the pose candidates. Secondly, we build the probability and statistical model of motion, which improved the accuracy of pose estimation. The weighted Markov method is performed to effectively estimate the pose from candidates. Finally, Kernel regression is used to smooth the motion. The proposed method is effective to analyze the 3D pose from videos and solve the orientation

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. AMC’09, October 23, 2009, Beijing, China. Copyright 2009 ACM 978-1-60558-760-8/09/10...$10.00.

49

ambiguities problem and is invariant to view points. The effectiveness of this method is verified with videos of multi human poses.

3. METHOD The flowchart of our example-based pose estimation method is shown in Fig.1. The main idea is to search the poses that are most similar to the input image from the database and use probability and statistical model to estimate the pose of the input image from the poses candidates. We first construct a pose and silhouette database using motion capture data. Then we detect the silhouette of human in each frame of the video. Shape context is used to describe the silhouette and for silhouette matching. After that, for each input silhouette detected from video, we select the most similar K poses from the silhouette database to form the pose candidates. We use probability and statistical model to estimate the pose for each input image. Finally, Kernel Regression is used to smooth the motion. Afterward we will describe each step in detail.

The remainder of this paper is organized as follows. Previous work is reviewed in section 2. Details of the example-based method that we proposed are presented in Section 3. Experiments and analysis are described in Section 4. We conclude this paper in section 5.

2. RELATED WORK There has been a great deal of prior works on human motion analysis from video [Hu et al. 2004; Moeslund and Granum 2001; Moeslund et al. 2007; Poppe 2007]. Here we only focus on example-based human pose estimation. In visual analysis of human motion, description of image feature is very important. The mostly used image features are silhouette and edges. Agarwal and Triggs [2006] presented an example-based approach for 3D pose estimation from single view image sequences. Shape context was used to encode the silhouette. Results demonstrated reconstruction of long sequences of walking motions with turns from monocular video. Poppe and Poel [2006] compared three different shape descriptors on a human upper body pose database. The three shape descriptors are shape context, Fourier descriptor and Hu moments. They also performed experiments with deformed silhouette to test each descriptor’s robustness against variations in body dimensions, viewpoints and noises. Howe [2004] used silhouettes, which were matched to a collection of known poses using turning angle and chamfer distance. Markov chain was used for temporal propagation for 3D pose estimation of walking and dancing. Edges contain more information but are also more sensitive to texture. Mori and Malik [2002] extracted shape contexts of edge points from image. They stored an example collection to recover 2D joint positions, which were transformed to a 3D pose estimation in a subsequent step. Poppe [2007a] used Histograms of Oriented Gradients (HOG) as image descriptor, which didn’t need human detection. He took experiments on the HumanEva Sets. Howe [2006] also experimented on the HumanEva Sets. HumanEva Sets [Sigal and Black 2006] have become a common data set for comparison of different approaches and performance evaluation of accuracy against ground-truth.

Figure 1: Flowchart of example-based human pose analysis

The temporal information helps to improve accuracy. Brand [1999] used a hidden Markov model (HMM) to represent the mapping from 2D silhouette sequences in image space to skeletal motion in 3D pose space. Rosales et al. [2001] learned a mapping from visual features of a segmented person to static pose using neural networks. This representation allowed 3D pose estimation invariant to speed and direction of movement. Grauman et al. [2003] learned a probabilistic representation of the mapping from multiple view silhouette contours to whole-body 3D joint locations. Pose reconstruction is demonstrated for close-up images of a walking person from multiple or single views. Similarly, Elgammal and Lee [2004] learned multiple viewdependent mapping from silhouettes to 3D pose for walking actions. Shakhnarovich et al. [2003] introduced Parameter Sensitive Hashing (PSH) to rapidly estimate the pose given a new image. There are still a lot to do in performing example-based method in real time.

3.1 Silhouette Detection and Sampling Human silhouette detection is very important to example-based pose estimation. However, as affected by lighting conditions, shadows, moving background objects and occlusions, to perfectly detect the silhouette of human from video is really a difficult problem. In this paper, we use a correlation coefficient based method [Spagnolo et al. 2006] to detect human from video. The detected human region may contain noise pixels and with small holes in human body. We do morphological operations to filter the noise and fill the small holes. Eight-neighborhood tracking algorithm is used to detect the silhouette of human. And then we sample the silhouette equidistantly. The results are shown as Fig.2, which are a video frame, detected human region, human silhouette and sampled silhouette with different points.

50

(b)

(a)

(c)

(e)

(d)

Figure 2: Silhouette detection and sampling. (a) a video frame; (b)detected human region; (c) human silhouette; (d, e)sampled silhouette with different points.

3.2 Shape Context and Silhouette Matching

3.3 Pose Estimation

Shape context is a silhouette based shape descriptor [Mori and Malik 2006] which can be described as Eq.1.

3.3.1 Human Model and Motion Data

h pi (k ) =#{q ≠ pi : (q − pi ) ∈ bin(k )}

The skeleton model, block model, 3D human model employed in this paper are shown in Fig.3. The motion data is in bvh file form, which provides skeleton hierarchy information in addition to the motion data. The skeleton human model consists of 22 skeleton joints. Motion data of each frame defines the position and rotation of the ROOT joint, and rotations of the other 21 joints. The other joints include head, neck, chest, collar, shoulder, forearm, hand, thigh, shin, foot and so on.

(1)

Where hpi (k ) is the histogram of point pi . And consider a point pi on silhouette P = { pi | i = 1,..., n} and another point qi on silhouette Q = {qi | i = 1,..., n} , the distance of these two points is: 2 1 K [hpi (k ) − hq j (k )] c( pi , q j ) = χ (hpi , hq j ) = ∑ 2 k =1 hpi (k ) + hq j (k ) 2

(2)

The outer distance boundary R has great influence of shape context. We compute R as Eq.3, where R is mean distance of all points. α is a scale parameter such that small α makes shape context locally while big α makes it globally.

uuuur 1 || pi p j || , ∑ n(n − 1) Ω Ω : { pi , p j ∈ P, i ≠ j } ,

R =αR =α

(3)

(a)

The silhouette matching can be solved using bipartite graph matching in Ο(n3 ) time. In this paper we use Modified Hausdorff Distance (MHF) to match two silhouettes. Our MHF based silhouette matching method reduced the computational complexity to Ο(n 2 ) while doesn’t reduce the accuracy much. The method is described as follows.

(b)

(c)

Figure 3: Human models used in this paper. (a)skeleton model; (b)block model; (c)3D human model.

Motion is a function of time, at each timestamp i the value is a pose data. Motion can be defined as:

Pi = ( pi(1) , ri (1) , ri(2) ...ri ( j ) ,...ri (22) )

(7)

The similarity of a point p on silhouette P to silhouette Q is defined as C pQ and the minimum matching similarity is C ( p, Q) .

Where pi(1) ∈ P 3 and ri (1) ∈ R 3 represent the position and rotation of

Similarly, we define C (q, P) as:

the ROOT joint. ri ( j ) ∈ R 3 , j = 2,3,..., 22 represent the rotation of

C ( p, Q) = min C pQ = min c( p, q) q∈Q

the jth joint N j . According to the relation between joints, the ,

C (q, P) = min CqP = min c(q, p )

position of joint N j at timestamp i can be obtained as follows:

(4)

ur ( j ) pi = Ti ( root ) Ri( root ) ...

p∈P

Then the distance of two silhouettes P and Q is:

D( P, Q) = max(CPQ , CQP )

( grandparent ) 0

T

(5)

(t )T

( parent ) i

R

ur ( j ) p0

(8)

are the transform and rotation matrix of ROOT joint at timestamp i. They can be obtained from pi(1) and ri(1) . T0( k ) is the transform

of two silhouettes, shown as:

1 1 C ( p, Q) CQP = ∑ C (q, P) ∑ n p∈P n q∈Q ,

R

( parent ) 0

ur ( j ) where p i is the world coordinate of joint N j . Ti ( root ) and Ri( root )

where CPQ and CQP are the directed minimum matching similarities

CPQ =

( grandparent ) i

matrix of N k and it is obtained from the offset in the local

(6)

coordinate system of its parent, where N k is the joint between

51

When estimating the pose from the candidate set, we consider both the similarities and transition probability, and the candidate pose is selected as:

ROOT and N j in the human skeleton. Ri( k ) is the rotation matrix of joint N k (same defined as above) at timestamp i and it is ur ( j ) obtained from ri ( k ) . p 0 is the offset in the local coordinate system of its parent.

k = arg min( wik * P(i −1)( ik ) , k )

Where P(i −1)(ik ) is the transition probability of Pi −1 to Pik which is

The coordinate of all the joints according to ROOT joint can be obtained using Eq.8. So human motion can also be defined as:

ur (1) ur (2) ur ( j ) ur (22) Fi′ = ( p′i , p′i ... p′i ,... p′i )

the pose corresponding to silhouette Sik .

3.3.3 Statistical Modeling

(9)

For each pose, there will be K poses in the candidate set. As the number of poses in the database increase, the transition probability will be complex and the time of searching a pose will increase too. In this paper, we use cluster method to build the statistical model of poses in the database. The statistical model cluster the poses based on the similarity of the pose in pose space and build the transition probability between pose classes. The proposed statistical modeling method increased the effectiveness and accuracy of pose estimation. The statistical modeling is described as follows:

ur ( j ) ur ( j ) ur (1) ur (1) Where p′i = p i − p i , j = 2,3,..., 22 , p′i = (0,0,0) .

The angle at joint N j can obtained using:

θ

( j) i

ur ( j −) ur ( j ) ur ( j + ) ur ( j ) ( p′i − p′i ) ⋅ ( p′i − p′i ) = arccos( ur ur′( j ) ur′( j +) ur′( j ) ) ( j −) ′ |p − p || p −p | i

i

i

(10)

i

ur ( j − ) ur ( j + ) Where p′i and p′i are neighbor joints of N j

Firstly, we cluster human poses by C-means method. C classes of human pose are obtained as C1 , C2 ,..., Cc .

3.3.2 Probability Modeling

Secondly, for a pose Pi in class Ci , we search its consecutive

The Nearest Neighbors (NN) method which takes the pose nearest to the input as the pose estimation is a classic method for example-based learning. NN is appealing as its simplicity. However, as NN doesn’t make use of temporal information the result is often not acceptable due to orientation ambiguity, which means a silhouette can correspond to several poses. Local Weight Regression (LWR) method takes the average of the K candidate poses with weights as the pose estimation of the input image. The LWR is often not very accurate as error accumulation. In this paper, we take temporal information into account and use Weighted Markov method to estimate the pose. The probability model we proposed is described as follows:

frame Pi +1 in the original sequence. If Pi +1 does not belong to Ci but a member of C j , we set class C j as a child node of class Ci . For each pose in class Ci , we find all the child nodes of class Ci which form the cluster tree for class Ci . Finally, we build the cluster tree for each cluster classes C1 , C2 ,..., Cc . All the cluster trees form a cluster forest.

3.3.4 Human Pose Estimation Accordingly, our probability and statistical modeling based pose estimation method can be summarized as:

The transition probability of two poses Pi and Pj is defined as:

Pij = exp(− Dis ( Pi , Pj ) / K p )

(11)

Firstly, modeling. We build the probability and statistical model of the database.

(12)

Secondly, pose estimation. For a given pose, the estimation of the consecutive frame can be divided into three steps: first we obtain the K candidate poses for the consecutive frame. Then we remove the candidate poses, which does not belong to the child notes of the current pose based on the results of statistical modeling. At last, according to the probability modeling, we select the pose with the biggest weighted transition probability as the estimated pose.

Where Dis ( Pi , Pj ) is the distance of the two poses Pi and Pj . n

Dis ( Pi , Pj ) = ∑ wk log(ri ) (r ) k −1

k =1

k j

2

In Eq.12, wk is the weight of each joint. We set the weight of the key joints to be 1, the others to be 0. The key joints include ROOT, neck, knee, elbow and crotch.

The pose of the first frame in the sequence is estimated by nearest neighborhood method. The estimation of the first frame is very important as the following work is based on the first frame estimation.

For an input human silhouette, we search the K nearest silhouettes in the database to form the pose candidate set {Si1 , Si 2 ,..., SiK } . And we set a weight for each silhouette in the candidate set to consider the similarity of input image and candidate pose into pose estimation. The weight is calculated as:

wik = exp(− D( S i , S ik ) / K s )

(14)

3.4 Motion Smooth The estimated human poses for an input image sequences may be not smooth since the pose database may not contain all the poses of human in the videos. In order to make the output estimated poses form a plausible human motion; we have to smooth the poses. Kernel regression is a non-parameter estimation method,

(13)

Where, D ( Si , Sik ) is the distance of Si and Sik defined as Eq.5.

52

pose estimation will increase too also noise points will be included when sampling. So we use 200 sampled points in this paper.

which has great usage in data analysis. In this paper, we use the kernel regression method to smooth the estimated poses. Pose smoothing is done on the human joints coordinate data. For a given estimated pose sequence as Eq.15:

ur (1) ur (2) ur ( j ) ur (22) Testt = ( p′t , p′t ... p′t ,... p′t ) , t = 1, 2,..., T

2) Fig4b is the results of using different radiuses when sampled points are 200. It shows that the estimation errors is minimum when α =1. So we set α to be 1 in this paper. That means the

(15)

mean distance R is used.

wt = exp(−( X t , t ) / K ) , X t = 1, 2,..., T 2

2 w

mean error

Where T is the length of the image sequence. We set a weight for each frame using kernel functions, shown as: (16)

Then the smoothed pose sequence is;

ur (1)

ur (2)

∑w p′ ∑w p′ T

Testt′ = ( t =1 T

t

∑wt t =1

t

T

, t =1 T

t

∑wt t =1

t

ur ( j )

∑w p′ T

,..., t =1 T

t

t

∑wt t =1

ur (22)

∑w p′ T

,..., t =1

t

t

T

∑wt

)

22 20 18 16 14 12 10 8

(17)

right knee left knee

20 40 60 80 100 120 140 160 180 200 220 240 260

t =1

(a)Radius of SC

4. EXPERIMENTAL RESULTES AND ANALYSIS

15 14

mean error

In this paper we use POSER to synthesize the pose and human silhouette database. Rather than using random synthetic poses, we use real human poses from CMU human motion capture data. The motions in the database include walking, running and jumping. The silhouettes are rendered from different viewpoints. The numbers of frames in the database and for testing are shown in table 1. The real testing video is from walking videos [Sidenbladh and Michael 2003]. And we shoot running and jumping videos for testing too.

12 11 10 9 0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

(b) Sampled points of SC Figure 4: Results of radius and sampled points selection for shape context.

Table 1: Human poses in the database Action

13

Frames Training

Test

Walk

2808

304

Run

4013

139

Jump

2969

350

3) The results of pose estimation using SC, FD and Hu moments are shown in Fig.5, where the sampled points for SC and FD are 200. As shown in Fig.5, the mean error of SC is the lowest, the next is FD, Hu is the highest.

mean error

4.1 Radius and Sampled Points Parameter for SC and Comparison of Different Shape Descriptors We compared the estimation errors for shape context under different radius and sampled points. And then we compared three different shape descriptors: Fourier descriptors, shape context and Hu moments. The experiments were taken on synthesized walking sequence. 10-NN method was used to estimate the pose. As the angle of human limbs changes rapidly, we use the estimation errors of right and left knee angle to analysis the recovery results. The results of radius and sampled points selecting experiments are shown in Fig.4.

45 40 35 30 25 20 15 10 5 0

righr knee angle left knee angle

SC

FD

Hu

Figure 5: Comparison of there descriptors: Shape Context, Fourier Descriptors and Hu moment.

1) The results of experiment with different sampled points and α =1 is shown in Fig4a. As shown in Fig4a, when the sampled points increase the estimation errors of both right and left knee angles decrease and become convergent at 180 sampled points. But as the sampled points increased, the time spending for

To be noted that, the sampled points and mean errors have to do with the training database and test sequences. The given results are obtained based on our experimental data as shown above.

53

the human pose in video effectively and obtain smooth human poses. Results of frames around 200 are far away from the real angle. It is mainly because that the test data are not seen in the training database. In addition, the experiments on walking, running and jumping indicate that our method can estimate different kinds and complexities of human motions in video.

4.2 Pose Estimation and Smoothing We compare the results of LWR, NN, Markov and our method on walking, running and jumping sequences. And then we give the results of KR based pose smoothing. At last, we compared the mean pose estimation errors of the above methods and our proposed method.

4.5

200

4

180

3.5

160

3

knee angle

2.5

140

2

120

1.5 NN

Markov

Our

0.5 80 Real-R

Real-L

LWR

NN

Markov

0

Our

Run_L

60 40 1

Run_R

Walk_L

Walk_R

Jump_L

Jump_R

Figure 7: Comparison of results of different methods

10 19 28 37 46 55 64 73 82 91 100 109 118 127 136 145 154

(a) Results of left knee angle on walking sequence

Fig.7 shows the estimation mean error of LWR, NN, Markov and our method on walking, running and jumping sequences. As shown in Fig.7, the mean error of our probability and statistical modeling method is lower than NN, LWR and Markov methods. Our KR based smoothing method not only makes the poses smoother but also decreases the mean error. In our experiment, the error of left knee angle is higher than right on running and walking motion. It is because that the human in our test videos is running and walking from left side to right side. In these motions, the left knees are occluded.

170 160 150

knee angle

LWR

1

100

140 130 120 110 100 Real-L

LWR

NN

Markov

Our

4.3 Pose Estimation on Real Image Sequences

90

The results of our method experimented on real walking, running and jumping videos are shown in Fig.8. The videos are shot in nature environment. And the human in the running video run in a circle. As shown in Fig.8, our proposed method analyzes the 3D pose from videos effectively. The results on human running in a circle demonstrate that our method is invariant to view points.

80 1

19 37 55 73 91 109 127 145 163 181 199 217 235 253 271 289 307 325 343

(b) Results of left knee angle on jumping sequence Figure 6: Analysis of knee angles on walking and jumping videos.

Fig.6 shows the results of different methods experimented on walking and jumping sequences. Fig6a is the results of estimated left knee angle on walking sequence. The black broken line and the green line represent the real angle of human right and left knees respectively. The yellow line represents the results of NN method. It estimates the right knee as left knee at frame 38, 90 and 154, which is caused by orientation ambiguity. The purple line represents the results of LWR method. The mean errors are high at some frames as error accumulation. Markov method overcomes the orientation ambiguity problem. But the estimated poses are not smooth. The red line represents the results of our probability and statistical modeling method after poses smoothing. We can see that the estimated right knee angle is not only most near to the real angle, but also smooth. It demonstrates that our method is effective. Fig.6b is the results of estimated left knee angle on jumping sequence. The black broken line represents the real left knee angle of human when jumping. The red line is results of our method. It is shown that our method can estimate

4.4 Discussion on Database and Our Method The human pose database is important to example-based methods. The types and complexities of motions that our method can analysis are limited to the types and numbers of motions in the database. The viewpoints and dimensions of human in the database have great influence on the pose estimation. To construct a huge database that contains all kinds of motions or motions with different dimensions and viewpoints is difficult. Also it will make pose estimation more time-consuming. A specific database for specific video analysis and well-organized database are advisable approaches. In addition, our method is depended on the silhouette detection from video. But human silhouette detection from video is a difficult and not solved problem well. So people have turned to methods of robust human detection and non-silhouette detection based image descriptors. There is still a lot work to do to make pose estimation more accurate and for real time applications.

54

walking jumping running Figure 8: Results on real videos of walking, jumping and running

[2] Elgammal, and C.S. Lee. 2004. Inferring 3D body pose from silhouettes using activity manifold learning. In Computer Vision and Pattern Recognition, 681-688. [3] G. Shakhnarovich, P.Viola, and T.Darrell. 2003. Fast pose estimation with parameter-sensitive hashing. In International Conference on Computer Vision, 750-759,. [4] G.Mori, and J.Malik. 2006. Recovering 3D Human Body Configurations Using Shape Contexts. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28, 7, 1052-1062. [5] G.Mori, and J.Malik. 2002. Estimating human body configurations using shape context matching. In European Conference of Computer Vision, 666–680. [6] K. Grauman, G. Shakhnarovich, and T. Darrell. 2003. Inferring 3D structure with a statistical image-based shape model. In International Conference on Computer Vision, 13– 16. [7] L. Sigal, and M. J. Black. 2006. HumanEva: Synchronized video and motion capture dataset for evaluation of articulated human motion. Technical Report CS-06-08, Brown University, Department of Computer Science, Providence, RI, September. [8] M. Brand. 1999. Shadow puppetry. In International Conference on Computer Vision, 1237-1244. [9] N. R. Howe. 2006. Evaluating lookup-based monocular human pose tracking on the humaneva test data. NIPS Workshop on Evaluation of Articulated Human Motion and Pose Estimation (EHuM). [10] N. R. Howe. 2004. Silhouette lookup for Monocular 3D pose tracking. in: Conference on Computer Vision and Pattern Recognition Workshop, 15-22. [11] P. Spagnolo, T.DOrazio, M. Leo, and A. Distante. 2006. Moving object segmentation by background subtraction and

5. CONCLUSION Vision based human pose estimation is important for pervasive computing. In this paper, we proposed an example-based method for human pose estimation from monocular image sequence. We first build the pose and silhouette databases using motion capture data. Then we use shape context to describe the silhouette of human extracted from video frames, and get the candidates from the pose database by silhouette matching. Secondly, we build the probability and statistical model of motion, and estimate the pose from the candidates. The temporal and spatial models improve the accuracy and efficiency of pose estimation. Finally, Kernel Regression is used to smooth the motion which makes the estimated poses form a plausible human motion. The proposed method could effectively analyze the 3D pose from videos and solve the orientation ambiguity problem; also it is invariant to view points. The effectiveness of this method is verified on different types of human videos. The non-silhouette detection based method and approaches for real time applications will be our future work.

6. ACKNOWLEDGMENTS This paper is supported by the grants from the National 863 project of China [No. 2007AA01Z334], the National Natural Science Foundation of China [Grant No. 69903006 and 60373065] and the Program for New Century Excellent Talents in University of China [Grant No. NCET-04-0460]

7. REFERENCES [1] A. Agarwal. and B. Triggs. 2006. Recovering 3D human pose from monocular images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28, 1, 44-58.

55

[12]

[13]

[14]

[15]

[16] T.B. Moeslund, and E. Granum. 2001. A survey of computer vision-based human motion capture. Computer Vision and Image Understanding, 81, 3, 231-268. [17] T.B.Moeslund,A.Hilton, and V.kruger. 2006. A survey of advances in vision-based human motion capture and analysis. Computer Vision and Image Understanding, 104, 2, 90-126. [18] W. Hu, T. Tan, L. Wang, and S. Maybank. 2004. A survey on visual surveillance of object motion and behaviors. IEEE Transactions on Systems, Man, and Cybernetics, 34, 3, 334352. [19] H. Sidenbladh and J. B. Michael. 2003. Learning the statistics of people in images and video. International Journal of Computer Vision, 54, 1-3, 183-209. DOI= http://www.nada.kth.se/~hedvig/data.html

temporal analysis. Image and Vision Computing, 24, 5, 411423. R. Rosales, M. Siddiqui, J. Alon, and S. Sclaroff. 2001. Estimating 3D Body pose using uncalibrated cameras. In Computer Vision and Pattern Recognition, 9–14. R. Poppe, and M. Poel. 2006. Comparison of silhouette shape descriptors for example-based human pose recovery, In International Conference on Automatic Face and Gesture Recognition, 541-546. R. Poppe. 2007a. Evaluating Example-based Pose Estimation: Experiments on the HumanEva Sets. In Computer Vision and Pattern Recognition workshop, 1-8. R. Poppe. 2007. Vision-based human motion analysis: An overview. Computer Vision and Image Understanding, 108, 1, 4-18.

56