Accurate Estimation of Human Body Orientation From ... - IEEE Xplore

2 downloads 0 Views 18MB Size Report
Abstract—Accurate estimation of human body orientation can significantly enhance the analysis of human behavior, which is a fundamental task in the field of ...
1442

IEEE TRANSACTIONS ON CYBERNETICS, VOL. 43, NO. 5, OCTOBER 2013

Accurate Estimation of Human Body Orientation From RGB-D Sensors Wu Liu, Yongdong Zhang, Member, IEEE, Sheng Tang, Member, IEEE, Jinhui Tang, Member, IEEE, Richang Hong, and Jintao Li, Member, IEEE

Abstract—Accurate estimation of human body orientation can significantly enhance the analysis of human behavior, which is a fundamental task in the field of computer vision. However, existing orientation estimation methods cannot handle the various body poses and appearances. In this paper, we propose an innovative RGB-D-based orientation estimation method to address these challenges. By utilizing the RGB-D information, which can be real time acquired by RGB-D sensors, our method is robust to cluttered environment, illumination change and partial occlusions. Specifically, efficient static and motion cue extraction methods are proposed based on the RGB-D superpixels to reduce the noise of depth data. Since it is hard to discriminate all the 360° orientation using static cues or motion cues independently, we propose to utilize a dynamic Bayesian network system (DBNS) to effectively employ the complementary nature of both static and motion cues. In order to verify our proposed method, we build a RGB-D-based human body orientation dataset that covers a wide diversity of poses and appearances. Our intensive experimental evaluations on this dataset demonstrate the effectiveness and efficiency of the proposed method. Index Terms—DBNS, human body orientation estimation, RGB-D, superpixel.

I. Introduction

I

N THIS paper, we focus on the method to estimate human body orientation effectively and efficiently. Similar to [3], [4], [32], [38], the human body orientation is defined as the orientation of torso, which only considers the angle around Manuscript received October 31, 2012; revised April 11, 2013; accepted June 21, 2013. Date of publication July 23, 2013; date of current version September 11, 2013. This work was supported in part by the National Nature Science Foundation of China under Grant 61173054, Grant 61100087, and Grant 61103059, the National Key Technology Research and Development Program of China under Grant 2012BAH39B02, the Program for New Century Excellent Talents in University under Grant NCET-12-0632, and the Natural Science Foundation of Jiangsu Province under Grant BK2012033 and Grant BK2011700. Recommended by Associate Editor L. Shao. W. Liu is with the Advanced Computing Research Laboratory, Key Laboratory of Intelligent Information Processing of Chinese Academy of Sciences (CAS), Institute of Computing Technology, CAS, Beijing 100190, China, and also with the University of CAS, Beijing 100049, China (e-mail: [email protected]). Y. Zhang, S. Tang, and J. Li are with the Advanced Computing Research Laboratory, Key Laboratory of Intelligent Information Processing of Chinese Academy of Sciences (CAS), Institute of Computing Technology, CAS, Beijing 100190, China (e-mail: [email protected]; [email protected]; [email protected]). J. Tang is with the Nanjing University of Science and Technology, Nanjing 210094, China (e-mail: [email protected]). R. Hong is with the Hefei University of Technology, Hefei 230009, China (e-mail: [email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TCYB.2013.2272636

Fig. 1. Examples of human body orientation. For purpose of convenient analysis, the body orientation space is divided into eight nonoverlapping partitions (S, SE, E, NE, NW, W, SW, and N), each with 45°. The RGB-D sensor locates at people’s south.

the axis perpendicular to the ground plane. Some examples of human body orientation are shown in Fig. 1, in which the eight people separately orientate toward eight different directions (S, SE, E, NE, NW, W, SW, and N). Accurate human body orientation estimation can significantly enhance people tracking [4], pose estimation [34], [38], attributes extraction [19] and action recognition [20], which are all fundamental tasks in the field of computer vision. For example, the work in [38] regards body orientation as a prior knowledge and shows that the incorporation of orientation can dramatically increase the accuracy of human pose estimation. Furthermore, human body orientation estimation also contributes to the understanding of human attention and relation detection, which is very useful in business and perceptual interface [25]. Nonetheless, accurate human body orientation estimation is a challenging work as a result of a wide diversity of poses and appearances which human can take. In addition,

c 2013 IEEE 2168-2267 

LIU et al.: ACCURATE ESTIMATION OF HUMAN BODY ORIENTATION FROM RGB-D SENSORS

since the body orientation is mostly used as a prior knowledge, the algorithm complexity must be especially considered for the estimation method. The previous works in human body orientation estimation can be divided by the type of data they rely on, i.e., 2-D information or 3-D information. Due to the missing of geometrical information, 2-D-based algorithms are sensitive to cluttered environments, illumination changes and partial occlusions. Although algorithms relying on 3-D data have achieved good estimation results [1], [26], [32], [43], the existing 3-D information acquisition methods are sophisticated and expensive [7]. Recently, low-cost RGB-D sensors (Kinect sensors) bring an opportunity to address the challenging issue [14]. RGB-D sensors can capture RGB image along with its per-pixel depth information in real time. The complementary nature of the appearance (RGB) and depth information in the RGB-D sensors opens up new opportunities to solve fundamental problems in computer vision, including robotic manipulation [16], human– computer interaction [9], [34], [38], [39], 3-D mapping and localization [13], [24], and so on. Especially, RGB-D information has been used to estimate head orientation in [7] and achieved the state-of-the-art performance. Compared to 2-D information, the geometry information brought by depth makes it possible to overcome the cluttered environment, illumination change, and partial occlusion in orientation estimation. In addition, similar to 3-D information, the geometry information has stronger orientation discriminatingly than 2-D. More important, compared with 3-D information, the acquisition of RGB-D information through RGB-D sensors is cheaper, more convenient and has lower computational complexity. In spite of this fact, compared to existing 2-D and 3-Dbased human orientation estimation methods, methods using RGB-D information have a number of specific challenges. Firstly, the depth information provided by low-cost RGB-D sensor is noisy and inhomogeneous. Only 794 raw depth values are used for encoding depth information in each pixel and the actual depth measure distance is 0–10 m. As shown in Fig. 2(a), 86.9% of the depth values are used to encode the interval between 0 m and 2.5 m, leaving only 140 values for describing the remaining 2.5–10 m range [37]. Therefore, as shown in Fig. 2(b), the range resolution far from the camera is very coarse. Furthermore, the 3-D information obtained by a single RGB-D camera is incomplete, which is called 2.5-D data [7]. As illustrated in Fig. 2(b), only front side (the side facing the camera) RGB-D information can be obtained. In conclusion, although the RGB-D information is useful for orientation estimation, specifical algorithm needs to be studied to exploit the mutual complementarity between appearance and depth information from the noisy and inhomogeneous depth data. Therefore, an effective and efficient human body orientation estimation method, which takes full advantage of RGB-D information, is proposed in this paper. First, in order to reduce the distraction of noisy and inhomogeneous depth data, RGB-D-based superpixels are used instead of the original RGB-D points. Based on the superpixels, static and motion cues are extracted to estimate the orientation. As static cues,

1443

Fig. 2. Character of RGB-D information. (a) Transition from raw depth values to metric depth values. (b) Three different views of one RGB-D frame obtained from a RGB-D sensor.

superpixel-based viewpoint feature histogram (SVFH) that encodes the geometry and viewpoint information is extracted. However, methods only depending on SVFH cannot distinguish the symmetric orientation partitions, e.g., W and E, due to their same depth information. Hence, we join two kinds of motion cues to improve it. We first extract superpixelsbased scene flow (SSF) information as motion cue based on the assumption that the people’s body orientation is nearly parallel to the moving direction if his speed is faster than a certain threshold [3], [4], [15], [30]. We also extract the temporal information indicating the continuity of the human orientation as motion cues. Consequently, we propose to fuse the static and motion cues by a dynamic Bayesian network system (DBNS) for employing the complementary nature of these cues to improve the orientation discriminability of our method. Finally, a new RGB-D-based human body orientation dataset is built to evaluate our algorithm. Our contributions are summarized as follows. 1) Innovatively using RGB-D information to estimate human body orientation. Compared with 2-D information, the additional depth information improves its discriminability of human body orientation. With the invention of RGB-D sensor, it is more conveniently obtained than traditional 3-D information. 2) RGB-D-based superpixels are extracted, which are more robust and sparser than original RGB-D points. Based on the superpixels, efficient static and motion cues extraction methods are proposed. 3) In order to improve the orientation discriminability for the various body poses and appearances in the realworld, a DBNS is designed to effectively employ the complementary nature of both static and motion cues. The rest of this paper is organized as follows. Existing related works are reviewed in Section II. In Section III, the

1444

IEEE TRANSACTIONS ON CYBERNETICS, VOL. 43, NO. 5, OCTOBER 2013

proposed method is presented, including the superpixel extraction, SVFH features extraction, SSF information estimation and the DBNS combining all the cues. In Section IV, a new RGB-D-based human body orientation dataset is introduced and experimental evaluations on this dataset are presented. The conclusion and future works are given in Section V. II. Related Work There are plenty of previous works related to human body orientation estimation. Generally, depending on the type of data they rely on, existing approaches mainly fall into two categories: 2-D-based methods and 3-D-based methods. A. 2-D-Based Human Body Orientation Estimation Generally, 2-D-based human body orientation estimation methods can be divided into three categories: static cues based methods [6], [8], [20], [33], motion cues based methods [15], [30] and static and motion combination methods [3], [4], [25]. Many static cues based methods formalize human body orientation estimation as a classification problem. The orientation space is divided into n categories (i.e., eight categories), then specific visual features are extracted and orientation classifiers are trained with these features. For example, Haar wavelets and HOG are separately extracted in [8] and [33]. Next, eight viewpoint classifiers are trained by support vector machine [2]. In [6], pedestrian classification and orientation estimation problems are integrated into a set of view-related models. Unlike the above mentioned popular visual features, the poselets activation vector, which describes the distributed representation of human poses and appearances, is used in [20]. However, the static cues based methods may introduce ambiguities when discriminating the symmetric orientation due to the missing geometrical information. For example, it is difficult to distinguish facing left from facing right when people are standing sideways. In order to decrease the ambiguities, motion cues based methods suppose the body orientation is near parallel to the people’s moving direction and use motion cues to estimate the orientation [15], [30]. Nevertheless, without exploiting bodyposes-related features, these methods are problematic when the target person does not move or move slowly, as the velocity orientation become too noisy to provide reliable information. Thus, Ozturk et al. [25] use shape context based approach to detect basic body orientation, and then propose a primary logic approach to refine the orientation with optical flow. Furthermore, Chen et al. [3] integrate multilevel HOG features into a temporal filtering framework to estimate body orientation. In order to better fuse the body-pose-related features with motion cues, a novel semi-supervised approach is proposed for coupled adaptive learning in [4]. In summary, 2-D-based human body orientation estimation methods rely mainly on the visual features to distinguish different orientation. Lacking of geometry information reduces its discriminability. Although introducing supplementary motion cues is beneficial, motion cues estimation in 2-D environment is also a very hard work. Therefore, more effective cues are needed to assist human body orientation estimation.

B. 3-D-Based Human Body Orientation Estimation Due to the difficulties caused by the inherent limitations of the 2-D-based human body orientation estimation, such as lacking of geometric information in particular, many works turn to use 3-D information as primary cues [1], [26], [32]. Peng et al. [26] propose a multi-camera scenario, in which orientation vectors are extracted from binary silhouette images by performing multilinear analysis. From the vectors, 1-D manifold is learned to estimate the body orientation. On the contrary, in order to improve algorithm’s efficiency, the estimation approach in [32] analyzes silhouette information of each camera view separately and all single view results are fused within a Bayesian filter framework. Also using Bayesian filter framework, Yao et al. [43] employ three 3-D elliptic cylinders to represent people instead of binary silhouettes. This change allows introducing a spatial color layout that is useful to discriminate the tracked person from potential distracters. The approach is robust to occlusion and large variations in human appearance. Different from the multi-camera scenario, Andriluka et al. [1] estimate 3-D human poses in monocular camera scenario. Human body orientation estimation is a key step in this approach. However, its orientation estimation method relies on pictorial structures, which is very similar to the 2-D-based method in [20]. Admittedly, the additional dimension information can help reduce the limitations of 2-D-based methods, but the 3-D information acquisition process in a multi-camera set-up is not available in some surveillance systems or mobile applications [7]. Fortunately, the works in [19], [34] show that the 2.5-D information obtained by recent RGB-D sensors could become the substitution of 3-D information, which can be more conveniently obtained. However, as the reasons introduced in Section I, existing 3-D-based methods cannot be used directly on this 2.5-D information. Although Fanelli et al. have addressed the problem of head orientation estimation with depth information [7], very few works have been done on body orientation estimation with depth information. Different from head, human body can take more diversity of poses and appearances, hence the method proposed in [7] cannot be used to estimate body orientation. Although Sun et al. regard body orientation as a prior knowledge to estimate the body pose from RGB-D sensors [38], it only estimates the −75° to 75° orientation space. In conclusion, compared with 2-D-based method, additional geometry information (3-D or depth) improves the human body orientation estimation results in 3-D-based methods. Moreover, RGB-D sensors bring an efficient method to acquire geometrical information. As existing 3-D-based methods cannot be used directly on the RGB-D, we need to explore the special RGB-D-based methods. III. Proposed Approach The overview of our approach is shown in Fig. 3. The first stage is superpixels extraction. For a given video frame, we firstly use RGB-D-based human segmentation to get each people’s individual region. Next, for each individual region, a modified watershed algorithm that combines the registered

LIU et al.: ACCURATE ESTIMATION OF HUMAN BODY ORIENTATION FROM RGB-D SENSORS

1445

Fig. 3. Overview of our approach. 1. Superpixels extraction. For a given frame, the RGB-D-based human segmentation is used to get its individual region. Then the superpixels are extracted by an improved RGB-D-based watershed algorithm. 2. Static and motion cues extraction. On the one hand, SVFH features are extracted and the static classifier is trained to get the static cues. On the other hand, as the motion cues, SSF information is obtained from two adjacent frames. 3. Combination. DBNS combining all the cues is used to get the final estimation results.

RGB and depth information is used to get its superpixels. Then, each person is represented by these superpixels. The second stage is static and motion cues extraction. On one hand, SVFH features are extracted from each individual and the static classifier is trained to predict the static cues; on the other hand, as motion cues, the SSF information is estimated by the particle filter (PF) algorithm. The third stage is the fusion stage. Static human orientation classification results, SSF information and temporal information are properly embedded into a DBNS to get the final estimation results. A. RGB-D-Based Superpixels Acquisition and Representation Superpixels have been one of the most promising representations with demonstrated success in image segmentation, object recognition and tracking [17], [29], [42]. Here, an improved watershed algorithm is used to get the superpixels, which combines the registered RGB and depth information. First, we need to achieve the people’s individual region. Most 2-D-based people detection approaches [5], [10] are plagued by the difficulty of separating subjects from backgrounds, also known as the detection or figure-ground segmentation step. Hence, 3-D information is employed to solve the problems [23], [36]. Compared with methods only using 3-D information, RGB-D-based people detection methods [19], [22] combining the appearance and depth information are more robust and faster. Therefore, we use human detection and segmentation method proposed in [19] to get the people’s individual point cloud, which contains registered RGB and depth information shown in Fig. 4(b). Next, M seed points are densely distributed on the individual region (M=800 in our method). In order to combine the intensity value and depth value, the color difference in watershed algorithm is redefined as follows: max(|Ipixel − Ipixel |, |Dpixel − Dpixel |)

(1)

where I is the intensity value and D is the depth value that is normalized into 0–255. Furthermore, flooding from the seed points, the superpixels are obtained by the watershed algorithm [21]. Finally, each superpixel is presented by

Fig. 4. Procedure of the RGB-D-based superpixel acquisition. (a) Input RGB and depth information. (b) People’s individual region that combines the RGB and depth information. (c) Individual region with the red seed points. (d) Extracted superpixels (represented by different colors).

sp(i) = {ci , ni , pni }, where ci = [xi , yi , zi ] is the center of the ith superpixel, ni = [x¯i , y¯i , z¯i ] is the surface normal vector of the superpixel and pni is the number of pixels in the superpixel, 0 ≤ i < M. B. SVFH Features Extraction Although depth cameras have been exploited in computer vision for several years, few adaptive deep features have been proposed to characterize objects due to the current noisy and incomplete depth information. Viewpoint feature histogram (VFH) is firstly proposed by [31] to recognize object and its pose on 3-D point cloud data. The feature encodes the 3-D point’s geometry and viewpoint information. However, original VFH is extracted from all 3-D points, which is time-consuming and more sensitive to noise. Especially, different from the object used in [31], people are nonrigid and much bigger. Therefore, we propose the SVFH, which means extracting the VFH feature from superpixels. As superpixels are sparser and more stable than individual point, SVFH is more efficient and stable than VFH. Furthermore, different from VFH, the area size of superpixel is different. The superpixel with larger area size should contribute more to the histogram. The detailed SVFH extraction method is introduced as follows. The SVFH feature consists of two parts: viewpoint direction component and surface shape component, which are shown in

1446

IEEE TRANSACTIONS ON CYBERNETICS, VOL. 43, NO. 5, OCTOBER 2013

Fig. 5. Viewpoint direction component is a 128-bin histogram, which collects all the superpixels’ surface normal direction. The surface shape component is a 196-bin histogram that can be computed as follows. As shown in Fig. 5(a) and (c), for each sp(i) = {ci , ni , pni }, the normal angular deviation between it and central superpixel sp(i) = {cc , nc , pnc } is calculated by α = v · ni

ci − c c ci − cc 2 θ = arctan(w · ni , u · ni )

φ =u·

(2)

where u, v, and w represent a Darboux frame coordinate system calculated through (3) u = nc

ci − c c ci − cc 2 w = u × v.

v=u×

(3)

Furthermore, α, φ, and θ angles are divided into 45 bins respectively. Another 45 bins count the distances between the central superpixel and other superpixels. The front 308 bins histogram (including the viewpoint direction component) can be calculated by the method in [18]. In addition, the torso region is divided into small regions by the method in [19]. The angle between each region’s central point and the body’s central point in horizontal plane is calculated and counted into 16-bin histogram. Finally, the 324-bin histogram composes the SVFH feature. Different from VFH, when calculating the histogram, the superpixel with larger area size contributes more to the VFH. Hence, all the superpixels are weighted by the real area size ratio between the superpixel and the central superpixel, which can be calculated by S(sp(i)) (i) wi = Sratio = S(sp(c)) (pni /pn) × 4z2i × tan( hor ) × tan( ver ) 2 2 = (4) hor ver 2 (pnc /pn) × 4zc × tan( 2 ) × tan( 2 ) pni × z2i = pnc × z2c where pn is the number of pixels in the whole depth image, hor and ver are the RGB-D sensors’ horizontal and vertical field of view. In order to estimate the orientation with the VFH, static orientation classifiers are trained. At first, the human body orientation is quantized into eight classes (S, SW, SE, W, E, NW, NE, and N), each with 45°. For each individual region in the training set, the 324 dimensional SVFH feature vector is extracted. In addition, a multiclass support vector machine [2] is employed to get the final static classifier. Finally, when a new data coming, the probabilities of the data classified into the eight orientation classes are predicted by the static classifier. These probabilities are used as static cues. C. Superpixel-Based Sparse Scene Flow Estimation Static human orientation classifier can give us a basic estimation result in one frame. However, as the depth data

Fig. 5. SVFH Extraction. (a) Extraction of SVFH from a fraction of superpixels, where the red point is the center of the central superpixel, the yellow point is the center of each superpixel, and blue arrow is the surface normal of each superpixel. SVFH contains viewpoint direction component and surface shape component. (b) Surface normal of one superpixel and the viewpoint direction component is a 128-bin histogram which collects all superpixels’ surface normal direction. (c) Normal angular deviation between two surface normals.

is noisy and incomplete, the results are not satisfactory. In order to improve the discriminative ability, we extract the motion cues as supplements. Scene flow is the full 3-D motion field of an observed scene, which is similar to optical flow—the projection of the 3-D motion field onto the image plane. Reference [11] presents a novel RGB-D information based approach for scene flow estimation. In addition, [12] uses this approach to track hands and estimate their trajectories. However, as the scene flow estimation methods in [11] and [18] consider all the RGB-D points, their complexity are very high. Instead, we propose the SSF, which only regards the superpixels’ central points as scene particles to reduce the complexity. Furthermore, the observation model used in [11] only considers the depth information. As the complementary nature of the depth and appearance (RGB) information bring more cues for superpixels tracking, we employ a new observation model that considers both of them in our SSF estimation method. Compared with existing scene flow estimation methods, our SSF estimation method is more robust and efficient. The detailed SSF estimation process is introduced as follows. Firstly, for each superpixels’ central point in frame t, we regard it as a scene particle. Its state vector is defined as st = [ct , vt , pnt ], where ct = [xt , yt , zt ] is the particle’s 3-D location, vt = [x˙t , y˙t , z˙t ] is its velocity and pnt is the number of pixels in this superpixel. c0 and pn0 can be initialized by the superpixel and v0 is initialized to 0. Given the state st−1 and observations z1:t−1 of every scene particle, PF algorithm [27] is used to find next state st = [ct , vt , pnt ]. Similar to [11], the posterior probability p(st |z1:t ) can be represented by applying Bayes’ rule  p(st |z1:t ) ∝ p(zt |st )

p(st |st−1 )p(st−1 |z1:t )dst−1

(5)

LIU et al.: ACCURATE ESTIMATION OF HUMAN BODY ORIENTATION FROM RGB-D SENSORS

1447

where z1:t is the observations up to time t. As general tracking work solved by PF, it is critical to construct an efficient observation model p(zt |st ). To exploit the complementary nature of the depth and appearance (RGB) information, we propose a new observation model that has two types of observations, RGB zt = [zRGB , zD is based on appearance constancy t t ], where zt D assumption and zt is based on structure constancy assumption. Under observation independence assumptions, the observation model is produced as p(zt |st ) = p(zRGB |st )p(zD t t |st )

(6)

where p(zRGB |st ) is the appearance observation likelihood and t |s p(zD ) is the structure observation likelihood. t t The appearance observation likelihood is calculated as 1 p(zRGB |st ) =  (7) t RGB c(zt |r, v)2 + ε2 where ε is a smooth approximation. c(zRGB |r, v) is appearance t cost, which is computed by  c(zRGB |r, v) = min(Ht (i), Ht−1 (i)) (8) t i

where Ht (i) is the HSV color histogram of one superpixel and the appearance cost equals to the intersection distance of two consecutive superpixels’ histograms. Besides appearance constancy assumption, the structure constancy assumption is also very important. Instead of constant depth hypothesis in [11], we suppose each superpixel’s area size and relative position on the body should be constant. Hence, in our method, the structure observation likelihood is measured by area size and relative position constancy assumption. For each scene particle, its relative position is calculated by the distance from the center of the superpixel to the center of the human body. So the structure observation likelihood is calculated by S(ct ) p(zdt |st ) = (9) |d(ct ) − d(ct − vt )| S(ct−1 ) where d(c) = ci − cc 2 and

S(ct ) S(ct−1 )

is computed by

S(ct ) pnt × z2t S(sp(ct )) = Sratio = = . S(ct−1 ) S(sp(ct−1 )) pnt−1 × z2t−1

(10)

Fig. 6. DBNS combining the static cues and motion cues. Node st is the static cues, vt is the SSF information, and θt is the human body orientation to be estimated.

graph, we can define a transition probability term for each θt according to p(θt |θt−1 , st , vt ) = p(θt |θt−1 )p(θt |st )p(θt |vt ).

(11)

p(θt |θt−1 ) favors the continuity of the human body orientation, which can be decomposed as p(θt |θt−1 ) =

ek0 cos(dif (θt ,θt−1 )) 2πI0 (k0 )

(12)

where k0 is the concentration parameter (k0 = 2.4 in our experiment), I0 is the 0th order modified Bessel function and dif (θt , θt−1 ) is defined as  if |θt − θt−1 | ≤ π |θt − θt−1 |, dif (θt , θt−1 ) = . 2π − |θt − θt−1 |, otherwise (13) Equation (12) means that the body orientation at time t + 1 should be distributed around the orientation at previous time t, which is regarded as temporal information [3]. p(θt |st ) is the probability output by the SVFH-based orientation classifier. p(θt |vt ) favors the relationship between the human body orientation and SSF orientation, which can be decomposed as p(θt |vt ) =

eκ(vt ) cos(dif (θt −ang(vt ))) 2πI0 (κ(vt ))

(14)

D. Combination of Static Cues and Motion Cues

where vt = [x˙t , y˙t , z˙t ], ang(vt ) = arctan( xz˙˙tt ). κ(vt ) can be decomposed as  0, if vt  < γ κ(vt ) = (15) κ1 vt , otherwise

After static cues and motion cues are obtained, the combination of the two cues is an important process. Recently, graphbased methods are very popular to solve learning problem [41]. Therefore, different from simple early fusion or late fusion [35], we propose to combine the static cues with motion cues by the DBNS shown in Fig. 6, which is robust to a wide diversity of body poses and appearances. Here, node st is the static cues, vt is the SSF information, and θt is the human body orientation to be estimated. The edges between v nodes mean the behavior relation between different frames, which has already been modeled in Section III-C. According to the

where κ1 is the scale parameter, γ is the threshold (κ1 = 0.19, γ = 10 in our experiment). This formulation means that when the velocity of SSF is lower than γ, the person is considered as static and its orientation is independent with SSF orientation. Otherwise, if the speed of SSF becomes faster, the human body orientation and SSF orientation should be more similar. At last, we can get the p(θ˜ tc |θt−1 , st , vt ) for each orientation class, where θ˜ tc is the central degree of every class. Then the estimated orientation can be calculated by the parabolic interpolation [20].

As the observation models are determined, the SSF information can be estimated by the PF algorithm.

1448

IEEE TRANSACTIONS ON CYBERNETICS, VOL. 43, NO. 5, OCTOBER 2013

IV. Experiments In this section, we conduct comprehensive evaluations of our method. First, a RGB-D-based human body orientation estimation benchmark dataset is introduced. Then, the baseline algorithms and evaluation metrics are described. Next, comparison experiments with two state-of-the-art methods are performed to evaluate the proposed approach. In addition, we conduct experiments to evaluate and understand the roles of the different components of the approach. Finally, some results of our method are shown in the end of this section. A. Dataset of RGB-D-Based Human Body Orientation Estimation In order to evaluate the proposed method, we have released a RGB-D-based dataset on http://mcg.ict.ac.cn/mcg-rgbd.html. The data is captured by RGB-D sensors (also known as Kinect). It consists of ten RGB-D video surveillance sequences, captured at three different scenes, including meeting room, corridor and entrance, with 4000 frames and 11 different persons. The resolution of the frame is 640 × 480. Some examples are shown in Fig. 1. In order to imitate the realworld scenarios better, a wide diversity of poses are included in the dataset, such as standing, squatting, jumping, walking, running, rotating, waving hands, hug, and so on. As the limitations of the RGB-D sensors, all the people are recorded in the scope of 2.5–10 m. In addition, all the captured scenes are indoors as RGB-D sensors cannot work in the strong light. The dataset are marked by poser [28], a 3-D computer generation image software, which is similar to the labeling method in [20]. Just like other human body orientation datasets in 2-D [4] or 3-D [1], we only focus on estimating the angle around the axis perpendicular to the ground plane. Each frame is annotated by two to three people and the final annotation uses the average value. At last, we totally have 2700 people annotations and 5400 examples including reflections. The distribution of the dataset is shown in Fig. 7. B. Experimental Protocol We firstly compared our method with two state-of-the-art orientation estimation methods, Poselets method and VFH method. The Poselets method corresponds to the poselets activation vector based method in [20] that achieves effective results in 2-D-based human orientation estimation. The implementation of Poselets method is provided by the authors of [20]. The VFH method corresponds to the original VFH-based method that is used in [31] to estimate object orientation on 3-D point cloud data. The implementation of VFH method is provided by the Point Cloud Library [18]. In addition, for evaluation, we also design other three scenarios, SVFH method, SSF method, and SVFH + SF method. The SVFH is our static human orientation classification approach based on SVFH. The SSF method only uses motion cues (SSF information) to estimate body orientation. The estimation results are calculated by (16). The SVFH + SF method uses static human orientation classifier to detect basic body orientation and original scene flow information in [11] to refine the orientation. Finally, we use DBNS method to indicate the proposed method, which

Fig. 7. Distribution of our human body orientation dataset. Red number is the actual number of annotations in each class. TABLE I Average Error for All Classes

combines the static human orientation classification results, SSF information and temporal information. For the methods using static cues, we use six RGB-D videos as training dataset, which contains six people and 3464 frames, to train the static models. The remaining four videos are used as testing dataset for all the methods, which contains other five people and 1954 frames totally. For the methods both using static and motion cues, the orientation in the first frame is initialized only by the static cues. Besides, the orientation is set to zero if the methods only using motion cues. The experiments run on a desktop computer with Core I7, 2.3 GHz CPU, 8G 1600 MHz RAM, 128G SSD and Windows 7 64 bit OS. Overall performance of these methods is measured by the average estimation error in degree, which can be calculated by  |θest − θtru |, if |θest − θtru | ≤ π dif (θest , θtru ) = otherwise 2π − |θest − θtru |, (16) where θest is the estimation orientation and θtru is the labeled orientation. C. Comparison With Other Methods The DBNS method is firstly compared with Poselets method in our experiments. Poselets method detects people’s body

LIU et al.: ACCURATE ESTIMATION OF HUMAN BODY ORIENTATION FROM RGB-D SENSORS

Fig. 8. Comparison of DBNS method with Poselets method and VFH method.

part and uses the distributed representation of pose to estimate the orientation. By contrast, DBNS method estimate the orientation with the geometry and viewpoint information extracted from RGB-D data. From Fig. 8, we see that both methods can achieve good estimation when people facing the camera (between −15° and 15°). The main challenge is the estimation of the orientation when people deviating from the camera. Compared with Poselets method, the DBNS method could obviously improve the performance, especially when orientation is around NW, W, E, and NE. The mean estimation error of DBNS method is two times smaller than Poselets method, as shown in Table I. The results demonstrate that the geometry and viewpoint information brought by RGB-D could better discriminate the orientation than 2-D information. Different from DBNS method, VFH method only uses depth information and achieves smaller mean error than Poselets method. However, from the Fig. 8, we can find the estimation accuracy of VFH method is lower than Poselets method in NE and N. The reason is that the 2.5-D character makes it difficult to distinguish the body’s front side and back side. Instead, the face detection in Poselets method could distinguish them better. Nonetheless, this problem is resolved by our DBNS method, as the DBNS model can distinguish these situations by the people moving direction and temporal information. As a result, our DBNS method achieves more accurate estimation in all the orientation than VFH method. The speeds of all the methods are also shown in Table I. From the results, we can find DBNS method is ten times faster than Poselets method using 2-D information. As without geometry information, the Poselets method needs to detect people’s Poselets activation vector to estimate the orientation. According to our experiments, the detection process is very time consuming. Furthermore, although the extraction of motion cues increases the algorithm’s complexity, DBNS method also has 1.6 times speed promotion than VFH method. It demonstrates that the RGB-D-based superpixels that are sparser than depth points can significantly decrease the cues extraction time. D. Comparison of Superpixel-Based Methods With RGB-D Points-Based Methods As we analyzed before, the RGB-D-based superpixels could reduce the distraction of noisy and inhomogeneous depth data.

1449

Fig. 9. Comparison of original SF method and SSF method. (a) and (c) Results of original SF method. (b) and (d) Results of SSF method.

Fig. 10.

Comparison of DBNS method with SVFH and SSF methods.

Fig. 11.

Average speed of SSF in different orientation partitions.

In addition, as the superpixels are much sparser than RGB-D points, the static and motion cues extraction methods based on them will be faster than RGB-D points based methods. In order to verify our analysis, we compare the superpixelbased methods with RGB-D points based methods. First of all, SVFH method is compared with VFH method. As shown in Table I, mean error of SVFH method is smaller than VFH method. It indicates that the VFH extracted from superpixels are more robust than individual RGB-D points based VFH. More important, the speed of SVFH method is five times faster than VFH method. Moreover, we also use superpixels to estimate scene flow information. As the depth values are noisy and unstable, much mistaken scene flow information is generated from the original SF method, which can be seen in Fig. 9. On the contrary, the results obtained from SSF method are much better. From the comparison of SVFH+SF and DBNS in Table I, the improvement is very significant.

1450

IEEE TRANSACTIONS ON CYBERNETICS, VOL. 43, NO. 5, OCTOBER 2013

Fig. 12.

Result illustration of our approach on single person (best seen in color).

Fig. 13.

Result illustration of our approach on different styles of body poses and appearances (best seen in color).

Fig. 14.

Result illustration of our approach on multipeople (best seen in color).

E. Comparison of DBNS Method With Static Cues Based Method and Motion Cues Based Method In order to validate that our DBNS model that combines the static and motion cues could improve the discriminability of human body orientation, we compare the DBNS method with the static cues based method and motion cues based method. As SVFH and SSF are best static cues based method and motion cues based method in the six methods respectively, we only compare the DBNS method with SVFH and SSF methods. The comparison results can be found in Fig. 10. First of all, compared with DBNS method, the estimation error of SVFH method is very high in W, E, NW, NE, and N. The results demonstrate that only depending on static cues, it is very hard to distinguish symmetric orientation, such as NW versus SE, W versus E, and SW versus NE. On the contrary, the motion cues are employed to distinguish these orientations. We suppose that the body orientation is nearly parallel to the moving direction. In addition, with the speed of SSF becoming faster, the human body orientation and SSF orientation should be more similar. The experiment results shown in Figs. 10 and 11 demonstrate our hypothesis. From Fig. 11, we can find that the speed of SSF is quick in W, E, NW, and NE. Accordingly, the orientation estimation results of SSF method

shown in Fig. 10 are more accurate in these orientation partitions. Conversely, when the speed of the SSF is low in N, SW, SE, and S, the estimation error of SSF is higher. According to our observation, the motion cues in these orientation partitions are mostly generated by the jumping, squatting, waving hand, and other actions in the dataset. In these situations, we should consider the static cues more. Different from the two methods, DBNS method performs best on all the orientation as our dynamic Bayesian network system adaptively adjust the fusion parameter of static and motion cues, which holistically exploits the mutual complementarity between static and motion cues to estimate human body orientation. F. Qualitative Results In order to further demonstrate that the proposed method is robust to a wide diversity of human poses and appearances, some examples of estimation results are shown as below. Fig. 12 shows the presentation of our approach working on single people scenario, where the girl runs around. The body orientation is marked by arrows in ellipse. The results evince that our approach could accurately estimate the body orientation in 360° scope. Fig. 13 shows the presentation on different styles of body poses and appearances, which

LIU et al.: ACCURATE ESTIMATION OF HUMAN BODY ORIENTATION FROM RGB-D SENSORS

1451

Fig. 15. Some failure examples of our approach. (a) Failure examples while the people turn around. (b) Failures will be revised after people turn around and move toward one direction.

testify that our approach is robust to various body poses and appearances. Fig. 14 shows the presentation on multipeople scenario, where two people hug together and then separate. We can see that the estimation results are accurate when people are very close, which indicate that our approach is robust to the interference of multipeople. When the people turn around, our method may fail to estimate the orientation, which are shown in Fig. 15(a). In this case, the speed of SSF is not fast enough to supply effective information for body orientation estimation. Hence, their orientations are mainly achieved through static cues. As methods only depending on static cues cannot distinguish the symmetric orientation partitions due to their approximate depth information, the orientation estimation maybe failed in this case. However, after people turn around and move toward one direction, failures will be revised through fusing more motion cues, which are shown in Fig. 15(b). V. Conclusion Since RGB-D information is more informative than 2-D information and more conveniently acquired than traditional 3-D information, we propose a novel human body orientation estimation method through the innovative use of the RGB-D information. In our proposed method, static cues (SVFH) and motion cues (SSF) are extracted based on the RGB-D superpixels which are robust to the noisy depth data. Furthermore, a DBNS is proposed for fusing the two complementary static and motion features. The intensive evaluations on our released dataset demonstrate the effectiveness and efficiency of the proposed methods. Compared with existing methods, our proposed method obviously improves the performances in all the orientation space, especially when people deviating from the camera. Several issues are worthy of further investigation. The first is how to simultaneously estimate the human head and body orientation from RGB-D sensors. The head orientation is very useful for gaze determination, identification and facial expression recognition [7]. However, for the resolution limitation of depth information and large people pose variations, it is difficult to discriminate the head orientation when people stay away from RGB-D sensors. The second is how to utilize the human orientation to improve human behavior analysis, especially in pose estimation. The third is how to employ the videobased 3-D object retrieval technology [40] to do people search.

References [1] M. Andriluka, S. Roth, and B. Schiele, “Monocular 3-D pose estimation and tracking by detection,” in Proc. IEEE Conf. Comput. Vision Pattern Recognit., Mar. 2010, pp. 623–630. [2] C.-C. Chang and C.-J. Lin, “LIBSVM: A library for support vector machines,” ACM Trans. Intell. Syst. Technol., vol. 2, no. 3, pp. 27:1– 27:27, 2011. [3] C. Chen, A. Heili, and J. Odobez, “A joint estimation of head and body orientation cues in surveillance video,” in Proc. IEEE Int. Conf. Comput. Vision Workshops, Nov. 2011, pp. 860–867. [4] C. Chen and J.-M. Odobez, “We are not contortionists: Coupled adaptive learning for head and body orientation estimation in surveillance video,” in Proc. IEEE Conf. Comput. Vision Pattern Recognit., Jun. 2012, pp. 1544–1551. [5] N. Dalal and B. Triggs, “Histograms of oriented gradients for human detection,” in Proc. IEEE Comput. Soc. Conf. Comput. Vision Pattern Recognit., vol. 1. 2005, pp. 886–893. [6] M. Enzweiler and D. Gavrila, “Integrated pedestrian classification and orientation estimation,” in Proc. IEEE Conf. Comput. Vision Pattern Recognit., 2010, pp. 982–989. [7] G. Fanelli, T. Weise, J. Gall, and L. V. Gool, “Real time head pose estimation from consumer depth cameras,” in Proc. 33rd Int. Conf. Pattern Recognit., 2011, pp. 101–110. [8] T. Gandhi and M. Trivedi, “Image based estimation of pedestrian orientation for improving path prediction,” in Proc. IEEE Intell. Vehicles Symp., 2008, pp. 506–511. [9] R. Girshick, J. Shotton, P. Kohli, A. Criminisi, and A. Fitzgibbon, “Efficient regression of general-activity human poses from depth images,” in Proc. IEEE Int. Conf. Comput. Vision, 2011, pp. 415–422. [10] R. B. Girshick, P. F. Felzenszwalb, and D. A. Mcallester, “Object detection with grammar models,” in Proc. Adv. Neural Inform. Process. Syst., 2011, pp. 442–450. [11] S. Hadfield and R. Bowden, “Kinecting the dots: Particle based scene flow from depth sensors,” in Proc. IEEE Int. Conf. Comput. Vision, 2011, pp. 2290–2295. [12] S. Hadfield and R. Bowden, “Go with the flow: Hand trajectories in 3-D via clustered scene flow,” in Proc. Int. Conf. Image Anal. Recognit., 2012, pp. 285–295. [13] S. Izadi, D. Kim, O. Hilliges, D. Molyneaux, R. Newcombe, P. Kohli, J. Shotton, S. Hodges, D. Freeman, A. Davison, and A. Fitzgibbon, “Kinectfusion: Real-time 3-D reconstruction and interaction using a moving depth camera,” in Proc. 24th Annu. ACM Symp. User Interface Softw. Technol., 2011, pp. 559–568. [14] J. Han, L. Shao, D. Xu, and J. Shotton, “Enhanced computer vision with microsoft kinect sensor: A review,” in Proc. IEEE Trans. Cybernetics, vol. 43, no. 5, pp. 1318–1334, Oct. 2013. [15] N. Krahnstoever, M.-C. Chang, and W. Ge, “Gaze and body pose estimation from a distance,” in Proc. 8th IEEE Int. Conf. Adv. Video Signal Based Surveillance, 2011, pp. 11–16. [16] K. Lai, L. Bo, X. Ren, and D. Fox, “Detection-based object labeling in 3-D scenes,” in Proc. IEEE Int. Conf. Robot. Autom., 2012, pp. 1330–1337. [17] A. Levinshtein, A. Stere, K. N. Kutulakos, D. J. Fleet, S. J. Dickinson, and K. Siddiqi, “Turbopixels: Fast superpixels using geometric flows,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 31, no. 12, pp. 2290–2297, Dec. 2009. [18] Point Cloud Library [Online]. Available: http://pointclouds.org/

1452

[19] W. Liu, T. Xia, J. Wan, Y. Zhang, and J. Li, “RGB-D based multiattribute people search in intelligent visual surveillance,” in Proc. 18th Int. Conf. Adv. Multimedia Modeling, 2012, pp. 750–760. [20] S. Maji, L. Bourdev, and J. Malik, “Action recognition from a distributed representation of pose and appearance,” in Proc. IEEE Conf. Comput. Vision Pattern Recognit., 2011, pp. 3177–3184. [21] F. Meyer, “Color image segmentation,” in Proc. Int. Conf. Image Process. Applicat., 1992, pp. 303–306. [22] M. Munaro, F. Basso, and E. Menegatti, “Tracking people within groups with RGB-D data,” in Proc. Int. Conf. Intell. Robots Syst, 2012, pp. 2101–2107. [23] L. E. Navarro-Serment, C. Mertz, and M. Hebert, “Pedestrian detection and tracking using three-dimensional LADAR data,” Int. J. Robot. Res., vol. 29, no. 12, pp. 1516–1528, 2010. [24] R. A. Newcombe, A. J. Davison, S. Izadi, P. Kohli, O. Hilliges, J. Shotton, D. Molyneaux, S. Hodges, D. Kim, and A. Fitzgibbon, “Kinectfusion: Real-time dense surface mapping and tracking,” in Proc. 10th IEEE Int. Symp. Mixed Augmented Reality, 2011, pp. 127–136. [25] O. Ozturk, T. Yamasaki, and K. Aizawa, “Tracking of humans and estimation of body/head orientation from top-view single camera for visual focus of attention analysis,” in Proc. IEEE 12th Int. Conf. Comput. Vision Workshops, 2009, pp. 1020–1027. [26] B. Peng and G. Qian, “Binocular dance pose recognition and body orientation estimation via multilinear analysis,” in Proc. IEEE Comput. Soc. Conf. Comput. Vision Pattern Recognit. Workshops, 2008, pp. 1–8. [27] P. P´erez, C. Hue, J. Vermaak, and M. Gangnet, “Color-based probabilistic tracking,” in Proc. 7th Eur. Conf. Comput. Vision Part I, 2002, pp. 661–675. [28] Poser [Online]. Available: http://poser.smithmicro.com/. [29] X. Ren and J. Malik, “Learning a classification model for segmentation,” in Proc. Ninth IEEE Int. Conf. Comput. Vision, vol. 1. 2003, pp. 10–17. [30] N. Robertson and I. Reid, “Estimating gaze direction from lowresolution faces in video,” in Proc. 9th Eur. Conf. Comput. Vision Part II, 2006, pp. 402–415. [31] R. Rusu, G. Bradski, R. Thibaux, and J. Hsu, “Fast 3-D recognition and pose using the viewpoint feature histogram,” in Proc. IEEE/RSJ Int. Conf. Intell. Robots Syst., 2010, pp. 2155–2162. [32] L. Rybok, M. Voit, H. Ekenel, and R. Stiefelhagen, “Multi-view based estimation of human upper-body orientation,” in Proc. 20th Int. Conf. Pattern Recognit., 2010, pp. 1558–1561. [33] H. Shimizu and T. Poggio, “Direction estimation of pedestrian from multiple still images,” in Proc. IEEE Intell. Vehicles Symp., 2004, pp. 596–600. [34] J. Shotton, T. Sharp, A. Kipman, A. Fitzgibbon, M. Finocchio, A. Blake, M. Cook, and R. Moore, “Real-time human pose recognition in parts from single depth images,” Commun. ACM, vol. 56, no. 1, pp. 116–124, 2013. [35] C. G. Snoek, M. Worring, and A. W. Smeulders, “Early versus late fusion in semantic video analysis,” in Proc. 13th Annu. ACM Int. Conf. Multimedia, 2005, pp. 399–402. [36] L. Spinello, K. O. Arras, R. Triebel, and R. Siegwart, “A layered approach to people detection in 3-D range data.” in Proc. AAAI Conf. Artificial Intell. Phys. Grounded AI Track, 2010. [37] L. Spinello and K. Arras, “People detection in RGB-D data,” in Proc. IEEE/RSJ Int. Conf. Intell. Robots Syst., 2011, pp. 3838–3843. [38] M. Sun, P. Kohli, and J. Shotton, “Conditional regression forests for human pose estimation,” in Proc. IEEE Conf. Comput. Vision Pattern Recognit., 2012, pp. 3394–3401. [39] J. Taylor, J. Shotton, T. Sharp, and A. Fitzgibbon, “The Vitruvian manifold: Inferring dense correspondences for one-shot human pose estimation,” in Proc. IEEE Conf. Comput. Vision Pattern Recognit., 2012, pp. 103–110. [40] M. Wang, Y. Gao, K. Lu, and Y. Rui, “View-based discriminative probabilistic modeling for 3-D object retrieval and recognition,” IEEE Trans. Image Process., vol. 22, no. 4, pp. 1395–1407, Apr. 2013. [41] M. Wang, X.-S. Hua, J. Tang, and R. Hong, “Beyond distance measurement: Constructing neighborhood similarity for video annotation,” IEEE Trans. Multimedia, vol. 11, no. 3, pp. 465–476, Apr. 2009. [42] S. Wang, H. Lu, F. Yang, and M.-H. Yang, “Superpixel tracking,” in Proc. IEEE Int. Conf. Comput. Vision, Nov. 2011, pp. 1323–1330. [43] J. Yao and J.-M. Odobez, “Multi-camera 3-D person tracking with particle filter in a surveillance environment,” in Proc. 16th Eur. Signal Process. Conf., 2008.

IEEE TRANSACTIONS ON CYBERNETICS, VOL. 43, NO. 5, OCTOBER 2013

Wu Liu received the B.E. degree from Shandong University, Shandong, China, in 2009. He is currently pursuing the Ph.D. degree from the Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China. His current research interests include multimedia information retrieval and computer vision. Dr. Liu is a student member of ACM and China Computer Federation.

Yongdong Zhang (M’07) received the Ph.D. degree from Tianjin University, Tianjin, China, in 2002. He is currently a Professor with the Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China. His current research interests include the fields of multimedia content analysis and understanding, multimedia content security, video encoding, and streaming media technology.

Sheng Tang (M’07) received the Ph.D. degree in computer application technology from the Institute of Computing Technology, Chinese Academy of Sciences (ICT-CAS), Beijing, China, in 2006. He is currently an Associate Professor at the ICT-CAS. His current research interests include the fields of pattern recognition and machine learning, multimedia information processing, in particular, on indexing, retrieval, and extraction of information in images and videos. Dr. Tang is active in the international research community. He serves as the reviewer of a number of prestigious international conferences and journals. He is a member of ACM and senior member of China Computer Federation. Jinhui Tang (M’08) received the B.E. and Ph.D. degrees from the University of Science and Technology of China, Hefei, China, in 2003 and 2008, respectively, both in the Department of Electronic Engineering and Information Science. Since July 2008, he has been a Research Fellow at the School of Computing, National University of Singapore, Singapore. His current research interests include content-based image retrieval, video content analysis, and pattern recognition. Dr. Tang is a member of the Association for Computing Machinery. He was a recipient of the 2008 President Scholarship of Chinese Academy of Science, and a co-recipient of the Best Paper Award in ACM Multimedia 2007. Richang Hong received the Ph.D. degree from the University of Science and Technology of China, Hefei, China, in 2008. He was a Research Fellow at the School of Computing, National University of Singapore, Singapore, until December 2010. He is currently a Professor in Hefei University of Technology, Hefei, China. He has co-authored more than 50 publications in the areas of his research interests, which include multimedia question answering, video content analysis, and pattern recognition. Dr. Hong is a member of the Association for Computing Machinery. He was a recipient of the Best Paper Award in the ACM Multimedia 2010. Jintao Li (M’05) received the Ph.D. degree from the Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China, in 1989. He is currently a Professor with the Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China. His current research interests include multimedia technology, virtual reality technology, and pervasive computing.