Fast Pose Estimation from Silhouettes in Video ... - Robot Vision

8 downloads 252 Views 3MB Size Report
able to track the pose of rigid models through a video ... During runtime, video frames are jointly ..... [7] S. Lee, S. Lee, J. Lee, D. Moon, E. Kim, and J. Seo.
Fast Pose Estimation from Silhouettes in Video Sequences Christian Reinbacher Matthias R¨uther Horst Bischof Institute for Computer Graphics and Vision, Graz University of Technology Inffeldgasse 16, A-8010 Graz {reinbacher,ruether,bischof}@icg.tugraz.at

Abstract. We address the problem of model-based pose estimation from image sequences. While most methods build on local features, we use object silhouettes only, which are weaker, but considerably more robust cues. Without initialization of the pose, we are able to track the pose of rigid models through a video sequence, despite varying texture, illumination and appearance. Additionally our method handles multiple objects inherently and jointly estimates pose and object type. The method works at interactive frame rates, which makes it an ideal tool for augmented reality applications, active inspection systems and robotic manipulation tasks.

(a)

(b)

(c)

Figure 1. Pose estimation from video sequences by matching synthetically created silhouettes to video silhouettes. Object type and pose are jointly inferred.

1. Introduction

quence. While other pose estimation methods, typically working on single views or in a synchronized multi-view setup, always require a good initialization, we estimate a valid pose trajectory from any starting position. Even if the object to be tracked is not correctly segmented in all frames, we implicitly identify the correct object and its respective pose after a few video frames.

Many applications such as interactive augmentation of 3D objects, automated packaging or robotic pick & place require the knowledge of the exact position and orientation (i.e. pose) of objects with reference to a defined coordinate system. In computer vision this problem is addressed as visual pose estimation, which is a challenging task since the 2D projection of an object pose may be ambiguous, and its appearance is susceptible to illumination changes, shadows or the lack of visual features. In this work, we present a method for modelbased pose estimation using only silhouette information, which can be extracted reliably for a large number of objects, even if textureless, shaded or slightly transparent. Although a single silhouette may be a relatively weak cue for the object pose because of segmentation errors, occlusion or ambiguities, it becomes very robust when observed over a sequence of images. We show how a camera-to-object motion coherence can be used in an incremental update of the pose trajectory over the whole image se-

The idea of our method is to pre-build a database of reference silhouette views by simply rendering 3D models of objects from known viewpoints. During tracking, we match segmented silhouettes to the database. Pose estimation and object identification is solved on the fly by estimating a discrete probability distribution, which allows to easily incorporate temporal coherence and robustly recover from failures. To speed up the process, we only match to a subset of the database, which is automatically determined during runtime. This enables our algorithm to run at interactive frame rates even for a large amount of reference views. 1

2. Related Work The estimation of the 6 Degrees Of Freedom (DOF) pose of an object relative to an observing camera from a single image is a vivid area of research. Having a model of the object allows to infer the position by either using feature cues such as edges [16], texture [8] or global measures such as shape [13]. Since the shape of an object usually has less discriminative power than its appearance, purely shape based pose estimation methods require a good initialization to reach the globally optimal pose. Most methods, working on single images therefore try to overcome this problem by using additional cues in the image. Liebelt et al. [8] use textured 3D models and extract features from rendered 2D images. These appearance features are selected based on a discriminative filtering and then matched to image features. A geometric verification rules out improbable poses. Ulrich et al. [16] utilize the edges, present on a CAD model, represented by a triangulated mesh. Numerous views of the object are rendered and clustered into representative views in a bottom-up fashion. The focus of this method lies on the exact determination of the pose of one specific object. Depending on the number of reference views and the complexitiy of the CAD model, the typical runtime for one pose estimation lies between 0.5 and 5 seconds. Recently, pose estimation methods working on image features extracted from video sequences have been proposed [12, 10, 9]. In contrast to feature based methods, Toshev et al. [14] proposed a method for a purely shape based object recognition in video sequences. They obtain silhouettes from 3D models and cluster them to get representative views of each model. During runtime, video frames are jointly segmented and an optimal set of silhouettes from the view database is assigned to the whole sequence. In a sense, their work is conceptually similar to our method. However, in contrast to Toshev et al. our method rather focuses on the accuracy of the estimated pose than on the recognition of object classes. We avoid the clustering step by keeping compact representations of all views. Furthermore, we only utilize information from the current input video frame to iteratively update the pose estimation, which allows us to process arbitrarily long sequences at interactive frame rates. We optimize over the whole set of models in our database inherently.

3. Background Single image pose estimations may be error-prone due to ambiguities in shape appearance or segmentation errors. Robustly integrating those iterative measurements is essential to end up with a good pose estimation and accurate camera motions. Since we want to process images as they arrive without using a complete history of all previously seen images, we need to keep multiple hypotheses about the camera trajectory in order to jump over wrong pose predictions. A method which inherently uses multiple hypotheses to describe a system state is Particle Filtering (PF). PF is a method to approximate a arbitrary probability density using discrete samples with associated weights (particles) [1]. Each particle represents a hypothesis with an associated probability. Particle filters are used when the probability density can not be described in a closed form, but drawing samples from it is rather easy. Particle filters have not only been useful in positioning and tracking tasks (see [6] for an overview) but also for recognition and pose estimation of 3D objects [7]. We will now briefly review the concept of particle filtering and show the applicability to the problem of sequential pose estimation. Given a state vector xt and a set of observations Yt = {y0 , . . . , yt } our goal is to estimate the posterior p (xt |Yt ). The state vector consists of particles xit , i = 1, . . . , N . In our framework, each particle represents a system state with an associated weight wti . PF approximates p (xt+1 |Yt ) numerically by updating the weight of each particle followed by a resampling to favor particles with a higher weight. According to [6] we determine the number of effective particles and perform a sampling importance resampling if the number is lower than a threshold. Following the suggestion of [6] we set this threshold to 2N 3 . Finally, a prediction based on a motion model moves the particles to the next state. In the following section we will show how a PF can be used to overcome the arising problems of initialization, tracking and handling of multiple objects.

4. Pose Estimation from a Sequence of Silhouettes The main idea of our approach is to improve on the rather weak cue of a single object silhouette by exploiting video sequences. Under the assumption of a smooth camera motion between successive frames, improbable pose estimations can be ruled out over

time. Our method maintains multiple hypotheses and collapses into a single estimation, once enough evidence has been collected. For our method to be applied, we assume, that the object can be segmented in an individual frame, and a 3D model is available.

4.1. Object Pose Representation To generate a representation of a 3D object by 2D silhouettes, we propose to create a set of reference views by placing virtual cameras at uniform viewpoints on the view sphere. The object is located in the center of this sphere. We use a method for regular sampling of spheres and other rotation groups proposed by Mitchell [11]. The CAD model is given by a triangulated mesh. One reference view consists of a set Si of ni outer object contour points Si = {xi,1 . . . xi,ni }

(1)

and the pose parameters in spherical coordinates (α, β), thus one entry in the database is the set Vi = {Si , αi , βi } .

(2)

4.2. Silhouette Descriptor and Distance Metric To match an object silhouette and update the weight wti of each particle, the shape matching not only has to be rotation invariant, but has to be able to recover the rotation resulting in a camera roll around the optical axis. A standard approach is to transform the contour points c = c1 , . . . , cL into polar coordinates p = p1 , . . . , pM with the centroid given by the center of gravity of the silhouette and normalize it to a fixed length M . The normalization makes the matching scale invariant. Such representations can be easily matched by a normalized cross-correlation of the video silhouette pv with a database silhouette pd in a sliding window fashion. The error vector between two silhouettes is defined as: (prv (x) − prv )(prd (x + n) − prd ) σprv σprd (3) with n = 0, . . . , M − 1 and prx is the radial component of the polar coordinates. The error is then given by max (e(n)) and the optimal rotation by argmax (e(n)). However, this representation poses 1 X e(n) = M −1 x

n

Figure 2. Example of generated views.

Using such a representation of 3D objects, the 6 DOF of object pose estimation for a single image (3 DOF in rotation and translation respectively) can be reduced to 4 DOF given by camera roll, scale and position on the viewing sphere. Each particle represents an object pose consisting of one entry of the database (2) and a camera roll θi . Additionally each particle carries its complete history of poses for every frame s = [1, . . . , t − 1]. For the prediction step of the PF each particle has to be moved according to a motion model. Since we do not restrict the motion of camera or object but only require it to be small, we allow each particle to move within a local neighborhood on the view sphere.

some problems if the silhouette is non-convex. When transforming such c into its polar representation, for some angles multiple radii are present. Normalizing the contour to a common length, requires the angles to be sorted, which leads to artifacts in concave regions (see Figs. 3(a) and (b)). Fig. 3(e) shows a difference plot of Figs. 3(a) and (b), where (b) is aligned to (a) using (3). The two large peaks stem from the concavities of the shape. In order to circumvent this problem, we use only those points of the shape outline, which forms a star convex polygon with respect to the center of gravity. A subset X of Rn is star convex if there exists an x0 ∈ X such that the line segment from x0 to any point in X is contained in X. Figs. 3(c) and (d) show the subset of selected points in green. When matching those two contours, the resulting difference plot shows no significant outliers (see Fig. 3(f)).

4.3. Handling of Multiple Objects The formulation of the pose estimation problem as a discrete probability distribution allows us to easily incorporate multiple 3D objects. Particles are initialized randomly over viewing spheres of all input models. No special attention has to be paid in the resampling state, since particles, which got a low weight in the measurement update over several frames are

! ,

δr

100 50 0 −50 −100 −2

0

2

θ[rad] (e)

(b)

δr

(a)

100 50 0 −50 −100 −2

0

2

θ[rad] (c)

(d)

(f)

Figure 3. The shapes in (b) and (d) are rotated versions of (a) and (c) respectively. The circle depicts the center of gravity. The polar representation of the shape outline is depicted in (a) and (b). Please note the problem where the shape is nonconvex. The polar representation, when only using the star-convex parts of the shape, is depicted in (c) and (d). (e) shows the difference plot of the two rotated shapes when using a polar representation of the outline. (f) shows the difference between the curves, generated with our method.

more likely to be removed by the resampling procedure. New objects can be added in a plug-and-play fashion since there are no dependencies among them. In practice and to speed up the pose estimation, we adaptively vary the number of particles in the resampling step according to Fox [3]. The idea of their approach is to determine the number of particles in each iteration, such that the error between the true posterior and the approximation by the particle filter is less than  with a probability 1 − δ. To apply that method, we have to divide the state space (the position on the viewing sphere) into K bins. We determine the bin by a K-means clustering on the coordinates of the predefined viewpoints. With that, the optimal number of particles, N , can be calculated by:

k−1 N= 2

(

2 1− + 9(k − 1)

s

2 z1−δ 9(k − 1)

)3 .

(4) where k denotes the number of non-empty bins with support (at least one particle falls into that bin). The optimal number of particles is determined for each object individually.

5. Experiments For all experiments we use a database of 20 3D models obtained from different sources, some of which are depicted in Fig. 7.

5.1. Synthetic Sequences Using a 3D model, we create synthetic input data by simulating a flight around the object using OpenGL. The known camera parameters allow us to measure the accuracy of our pose estimation method. We define pose error as Euclidean distance between ground-truth and estimated camera position. During all experiments we fixed the number of viewpoints per object to 1000, the silhouettes are normalized to a length of 1000, K = 10 and  = 0.01. All experiments were conducted using a PC with a 2.66GHz Intel Core i7 processor. We implemented our method using Matlab 2011a. Number of Particles In this experiment we investigate the influence of the number of particles on the pose estimation result. We use 5 3D models from our database and render them from the same camera trajectory using OpenGL. The camera is rotated around the object with a step width of 1◦ at a camera object distance of 1. We vary the number of particles between 10 and 100 per object in the database. Each run is repeated 5 times to rule out the effect of the random initialization of the particle filter. The adaptive resampling step, which dynamically changes the amount of particles, is deactivated for this experiment. As a comparison, we match each input image individually to the whole set of silhouettes in the database using the similarity metric defined in section 4.2.

4 time/frame [s]

pose error

0.4 0.3 0.2 0.1 20

40

60

80

100

3 2 1 0

20

number of particles

40

60

80

100

number of particles

(a)

(b)

Figure 4. Influence of the number of particles on the pose estimation result. The blue curves (crosses if printed w/o color) depict the results for a varying number of used particles, the red lines show the result when matching the input frames independently to the database.

Figure 4 shows the results of this experiment. Matching against all elements of the database without temporal coherence results in a pose error, defined as Euclidean distance between ground-truth and estimated camera position, of 0.149 ± 0.342 and 96.3% correctly classified input images. The large standard deviation is caused by ambiguities in the object silhouettes. As an example, one reconstructed camera trajectory is depicted in Fig. 5. The trajectory is shown as a line, connecting the best matching camera for each input frame. The dots represent the camera poses contained in the database. In Fig. 5(a) the trajectory occasionally jumps to the opposite of the view sphere, due to symmetries in the object geometry. Our method successfully overcomes those ambiguities by incorporating the temporal information of previous camera positions. As can be seen from Fig. 4(a) the mean error drops below the brute-force match for 20 particles per object. Starting from 60 particles, the results do not improve drastically. Since the pose estimation is given by the best matching silhouette’s position on the view sphere, the precision of our method is limited by the distance between the view points on the viewing sphere. This is a common limitation of viewbased pose estimation methods, which can be overcome by a subsequent refinement of the pose using an iterative method like Iterative Closest Point. On the computer system used, the mean time for 60 particles and 20 objects in the database is 0.24s per input image and 99.8% of all input images are classified correctly. The match against the whole database would take 3.93s per image on the same computer system. Figure 6 shows how the temporal coherence helps to resolve ambiguities over time. Up to frame 14 of

(a)

(b)

Figure 5. Comparison of a recovered camera trajectory with (a) single image detections and (b) optimized using temporal coherence. 2 1.8 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0

0

5

10

15

20

25

30

35

40

45

Figure 6. Temporal coherence allows to eliminate ambiguities over time.

the sequence, the camera pose is wrongly estimated due to ambiguities in the geometry of the toy figure. Additional information from movement of the object helps to resolve the ambiguity. In successive frames, the pose error stays small, even when the object is moved to the original, wrongly estimated pose.

sequence nelson nelson2 cow elephant block

5.2. Real-World Sequences Using the parameters determined in the previous experiments, we apply our method to real-world videos using several different objects. Models for these objects were obtained either from the internet or by using a structured-light based 3D scanner. For all experiments we use a FireWire camera with a resolution of 640 × 480. We calibrate it intrinsically using the method by Zhang [17]. Ground truth poses are generated by detecting a planar marker with known dimensions. Our method relies on reliably extracted silhouettes, which can be easily obtained in a controlled scenario e.g. in industrial applications. However, for more general video sequences where the light and background can not be controlled, more sophisticated segmentation approaches for objects in video sequences are needed. Some recently proposed methods for non-rigid object tracking eliminate the rectangular bounding box around the tracked object and operate on a pixel level, delivering a segmentation of the object as by-product ([15, 5, 14]). They have in common to do an offline processing of the tracking sequence, which is not applicable in our scenario. Bibby and Reid [2] overcome this problem by using online learning of the object appearance within a probabilistic framework. Another very recent approach for object tracking and segmentation has been proposed by Godec et al. [4] which is based on the generalized Hough-transform and uses back projection to initialize a segmentation of the object. We decided to use this method for our experiments, since it works online, delivers good segmentations and the authors provide code which is freely available1 . Some input images with the segmentation and estimated pose overlaid are depicted in Fig. 7. For a better visualization, the 3D model is rendered with an offset from the estimated camera pose. Table 1 depicts the results for the used sequences. Our method does not rely on an initialization and therefore has to estimate both the object and its orientation. The second column of Tbl. 1 shows the number of frames where the correct object has been determined. The pose error is given as mean euclidean distance between ground-truth and estimated camera positions (for a camera-object distance of 1). The estimated pose and object type for each frame is determined by the particle with the highest weight. 1

http://lrs.icg.tugraz.at/research/ houghtrack/

# frames 160 88 135 102 101

# correct 116 70 77 77 101

pose error 0.16 0.068 0.23 0.19 0.27

Table 1. Pose estimation results for some real-world sequences (see Fig. 7).

Since each particle carries its complete history of poses it would also be possible to assign one particle to a whole video sequence. Especially for geometrically symmetric objects like the block (second row of Fig. 7), ambiguities can not always be resolved when looking at the currently best performing particle. In this case, two particles on opposite sides of the viewing sphere are competing against each other. As can be seen from the results in Fig. 7, severe segmentation failures over longer periods of time cause wrong object detections and pose estimations. However, our method usually quickly recovers within few frames. For the object used in the sequence shown in the last row of Fig. 7, no exact 3D model was available. Nevertheless our method estimates reasonable poses. For more results, we refer the reader to our supplementary video2 , where we provide a visualization of the pose estimation results for complete video sequences.

6. Conclusion In this work we presented a purely shape based pose estimation method working on image sequences. Our method simultaneously infers the object type and its pose relative to a calibrated camera from a set of corresponding 3D models. We do not rely on an initialization and cover the possible pose space of the object by sampling the view sphere in an offline step. Our method avoids matching video silhouettes to the whole set of reference views by utilizing a particle filter to rule out improbable poses after a few frames. New models can be easily added to the reference database since all objects are treated independently. In experiments on synthetically created and real-world video sequences we have shown the practical applicability of our method. We successfully applied our algorithm to objects with varying complexity, from simple block-like structures to or2

http://www.youtube.com/user/rvlab

2

4

26

61

camera poses

2

27

28

101

camera poses

2

5

36

84

117

2

25

57

68

88

3

43

108

203

209

Figure 7. Visualization of the pose estimation results on real-world sequences. The numbers represent the frame numbers in the respective input sequence. The segmentation is overlaid in red, the 3D model is rendered from the estimated camera position and depicted in green (best viewed in color).

ganic forms. A possible application of our method is robotic bin picking where a robotic arm with a headmounted camera approaches a bin full of different, known objects and has to estimate the object type and pose on-the-fly to successfully grab it. The accuracy of our method is currently limited by the segmentation quality and the number of virtual camera views used to create the database of reference views. The latter can be overcome by a subsequent refinement of the pose estimation using a local optimization method like ICP. Furthermore, it might be possible to derive the number of reference views depending on the model complexity. Future work will include the explicit handling of occlusion and the detection of segmentation failures. Both potential problems are currently handled implicitly by a restriction of the camera motion between successive frames. Currently, particles are allowed to move on the view sphere of one specific object. A possible direction for research might be to allow particles to move to other objects by finding and clustering similar views between different objects.

Acknowledgements We would like to thank the reviewers for their helpful and positive comments. We are grateful to the authors of [4] for providing code and assistance. This work was supported by the Austrian Research Promotion Agency (FFG) under the project SILHOUETTE (825843).

References [1] M. Arulampalam, S. Maskell, N. Gordon, and T. Clapp. A tutorial on particle filters for online nonlinear/non-gaussian bayesian tracking. Signal Processing, IEEE Transactions on, 50:174 –188, 2002. 2 [2] C. Bibby and I. D. Reid. Robust real-time visual tracking using pixel-wise posteriors. In European Conference on Computer Vision, pages 831–844, 2008. 6 [3] D. Fox. Adapting the sample size in particle filters through kld-sampling. The International Journal of Robotics Research, 22(12):985–1003, December 2003. 4 [4] M. Godec, P. Roth, and H. Bischof. Hough-based on-line tracking of non-rigid objects. In International Conference on Computer Vision, 2011. 6, 8 [5] M. Grundmann, V. Kwatra, M. Han, and I. Essa. Efficient hierarchical graph based video segmentation. In International Conference on Computer Vision, 2010. 6

[6] F. Gustafsson, F. Gunnarsson, N. Bergman, U. Forssell, J. Jansson, R. Karlsson, and P.-J. Nordlund. Particle filters for positioning, navigation, and tracking. Signal Processing, IEEE Transactions on, 50:425 – 437, 2002. 2 [7] S. Lee, S. Lee, J. Lee, D. Moon, E. Kim, and J. Seo. Robust recognition and pose estimation of 3d objects based on evidence fusion in a sequence of images. In International Conference on Robotics and Automation, pages 3773 –3779, 2007. 2 [8] J. Liebelt, C. Schmid, and K. Schertler. Viewpointindependent object class detection using 3d feature maps. In IEEE Conference on Computer Vision and Pattern Recognition, 2008. 2 [9] L. Mei, J. Liu, A. Hero, and S. Savarese. Robust object pose estimation via statistical manifold modeling. In International Conference on Computer Vision, 2011. 2 [10] L. Mei, M. Sun, K. M. Carter, A. O. Hero, and S. Savarese. Unsupervised object pose classification from short video sequences. In British Machine Vision Conference, 2009. 2 [11] J. C. Mitchell. Discrete uniform sampling of rotation groups using orthogonal images. SIAM Journal of Scientific Computing, 30:525–547, 2007. 3 [12] H. Najafi, Y. Genc, and N. Navab. Fusion of 3d and appearance models for fast object detection and pose estimation. In Asian Conference on Computer Vision, pages 415–426, 2006. 2 [13] B. Rosenhahn, T. Brox, D. Cremers, and H. peter Seidel. A comparison of shape matching methods for contour based pose estimation. In Combinatorial Image Analysis, LNCS 4040, pages 263–276, 2006. 2 [14] A. Toshev, A. Makadia, and K. Daniilidis. Shapebased object recognition in videos using 3d synthetic object models. In IEEE Conference on Computer Vision and Pattern Recognition, 2009. 2, 6 [15] D. Tsai, M. Flagg, and J. M.Rehg. Motion coherent tracking with multi-label mrf optimization. In British Machine Vision Conference, 2010. 6 [16] M. Ulrich, C. Wiedemann, and C. Steger. CADbased recognition of 3D objects in monocular images. In International Conference on Robotics and Automation, pages 1191–1198, 2009. 2 [17] Z. Zhang. Flexible camera calibration by viewing a plane from unknown orientations. In International Conference on Computer Vision, volume 1, pages 666 –673 vol.1, 1999. 6