a smooth 6dof motion prior for efficient 3d surface tracking

0 downloads 0 Views 6MB Size Report
using the Gauss-Seidel algorithm, which updates d at iter- ation k + 1 using the .... and Hans-Peter Seidel, “Marker-less deformable mesh tracking for human ...
A SMOOTH 6DOF MOTION PRIOR FOR EFFICIENT 3D SURFACE TRACKING Thomas Popham, Roland Wilson, Abhir Bhalerao Department of Computer Science, Warwick University, CV4 7AL ABSTRACT This paper proposes an efficient method for tracking a 3D surface model, which utilises an accurate neighbourhood motion prior to regularize the solution. Typical 3D motion trackers only estimate the local translations at each point on the surface model, which means that enforcing smooth motion between neighbouring surface points can be difficult when undergoing rigid body motion. This paper uses a patch-based representation of the scene surface so that both translations and rotations can be estimated on the surface, leading to smooth neighbouring scene flows under local rigid body motions. Since the translation and rotation motions are estimated at each patch using a variational approach, the proposed tracker is efficient with relatively few reprojections required at each frame. The proposed method is demonstrated on a real-world multi-camera sequence, and the scene-flow is accurately estimated over ninety frames. 1. INTRODUCTION Estimating the motion of a scene is a longstanding computer vision problem, and most research has focused upon the 2D problem. However, estimating the long-term 3D motion of a general scene remains a challenging problem. This is due to the fact that the scene surfaces may contain little texture, but may move in a complex way. Therefore defining a suitable prior for the motion of the scene is important. If the scene is limited to tracking humans, then a skeleton model can be used [1], however this often requires an initial human-operator fitting. A common method of enforcing the motion of the scene to be smooth is by Laplacian mesh deformation [2], which ensures that neighbouring mesh nodes have a similar motion. However, a problem with this approach is that this assumption becomes weak when the surface rotates, since there will be different translations at each point of the same object. This paper presents a surface tracking method which overcomes this problem by locally estimating the rotation as well as translation at the surface, using a set of planar patches. A further advantage of the proposed method is that it combines the estimated motions from each patch in a probabilistic way. This is necessary since some surface patches contain very little texture and therefore provide few constraints on the scene motion, whilst other patches contain greater texture information and provide fuller constraints upon the patch motion.

Figure 1. Trajectories of patch centres between frames 20 and 70.

2. RELATED WORK One way to estimate the 3D motion of the scene is to combine 2D motion estimates from several cameras [3, 4]. Since existing state-of-the-art optical techniques can be applied to the problem, scene flow estimates can be readily obtained since the motion regularization can be efficiently performed in each camera image. However, this approach suffers from the disadvantage that the motion regularization is performed with no depth information. Another approach to estimating the scene motion is a feature matching approach, which can either be carried out using the input images [2] or on the estimated surface [5, 6]. The main problem with this approach is that if the scene contains surfaces with no texture, then no surface features will be detected and estimating correspondences will be difficult for these regions. A second problem with this approach is that at every frame to be matched, the geometry of the scene must be estimated, which is computationally expensive. Several methods estimate the motion of the scene at discrete points on the scene surface. Pons et al. use a variational approach to estimating the motion at each node of a mesh, which is deformed using the Laplace operator as the regularizing energy [7]. Carceroni and Kutulakos estimate the complex motion a set of surface elements (surfels) over a single frame [8] and Mullins et al. use a particle filter to estimate the motion of a set of patches using a hierarchical Gaussian Mixture Model as a motion prior [9]. Finally, Cobrazas et al [10] use a single camera to track a set of rigid surface patches, using a variational approach.

All of these approaches either only estimate local translations, or fail to implement an effective motion smoothness prior, or are inherently expensive to evaluate, leading to poor tracking performance over many frames. The approach presented in this paper does not suffer from any of these problems: rigid-body motion is estimated at each point of the surface model; an effective smoothness prior is implemented; and a variational approach is adopted, making this a highly efficient approach.

of Horn and Schunck [12], who used the following smoothness energy term for 2D optical-flow:  Esmooth =

∂u ∂x



dt = ∆xt = xt − xt−1 .

(1)

4. GLOBAL ENERGY MODEL Since the 3D tracking problem is under-defined, the range of possible patch motions must be limited through the use of a prior, which enforces local motion smoothness. The total energy that must be minimised therefore consists of a data term and a smoothness term: E = Edata + λEsmooth ,

(2)

where λ controls the strength of the smoothness prior. When estimating the motion of a patch across a single frame, one would expect that the patch texture I remains consistent with the set of input images J. A suitable cost function for the motion dt is the sum-of-squared differences between the patch texture and input images projected onto the patch: Edata =

XX

2

(I(p) − Jc (Hc (xt−1 + dt )p)) , (3)

c∈C p∈W

where C is the set of cameras and W is the set of pixels on the patch. A suitable smoothness energy term may be defined by penalizing the difference between neighbouring motions. A classic example of such an approach is the work

 +

∂u ∂y

2

 +

∂v ∂x

2

 +

∂v ∂y

2 , (4)

where u is the flow in the x direction and v is the flow in the y direction. Generalizing this smoothness term to the 3D patch tracking problem yields:

3. PLANAR PATCH AND MOTION MODEL In order to smoothly track the non-rigid motion of the scene’s surfaces, a piecewise-description of the scene surfaces is used: a set of planar patches. Each planar patch is described by six components: three parameters u, v, w describing the patch centre and three parameters θu , θv , θw describing the patch orientation. The patches are fitted to the scene surfaces using a local surface reconstruction method using the algorithm described in [11]. The vector xt is used to represent the complete description of a patch at time t: xt = (u, v, w, θu , θv , θw )T . Each patch is assigned a texture model I(p), which is a function of the pixel co-ordinates on the patch surface. An image Jc from camera c may be projected onto the patch by using the projection function Hc (x, p), which maps a point p on the patch surface into the image plane of camera c, according to the state x of the patch. The projected image from camera c onto the patch is therefore: Jc (Hc (x, p)). The problem is then to estimate the motion of each patch from time t − 1 to time t. The state of the patch at time t − 1 is xt−1 and the motion of the patch between t − 1 and t is:

2

Esmooth =

∂u ∂x

«2

„ +

∂u ∂y

«2

„ +

∂u ∂z

«2

„ + ... +

∂θw ∂z

«2 ! . (5)

5. ENERGY MINIMIZATION In order to minimise the cost function in equation (2), a first order Taylor-series expansion of the projected image J is taken, allowing the patch motion to be predicted using a set of motion templates. The first-order Taylor expansion of the projection function yields: 0

Edata

˛ ˛ ∂Jc ˛ @ = I(p) − Jc (Hc (xt−1 , p)) − ˛ ∂x ˛ c∈C p∈W XX

12 dt A .

xt−1

(6)

Differentiating the total energy Etotal = Edata +λEsmooth with respect to the displacement then gives:  ∇2 u  ∇2 v   2  ∂E ∇ w = ∇Jc (I − Jc ) − ∇Jc dt ) + λ  2  ,  ∇ θu  ∂d  ∇2 θ  v ∇2 θw 

(7)

where ∇Jc is the simplified notation for the Jacobian matrix ∂J c ∂E 2 ∂d and ∇ is the Laplacian operator. Now setting ∂x = 0 and dropping the time subscripts, (∇J)T (∇J)d = ∇J(I − J) + λ∇2 d.

(8)

2

Approximating the Laplacian ∇ u with the difference between u and the neighbouring mean u ¯, allows us to write the motion in terms of the image gradients and the neighbouring motions:  d = ((∇J)T ∇J + λI)−1 ∇J(I − J) − λd¯ .

(9)

This set of equations (six for each patch) can be solved using the Gauss-Seidel algorithm, which updates d at iteration k + 1 using the neighbouring information and image terms evaluated at the previous iteration k: “ ”−1 “ ” k dk+1 = (∇J)T ∇J + λI ∇J(I − J) − λ∆d¯ . (10)

The model in equation (9) assumes that each patch is

tracked with equal error. However, in reality the errors will be different at each patch. The vectors d and d¯ are therefore treated as the means of Gaussian distributions with covari¯ Therefore equation (10) is modified to give ances S and S. dk+1 as a product of the measurement and neighbouring distributions: −1 S k+1 = M −1 + (S¯k )−1   k dk+1 = S k+1 M −1 E k − (S¯k )−1 d¯ , k

T

to see: the patches with more texture information should be used to constrain the motion of patches with less texture information.

(11) (12)

−1

where E is estimate (∇J) ∇J ∇J(I − J), M is the measurement covariance, S k+1 is the covariance of the patch at iteration k + 1, and S¯k is the covariance of the mean motion used in approximating the Laplacian. When estimating the mean, it is important to include the fact that measurements further away from patch i are less reliable, than those closer to patch i. Therefore the covariances Si are spread according to the distance from patch i. Thus,

Figure 2. The yellow ellipses show the estimated measurement noise covariances from equation (16).

6. EXPERIMENTAL RESULTS

where d0k j is the estimated motion at patch i given than it moves rigidly with respect to the motion of patch j, and S 0 = f (S, d) is function which spreads the covariance S according to the distance dij between patches i and j. It is at this stage that an estimation of the local rotation at each patch becomes advantageous, since the effect of the rotation at j can be included in the predicted translation component of d0k j . Through experimentation, it has been found that adding noise according to the square of the distance provides a sufficient model:

The results of the proposed method are demonstrated on the ‘Katy’ sequence, which is 90 frames long and was captured using 32 cameras at 30 frames per second. Figures 3(ad) show the view from one camera at 30-frame intervals. Note that this sequence contains surfaces with little texture and changes in lighting conditions. Figure 3(e) shows the 1400 patches fitted to the scene surfaces at frame 0 and figures 3(f-h) show the tracked patches at frames {30, 60, 90}. Figure 1 shows the trajectories of the patch centres between frames 20 and 70. The resulting tracked motion is smooth and coherent over the 90 frames. Note that even on the arms, where there is little texture, the motion of the surface is accurately tracked over the sequence. This shows the advantage of a regularized texture model approach over a feature-based one [6, 13], for which it would be impossible to estimate correspondences in non-textured regions. Table 1 shows the typical number of correspondences that have been reported by authors for simple human actor sequences. The presented approach is clearly able to estimate more correspondences over a longer time period than other feature-based approaches.

S 0 = S + αI(d2ij )

7. CONCLUSIONS

(S¯ik )−1 =

X

(f (Sjk , dij ))−1 ,

(13)

(f (Sjk , dij ))−1 d0k j ,

(14)

j∈N (i) k d¯i = S¯ik

X j∈N (i)

(15)

where I is the identity matrix, and α is the coefficient controlling how much the covariance S is spread. A remaining question to be addressed is how the measurement noise M should be estimated. Intuitively, one would expect the pixel gradients in (∇J)T ∇J to be inversely proportional to the measurement noise, since a small gradient is more likely to originate from noise rather than a true image gradient, i.e. M ∝ ((∇J)T ∇J)−1 .

(16)

Figure 2 show a set of example 2D patches with ellipses representing the estimated measurement covariances in the image x and y directions. For patches located along an edge, the measurement uncertainty is in the direction of the edge. Only two patches (top-left and top-right) have sufficient texture information to constrain the patch motion in both directions. On the other hand, the patches on the bottom-left and bottom-right contain mainly background and therefore have large measurement noise covariances. The advantage of the approach being presented here is clear

A patch-based tracking system has been proposed for estimating the 3D motion captured by a multi-camera system, which makes use of a novel and accurate neighbourhood prior, taking account of measurement uncertainties. The patch-based representation allows both translations and rotations to be estimated at each point on the surface, increasing the effectiveness of the local rigid motion prior. Planned further work includes performing comparisons on other datasets, and fitting a mesh to the final patches to enable accurate video-based renderings. 8. REFERENCES [1] T.B. Moeslund, A. Hilton, and V. Kr¨uger, “A survey of advances in vision-based human motion capture and analysis,” Computer Vision and Image Understanding, vol. 104, no. 2-3, pp. 90–126, 2006. [2] E. de Aguiar, C. Theobalt, C. Stoll, and H. Seidel, “Marker-Less 3D Feature Tracking for Mesh-Based

(a) Frame 0

(b) Frame 30

(c) Frame 60

(d) Frame 90

(e) Frame 0

(f) Frame 30

(g) Frame 60

(h) Frame 90

Figure 3. (a)-(d) Input images at frames {0, 30, 60, 90} (e) Patches fitted at frame 0 (f)-(h) Tracked patches at frames {30, 60, 90}.

Author Zaharescu et al. [6] Zaharescu et al. [6] Doshi et al. [13] Doshi et al. [13] Aguiar et al. [2] This Paper

Method MeshHog SIFT SIFT SIFT SIFT and Optical Flow Patch Tracking

Sequence Dance-1 Dance-1 Roxanne Twirl JP Free Dance Various Katy

Number of frames 20 20 4 10 150-400 90

Reported number of correspondences 13 0 < 600 per camera < 65 per camera 20-50 >1000

Table 1. Comparison between different multi-camera 3D correspondence techniques

Human Motion Capture,” Lecture Notes in Computer Science, vol. 4814, pp. 1, 2007. [3] S. Vedula, S. Baker, P. Rander, R. Collins, and T. Kanade, “Three-dimensional scene flow,” IEEE Transactions on Pattern Analysis and Machine Intelligence, pp. 475–480, 2005. [4] Edilson de Aguiar, Christian Theobalt, Carsten Stoll, and Hans-Peter Seidel, “Marker-less deformable mesh tracking for human shape and motion capture,” in IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), Minneapolis, USA, June 2007, IEEE, pp. 1–8, IEEE. [5] J. Starck and A. Hilton, “Correspondence labelling for wide-timeframe free-form surface matching,” in IEEE 11th International Conference on Computer Vision, 2007. ICCV 2007, 2007, pp. 1–8. [6] A. Zaharescu, E. Boyer, K. Varanasi, and R. Horaud, “Surface feature detection and description with applications to mesh matching,” Computer Vision and Pattern Recognition, IEEE Computer Society Conference on, vol. 0, pp. 373–380, 2009. [7] J.P. Pons, R. Keriven, and O. Faugeras, “MultiView Stereo Reconstruction and Scene Flow Estimation with a Global Image-Based Matching Score,” International Journal of Computer Vision, vol. 72, no. 2, pp. 179–193, 2007.

[8] R.L. Carceroni and K.N. Kutulakos, “Multi-View Scene Capture by Surfel Sampling: From Video Streams to Non-Rigid 3D Motion, Shape and Reflectance,” International Journal of Computer Vision, vol. 49, no. 2, pp. 175–214, 2002. [9] A. Mullins, A. Bowen, R. Wilson, and N. Rajpoot, “Video Based Rendering using Surfaces Patches,” 3DTV Conference, 2007, pp. 1–4, 2007. [10] D. Cobzas, M. Jagersand, and P. Sturm, “3D SSD tracking with estimated 3D planes,” Image and Vision Computing, vol. 27, no. 1-2, pp. 69–79, 2009. [11] M. Habbecke and L. Kobbelt, “A Surface-Growing Approach to Multi-View Stereo Reconstruction,” in IEEE Conference on Computer Vision and Pattern Recognition, 2007, pp. 1–8. [12] B.K.P. Horn and B.G. Schunck, “Determining Optical Flow,” Artificial Intelligence, vol. 17, pp. 185–203, 1981. [13] A. Doshi, A. Hilton, and J. Starck, “An empirical study of non-rigid surface feature matching,” in Visual Media Production (CVMP 2008), 5th European Conference on, 2008, pp. 1–10.