Accurate Segmentation of Moving Objects in ... - Semantic Scholar

6 downloads 0 Views 343KB Size Report
Research Council of Canada (NSERC), iCORE, Syncrude, and Matrikon. Dongxiang Zhou. 1,2 and Hong Zhang. 1. 1. Department of Computing Science, ...
Accurate Segmentation of Moving Objects in Image Sequence Based on SpatioTemporal Information* Dongxiang Zhou1,2 and Hong Zhang1 1. Department of Computing Science, University of Alberta, Edmonton, Canada, T6G 2E8 2. School of Electronic Science and Engineering, National University of Defense Technology, Changsha, China, 410073 ABSTRACT Accurate segmentation of moving objects in an image sequence is a crucial task in many computer vision and image analysis applications such as the mineral processing industry and automated visual surveillance. In this paper, we introduce a novel algorithm for spatio-temporal segmentation of image sequences to achieve accurate extraction of the boundary of moving objects from noisy background. Our approach performs an initial segmentation using background subtraction method of the Gaussian mixture model (GMM). A MRF (Markov Random Field)-based labeling technique is then adopted to remove the potential miss-classified regions. The final solution is successfully obtained using the level set method, which can improve the results by splitting connected moving objects. The algorithm works well for image sequences with multiple moving objects of different sizes. 1. INTRODUCTION Segmentation of moving objects aims to partition an image into physical moving objects and static background. Despite many motion segmentation approaches have been proposed in the literature, it is still a challenging problem to produce an accurate motion flow field at motion boundaries. In fact, since background pixels are spatially correlated, pixels should not be considered independent in space. In this paper, we present an algorithm that combines motion segmentation and spatial segmentation methods to obtain accurate segmentation of moving objects. The approaches for motion segmentation can be broadly grouped into three categories: temporal differencing, background subtraction and optical flow [1]. Temporal differencing, or change detection based on frame difference [2], attempts to detect moving regions by making use of the difference of consecutive frames (two or three) in a video sequence. This method is highly adaptive to dynamic environments, but generally does a poor job of extracting * This research is supported in part by Natural Science and Engineering Research Council of Canada (NSERC), iCORE, Syncrude, and Matrikon.

the complete shapes of certain types of moving objects. Background subtraction is a commonly used technique for segmenting objects of interest in static scenes [3]. The Mixture of Gaussians (MoG) method has enjoyed tremendous popularity for background modeling since it was first proposed by Friedman and Russell [4]. Stauffer [5] presented an adaptive background mixture model using a mixture of K Gaussian distributions (K is a small number from 3 to 5). The method is stable and robust. The background subtraction technique works well at identifying moving objects, but it requires an accurate reference image and is usually sensitive to changes in illumination condition, background trembling and so on. Optical flow methods group those optical flow vectors that are associated with the same motion or structure [1]. In theory, this is an ideal way to solve the segmentation problem; however, its accuracy is limited and segmentation does not detect the exact contours of the objects. The analysis of the above three motion segmentation methods reveals that segmentation based on motion information is unlikely to achieve an accurate result without the help of spatial information. Therefore, many effective segmentation approaches are spatio-temporal schemes. While motion information is used as the main criterion to guide the segmentation process, spatial segmentation also plays an important role in the algorithms. In [6], spatial segmentation is applied twice in the process, while edge plays a post-processing role. In Choi et al.’s segmentation scheme [7], spatial segmentation is based on morphological filters and the watershed algorithm as basic tools. Wardhani and Gonzalez [8] proposed a split and merge image sequence segmentation scheme for content based retrieval. As described above, the spatial segmentations are either based on contour finding or region growing. In the situation of noisy background, it is difficult to detect accurate contours or decide the boundaries of regions. So, in this paper, we adopt MRF and level set theory as the spatial segmentation method, which play post-processing roles after initial segmentation from motion information. We use

GMM-based background subtraction algorithm to aggressively classify pixels that are likely moving in an image. The resulting image is then improved by a local MRF labeling model to remove the potential miss-classified regions. Level set theory is applied to refine the final results by splitting connected objects. The rest of the paper will be organized as follows. Section 2 presents background subtraction using Gaussian mixture model. Section 3 describes post-processing by MRF model. Section 4 shows the segmentation refinement scheme by level set theory. Section 5 provides our experimental results. Finally Section 6 presents our conclusions. 2. BACKGROUND SUBTRACTION WITH GMM Background subtraction is a particularly popular method for motion segmentation, Gaussian background models are among the most robust available. Stauffer and Grimson [5] generalized this model by allowing for multiple Gaussian distributions per pixel to account for multimodal backgrounds. Each pixel in the scene is modeled using a mixture of K Gaussian distributions. The probability that a certain pixel has intensity xt at time t is estimated as: K

P ( xt ) = ∑ ω i ∗ η ( xt , µ i, ∑ i )

(1)

i =1

where ωi is the weight, µi is the mean, ∑i = σi2 is the covariance for the ith distribution, and η is a Gaussian probability density function: 1 − ( x −µ ) ∑ ( x −µ ) e (2) η ( xt , µ , ∑) = (2π ) ∑ 1 2

1 2

t

T

−1

t

1 2

The ith Gaussian component is updated as follows: ⎧ω i , t = (1 − Mi , t )ω i , t − 1 + (ω i , t − 1 + α ) Mi , t ⎪µ t = (1 − Mi , t ) µ t − 1 + ((1 − ρ ) µ t − 1 + ρ xt ) Mi , t ⎪ (3) ⎨ ⎪σt = (1 − Mi , t )σt − 1+ ((1 − ρ )σt − 1 + ρ xt − µt ) Mi , t ⎪ ρ = α η ( xt , µ i, ∑ i ) ⎩ where α is the learning rate, set to be 0.05 in our algorithm. ρ is the learning factor for adapting current distributions, set to be constant 0.005. Mi,t is defined as follows: ⎧1 : if wi is the matched Gaussian component Mi , t = ⎨ ⎩0 : otherwise

(4)

After the updates, the weights ωi are normalized. The K distributions are ordered based on ωi /σi and the first B distributions are used as a model of the background of the scene where B is estimated as: ⎞ ⎛ b B = arg min⎜⎜ ∑ ω i > T ⎟⎟ (5) ⎠ ⎝ i =1 The threshold T is the minimum fraction of the background model. Background subtraction is performed by marking any pixel that is more than λ (1.0∼1.5 in our

experiments) standard deviations away from any of the B distributions as a foreground pixel. 3. POST-PROCESSING BY MRF LABELING

Object segmentation can be treated as a Markovian labeling process. The labeling process is guided by the maximum aposteriori (MAP) criterion. In object segmentation, only the foreground label F and background B are considered. After background subtraction, the region labeling is improved by MRF method. Let us denote the image by S and the label set by L. The label set consists of background (B) and foreground (F). The intensity of pixel i in the image is denoted by xi and its label by fi. To find the best segmentation label for pixel i, the probability P(fi | xi, F-{ fi }) should be maximized in which F-{ fi } is the segmentation of the image except for the i-th pixel. To do this, we use the Bayes theorem: P( xi | fi, F − { fi}) ⋅ P( fi | F − { fi}) (6) P( fi | xi, F − { fi}) = P ( xi | F − { fi}) The intensity of the ith pixel (xi) only depends on fi and is independent of other regions F-{ fi }. In addition, P(xi) has a defined value and does not depend on segmentation. So to maximize the previous equation, only the numerator needs to be maximized: P( fi | xi, F − { fi}) ∝ P( xi | fi ) ⋅ P( fi | F − { fi}) (7) For general static segmentation, it is assumed that each label type has a Gaussian distribution with the mean µ of and variance of σ2. So, we have: 1 1 (8) P( xi | fi ) = exp[− 2 ( xi − µf ) 2 ] 2σ 2π ⋅ σ i

in which µfi is the mean intensity of region fi. In some situations, the background and moving objects will have the same gray values and therefore we cannot assume each label type to have the same Gaussian distribution. So, we suppose each pixel has a Gaussian distribution and compute the mean and the variance by its eight neighbor pixels according to the label result of GMM model training. We also model the spatial connectivity by an MRF. So, if we denote the neighborhood of pixel i by Ni, then we have: P ( fi | F − { fi}) = P ( fi | fj , j ∈ Ni ) (9) According to the Hammersley-Clifford theorem [9], the density of fi is given by a Gibbs density which has the following form: 1 P( fi | fj , j ∈ Ni ) = exp{−∑Vc ( fi )} (10) Z Ci where ci is the set of all possible cliques that include pixel i. Z is the normalizing constant. So, we can rewrite the term in Eq. (8) that should be maximized as follows:

A=

1

exp{−

1

( xi − µfi ) 2 − ∑Vc ( fi )}

(11)

2σ 2π ⋅ σ ⋅ Z ci As we use an 8-neighborhood system, the spatial clique energies were chosen so that there was no bias towards any particular orientation or location in the image. For this reason only spatial two-cliques are considered and they are all the same, defined as follows: ⎧− β , if fi = fj and i, j ∈ c (12) Vc ( f ) = ⎨ if fi ≠ fj and i, j ∈ c ⎩+ β , From Eq. (11), we can obtain energy function of every pixel xi, so we have: ( xi − µi ) 2 U ( xi, fi ) = ln( 2π ⋅ σi ) + + ∑ Vc ( f i ) (13) 2σi 2 ci For the whole image, the segmentation problem is reduced to the minimization of the following energy function: U ( S , F ) = ∑U ( xi, fi ) (14) 2

i∈S

Many heuristics have been proposed to solve the MRF problem: Iterated Conditional Modes (ICM) [10], Graduated Non-Convexity (GNC) [11], Mean Field Annealing [12], Simulated Annealing (SA) [13], Dynamic Programming [14], etc. Two frequently used optimization methods are SA and ICM. The SA has been shown to provide global minima, although it usually takes too long to converge [15]. The ICM is the fastest method. However its results are often unsatisfactory on relatively complex images due to its “hard-decision” nature, and it is very sensitive to the initial conditions. As we have obtained good initial segmentation by background subtraction, we choose the ICM method because of the time factor. Specifically, we define an iterative procedure of using ICM to solve the MRF problem as follows: (1) Start at a “good” initial configuration and set k = 0 (2) From the initial configuration, for every pixel xi in the image, compute U(xi, fG) and U(xi, fH) according to Eq. (13), then select the one which has a minimal energy : U(xi, fi) = min(U(xi, fG), U(xi, fH)) (3) Compute the total energy U(S,F) according to Eq. (14) (4) Go to (2) with k = k+1 until convergence (for example, the energy change is less than a certain threshold). 4. REFINEMENT BY THE LEVEL SET THEORY

In the regions classified by MRF, some objects may still be connected incorrectly, and the boundaries of objects are not accurate. How to split them into pieces and find the accurate contour of objects is a challenging task because of the irregular shapes of the objects and the lack of sharp boundaries. Active contours or “snakes” [15] can be used to segment objects automatically, but its performance suffers from the change of topology and presence of corners. To overcome these problems, the level set approach has been

proposed [16]. Level set method can handle topological merging and splitting naturally. Most of the challenges in level set methods are how to construct an adequate model for the speed function. Based on the speed function model, level set theory can be classified as edge-based and region-based. The drawback of the edge-based method is that it can only detect objects with edges defined by strong gradients. The region-based method is an energy minimization based segmentation, and is suitable for noisy background. We modify an active model based on Mumford-Shan segmentation techniques proposed by Chan and Vese [17] to maintain the requirements of splitting connected objects and extracting accurate boundary of objects, in which we add an external force to make the connected pieces split. We use this model on a local window of every moving fragment identified by MRF processing. 4.1 Level set method modified from Chan-Vese (CV) model

Chan and Vese’s model introduced the energy function F(C), defined by F (C ) = µ ⋅ Length(C ) + ν ⋅ Area(inside(C )) + 2 2 (15) λ1 ⋅ ∫ u 0 − c1 dxdy + λ 2 ⋅ ∫ u 0 − c 2 dxdy inside ( C )

outside ( C )

where µ≥0, ν≥0, λ1 and λ2>0 are fixed parameters, u0 is the intensity of a pixel, C is the curve, the constants c1 and c2 depending on C, are the average of u0 inside or outside the curve C , respectively. Finding the object boundary turns out to be the minimization of the energy F(C). Chan and Vese deduced the associated Euler-Lagrange equation for φ to be ⎧ ⎡ ⎤ ∇φ ) − λ 1(u 0 − c1) 2 + λ 2(u 0 − c 2) 2 ⎥ ⎪φ t = δ (φ ) ⎢ν∇( (16) ∇φ ⎨ ⎣⎢ ⎦⎥ ⎪ ⎩φ ( x, y,0) = φ 0( x, y ) We add an extra force to split the connected objects during the process of curve evolution. So, Eq. (16) is changed to ⎧ ⎡ ⎤ ∇φ ) − λ 1(u 0 − c1) 2 + λ 2(u 0 − c 2) 2 − M ⎥ ⎪φ t = δ (φ ) ⎢ν∇( φ ∇ ⎨ ⎣⎢ ⎦⎥ (17) ⎪ ⎩φ ( x, y,0) = φ 0( x, y ) where M>0, it is a not a fixed parameter, its value is changed according to the status of curve evolving. 4.2 Adaptive Curve Evolution

4.2.1 Geometry properties of level set curve The gradient of any point on the curve is defined as ⎛ ∂φ ∂φ ⎞ ∇φ = ⎜⎜ , ⎟⎟ ⎝ ∂x ∂y ⎠

(18)

v Then, the unit normal vector N of any point on the curve is given by v ∇φ (19) N= ∇φ

and the curvature k is obtained from the divergence of the gradient of the gradient of the unit normal vector to the front: ⎛ ∇φ ⎞ φxxφ y2 − 2φxφyφxy + φyyφ x2 ⎟= k =∇⋅⎜ (20) 3 ⎜ ∇φ ⎟ ⎠ ⎝ φ x2 + φ y2 2

(

value of TA is set to include all the pieces that consist of connected fragments. Although some single fragments may be also included into the cluster, our algorithm will automatically make a decision whether the MCV method will continue to be used after a few iterations of curve evolution.

)

Thus, it is true that k>0 for concave regions, k1 Y Multi-Curve Processing

N NL >TB

N

Y CV method

Fig. 1 Flowchart of MCV method For the pieces combined of two or more fragments, the percentage of regions with k>0 (we call it fPR hereafter) is bigger than those of only one fragment (see Fig. 2). Thus, we choose a threshold TA to divide the pieces into two clusters. For the pieces whose fPR is bigger than TA, we adopt the MCV method; otherwise, we use the original CV method to obtain the accurate contours of the pieces. The

(a)

(b)

Fig. 2 Examples of segment after MRF processing (a) Connected fragments, whose fPR is 0.39 (b) Single fragment, whose fPR is 0.24 5. EXPERIMENTAL RESULTS

This section demonstrates the performance of our proposed algorithms on an image sequence. Fig. 4 shows the results of four example frames. The top row displays the original images. The second row shows the results of background subtraction by Gaussian mixture model. We can see clearly that there are many spurious foreground pixels caused by the trembling background. The third row is the results after post-processing by MRF labeling. The results show that almost all the moving objects can be detected but, with some moving objects connected, some contours are not accurate. The fourth row is the final segmentation results refined by level set method. It is clearly seen that the connected moving objects are partitioned and accurate boundaries can be extracted, compared with the ground truth images which were obtained manually in the bottom row. We developed a quantitative score metric to measure the similarity between segmentation result and the ground truth. Let A= {A1, A2,…, Am} and B= {B1, B2,…, Bn} denote the pieces in the ground truth and segmentation result, respectively. Am∪Bn and Am ∩ Bn denote the union and intersection of Am and Bn respectively. |C| denotes the number of the pixels of C. The proposed score metric measures the performance of a segmentation algorithm at the object level. The partial score is defined as the following weighted sum: M ⎧ Am N Am ∩ Bn ⎫ ⎪ ⎪ S ( A, B ) = ∑ ⎨ (21) ⎬ ∑ m =1 ⎪ ⎩ A n=1 Am ∪ Bn ⎪⎭ Table 1 shows the evaluation results of the four sample frames. We can see from the table that the level set method

can improve the segmentation results by almost 26 percent over the MRF method.

[5] C. Stauffer, and W. Grimson, “Adaptive background models for real-time tracking”, Proc. of IEEE CS Conf. on Computer Vision and Pattern Recognition, Vol. 2, pp. 246-252, 1999.

Table 1 Evaluation results of 4 example frames in Fig.4 Post-processing by Post-processing by MRF (%) level set method(%) 1 60.2 85.5 2 60.2 86.6 3 62.7 89.6 4 63.9 89.0 Average 61.8 87.7

[6] R. Mech and M. Wollborn, “A noise robust method method for 2D shape estimation of moving objects in video sequences considering a moving camera”, Signal Processin 66(2):203-217, 1998.

6. CONCLUSIONS

In this paper we have presented a method for spatiotemporal segmentation of an image sequence, which is based on background subtraction, MRF theory and the level set theory. It is useful in environments of many moving objects with different sizes. In particular, the method leads to very precise boundaries. The approach had satisfactory performance on real image sequences and the technique is stable enough to support real applications. However, our method is computationally expensive and not suitable for real time applications, taking an average of four seconds one image on a P4 3.4 GHz machine. Further research is required to reduce the complexity of the algorithm. REFERENCES [1] L. wang, W. Hu, and T. Tan, “Recent development in human motion analysis”, Pattern Recognition, Vol.36, No.3, pp. 585-601, Mar. 2003. [2] R. Radke and et al., “Image change detection algorithms: a systematic survey”, IEEE Transactions on Image Processing, Vol. 14, No. 3, pp. 294-307, 2005. [3] A. M. McIvor, “Background subtraction techniques”, In Proc. of Image and Vision Computing, New Zealand, Oct. 2000. [4] N. Friedman and S. Russell, “Image segmentation in video sequences: a probabilistic approach”, in Proc. of the 13th Annual Conf. on Uncertainty in Artificial Intelligence (UAI-97), pp.175181, Morgan Kaufmann Publishers, Inc., San Francisco, CA, 1997.

[7] J. G. Choi, S. W. Lee and S. D. Kim, “Spatio-temporal video segmentation using a joint similarity measure”, IEEE Trans. CSVT, Vol. 7, No. 2, pp. 279-286, 1997. [8] A. Wardhani and R. Gonzalez, “Image structure analysis for CBIR”, Proc. Digital Image Computing: Techniques and Applications, DICTA’99, Dec. 99, Perth, Australia, pp. 166-168. [9] J.Besag, “Spatial interaction and the statistical analysis of lattice systems (with discussion)”, J. of Royal Statist. Soc., series B, 36(2):192-326, 1974. [10] J. Besag, “ On the statistical analysis of dirty pictures,” J. Royal Stat. Soc. Series B, vol. 48, pp. 259-302, 1986 [11] A. Rangarajan and R. Chellappa, “Generalized graduated non-convexity algorithm for maximum a posteriori image estimation”, Proc. ICPR, June 1990, pp. 127-133. [12] D. Geiger and F. Girosi, “Parallel and deterministic algorithm for MRFs: surface reconstruction and integration”, Proc. ECCV90, Antibes, France, 1990, pp.89-98. [13] S. Geman and D. Geman, “Stochastic relaxation, Gibbs distribution, and the Bayesian restoration of images,” IEEE Trans. Patt. Anal. Machine Intell., vol. PAMI-6, pp.721-741, Nov.1984. [14] H. Derin, H. Elliott, R. Cristi and D. Geman, “Bayes smoothing algorithms for segmentation of binary images modeled by markov random fields”, IEEE Trans. Patt. Analysis and March. Intel., 6,1984. [15] M. Kass, A. Witkin and D. Terzopoulos, “Snake: active contour models”, Int. J. Comput. Vis., Vol. 1, pp. 321-331, 1987. [16] S. Osher and J.A. Sethian, “Front propagation with curvaturedependent speed: algorithms based on Hamilton-Jacobi formulation”, J. Computat. Phys., Vol. 79, pp. 12-49, 1988. [17] T. F. Chan and L. A. Vese, “Active contour without edges”, IEEE Trans. on Image Processing, Vol. 10, no. 2, pp. 266-277, Feb. 2001.

Fig. 4 Sample frames and segmentation results. The up row shows original images, the second row is the results of background subtraction by GMM, the third row is results of MRF, the fourth row is results of level set method, the bottom row is the ground truth images obtained by manual works.