Salient Object Detection via Random Forest - IEEE Xplore

0 downloads 0 Views 787KB Size Report
Salient Object Detection via Random Forest. Shuze Du and Shifeng Chen ... is extracted based on this rarity map by using an active contour model. Next, a local ...
IEEE SIGNAL PROCESSING LETTERS, VOL. 21, NO. 1, JANUARY 2014

51

Salient Object Detection via Random Forest Shuze Du and Shifeng Chen

Abstract—Salient object detection plays an important role in image pre-processing. Existing approaches often neglect the contours of salient objects, thus resulting in inaccurate detection for large objects. Besides, they mainly focus on detecting only a single object. In this paper, we detect the salient object from the view of the object contour. We propose to exploit the random forest to measure patch rarities and compute similarities among patches. A global rarity map is calculated based on the patch’s rareness over the whole image. The approximate contour of the salient object is extracted based on this rarity map by using an active contour model. Next, a local saliency map is obtained by the similarities of patches inside the contour and those outside. Finally, the local map is refined through image segmentation. Our method can detect not only a single object but also multiple objects. Experimental evaluation on the ASD-1000 and SED2 datasets shows that our method outperforms the state-of-the-art methods. Index Terms—Random forests, salient object detection.

I. INTRODUCTION

V

ISUAL saliency is a concept from neuroscience and physiology that makes the areas of interest pop out of human visual fields [1]. Salient object detection is a main trend in modeling saliency, which aims to locate the most interested object(s) in a scene. Recently, it attracts more and more attention as it serves as a pre-processing step for many application, such as image segmentation [2] and retargeting [3]. Traditional methods often compute saliency based on contrasts, either local [1], [2], [4] or global [3], [5]–[9] analysis of the contrast. Local methods are more sensitive to high-contrast edges and noise, and attenuate smooth regions in objects, which makes them more appropriate for detecting small objects. For global methods, the patch-based approach [3] also tends to highlight the object boundaries rather than the whole object area. Though segmentation-based methods [5], [6] alleviate the “object attenuation” problem (the attenuated object interior) effectively, they still have difficulties of highlighting the entire

Manuscript received August 19, 2013; revised September 29, 2013; accepted November 02, 2013. Date of publication November 12, 2013; date of current version November 19, 2013. This work was conducted at Shenzhen Institutes of Advanced Technology, CAS. This work was supported by Guangdong Innovative Research Team Program under Grant 201001D0104648280 and by the Science, Industry, Trade, and Information Technology Commission of Shenzhen Municipality, China, under Grant JC201005270357A. The associate editor coordinating the review of this manuscript and approving it for publication was Prof. Tolga Tasdizen. S. Du is with Chengdu Institute of Computer Applications, the Chinese Academy of Sciences (CAS), Chengdu, China, and also with the University of CAS, Beijing, China (e-mail: [email protected]). S. Chen is with the Shenzhen Key Laboratory for Computer Vision and Pattern Recognition, Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen, China (e-mail:[email protected]). (Corresponding author: S. Chen). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/LSP.2013.2290547

Fig. 1. Overview of our algorithm. Given the input image, rarity analysis outputs a global saliency map by using the random forest. Based on the rarity map, the rough contour of the salient object is extracted. A local saliency map, which represents the similarity between a outer patch and all inner patches of the contour, and the dissimilarity between a inner patch and all outer patches, is computed. The final map is obtained by refining the local map with graph-cut based segmentation.

object when the inner region of the object is inhomogeneous. They often detect several parts of the object. Borji et al. [10] propose to use the global patch rarity (the patch’s frequency of happening over the entire scene) to fill the blank holes inside the object. For large objects, however, the rare patches are those that lie on boundaries between objects and the background. Most of the above models are formulated on detecting only the single salient object. The limitations of these models make them difficult to detect more than one salient object in an image. Our work is partially inspired by [2], in which the authors integrate the saliency map and shape prior of the object, i.e., a salient object has a well-defined closed boundary, into a model for segmenting the salient object. The shape prior is extracted by combining saliency with object boundary information obtained by an edge detector. In this paper, we propose a new patch-based approach to detecting the salient object through its approximate contour. The framework of the approach is presented in Fig. 1. First, we use the global patch rarity to capture the approximate contour of the salient object. Unlike [2], our contour extraction method is simple and no edge detector is required. The image is then separated into two parts, one inside the contour and the other outside, corresponding to the object and background regions respectively. As the contour is coarse, we measure the contrasts between the inner patches and the outer patches, aiming to suppress the patches of the inner part similar to the outer patches while highlight the outer patches similar to the inner patches. Through the contrasts among patches, our method highlights the whole object uniformly. The rarity analysis and contrast measure are completed via the random forest. Finally, we refine the local map using graph-cut based segmentation as the patch-based map is the rough estimation of saliency. This approach performs well not only for the single-object case, but also for the multiple-object case.

1070-9908 © 2013 IEEE

52

IEEE SIGNAL PROCESSING LETTERS, VOL. 21, NO. 1, JANUARY 2014

B. Contour Detection

II. PROPOSED MODEL Our novel model is patch-based, which can be divided into two main stages: approximate contour extraction and contrast measurements among patches. In this section, we first introduce the building process of the random forest, and then describe the details of the two stages.

Let be a patch whose center pixel is at position scaled image. Suppose that it resides at the leaf node , then the rarity of is evaluated from

in the in tree

(2) A. Random Forest The random forest was first proposed to deal with the classification problem [11]. Now it has been applied to density estimation [12] and many other tasks. In [13], the authors use the forest to compute the similarities between images in their image retrieval system. In our work, we employ a variant of the random forest proposed by Yu et al. [12], and expand the similarity metric in [13] to measure the similarity between a patch and a patch set. The rarity of a patch is determined by the size of the leaf node that the patch resides. Given an input image, it is first scaled to the size of . Let denote the set of linearized image patches of size ( is divisible to ) extracted from top-left to bottom-right of the scaled image without overlap. Hence the total number of patches is . As the work in [10], RGB and Lab color spaces are used together to represent the color feature, i.e., each patch is represented by six color sub channels. Thus is a column vector of pixel values and its length is . Each color space is normalized to the range . The forest is an ensemble of trees, formulated as . Each tree is consisted of splitting nodes and leaf nodes, and it is recursively built from all these patches. At a splitting node in tree , two random numbers and are generated to indicate which dimension indices of the feature vector to be used for splitting the patch set associated to it. Define , the splitting function at this node can be formulated as:

where is the number of patches contained in . is set to suppressing patches far away from the image center , as the rarity patches near the image border always belong to the cluttered background. Here, controls the strength of the center prior. In our experiments, it is set to 0.16 with patch coordinates normalized to . According to our observation, the leaf node on which boundary patches lie always has a smaller size than that of patches in background or the object. In a single tree, however, the size may be large by chance. Thus, we average the size of leaf node in each tree (we omit in the denominator of Eq. (2)) to weaken the chance. We then adopt the active contour model of [14] to extract the contour of salient object based on rarity map , as it is non-sensitive to the initial contour. Moreover, it can extract all objects from the rarity map, which is beneficial for detecting more than one salient object. Fig. 1 shows an example of the rarity map and the extracted contour.

C. Salient Object Discovery In this subsection, we introduce how to discover the salient objects according to the detected contour . We partition these patches into two subsets, inside and outside , respectively. The formal definition can be written as: otherwise

otherwise

(1)

where and are the patch sets contained in node ’s left and right child respectively, , and is the cardinality of . For each splitting node , we try different pairs ( in our implementation) of and select the one which satisfies . Then ‘s two children are selected as new nodes to split. The splitting process is terminated at a node when the depth of the node reaches the pre-set maximum depth or it contains only one patch. Any node that cannot be further divided forms a leaf node. Tree stops growing if we are unable to find any more node to split. In this way, each leaf node contains at least one patch and each patch falls into one of the leaf nodes in any tree. After the forest is built, we use it to measure the rarities of patches and compute similarities among them.

(3)

denotes the number of pixels of contained where inside the extracted contour , is a constant, and its value varies between 0 and 1 according to the dataset. The similarity between and in tree is defined as:

at the same leaf otherwise

(4)

is used because two patches falling together in a small leaf is less likely and it is reasonable to consider them similar if they in fact do (refer to [13]). For the ensemble forest , the similarity between patch and can be obtained by combining the similarities computed from all trees: (5)

DU AND CHEN: SALIENT OBJECT DETECTION VIA RANDOM FOREST

53

Fig. 2. Precision-recall curves for different settings on the ASD dataset. (a) Results of 5 runs using the setting ). (c) Comparison of saliency maps with different depth ( maps using different number of trees (

Our purpose is to compare the similarity between a patch in one set and the other. Motivated by [3], we define the similarity between patch and patch set as:

(6) or , and indiwhere denotes either the patch set cates the number of patches in . Since the above detected contour is rough, some patches belonging to the object may be excluded from the object region. In order to complement the object, the saliency value of outside patch is defined by: (7) ensuring that the maxwhere imum value of is equal to 1. That is, patch is considered salient if it is highly similar to all patches inside the contour. As some background patches may be contained in , it is necessary to suppress them if they are similar to the outside patches. Likewise, the saliency of inside patch is evaluated by: (8) , as stated above, is a norwhere malization factor. This scheme is used for down-weighting the saliency for inside patches that are similar to the outside patches. For patch , the larger its similarity to , the less salient it is. The contour map is considered as an initial map. Based on the computations above, the map is re-estimated more accurately, hence producing the object-level saliency map . This map is then re-scaled to the size of the original image. D. Refinement with Image Segmentation As our approach is patch-based, the detection result at object boundary is not accurate. We use image segmentation to alleviate this issue. The final saliency map is obtained by averaging saliency values within image regions which are generated by segmenting the original color image using graph-cut [15]. This step can effectively remove noise or remedy the false detection. In this manner, the results can be further refined.

and

. (b) Comparison of saliency

).

III. EXPERIMENTS A. Experimental Protocol To validate our proposed approach, we have performed experiments on two datasets: 1) ASD [9], which contains 1000 images with binary ground truth and most of them contain only one salient object, and 2) SED2 [16], which includes 100 images and there are two salient objects in each image. Following the setting in [9], we evaluate our model using the precision and recall analysis. Given a saliency map, a binary mask is obtained by a threshold , which is compared with ground truth to compute the precision and recall. The precision from 0 to 255. vs. recall curves are computed by varying Our approach is compared against several state-of-the-art salient object detection methods, including saliency filters (SF) [6], context aware saliency (CA) [3], context-based saliency (CB) [2], histogram based contrast (HC) [5], region based contrast (RC) [5], and prior based saliency (PBS) [17]. We set and in the following experiments. B. In-depth Analysis on ASD We first study the robustness of our model and the influences of different parameters on ASD because it is widely used as as the the benchmark in this field. We empirically set objects in this dataset have a moderate size. Robustness. We run 5 times for the combination of 20 trees . The results are shown in Fig. 2(a). The precisions of and the 5 runs at the same recall vary within a very narrow range of 0.005. This manifests that the forest is stable. In the following experiments, for each parameter setting, we report the results by averaging over all the results from 5 runs. Number of trees. Fig. 2(b) proves the performance is nonsensitive to the number of trees. From the figure, we can see that increasing the number of trees from 5 to 40 causes little variance, the result of using 20 trees is slightly better than that of using more or fewer trees. Tree depth. Our model is only sensitive to the tree depth. When the tree depth is small, rarity patches cannot be distinguished from the others. If the trees grow deeper, however, the common patches will be over-split, leading to lower performance. See Fig. 2(c) for the influence of tree depth. We obtain is set to 7, and the perforthe best results when tree depth mance drops when the depth becomes larger or smaller. Comparison with other methods. In all of the following for experiments, we build a forest with 20 trees and each image. The resulting precision vs. recall curve is given in Fig. 3(a). It is observed that the random forest based rarity map

54

IEEE SIGNAL PROCESSING LETTERS, VOL. 21, NO. 1, JANUARY 2014

Fig. 3. Results on ASD dataset. (a) Precision-recall curves for all methods. (b) Visual comparison of other approaches to our method (RFSR) and ground truth (GT). Best viewed in color.

Fig. 4. Quantitative and visual comparisons on dataset SED2. Best viewed in color.

(RFR) significantly outperforms the CA method, which shows that our method can capture the contour information more effectively by using a very simple strategy. Our random forest based saliency (RFS) is comparable to the segment-based methods (all except the patch-based CA). After refining the saliency map with graph-cut, the final result (RFSR) outperforms the other approaches in most of the recall range. Salient object detection results for several representative images are shown in Fig. 3(b), which shows a visual comparison of different methods. Our model is able to highlight the whole object even if the object locates far away from the image center. For the large object with a non-uniform inner area, our model still detects the whole object rather than some parts in it. We use a PC with 3.0 GHz CPU s s to build a forest in and 2 GB RAM. It spends about codes and another 0.01s to search the leaves into which patches fall. C. Result on SED2 We set for this set as there are many small objects in it. If is too large, these objects will be excluded from the object regions. Fig. 4(a) shows that the proposed model consistently outperforms the other methods on this two-object dataset. From Fig. 4(b), we note that our model detects the two objects well, while the other methods only detect one object or highlight the objects non-uniformly, i.e., pay more attention to one object than the other. This validates that the other methods work well on the single salient object dataset but worse on multiple objects dataset, while our model achieves good performance on both sets. IV. CONCLUSION In this work, we introduce a novel method to evaluate the saliency of an image based on the rarities of patches and contour-based contrast analysis. Our model achieves currently higher performance on two datasets compared to various state-of-the-art works. In future work, we believe that inves-

tigating more sophisticated techniques for contour extraction will be beneficial. In addition, we will apply our model to processing other vision problems. REFERENCES [1] L. Itti, C. Koch, and E. Niebur, “A model of saliency-based visual attention for rapid scene analysis,” IEEE Trans. Patte. Anal. Mach. Intell., vol. 20, no. 11, pp. 1254–1259, 1998. [2] H. Jiang, J. Wang, Z. Yuan, T. Liu, and N. Zheng, “Automatic salient object segmentation based on context and shape prior,” in Proc. BMVC, 2011. [3] S. Goferman, L. Zelnik-Manor, and A. Tal, “Context-aware saliency detection,” in CVPR, 2010. [4] Q. Yan, L. Xu, J. Shi, and J. Jia, “Hierarchical saliency detection,” in CVPR, 2013. [5] M.-M. Cheng, G.-X. Zhang, N. Mitra, X. Huang, and S.-M. Hu, “Global contrast based salient region detection,” in CVPR, 2011. [6] F. Perazzi, P. Krahenbuhl, Y. Pritch, and A. Hornung, “Saliency filters: Contrast based filtering for salient region detection,” in CVPR, 2012. [7] Y. Xie, H. Lu, and M.-H. Yang, “Bayesian saliency via low and mid level cues,” IEEE Trans. Image Process., vol. 22, no. 5, pp. 1689–1698, 2013. [8] C. Yang, L. Zhang, H. Lu, X. Ruan, and M.-H. Yang, “Saliency detection via graph-based manifold ranking,” in CVPR, 2013. [9] R. Achanta, S. Hemami, F. Estrada, and S. Susstrunk, “Frequencytuned salient region detection,” in CVPR, 2009. [10] A. Borji and L. Itti, “Exploiting local and global patch rarities for saliency detection,” in CVPR, 2012. [11] L. Breiman, “Random forests,” Mach. Learn., vol. 45, no. 1, pp. 5–32, 2001. [12] G. Yu, J. Yuan, and Z. Liu, “Unsupervised random forest indexing for fast action search,” in CVPR, 2011. [13] R. Marée, P. Denis, L. Wehenkel, and P. Geurts, “Incremental indexing and distributed image search using shared randomized vocabularies,” in ACM MIR, 2010. [14] K. Zhang, L. Zhang, H. Song, and W. Zhou, “Active contours with selective local or global segmentation: A new formulation and level set method,” Image Vis. Comput., vol. 28, no. 4, pp. 668–676, 2010. [15] P. F. Felzenszwalb and D. P. Huttenlocher, “Efficient graph-based image segmentation,” Int. J. Comput. Vis., vol. 59, no. 2, pp. 167–181, 2004. [16] S. Alpert, M. Galun, R. Basri, and A. Brandt, “Image segmentation by probabilistic bottom-up aggregation and cue integration,” in CVPR, 2007. [17] C. Yang, L. Zhang, and H. Lu, “Graph-regularized saliency detection with convex-hull-based center prior,” IEEE Signal Process. Lett., vol. 20, no. 7, pp. 637–640, 2013.