Salient Object Detection on Large-Scale Video Data - CiteSeerX

0 downloads 0 Views 284KB Size Report
However, query-by-example (QBE) of video is in- convenient for most of .... N-folder cross-validation and grid-search on a wide range are taken to obtain the ...
Salient Object Detection on Large-Scale Video Data Shile Zhang Dept of Computer Science Fudan University Shanghai, CHINA

Jianping Fan Dept of Computer Science UNC-Charlotte Charlotte, NC 28223, USA

Hong Lu, Xiangyang Xue Dept of Computer Science Fudan University Shanghai, CHINA

[email protected]

[email protected]

{honglu,xyxue}@fudan.edu.cn

Abstract Recently more and more researches focus on the concept extraction from unstructured video data. To bridge the semantic gap between the low-level features and the high-level video concepts, a mid-level understanding of the video contents, i.e., salient object is detected based on the techniques of image segmentation and machine learning. Specifically, 21 salient object detectors are developed and tested on TRECVID 2005 development video corpus. In addition, a boosting method is proposed to select the most representative features to achieve a higher performance than only using single modality, and lower complexity than taking all features into account.

1. Introduction With the prevalence of the digital cameras and the development of communication techniques, more and more videos are becoming available. Thus video retrieval over large-scale collections is becoming more and more important. However, query-by-example (QBE) of video is inconvenient for most of users because the representative query examples may usually be unavailable. In addition, it is very difficult to automatically find the users’ interestingness. Thus querying by semantics, which can specify the query concepts via keywords, can probably better meet the user’s need. However, the extraction of the semantics of the video data is also a hard work for the researchers because of the varieties of video concepts. Focusing on some specific video domains may be able to avoid this problem [1] [17]. For example, soccer videos are analyzed and features with a lot of human prior knowledge, such as the playfield region, the camera motion, and the players’ position, are extracted to describe the characteristic of soccer videos in [1]. In [17], pairwise constraints are added into the regularized empirical risk to deal with problem of insufficient labeled data in video object classification, which is tested on two surveillance video data sets. A promising way to narrow the semantic gap is to build large-scale concept ontology for the video [12]. In [16], five graphical models are used for learning the

multi-concept relation to improve the performance of the concept detection. Four different mining models are proposed in [15] to reveal semantics beyond the discovered patterns. And an event, which is regarded as stochastic process in the semantic concept space [6], is a specific pattern in which several concurrent concepts evolve. So it is important to develop the basic concept detectors to build the hierarchy of the ontology of the video. Naphade et al. [9][10] build some models of regional semantic concepts with four features, including color correlogram, edge orientation histogram, co-occurrence texture, and moment invariants, at four different granularities. Regions having the similar semantics can group together by Self-Organization Map (SOM) learning in [20]. And in [18] [19], the framework of semi-supervised learning is used to detect objects in aerial imagery. Multiple-instance learning is also taken by some researchers [2] [11]. In [11], a negative hypothesis is built from the instances in negative bags. And then the positive hypothesis is built with the instances in positive bags with weights evaluated by the negative hypothesis. Chen et al. [2] embeds bags into an instance-based feature space and selects the most important features to identify instances that are relevant to the observed classification. In this paper, we develop 21 salient object detectors which are defined by the Large Scale Concept Ontology for Multimedia (LSCOM). We label the regions which are segmented from the key frames of the video shots of TRECVID 2005 development data set. Eleven regional features, including color, texture, edge, and shape, are extracted. And supporting vector machines are trained on each feature. The final results are obtained by the proposed boosting method which adjusts the weight of each classifier iteratively in a framework similar to AdaBoost. The rest of paper is organized as follows: We give the definition of the salient object in Section 2 and introduce the detection function in Section 3. Experiments and their analysis are shown in Section 4. In Section 5, we give a conclusion and future work plan.

2. Salient Object Semantics are expected to be extracted from the videos for better indices of the videos. And it’s also helpful for

retrieving the information we need which is distributed on different kinds of modalities of data, such as visual, auditory, and textual. For example, a query of “George Bush” may include his appearance on TV, his speech on radio, and his news on newspaper or web. However, the semantics of a video clip are based on the complicated architecture of human knowledge, whereas the low-level features are based on the signal processing which we can easily get from the video directly. The low-level features of different video clips with the same semantic may vary in a wide range. There is a huge difference between them which is called “semantic gap”. To bridge the semantic gap, a middle-level representation, which is called “salient object”, is first proposed to express the video content efficiently by Fan et al. [7][8]. In [8], the salient objects are defined as the visually distinguishable video components that are related to human semantics. According to this definition, we define 21 salient objects to detect based on TRECVID 2005 development video data, which contains more than 70,000 shots. They are listed in the following table: Airplane

Animal

Boat_Ship

Bus

Building

Computer_TV-screen

Charts

Car

Explosion_Fire

Desert

Flag-US

Flowers

Hand

Maps

Mountain

Road

Sky

Snow

Truck

Vegetation

Waterscape_Waterfront

Table 1. Salient objects to detect.

The labeling results provided by NIST and LSCOM are simply judgments on whether a key frame of a shot contains a concept or not. In order to describe the location more accurately, image segmentation is taken on all key frames. The parameters of the segmentation function are adjusted so the images are a little over-segmented, and one region contains as fewer salient objects as possible. The ideal segmentation is that one region contains only one salient object. Yet in practice, if most part of a region is a kind of salient object, the region is labeled as positive for that object. The manually labeling work is based on the official annotations, which means that given a salient object, the assessors will only label the regions of a small portion of shots which contain that salient object. Besides the visual salient objects, we consider the definition of salient objects can extend to the auditory field, although it’s really hard to label them for the usual overlapping and different starting and ending time of these auditory salient objects. But without doubt, it can be of much help in the description of a shot. And we leave it to

our future work.

3. Detection Function 3.1. Image Segmentation There are many segmentation methods based on different criteria [3][5][13]. Specially, NCut [13] transforms the image segmentation problem to an eigen problem, while JSeg [5] defines a J-Value, which considers both the color and texture information of the image, and looks for a segmentation solution to minimize the J-Value of the segmented image. And mean shift applied in image segmentation [3], is a procedure of discontinuity preserving smoothing and grouping. In this paper, JSeg is adopted for its good performance and low complexity. Some segmentation results are shown in Figure 1. Over-segmentation is achieved for the purpose mentioned in Section 2.

Figure 1. Some segmentation results of JSeg.

3.2. Feature Extraction Given a region of an image, 11 features are extracted to characterize the visual properties of this region. They are: (a) the average color and the color variance in L*A*B* color space with 7 dimensions (the average and the variance on each channel and the variance using Euclidean distance on three channels); (b) Tamura texture with 15 dimensions (the coarseness, the contrast and some other histograms); (c) four 5-dimension features of the co-occurrence matrix on the horizontal, vertical, 45 degree, and 135 degree directions, respectively; (d) the color layout with 12 dimensions, defined in MPEG-7; (e) the scalable color with 64 dimensions, defined in MPEG-7; (f) the edge histogram with 5 dimensions, similar with the edge histogram descriptor defined in MPEG-7; (g) 4-dimension information of the circumscribing rectangle and the coverage ratio of the region; (h) 7-dimension invariant moments.

3.3. Classifier Training Supporting vector machine (SVM) has a good performance on 2-class learning problem. For a given salient object, if most of a region Xi is, or is part of, this salient

if most of a region Xi is, or is part of, this salient object, its label Yi = +1 . There are some transformation parameters W and b such that f ( Xi ) = W ⋅ Φ ( Xi ) + b ≥ +1 . And for the negative region, we have f ( Xi ) = W ⋅ Φ ( Xi ) + b ≤ −1 . Φ ( Xi ) is called kernel function which maps Xi into higher dimensional space. We adopt the radial basis function (RBF) as the kernel function in our experiment due to the effective approximation of the geometric property of each homogeneous regional feature of the image, which is

object. The detail of the algorithm is described as follows: 1) Assume that there are totally m classifiers trained on m features respectively. For a region ri = ((x i1, x i 2 ,..., x im ), yi ) , yi is its label and (x i1, x i 2 ,..., x im ) are the confidences of ri to be positive, output by the classifier trained on different features, each dimension of which ranges from 0 to 1. Training set R = {r1 , r2 ,..., rn } with a distribu-

ℜ( Xi, Xj ) =exp( −γ Xi − Xj ) .

The SVM training is actually the following optimization problem: n  W 2  (1) min  + C∑ξi  2 1 i =   where

W2 2

1 . And initial n weights of each classifier, 1 . Let t = 1. w1,1 = w1,2 = ... = w1,m = m Measure the goodness of the j-th feature at the t-th times,  n , if yi = +1  ∑ di xij  i =1 g j ,t =  n  d (1 − x ) , if y = −1 i ij i  ∑ i =1 and the goodness of the ensemble at the t-th times,

tion D = {d1, d2 ,..., dn } , di =

2)

stands for the reciprocal of the margin between

two supporting planes and ξ i stands for the training error rate. C > 0 is the penalty parameter to adjust the margin of supporting planes and the training error rate. The parameter pair ( C ,γ ) affects the performance of the model a lot. N-folder cross-validation and grid-search on a wide range are taken to obtain the optimal one. Each homogeneous feature characterizes a certain kind of visual properties for different regions of the images, and different salient objects may have different representative features, which behaves as the classifiers trained from different features have different accuracies. It is thus very important to take some techniques to tune the weight distribution of the classifiers, which is actually a procedure of feature selection, for different types of salient object to achieve a higher performance. Linear regression is a regress method of using linear function to combine some variables for an expected value. In our framework, the expected value is the label of a region and the variables are the confidences on different features. That is: C ( Ri ) = ∑ w j C j ( Ri ) + b

j =1

3)

3.4. Classifier Combination

m

m

Gt = ∑ wt , j g j ,t .

(2)

j =1

where m is the number of features, w j is weight of the j-th feature, C j (Ri ) is the confidence of Ri on the j-th feature and b is the bias. In fact, linear regression is the same with one-layer neuron networks, and w j and b can be obtained by an algorithm named Adaline [14], which modifies w j and b iteratively until convergence. It is proved to converge to the least squares solution. The proposed boosting algorithm, which is similar to AdaBoost, is also a method, which can select the most representative features for the detection of a given salient

Update m  di  Z ⋅ exp(0.5 − ∑ wt , j xij ) , if yi = +1 j =1  D di =  m  di ⋅ exp( w x − 0.5) , if y = −1 ∑ t , j ij i  Z D j =1

,

ZD is the normalization factor.

wt , j

g j ,t

4)

Update wt +1, j =

5)

factor. t ← t + 1, repeat from the 2nd step until t = T .

6)

The final classifier C ( r ) =

Zg

⋅e

, Zg is the normalization

T

m

∑∑ t =1 j =1

Gt wt , j C j (r ) Z

,

where Cj(r) is the confidence of region r on j-th feature and Z is also the normalization factor.

3.5. Object Detection

After shot boundary detection, key frames of the shot are segmented into regions, and eleven regional features referred before are extracted and input into the SVM models respectively after normalization. The final confidence of a region for a given salient object is the boosting result of classifiers trained on each feature. The region is recognized as the salient object with the highest confidence which is also above an empirical threshold. If not, this region can not

be recognized. As shown in Figure 2, the detection is satisfactory.

SameWeight

LinearRegression 0.3

0.4

0.5

0.6

Boosting 0.7

0.8

0.9

Airplane Animal Boat_Ship Building Bus Car Charts Computer_T V-screen Desert Explosion_Fire Flag-US Flowers Hand Maps Mountain Road Sky Snow T ruck

Figure 2. Salient object detection of video shots.

4. EXPERIMENTS

Our experiments are conducted on TRECVID 2005 development video data, which contains more than 70,000 shots. And 2/3 of these shots are used for training, 1/3 for testing. To evaluate the performance more conveniently, an index considering both precision and recall is taken, which is defined as: 2 × Pr ecision × Re call F1 = Pr ecision + Re call

4.1. Comparison of Combination Methods

Three different combination methods are compared: each classifier has the same weight, linear regression and our boosting algorithm. The comparison is shown in Figure 3. Linear regression gets an obvious worse performance because its target is the least squares error thus not suitable for this unbalanced dataset in which there are much more negative samples than the positive. From Table 2, we can find out that detectors with the proposed boosting method have got the highest maximum, minimum and average F1 performance and the lowest variance which means the stability of their performances are best.

Veget ation Waterscape_Waterfront

Figure 3. Experiment results for different combination methods. Combination Methods

Max

Min

Avg

Var

Same Weight

0.86

0.49

0.73

0.008

Linear Regression

0.81

0.36

0.60

0.018

Boosting

0.88

0.54

0.74

0.006

Table 2. Statistical comparison of combination methods.

4.2. Boosting Iteration Times In this experiment, the effect of the iteration times T in the boosting method is concerned. From Figure 4, we can find out that only 4 salient objects benefit a little from the increase of the iteration times while others keep stable, which shows that the speed of convergence of boosting is satisfactory.

4.3. Number of Features After boosting, some features with the smaller weights are discarded to verify the effectiveness of the increase of features. Figure 5 shows that most of the features in the feature set are not too useful for the performances of the most of salient object detectors, because increasing or discarding features doesn’t affect obviously, except for when there is only one feature left. We can see that the performances dramatically decrease when 10 features are

Iteration 5 0.5

Iteration 10 0.6

Iteration 20 0.7

0.8

is too little for classifying objects so its weight is still usually below the average level. 0.75 0.74 0.73

Average F1

ances dramatically decrease when 10 features are discarded, while only one single feature is left. This feature has the largest weight which means it is the best among the eleven features. It demonstrates that the multiple-feature fusion is very necessary for we can get a better performance than only one feature even though it is the best one, but too many features are unnecessary for little contribution can be made by these new features. It means feature fusion can help improve the performance of classification but there is an upper bound for adding new features.

0.72 0.71 0.7 0.69 0.68

0.9

0.67 0.66

Airplane

0

Animal

2

4

6

8

10

Number of Discarded Features

Boat _Ship Building Bus

Figure 5. Experiment results for different numbers of features discarded after boosting

Car Chart s Comput er_T V-screen Desert Explosion_Fire Flag-US Flowers Hand Maps Mountain Road Sky Snow T ruck Veget at ion Wat erscape_Waterfront

Figure 4. Experiment results for different iteration times in boosting.

4.4. Distribution of Features’ Weights For every salient object detector, we output the features’ weights after boosting. Simple statistical methods, such as the average and the variance, are used to analyze the importance of each feature. From Table 3, it’s noticed that three color features have much larger weights than four co-occurrence matrix features, and the variances of the distributions of these four co-occurrence matrix features’ weights are very small. That means the color feature is more discriminative than the statistical texture feature when applied to recognize objects on the TV programs. What’s more, the moment feature has a smaller weight than the location feature for the segmentation is very imperfect and we even over-segment the images purposely. Location information is more stable than moments for it only cares about the bounding box of the region, but such information

Features

Max

Min

Avg

Var

Avg. color and var.

0.399

0.068

0.148

0.0065

Color layout

0.264

0.053

0.112

0.0022

Scalable color

0.357

0.048

0.095

0.0054

Tamura

0.453

0.050

0.128

0.0074

Horizontal CM

0.104

0.042

0.075

0.0002

Vertical CM

0.090

0.042

0.071

0.0002

45-degree CM

0.090

0.041

0.071

0.0003

135-degree CM

0.090

0.041

0.071

0.0003

Edge histogram

0.118

0.051

0.083

0.0004

Invariant moment

0.090

0.040

0.068

0.0003

Location

0.137

0.044

0.079

0.0006

Table 3. Distribution of features' weights.

5. CONCLUSION

By segmenting and labeling on the key frames of the shots of TRECVID 2005 development video, we define 21 salient objects to detect, which can help achieving a middle-level understanding of the video shots. 11 regional features on color, shape, edge and texture, are extracted. And a boosting method is adopted for feature selection after the SVM training on each feature. Except for a fast convergence speed, our boosting method performs better than two other regular combination methods. And it can be used for selecting the necessary features to reduce the complexity of the prediction. More features won’t lead to higher performance. So exploring the upper bound theoretically is interesting. Besides,

for a number of salient objects are labeled, the relationship between these objects, such as the ratio of the border length between two regions, to the perimeter of one of these two regions [4], hasn’t been considered which may probably improve the performances of the detectors. And using these detectors to learn the scene-level concepts are also the work we plan to do next step. Furthermore, the auditory salient object is useful especially in videos for the audio supplies some important cues for semantics learning. The development of auditory salient objects detectors is also challenging.

6. References [1] J. Assfalg, M. Bertini, C. Colombo, A. D. Bimbo, W. Nunziati, “Automatic extraction and annotation of soccer video highlights,” in Proc. ICIP, pp.527-530, 2003. [2] Y. Chen, J. Bi, J. Z. Wang, “MILES: Multiple-instance learning via embedded instance selection,” IEEE Trans. PAMI, vol. 28, no. 12, pp.1931-1947, 2006. [3] D. Comaniciu, P. Meer, “Mean shift: a robust approach toward feature space analysis,” IEEE Trans. PAMI, vol. 24, no. 5, pp. 603-619, 2002. [4] R. Datta, W. Ge, J. Li, J. Z. Wang, “Toward bridging the annotation-retrieval gap in image search by a generative modeling approach,” in Proc. ACM Multimedia, pp.977-986, 2006. [5] Y. Deng, B. S. Manjunath, “Unsupervised segmentation of color-texture regions in images and video,” IEEE Trans. PAMI, vol. 23, no. 8, pp. 800-810, 2001. [6] Shahram Ebadollahi, Lexing Xie, Shih-Fu Chang, J. R. Smith, “Visual Event detection using multi-dimensional concept dynamics,” IEEE conf. Multimedia and Expo, pp. 881-884, 2006. [7] J. Fan, H. Luo, A. K. Elmagarmid, “Concept-oriented indexing of video databases: toward semantic sensitive retrieval and browsing,” IEEE Trans. Image Processing, vol. 13, no. 7, pp. 974-992, 2004. [8] J. Fan, H. Luo, J. Xiao, L. Wu, “Semantic video classification and feature subset selection under context and concept uncertainty,” in Proc. Joint ACM/IEEE Conf. Digital Libraries, pp. 192-201, 2004. [9] M. R. Naphade, A. Natsev, C. Lin, J. R. Smith, “Multi-granular detection of regional semantic concepts,” IEEE conf. Multimedia and Expo, pp. 109-112, 2004. [10] M. R. Naphade, J. R. Smith, “Learning visual models of semantic concepts,” IEEE conf. Multimedia and Expo, pp. 531-534, 2003. [11] M. R. Naphade, J. R. Smith, “A generalized multiple instance learning algorithm for large scale modeling of multimedia semantics,” ICASSP, pp. 341-344, 2005. [12] M. R. Naphade, J. R. Smith, J. Tesic, Shih-Fu Chang, W. Hsu, L. Kennedy, A. Hauptmann, J. Curtis, “Large-scale concept ontology for multimedia,” IEEE Multimedia, vol. 13, no. 3, pp-86-91, 2006. [13] Jianbo Shi, J. Malik, “Normalized cuts and image segmentation,” IEEE Trans. PAMI, no. 22, vol. 8, pp. 888-905, 2000.

[14] B. Widrow, M. E. Hoff, “Adaptive switching circuits,” IRE Western Electric Show and Convention Record, vol. 4, pp. 96-104, 1960. [15] Lexing Xie, Shih-Fu Chang, “Pattern mining in visual concept streams,” IEEE conf. Multimedia and Expo, pp.297-300, 2006. [16] Rong Yan, Ming-yu Chen, A. G. Hauptmann, “Mining relationship between video concepts using probabilistic graphics models,” IEEE conf. Multimedia and Expo, pp. 301-304, 2006. [17] Rong Yan, Jian Zhang, Jie Yang, A.G. Hauptmann, “A discriminative learning framework with pairwise constraints for video object classification,” IEEE Trans. PAMI, vol. 28, no. 4, pp. 578-593, 2006. [18] Jian Yao, Zhongfei Zhang, “Semi-supervised learning based object detection in aerial imagery,” IEEE conf. CVPR, vol. 1, pp. 1011-1016, 2005. [19] Jian Yao, Zhongfei Zhang, “Object detection in aerial imagery based on enhanced semi-supervised learning,” IEEE conf. ICCV, vol. 2, pp. 1012-1017, 2005. [20] Ruofei Zhang, Zhongfei Zhang, “Image database classification based on concept vector model,” IEEE conf. Multimedia and Expo, pp. 93-96, 2005.