Scene Classification based on Histogram of Detected

0 downloads 0 Views 1MB Size Report
part of the Hollywood2 dataset [2], a part of YouTube 8M dataset. [3], and a data set of video clips from other movies and videos. The dataset contains 7000 ...
Scene Classification based on Histogram of Detected Objects Siyi Shuai, Junichi Miyao, and Takio Kurita Department of Information Engineering Hiroshima University 1-4-1 Kagamiyama, Higashi-Hiroshima, 739-8527, Japan

Abstract—Video content analysis has been a hot topic for computer vision researchers. As the videos on the Internet grow larger, hundreds of hours of video are generated per minute on the YouTube website. It is necessary to study video-related algorithms to help us to better handle these videos. In this paper, we propose a method to classify the scenes of the video clips based on their context. To extract context information from objects in each frame are detected by using YOLO9000 object detection algorithm [1] and the detected results of all frames in the video clip are accumulated as histogram similar with a bag-of-words method which is one of the most popular methods for text classification. Then support vector machine (SVM) is used to classify the scene of the video clips. For the multi-class classification, one-against-rest approach is used. The proposed algorithm is applied to the dataset of video clips which contains a part of the Hollywood2 dataset [2], a part of YouTube 8M dataset [3], and a data set of video clips from other movies and videos. The dataset contains 7000 video clips for 6 scene categories.

(a) Restaurant

(b) Road

(c) Office

(d) Living room

(e) Kitchen

(f) Bedroom

I. I NTRODUCTION Recently video processing became a hot topic in the computer vision. For a long time, researchers have spent a lot of efforts on the image classification and the image processing. As the study of image analysis makes progress, the research results are getting improved, then researchers began to pay attention to study other harder-to-handle topics such as video analysis. Video content understanding has already been a hot topic for researchers. Understanding the video content is harder than image. In the video, the scenes always change as the human activity, but image is static and each image could represent a kind of scene. Each a series of scene images has certainly objects and use these objects as attributes for scene representation [4]. However, in the video we would get different objects as the camera moving. For example, in the living room scene video, we could get some objects, like “desk”, “book”, “lamp”, “chair”, “vase” in the first second. In the next second, we would get the different objects with the first second like, “sofa”, “pot”, “car”, “chair”, “bottle”. It is hard to use one frame of videos to represent the scene. While there has been significant progress in handling this tasks such as object detection [1], bag-of-words model and scene classification. For the object detection algorithm, the great progress of object detection have been achieved by using deep Convolutional Neural Networks (CNN). Girshick et al. proposed a object detection algorithm by combining the selective search [5] with the deep CNN. Then research on object detection

Fig. 1: 6 Scene Classes

has been developing rapidly such as fast R-CNN [6], faster R-CNN [7], YOLO [8] and SSD [9]. Especially YOLO9000 algorithm can detect more than 9000 objects in almost real time [1]. For image classification or image retrieval, bag-of-words is one of the well known methods to extract features from a given image [10], [11]. Usually SIFT descriptors [12] are extracted from local regions in the training images and they are classified into visual vocabulary by using k-means clustering algorithm. Then the histogram of the visual vocabulary is constructed based on the extracted SIFT descriptors of a given image and it is used as the input of SVM classifier. In this approach, visual words are created by using kmeans clustering of features extracted from local regions in the images. It is difficult to consider that the obtained local regions include meaningful objects. The local regions are too primitive as the visual words. T. Matsukawa et al. proposed an

algorithm in which information obtained by the face detection are accumulated as a histogram and they are used to classify the audience’s state in video [13]. In this paper, we will use the detected objects by YOLO9000 as the cues of the scene (visual words) for video scene classification. In the proposed algorithm, the objects in each frame in the video clips are detected and the occurrences of the object in all frames in the video clips are accumulated in a histogram as bag-of-visual-words. The histogram is used as a feature vector of the video clips and is classified by Support Vector Machine (SVM). The proposed algorithm is evaluated by using the dataset which consists of 6 classes (bedroom, living room, kitchen, restaurant, office, and road) as shown in figure 1. The dataset used in this paper contains part of Hollywood2’s data [2] and part of YouTube’s 8M’s data [3], and we also intercept video clips from other movies and videos as research datasets. This paper is organized as follows. The related works are briefly reviewed in Section II and the proposed algorithm is explained in Section III. Experimental results are explained in Section IV. Section V is for conclusion. II. R ELATED W ORKS A. Scene Classification by bag-of-visual-words based on face detection The original bag-of-words model is one of the most popular methods for text classification which is usually used in information retrieval [14]. The key idea is to assume that the word order, grammar, syntax and other elements are ignored in a document. It is only viewed as a collection of several words. In addition, each word appears in the document is independent. In the bag-of-words representation, the document is represented by a vector of the word counts that appear in it. The bag-ofwords vector can be normalized to unity and scaled so that common words are less important than rare words, such as in the TF-IDF representation. The Bag-of-words model has been applied in computer vision such as image classification, image retrieval. Inspired by the success of text classification method, G. Csurka et al. [10] used the term bag of keypoints to describe the approach to use frequency-based techniques for image classification. In their original algorithm, SIFT descriptors from affine covariate regions are extracted and they are classified into visual vocabulary by using k-means clustering. Then the histogram of the visual vocabulary are constructed based on the extracted SIFT descriptors in the given image and it is classified by using a naive Bayesian classifier and SVM. The algorithm was improved by extracting SIFT descriptors from the regular grid and by using boosting instead of SVM [11]. S. Lazebnik et al. proposed an approach called spacial pyramid matching [15]. The local descriptors are accumulated in a spatial pyramid of bins containing word counts (histograms) and a pyramid match kernel is used to combine histogram intersection counts in a hierarchical fashion. In these approach visual words are created by using k-means clustering of features extracted from local regions. But it is

difficult to consider that the obtained local regions include meaningful objects. The local regions are too primitive as the visual words. To extract more meaningful visual words, T. Matsukawa et al. [13] proposed an algorithm in which information obtained by the face detection are accumulated as a histogram. The authors demonstrated a classification of audience’s state in video sequences by voting of facial expressions and face directions. B. Object detection by using deep CNN Great progress of object detection algorithm have been achieved by using deep CNN. R. Girshick et al. proposed a object detection algorithm by combining the selective search [16] with the deep CNN [5]. In the algorithm, candidates regions of the objects in the image are generated by using selective search and the regions are normalized to the fixed size and the features extracted by the trained CNN are used for the classification by SVM. To reduce the redundancy of the overlaps between the candidate regions, K. He et al. proposed a technique called spatial pyramid pooling [17], [18]. The feature vector are extracted by spatial pyramid pooling after creating large feature maps from the input image. This algorithm realized considerable speedup. In Fast R-CNN [6], the network structure is simplified by introducing ROI pooling layer which can achieve adjustable pooling and the bounding box regression and classification are simultaneously trained by defining multi-task loss function. Faster R-CNN [7] introduced region proposal network to estimate the candidate regions. This algorithm further improved the accuracy of the object detection and speed. J. Redmon et al. proposed an algorithm called YOLO (You Only Look Once) in which the image are partitioned into 7 grid regions and the bounding box and class are estimated based on each region [8]. By using the contextual information around the target object, the errors at the background are reduced. SSD (Signle Shot MultiBox Detector) proposed by W. Liu et al. [9] takes similar approach with YOLO and achieves similar speed with YOLO and better classification accuracy. The YOLO9000 [1] is an improvement of YOLO. The authors improved the YOLO model structure and it was called YOLOv2. The YOLOv2 achieved 76.8 mPA at 67 FPS for the VOC2007 dataset and was more accurate and faster than the Faster R-CNN and SSD. Based on the YOLOv2, the authors proposed the WordTree method to combine various kind of datasets and joint optimization method of detection and classification in order to train a model. The model is called YOLO9000. YOLO9000 achieves 19.7 mPA and 16.0 mPA for the ImageNet dataset and the COCO dataset respectively. III. S CENE CLASSIFICATION BASED ON HISTOGRAM OF DETECTED OBJECTS

A. Outline of the proposed Method In the standard bag-of-words for image classification, the features of local regions are used as visual words. However it is rare that meaningful objects are included in the local regions and many of the local regions probably include meaningless

Fig. 2: Outline of the proposed method.

parts of the objects. To extract visual words with firm meaning in the video clips, we use the results of object detection from the frames in the video clip. With the progress of recent object detection methods, we think that the results of the object detection could be used to define the meaningful visual words. Usually each scene has their own objects. For example, there must be a bed in the bedroom and there must be cars on the road. The results of the object detection could be useful for the classification of the scene in the video clips. In this paper, information of the detected objects are accumulated in the histogram of the occurrences of the detected objects as the bag-of-visual-words. Then the feature vectors of video clips are classified by using SVM. The outline of the proposed algorithm is shown in the figure 2.

The YOLO9000 can detect more than 9000 object classes in the video clips. However there are lots of unnecessary object classes for the classification of the target scenes. Especially most of the object classes don’t appear in the target scenes. To extract information of the important objects, we restricted the object classes into a subset of the all classes of the YOLO900 by deleting some object classes which are not related to the target scenes. We also removed some object classes which are redundant such as ”person”. It appears in all scenes and the discrimination power is low. Let x be a vector whose elements are selected from the histogram h obtained by YOLO9000 object detection algorithm. This process of the computation of the bag-of-visual-words is shown in Figure 3.

B. Object Detection We use YOLO9000 [1] to detect objects appearing in the video clips. This method takes a color video clip of size 600 × 800 as input and generates the objects name and their sequence number as output. Let O be the number of objects for which the object detection algorithm can detect. For the case of YOLO900, O is equal to 9418. For the video clips with F frames, information of the detected objects in the f -th frame in the video clips are accumulated in a histogram as   hf = hf 1 , hf 2 , . . . , hf O (f = 1, . . . , F ) (1)

D. Classification by SVM

where hf i is the number of occurrence of the i-th objects in the frame f . Then the histograms of all the frames in the video clip are accumulated as one histogram as h=

F X

hf .

We use SVM for classification. The classifier of SVM for binary classification is defined as y = sgn(wT φ(x) + b)

where φ(x) is a mapping from the input vector x to the feature space of the linear classification and sgn is defined as  1 x>0 sgn(x) = (4) −1 otherwise Let {(xi , ti )|i = 1, . . . , N } be the set of training samples where xi is the bag-of-visual-words feature vector (histogram) and ti ∈ {1, −1} is the class label of the i-th training sample. Then the objective function of SVM is defined as

(2)

f =1

C. Bag-of-visual-words We use the bag-of-words approach to extract the frequency of visual words (detected objects) appearing in the video clips.

(3)

N

Q=

X 1 ||w||2 + C ξi , 2 i=1

(5)

where C is the penalty parameter of the error term and trades off between misclassification of training examples and

Fig. 3: The computation of the histogram (bag-of-visual-words) of a video clip.

simplicity of the decision surface. This object function is minimized with subject to ti (wT φ(xi ) + b) ≥ 1 − ξi , ξi ≥ 0

(i = 1, . . . , N ).

(6) (7)

As the kernel function K(x, y) = φ(x)T φ(y) of the kernel SVM, we used the linear kernel Klin (x, y) = xT y,

(a) Hollywood2

(8)

the Radial Basis Function (RBF) kernel Krbf (x, y) = exp(−γ||x − y||2 ),

(9)

and Ploynomial kernel Kpoly (x, y) = (γxT y + r)p .

(10)

The RBF kernel has a parameter γ and defines how far a single training sample has influence. Similarly the Plolynomial kernel has three parameter γ, r, and p. To classify multiple classes, one-against-rest approach is applied.

(b) YouTube 8M

IV. E XPERIMENTS A. Dataset In the field of action recognition and video content understanding, the researchers often use the datasets such as UCF101, THUMOS14, Hollywood2, HMDB51, KTH and so on. For example, the Hollywood2 dataset [2] contains 1,152 video clips from 69 movies representing 10 scene classes. Recently large datasets such as YouTube 8M have been available. The YouTube 8M dataset [3] was released in 2016. This dataset contains 7 million video URLs representing 450,000 hours of video and it is divided into 4716 classes. To evaluate the effectiveness of the proposed approach, we have built a new dataset for classifying video scenes as shown in Figure 4. The dataset includes videos filtered out of the scene video datasets in Hollywood2 and the ones screened from YouTube 8M. We also included pieces from

(c) built by ourself

Fig. 4: Dataset

some movies and YouTube. Currently our dataset contains 7000 video clips for 6 scene categories which are bedroom, living room, kitchen, office, road, and restaurant. We selected these scenes because these scenes were appeared in the most of the human activities and it could be useful if we can classify these scenes. The size of each video clip in the datasets is normalized to 600 × 800 pixels and the length of each video clip is about 5 seconds.

(a) Bedroom

(b) Office

Fig. 6: Scenes classification performances by Linear SVM with different C parameters.

examples of the obtained bag-of-visual-words feature vector (histograms) for the target 6 scenes. It is noticed that the feature vectors are different depending on the scenes. C. Classification Results

(c) Road

(d) living room

To evaluate the classification accuracy of the proposed approach, we divided the dataset into 6000 samples for training and 1000 samples for testing. The training samples were used to determine the parameters of SVM and the test samples were used to evaluate the performance of the proposed algorithm. Figure 6 shows the classification accuracy obtained by the linear SVM with the different values of C parameters. Also the test loss Etest =

N test X

ξi

(11)

i=1

(e) kitchen

(f) Restaurant

Fig. 5: Examples of the bag-of-visual-words feature vectors (histograms) for 6 scenes.

B. Feature Extraction As explained in the section III, the bag-of-visual-words feature vectors x were extracted by applying YOLO900 object detection algorithm to frames of each video clip. The threshold value of YOLO9000 is set to 0.15 and the hierarchical threshold value is set to 0.5. The hierarchical threshold value is considered as the level of object classes being detected in the WordTree [1]. The number of visual words was restricted to 54 object classes for 6 scenes by investigating the useful objects for classification in the video clips. Figure 5 shows the

is shown in figure 6 where Ntest = 1000 is the number of test samples. From this figure, it is noticed that the values of the recognition accuracy fluctuate between 0.954 to 0.968 and the values of test loss fluctuate between 0.183 to 0.329 when the parameter of the linear SVM C was changed from 1 to 105 . The best accuracy is achieved 0.968 at C = 102 . For RBF kernel SVM, the default value of γ that is 1/54 (54 is the number of object classes in the bag-of-visual-words). When the value of parameter C and γ is set to default, the accuracy of classification is 0.435. Then we try to find the best parameters by checking various values of γ (the parameter C = 1) for classification. The classification accuracy would be raised as the value of γ decreases. When the γ is set to smaller than 10−4 , the SVM with RBF kernel can achieve the similar performance as the linear SVM. For polynomial kernel SVM, the classification accuracy is 0.947 when the three parameters are default (C = 1, γ = 1/54, and p = 3). The best parameters for kernel SVMs were searched by using grid search method based on cross-validation. Table I shows the results of the best classification accuracy for

TABLE I: The classification accuracies obtained by the linear SVM, RBF kernel SVM, and polynomial kernel SVM. The best accuracy was searched by using grid search based on cross-validation. The linear SVM gets the best performance of classification when the C = 2. The best classification accuracy of RBF kernel SVM was achieved when the parameters were set to C = 100 and γ = 10−4 . For the polynomial kernel SVM, the best classification accuracy was obtained by setting the parameters with C = 1, γ = 0.01, and p = 1. SVM model Linear SVM RBF kernel SVM Polynomial kernel SVM

Accuracy 0.968 0.968 0.964

the linear SVM, RBF kernel SVM, and polynomial kernel SVM. After using grid search method, we has find the best parameters to classify the scenes. Both the linear SVM and RBF kernel SVM get the same classification performance and get the best result of classification. V. C ONCLUSION In this paper, we proposed a scene classification method based on the bag-of-visual-words feature vector obtained by the object detection of each frame in the video clips. By using the results of the object detection, we can obtain the feature vector in which the visual words have firm meaning. In the experiments using the video clips from Hollywood 2 dataset [2], YouTube 8M dataset [3], and our own dataset, 6 classes (”Bedroom”, Living Room”, ”Office”, ”Restaurant”, ”Kitchen”, and ”Road”) were classified and about 97% of the classification accuracy was achieved. For future works, we would like to include other clues such as actions appearing in the video clips. Also we would like to develop scene classification method in which all the processing units for feature extractions and classifications are integrated into one deep neural network. ACKNOWLEDGMENT This work was partly supported by JSPS KAKENHI Grant Number 16K00239. R EFERENCES [1] Joseph Redmon, and Ali Farhadi, “YOLO9000: Better, Faster, Stronger,” Proc. of The Thirtieth IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2017), pp.6517-6525, 2017. [2] M. Marszalek, I. Laptev, and C. Schmid, “Actions in context,” Proc. of The IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2009). [3] Sami Abu-El-Haija, Nisarg Kothari, Joonseok Lee, Paul Natsev, George Toderici, Balakrishnan Varadarajan, Sudheendra Vijayanarasimhan, “YouTube-8M: A Large-Scale Video Classification Benchmark,”Proc. of The IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2017). [4] Li-Jia Li, Hao Su, Yongwhan Lim, Li Fei-Fei, “Objects as Attributes for Scene Classification,” Proc. of European conference on Trends and Topics in Computer Vision (ECCV 2010), Vol.1, pp.57-69. [5] Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik, “Rich feature hierarchies for accurate object detection and semantic segmentation,” Proc. of the IEEE conference on computer vision and pattern recognition (CVPR2014), pp. 580-587, 2014.

[6] Ross Girshick, “Fast R-CNN,” Proc. of The IEEE International Conference on Computer Vision (ICCV 2015), pp. 1440-1448, 2015. [7] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun, “Faster RCNN: Towards real-time object detection with region proposal networks,” Advances in Neural Information Processing Systems 28 (NIPS 2015), pp.91-99, 2015. [8] Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi, “You Only Look Once: Unified, Real-Time Object Detection,” Proc. of The IEEE Conference on Computer Vision and Pattern Recognition (CVPR2016), pp. 779-788, 2016. [9] Wei LiuEmail, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C. Berg, “SSD: Single Shot MultiBox Detector,” Proc. of European Conference on Computer Vision (ECCV2016), pp.21-37, 2016. [10] Gabriella Csurka, Christopher R. Dance, Lixin Fan, Jutta Willamowski, and C´edric Bray, “Visual categorization with bags of keypoints,” Proc. of Workshop on Statistical Learning in Computer Vision, ECCV2004, pp.1-22, 2004. [11] Gabriela Csurka, Christopher R. Dance, Florent Perronnin, and Jutta Willamowski, “Generic Visual Categorization Using Weak Geometry,” Toward Category-Level Object Recognition, pp. 207-224, 2006. [12] D. G. Lowe, “Distinctive Image Features from Scale-Invariant Keypoints,” International Journal of Computer Vision, 60(2), pp.91-110, 2004. [13] Tetsu Matsukawa, Akinori Hidaka and Takio Kurita, “Classification of Spectators’ State in Video Sequences by Voting of Facial Expressions and Face Directions,” Proc. of IAPR Conference on Machine Vision Applications (MVA2009), pp.426-430, 2009. [14] Yin Zhang, Rong Jin, and Zhi-Hua Zhou, “Understanding Bag-ofWords Model: A Statistical Framework,” International Journal of Machine Learning and Cybernetics, Vol.1, Issue 14, pp. 4352, 2010. [15] Svetlana Lazebnik, Cordelia Schmid, and Jean Ponce, “Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories,” Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR2006), Vol. 2, pp. 2169-2178, 2006. [16] J. R. R. Uijlings, K. E. A. van de Sande, T. Gevers, and A. W. M. Smeulders, “Selective Search for Object Recognition,” International Journal of Computer Vision, Vol. 104, Issue 2, pp.154171, 2013. [17] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, “Spatial pyramid pooling in deep convolutional networks for visual recognition,” Proc. of European Conference on Computer Vision (ECCV2014), pp 346361, 2014. [18] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, “Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition,” IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol.37, Issue: 9, 2015.