Simple Yet Efficient Content Based Video Copy Detection

3 downloads 0 Views 162KB Size Report
Apr 19, 2018 - arXiv:1804.07019v1 [cs.MM] 19 Apr 2018. Simple Yet Efficient Content Based Video Copy Detection. Jörg P. Bachmann1 and Benjamin ...
arXiv:1804.07019v1 [cs.MM] 19 Apr 2018

Simple Yet Efficient Content Based Video Copy Detection J¨org P. Bachmann1 and Benjamin Hauskeller2 1

[email protected] 2 [email protected] 1,2 Humboldt-Universit¨at zu Berlin, Germany April 20, 2018

Abstract

plicate detection, analyzing statistics of particular broadcast advertisements, or simply searching for large videos containing certain scenes or clips. Two basic approaches exist to address these challenges, namely watermarking and content-based copy detection (CBCD). Watermarking suffers from being vulnerable to transformations frequently performed during copy creation of a video (e.g. resizing or reencoding). Furthermore, watermarking cannot be used on videos unmarked before distribution. In contrast, CBCD is about finding copies of an original video by specifically comparing the contents and is thus more robust against transformations done during copy creation. These transformations include resolution and encoding changes, cropping, blurring, and insertion of logos. Hence, copies are near-duplicates and it is natural to use a distance or similarity function to discover them.

Given a collection of videos, how to detect content-based copies efficiently with high accuracy? Detecting copies in large video collections still remains one of the major challenges of multimedia retrieval. While many video copy detection approaches show high computation times and insufficient quality, we propose a new efficient content-based video copy detection algorithm improving both aspects. The idea of our approach consists in utilizing self-similarity matrices as video descriptors in order to capture different visual properties. We benchmark our algorithm on the MuscleVCD ST1 benchmark dataset and show that our approach is able to achieve a score of 100% and a score of at least 93% in a wide range of parameters.

1

Introduction

Of course, the distance function needs to be robust against the transformations mentioned above. On the other hand, it has to discriminate videos from different sources to reduce the false alarm rate. In this paper, a video descriptor

Nowadays, a vast amount of video data is uploaded and shared on community sites such as YouTube or Facebook. This leads to many problem statements such as copyright protection, du1

is proposed which leads to a distance function, technique have been developed [18, 19, 5]. successfully addressing these goals. The here proposed video descriptor is a derivative of the self-similarity matrix which is a common tool for structural analysis of time series. 1.1 Related Work The main purpose of the self-similarity matrix is Although we cannot give a full overview on this to create recurrence plots. Thus, similar subsetopic, we like to roughly classify the existing quences of one time series can be found [8, 13]. approaches related to CBCD. Videos are se- An example for an application is the structural quences of images or frames, hence comparing analysis of music track arrangements [14, 9]. videos is based on comparing images. Many However, we do not use the self-similarity for approaches compare global features created per analyzing a time series but we do use the (reframe [19, 5, 4]. These global features include duced) self-similarity matrices of two time series mean color values and color histograms. To to assess their similarity. achieve higher discriminability, each image is partitioned into a grid and global features are 1.2 Contribution and Organization calculated for each block of the grid [18, 6, 19]. In contrast to global features, local features A contribution of our spatio-temporal video deare more robust against transformations when scriptor (see Section 2) is its modularity, i.e. the searching for similar images [12]. For example, separation of temporal and spatial aspects. It Harris Corner Detectors are used to create fea- uses an arbitrary distance or similarity function ture descriptors in [15] and [16]. Since these comparing two images to create the descriptor global and local feature descriptors are created for a video. Hence, the user is free to choose on a per frame basis, they are called spatial de- any suitable distance function out of a large set scriptors. already defined by the research community [17] For video CBCD, temporal information also describing the distances between images propneeds to be taken into account (see [6, 7, 18, erly. Due to this flexibility, we were able to im19, 5, 4]), leading to so-called spatio-temporal prove our simple yet fast image distance funcdescriptors. One approach compares temporal tion to achieve a score of 100% in the Muscleordinal rankings of global features [6, 11, 10]. VCD ST1 benchmark. Comparing two videos is Since ordinal rankings depend on a total order then performed by comparing these descriptors of the features, they cannot be built for more without calling the distance function for any imrobust local features. A new concept called video age again. Thus, the execution time of comparstrands indicating movement of colored regions ing two videos at runtime does not depend on within half-second segments was introduced as the complexity of the underlying image distance space-time descriptors by the authors in [7]. The function, even if images are compared using exedit distance was first used to compare videos in pensive, robust, and discriminating image dis1998 [3, 4], where the global features of frames tance functions. With this descriptor we achieve are quantized to obtain strings over an alphabet, to create a distance function fulfilling all desired thus the edit distance can be applied to compare requirements described above. two videos. Recently, different derivations of this The remainder of this paper is organized as 2

follows. The proposed video distance function as well as the CBCD decision algorithm is motivated and defined in Section 2. Experiments evaluate its efficiency in Section 3. Section 4 focuses on conclusions and future work.

2 2.1

2.2.1

Pixel Distance

An exact copy of a video is identical with the original at each pixel in each frame. Since we usually do not have exact copies in the real world, our first intuition is to measure the distance of all corresponding pixels: δ1 (U, V ) := P 0≤i