AUTOMATIC TRADEMARK DETECTION AND ... - Semantic Scholar

3 downloads 0 Views 281KB Size Report
AUTOMATIC TRADEMARK DETECTION AND RECOGNITION IN SPORT VIDEOS ... of money in sports marketing and sponsorships, all over the world.
AUTOMATIC TRADEMARK DETECTION AND RECOGNITION IN SPORT VIDEOS Lamberto Ballan, Marco Bertini, Alberto Del Bimbo and Arjun Jain Media Integration and Communication Center, University of Florence, Italy http://www.micc.unifi.it/vim ABSTRACT In this paper we describe a system for automatic detection and recognition of trademarks in sports videos. We propose a compact representation of trademarks based on SIFT feature points and a matching algorithm to robustly detect and retrieve trademarks in a variety of different sports video types. Trademark localization is performed through robust clustering of matched feature points in the video frame. A supervised machine learning approach is used to automatically adapt the similarity threshold used to assess the trademark matches. Experimental results are provided, along with an analysis of the precision and recall. Results show that our proposed technique is efficient and effectively detects and classifies trademarks. 1. INTRODUCTION AND PREVIOUS WORK Large and small firms invest every year an extremely large amount of money in sports marketing and sponsorships, all over the world. A large part of these investments is used for the placement of the sponsors’ names, trademarks and logos in the form of objects such as billboards, banners and other physical advertising media. These media are usually placed around and within the area in which the sport activity is carried on, such as a golf course, a football or soccer field, a motorbike or formula one race circuit, basketball or volleyball courts, swimming pools, etc. Since the cost for these sponsorships can be extremely high, sponsors require a verification of the level of visibility of their trademarks, in order to evaluate the return of their investment. This work of brand visibility evaluation is currently performed by several sports marketing firms, through manual inspection of the broadcast videos. Human annotators mark up the start and end time of each trademark appearance, possibly adding a subjective evaluation of its quality. This process is both expensive and slow, due to the fact that even the best annotators are able to deal with a limited number of trademarks at the same time, and thus in order to perform a complete annotation of a single sport event it may be required to check the same video several times. The introduction of automatic systems that ease the work of the human annotators would streamline the whole process. The problem of detecting and recognizing a trademark or a logo is part of the more generic object recognition problem. In this field the two main types of features used are related to shape [1] or are pixel-based [2, 3, 4]. Recently the use of local descriptors has become particularly widespread for object detection and recognition because of the robustness w.r.t. occlusions and image clutter. Most of the work related to trademark recognition deals with the problem of content-based indexing and retrieval in logo databases, with the goal of assisting the process of trademark registration, by comparing a newly designed trademark with archives of already registered logos; a review of these works is provided in [5]. In this case it

can be assumed that the image acquisition and processing chain is controlled so that the images are of acceptable quality and are not distorted. The problem of trademark recognition in videos is inherently harder, due to the relatively low quality of the images, caused by the technical limitations of the imaging equipment used to produce and transmit video (e.g. video interlacing, color sub-sampling, motion blur, compression artifacts, etc.), and to the lack of other information (e.g. no captions advise of the presence of logos). In [6] the problem of detecting and tracking billboards in soccer videos has been studied, with the goal of superimposing different advertisements, without performing logo recognition. Billboards are detected using colour histogram back-projection. In [7] logo appearance is detected by analyzing sets of significant edges and applying heuristic techniques to discard small or sparsely populated edge regions of the image. The logo recognition method proposed in [8] deals with logos appearing on rigid planar surfaces that have an homogeneously colored background; the video frame is binarized and logo regions are combined using heuristics. The Hough transform space of the segmented logo is then searched for large values to find the image intensity profiles along lines, and matching these lines with the line profiles of the models. In [9] candidate logo regions are detected using color histogram back-projection and then they are tracked. Multidimensional receptive field histograms are then used to perform logo recognition. For every candidate region the most likely logo is computed, and thus if a region does not contain a logo the precision of identification is reduced. The most recent works on trademark detection and recognition, have started to use interest point descriptors. In [10] the architecture for a system for media monitoring is presented. The system provides logo detection and recognition functionalities, and the authors briefly discuss a variation of the SIFT algorithm to select and track keypoints in videos. The points are used for trademark recognition, but the logo matching algorithm is not described, and very few results of the proposed variation are provided. In [11] logo descriptors are computed from SIFT points using k-means clustering. Latent Semantic Analysis is then used for logo retrieval, considering the clusters as words, and using these “words” to build the latent semantic space. The proposed system does not take into account that a trademark is often represented with different instances, whose size and appearance may change greatly, within the same video; due to the fixed number of clusters, the system would not cope well with small or compact logos. In [12] matching SIFT points are used to select nearby image pixels. Pixels that are near to a certain number of interest points are then clustered, and the densest cluster is retained. If the retained cluster size is above a certain threshold than a trademark match is declared. The paper does not provide experimental results, and the method relies on a number of empirically determined thresholds. The contribution of our work is the presentation of a method

match entire trademarks. Because of this we use DoG points and SIFT [3] feature descriptors as a compact representation of the important aspects and local texture in trademarks. Trademarks are represented as a bag of SIFT feature points and each trademark is represented by one or more graphical instances. Trademark Tj is so represented by the Nj SIFT feature points detected in the image: Tj = {(xtk , ykt , stk , dtk , Otk )}, for k ∈ {1, . . . , Nj }, and xtk , ykt , stk , and dtk are, respectively, the x- and y-position, the scale, and the dominant direction of the kth detected feature point; Okt is a 128-dimensional local orientation histogram of the SIFT point (t is used only to distinguish points from trademarks and video frames). Each frame, Vi , of a video is represented similarly as a bag of Mi SIFT-feature points detected in frame i. 2.2. Detection and retrieval of trademarks

Fig. 1. Overview of our trademark detection system.

that detect, recognize and perform robust localization of trademarks in sports videos, based on SIFT points. A supervised machine learning approach automatically adapts the similarity threshold used to recognize different trademarks in different sports videos, in order to cope with changes in visual quality of the frames. The rest of the paper is organized as follows. In Sect. 2 is provided an overview of the trademark detection system and details of the frame classification procedure, and of the automatic threshold adaptation. Finally in Sect. 3 experimental results are shown and discussed. Conclusions are drawn in Sect. 4. 2. THE TRADEMARK DETECTION SYSTEM We propose a semi-automatic system for detecting and retrieving trademark appearances in sports videos. First of all we describe our system architecture (Fig. 1), that has been developed during a collaboration with a sports marketing firm1 . The system has been designed to attain a quasi-real time processing of videos. In this system a human annotator supervises the results of the automatic annotation through an interface that shows the time and the position of the detected trademarks; due to this fact the aim of the system is to provide a good recall figure, so that the supervisor can safely skip the parts of the video that have been marked as not containing a trademark, thus speeding up his work.

Detection and retrieval of trademarks is done by comparing the bag of local features representing the trademark Tj with the local features detected in the frames of the video Vi . Therefore for each point in Tj we search for its two nearest neighbors in the Vi point set: N1 (Tjk , Vi )

=

N2 (Tjk , Vi )

=

min ||Ovq − Otk || q

min

q6=N1 (Tjk ,Vi )

||Ovq − Otk ||.

(1)

Next, for every point in the video frame we compute its match score: M (Tjk , Vi ) =

N1 (Tjk , Vi ) , N2 (Tjk , Vi )

(2)

that is the ratio of the distances to the first and second nearest neighbors. Points are selected as being good candidate matches on the basis of their match scores. The match set for trademark Tj in frame Vi is so defined as: Mij = {k | M (Tjk , Vi ) < 0.8}. It means that if the match score Mij is greater than 0.8 the match point is discarded; this approach performs well because a correct match needs to have the closest matching descriptor significantly closer than the closest incorrect match (because of the high dimensionality of the feature space). The final determination of whether a frame Vi contains trademark Tj is made by thresholding the normalized match score: |Mij |/|Tj | > τ ⇐⇒ trademark Tj present in frame Vi . The threshold τ requires that a certain percentage of the feature points detected in the reference trademark Tj must be matched to

2.1. Image and video features One of the distinctive aspects of trademarks is that they are usually planar objects and contain both text and other high-contrast features such as graphic logos. In sport videos they are also often occluded by players or other obstacles between the camera and the trademark. Therefore, the first challenge is to build a model that is able to cope with partial occlusions. To obtain a matching technique that is robust to partial occlusions, we use local neighborhood descriptors of salient points. By combining the results of local point-based matching we are able to 1 Sport

System Europe srl - www.sportsystem.com

Fig. 2. An example of the normalized match score histogram for 4 different trademarks in a MotoGP video.

first one, C, is the penalty parameter of the error term, while the second, γ, controls the width of the RBF kernel. The standard crossvalidation approach used to select the best pair of (C; γ) have been used, finding that the best combination is (512, 2.0), which gave us a v=4 fold classification accuracy of 93.96%. 3. EXPERIMENTAL RESULTS

Fig. 3. Robust trademark localization: points in cyan are those selected as final-matches; points in green are SIFT matched points with a low influence and so discarded from the final match set. the frame Vi . Preliminary experiments have shown that a value of ∼ 0.2 is a reasonable choice (Fig. 2). In order to localize the trademark in the original frame Vi and to approximate its area, we compute a robust estimate of the feature point cloud (Fig. 3). The current feature point locations are so denoted as F = {(x1 , y1 ), (x2 , y2 ), . . . , (xn , yn )}. The robust centroid estimate is computed by iteratively solving for (µx , µy ) in n X

ψ(xi ; µx ) = 0,

i=1

n X

ψ(yi ; µy ) = 0

i=1

where the influence function ψ used is the Tukey biweight and the scale parameter c is estimated using the median absolute deviation from the median: MADx = mediani (|xi − medianj (xj )|). 2.3. Frame classification The main difficulty of the proposed detection algorithm is that it doesn’t take into account the frame visual quality. Our idea is to classify frames according to their visual quality and to adapt the normalized match threshold (NMT) to obtain better retrieval results in term of recall. Frame classification has been performed using SVM. The visual features used for frame classification are related to the artifacts that affect more severely the appearance of trademarks, or that are related to their visual appearance. Camera and motion blur are extremely common artifacts of sports videos, due to the fast paced action that typically occurs; they have to be evaluated using a no-reference approach since they are not related to the compression process; the blurriness metric proposed in [13] has been used to assess visual quality. Edges and interest points are related to the appearance of trademarks and logos, and their number has been used as a hint to evaluate the likelihood of the detectability of the trademarks. Each frame is represented by a feature vector composed by the values of blurriness and number of edges computed using the Canny edge detector, by the number of SIFT points and the corresponding normalized match score. The blurriness value and the number of edges are computed using three different kernels of size 3, 5 and 7; this is done to take into account different scales and sizes of the edges of the trademarks. The SVM uses the standard Radial Basis Function (RBF) kernel: K(xi ; xj ) = exp(−γ||xi − xj ||2 ), γ > 0, due to its property of handling the case when the relation between class labels and attributes is nonlinear and its requirement of less hyperparameters than other kernels. The RBF kernel requires to chose two parameters: the

Preliminary evaluation of the performance of the proposed technique have been done on three videos of three different sports (MotoGP, volley and soccer). Each of these is approximately one hour long. The first challenge is that different sports are characterized by different set-up and problems. In fact, sports like volleyball and basketball presents a lot of situations with occlusions or partial appearance of the trademarks. In MotoGP and Formula1 motion blur is the most common issue; instead, in soccer there are often trademarks at low resolution with few SIFT feature points. Therefore we have done experiments with these videos and several different trademarks to set all the parameters of the proposed algorithms (Sect. 2) adapting all of them to the particular type of sport. The main goal of these experiments is to set the NMT τ and the processing frame-step obtaining a correct trade-off between precision and recall. These experiments are presented in details in our previous work [14] and show that the best performances are obtained with τ equal to 0.2 and 5fps. However, depending on the type of sports, the recall figure show great variations (from 80% to 40%); in particular it happens in sports like Volleyball and Basket (with τ = 0.17). This threshold is statically set up for each sport type. 3.1. Automatic NM Threshold adaptation To test the effectiveness of the proposed dynamic threshold adaptation, the following experiments have been performed on Volleyball videos. To do so we have determined what is the lowest acceptable value τn for the NMT, analysing all the frames whose normalized match score (NM-score) is below the static value used for volley. Our goal is to find the best trade-off between the number of the frames that have a NM-score between 0.17 and τn , precision and recall. The SVM described in the previous section has been trained with frames containing different trademarks, while varying the NMT interval between τn and τ ; each of these SVMs have been tested with frames of the corresponding NM-score interval. Results of the classification of the frames with a NM-score below 0.17 are reported in table 1. Analysis of the table shows that the best results are obtained for τn = 0.08. As expected precision maintains relatively high values within all the NM-score ranges; instead recall shows low values for τn ≤ 0.06. This is due to the fact that frames with low NMscore values usually do not contain trademarks and thus the SVMs are biased towards the classification of this type of frames. 3.2. System performances Once determined the lowest τn of the NM-score range used for the training of the SVM we have performed experiments to determine how to select the trademarks used for the SVM training. In particular we have analysed whether it is better to use a SVM for each trademark, or it is possible to use a generic SVM (trained with different trademarks). For this experiment we have used two different volleyball videos, one (25000 frames) to select the training set and the other (5000 frames) for the test. The trademarks that we have selected for the training are TIM, ALICE, MIZUNO and ERREA;

τn 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.10 0.11 0.12 0.13 0.14 0.15 0.16

No Frames 8685 8405 8300 7825 7045 6170 3920 3125 2635 2320 1800 1735 850 490 475 460 335

Data for different NM Thresholds Correct Correct Miss False TM no TM 600 4815 2950 320 605 4570 2905 325 650 4480 2830 340 1270 3715 2060 780 1210 3145 1940 750 1185 2445 1735 805 710 1460 1245 505 1390 870 290 575 1465 445 60 665 1160 535 225 400 860 370 180 390 845 355 180 355 520 140 35 155 315 30 0 145 310 10 0 155 305 10 0 145 220 0 0 115

Precision

Recall

0.652 0.650 0.656 0.619 0.617 0.595 0.584 0.707 0.688 0.744 0.688 0.704 0.770 0.685 0.667 0.678 0.657

0.169 0.172 0.187 0.381 0.384 0.406 0.363 0.827 0.961 0.837 0.827 0.824 0.937 1.000 1.000 1.000 1.000

Table 1. SVM classification results varying NMT ranges for Volleyball videos; τn is the lowest value of the range, 0.17 is the highest.

Trademark TIM ALICE

Thresh. type Static Dynamic Static Dynamic

Overall system performance Correct Correct Miss TM no TM 980 2575 1415 1445 2540 950 760 2505 1605 1030 2180 1420

False

Prec.

Rec.

30 65 130 370

0.97 0.956 0.854 0.736

0.409 0.603 0.321 0.420

Table 3. Comparison of trademark recognition results obtained with static and dynamic thresholds (using SVMu). these views determine a smaller trademarks appearance and thus the presence of fewer SIFT features. We will also investigate methods to automatically adapt the frame sampling rate to deal with short sequences of trademark appearances and long sequences of low visual quality. Acknowledgements This work is partially supported by Sport System Europe srl, Italy, and the EU IST VidiVideo Project (Contract FP6-045547).

5. REFERENCES for the test we have selected the two main sponsors (TIM and ALICE). The number of frames with a NM-score between 0.17 and 0.08 (as determined by above experiments) are: 10305 for TIM, 5765 for ALICE, 1695 for MIZUNO and 1825 for ERREA. We have trained three different SVMs: SVM1 trained only with frames containing TIM, SVM2 with frames containing ALICE, and SVMu with the frames of all the four trademarks. In table 2 are reported the classification results of the two main trademarks obtained with the different SVMs. Analysis of the results shows that the SVMu gives in general the best trade-off between precision and recall. This solution is also technically appealing because it allows to train a single SVM independently from the number of trademarks that have to be recognized by the system. The overall system performance obtained with the automatic NMT adaptation is compared to the previous system that used a static threshold (Tab. 3). The dynamic selection greatly improves recall with a minimal loss of precision. Trademark TIM ALICE

SVM SVM1 SVM2 SVMu SVM1 SVM2 SVMu

Classification Results Correct Correct Miss TM no TM 950 255 45 685 465 310 745 435 250 300 345 340 255 410 275 530 225 110

False

Prec.

Rec.

225 15 45 175 220 295

0.808 0.978 0.943 0.632 0.537 0.642

0.955 0.688 0.631 0.469 0.927 0.828

Table 2. Classification results of the three different SVMs.

4. CONCLUSIONS AND FUTURE WORK In this paper we have presented a system for semi-automatic detection and recognition of trademarks in sports videos. A supervised machine learning approach is used to automatically adapt the similarity threshold used to determine the trademark recognition and localization. Experimental results show that the use of automatic adaptation provided by a SVM classifier improves the recall of our previous system with a minimal cost in terms of precision. Our future works deal with an improvement of the detection of trademarks adding color features to improve the performances in sports like soccer and rugby, where long range views are more common;

[1] S. Belongie, J. Malik, and J. Puzicha, “Shape matching and object recognition using shape contexts,” IEEE TPAMI, vol. 24, no. 4, pp. 509–522, 2002. [2] T. Gevers and A. W. M. Smeulders, “Color-based object recognition,” Pattern Recognition, vol. 32, pp. 453–464, 1999. [3] David G. Lowe, “Distinctive image features from scaleinvariant keypoints,” International Journal of Computer Vision, vol. 60, no. 2, pp. 91–1, 2004. [4] H. Bay, T. Tuytelaars, and L. Van Gool, “Surf: Speeded up robust features,” in Proc. of ECCV, Graz, Austria, 2006. [5] J. Schietse, J. P. Eakins, and R. C. Veltkamp, “Practice and challenges in trademark image retrieval,” in Proc. of CIVR, Amsterdam, NL, 2007. [6] F. Aldershoff and Th. Gevers, “Visual tracking and localisation of billboards in streamed soccer matches,” in Proc. of SPIE, San Jose, CA, USA, 2004. [7] B. Kovar and A. Hanjalic, “Logo appearance statistics in a sport video: Video indexing for sponsorship revenue control,” in Proc. of SPIE, San Jose, CA, USA, 2002. [8] R. J. M. Den Hollander and A. Hanjalic, “Logo recognition in video by line profile classification,” in Proc. of SPIE, 2004. [9] F. Pelisson, D. Hall, O. Riff, and J. L. Crowley, “Brand identification using gaussian derivative histograms,” Machine Vision and Applications, vol. 16, pp. 41–46, 2003. [10] G. Kienast, H. Stiegler, W. Bailer, H. Rehatschek, S. Busemann, and T. Declerck, “Sponsorship tracking using distributed multi-modal analysis,” in Proc. of EWIMT, 2005. [11] J. Wang, Q. Liu, J. Liu, and H. Lu, “Logo retrieval with latent semantic analysis,” in Proc. of VIP, Beijing, China, 2006. [12] S. Sanyal and S. H. Srinivasan, “Logoseeker: A system for detecting and matching logos in natural images,” in Proc. of ACM Multimedia, Augsburg, DE, 2007. [13] P. Marziliano, F. Dufaux, S. Winkler, and T. Ebrahimi, “Perceptual blur and ringing metrics:application to jpeg2000,” Signal Processing:Image Communication, vol. 19, no. 2, 2004. [14] A. D. Bagdanov, L. Ballan, M. Bertini, and A. Del Bimbo, “Trademark matching and retrieval in sports video databases,” in Proc. of MIR, Augsburg, Germany, 2007.