Accurate and efficient cross-domain visual matching leveraging ...

2 downloads 0 Views 2MB Size Report
Apr 25, 2013 - Jesse, Risto-Jussi Isopahkala, Steven Allen, www.turbosquid.com. References .... Lowe, D.: Distinctive image features from scale-invariant key-.
Vis Comput (2013) 29:565–575 DOI 10.1007/s00371-013-0818-0

O R I G I N A L A RT I C L E

Accurate and efficient cross-domain visual matching leveraging multiple feature representations Gang Sun · Shuhui Wang · Xuehui Liu · Qingming Huang · Yanyun Chen · Enhua Wu

Published online: 25 April 2013 © Springer-Verlag Berlin Heidelberg 2013

Abstract Cross-domain visual matching aims at finding visually similar images across a wide range of visual domains, and has shown a practical impact on a number of applications. Unfortunately, the state-of-the-art approach, which estimates the relative importance of the single feature dimensions still suffers from low matching accuracy and high time cost. To this end, this paper proposes a novel cross-domain visual matching framework leveraging multiple feature representations. To integrate the discriminative power of multiple features, we develop a data-driven, query specific feature fusion model, which estimates the relative importance of the individual feature dimensions as well as the weight vector among multiple features simultaneously. Moreover, to alleviate the computational burden of an exhaustive subimage search, we design a speedup scheme, which employs hyperplane hashing for rapidly collecting the hard-negatives. Extensive experiments carried out on various matching tasks demonstrate that the proposed approach outperforms the state-of-the-art in both accuracy and efficiency.

G. Sun () · X. Liu · Y. Chen · E. Wu State Key Laboratory of Computer Science, Institute of Software, Chinese Academy of Sciences, Beijing, China e-mail: [email protected] G. Sun · Q. Huang University of Chinese Academy of Sciences, Beijing, China S. Wang · Q. Huang Key Laboratory of Intelligent Information Processing (CAS), Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China E. Wu University of Macau, Macao, China

Keywords Visual matching · Cross-domain · Multiple features · Hyperplane hashing

1 Introduction The prevalence of the imaging devices and the Internet has given ordinary customers the privilege to capture their worlds in images, and conveniently share them with others on the web. Today, people can generate volumes of images with exactly the same content across a wide range of visual domains (e.g., natural images, sketches, paintings, computer-generated images (CG images)) with dramatic variations in lighting conditions, seasons, ages, and rendering styles. Matching these images has been very helpful to a number of applications, such as scene completion [11], Sketch2Photo [3, 4, 7], Internet rephotography [27], painting2GPS [27], and CG2Real [16]. Therefore, accurately and efficiently finding visually similar images across different domains is in urgent need from research and application. Our objective in this paper is searching a large-scale dataset to find the image patches (subimages), which are visually similar to a given query across different domains. This is a very challenging task because small perceptual differences can result in arbitrarily large differences at the raw pixel level. In addition, without the knowledge of query specific domain, it is very difficult to develop a generalized solution for multiple potential visual domains. Finally, rapid system feedback is desired for a number of applications, such as scene completion [11] and Sketch2Photo [3, 4, 7]. Researchers have made significant progress in the study of cross-domain visual matching. Some methods focus on the design of local invariant descriptors (self-similarity (SSIM) [26], symmetry [10], etc.) across different visual domains. As a second paradigm, a data-driven approach [27] focuses

566

on weighting the dimensions of the Histograms of Oriented Gradients (HOG) [6] descriptor by training a linear classifier at the query time. In comparison with the uniform weights, the learned weights appropriately measure the relative importance of the feature dimensions, thus the performance is significantly improved. Although promising results have been achieved, the above mentioned approaches are still far from being perfect for many practical applications. The reasons lie in two aspects. Firstly, single image feature (e.g., SSIM, HOG) lacks sufficient discriminative power under dramatic appearance variations, which may lead to restricted generalization ability and low accuracy for cross-domain visual matching. A possible way to tackle this problem is to integrate several kinds of features together, such as feature concatenation. However, the systems that adopt such equally weighted concatenation may still not perform better (or even worse) than using single feature. This is because that the information conveyed by different features can be significantly unbalanced. A single feature may even dominate the effectiveness of a feature ensemble. On the other hand, manual feature weight setting is impractical as the feature weights may vary significantly from query to query. Furthermore, owing to tens of millions of negative subimages in dataset, data-driven approach [27] requires training a linear Support Vector Machine (SVM) classifier with hardnegative mining approach [6] at the query time. In hardnegative mining, one alternates between using current classifier to search the retrieval dataset exhaustively for sufficient false positives (hard-negatives), and learning a classifier given a current set of hard-negatives. For hard-negative mining, exhaustively searching the whole dataset incurs a heavy computational burden, and prevents its practical use for a number of applications. To address the above mentioned issues, we propose an accurate and efficient cross-domain visual matching framework leveraging multiple feature representations because multiple features provide complementary discriminative power for matching accuracy boosting. We present a datadriven, query specific feature fusion model to overcome the information unbalance conveyed by different features. We train a discriminative linear classifier on multiple feature descriptors to learn uniqueness-based weights, which are derived from both individual feature dimensions and the weight vector among multiple features. In comparison with [27], our learned weights appropriately encode the discriminative power of multiple features. Moreover, both our approach and [27] involve hard-negative mining approach, which incurs a heavy computational burden at the query time. To alleviate the overhead, we employ compact hyperplane hashing [19] to recast the hard-negative mining in a hyperplane-to-point search framework. On a dataset containing tens of millions of subimages, our speedup scheme

G. Sun et al.

can rapidly and consistently identify the top matches with higher quality in nearly 10 minutes per query on a PC, which is much faster than the original hard-negative mining approach [6]. Extensive experiments demonstrate that our proposed approach outperforms the state-of-the-art [27] in both accuracy and efficiency. The contributions in this paper can be summarized as follows: − We propose an accurate and efficient cross-domain visual matching framework, which is the first work to effectively make use of multiple feature representations in this task. − We develop a data-driven, query specific feature fusion model, which outperforms feature concatenation significantly. − We design a speedup scheme (hashing-based hardnegative mining), which is more efficient than the original hard-negative mining approach. The remainder of this paper is organized as follows. Sections 2 and 3 review related work and previous data-driven approach, and Sect. 4 describes our approach. Extensive experimental results will be provided in Sect. 5. We discuss about the limitations and future work in Sect. 6.

2 Related work We briefly review the related work on cross-domain visual matching, multiple features, and approximate near-neighbor search. 2.1 Cross-domain visual matching Many studies have been devoted to matching images among specific domains, such as photos under different lighting conditions [5], sketches to photographs [3, 4, 7], paintings to photographs [25], and CG images to photographs [16]. However, these domain-specific solutions are not applicable to cross multiple potential visual domains. For a generalized solution to this problem, some methods focus on the design of local invariant descriptors (self-similarity (SSIM) [26], symmetry [10], etc.) across different visual domains. In addition, a data-driven approach [27] weights the dimensions of the HOG [6] or SIFT [20] descriptor by training a linear classifier. Nevertheless, none of these approaches effectively makes use of multiple feature representations to boost matching accuracy. 2.2 Multiple features Researchers in the Content-Based Image Retrieval (CBIR) community have proposed a few systems using multiple feature representations [9, 24, 28] to boost the retrieval performance. However, how to integrate the discriminative power

Accurate and efficient cross-domain visual matching leveraging multiple feature representations

of multiple features without relevance feedback is still an open problem. As a promising method successfully applied in image classification and object recognition, Multiple Kernel Learning (MKL) [1, 17, 23, 29, 30] aims to optimize the discriminative model built on multiple similarity measures (kernels) and the kernel weights simultaneously. Compared with feature concatenation, MKL is more capable of identifying the meaningful features for matching cross-domain images. 2.3 Approximate near-neighbor search While spatial partitioning and tree-based search algorithms break down for high-dimensional data, many researchers have proposed hashing-based near-neighbor search methods to deal with high-dimensional inputs. One of the most successful methods is Locality-Sensitive Hashing (LSH) [8, 13], which has been employed in a number of applications. However, in hard-negative mining approach, given the current classifier, such point-to-point search methods are not applicable. Recently, hyperplane-to-point search problem has drawn much attention. In [14, 19], the authors develop hyperplane hashing to address the hyperplane-to-point search problem, where the points which have minimal distances to the hyperplane are rapidly returned. Due to the efficiency of hyperplane hashing, we employ the technique to speed up hard-negative mining approach by avoiding exhaustively searching the whole retrieval dataset.

3 Data-driven framework According to the Introduction in [27], cross-domain visual matching is formalized as an optimization problem by minimizing the following objective function:    L(wq , bq ) = h w q xi + bq xi ∈P ∪Iq

+



  2 h −w q xj − bq + λwq 

(1)

xj ∈N

where wq is the relative importance vector of the feature dimensions. xi is subimage Ii ’s grid-like HOG feature template. bq is the bias term. Iq is the query image (positive). P represents a set of extra positive data-points, by applying small transformations to the query image Iq . N is the tens of millions of subimages (negatives) from the retrieval dataset. λ is regularization parameter, and h(x) = max(0, 1 − x) is the hinge loss function. Given the learned, query-dependent importance vector wq , the visual similarity between a query image Iq and any other subimage Ii can be defined as S(Iq , Ii ) = w q xi + bq

(2)

567

To handle multiple feature representations in the datadriven framework, a possible way is feature concatenation. Let n denote the number of features and xi,k be the kth normalized feature of subimage Ii . Subimage Ii ’s concatenated feature xci can be given as following:      xci = x i,1 , xi,2 , . . . , xi,n

(3)

Then xci can be treated as a single feature, and the relative importance vector can be learnt by minimizing Eq. (1). However, owing to the unbalanced information conveyed by different features, such simple feature concatenation may still not perform better (or even worse) than single feature, as illustrated in Sect. 5.

4 Our approach An overview of our approach is shown in Fig. 1. In the preprocessing stage, we learn a set of hashing functions, then convert the concatenated feature of each subimage in retrieval dataset into a hash code and store in a hyperplane hash table. At the query time, we first extract multiple features of the query. These features form a feature ensemble as initial classifier (decision hyperplane). Then a set of extra hyperplanes is created and the corresponding hash codes are provided to hard-negative mining as inputs. Our hashing-based hard-negative mining is shown in the red box of Fig. 1. Given a set of hash codes, hyperplane hashing rapidly returns sufficient number of hard-negatives. Using these hard-negatives, MKL learner (our multiple feature fusion model) trains a classifier and creates a set of extra hyperplanes, whose corresponding hash codes are looked up in hyperplane hash table for hard-negatives again. In our experiments, we use ten iterations of the procedure. Finally, the system outputs the learnt weights and the top matches. Note that the learnt weights are comprised of the relative importance of the individual feature dimensions and the weight vector among multiple features. In this section, we first describe our multiple feature fusion model, and then explain how to rapidly collect hard-negatives using our hashingbased hard-negative mining. 4.1 Multiple feature fusion To balance the information conveyed by different features, we define the visual similarity between a query image Iq and any other subimage Ii as following: S  (Iq , Ii ) =

n 

  dq,k w q,k xi,k + bq

(4)

k=1

where dq = [dq,1 , dq,2 , . . . , dq,n ] is the weight vector for n different features. wq,k corresponds to the relative importance vector of the kth normalized feature xi,k , and wcq =

568

G. Sun et al.

Fig. 1 Overview of our approach. For the initial feature ensemble and the learned weights, the transparency of colors (red, green, blue) denotes the relative value in individual features. A lower transparency means a larger value

   [w q,1 , wq,2 , . . . , wq,n ] . In other words, we aim to construct a linear combination of multiple visual similarities. Instead of equally weighted or manual setting combination, we learn a query specific weight vector dq and wcq simultaneously by minimizing the following L2 norm multiple kernel learning objective function:  n        L wcq , bq , dq = h dq,k w q,k xi,k + bq xci ∈P ∪Iq

+

 xcj ∈N

k=1

 h −

n 

  dq,k w q,k xj,k − bq



k=1

2 + λ wcq + μdq  dq 2 s.t. dq ≥ 0

(5)

where the symbol  represents the Hadamard product (i.e., elementwise product). λ and μ is regularization parameter. In our experiments, λ = 102 and μ = 108 . We use SMO-MKL package [29] for learning wcq , bq , and dq . Usually, one uses the hard-negative mining approach [6] to cope with a very large amount of negative data. In hard-negative mining, one alternates between using the current classifier to search the whole negative set exhaustively for sufficient hard-negatives, and learning a classifier given a current set of hard-negatives. However, exhaustively searching the whole negative set is very time consuming, and prevents its practical use. Our visual similarity Eq. (4) is still a linear model with respect to the concatenated feature xci , since Eq. (4) can be rewritten as f

S  (Iq , Ii ) = wq xci + bq

(6)

   where wq = [dq,1 w q,1 , dq,2 wq,2 , . . . , dq,n wq,n ] . Equation (6) allows us to employ hyperplane hashing to speed up hard-negative mining approach. f

4.2 Hashing-based hard-negative mining To alleviate the computational burden of exhaustive subimage search, we employ compact hyperplane hashing [19] to recast hard-negative mining in a hyperplane-to-point search framework. Given a hyperplane query, hyperplane hashing is capable of rapidly returning the points which have minimal distances to the hyperplane. In the preprocessing stage, we learn a set of hashing functions H = [h1 , h2 , . . . , hm ] as [19]. Then we convert each subimage concatenated feature xci into a m-bit hash code and store in a single hash table with m-bit hash keys as entries. To search for hard-negatives during each hard-negative mining iteration, we design a hardnegative collecting procedure to approximate the exhaustive search. Given the current classifier (decision hyperplane) paramf eterized by wq and bq , we aim to find sufficient number of subimages, and each subimage i must satisfy S  (Iq , Ii ) ≥ −1. In fact, in our experiments, all the hard-negatives satisfy −1 ≤ S  (Iq , Ii ) ≤ 1. We create a set of extra hyperplanes by modifying the bias term bq of the current decision hyperplane. To collect hard-negatives, we look up the hash codes of all the hyperplanes in the hash table until sufficient number of hard-negatives are found, as illustrated in Fig. 2. In Fig. 2(a), the hard-negative mining can be described as searching the retrieval dataset exhaustively to find the hard-negatives (red points). In Fig. 2(b), the extra hyperplanes (p1 , p2 , . . . , pl ) between S  (Iq , Ii ) − 1 = 0 and S  (Iq , Ii ) + 1 = 0 are created by modifying the bias term bq of the current decision hyperplane. Note that though there are innumerable hyperplanes between S  (Iq , Ii ) − 1 = 0 and S  (Iq , Ii ) + 1 = 0, these hyperplanes only correspond to a few different hash codes, as the space has been quantified by hashing. The hard-negative collecting for hyperplane p1 and p2 can be seen in Figs. 2(c) and 2(d). For each hyperplane p, we extract its hash key H (p) and perform the bitwise NOT operation to get the key H (p). Then we look up

Accurate and efficient cross-domain visual matching leveraging multiple feature representations

569

Fig. 2 Hard-negative collecting. (a) The hard-negative mining can be described as searching the retrieval dataset exhaustively to find the hard-negatives (red points). Black points denote easy-negatives. (b) A set of extra hyperplanes (p1 , p2 , . . . , pl ) between S  (Iq , Ii ) − 1 =

0 and S  (Iq , Ii ) + 1 = 0 is created by modifying the bias term bq of the current decision hyperplane. (c–d) For each hyperplane p, the green points which have minimal distances to p are collected. Brown points denote the points which have been collected by previous hyperplanes

H (p) in the hyperplane hash table up to a small Hamming distance to obtain the points which have minimal distances to the hyperplane. Once sufficient number of hard-negatives are found, we retrain the classifier by minimizing Eq. (5).

Filter Bank: We compute the output energy of a bank of filters. The filters are Gabor-like functions with 4 orientations at 3 different scales. The output of each filter is averaged on a regular grid with a stride of 8 pixels. Then all the local filter bank descriptors are stacked together to form a global descriptor. In comparison with the GIST descriptor [22], the filter bank feature encodes more discriminative information with spatial layout. HOG: The HOG feature [6] provides excellent performance for object and human detection tasks. As with [27], we also use HOG feature for cross-domain visual matching task. The HOG descriptors are densely extracted on a regular grid with a stride of 8 pixels. Then all the local HOG descriptors are stacked together to form a global descriptor. SSIM: Unlike the above two features, the SSIM feature [26] provides a distinct, complementary measure of scene layout using local self-similarities. The SSIM descriptors are extracted on a regular grid with a stride of 8 pixels. Each descriptor is obtained by computing the correlation surface, and partition it into 24 bins (8 angles, 3 radial intervals).

5 Experiments In this section, we perform a number of matching experiments across multiple visual domains. We first describe the features used in the experiments, then present the experimental results. 5.1 Image features In our experiments, we first resize the query image heuristically to limit its feature dimensionality no more than 10,000. Then we extract the following three features for the query image. Each feature is normalized over overlapping spatial blocks as [6].

570

Then all the local SSIM descriptors are concatenated to form a global descriptor. As with [27], we use the sliding window fashion to increase the number of good matches, and use non-maxima suppression to remove highly-overlapping redundant matches. For each image in retrieval dataset, we precompute the filter bank, HOG, and SSIM feature pyramids at multiple scales ahead of time, and iteratively collect hardnegative subimages in these feature pyramids at the query time. Although only three features are used in our experiments, we emphasize that any rigid grid-like image representation (e.g., dense SIFT [18, 20], Spark feature (Shape Context) [2, 7], Local Binary Patterns (LBP) [21]) can be integrated into our framework. 5.2 Results We conduct a series of experiments to validate our approach. We collect 500 query images of outdoor scenes in different geographical locations across diversified domains. Like [12], our retrieval dataset is composed of GPS-tagged Flickr

Fig. 3 Illustration of the weight vector of multiple features (normalized dq ). In the first query, the SSIM feature dominates the effectiveness of the feature ensemble. In the second query, the relative importance of the three features is almost equal

G. Sun et al.

images. For each query, we create a set of 5,000 random images, and 5,000 images randomly sampled within a 30 mile radius of the query location (containing tens of millions of subimages). We compare with two baseline methods: the state-of-the-art Shrivastava et al. [27] (using HOG feature) and the feature concatenation (using filter bank, HOG, and SSIM features) as described in Sect. 3. All the experiments are run on a PC with a 3.40 GHz Intel Core i7-2600 CPU and 8 GB RAM. Image-to-Image: Here, we match the photos taken over different ages, seasons, weather or lighting conditions. Figure 4 shows some queries and the corresponding top matches for our approach and the baselines. To investigate the weight vector of multiple features, we plot the normalized dq of the two queries in Fig. 4, as shown in Fig. 3. In the first query, the two baseline methods fail to find good matches, because the SSIM feature dominates the effectiveness of the feature ensemble. In contrast, our approach can find good matches by integrating multiple feature discriminative power properly. In the second query, while the HOG feature lacks sufficient discriminative power, both the feature concatenation and our approach can find good matches. This is because that the relative importance of the three features is almost equal for the query. However, in most cases, the relative importance of multiple features is very different. Sketch-to-Image: Matching sketches to images is a very challenging task, as the majority of sketches tend to make use of complex lines rather than just simple silhouettes. In addition, sketches show strong abstraction and local deformations with respect to the real scene. While most current approaches focus on designing a sketch-specific solution, our approach provides a general framework by integrating multiple visual cues to boost matching performance. Some qualitative examples can be seen in Fig. 5.

Fig. 4 Qualitative comparison of our approach against baselines for Image-to-Image matching

Accurate and efficient cross-domain visual matching leveraging multiple feature representations

571

Fig. 5 Qualitative comparison of our approach against baselines for Sketch-to-Image matching

Fig. 6 Qualitative comparison of our approach against baselines for Painting-to-Image matching

Painting-to-Image: Matching paintings to images is also a very difficult task, as painting styles may vary significantly from painter to painter. Moreover, brush strokes may bring in strong local gradients even in the regions such as sky. Without any changes, we apply our approach as before. Qualitative comparison of our approach against baselines can be seen in Fig. 6. CG Image-to-Image: As another cross-domain visual matching evaluation, we testify our approach on matching CG images to real images. Usually, CG images lack sufficient details (e.g., texture and noise), and the color distributions of CG images are over saturated and exaggerated, which results in insufficient discriminative power of single feature. Qualitative comparison of our approach against baselines can be seen in Fig. 7. In the first example, it is

important to see that the feature concatenation may perform even worse than using single feature. Others-to-Image: Furthermore, we match some other queries (e.g., stereoscopic 3D images, logos, and ink and wash paintings) to images in our experiments, as shown in Fig. 8. The results further prove that our approach consistently achieves impressive matching performance for the queries from more visual domains. For quantitative evaluation, we create labels for the retrieval dataset as ground truth. For each image in the dataset, we label whether the query scenes (Angkor Wat, Sydney Opera House, Venice, etc.) appear or not. For each query, we compare how many query scene subimages are retrieved in the top 20 matches. We use the bounded mean Average Precision (mAP) [15] of the 500 collected query images to eval-

572

Fig. 7 Qualitative comparison of our approach against baselines for CG Image-to-Image matching

Fig. 8 Qualitative comparison of our approach against baselines for Others-to-Image matching

Fig. 9 Typical failure case of our approach

G. Sun et al.

Accurate and efficient cross-domain visual matching leveraging multiple feature representations

573

Acknowledgements The authors would like to thank the anonymous reviewers for their valuable comments and suggestions to improve the quality of the paper. This research is supported by National Fundamental Research Grant 973 Program (2009CB320802), NSFC grant (61272326, 61025011), and Research Grant of University of Macau. Image credits: Andy Carvin, Bob Pejman, Leonid Afremov, Mariko Jesse, Risto-Jussi Isopahkala, Steven Allen, www.turbosquid.com.

References Fig. 10 Illustration of the mAP of our approach against baselines. The mAP is computed across 500 collected query images

Fig. 11 Illustration of the average query time (in 103 seconds) of our approach against baselines. The average query time is computed across 500 collected query images

uate the performance of our approach. Figure 10 shows the mAP of the different approaches. Our approach outperforms the two baselines significantly. In addition, we also compare our approach with the one using original hard-negative mining (our approach (O)), which shows that our speedup scheme can return comparable results with the original hardnegative mining approach. To evaluate the efficiency of our speedup scheme, we compare the average query time of our approach with the two baselines and our approach (O), as shown in Fig. 11. Our approach takes only nearly ten minutes per query to find good matches.

6 Limitations and future work The typical failure case of our approach can be seen in Fig. 9. In this example, we fail to find good top matches, although we have integrated the discriminative power of the filter bank, HOG, and SSIM features. This is because that none of the three features is effective for the query. To address this problem, a possible way is to integrate more features into our framework. Moreover, our approach is still too slow for online interactive applications. In future work, we will make more efforts to improve the efficiency of our system.

1. Bach, F.R., Lanckriet, G.R.G., Jordan, M.I.: Multiple kernel learning, conic duality, and the SMO algorithm. In: Proceedings of the 21th International Conference on Machine Learning (ICML), pp. 6–13 (2004) 2. Belongie, S., Malik, J., Puzicha, J.: Shape matching and object recognition using shape contexts. IEEE Trans. Pattern Anal. Mach. Intell. 24(4), 509–522 (2002) 3. Cao, Y., Wang, C., Zhang, L., Zhang, L.: Edgel index for largescale sketch-based image search. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 761–768 (2011) 4. Chen, T., Cheng, M.M., Tan, P., Shamir, A., Hu, S.M.: Sketch2Photo: Internet image montage. ACM Trans. Graph. 28(5), 124 (2009) 5. Chong, H.Y., Gortler, S.J., Zickler, T.: A perception-based color space for illumination-invariant image processing. ACM Trans. Graph. 27(3), 61 (2008) (SIGGRAPH) 6. Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), vol. 1, pp. 886–893 (2005) 7. Eitz, M., Hildebrand, K., Boubekeur, T., Alexa, M.: Sketch-based image retrieval: benchmark and bag-of-features descriptors. IEEE Trans. Vis. Comput. Graph. 17(11), 1624–1636 (2011) 8. Gionis, A., Indyk, P., Motwani, R.: Similarity search in high dimensions via hashing. In: Proceedings of the 25th International Conference on Very Large Data Bases (VLDB), pp. 518–529 (1999) 9. Ha, J.Y., Kim, G.Y., Choi, H.I.: The content-based image retrieval method using multiple features. In: International Conference on Networked Computing and Advanced Information Management (NCM), vol. 1, pp. 652–657 (2008) 10. Hauagge, D.C., Snavely, N.: Image matching using local symmetry features. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 206–213 (2012) 11. Hays, J., Efros, A.A.: Scene completion using millions of photographs. ACM Trans. Graph. 26(3), 4 (2007) 12. Hays, J., Efros, A.A.: Im2gps: estimating geographic information from a single image. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1–8 (2008) 13. Indyk, P., Motwani, R.: Approximate nearest neighbors: towards removing the curse of dimensionality. In: Proceedings of the 30th Annual ACM Symposium on Theory of Computing (STOC), pp. 604–613 (1998) 14. Jain, P., Vijayanarasimhan, S., Grauman, K.: Hashing hyperplane queries to near points with applications to large-scale active learning. In: Advances in Neural Information Processing Systems (NIPS), vol. 23 (2010) 15. Jegou, H., Douze, M., Schmid, C.: Hamming embedding and weak geometric consistency for large scale image search. In: European Conference on Computer Vision (ECCV), pp. 304–317 (2008) 16. Johnson, M.K., Dale, K., Avidan, S., Pfister, H., Freeman, W.T., Matusik, W.: CG2Real: improving the realism of computer generated images using a large collection of photographs. IEEE Trans. Vis. Comput. Graph. 17(9), 1273–1285 (2011)

574

G. Sun et al.

17. Lanckriet, G.R.G., Cristianini, N., Bartlett, P., Ghaoui, L.E., Jordan, M.I.: Learning the kernel matrix with semidefinite programming. J. Mach. Learn. Res. 5, 27–72 (2004) 18. Lazebnik, S., Schmid, C., Ponce, J.: Beyond bags of features: spatial pyramid matching for recognizing natural scene categories. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), vol. 2, pp. 2169–2178 (2006) 19. Liu, W., Wang, J., Mu, Y., Kumar, S., Chang, S.F.: Compact hyperplane hashing with bilinear functions. In: Proceedings of the 29th International Conference on Machine Learning (ICML), pp. 17–24 (2012) 20. Lowe, D.: Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis. 60(2), 91–110 (2004) 21. Ojala, T., Pietikainen, M., Maenpaa, T.: Multiresolution gray-scale and rotation invariant texture classification with local binary patterns. IEEE Trans. Pattern Anal. Mach. Intell. 24(7), 971–987 (2002) 22. Oliva, A., Torralba, A.: Modeling the shape of the scene: a holistic representation of the spatial envelope. Int. J. Comput. Vis. 42(3), 145–175 (2001) 23. Rakotomamonjy, A., Bach, F., Canu, S., Grandvalet, Y.: SimpleMKL. J. Mach. Learn. Res. 9, 2491–2521 (2008) 24. Rui, Y., Huang, T.S., Ortega, M., Mehrotra, S.: Relevance feedback: a power tool for interactive content-based image retrieval. IEEE Trans. Circuits Syst. Video Technol. 8(5), 644–655 (1998) 25. Russell, B.C., Sivic, J., Ponce, J., Dessales, H.: Automatic alignment of paintings and photographs depicting a 3D scene. In: 3rd International IEEE Workshop on 3D Representation for Recognition (3dRR) (2011) 26. Shechtman, E., Irani, M.: Matching local self-similarities across images and videos. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1–8 (2007) 27. Shrivastava, A., Malisiewicz, T., Gupta, A., Efros, A.A.: Datadriven visual similarity for cross-domain image matching. ACM Trans. Graph. 30(6), 154 (2011) (SIGGRAPH Asia) 28. Vadivel, A., Sural, S., Majumdar, A.K.: Image retrieval from the web using multiple features. Online Inf. Rev. 33(6), 1169–1188 (2009) 29. Vishwanathan, S.V.N., Sun, Z., Theera-Ampornpunt, N., Varma, M.: Multiple kernel learning and the SMO algorithm. In: Advances in Neural Information Processing Systems (NIPS), vol. 23 (2010) 30. Wang, S., Huang, Q., Jiang, S., Tian, Q.: S3MKL: scalable semisupervised multiple kernel learning for real-world image applications. IEEE Trans. Multimedia 14(4), 1259–1274 (2012) Gang Sun received his B.Sc. degree from Nankai University, Tianjin, China, in 2009. He is currently a Ph.D. candidate at the State Laboratory of Computer Science, Institute of Software, Chinese Academy of Sciences. His research interests include computer graphics, computer vision, and machine learning.

Shuhui Wang received the B.Sc. degree from Tsinghua University, Beijing, China, in 2006, and the Ph.D. degree from the Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China, in 2012. He is currently a researcher with the Institute of Computing Technology, Chinese Academy of Sciences, Beijing. He is also with the Key Laboratory of Intelligent Information Processing, Chinese Academy of Sciences. His research interests include semantic image analysis, image and video retrieval, and large scale web multimedia data mining.

Xuehui Liu received her B.Sc. and M.Sc. degrees from Xiang Tan University, Hunan, and received her Ph.D. degree in 1998 from the Institute of Software, Chinese Academy of Sciences. Since then she has been working at the Institute of Software, Chinese Academy of Sciences, and is now an Associate Professor. Her research interests include realistic image synthesis, physically based modeling, and simulation.

Qingming Huang received the B.Sc. degree in computer science and Ph.D. degree in computer engineering from Harbin Institute of Technology, Harbin, China, in 1988 and 1994, respectively. He is currently a Professor with the Graduate University of the Chinese Academy of Sciences (CAS), Beijing, China, and an Adjunct Research Professor with the Institute of Computing Technology, CAS. His research areas include multimedia video analysis, video adaptation, image processing, computer vision, and pattern recognition.

Accurate and efficient cross-domain visual matching leveraging multiple feature representations Yanyun Chen received his Ph.D. from the Institute of Software, Chinese Academy of Sciences, in 2000 and his M.S. and B.S. from the Chinese University of Mining and Technology in 1996 and 1993, respectively. He is currently a researcher at the State Laboratory of Computer Science, Institute of Software, Chinese Academy of Sciences. His current research interests include photorealistic rendering, virtual reality, augmented reality, and complex surface appearance modeling techniques.

575 Enhua Wu received his B.Sc. degree in 1970 from Tsinghua University, Beijing, and his Ph.D. degree in 1984 from University of Manchester, UK. Since 1985, he has been working at the Institute of Software, Chinese Academy of Sciences, and from 1997 he has been also teaching in University of Macau. He is a member of IEEE & ACM. His research interests include realistic image synthesis, virtual reality, physically based modeling, and scientific visualization.