Visual tag dictionary: interpreting tags with visual words - Google Sites

72 downloads 0 Views 1MB Size Report
Oct 23, 2009 - textual words, the visual tag dictionary interprets each tag with visual .... We implement the Difference-of-Gaussian (DoG) [10] method to detect ...
Visual Tag Dictionary: Interpreting Tags with Visual Words Meng Wang† , Kuiyuan Yang‡ , Xian-Sheng Hua† , Hong-Jiang Zhang

§

† Microsoft Research Asia, Beijing 100090, P. R. China

{mengwang, xshua }@microsoft.com

‡ University of Science and Technology of China, Hefei 230027, P. R. China §

[email protected] Microsoft Advanced Technology Center, Beijing 100090, P. R. China

[email protected] ABSTRACT

facilitate the organization of the social media, and many interesting applications have been posed around these valuable metadata, such as tag-based media search, tag recommendation and tag ranking. There are also many works that try to exploit these free metadata in multimedia content analysis, such as using tagged images as free training data to learn semantic concept models. Intensive research has been dedicated to the above problems and encouraging results are reported. Different from these efforts, in this paper we do not intend to propose a specific algorithm or solve one problem, and instead we introduce an approach (actually a corpus) that is able to benefit all these problems. It is well recognized that the key difficulty in the aforementioned problems is the semantic gap between images and tags. More specifically, images are often represented with low-level features and they can be viewed as samples in a feature space, whereas tags are textual words and cannot be located in the feature space. Therefore, we introduce a visual tag dictionary that attempts to bridge this gap via visual words. Here a visual word means a pattern of salient local image patches, and recently many studies demonstrate that the visual-word-based image representation is effective in many applications such as image categorization and search. In the visual tag dictionary, each tag is interpreted with the distribution of visual words, which is analogous to the conventional dictionaries (such as the Merriam-Webster dictionary) that explain terms with textual words. In this way, tags and images can be connected via visual words. As illustrated in Fig. 1, the visual tag dictionary is constructed as follows. First, a large image dataset is gathered and then keypoints are detected in each image. The keypoints are grouped into clusters and each cluster is treated as a visual word as it represents the pattern of the image local patches. For each specific tag, the images annotated with it are collected and we then build a Gaussian Mixture Model (GMM) based on their visual-word representations (why we adopt GMM will be explained in Section 4). This model is regarded as the visual-word based interpretation of the tag. In fact, if we do not take the covariance matrices in the GMMs into account, the visual tag dictionary will be very close to the conventional textual dictionaries and the only difference is the element in our dictionary is visual word and the conventional dictionaries use textual words. In particular, different Gaussian components can be viewed as different interpretations of a tag, which addresses the polysemy property of tags. The visual tag dictionary can be

Visual-word based image representation has shown effectiveness in a wide variety of applications such as categorization, annotation and search. By detecting keypoints in images and treating their patterns as visual words, an image can be represented as a bag of visual words, which is analogous to the bag-of-words representation of text documents. In this paper, we introduce a corpus named visual tag dictionary. Unlike the conventional dictionaries that define terms with textual words, the visual tag dictionary interprets each tag with visual words. The dictionary is constructed in a fully automatic way by exploring the tagged image data on the Internet. With this dictionary, tags and images are connected via visual words and many applications can be thus facilitated. As examples, we empirically demonstrate the effectiveness of the dictionary in tag-based image search, tag ranking and image annotation.

Categories and Subject Descriptors H.3.1 [Information Storage and Retrieval]: Content Analysis and Index

General Terms Algorithms, Experimentation

Keywords Flickr, tag, image search, image annotation

1. INTRODUCTION Recent years have witnessed the growth of communitycontributed social media on Youtube, Flickr, Zooomr, etc. These media repositories not only promote users to collaboratively create, evaluate and distribute media information but also allow them to annotate their uploaded media data with descriptive keywords called tags. The tags can greatly

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. WSMC’09, October 23, 2009, Beijing, China. Copyright 2009 ACM 978-1-60558-761-5/09/10 ...$5.00.

1

Eagle 1. Any of various large diurnal birds of prey noted for their strength, size, keenness of vision, and powers of flight. 2. A gold coin of the United States bearing an eagle on the reverse and usually having a value of ten dollars. 3. A golf score of two strokes less than par on a hole.

Eagle Eagles are large birds of prey which are members of the bird family Accipitridae. They are differentiated from other birds of prey mainly by their larger size, more powerful build, and heavier head and bill. Like all birds of prey, eagles have very large powerful hooked beaks for tearing flesh from their prey, strong muscular legs, and powerful talons claws.

Visual-Word List

. . . . . . . . . .

Eagle

VISUAL TAG DICTIONARY

1.

... Keypoint clustering

2.

3.

...

Mixture model learning in visual-word feature space

...

(explanation with visual words)

Keypoint detection Figure 1: An schematic illustration of the visual tag dictionary construction approach. Different from the conventional textual dictionaries, such as the illustrated Merriam-webster dictionary and Wikipedia, that explain words with textual sentences, the visual tag dictionary interprets each tag with sets of visual words. In this way, the gap between image visual content and textual tags is bridged and many applications can be facilitated. applied in different tasks and in Section 5 we will provide several exemplary applications. The main contribution of this paper can be summarized as follows:

(3) Demonstrate the usefulness of the dictionary in different application scenarios, including tag-based image search, tag ranking and image annotation.

(1) Introduce the visual tag dictionary which can facilitate different applications.

The organization of the rest of this paper is as follows. In Section 2, we provide a short review on the related work. In Section 3, we introduce the construction approach of the visual tag dictionary. In Section 4, we demonstrate the application of the visual tag dictionary in different tasks. Finally, we conclude the paper in Section 5.

(2) Describe the construction approach of the dictionary. It is flexible and we can easily change several steps to satisfy different demands or expand the dictionary by adding new tag entries.

2

black butterfly insects

birds hawk raptor

zoo african lions

flowers red petal air balloon anemone bloom aerial adventure Figure 2: Several exemplary images and the associated tags in the NUS-WIDE dataset. 10

Figure 4: Two visual words and several images that contain them.

5

10

10

flower 3

visual word frequency

tag frequency

10

spider camel rat

10

10

6

4

2

1

0

1000

2000

3000 tag

4000

5000

10

5

6000

10

Figure 3: The frequency distribution of tags and several examples.

4

0

200

400 600 visual word

800

1000

2. RELATED WORK

Figure 5: words.

As a representative behavior in the Web 2.0 era, collaborative tagging attracts the interests of more and more researchers. The existing research efforts can mainly be classified into two categories. One category is investigating users’ tagging behavior and accommodating users in this process. For example, Sigurbjornsson et al. [13] provided the insights on how users tag their photos and what type of tags they are providing. They conclude that users always tag their photos with more than one tag and these tags span a broad spectrum of the semantic space. Yan et al. [19] proposed a model that is able to predict the time cost of manual image tagging. Li et al. [8] proposed a method to analyze the relevance scores of tags with respect to the image. Weinberger proposed a method to analyze the ambiguity of tags [17]. Different tag recommendation methods are also proposed aiming at helping users tag images more efficiently [6][13].

The other category of research efforts is regarding the tagged multimedia corpus as a knowledge base and exploring it in different applications. Kennedy et al. [5] and Tang et al. [14] have trained classifiers with Flickr images and the associated tags for image annotation. Jing et al. developed an online travel assistant system by mining the Flickr photos and their associated tags [4]. Yang et al. generated a Web 2.0 dictionary by automatically analyzing the relationship among tags [21]. Recently, Wu et al. [18] proposed a method to estimate the semantic distance between tags. Our work also belongs to this category, as we explore the tagged image set to construct a visual tag dictionary in order to facilitate different applications.

3

The frequency distribution of visual

3. THE CONSTRUCTION OF VISUAL TAG DICTIONARY 3.1 Tagged Image Set

3.2 Keypoint detection and Clustering We implement the Difference-of-Gaussian (DoG) [10] method to detect keypoints in the images and 122,918,378 keypoints are found in all. We then randomly select 10,000,000 keypoints and group them into 1000 clusters using the K-means algorithm. The remained keypoints are then assigned to the clusters with the nearest-neighbor principle. The center of each cluster is regarded as a “visual word” that represents the specific local pattern shared by the patches around the keypoints, as shown in Fig. 4. Figure 5 illustrates the frequency distribution of the visual words. After building the visualword list, an image can now be represented as a visual-word vector in which each component measures the appearance count of the corresponding visual word.

Figure 6: An example that illustrates the ambiguity of tags. The above images are all associated with the tag “apple”, and we can see that this tag can corresponds to different categories, such as fruit, computer, mobile photo and logo. 40 nature

35 30 mixture number

The visual tag dictionary is constructed based on the NUS-WIDE dataset, which contains 269,648 images that are collected from Flickr [2]. But it is worth mentioning that our approach is flexible, and we can also build the dictionary based on other datasets. There are 425,059 tags associated with these images originally. Chua et al. have filtered out the tags that are out of WordNet lexicon or appear with too low frequencies, and 5018 unique tags are kept after this process. Figure 2 illustrates several exemplary images and the associated tags. Each of the 5018 tags is regarded as an entry in our dictionary. Figure 3 illustrates their frequency distribution.

25 20 15 10

3.3 Mixture Model Learning

garden cat

5

For each of the 5018 tags, we learn a mixture model from the visual-word-vector representation of the images. The mixture model can naturally address tags’ polysemy. For example, as illustrated in Fig. 6, the tag “apple” can stand for fruit, computer, mobile phone, etc. Mixture model is able to capture the multiple meanings or senses of tags and generate multiple representations for each tag, just analogous to the conventional dictionaries that can contain multiple explanations for an entry. We employ Gaussian Mixture Model (GMM) and for simplicity the covariance matrix of each Gaussian component is assumed to be diagonal. Given a set of images I = {I1 , I2 , . . . , In } and each image Ii is represented by a visualword vector xi , the GMM takes the form of

0 0

1000

sunflower

2000

3000 tag

5000

6000

Figure 7: The distribution of mixture numbers of tags. • Maximization Step: πk =

n 1∑ wjk n j=1

∑n j=1 µk = ∑ n

K ∑

wjk xj

j=1

1 1 f (x|Θ) = πk √ exp{− (x − µk )T Σ−1 k (x − µk )} d |Σ | 2 (2π) k k=1 (1) where Θ = {π1 , . . . , πk , µ1 , µ2 , . . . , µk , Σ1 , Σ2 , . . . , Σk }. The parameters are estimated by the Expectation Maximization (EM) algorithm, i.e., iterating the following two steps:

∑n Σk =

j=1

wjk

wjk (x − µk )(x − µk )T ∑n j=1 wjk

(3)

(4)

(5)

The number of Gaussian components is decided following the Minimum Description Length (MDL) method [12], i.e.,

• Expectation Step: πk f (xj |µk , Σk ) wjk = ∑K k=1 πk f (xj |µk , Σk )

4000

∑ K (1 + 2d) log n − log f (xi |Θ) 2 i=1 n

K ∗ = arg min K

(2)

(6)

Figure 7 illustrates the distribution of the numbers of

4

AveragePrecision

TimeͲbasedranking

DictionaryͲbasedranking

0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

(a) Time-Based Ranking Query

Figure 8: The performance comparison of timebased ranking and dictionary-based ranking.

(b) Dictionary-Based Ranking

Figure 10: The top results of different ranking methods of query ”rainbow”. (a) Time-Based Ranking

(only replacing textual words with visual words). However, our approach can also be viewed as the process of modeling each tag with images’ visual-word representations. We have to emphasize that the introduced approach is not unique. For example, we can also adopt other models such as SVM. Here we present several of our considerations on why we choose GMM. First, the computational cost of the training of GMM with diagonal covariance matrices is only O(Kld), where K is the number of Gaussian components, l is the number of training samples and d is the dimensionality of features, and this complexity is advantageous in comparison with many other methods. For example, the computational cost of SVM training scales as O(s3 + s2 l + sld), where s is the number of support vectors. In visual modeling, support vectors often occupy a large proportion of training samples due to the high complexity of the data. This makes the cost of SVM nearly O(l3 + l2 d), and we can see that it is much more expensive than the cost of GMM training. In addition, the storage cost of GMM is also very low. Clearly, as a public corpus, a smaller size will make it easier for sharing and downloading. The GMM model only needs to store K(1 + 2d) parameters, whereas a SVM model with RBF kernel contains 1 + sl parameters. In most cases, s will be much greater than K, and thus the superiority of GMM in storage is also obvious. For example, the Columbia374 [20], which contains 374 SVM models that are learned from about 61,941 samples, occupies more than 3G bytes, whereas the size of our dictionary that contains more than 5000 tag entries is about only 60M bytes. The weakness of GMM actually lies on its performance. As a generative model, existing studies have shown that it performs worse than discriminative models in visual concept modeling [11]. However, generative model is actually closer to the conventional textual dictionaries, since it describes the property of each target concept, i.e., explain what the tag should be instead of what it should not be. Recently, several discriminative training methods have been proposed for generative models in order to bridge their performance

(b) Dictionary-Based Ranking

Figure 9: The top results of different ranking methods of query ”fruit”. Gaussian components and we can see that for most tags the numbers vary between 2 and 10. Therefore, the corpus contains the following two parts: (1) Visual-word list, i.e., the centers of the 1000 clusters. Given the feature vector of a new keypoint, its clustering membership can be determined by searching for the nearest cluster center. (2) Tag list and the parameters of GMMs, including the weights, mean vectors and covariance matrices of Gaussian components. Accordingly, we can derive that the size ∑ of the visual tag dictionary scales as O(rd + (1 + 2d) ti=1 Ki ), where r is the dimensionality of SIFT features, d is the number of visual words, t is the number of tag entries and Ki is the number of Gaussian components of i-th tag. The dictionary is particularly flexible in adding tags, since we only need to add (1 + 2d)K parameters for each new tag. The size of the corpus is about 60M in practice, and it can be downloaded from http://research.microsoft.com/enus/um/people/mengwang/VisualTagDictionary.zip.

3.4 Discussion In the previous sections, we have introduced the visual tag dictionary as a coordinate of the conventional dictionaries

5

AverageNDCG

gap with discriminative models [1]. They can also be incorporated into our visual tag dictionary construction approach to obtain better performance.

4. APPLICATIONS 4.1 Tag-Based Image Search 4.1.1 Application Scenario Along with the explosive growth of social media data, their efficient search becomes highly desired. Tag-based image search, which returns the images annotated with a specific query tag, is a straightforward approach. Currently, Flickr provides two ranking options for tag-based image search. One is “most recent”, which orders images based on their uploading time, and the other is “most interesting”, which ranks the images by “interestingness”, a measure that integrates the information of click-through, comments, etc. These two options both rank images according to certain measures (interestingness or time) that are not related to relevance levels, and thus may bring many irrelevant images on top of the returned ranking lists. Here we show the application of the visual tag dictionary in tag-based image search. Given a query tag, we estimate the relevance scores of the images that contain this tag as their likelihoods with the GMM model, i.e.,

Baseline

DictionaryͲbased RandomwalkͲ tagranking basedtagranking

Figure 11: The performance comparison of different tag ranking methods. of tags has significantly limited their usefulness. In [9], Liu et al. have proposed a tag ranking approach, which is able to order the tags of an image according to their relevance levels with respect to the image. Here we show that the tag ranking can be accomplished with the visual tag dictionary. From the probabilistic point of view, the relevance level of a tag with respect to an image should be decided by its posteriori probability, i.e., P r(t|x). According to Bayes’ law, we can obtain that

K ∑

1 1 exp{− (xi −µk )T Σ−1 πk √ k (xi −µk )} d |Σ | 2 (2π) k k=1 (7) These images are then ordered with r(xi |t), and this ranking list will be better in terms of relevance. r(xi |t) =

0.9 0.85 0.8 0.75 0.7 0.65 0.6 0.55 0.5

P r(t|x) =

p(x|t)P r(t) p(x)

(8)

where P r(t) is the prior appearance probability of the tag t, and p(x) and p(x|t) are the prior probability density function and the probability density function of the image conditioned on the tag t, respectively. Since p(x) is identical for different tags, we skip it and define a tag’s relevance level with respect to an image as

4.1.2 Experiments We select a diverse set of popular queries, including apple, beach, bird, cow, dog, eagle, f lower, f ruit, lion, statue, f ighter, horse, rabbit, f urniture, palace, penguin, rainbow, sailboat, spider and swimmer, and then we perform tagbased image search with the “ranking by most recent” option. The top 2,000 returned images for each query are collected. We compare the following two methods: (1) Time-based ranking, i.e., order the images according to their uploading time; (2) Dictionary-based ranking, i.e., re-order the images based on their relevance scores estimated in Eq. (7). From the results we can see that the dictionary-based ranking performs better for most queries. The mean average precision measurements of time-based ranking and dictionary-based ranking are 0.545 and 0.599 respectively. This demonstrates that better ranking list can be obtained by exploring the visual tag dictionary. Figure 9 and 10 illustrate the top results of two exemplary queries (“fruit” and “rainbow”).

r(t|x) = p(x|t)P r(t)

(9)

The p(x|t) can be estimated with the GMM model, i.e., adopt the likelihood f (xi |Θt ) and P r(t) is estimated as N (t)/N , where N (t) is the returned number of performing tag-based search on Flickr with query t, and N is the number of all images on Flickr. The tags of an image then can be ranked with r(t|x) in descending order.

4.2.2

Experiments

We conduct experiments on the dataset that is used in [9], which contains 50,000 images. Liu et al. [9] implement a pre-filtering process on the tags that are collected from Flickr and there are 13,330 identical tags in all after filtering. The ground truths of 10,000 images are provided. More specifically, for each of these images, the tags in the list are labeled as one of the five levels: most relevant (score 5), relevant (score 4), partially relevant (score 3), weakly relevant (score 2), and irrelevant (score 1). NDCG [3] is used as the performance evaluation measure for tag ranking. Given an image I with ranked tag list t1 , t2 , . . . , tn , the NDCG is computed as

4.2 Tag Ranking 4.2.1 Application Scenario Generally, the tag list associated with an image is orderless. For example, the lists on Flickr are usually displayed with the order that is generated based on their input sequence, and this order carries little information about the importance or relevance of tags. The disorder characteristic

Nn = Zn

n ∑ (2r(i)−1 )/ log(1 + i) i=1

6

(10)

ALIPRsystem

DictionaryͲbasedannotation

0.4 0.3 0.2 0.1

Original Tag List: morning england sky woman house tree bird church water grass river landscape frost Ranked Tag List: water sky tree landscape grass river morning england church woman bird frost house

Original Tag List: ocean sky tree beach palm Ranked Tag List: palm beach tree sky ocean

0

Original Tag List: red automobile maryland vehicle Ranked Tag List: red vehicle automobile maryland

MeanPrecision@5

MeanPrecision@10

MeanPrecision@15

Figure 13: The performance comparison of the dictionary-based annotation method and the ALIPR system .

Original Tag List: bird history netherlands raptor outstanding Ranked Tag List: bird raptor history netherlands outstanding

Figure 12: The tag ranking results of several exemplary images.

where r(i) is the relevance level of the i-th tag and Zn is a normalization constant that is chosen so that the optimal ranking’s NDCG score is 1. After estimating the NDCG measure of each image’s tag list, we can average them to obtain an overall performance evaluation measurement of the tag ranking method. We compare the following three methods: (1) Baseline, i.e., using the order of the tags displayed on Flickr. (2) Dictionary-based tag ranking, i.e., rank the tags based on the relevance scores estimated in Eq. (9). It is worth noting that the visual tag dictionary only contains 5013 entries, which means that most of the tags cannot be found in the dictionary. Here we adopt a naive approach that directly moves the missed tags to the end of the ranking list. (3) Random walk-based tag ranking, i.e., the method proposed in [9]. Figure 11 illustrates the results. From the results we can see that the dictionary-based tag ranking outperforms the baseline method. This demonstrates the usefulness of the visual tag dictionary in tag ranking. However, the random walk-based tag ranking still achieves the best performance. This can be attributed to the following two facts: (1) The dictionary is not large enough. As previously mentioned, many tags cannot be found in the dictionary and we put them backward. This will degrade the performance of tag ranking, and it can be believed that a larger dictionary will introduce better performance. (2) The dictionary-based tag ranking benefits from the knowledge of a different dataset, whereas the random walk-

Figure 14: The top five annotation results of several exemplary images. based tag ranking is directly implemented on the testing data. Even with slightly worse performance than the random walk-based method, the dictionary-based tag ranking can still be very useful as it is much more computationally efficient: the random walk-based method needs to perform Kernel Density Estimation on all images and then implement random walk on the tags of each image, whereas the dictionary-based method only needs to compute the confidence score of each tag.

4.3 4.3.1

Image Annotation Application Scenario

Image annotation attempts to assign a given image a set of keywords that are relevant to its content. Many research efforts have been dedicated to this problem [16][7][15]. A typical approach is to train classifiers for a set of concepts based a labeled training set and then use them to test new images. Here we accomplish the annotation with the visual tag dictionary. In fact, in Eq (9) we can estimate the relevance score of each tag in the dictionary with respect to an image, and thus we can annotate the image by setting a threshold on the scores or directly select the most confident ones.

7

4.3.2 Experiments

[5] L. S. Kennedy, S. F. Chang, and I. V. Kozintsev. To search or to label? precdicting the performance of search-based automatic image classifiers. In Proceedings of ACM international workshop on Multimedia information retrieval, 2006. [6] R. H. V. Leuken, L. Garcia, X. Olivares, and R. Zwol. Visual diversification of image search results. In Proceedings of International World Wide Web Conference, 2009. [7] J. Li and J. Wang. Real-time computerized annotation of pictures. IEEE Transactions on Pattern Analysis and Machine Intelligence, 30(6), 2008. [8] X. R. Li, C. G. Snoek, and M. Worring. Learning tag relevance by neighbor voting for social image retrieval. In Proceeding of ACM International Conference on Multimedia Information Retrieval, 2008. [9] D. Liu, X. S. Hua, L. Yang, M. Wang, and H. J. Zhang. Tag ranking. In Proceedings of International World Wide Web Conference, 2008. [10] D. G. Lowe. Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision, 60(2), 2004. [11] N. Naphade and J. R. Smith. The role of classifiers in multimedia content management. SPIE Storage and retrieval for media databases, 2003. [12] J. Rissanen. Stochastic complexity in statistical inquiry. World Scientific, 1989. [13] B. Sigurbjornsson and R. V. Zwol. Flickr tag recommendation based on collective knowledge. In Proceeding of ACM International World Wide Web Conference, 2008. [14] J. Tang, S. Yan, R. Hong, G. J. Qi, and T. S. Chua. Inferring semantic concepts from community-contributed images and noisy tags. In Proceeding of ACM Multimedia, 2009. [15] M. Wang, X. S. Hua, R. Hong, J. Tang, G. J. Qi, and Y. Song. Unified video annotation via multi-graph learning. IEEE Transactions on Circuits and Systems for Video Annotation, 19(5), 2009. [16] M. Wang, X. S. Hua, J. Tang, and R. Hong. Beyond distance measurement: constructing neighborhood similarity for video annotation. IEEE Transactions on Multimedia, 11(3), 2009. [17] K. Weinberger, M.Slaney, and R.V.Zowl. Resolving tag ambiguity. In Proceeding of ACM Multimedia, 2008. [18] L. Wu, X. S. Hua, N. Yu, W. Y. Ma, and S. Li. Flickr distance. In Proceeding of ACM Multimedia, 2008. [19] R. Yan, A. Natsev, and M. Campbell. A learning-based hybrid tagging and browsing approach for efficient manual image annotation. In Proceeding of IEEE Conference on Computer Vision and Pattern Recognition, 2008. [20] A. Yanagawa, S.-F. Chang, L. Kennedy, and W. Hsu. Columbia university baseline detectors for 374 LSCOM semantic visual concepts. Columbia University ADVENT Technical Report 222-2006-8, 2007. [21] Q. Yang, X. Chen, and G. Wang. Web 2.0 dictionary. In Proceedings of ACM International Conference on Image and Video Retrieval, 2008.

We randomly collect 500 images from Flickr and then use our method to annotate them. For each image, the most confident K tags are kept as the annotation results, and we compare this method with the ALIPR system introduced in [7] with different K. The results are illustrated in Fig. 13. From the figure we can see that the mean precision of our method consistently outperforms the results obtained by ALIPR. Figure 14 illustrates the annotation results of several exemplary images.

5. CONCLUSIONS This paper introduces a visual tag dictionary and its construction approach. Different from the conventional dictionaries that define terms with textual words, the visual tag dictionary interprets tags with distributions of visual words. With this dictionary, images and tags are connected via visual words, and many applications can be facilitated. We have shown the application of the dictionary in three different tasks, i.e., tag-based image search, tag ranking and image annotation, and empirical results have demonstrated its usefulness. There are several future works along different directions: (1) Expanding the dictionary. We can add more tag entries and build the dictionary with more image data. (2) Adding a tag refinement component to the dictionary construction process. Existing studies reveal that the tags provided by Flickr users are rather noisy and there is only around 50% chance that a concept actually appears in the image when it is tagged [5]. The noises will make the visual-word-based tag explanations inaccurate and degrade the performance in the related applications. Implementing a tag refinement process will reduce the noises, whereby a better dictionary can be constructed. (3) Adding a tag filtering component to the dictionary construction process. Many tags are intrinsically not directly related to the visual content of images, such as adjectives and several abstract nouns (e.g., “beautiful”, “distance” and “love”). Hence, it is actually meaningless to interpret these tags with visual words. We will develop a method to decide whether a tag is visually related or unrelated, such that unrelated ones can be filtered out. (4) Investigating more applications of the dictionary. Besides the three applications introduced in this paper, the dictionary can be further used in many other tasks, such as estimating the semantic distance between tags.

6. REFERENCES [1] C. M. D. Alamo, F. J. C. Gil, C. D. T. Munilla, and L. H. Gomez. Discriminative training of gmm for speaker identification. In Proceeding of ICASSP, 1996. [2] T. S. Chua, J. Tang, R. Hong, H. Li, Z. Luo, and Y. T. Zheng. Nus-wide: A real-world web image database from national university of singapore. In Proceedings of ACM International Conference on Image and Video Retrieval, 2009. [3] K. Jarvelin and J. Kekalainen. Cumulated gain-based evaluation of ir techniques. ACM Transactions on Information System, 2002. [4] F. Jing, L. Zhang, and W. Y. Ma. Visualtour: an online travel assistant based on high quality images. In Proceedings of ACM Multimedia, 2006.

8