Social Media Driven Image Retrieval - Semantic Scholar

2 downloads 0 Views 154KB Size Report
one male tennis player rankings; The Cham- pionships, Wimbledon; Billie Jean King;. Venus Williams; Ken Rosewall; Juan Martin del Potro; Serena Williams; ...
Social Media Driven Image Retrieval Adrian Popescu

Gregory Grefenstette

CEA, LIST, Vision & Content Engineering Laboratory 92263 Fontenay aux Roses, France

Exalead Paris, FRANCE

[email protected]

[email protected]

ABSTRACT People often try to find an image using a short query and images are usually indexed using short annotations. Matching the query vocabulary with the indexing vocabulary is a difficult problem when little text is available. Textual user generated content in Web 2.0 platforms contains a wealth of data that can help solve this problem. Here we describe how to use Wikipedia and Flickr content to improve this match. The initial query is launched in Flickr and we create a query model based on co-occurring terms. We also calculate nearby concepts using Wikipedia and use these to expand the query. The final results are obtained by ranking the results for the expanded query using the similarity between their annotation and the Flickr model. Evaluation of these expansion and ranking techniques, over the ImageCLEF 2010 Wikipedia Collection containing 237,434 images and their textual annotations, shows that a consistent improvement is obtained compared to existing methods.

self description in the social sciences. Interesting results are already reported for extracting useful knowledge from Wikipedia ([1, 7, 20]); understanding the tagging process, with an application to tag recommendation ([18, 23]) or using social data to diversify image retrieval results ([22, 12]). In this paper, we present an image retrieval system in which knowledge extracted from social media plays a central role. Social efforts can collaboratively create generic resources that describe a large number of concepts, such as Wikipedia, or be devoted to individuating user created media, such as videos (You Tube) or photographs (Flickr). As might be expected from these different goals, studies [8] have shown that there are significant differences between the language used on photo sharing platforms and the generic language of the Web, or that of Wikipedia. We show how these differences can be exploited to improve retrieval performances. This paper addresses the following research questions concerning the introduction of knowledge derived from social networks into an image retrieval system:

Categories and Subject Descriptors

• given a query, how can we extract similar concepts from Wikipedia in an efficient manner?

H.3.3 [Information Storage and Retrieval]: Information Search and Retrieval—search process

• does the difference between general language and photography tagging language have a consistent effect on retrieval performances?

General Terms Algorithms, Experimentation

• how do multilingual knowledge structures affect the performances of retrieval over multilingual datasets?

Keywords

• is the additional complexity introduced by the use of knowldge structures doable under real time constraints?

Wikipedia, Flickr, image retrieval

1. INTRODUCTION The volume of user generated content represents both a chance and a challenge for many research fields. The unprecedented amount of multimedia information, some structured, some not, promises to yield still rich discoveries in language modeling for information retireval, semantic gap reduction in pattern recognition, and even in modeling group

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. ICMR ’11, April 17-20, Trento, Italy c 2011 ACM 978-1-4503-0336-1/11/04 ...$10.00. Copyright

In this paper, we present an image retrieval system answering these questions. Our system was evaluated using the ImageCLEF 2010 Wikipedia Collection [15], a largescale, noisy and multilingual, heterogeneous, publicly available dataset. Accompanying the dataset is ground truth for a series of queries, and published performance results from participating ImageCLEF teams. Our evaluation shows that the combined usage of Wikipedia-based query expansion and of Flickr-based query modeling results in a consistent performance improvement compared to results reported during the official ImageCLEF campaign.

2.

RELATED WORK

Wikipedia’s rich content and wide coverage make it a valuable resource for knowledge extraction. DBPedia [1] mines the structured parts of encyclopedic articles to produce RDF-structured data. Efforts to exploit Wikipedia to

map concepts to each other and to new text are described in WikiRelate! [20], Explicit Semantic Analysis (ESA) [7] and Wikipedia Link-based Measure (WLM) [10]. WikiRelate! modifies techniques previously applied to WordNet in order to suit Wikipedia’s structure. The authors of [7] map queries to Wikipedia concepts representation in order to find related concepts. WLM exploits only Wikipedia links to find related concepts. [10] finds that ESA proves more efficient than WLM and WikiRelate! but at a cost of higher computational complexity. ESA is interesting because it finds related concepts for any given query and not only for monoconceptual queries and is thus usable in Web information retrieval, as shown by the work on cross-lingual text retrieval described in [19]. However, since cosine distance is used to measure similarity between queries and articles in ESA, longer articles, which are potentially interesting in IR, are penalized. Flickr tag processing has recently received considerable attention. Schmitz [17] applies a statistical subsumption model to induce hierarchies of tags. The authors of [13] use statistical methods to build community-driven folksonomies and also show how to evaluate folksonomies by automatic comparison to a manually created taxonomy, demonstrating that meaningful relations can be extracted from large tag sets. In [18] and [23], the authors analyze a large Flickr dataset to find tag relations which are then used to suggest new tags based on already existing photo tags and their closely related tags. The authors of [18] also map tags into WordNet to extract the distribution of Flickr tags in different conceptual domains and report that main tag categories include artifacts and places. In addition to tag correlation, Wu et al. [23] also analyze image content in order to improve tag recommendation. In [21], the authors use Minimum Length Distance to find interesting groups of related pictures associated to a query. Their approach basically turns a clustering problem into a data compression problem and exploits both tags and URLs instead of tags only. The main roles of tag clustering or compression are to maximize results diversity and to find representative images. PanImages [6] is an image retrieval system which uses translation dictionaries in order to propose images in languages different from the query language. This type of service is particularly useful for queries in languages that are poorly represented on the Web because it improves recall of images. The system translates only single words or small phrases that are found in dictionaries and this limits its usefulness in Web image retrieval since more complex queries are impossible to process. Moreover, the user needs to validate the translations before seeing results and this requires a linguistic competence not only in the base language but also in the translation languages. The ImageCLEF 2010 Wikipedia Collection [15] includes 237,434 images and associated textual metadata in none, one, or several languages (English is the most frequently used language, followed by German and French). A set of 70 queries, formulated in English, German and French was also created. Over this test collection, the best textual results are reported by Xerox Research [3], using a standard language modeling approach for textual retrieval. Xerox results are reported for a mix of all three query language variants and considering the image annotations as a single dataset. The same group also explored combinations of textual and content-based retrieval, concluding that the in-

troduction of visual content analysis significantly improves results. The multilingual aspect was investigated by University of North Texas [16] and good results for English and French are reported after an automatic translation of all image annotations into the language of the queries. Whereas this technique is applicable to the ImageCLEF dataset, it would be difficult to scale it up to Web size collections. Dublin City University [11] performed document expansion with Wikipedia content and used the Okapi feedback algorithm for retrieval. In addition, document reduction was exploited to weight the terms in the query. The authors state that document expansion is useful but do not provide a baseline run without document expansion in order to evaluate the impact of the expansion. The University of Berkeley [9] proposed runs for all three query languages using a retrieval model that combined logistic regression and blind relevance feedback. Interestingly, although German annotations are almost twice as numerous than French annotations, the results reported for the two languages are similar. UNED [2] implemented a vector space model approach, with TF-IDF weights for the textual metadata and reported slightly worse results compared to language modeling [3] or Okapi based retrieval [11]. These various efforts cover a large panel of state of the art retrieval methods applied to a freely available collection and constitute sound baselines for assessing the performances of other retrieval models, such as ours.

3.

QUERY MODELING

In most existing image retrieval systems, queries are processed as a series of terms which are matched against image annotations, with no further treatment concerning what the query means in whole or in part. To improve search results, clustering over textual or visual features (or both) has been provided when many images are found. Query expansion has also been proposed, improving recall, though precision is likely to decrease if the expansion terms are not correctly chosen [15]. We use here separate query models derived from Flickr and Wikipedia, two data sources differing in the nature of the language used and the type of conceptual relations that we can extract. Co-occurence relations from Flickr can provide facets to restrain search, as seen in Flickr Clusters, but using these relations to expand search can lead to semantic drift. On the other hand, Wikipedia is structured in categorical trees and can be used to mine isA relations which are known to be useful for query expansion ([5], [14]). It is a combination of these two query models that we describe here.

3.1

Flickr Query Modeling

Flickr is one of the most successful existing photo sharing sites, so it is not surprising that term relations extracted from this source are defined within a photographic tagging language, consistently different from the general language [8]. The Flickr dataset, which includes billions of tagged images, is a fertile source for modeling an image query beyond the terms it explicitly contains. Previous works involving tag modeling focused on tag recommendation [18] or image clustering [12], using raw tag co-occurrence to derive related terms. Such relatedness measures tend to favor terms that are too generic from an information retrieval perspective, which requires finding terms that are frequently associated to the current query but also as specific as possible.

Query fractals

tennis player court cactus desert

Related Concepts fractal, apophysis, abstract, mandelbrot, romanesco, art, green, digital, cauliflower, nature, math wimbledon, racket, players, sport, federer, ball, wta, atp, centre court, tournament arizona, cacti, saguaro, tucson, cholla, sonoran, phoenix, california, barrel cactus, az

Table 1: Top 10 related Flickr socially related tags To comply with these dual conditions of frequency and specificity, we introduce an adaptation of the TF-IDF model to the social space. Given a query Q, we define its social relatedness to term T using: SocRel(T |Q) = users(Q, T ) ∗ 1/log(pre(T )) (1) where users(Q, T ) is the number of distinct users which associate tag T to query Q among the top 20,000 results returned by the Flickr API for the query Q; and where pre(T ) is the number of distinct users from a prefetched subset of 30,000 Flickr users that have tagged photos with tag T . In this new social weighting scheme, term frequency and document counts from the classical IR formulas are replaced with user counts, which prevents the final relatedness score from being biased by heavy contributions from a reduced number of users [4]. The computation of users(Q, T ) from 20,000 is destined to keep computation time low (under 3 seconds in the current unoptimized implementation), while accounting for different query relevant contexts. pre(T ) is precomputed from all the tags submitted by a random subset of 30,000 Flickr users. Related terms are computed from preprocessed queries. This processing involves removing photographic terms (which can have a negative impact on queries with few results) as well as prepositions and articles from the queries. Both prepositions and photographic terms are kept in precompiled lists, extracted from Wikipedia: the ”List of English prepositions” page, lists of photographic terms exist on the ”Category:Film techniques”and ”Category:Photographic processes” pages. In the following sections, we will use the preprocessed forms of the queries. For instance, cactus in desert becomes cactus desert. We illustrate the results of our social relatedness term extraction process in table 1, where we present top 10 related terms for queries fractals, tennis players on court and cactus in desert. A mix of relations are found. For fractals, the top related term is fractal, the singular form of the word; mandelbrot and romanesco are types of fractals and art or abstract are generic terms that are often associated to the subject. For tennis player court, federer is the name of a famous tennis player; wimbledon and centre court are two well-known tennis arenas; wta and atp are the organizing boards of feminine and masculine circuits and tournament is the name of the emblematic event for tennis. For cactus desert, cacti is the formal plural of cactus; saguaro, cholla and barrel cactus are types of cacti and arizona or sonoran are well-known natural environments of cacti. The examples in table 1 illustrate that the extracted set of related terms implicitly describes important query facets: location is irrelevant for fractals, but appears for tennis players court (via

court names) and for cactus desert. If used directly for query expansion, some query related terms would introduce a high amount of noise in results. For instance, while locations are relevant for defining facets of cactus desert, retrieving results with california or arizona would cover many other subjects than the initial query. Ideally the expansion should be limited to subtypes of the initial query or of some of its terms. The automatic detection of isA relations within Flickr was explored in [17], but results are not accurate enough to detect subtypes with sufficient precision and recall. The limitation is especially relevant for recall because many useful subtypes would not appear among the top 20,000 results and would be thus ignored. We decided that another data source was needed for finding subtypes and, given its broad conceptual coverage, Wikipedia was chosen. Before using Wikipedia, we create query an enriched version of the query (QE ) by selectively stemming orginal query terms. A Flickr related term is retained as a variant if its edit distance to one stemmed term from the initial query is smaller than three or if the stemmed form is a prefix of the related term, so cactus desert becomes (cactus:cacti) desert and tennis player court is transformed into tennis (players:player) court. The process is identical for all languages. The final query model obtained from Flickr can be expressed as: MF lickr (Q) = ∪N x=1 (weight(Tx ), Tx )) (2) where N is the maximum number of retained Flickr socially related tags, Tx , is the tag in the x position and weight(Tx ) = SocRel(T |Qx )/SocRel(T |Q1 ) is the normalized social weight of Tx calculated using relation (1), with x = 1...N . Words in the initial query and the top related Flickr term have weights of 1 while terms starting from x = 2 have weights which are normalized with SocRel(T |Q1 ). This weighting expresses the fact that the importance of a term in the Flickr model decreases with its rank among the socially related terms.

3.2

Wikipedia Query Modeling

When the terms in a query (or a part of a query) are categorical in nature, results that correspond to their subtypes are ignored in absence of query expansion. For instance, saguaro or pachycereus pringlei are valid subtypes of cactus, which is a part of cactus desert. Images tagged with these subtypes are potentially relevant for cactus desert but they would not be returned when querying with the initial terms. Query expansion is particularly useful in cases when initial queries return a small number of results or for languages that are seldom used in the annotations of the images [6]. Since image queries cover a broad range of concepts and are expressed in different languages, a generic, detailed and multilingual data source is needed to enable an efficient expansion. Multilingual, large, growing (over 17 million articles in over 100 language as of December 20101 ) Wikipedia is a good data source for concept representation and query expansion. Here we propose a semantic relatedness measure similar to explicit semantic analysis [7], but adapted for realtime information retrieval. Our main modifications concern the prevalent role accorded to article categories and the replacement of the cosine distance with the dot product to 1

http://en.wikipedia.org/wiki/Wikipedia

Query fractals

tennis player court

cactus desert

Related Concepts Fractal; Fractal compression; Fractal antenna; Fractal cosmology; Mandelbrot set; Fractal landscape; Fractal dimension; Fractal-generating software; Newton fractal; Iterated function system Roger Federer; Rafael Nadal; World number one male tennis player rankings; The Championships, Wimbledon; Billie Jean King; Venus Williams; Ken Rosewall; Juan Martin del Potro; Serena Williams; Maria Sharapova Saguaro; Cylindropuntia bigelovii; Pachycereus pringlei; Cylindropuntia fulgida; Stenocereus thurberi; Opuntia ficus-indica; Opuntia basilaris; Opuntia engelmannii; Hylocereus undatus; Tucson, Arizona; Phoenix, Arizona

Table 2: Top 10 related Wikipedia concepts.

measure similarity. We express the semantic relatedness between a query and a Wikipedia article as a combination of two scores. We first measure the overlap between the words in the query and the words in the category section of encyclopedic articles and then compute the dot product between the query and a vectorial representation of the entire article content. Priority is given to the first score. Words in categories are given a preferential role because, even if the Wikipedia categorization process is not completely accurate [20], it still enables the extraction of good quality conceptual subsumption relations. Also, the use of categories is more adequate for our application than the initial use of Explicit Semantic Analysis for text categorization. Whereas in the latter case, entry texts have arbitrary length, most image queries are short and the overlap between queries and categories is significant. We compute the overlap using the enriched version of the queries that contains term variants (QE ). Also, for English, queries are run through WordNet and synonyms are added to the query terms that are unambiguous in WordNet, to avoid introducing noise from polysemous word senses. Overlap is normalized by the number of words in a query and its values vary from 0 (no terms in common between the query and the article’s categories) and 1 (all terms in common). Due to the fact that queries usually contain a small number of words, the overlap scores offer a coarse expression of semantic relatedness and a large number of articles will share the same scores. The use of overlap rather than cosine similarity avoids one of the drawbacks of purely vectorial measures, namely biasing the final scores towards as small number of vector components which can be overrepresented in target documents. In our system, given the query tennis player court, an article categorized under with tennis and player is always better ranked than an article categorized only under player. When using the cosine measure, there is no guarantee that his will happen. In order to differentiate between articles with identical overlap counts, we calculate the dot product between the vectorial representations of the query and of the article. Since computing the dot product is more time consuming than computing the overlap, the operation is performed only for articles which have a non null overlap and the complex-

ity of the Wikipedia query model computation is significantly reduced compared to an exhaustive search for similar articles. In table 2, we present top 10 related concepts (Wikipedia articles) for fractals, tennis player court and cactus desert. Wikipedia related concepts obtained with our semantic relatedness measure are generally well representative for their source queries. The related concepts presented in table 2 capture important query facets. For fractals, all top related concepts are relevant for this topic and, whereas most concepts contain fractal, the singular form of the query word, in their title, this is not the case for Mandelbrot set, a wellknown type of fractal. The second query, tennis player court is more complex and 8 out of 10 top related concepts are tennis players, indicating that being a player is one very important facet of the query. For cactus desert, 8 top related concepts point to cacti species and the other 2 indicate locations that are situated in a desertic region well-known for its cacti. The examples in table 2 illustrate the way related concepts which do not contain terms from the initial query are extracted from Wikipedia and can be used for query expansion. Since well known concepts, which offer a good model of a query, usually have detailed Wikipedia articles and the cosine distance is known to penalize long documents, the dot product represents a good measure for capturing similarity while favoring long – and important – documents. Also, well-known concepts are better suited for information retrieval purposes. As an example, compare the illustration of the query tennis player on court with photos of Venus Williams or Rafael Nadal, concepts which surfaced by the use of the dot product, versus illustrating the same query with David Sanchez or Luis Ayala, surfaced when using cosine distance. A comparison of Flickr social relatedness models (table 1) and Wikipedia query models (2) shows that the number of close concepts related via a IsA relation to the initial query (or a part of this query) is much higher with Wikipedia. Our hypothesis concerning the important differences between the query models generated with Flickr and Wikipedia is confirmed. This variety of the Flickr model led us to use it for ranking results and finding term variants and to use Wikipedia for expanding the initial query. The Wikipedia query model can be as a ranked list of related concepts: N

W ik MW ikDot (Q) = ∪x=1 (Cx , Ov(Q, Cx ), dot(Q, Cx )) (3)

where Ov(Q, Cx ) is the overlap between the query Q and concept Cx , dot(Q, Cx ) is the dot product of Q and the VCx which is vectorial representation of Cx and Cx is the xth Wikipedia related concept, with x = 1...NW ik . Related concepts are sorted by Ov(Q, Cx ), in decreasing order, and ties are broken using dot(Q, Cx ). The Wikipedia modeling of the queries discussed above only takes into account monolingual versions of concepts but collections are often multilingual. Suppose an English query with fractals is launched against a collection that contains annotations in English, but also in French and German. Relevant images include those tagged with fractals and its English Wikipedia related concepts, but also images tagged French and German equivalents of English related concepts. We defined a multilingual extension of the model in (3) by using the Wikipedia translation graph in order to find the equivalents of related concepts in other languages.

N

W ik MW ikM L (Q) = ∪x=1 (Ov(Q, Cx ), dot(Q, Cx ), Cx , T r(Cx )) (4)

where T r(Cx ) is the set of translations of Cx from a pivot language to an arbitrary number of target languages in which the Cx is described. The multilingual Wikipedia query model in (4) can be used to query a multilingual collection with concepts that are semantically related to the initial query automatically, as opposed to the interactive multilingual querying process required by PanImages [6]. The creation of our Wikipedia model has a computational cost which is directly related to the size of Wikipedia for the language of the query. The authors of [7] remove articles which have a size smaller than a given threshold. This decision is arbitrary and, in order not to lose potentially interesting articles, we adopt a different strategy which consists in pre-selecting only articles which are categorized with at least one term from QE and computing the query-article similarity only for this subset of articles. Moreover, the process can be distributed on several machines in order to speed-up execution. In our current implementation, which uses a SQL representation of Wikipedia titles, categories and vectors, the calculation of the model takes around 5 seconds for a query. However, our focus here was not on optimization and a significant reduction of the execution time can probably be obtained by optimizing the underlying data structures.

4. RETRIEVAL PROCESS The retrieval model developed in this paper is based on a vector space model representation of the test collection and it is designed to easily plug in the Flickr and/or the Wikipedia query models described in the preceding section. A brief algorithmic description is presented hereafter: Algorithm 1 Retrieval algorithm with usage of Flickr and Wikipedia models 1: for each word in QE 2: listInitial = retrieve(word, V ectorsCollection ) 3: for each doc in listInitial 4: docScore = overlap(Q, doc )/|Q| 5: hashInitial = add (doc, docScore) 6: for each Cx in MW ik 7: listWiki = retrieve(Cx , V ectorsCollection ) 8: for each doc in listWiki 9: hashWiki = add (doc, Ov(Cx )) 10: empty (listWiki) 11: hashAll = merge(hashInitial, hashWiki) 12: for each doc in hashAll 13: score1 = getOverlapValue(doc) 14: score2 = cosine(doc, MF lickr ) 15: intermediary = add (doc, score1 , score2 ) 16: results = reverseSort(intermediary, score1 , score2 ) The Wikipedia and Flickr models in the algorithm are obtained following the procedures described in the previous section and all the structures that are used are initially empty. The methods are written in bold and italic text and the instructions are in italics. In steps 1 and 2, we create a list of all collection documents that match at least one word from the enriched version of the query (which contains term variants and WordNet synonyms). Given QE = tennis

(player:players) court, we retrieve all documents annotated with any of these terms. Steps 3 and 4 serve for selecting the distinct documents from the initial list and attribute them an overlap score which expresses their similarity to the query. For instance, a document annotated with tennis and player and another one annotated with tennis and court have an overlap of 0.67, while if only tennis, player or court appear in the document description, the score is 0.33. This score is computed in an identical manner to that used for the rough Wikipedia similarity score and it has values between 0 and 1. All distinct documents are added to a hash which contains their name and their overlap score (step 5). After querying with terms from QE , we create a list of documents for each related concept from MW ikDot (steps 6 and 7). Next (step 8 and 9), along with the overlap score of the corresponding concept (Ov(Cx )). For instance, documents annotated with Venus Williams or Rafael Nadal will have scores of 0.67 because these concepts are categorized under tennis players in Wikipedia. Knowing that Wikipedia concepts are sorted by decreasing overlap score and that we want to retain only the occurrence of a document with the highest overlap, in steps 8 and 9, we check if the documents are already part of a hash and, if this is not the case, we add them to the Wiki dedicated hash. If a document annotated with Rafael Nadal is also found using Association of Tennis Professionals, this second occurrence is not recorded since the latter concept is categorized only under tennis and its overlap score (0.33) is smaller than that of Rafael Nadal. The list of documents is emptied after searching for the results that correspond to each Wikipedia related concept. In step 11, we merge the document hashes obtained after querying with terms from QE and with concepts from the Wikipedia model. During the merging procedure, if a document appears in both hashes, we retain the highest overlap score. If a document annotated with tennis, player and court but also with Rafael Nadal, its overlap score (1) will be determined by the first set of annotations. After step 11, the hash will contain distinct documents and their maximum overlap score. As we noted for the Wikipedia models, overlap scores have a small range of values and the hash contains a large number of ties. In order to break them, we compute score2 , the cosine between the Flickr query model and each document’s vector (steps 13 and 14) and add the results into a structure that contains the document name, the overlap score and the cosine score. The ranked set of results for query Q is obtained by sorting intermediary first by score1 and then by score2 in decreasing order. Thus, given two documents whose overlap with the original query is identical, will be ranked higher the document with the highest cosine similarity. As we mentioned, Wikipedia and Flickr models have different roles in the retrieval system. Related concepts from the encyclopedia are used for semantic query expansion, whereas the set of related Flickr tags is exploited for result ranking. The presented retrieval algorithm is flexible and allows an easy evaluation of the impact of different model parts on the system performances. If we remove steps 6 to 10, the Wikipedia model is discarded and the returned results correspond only to terms in the original query. Conversely, if we remove steps 1 to 5, the final results come only from querying with Wikipedia related concepts. If we replace MW ikDot in step 6 with MW ikM L , the documents found correspond to Wikipedia concepts in the language of

Method MW ikESA MW ikCosine MW ikDot

MAP 0.1146 0.1208 0.1303

P@20 0.3286 0.3714 0.3807

Table 3: MAP results for retrieval with Wikipedia models created with Explicit Semantic Analysis, respectively with versions of our model. Figure 1: Comparing extraction of Flickr expansion terms using social tf-idf versus using co-occurence statistics the query but also to their translations to other languages. Finally, the cosine similarity (step 14) can be calculated using a query model which includes only words from the initial query and no term variants. These particular settings are explored in the evaluation section and compared with state of the art methods. The complexity of the algorithm presented here is higher than the complexity of an algorithm which does not perform query expansion. The additional complexity is due to steps 6 to 10 (formulation of queries with Wikipedia related concepts) and it is linearly dependent of the number of concepts used. Wikipedia queries can be easily distributed on several processors in order to break complexity and speed-up the retrieval process.

5. EVALUATION We first evaluate only Flickr-based expansion, then only Wikipedia based expansion, and finally the combined expansion from both derived query models.

5.1 Flickr Model Evaluation We present here an evaluation of Flickr-only query expansion for English queries. Since co-occurrence based models have been used for tag recommendation, we compare our social tf-idf inspired Flickr query expansion model to a pure co-occurrence model (created by removing the second term in formula 1) and replace MF lickr (step 14 of the algorithm) with MCooc . We also remove Steps 6 to 10, related to Wikipedia expansion. To assess the influence of N, the number of socially related tags retained, in figure 1, we present the mean average precision (MAP) results for values of N between 0 and 100, with a step of 5. For both measures and all values of N, the social tf-idf inspired model performs better than the co-occurrence based one. A t-test applied to the two methods at N= 5 and N = 100 shows that the performances of MF lickr and the MCooc are not statistically significant at (p = 0.05). However, the use of the social tf-idf model produces a 5% improvement over the co-occurrence model and we use MF lickr for the experiments presented hereafter. Expanded query results gradually improve up to N = 15, with little gain thereafter, which confirms the prevalent role of top related terms in results ranking. However, the use of a large value of N guarantees fine-grained differentiation between image similarity scores and we perform the remainder of experiments with N = 100. The effect of introducing socially induced stemmed term variants QE is also tested. We computed performances for

model size N = 100 with and without term variant expansion. When using term variants, MAP = 0.2469, whereas when using initial queries, without term variants, MAP = 0.2127 (relative improvement of 14%). A 1-tailed t-test shows that the two MAP distributions are different at p = 0.13466 ( P@20 distributions are statistically different for p = 0.20276). Finally, we evaluate the difference between results ranking with Flickr models and with a query model obtained using general language. (Mgeneral ) is obtained from Wikipedia articles that describe query related concepts and it is based on a tf-idf term scoring. To test it, we rank results with Mgeneral instead of MF lickr (step 14 of the algorithm). In this setting, MAP = 0.2132, a result that shows that query models obtained from generic language decrease the retrieval performances. The good performances of MF lickr , correlated to the poor performances of Mgeneral , validate our hypothesis concerning the significant differences between the photographic and the general languages and show that the photographic language should be used for image ranking.

5.2

Wikipedia Model Evaluation

All evaluations of Wikipedia query models are also performed on English queries only. The extraction of Wikipedia related concepts presented in section 3.2 is compared to Explicit Semantic Analysis [7]2 . The ESA baseline (MW ikESA ) is compared to two different similarity measures for finding Wikipedia related concepts: dot product (MW ikDot ) and cosine similarity (MW ikCosine ) by inserting each one of the three models for retrieve in steps 6 to 10 of the retrieval algorithm. Steps 1 to 5 of the algorithm are not performed in order to retrieve results that correspond only to Wikipedia concepts and not to terms in the initial query. For the experiments performed here the number of retained Wikipedia related concepts is NW ik = 500; a value which allows a rich representation of the initial queries. The results in table 3 show that the best results are obtained with MW ikDot , followed by MW ikCosine and MW ikESA . The relative MAP improvement brought by MW ikDot with respect to MW ikESA is 12%. A comparison of these MAP distributions using a paired t-test shows that they are different at p = 0.23368 (for P@20, the two distributions are statistically different for p = 0.095). The performances reported here confirm our hypothesis that, from an information retrieval perspective, the use of the dot product for finding a list of semantically related concepts is more appropriate than the use of cosine similarity or of ESA. Also, the comparison of MW ikDot and MW ikCosine shows that the query model obtained using the dot product has better performances than the cosine based model. 2 Here we used an online implementation of ESA available at http://www.multipla-project.org/research-esa

Table 4: MAP and P@20 results for best ImageCLEF Wikipedia Retrieval 2010 runs and for the approach presented in this paper. ImageCLEF 2010 results ImageCLEFmultimodal ImageCLEFEN F RDE ImageCLEFEN ImageCLEFDE ImageCLEFF R Our results EN F RDE ENnoW iki ENW ikDot ENW ikM L DEnoW iki DEW ikDot DEW ikM L F RnoW iki F RW ikDot F RW ikM L

MAP 0.2765 0.2361 0.2251 0.0936 0.2200 MAP 0.2786 0.2431 0.2631 0.2681 0.1079 0.1502 0.1751 0.1441 0.1628 0.1903

P@20 0.5193 0.4393 0.3871 0.2314 0.3986 P@20 0.4843 0.4736 0.4786 0.4843 0.2757 0.3186 0.3643 0.3464 0.3600 0.3786

5.3 Global Evaluation We evaluate the combination of Flickr and Wikipedia models over English (EN), German (DE) and French (FR) queries provided to ImageCLEF Wikipedia task participants. Performances of the state of the art methods are illustrated through the best runs submitted to ImageCLEF Wikipedia Retrieval 2010. On the first line of table 4, we present the run that achieved the best performances during the campaign. As we mentioned, this run, submitted by Xerox Research [3], combines the use of textual and visual information. The following four lines contain the best multilingual textual ImageCLEF run, respectively the best runs for individual languages 3 . Then, we present results obtained with the algorithm presented in section 4, using a Flickr model size NF lickr = 100 and a Wikipedia model size NW ik = 500. For each language, we use either monolingual Wikipedia models (marked W ikDot, from formula 3) or Wikipedia models enriched with an automatic translation of relevant concepts (marked W ikM L, from formula 4) to perform steps 6 to 10 of the retrieval algorithm. We also present results without the use of Wikipedia models (steps 6 to 10 ignored, marked noWiki). Our focus here is on queries in individual languages, which reproduce the way images are usually searched on the Web. However, since runs that use query variants in all three languages were proposed during the ImageCLEF campaign, we also present such a run (EN F RDE) here. This run represents a linear combinations of W ikDot runs obtained for each language. As expected, the best results are obtained from the combination of the three query language variants (EN F RDE). For individual languages, the best results are obtained with the use of the Wikipedia multilingual model. Standard deviations are not reported for ImageCLEF runs and we cannot perform t-tests to compare these results to ours. Instead, we 3 A complete list of results is http://imageclef.org/2010/wikiMM-results

available

at

report relative improvements to discuss results. The MAP of EN F RDE is comparable to that of the best run submitted to ImageCLEF (0.2786 vs. 0.2765) but EN F RDE is a purely textual run whereas the run ImageCLEFmultimodal combines visual image processing and textual information. By comparison, ImageCLEFEN F RDE , the best textual ImageCLEF run has MAP = 0.2361, which translates into a 15.1% improvement of MAP for our approach (9.3% for P@20). For English, MAP improvements with respect to the best ImageCLEF run vary between 7.5% for ENnoW iki and 16% for ENW ikM L (corresponding P@20 improvements are 18.3% and 20.1%). In German, MAP improvements are of 13.2% for DEnoW iki and 46,5% for DEW ikM L (corresponding P@20 improvements are 16% and 36.5%). These scores show that query modeling with Flickr and Wikipedia is highly effective for English and German. For French, the best ImageCLEF run [16] has better performances than our best run (13.5% gain for MAP and 5% for P@20). Results reported in [16] are obtained after an automatic translation of all textual metadata to French, while we only use query modeling with Flickr and Wikipedia. For our runs, English results are clearly superior to German and French results and this difference is principally due to the fact that a larger number of collection images have English annotations. Interestingly, results for French are better than results for German, although the collection contains more annotations in German than in French. This result confirms a similar finding from [9]. The most probable cause of this inversion is a better adaptation of the query modeling introduced here to French. However, it is also possible that the retrieval process for German is more difficult compared to French due to the structure of the two languages. The introduction of monolingual Wikipedia models (from formula 3, marked W ikDot in table 4), and the multilingual Wikipedia models obtained through concept name translation (from formula 4, marked W ikM L) improves retrieval performances compared to the use of query terms alone (noW iki runs). The relative improvement is highest for German (28.2% between DEW ikDot and DEnoW iki , respectively 38.4% between DEW ikM L and DEnoW iki . Paired t-tests show that MAPs are statistically different at p = 0.02473 for DEW ikM L and DEnoW iki , respectively p = 0.04721 for F RW ikM L and F RnoW iki . For English, the introduction of multilingual Wikipedia models has a marginal effect on results compared to monolingual models (1.8% of MAP). The same operation has a bigger impact for German (14.2% of MAP) and French (14.5% of MAP) and this confirms the hypothesis that query translation is more useful for languages which are sparsely represented in a collection [6].

6.

CONCLUSIONS

In this paper, we introduced a generic social-media driven query expansion model and tested it on a large-scale, noisy image collection. The results presented here indicate that our model outperforms state of the art image retrieval methods. The comparison of the social tf-idf Flickr query model introduced here to a co-occurrence model used in previous studies [18] shows that our approach is better suited for image information retrieval. The extraction of Wikipedia related concepts presented here offers an improved representation of a query compared to Explicit Semantic Analysis. The improvement is explained by the privileged role

given to Wikipedia categories and by the use of the dot product, a similarity measure which does not penalize long documents, as does the cosine similarity used in ESA. In the retrieval stage, improvements are particularly consistent for languages which are sparsely represented in the test collection (German and French). Due to sparsity, the probability of a mismatch between query words and image annotations is larger compared to a language which is well represented in the collection. The use of the overlap score to have a coarse ranking of retrieved images enables a seamless integration of the Wikipedia query model in the retrieval system. It is complemented with the use of cosine similarity for a fine grained results ranking. The performance improvement comes with a cost derived from the calculation of Flickr and Wikipedia query models at query time, but also to the utilization of a Wikipedia related concepts to retrieve results. As mentioned, the calculation of Flickr and Wikipedia models, which currently takes around 8 seconds per query, can be optimized in order to approach real-time constraints. With the test collection used here, retrieving results for one Wikipedia related concept takes around 0.1 seconds. The process can be easily distributed on several processors in order to reduce the retrieval time. In the future, we plan to study the relation between query complexity and Flickr and Wikipedia models in order to find a criterion for deciding if query expansion models should be used or not for a given query. Another direction to explore is the use of Wikipedia query models for a conceptual diversification of image search results and its comparison to existing techniques which use textual annotations, visual features or a combination of the two.

7. ACKNOWLEDGMENTS This research is funded partly via the ANR Georama project (ANR-08-CORD-009) and partly via the Quaero Programme, sponsored by OSEO, French State Agency for Innovation.

8. REFERENCES [1] S. Auer and C. Bizer and G. Kobilarov and J. Lehmann and R. Cyganiak and Z. Ives. DBpedia: A Nucleus for a Web of Open Data Lecture Notes in Computer Science, 4825, 2008. [2] J. Benavent and X. Benavent and E. de Ves and R. Granados and A. Garcia-Serrano. Experiences at ImageCLEF 2010 using CBIR and TBIR mixing information approaches. In Working notes of the ImageCLEF 2010 Lab, Padua, Italy, 2010. [3] S. Clinchant and G. Csurka and J. Ah-Pine and G. Jacquet and F. Perronnin and J. S´ anchez and K. Minoukadeh. XRCE’s Participation in Wikipedia Retrieval, Medical Image Modality Classifiction and Ad-hoc Retrieval Tasks of ImageCLEF 2010. In Working notes of the ImageCLEF 2010 Lab, Padua, Italy, 2010. [4] D. Crandall and L. Backstrom and D. Huttenlocher and J. Kleinberg. Mapping the world’s photos. In Proc. of WWW 2009, 2009. [5] J. Deng, and W. Dong and R. Socher and L.-J. Li and K. Li and L. Fei-Fei. ImageNet: A Large-Scale Hierarchical Image Database. In Proc. of CVPR, Miami, USA, 2009.

[6] O. Etzioni, K. Reiter, S. Soderland, and M. Sammer. Lexical Translation with Application to Image Search on the Web. In Proc. of Machine Translation Summit XI, Copenhagen, Denmark, 2007. [7] E. Gabrilovich, S. Markovich. Computing Semantic Relatedness using Wikipedia-based Explicit Semantic Analysis. In Proc. of IJCAI, Hyderabad, India, 2007. [8] G. Grefenstette and G. Pitel. Image Specific Language Model: Comparing Language Models from Two Independent Distributions from FlickR and the Web. Language Forum, 34, 2008. [9] R. R. Larson. Blind Relevance Feedback for the ImageCLEF Wikipedia Retrieval Task. In Working notes of the ImageCLEF 2010 Lab, Padua, Italy, 2010. [10] D. Milne, I. H. Witten. An effective, low-cost measure of semantic relatedness obtained from Wikipedia links. In Proc. of WIKIAI Workshop, Chicago, USA, 2008. [11] J. Min and J. Leveling and G. Jones Document Expansion for Text-based Image Retrieval at WikipediaMM 2010. In Working notes of the ImageCLEF 2010 Lab, Padua, Italy, 2010. [12] P.-A. Mo¨ellic, J.-E. Haugeard, G. Pitel. Image clustering based on a shared nearest neighbors approach for tagged collections. In Proc. of CIVR, Niagara Falls, Canada, 2008. [13] A. Plangprasopchok and K. Lerman. Constructing folksonomies from user-specified relations on flickr. In Proc. of WWW ’09, 2009. [14] A. Popescu and C. Millet and P.-A. Mo¨ellic. Ontology Driven Content Based Image Retrieval. In Proc. of CIVR, Amsterdam, The Netherlands, 2007. [15] A. Popescu and T. Tsikrika and J. Kludas. Overview of the WikipediaMM task at ImageCLEF 2010. In Working notes of the ImageCLEF 2010 Lab, Padua, Italy, 2010. [16] M. E. Ruiz and J. Chen and K. Pusapathy and P. Chin and R. Knudson. UNT at ImageCLEF 2010: CLIR for Wikipedia Images. In Working notes of the ImageCLEF 2010 Lab, Padua, Italy, 2010. [17] P. Schmitz. Inducing ontology from flickr tags. In Proc. of Collaborative Web Tagging Workshop (WWW ’06), 2006. [18] B. Sigurbj¨ ornsson and R. van Zwol. Flickr tag recommendation based on collective knowledge. In Proc of WWW ’08, 2008. [19] P. Sorg and P. Cimiano. An Experimental Comparison of Explicit Semantic Analysis Implementations for Cross-Language Retrieval. Lecture Notes in Computer Science, 5723, 2010. [20] M. Strube and S. P. Ponzetto. WikiRelate! Computing Semantic Relatedness Using Wikipedia. In Proc. of AAAI, Boston, USA, 2006 [21] M. van Leeuwen, F. Bonchi, B. Sigurbj¨ ornsson, and A. Siebes. Compressing tags to find interesting media groups. In Proc. of CIKM ’09, 2009. [22] R. H. van Leuken, L. Garcia, X. Olivares, and R. van Zwol. Visual diversification of image search results. In Proc. of WWW ’09, 2009. [23] L. Wu, X.-S. Hua, N. Yu, W.-Y. Ma, and S. Li. Flickr distance. In Proc. of ACM Multimedia ’08, 2008.